Product Quantization Network for Fast Visual Search

Yu , Tan; Meng, Jingjing; Fang, Chen; Jin, Hailin; Yuan, Junsong

doi:10.1007/s11263-020-01326-x

Product Quantization Network for Fast Visual Search

Published: 23 April 2020

Volume 128, pages 2325–2343, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Computer Vision Aims and scope Submit manuscript

Product Quantization Network for Fast Visual Search

Download PDF

Tan Yu ¹,
Jingjing Meng²,
Chen Fang³,
Hailin Jin³ &
…
Junsong Yuan²

970 Accesses
12 Citations
Explore all metrics

Abstract

Product quantization has been widely used in fast image retrieval due to its effectiveness of coding high-dimensional visual features. By constructing the approximation function, we extend the hard-assignment quantization to soft-assignment quantization. Thanks to the differentiable property of the soft-assignment quantization, the product quantization operation can be integrated as a layer in a convolutional neural network, constructing the proposed product quantization network (PQN). Meanwhile, by extending the triplet loss to the asymmetric triplet loss, we directly optimize the retrieval accuracy of the learned representation based on asymmetric similarity measurement. Utilizing PQN, we can learn a discriminative and compact image representation in an end-to-end manner, which further enables a fast and accurate image retrieval. By revisiting residual quantization, we further extend the proposed PQN to residual product quantization network (RPQN). Benefited from the residual learning triggered by residual quantization, RPQN achieves a higher accuracy than PQN using the same computation cost. Moreover, we extend PQN to temporal product quantization network (TPQN) by exploiting temporal consistency in videos to speed up the video retrieval. It integrates frame-wise feature learning, frame-wise features aggregation and video-level feature quantization in a single neural network. Comprehensive experiments conducted on multiple public benchmark datasets demonstrate the state-of-the-art performance of the proposed PQN, RPQN and TPQN in fast image and video retrieval.

Product Quantization Network for Fast Image Retrieval

A Triplet-loss Dilated Residual Network for High-Resolution Representation Learning in Image Retrieval

Article 25 April 2023

SOLAR: Second-Order Loss and Attention for Image Retrieval

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Visual search has been a fundamental research topic in computer vision. Given a query (an image or a video), it aims to retrieve the query’s relevant items from a database. Accuracy and efficiency are two key aspects for a retrieval system. These two aspects drive the research on visual search to progress in two directions.

The first direction focuses on representation designing or learning for a higher search accuracy (Philbin et al. 2007; Perronnin et al. 2010; Jégou et al. 2010; Babenko et al. 2014; Babenko and Lempitsky 2015; Ng et al. 2015; Gordo et al. 2016; Yu et al. 2017c; Bai et al. 2018, 2017). A good representation maintains a large distance between irrelevant items in feature space and a close distance between relevant ones. Traditional retrieval systems first extract hand-crafted local features like SIFT and then aggregate local features into a global feature (Philbin et al. 2007; Perronnin et al. 2010; Jégou et al. 2010). With the progress of deep learning, the convolutional neural network provides an effective representation (Babenko et al. 2014; Babenko and Lempitsky 2015; Ng et al. 2015; Yu et al. 2017a, d), which is trained by the semantic information and thus is robust to low-level image transformations.

The second direction aims to boost the retrieval speed, especially when dealing with a large-scale dataset. In this circumstance, a compact visual representation is necessary. Generically speaking, there are two types of schemes to gain a compact representation, hashing and quantization. Hashing maps the real-value vectors into binary codes, which enables a faster distance computation and lower memory cost. One of widely used hashing methods is locality sensitivity hashing (LSH) (Charikar 2002). Nevertheless, LSH is data-independent, which ignores the data distribution and is sub-optimal to a specific dataset. To further improve the performance, some hashing methods (Salakhutdinov and Hinton 2007; Weiss et al. 2009; Gong et al. 2013) learn the projection from the data, which caters better to a specific dataset and achieves a higher retrieval precision. Note that traditional hashing methods are based on a two-step procedure. To be specific, they first conduct feature extraction and then carry on Hamming embedding based on the extracted features. These two steps are independently conducted and their mutual influence are ignored. More recently, inspired by the progress of deep learning, some deep hashing methods (Xia et al. 2104; Lai et al. 2015; Liu et al. 2016; Li et al. 2017) are proposed, which simultaneously conduct feature learning and feature compression through a unified neural network, achieving much better retrieval performance compared with methods based on off-the-shelf features.

Nevertheless, hashing methods are only able to produce a few distinct distances, limiting its discriminative power in characterizing distances between data points. In parallel to hashing methods, another widely used data compression method in visual retrieval is product quantization. It quantizes each feature vector into a Cartesian product of several codewords and the similarity can be efficiently computed through looking up cached similarity table. Thanks to the asymmetric distance calculation mechanism, product quantization enables a more accurate distance calculation than hashing methods using the same code length. The product quantization (PQ) (Jegou et al. 2011) and its optimized versions like OPQ (Ge et al. 2013), CKmeans (Norouzi and Fleet 2013), APQ (Babenko and Lempitsky 2014) and CQ (Zhang et al. 2014; Hong et al. 2018) are originally designed for unsupervised scenarios where no labeled data are provided. SQ (Wang et al. 2016c) extends product quantization to supervised scenarios. However, SQ is based on the hand-crafted features or CNN features from the pretrained model, therefore it might not be optimal with respect to a specific dataset.

To jointly optimize feature learning and product quantization, Cao et al. (2016) propose a deep quantization network (DQN) which supports an end-to-end training. It optimizes a weighted sum of similarity preserving loss and product quantization loss. It iteratively updates codewords and other parameters of a neural network. Therefore, in each iteration, the codewords are directly updated by k-means whereas the label information is ignored. Recently, Klein and Wolf (2017) propose a deep product quantization (DPQ). They learn a cascade of two fully-connected layers followed by a softmax layer to determine a soft codeword assignment. It is different from original product quantization as the codeword assignment is no longer determined by distance between the original feature and codewords. Nevertheless, the additional parameters introduced in the cascade of fully-connected layers make DPQ more prone to over-fitting.

In this paper, we also attempt to incorporate the product quantization in a neural network and train it in an end-to-end manner. We construct a soft product quantization layer through an approximation function $f({{\mathbf {x}}},\alpha )$, which is differentiable and supports an end-to-end training. The original product quantization function $q({{\mathbf {x}}})$ is a special case of the constructed approximation function when $\alpha \rightarrow +\infty $. Different from DPQ, fully-connected layers are no longer needed for codebook assignment, instead, in our method, the codeword assignment is determined by the similarity between features and the codewords. Therefore, we significantly reduce the number of parameters to be trained, making our product quantization network (PQN) less prone to over-fitting compared with DPQ. Meanwhile, inspired by the success of the triplet loss in metric learning and the advantage of the asymmetric similarity measurement in feature compression, we propose a novel asymmetric triplet loss to directly optimize the representation’s adaptability to asymmetric similarity measurement in an end-to-end manner. Interestingly, our experiments show that product quantization not also boosts retrieval efficiency, but also serves as a regularization approach, which improves the model’s generalizability and improves the retrieval precision.

Meanwhile, we revisit the residual quantization and extend PQN to residual product quantization network (RPQN). In general, residual quantization cannot achieve comparable distortion error as product quantization, which limits its effectiveness in retrieval based on hand-crafted features. Nevertheless, there are large redundancy in features obtained from deep neural network and thus the quantization error is less important. Benefited from residual learning triggered through residual quantization, the proposed RPQN achieves a higher accuracy than PQN. Figure 1 visualizes the architecture of the proposed RPQN. It takes a triplet consisting of an achor image, a positive image and a negative image $[I,I_+,I_-]$ as input. The backbone network, e.g., AlexNet, obtains the images’ real-value feature vectors $[{{\mathbf {v}}},{{\mathbf {v}}}_+,{{\mathbf {v}}}_-]$. The feature vectors of the positve image and negative image are further feed in the proposed residual product quantization layer (RPQL) to generate their quantized features $[{{\mathbf {q}}}_+,{{\mathbf {q}}}_-]$. The anchor image’s real-value feature and the quantized features of the negative image and the positive image constitutes a triplet $[{{\mathbf {v}}},{{\mathbf {q}}}_+,{{\mathbf {q}}}_-]$ to compute the assymetric loss which aims to enlarge the distance between ${{\mathbf {v}}}$ and ${{\mathbf {q}}}_-$, and decrease the distance between ${{\mathbf {v}}}$ and ${{\mathbf {q}}}_+$.

Moreover, we extend PQN to temporal product quantization network (TPQN) by exploiting the temporal consistency inherited in videos through a temporal pyramid pooling. The temporal pyramid pooling not only enhances the discriminativeness of the video representation but also provides a natural feature partition for product quantization. It integrates frame feature learning, frame features aggregation and video feature product quantization in a single neural network.

This paper is an extension of our conference paper (Yu et al. 2018). New contributions are summarized as follows:

We give a deeper insight into making the quantization differentiable. We propose to construct a function to approximate the hard quantization to achieve the functionality of the quantization. The softmax function used in the original conference paper (Yu et al. 2018) is a special case of the proposed approximation function in this paper. We give a more detailed derivation for the gradient back-propagation and conduct a more comprehensive ablation study.
We extend the proposed product quantization network (PQN) in the conference paper to residual product quantization network (RPQN) by revisiting the residual quantization. The residual quantization not only brings a finer feature space partition but also triggers the residual learning, enhancing the representation’s discriminability. We conduct comprehensive experiments on RPQN and demonstrate that it achieves higher accuracies than the PQN proposed in our conference paper.
We tackle a new task, video retrieval. By exploiting the temporal consistency inherited in videos, we extend the PQN proposed in our conference paper to temporal product quantization (TPQN). The proposed TPQN integrates frame feature learning, frame features pooling and video feature compression in a single network. We conduct systematic experiments based on the proposed TPQN and achieve state-of-the-art performance in video retrieval task.

The remainder of the paper is organized as follows: Related work for Hashing, quantization and fast video retrieval are discussed in Sect. 2. Product quantization network (PQN) is introduced in Sect. 3. The proposed residual product quantization network (RPQN) is introduced in Sect. 4 and temporal product quantization network (TPQN) is introduced in Sect. 5. Sections 6 and 7 provide experimental results in fast image retrieval and video retrieval, respectively.

2 Related Work

2.1 Hashing

Hashing (Charikar 2002; Salakhutdinov and Hinton 2007; Weiss et al. 2009; Gong et al. 2013; Xia et al. 2104; Lai et al. 2015; Liu et al. 2016; Li et al. 2017; Hong et al. 2017; Hong and Yuan 2018; Hong et al. 2018; Yu et al. 2017b) aims to map a feature vector into a short code consisting of a sequence of bits, which enables fast distance computation as well as small memory cost. One of traditional hashing methods, locality sensitivity hashing (LSH) (Charikar 2002), directly utilizes random projection to map the real-value vectors into binary codes. Since the random projection in LSH is independent with data distribution, it might not be optimal for a specific dataset. To optimize the projection for a specific dataset, spetral hashing (SH) (Weiss et al. 2009) and iterative quantization (ITQ) (Gong et al. 2013) learn the projection by minimizing the distortion errors in an unsupervised manner. They achieve better performance than LSH. To further exploit the supervision information, some supervised hashing methods (Liu et al. 2012; Norouzi et al. 2012; Shen et al. 2015) are proposed, achieving better performance than their unsupervised counterpart such as ITQ and SH.

Nevertheless, the above-mentioned hashing methods all adopt a two-step procedure. They first obtain real-value image features and then compress the features into binary codes. In that case, the representation learning and feature compression are conducted separately and the mutual influence between them is ignored. Recently, motivated by the success of deep learning, some works (Xia et al. 2104; Lai et al. 2015; Liu et al. 2016; Li et al. 2017) propose deep hashing methods by incorporating hashing as a layer into a deep neural network. The end-to-end training mechanism of deep hashing simultaneously optimizes the representation learning and feature compression, achieving better performance than traditional hashing methods based on the two-step procedure.

2.2 Product Quantization

Since the hashing methods are only able to produce a few distinct distances, it has limited capability of describing the distance between data points. Parallelly, another scheme termed product quantization (PQ) (Jegou et al. 2011) possesses a more powerful representation capability. It decomposes the space into a Cartesian product of subspaces and quantizes each subspace individually. Thanks to the product settings, it could partition the feature space into a huge number of fine cells. By utilizing the asymmetric similarity measurement, it only causes quantization error in reference images’ features and uses the original feature of the query without any distortion. Meanwhile, through looking up tables, it achieves comparable search speed as Hashing methods. Some following work (Ge et al. 2013; Babenko and Lempitsky 2014; Zhang et al. 2014) further optimize the product quantization through reducing the distortion and achieve higher retrieval precision and recall. Note that, production quantization and its optimized versions such as OPQ (Ge et al. 2013), AQ (Babenko and Lempitsky 2014) and CQ (Zhang et al. 2014) are designed for unsupervised scenarios where no supervision is provided, limiting their effectiveness in the supervised scenario.

To exploit the labels provided in the supervised scenario, Wang et al. (2016c) propose supervised quantization (SQ). Nevertheless, SQ conducts feature extraction and quantization individually, whereas the interaction between these two steps are ignored. To simultaneously learn image representation and product quantization, deep quantization network (DQN) (Cao et al. 2016) adds a fully connected bottleneck layer in the convolutional network. It optimizes a combined loss consisting of a similarity-preserving loss and a product quantization loss. Nevertheless, the codebook in DPQ is trained through k-means clustering and thus the supervision is ignored. Recently, deep product quantization (DPQ) (Klein and Wolf 2017) is proposed where the codebook and network parameters are learned in an end-to-end manner. Different from original product quantization which determines the codeword assignment according to distances between features and codewords, DPQ determines the codeword assignment through a fully connected layer with parameters learned from data. Nevertheless, the additional parameters in the cascade of fully connected layers make the network more prone to over-fitting.

Our work is also an attempt of incorporating product quantization in a neural network. We propose a soft product quantization layer and build our product quantization network (PQN), which can be trained in an end-to-end manner. Different from DPQ, our PQN determines the codeword assignment according to the similarity between features for coding and codewords, which can be seen as a soft extension of original product quantization. Unlike DPQ, we do not need additional fully-connected layers to determine the codeword assignment and the only parameters in our soft product quantization layer are codewords. Therefore, ours is less prone to over-fitting. Besides, we exploit residual quantization besides product quantization in RPQN. The residual quantization triggers the residual learning, which boosts the representation’s discriminabilty. Moreover, we extend product quantization to temporal product quantization for speeding up video retrieval. The temporal pyramid used in temporal product quantization not only provides a natural feature partition but also enhances the representation’s discriminabilty.

2.3 Fast Video Search

Traditional fast video search methods (Cao et al. 2012; Ye et al. 2013) follow a two-step pipeline by extracting the video feature (Tu et al. 2018, 2019) followed by the feature compression. Nevertheless, this two-step pipeline ignores the interaction between two steps, feature learning and compression, and thus the obtained compact video representation might be sub-optimal. To improve the performance of learned compact code, Wu et al. (2017b) propose a deep video hashing method which jointly optimize frame feature learning and hashing. Similarly, Liong et al. (2017) incorporate video feature learning and hashing function optimization in a single neural network. Recently, Liu et al. (2017) propose a deep video hashing framework as well as a category mask to increase the discriminativity of hash codes. Different from these deep hashing methods, we exploit product quantization to obtain a compact video representation and incorporate it in a neural network. Thanks to asymmetric distance measurement, product quantization results in a smaller distortion error than hashing. Moreover, we exploit the temporal consistency inherited in videos and extend the product quantization to temporal product quantization, constructing the proposed temporal product quantization network (TPQN).

3 Product Quantization Network

3.1 Limitation of Product Quantization

Let us denote by ${{\mathbf {v}}} \in {{\mathbb {R}}}^d$ the feature of a reference image I, we divide the feature ${{\mathbf {v}}}$ into M subvectors $ [{{\mathbf {v}}}_1,\ldots ,{{\mathbf {v}}}_m,\ldots ,{{\mathbf {v}}}_M]$ in the feature space where ${{\mathbf {v}}}_m \in {{\mathbb {R}}}^{d/M}$ is a subvector. The product quantization further approximates ${{\mathbf {v}}}$ by

$$\begin{aligned} {{\mathbf {q}}} = [{q}_{1}({{\mathbf {v}}}_1),\ldots ,{q}_{m}({{\mathbf {v}}}_m),\ldots ,{q}_{M}({{\mathbf {v}}}_M)], \end{aligned}$$

(1)

where $q_{m}(\cdot )$ is m-th quantizer defined as

$$\begin{aligned} q_{m}({{{\mathbf {v}}}}_{m}) = {{\mathbf {c}}}_{m}^{k^*}, \end{aligned}$$

(2)

in which,

$$\begin{aligned} k^{*} = \underset{k \in [1,K]}{\mathrm {argmax}} \langle {{{\mathbf {v}}}}_{m} , {{\mathbf {c}}}_{m}^k\rangle . \end{aligned}$$

(3)

In Eq. (3), $\{{{\mathbf {c}}}_{m}^{1},\ldots , {{\mathbf {c}}}_{m}^{K}\}$ are codewords for $q_{m}(\cdot )$. Originally, the codewords are learned through unsupervised k-means by minimizing the distortion error between each data point and its nearest codeword. Nevertheless, only minimizing distortion errors between data points and codewords ignores the supervision information. To exploit the provided supervision when incorporated product quantization in a neural network, we seek to learn the codewords through back-propagation in a end-to-end manner.

Let us attempt to derive the back-propagation of gradients in product quantization operation. We define L as the training loss, which we will introduce in Sect. 3.5. Let us assume we have already obtained ${\partial L}/{\partial {q}_{m}({{\mathbf {v}}}_m)}$, i.e., the derivative of loss ${{\mathcal {L}}}$ with respective to the output. The back-propagation seeks to compute the derivative of loss ${{\mathcal {L}}}$ with respective to the codewords, ${\partial L} / {\partial {{\mathbf {c}}}_{m}^{k^*}}$, and the derivative of loss ${{\mathcal {L}}}$ with respective to the input, ${\partial L} / {\partial {{{\mathbf {v}}}}_{m}}$. From the definition, it is straightforward to derive that ${\partial L} / {\partial {{\mathbf {c}}}_{m}^{k^*}} = {\partial L} / {\partial {q}_{m}({{\mathbf {v}}}_m)}$. Intuitively, it back-propagates the gradients of the output to the codewords it is assigned to and leave the other codewords unchanged. Nevertheless, ${q}_{m}({{\mathbf {v}}}_m)$ is not a continuous function of ${{{\mathbf {v}}}}_{m}$, making it infeasible to back-propagate ${\partial L}/{\partial {q}_{m}({{\mathbf {v}}}_m)}$ to ${\partial L} / {\partial {{{\mathbf {v}}}}_{m}}$. It drives us to seek other solutions to make the back-propagation in product quantization feasible.

3.2 From Hard Quantization to Soft Quantization

To overcome the non-differentiable problem, we generalize the quantization defined in Eq. (2) by constructing a continuously differential approximation function defined as $f_{m}({{{\mathbf {v}}}}_{m} ,\alpha )$ which satisfies $ {\lim }_{{\alpha \rightarrow +\infty }} f_{m}({{{\mathbf {v}}}}_{m} ,\alpha ) = q_{m}({{{\mathbf {v}}}}_{m})$, where $\alpha $ is a scalar controlling the consistency between $f_{m}({{{\mathbf {v}}}}_{m} ,\alpha )$ and $q_{m}({{{\mathbf {v}}}}_{m})$. In practice, we can choose a large $\alpha $ to achieve a good approximation. There are multiple choices to construct the function $f_{m}({{{\mathbf {v}}}}_{m} ,\alpha )$. Below we specify two instances and the soft quantization function proposed in our conference paper (Yu et al. 2018) corresponds to the second one:

$$\begin{aligned} \begin{aligned} f_{m}^1({{{\mathbf {v}}}}_{m} ,\alpha )&= \sum _{k=1}^K \frac{\langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^k \rangle ^{\alpha } }{\sum _{k'=1}^K \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle ^{\alpha }} {{\mathbf {c}}}_{m}^k,\\ f_{m}^2({{{\mathbf {v}}}}_{m} ,\alpha )&= \sum _{k=1}^K \frac{ e^{\alpha \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^k \rangle }}{\sum _{k'=1}^K e^{\alpha \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle } } {{\mathbf {c}}}_{m}^k \end{aligned} \end{aligned}$$

(4)

Note that, ${\lim }_{{\alpha \rightarrow +\infty }} f^1_{m}({{{\mathbf {v}}}}_{m} ,\alpha ) = q_{m}({{{\mathbf {v}}}}_{m})$ might not be satisfied if $\langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k} \rangle <0$. To tackle this issue, we clip $\langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k} \rangle $ to ensure it is non-negative. Below we derive the gradient back-propagation for $f_{m}^1({{{\mathbf {v}}}}_{m} ,\alpha )$ and that of $f_{m}^2({{{\mathbf {v}}}}_{m} ,\alpha )$ can be derived in a similar manner. To facilitate the derivation, we introduce a series of intermediate variables $\{a_{m}^k\}_{m=1,k=1}^{M,K}$ defined as

$$\begin{aligned} a_{m}^k = {\langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^k \rangle ^{\alpha }}/ {{\sum _{k'=1}^K \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle ^{\alpha }} }. \end{aligned}$$

(5)

Meanwhile, let us define ${{\mathbf {q}}}_m = f_{m}^1({{{\mathbf {v}}}}_{m} ,\alpha )$ as output of the approximation function taking ${{\mathbf {v}}}_m$ as input. By plugging Eq. (5) into Eq. (4), we can obtain

$$\begin{aligned} {{\mathbf {q}}}_{m} = \sum _{k=1}^K a_{m}^k {{\mathbf {c}}}_{m}^k. \end{aligned}$$

(6)

Since ${{\mathbf {q}}}_{m}$ is a function of ${{{\mathbf {v}}}}_{m}$ and $\{{{\mathbf {c}}}_{m}^k\}_{k=1}^K$, it is straightforward to obtain

$$\begin{aligned} \begin{aligned} dL&= \sum _{m=1}^{M} \Big (d{{\mathbf {q}}}_{m}\Big )^{\top } \frac{\partial L}{\partial {{\mathbf {q}}}_{m} }\\&= \sum _{m=1}^{M} \Big ( \frac{\partial {{\mathbf {q}}}_{m}}{\partial {{{\mathbf {v}}}}_{m}} d {{{\mathbf {v}}}}_{m} + \sum _{k=1}^{K} \frac{\partial {{\mathbf {q}}}_{m}}{\partial {{{\mathbf {c}}}}^k_{m}} d {{{\mathbf {c}}}}^k_{m}\Big )^{\top }\frac{\partial L}{\partial {{\mathbf {q}}}_{m} } . \end{aligned} \end{aligned}$$

(7)

Meanwhile, according to the definition of ${{\mathbf {q}}}_{m}$ in Eq. (6),

$$\begin{aligned} \begin{aligned} \frac{\partial {{\mathbf {q}}}_{m}}{\partial {{{\mathbf {v}}}}_{m}}&= \sum _{k=1}^{K} \frac{\partial {{\mathbf {q}}}_{m}}{\partial {a}_{m}^k } \Big ( \frac{\partial {a}_{m}^k}{\partial {{{\mathbf {v}}}}_{m} } \Big ) ^{\top } = \sum _{k=1}^{K} {{\mathbf {c}}}_{m}^k \Big ( \frac{\partial {a}_{m}^k}{\partial {{{\mathbf {v}}}}_{m} } \Big ) ^{\top },\\ \frac{\partial {{\mathbf {q}}}_{m}}{\partial {{{\mathbf {c}}}}^k_{m}}&= \sum _{k''=1}^{K} \frac{\partial {{\mathbf {q}}}_{m}}{\partial {a}_{m}^{k''} } \Big ( \frac{\partial {a}_{m}^{k''}}{\partial {{{\mathbf {c}}}}_{m}^k } \Big ) ^{\top } + a_{m}^k \frac{\partial {{\mathbf {c}}}_{m}^{k}}{\partial {{{\mathbf {c}}}}_{m}^k } \\&= \sum _{k''=1}^{K} {{\mathbf {c}}}_{m}^{k''} \Big ( \frac{\partial {a}_{m}^{k''}}{\partial {{{\mathbf {c}}}}_{m}^k } \Big ) ^{\top } + a_{m}^k {{\mathbf {I}}}, \end{aligned} \end{aligned}$$

(8)

where ${{\mathbf {I}}}$ is an identity matrix. By plugging Eq. (8) into Eq. (7), we further obtain

$$\begin{aligned} \begin{aligned} dL&=\sum _{m=1}^{M} \Big \{ \sum _{k=1}^{K} {{\mathbf {c}}}_{m}^k \Big ( \frac{\partial {a}_{m}^k}{\partial {{{\mathbf {v}}}}_{m} } \Big ) ^{\top } d {{{\mathbf {v}}}}_{m} \\&\quad + \sum _{k=1}^{K} \Big [\sum _{k''=1}^{K} {{\mathbf {c}}}_{m}^{k''} \Big ( \frac{\partial {a}_{m}^{k''}}{\partial {{{\mathbf {c}}}}_{m}^k } \Big ) ^{\top } + a_{m}^k {{\mathbf {I}}} \Big ] d {{{\mathbf {c}}}}^k_{m}\Big \}^{\top }\frac{\partial L}{\partial {{\mathbf {q}}}_{m} } . \end{aligned} \end{aligned}$$

(9)

Meanwhile, L is the function of ${{{\mathbf {v}}}}_{m}$ and ${{\mathbf {c}}}_{m}^k$, therefore,

$$\begin{aligned} dL = \sum _{m=1}^{M} \Big [ \Big ( d{{{\mathbf {v}}}}_{m} \Big )^{\top } \frac{\partial L}{\partial {{{\mathbf {v}}}}_{m} } + \sum _{k=1}^{K} \Big (d {{\mathbf {c}}}_{m}^{k} \Big )^{\top } \frac{\partial {L}}{\partial {{\mathbf {c}}}_{m}^k } \Big ]. \end{aligned}$$

(10)

Comparing Eq. (10) with Eq. (9), we obtain

$$\begin{aligned} \begin{aligned} \frac{\partial L}{\partial {{{\mathbf {v}}}}_{m}}&= \sum _{k=1}^K \frac{\partial a_{m}^k }{\partial {{\mathbf {v}}}_{m}} \Big ({{\mathbf {c}}}_{m}^{k} \Big )^{\top } \frac{\partial L}{\partial {{\mathbf {q}}}_{m}},\\ \frac{\partial L}{\partial {{\mathbf {c}}}_{m}^{k}}&= \sum _{k''=1}^K \frac{\partial a_{m}^{k''} }{\partial {\mathbf {{{\mathbf {c}}}}}_{m}^k} \Big ({{\mathbf {c}}}_{m}^{k''} \Big )^{\top }\frac{\partial L}{\partial {{\mathbf {q}}}_{m}} + a_{m}^k \frac{\partial L}{\partial {{\mathbf {q}}}_{m}}. \end{aligned} \end{aligned}$$

(11)

where

$$\begin{aligned} \begin{aligned} \frac{\partial a_{m}^k}{\partial {{{{\mathbf {v}}}}_{m}} }&= \frac{\alpha \sum _{k'=1}^K \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle ^{\alpha -1} \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^k \rangle ^{\alpha -1} {{\mathbf {c}}}_{m}^k\langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle }{(\sum _{k'=1}^K \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle ^{\alpha })^2} \\&\quad - \frac{ \alpha \sum _{k'=1}^K \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle ^{\alpha -1} \langle {{\mathbf {v}}}_{m}, {{\mathbf {c}}}_{m}^k \rangle ^{\alpha -1} {{\mathbf {c}}}_{m}^{k'}\langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^k \rangle }{(\sum _{k'=1}^K \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle ^{\alpha })^2}, \\ \frac{\partial a_{m}^{k''}}{\partial {{{{\mathbf {c}}}}_{m}^k} }&= \frac{\alpha {{\mathbb {I}}}(k= k'') {{\mathbf {c}}}_{m}^{k''}\langle {{\mathbf {v}}}_{m}, {{\mathbf {c}}}_{m}^{k''} \rangle ^{\alpha -1} \sum _{k'=1}^K \langle {{\mathbf {v}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle ^{\alpha } }{(\sum _{k'=1}^K \langle {{{\mathbf {v}}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle ^{\alpha })^2} \\&\quad - \frac{\alpha {{\mathbf {c}}}_{m}^{k}\langle {{\mathbf {v}}}_{m}, {{\mathbf {c}}}_{m}^k \rangle ^{\alpha -1}\langle {{\mathbf {v}}}_{m}, {{\mathbf {c}}}_{m}^{k''} \rangle ^{\alpha } }{(\sum _{k'=1}^K \langle {{\mathbf {v}}}_{m}, {{\mathbf {c}}}_{m}^{k'} \rangle ^{\alpha })^2}. \end{aligned} \end{aligned}$$

(12)

${{\mathbb {I}}}(k=k') = 1$ only if $k=k'$ otherwise 0. In fact, our experiments in Sect. 6.1.1 show that, $f_m^1(\cdot ,\alpha )$ and $f_m^2(\cdot ,\alpha )$ achieve comparable performance. For convenience of implementation, we select $f_m^2(\cdot ,\alpha )$ as default approximation function.

3.3 Regularization

Interestingly, the quantization mechanism works as a regularization, effectively suppressing over-fitting. Different from traditional regularization method putting constraints on the weights of the network such as $\ell _1$/$\ell _2$ regularization, quantization, instead, puts constraints on the activations. We will show that even though the quantized features can be seen as an approximation of the original features with inevitable distortion, it achieves a much better retrieval precision than that using the original features.

3.4 Initialization

We initialize the parameters of convolutional layers by fine-tuning a standard convolutional neural network without quantization, e.g., Alexnet, on the specific dataset. Since we adopt inner-product similarity, we add an intra-normalization layer after the fully-connected layer to fine-tune the network to make it compatible with the proposed product quantization network. After that, we extract the features from the fine-tuned neural network and conduct k-means followed by $\ell _2$-normalization to obtain the initialized codewords $\{{{\mathbf {c}}}_{mk} \}_{k=1,m=1}^{K,M}$ in the soft product quantization layer.

3.5 Asymmetric Triplet Loss

We extend triplet loss originally used in metric learning to asymmetric triplet loss to boost the performance of learned representation in asymmetric similarity measurement. We define $(I, I_+, I_-)$ as a training triplet, where $I_+$ represents a relevant (positive) image and $I_-$ is an irrelevant (negative) image with respect to I. We denote by ${{\mathbf {v}}}$ the feature of I before soft product quantization and denote by ${{\mathbf {q}}}_{+}$ and ${{\mathbf {q}}}_{-}$ the features of $I_+$ and $I_-$ after soft product quantization. We define asymmetric similarity between I and $I+$ as $\langle {{\mathbf {v}}}, {{\mathbf {q}}}_{+} \rangle $, where $\langle \cdot , \cdot \rangle $ denotes the inner-product between two vectors. The proposed asymmetric triplet loss is defined as

$$\begin{aligned} {{\mathcal {L}}}_{ATL} = \langle {{\mathbf {v}}}, {{\mathbf {q}}}_{-} \rangle - \langle {{\mathbf {v}}}, {{\mathbf {q}}}_{+} \rangle . \end{aligned}$$

(13)

Intuitively, it aims to increase the asymmetric similarity between pairs of relevant images and decrease that of pairs of irrelevant images. It is a natural extension of the original triplet loss to asymmetric distance. The difference is that, a training triplet used in the original triplet loss consists of three features of the same type, whereas a training triplet used in the proposed asymmetric triplet loss consists of one feature without quantization and two features after quantization. In fact, our experiments in Sect. 6.1.1 show that a better performance is achieved by processing the above loss through a sigmoid function and a revised loss function is defined as Eq. (14). The better performance might be attributed to the fact that the sigmoid function can normalize the original loss so that the training will not be biased by some samples with huge loss. By default, the asymmetric triplet loss (ATL) we mention in this paper is ${{\mathcal {L}}}_{ATL}^{+}$ in Eq. (14).

$$\begin{aligned} {{\mathcal {L}}}_{ATL}^{+} = \frac{1}{ 1 + e^{\langle {{\mathbf {v}}}, {{\mathbf {q}}}_{+} \rangle - \langle {{\mathbf {v}}}, {{\mathbf {q}}}_{-} \rangle } }. \end{aligned}$$

(14)

3.6 Encoding and Retrieval

After training the proposed product quantization network, the reference images in the database will be encoded by hard product quantization. We define the layer before the soft product quantization layer as embedding layer. Given a reference image I, we obtain its feature from embedding layer ${{\mathbf {v}}} = [ {{\mathbf {v}}}_1,\ldots ,{{\mathbf {v}}}_m,\ldots ,{{\mathbf {v}}}_M]$ and then obtain its PQ code $ {{\mathbf {i}}} = [i_1,\ldots , i_m,\ldots ,i_M]$ where $i_m$ is computed by

$$\begin{aligned} i_m = \underset{k \in [1,K]}{\mathrm {argmax}} \langle {{\mathbf {v}}}_m, {{\mathbf {c}}}_{m}^k \rangle , \end{aligned}$$

(15)

where $\{ {{\mathbf {c}}}_{m}^k\}_{m=1,k=1}^{M,K}$ are codewords learned from our product quantization network. In the retrieval phase, we obtain the query’s feature from the embedding layer ${{\mathbf {v}}}^q = [{{\mathbf {v}}}_1^q,\ldots ,{{\mathbf {v}}}_m^q,\ldots ,{{\mathbf {v}}}_M^q]$. The relevance between the query and a reference image represented by its product quantization code ${{\mathbf {i}}} = [i_1,\ldots ,i_m,\ldots ,i_M]$ is computed by the asymmetric similarity $s({{\mathbf {v}}}^q, {{\mathbf {i}}})$ defined as

$$\begin{aligned} s({{\mathbf {v}}}^q, {{\mathbf {i}}}) = \sum _{m=1}^{M} \langle {{\mathbf {v}}}_m^q, {{\mathbf {c}}}_{m}^{i_{m}} \rangle . \end{aligned}$$

(16)

Since $\{\langle {{\mathbf {v}}}_m^q, {{\mathbf {c}}}_{m}^{k} \rangle \}_{k=1}^K$ is computed only once for all the reference images in the database and thus obtaining $ s({{\mathbf {v}}}^q, {{\mathbf {i}}})$ only requires to sum up the pre-computed similarity scores in the look-up table, considerably speeding up the image retrieval process. Meanwhile, storing the product quantization code ${{\mathbf {i}}}$ only requires Mlog$_2K$ bits, which considerably reduces the memory cost.

3.7 Relation to Existing Methods

DQN (Cao et al. 2016) is the first attempt of incorporating product quantization in the neural network. It alternately optimizes codewords and other parameters of the network. It is worth noting that when updating codewords, it only minimizes the quantization errors through k-means and the supervision information is ignored.

SUBIC (Jain et al. 2017) integrates the one-hot block encoding layer in the deep neural network. It represents each image by a product of one-hot blocks, following the spirit of product quantization. Nevertheless, the sparse property limits its representation capability, making it perform not as well as ours as shown in Table 11.

DPQ (Klein and Wolf 2017) is another attempt of incorporating product quantization into the neural network. It determines the codeword assignment through a cascade of two fully-connected layers. In contrast, our method determines the codeword assignment according to the similarity between original features and the codewords. Note that, the additional parameters from these two fully-connected layers in DPQ not only increase the computation complexity in training the neural network, but also make the network more prone to over-fitting. Our experiments show that our proposed PQN considerably outperforms DPQ.

4 Residual Product Quantization Network

Different from PQ, residual quantization (RQ) (Chen et al. 2010) performs quantization on the entire feature space, and then recursively applies vector quantization (VQ) to the residuals of the previous quantization level. It is a stacked quantization model. In other words, different from PQ conducting VQ independently on several partitioned sub-spaces, RQ conducts VQ on multiple levels instead. For the first level, RQ simply applies VQ to quantize a feature vector ${{\mathbf {v}}}$ into one of K centroids through $q_0({{\mathbf {v}}})$. The feature vector’s residual vector ${{\mathbf {r}}}_1$ can be obtained by subtracting its corresponding centroid as ${{\mathbf {v}}} - q_0({{\mathbf {v}}})$. After than, ${{\mathbf {r}}}_1$ will be further quantized into K centroids through $q_1({{\mathbf {r}}}_1)$. By repeatedly quantizing the residuals $M_r-1$ times, we can approximate ${{\mathbf {v}}}$ by $q_0({{\mathbf {v}}}) +\sum _{m=1}^{M_r-1}q_m({{\mathbf {r}}}_m)$. Through combining codewords in $M_r$ levels, we obtain a considerably huge codebook consisting of $K^{M_r}$ codewords. Generally, residual quantization cannot achieve comparable distortion error as product quantization, which limits its effectiveness in retrieval based on off-the-shelf features.

Nevertheless, in a deep learning network, there existing huge amount of redundancy in activations and the quantization error is less important. Meanwhile, we observe that the residual quantization naturally triggers residual learning (He et al. 2016), which can boost the features’ discriminability. Based on the above motivation, we attempt to integrate the residual quantization in a deep neural network. We demonstrate the advantage of residual quantization over product quantization when incorporated into a deep neural network. In fact, the residual quantization in our framework not only minimizes the distortion error caused by the quantization, but more importantly, it triggers residual learning and enhances the discriminability of the learned compact representation.

4.1 Residual Product Quantization Layer

Figure 2 illustrates the proposed residual product quantization layer (RPQL). In this case, we set the residual level $M_r =2$. We denote by ${{\mathbf {v}}} \in {{\mathbb {R}}}^d$ the input feature vector. RPQL divides ${{\mathbf {v}}}$ into $M_p$ sub-vectors:

$$\begin{aligned} {{\mathbf {v}}} \rightarrow [{{\mathbf {v}}}_1,\ldots ,{{\mathbf {v}}}_m,\ldots ,{{\mathbf {v}}}_{M_p}] \end{aligned}$$

(17)

For each sub-vector ${{\mathbf {v}}}_m$, it goes through the following forward path:

1.
${{\mathbf {q}}}_m^1 \leftarrow f({{\mathbf {v}}}_m,\alpha )$
2.
${{\mathbf {r}}}_m^1 \leftarrow {{\mathbf {v}}}_m - {{\mathbf {q}}}_m^1$
3.
${{\mathbf {q}}}_m^2 \leftarrow f({{\mathbf {r}}}_m^1,\alpha )$
4.
${{\mathbf {q}}}_m \leftarrow {{\mathbf {q}}}_m^1 + {{\mathbf {q}}}_m^2$

where $f(\cdot ,\alpha )$ is the soft function defined in Sect. 3.

4.2 Indexing and Retrieval

After training the network, in the indexing phase, we encode each reference image I in the database through hard-assignment residual product quantization. Let us define the layer before residual product quantization layer as embedding layer. We denote by ${{\mathbf {v}}}$ the output of embedding layer when input is I. ${{\mathbf {v}}}$ is partitioned into $ [{{\mathbf {v}}}_1,\ldots ,{{\mathbf {v}}}_{m},\ldots ,{{\mathbf {v}}}_{M_p}]$ and each sub-vector ${{\mathbf {v}}}_m$ is indexed as follows:

1.
${i}_m^1 \leftarrow \underset{k \in [1,K]}{\mathrm {argmax}} \langle {{\mathbf {x}}}_m, {{\mathbf {c}}}_{m}^{1,k} \rangle $
2.
${{\mathbf {r}}}_m^1 \leftarrow {{\mathbf {x}}}_m - {{\mathbf {c}}}_{m}^{1{i}_m^1}$
3.
${i}_m^2 \leftarrow \underset{k \in [1,K]}{\mathrm {argmax}} \langle {{\mathbf {r}}}_m^1, {{\mathbf {c}}}_{m}^{2,k} \rangle $

In the indexing phase, for each reference image I, we only need to store indices of their corresponding codewords $\{i_m^1\}_{m=1}^{M_p}$ and $\{i_m^2\}_{m=1}^{M_p}$, taking $2M_plog_2K$ bits in total. In the retrieval phase, we utilize the asymmetric similarity scores to rank the reference images in the dataset. Let denote by ${{\mathbf {v}}}^q$ the output of embedding layer when the input is query image q. Let denote by $[i_1^1,...,i_{M_p}^1]$ and $[i_1^2,...,i_{M_p}^2]$ indices in two-level residual product quantization of the reference image I. The asymmetric similarity is defined as

$$\begin{aligned} sim_{asym}(q,I) = \sum _{m=1}^{M_p} [ \langle {{\mathbf {v}}}_q, {{\mathbf {c}}}_m^{1,i^1_m} \rangle + \langle {{\mathbf {v}}}_q, {{\mathbf {c}}}_m^{2,i^2_m} \rangle ]. \end{aligned}$$

(18)

In this case, for each query, we only need to compute its similarity with all the codewords for only once and then similarity between the query and each reference image can be efficiently obtained through $2M_p$ table look-ups.

5 Temporal Product Quantization Network

The proposed PQN and RPQN in the previous section are designed for fast image retrieval and are not optimal for video retrieval task. In this section, we introduce our temporal product quantization network (TPQN) for effective and efficient video retrieval. The proposed TPQN is an extension of the proposed product quantization network by further exploiting the temporal consistency inherited in videos. It integrates frame-wise feature learning, frame-wise features aggregation and video-level feature quantization in a single neural network, and supports an end-to-end training. Figure 3 visualizes the architecture of the proposed TPQN. In Sects. 5.1 and 5.2, we will introduce two core modules of TPQN sequentially. In Sect. 5.3, the indexing and retrieval strategy will be introduced.

5.1 Convolutional and Pooling Layers

Given a video V, we uniformly sample N frames from it and define a global set of sampled frames as ${{\mathcal {S}}} = \{f_1,\ldots ,f_N \}$. We feed frames $\{f_i\}_{i=1}^{N}$ in parallel into the backbone network consisting of convolutional and pooling layers and obtain a set of frame-level features $\{ F(f_i) | f_i \in {{\mathcal {S}}} \}$. We uniformly partition the global set of frames ${{\mathcal {S}}}$ into T subsets $\{{{\mathcal {S}}}^t\}_{t=1}^T$ where ${{\mathcal {S}}}^t = \{f_i|i \in [ \lfloor N/T \rfloor (t-1) + 1,\lfloor N/T\rfloor t ]\}$. Each subset ${{\mathcal {S}}}^t$ represents a continuous interval of the video. Features of frames in each subset ${{\mathcal {S}}}^t$ are max-pooled into an interval’s feature:

$$\begin{aligned} {{\mathbf {v}}}^t = \mathrm {maxpool}(\{F(f)|f\in {{\mathcal {S}}}^t \}),~~t=1,\ldots ,T+1 \end{aligned}$$

(19)

where ${{\mathcal {S}}}^{T+1}={{\mathcal {S}}}$ accounts for the global set containing all sampled frames. All the interval-level features will be concatenated into a global feature ${{\mathbf {v}}} = [{{\mathbf {v}}}^1,\ldots ,{{\mathbf {v}}}^{T+1}]$, which is the input of the proposed temporal product quantization layer. Intuitively, ${{\mathbf {v}}}$ is a two-level temporal pyramid pooling feature which exploits the temporal consistency inherited in the videos. But more importantly, the pyramid structure provides a natural feature partition for product quantization.

5.2 Temporal Product Quantization Layer

For each interval-level feature ${{\mathbf {v}}}^t$, following the standard product quantization, we equally split it into M sub-vectors as ${{\mathbf {v}}}^t = [{{\mathbf {v}}}_{1}^t,\ldots ,{{\mathbf {v}}}_{M}^t]$. Each ${{\mathbf {v}}}^{t}_m$ goes through the following pipeline consisting of 4 steps:

1.
${{\mathbf {y}}}_{tm} = \mathrm {relu}({{\mathbf {W}}}_{m}^t {{\mathbf {v}}}_{m}^t + {{\mathbf {b}}}_{m}^t)$
2.
$\bar{{{\mathbf {y}}}}_{m}^t = \frac{{{\mathbf {y}}}_{m}^t}{\Vert {{\mathbf {y}}}_{m}^t\Vert _2}$
3.
${{\mathbf {z}}}_{m}^t = f^t_m(\bar{{{\mathbf {y}}}}_{m}^t,\alpha )$
4.
$\bar{{{\mathbf {z}}}}_{m}^t = \beta \frac{{{\mathbf {z}}}_{m}^t}{\Vert {{\mathbf {z}}}_{m}^t\Vert _2}$

where ${{\mathbf {W}}}_{m}^t$ and ${{\mathbf {b}}}_{m}^t$ in step 1 are parameters of a fully-connected layer used to further boost the feature’s discriminability. $\beta $ in step 4 is a scalar to speed up the training and quantizer $f_{m}^t(\bar{{{\mathbf {y}}}}_{m}^t,\alpha )$ is defined as Eq. 4.

5.3 Indexing and Retrieval

After training the neural network, we will use the learned codewords but adopt hard quantization for indexing. To be specific, given a reference video V in the database, let ${{\mathbf {v}}}^t = [{{\mathbf {v}}}_{1}^t,\ldots ,{{\mathbf {v}}}^{t}_m,\ldots , {{\mathbf {v}}}^t_{M}]$ denote the max-pooled feature of the intervals $I_t$ where ${{\mathbf {v}}}^t_{m}$ is the m-th sub-vector. For each ${{\mathbf {v}}}_{m}^t$, we obtain its index $i_{m}^t$ by

1.
${{\mathbf {y}}}^t_{m} = \mathrm {relu}({{\mathbf {W}}}_{m}^t{{\mathbf {b}}}^t_{m} + {{\mathbf {b}}}_{m}^t)$
2.
$\bar{{{\mathbf {y}}}}^t_{m} = \frac{{{\mathbf {y}}}^t_{m}}{\Vert {{\mathbf {y}}}_{m}^t\Vert _2}$
3.
${i}_{m}^t = \underset{k \in [1,K]}{\mathrm {argmax}}\langle \bar{{{\mathbf {y}}}}_{m}^t,{{\mathbf {c}}}_{m}^{t,k} \rangle $

where K is the number of codewords used by each quantizer $q_{m}^t(\cdot )$. It takes $log_2(K)$ bits to store ${i}^t_{m}$ and $(T+1)Mlog_2(K)$ bits to store all indices $\{[{i}^t_{1},\ldots ,{i}_{M}^{t}]\}_{t=1}^{T+1}$ for the video V. The advantage of the proposed temporal product quantization over the original product quantization is that it exploits the temporal consistency inherited in the video, significantly boosting the representation’s discriminability.

In the retrieval phase, given a query video q, we can directly obtain its concatenated max-pooled feature $[{{\mathbf {v}}}^1_{q,1},\ldots ,{{\mathbf {v}}}^{T+1}_{q,M}]$ through the trained backbone network, further post-process each ${{\mathbf {v}}}^t_{q,m}$ through $\{{{\mathbf {W}}}_{m}^t,{{\mathbf {b}}}_{m}^t\}$ followed by $\ell _2$ normalization and obtain the final feature $[\bar{{{\mathbf {y}}}}_{q,1}^1,\ldots ,\bar{{{\mathbf {y}}}}_{q,M}^{T+1}]$. Note that we do not conduct quantization on the query’s feature due to the asymmetric similarity measurement. The similarity between the query q and a reference video V is computed by

$$\begin{aligned} \mathrm {sim}(q,V) = \sum _{t=1}^{T+1} \sum _{m=1}^{M} \langle \bar{{{\mathbf {y}}}}^t_{q,m}, {{\mathbf {c}}}_{m}^{t,{i}_{m}^t}\rangle . \end{aligned}$$

(20)

Computing Eq. (20) is considerably efficient by utilizing look-up table (Jegou et al. 2011). To be specific, we only need to compute the similarity between query video’s feature with each codeword $\langle \bar{{{\mathbf {y}}}}^t_{q,m}, {{\mathbf {c}}}_{m}^{t,k} \rangle $ for one time and then store the similarity in a look-up table. Computing the similarity between the query video q and a reference video V defined in Eq. (20) only need $M\times (T+1)$ times similarity table look-ups.

We summarize architectures of PQN, RPQN, and TPQN in Table 1.

Table 1 Comparisons between PQN, RPQN, and TPQN

Product Quantization Network for Fast Visual Search

Abstract

Similar content being viewed by others

Product Quantization Network for Fast Image Retrieval

A Triplet-loss Dilated Residual Network for High-Resolution Representation Learning in Image Retrieval

SOLAR: Second-Order Loss and Attention for Image Retrieval

Explore related subjects

1 Introduction

2 Related Work

2.1 Hashing

2.2 Product Quantization

2.3 Fast Video Search

3 Product Quantization Network

3.1 Limitation of Product Quantization

3.2 From Hard Quantization to Soft Quantization

3.3 Regularization

3.4 Initialization

3.5 Asymmetric Triplet Loss

3.6 Encoding and Retrieval

3.7 Relation to Existing Methods

4 Residual Product Quantization Network

4.1 Residual Product Quantization Layer

4.2 Indexing and Retrieval

5 Temporal Product Quantization Network

5.1 Convolutional and Pooling Layers

5.2 Temporal Product Quantization Layer

5.3 Indexing and Retrieval

6 Experiments on Image Retrieval

6.1 CIFAR-10 Using 3CNet

6.1.1 Ablation Study on PQN

6.1.2 Ablation Study on RPQN

6.1.3 Compare with State-of-the-Art Methods

6.2 CIFAR-10 Using AlexNet

6.2.1 Comparions with State-of-the-Art Methods

6.2.2 Extremely Short Code Evaluation

6.3 NUS-WIDE

6.4 ImageNet100

7 Experiments on Video Retrieval

7.1 Ablation Study

7.1.1 Influence of T

7.1.2 Influence of M

7.1.3 Influence of \(\alpha \)

7.1.4 Action Recognition

7.2 Complexity Analysis and Regularization

7.3 Comparison with State-of-the-Art Methods

8 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation