1 Introduction

Person re-identification (ReID) [8, 50] aims to match images of a person across disjoint cameras, which is widely used in video surveillance, security and smart city. Many methods [11, 16, 21, 22, 26, 32, 43, 50, 51] have been proposed for person ReID. However, for higher accuracy, most of them utilize a large deep network to learn high-dimensional real-value features for computing similarities by Euclidean distance and returning a rank list by quick-sort [13]. Quick-sort of high-dimensional deep features can be slow, especially when the gallery set is large. Table 1 shows that the query time per ReID probe image increases massively with the increase of the ReID gallery size; and counting-sort [1] is much more efficient than quick-sort, in which the former has a linear complexity w.r.t the gallery size (O(n)) whilst the latter has a logarithm complexity (O(nlogn)).

Several fast ReID methods [4, 5, 7, 24, 41, 47, 55, 56] have been proposed to increase ReID speed whist retaining ReID accuracy. The common main idea is hashing, which learns a binary code instead of real-value features. To sort binary codes, the inefficient Euclidean distance and quick-sort are replaced by the Hamming-distance and counting-sort [1]. Table 2 shows that computing a Hamming distance between 2048-dimensional binary-codes is \(229\times \) faster than that of a Euclidean distance between real-value features.

Table 1. ReID search time per probe image by quick-sort (real-value) and counting-sort (binary). The latter is much faster.
Table 2. Comparing Euclidean and Hamming distances, Euclidean and longer lengths are slow to compute.

Different from common image retrieval tasks, which are category-level matching in a close-set, ReID is instance-level matching in an open-set (zero-shot setting). For image retrieval in the ImageNet [28], the classes of training and test sets are the same and imagery appearances of different classes diverse a lot, such as dog, car, and airplane. In contrast, the training and test ReID images have completely different ID classes without any overlap (ZSL) whilst the appearances of different persons can be very similar to subtle changes (fine-grained) on clothing, body characteristics, gender, and carried-objects. The ZSL and fine-grained characteristics of ReID require state-of-the-art hashing-based fast ReID models [24] to employ very long binary codes, e.g. 2048, in order to retain competitive ReID accuracy. However, the binary code length affects significantly the cost of computing Hamming distance. Table  2 shows that computing a Hamming distance between two 2048-dimensional binary codes takes \(1.7\times 10^{-5}\) s, which is \(7\times \) slower than computing that of 32-dimensional binary codes at \(2.4\times 10^{-6}\) s. This motivates us to solve the following problem: How to yield higher accuracy from hashing-based ReID using shorter binary codes.

Fig. 1.
figure 1

A Coarse-to-Fine (CtF) hashing code search strategy to speed up ReID, where Q is a query image, \(\{G_i\}_{i=1}^{3}\) are the positive images in the gallery set, \(B=\{b_k\}_{k=1}^{N}\) are binary codes of lengths \(L=\{l_k\}_{k=1}^{N}\), \(T=\{k_i\}_{k=2}^{N}\) are Hamming distance thresholds where gallery images are selected by each \(t_k\) for further comparison by increasingly longer codes \(b_{k}\).

To that end, we propose a novel Coarse-to-Fine (CtF) search strategy for faster ReID whilst also retaining competitive accuracy. At test time, our model (CtF) first uses shorter codes to coarsely rank a gallery, then iteratively utilises longer codes to further rank selected top candidates where the top-ranked candidates are defined iteratively by a set of Hamming distance thresholds. Thus, the long codes are only used for a decreasingly fewer matches in ranking in order to reduce the overall search time whilst retaining ReID accuracy. This is an intuitively straightforward idea but not easily computable for ReID due to three difficulties: (1) Coarse-to-fine search requires multiple codes of different lengths. Asymmetrically, computing them with multiple models is both time-consuming and sub-optimal. (2) The coarse ranking must be accurate enough to minimise missing true-match candidates in fine-grained ranking whilst keeping their numbers small, thus reduce the total search time. Paradoxically, shorter codes perform much worse than longer codes in ReID task therefore hard to be sufficiently accurate. (3) The set of distance thresholds for guiding the coarse search affect both final accuracy and overall speed. How to determine automatically these thresholds to balance optimally accuracy and speed is both important and nontrivial.

In this work, we propose a novel All-in-One (AiO) framework together with a Distance Threshold Optimization (DTO) algorithm to simultaneously solve these three problems. The AiO framework can simultaneously learn and enhance multiple codes of different lengths in a single model. It progressively learns multiple codes in a pyramid structure, where the knowledge from the bottom long code is shared by the top short code. We promote shorter codes to mimic longer codes by both probability- and similarity- distillation. This makes shorter codes more powerful without importing extra teacher networks. The DTO algorithm solves a complex threshold search problem by a simple optimization process and the balance between search accuracy and speed is easily controlled by a single parameter. It explores a \(F_{\beta }\) score as the optimization target formulated as Gaussian cumulative distribution functions. So that we can estimate its parameters by the statistics of Gaussian probability distributions modeling the distances of positive and negative pairs. Finally, by maximizing the \(F_{\beta }\) score, we compute iteratively optimal distance thresholds.

Our contributions are: (1) We propose a novel Coarse-to-Fine (CtF) search strategy for Faster ReID, not only speeding up hashing ReID, but also improving their accuracy. To the best of our knowledge, this is the first work to introduce such search strategy into ReID. (2) A novel All-in-One (AiO) framework is proposed to learn and enhance multiple codes of different lengths in a single framework by viewing it as a multi-channel self-distillation problem. In the framework, the multiple codes are learned in a pyramid structure and shorter codes mimic longer codes via probability- and similarity- distillation loss. (3) A novel Distance Threshold Optimization (DTO) algorithm is proposed to find the optimal thresholds for coarse-to-fine search by concluding the threshold search task to a \(F_{\beta }\) distance optimization problem. The \(F_{\beta }\) score is represented with Gaussian cumulative distribution functions, whose mean and variance can be estimated by fitting a small validation set. (4) Extensive experimental results on two datasets show that, our proposed method is \(50\times \) faster than non-hashing ReID methods, \(5\times \) faster and \(8\%\) more accurate than hashing ReID methods.

2 Related Works

In this work, we wish to solve the fast ReID problem under the framework of hashing by proposing an All-in-One (AiO) hashing learning module and a Distance Threshold Optimization (DTO) algorithm. Thus, we mainly discuss the related works including non-fast person re-identification (ReID) task, fast ReID task and hashing algorithm.

Person Re-identification. Person re-identification addresses the problem of matching pedestrian images across disjoint cameras [8]. The key challenges lie in the large intra-class and small inter-class variation caused by different views, poses, illuminations, and occlusions. Existing methods can be grouped into hand-crafted descriptors [21, 26, 43], metric learning methods [16, 22, 51] and deep learning algorithms [11, 32, 35,36,37,38, 50]. The goal of hand-crafted descriptors is to design robust features. Metric learning aims to make a pair of true matches have a relatively smaller distance than that of a wrong match pair in a discriminant manner. Deep learning algorithms adopt deep neural networks to straightly learn robust and discriminative features in an end-to-end manner and achieve the best performance. However, all the ReID methods above learn real-value features for high accuracy, which is slow.

Hashing Algorithm. Hashing algorithm mainly divided into unsupervised and (semi-)supervised ones. Unsupervised hashing methods (LSH [6], SH [40], ITQ [19]) employ unlabeled data even no data. (Semi-)Supervised (SSH [39], BRE [17], KSH [23], SDH [30], SSGAH [34]) utilize labeled information to improve binary codes. Recently, inspired by powerful deep networks, some deep hashing methods (CNNH [42], NINH [18], DPSH [20]) have been proposed and achieve much better performance. They usually utilize a CNN to extract meaningful features, formulate the hashing function as a fully-connected layer with tanh/sigmoid activation function, and quantize features by signature function. The framework can be optimized with a related layer or some iteration strategies. However, all the hashing methods are designed for close-set category-level retrieval tasks, which cannot be directly used for person ReID, an open-set fine-grained search problem.

Fast Person Re-identification. Fast ReID methods aims to search in a fast speed meanwhile obtaining accuracy as high as possible. The main idea of those methods is hashing algorithm, which learns binary code instead of real-value features. Based on the binary codes, the inefficient Euclidean distance and quick-sorting can be replaced by efficient Hamming distance and counting sort. Zheng et al. [47] learn cross-view binary codes using two hash functions for two different views. Wu et al. [41] simultaneously learn both CNN feature and hash functions to get robust yet discriminative features and similarity-preserving binary codes. CSBT [4] solves the cross-camera variations problem by employing a subspace projection to maximize intra-person similarity and inter-person discrepancies. In [55] integrate spatial information for discriminative features by representing horizontal parts to binary codes. ABC [24] improves binary codes by implicitly fits the feature distribution to a pre-defined binary one with Wasserstein distance. However, all the fast ReID methods take very long binary codes (e.g. 2048) for high accuracy. Different from them, we propose a coarse-to-fine search strategy which complementarily uses codes of different lengths, obtaining not only faster speed but also higher accuracy.

3 Proposed Method

In this work, we propose a coarse-to-fine (CtF) search strategy for fast and accurate ReID. For effectively implementing the strategy, we design an All-in-One (AiO) framework together with a Distance Threshold Optimization (DTO) algorithm. The former learns and enhances multiple codes of different lengths in a single framework. The latter finds the optimal distance thresholds to balance time and accuracy.

3.1 Coarse-to-Fine Search

As we illustrated in the introduction section, although the long binary codes can get high accuracy, it takes much longer time than short codes. This motivates us to think about that can we reduce the usage of long codes to further speed hashing ReID methods up. Thus, a simple but efficient solution is complementarily using both short and long codes. Here, shorter codes fast return a rough rank list of gallery, and longer codes carefully refine a small number of top candidates. Figure 1 shows its procedures.

Although the idea is straightforward, there are still three difficulties preventing it being applied to ReID. (1) Coarse-to-fine search requires multiple codes of different lengths. Asymmetrically, computing them with multiple models is both time-consuming and sub-optimal. (2) The coarse ranking must be accurate enough to minimise missing true-match candidates in fine-grained ranking whilst keeping their numbers small, thus reduce the total search time. Paradoxically, shorter codes perform much worse than longer codes in ReID task. (3) The set of distance thresholds for guiding the coarse search affect both final accuracy and overall speed. How to determine automatically these thresholds to balance optimally accuracy and speed is both important and nontrivial. To solve the problems, we propose an All-in-One (AiO) framework and a Distance Threshold Optimization (DTO) algorithm. Please see the next two parts for more details.

Fig. 2.
figure 2

All-in-One framework. It learns and enhances multiple codes of different lengths in a single framework with a code pyramid structure and self-distillation learning.

3.2 All-in-One Framework

The All-in-One (AiO) framework aims to simultaneously learn and enhance multiple codes of different lengths in a single model, whose architecture can be seen in Fig. 2. Specifically, it first utilizes a convolutional network to extract the real-value feature vectors, then learns multiple codes of different lengths in a pyramid structure, finally enhances the codes by encouraging shorter codes mimic longer codes via self-distillation.

Learn Multiple Codes in a Pyramid Structure. The code pyramid learns multiple codes of different lengths, where the shorter codes are based on the longer codes. With such a structure, we can not only learn many codes in one shot, but also share the knowledge of longer codes with shorter codes. The equations are as below:

$$\begin{aligned} v_{0} = F(x), \ \ v_{k} = FC_{k}(v_{k-1}), \ \ k \in {1, 2, ..., N}, \end{aligned}$$
(1)

where x is input image, F is the CNN backbone, N is the code number, \(V = \{v_k\}_{k=1}^{N}\) are the real-value feature vectors with different lengths \(L = \{l_k\}_{k=1}^{N}\), \(FC_k\) is the fully-connected layers with \(l_{k-1}\) input- and \(l_{k}\) output-sizes. After getting real-value features of different lengths, we can obtain their binary codes \(B = \{b_{k}\}_{k=1}^{N}\) in the following equation.

$$\begin{aligned} b_{k} = sgn(bn(v_{k})), \end{aligned}$$
(2)

where bn is the batch normalization layer, sgn is the symbolic function. We use the batch normalization layer because it normalizes the real-value features to be symmetric to 0 and reduces the quantization loss.

Enhance Codes with Self-distillation Learning. As we discussed in the introduction section, the coarse ranking must be accurate enough to minimise missing true-match candidates in fine-grained ranking. Inspired by [12, 33], we introduce self-distillation learning to enhance the multiple codes in a single framework without importing extra teacher network. Different from conventional distillation models, which imports an extra large teacher network to supervise a small student network, we perform distillation learning in a single network and achieve better performance, which is important for fast ReID.

Specifically, our self-distillation learning is composed of a probability- and a similarity- distillation. The probability-distillation transfers the instance-level knowledge in a from of softened class scores. Its formulation is given by

$$\begin{aligned} \mathcal {L}_{pro} = \frac{1}{N-1} \sum _{k=1} ^{N-1} \mathcal {L}_{ce}(\sigma (\frac{z_{k+1}}{T}), \sigma (\frac{\hat{z}_{k}}{T})), \end{aligned}$$
(3)

where \(\mathcal {L}_{ce}(\cdot , \cdot )\) denotes the cross-entropy loss, \(\sigma \) is the softmax function, \(\hat{z}_{k}/z_{k+1}\) means the output logits of the binary code \(b_{k}/b_{k+1}\), \(\hat{z}_{k}\) means it act as a teacher and fixed during training, T is a temperature hyperparameter, which is set 1.0 empirically. The similarity-distillation transfers the knowledge of relationship from longer codes to shorter one, whose formulation is in Eq. (4). This is motivated by that as an image search task, ReID features should also focus on the relationship among samples, i.e. to what extent the sample A is similar/dissimilar to sample B.

$$\begin{aligned} \mathcal {L}_{sim} = \frac{1}{N-1} \sum _{k=1}^{N-1} \sum _{i,j} ||\frac{1}{l_{k+1}}{G}^{i,j}_{k+1} - \frac{1}{l_{k}} \hat{G}^{i,j}_{k}||^2, \end{aligned}$$
(4)

where \(G^{i,j}_{k}/G^{i,j}_{k+1}\) is the Hamming distance between \(b^{i}_{k}/b^{i}_{k+1}\) and \(b^j_{k}/b^j_{k+1}\), \(b^{i/j}_{k/k+1}\) is the binary code of image \(x_i/x_j\) with length \(l_{k}/l_{k+1}\), the \(\hat{G}\) means that G acts as a label and is fixed during the optimization process, thus contributes nothing to the gradients.

Overall Objective Function and Training. Recent progresses on ReID have shown the effectiveness of the classification [50] and triplet [11] losses. Thus, our final objective function includes our proposed probability- and similarity- distillation losses together with the classification and triplet losses as the final objective function. The formulation can be found in Eq. (5),

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{ce} + \mathcal {L}_{tri} + \lambda _1 \mathcal {L}_{prob} + \lambda _2 \mathcal {L}_{sim} \end{aligned}$$
(5)

Considering that the mapping function sgn in Eq. (2) is discrete and Hamming distance in Eq. (2) is not differentiable, a natural relaxation [20] is utilised in Eq. (5) by replacing sgn with tanh and changing the Hamming distance to the inner-product distance. Finally, our All-in-One framework can be optimized in an end-to-end way by minimizing the loss in Eq. (5).

3.3 Distance Threshold Optimization

figure a

After getting the multiple codes of different lengths \(B=\{b_i\}_{i=1}^{N}\), we can perform the Coarse-to-Fine (CtF) search. There are two tips in CtF search, i.e. high accuracy and fast speed. For fast speed, the candidate number returned by coarse search should be small. For high accuracy, the candidates returned by coarse search should include relevant images as more as possible. But the two requirements are naturally conflicting. Thus, it is important to find the proper thresholds to optimally balance the two targets, i.e. both high accuracy and fast speed. One simple solution is brute search via cross-validation. However, the search space is too large. For example, if we have multiple binary codes of lengths \(L = \{32, 128, 512, 2048\}\), the complexity of the brute search will be \(\prod _{L} > 4 \times 10^{9}\) times.

In this part, we propose a novel Distance Threshold Optimization (DTO) algorithm which solves the time-consuming brute parameter search task with a simple optimization process. Specifically, inspired by [9], we first explicitly formulate the two sub-targets as two scores in Eq. (6), i.e. precision (P) and recall (R) scores. Then we balance the two sub-targets by mixing the two scores with a single parameter \(\beta \) and get \(F_{\beta }\) score in Eq. (6).

$$\begin{aligned} \begin{aligned} P = \frac{TP}{TP+FP}, \ \ R = \frac{TP}{TP+FN} , \ \ F_{\beta } = (\beta ^2+1)\frac{PR}{\beta ^2P+R} \end{aligned} \end{aligned}$$
(6)

Here, TP is the number of relevant images in the candidates, FP is the number of non-relevant images in the candidates and FN is not retrieved relevant samples. As we can see, the precision score P means the rate of relevant images in the candidates. Usually a high P means a small candidate number, which is good for fast speed. The recall score R represents the rate of returned relevant samples in the total relevant samples. A high R score means more returned relevant samples, which is important for high accuracy. The \(F_{\beta }\) mixed the precision and recall scores with a parameter \(\beta \), which considers both speed and accuracy.

$$\begin{aligned} \begin{aligned} PDF(t) = \frac{1}{\sigma \sqrt{2\pi }} exp({-\frac{(t-u)^2}{\sigma \sqrt{2}}}), \ \ CDF(t) = \frac{1}{2}(1 + erf\frac{t-u}{\sigma \sqrt{2}}) \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned} F_{\beta } = \frac{CDF^{r}(\beta ^2 + 1)}{CDF^{n} + CDF^{r} + \beta ^2(1-CDF^{n} + CDF^{r})} \end{aligned}$$
(8)

Considering that TP/FP/FN are statistics which cannot be optimized, we replace them with two Gaussian cumulative distribution functions in form of Eq. (7) (right), whose parameters u and \(\sigma \) are estimated by fitting a validation set using the Gaussian probability distribution function in Eq. (7) (left). Finally, by maximizing the \(F_{\beta }\) in Eq. (8), we can get the optimal distance thresholds \(T=\{t_k\}_{k=2}^{N}\) balanced by \(\beta \).

4 Experiments

4.1 Dataset and Evaluation Protocols

Datasets. We extensively evaluate our proposed method on two common datasets (Market-1501 [49] and DukeMTMC-reID [52]) and one large-scale dataset (Market-1501+500k [49]). The Market-1501 dataset contains 1,501 identities observed under 6 cameras, which are splited into 12,936 training, 3,368 query and 15,913 gallery images. The Market-1501+500k enlarges the gallery of Market-1501 with extra 500,000 distractors, making it more challenging for both accuracy and speed. DukeMTMC-reID contains 1,404 identities with 16,5522 training, 2,228 query and 17,661 gallery images.

Evaluation Protocols. For accuracy, we use standard metrics including Cumulative Matching Characteristic (CMC) curves and mean average precision (mAP). All the results are from a single query setting. To evaluate speed, we use average query time per image, including distance computation and sorting time. For fair evaluation, we do not use any parallel algorithm for distance computation and sorting.

4.2 Implementation Details

We implemented our method with Pytorch on a PC with 2.6Ghz Intel Core i5 CPUs, 10 GB memory, and a NVIDIA RTX 2080Ti GPU. For a fair comparison and following [24, 25], we use ResNet50 [10] as the CNN backbone. In training stage, each image is resized to \(256\times 128\) and augmented by horizontal flip and random erasing [53]. A batch data includes 64 images from 16 different persons, where every person includes 4 images. The lengths \(L=\{l_k\}_{k=1}^{N}\) of multiple codes are empirically set \(\{32, 128, 512, 2048\}\). The margin in the triplet loss in Eq. (5) is 0.3. The framework is optimized by Adam [15] with total epochs 120. Its initial learning rate is 0.00035, which is warmed up for 10 epochs and decayed to its \(0.1\times \) and \(0.01\times \) at 40 and 70 epochs. We randomly split the training data into a training and a validation set according to 6 : 4, then decide the parameters via cross-validation, After that, we train our method with all training data. \(\lambda _1\) and \(\lambda _2\) in Eq. (5) are set as 1.0 and 1,000, and \(\beta \) in Eq. (8) is set 2.0. The three paramters are decided via cross validation. Code is available at githubFootnote 1.

Table 3. Comparisons with non-hashing ReID methods using real-value features of different lengths on Market-1501 and DukeTMTC-reID. \(\mathbf {B}\): binary code, \(\mathbf {R}\): real-value feature. Longer real-value features have higher accuracy but slower query speed. Our model CtF (including AiO) has very fast query speed (two orders of magnitude faster) and comparable accuracy with non-hashing ReID methods.

4.3 Comparisons with Non-hashing ReID Methods

Non-hashing ReID use longer real-value features, such as 2048-dimensional float64 features, for a better accuracy. This significantly affects their speed, i.e. query time. Table 3 shows that our proposed CtF (including AiO) method is significantly faster than non-hashing ReID methods (two orders of magnitude). CtF also achieves very competitive accuracy with close Rank-1 (93.7% vs. 94.1%) and mAP (87.6% vs. 86.4%) scores of the best non-hashing ReID mehtod BoT [25] on Market-1501 and DukeMTMC-reID, and better than all the other non-hashing methods using different feature length, of which 5 methods have features shorter than 2,062 (PSE [29], IDE [50], PN-GAN [27], CamStyle [54], PIE [48]) and 3 methods have features longer than 10,240 (SPReID [14], PCB [32], VPM [31]). Overall, longer feature usually contributes to higher accuracy but with slower speed. For example, SPReID, PCB and VPM take features longer than 10,240 and achieves \(92\%\)\(93\%\) and \(83\%\)-\(84\%\) Rank-1 scores on Market-1501 and DukeMTMC-reID datasets, respectively. The others utilize features no longer than 2,048 achieving Rank-1 score less than \(92\%\) and \(80\%\). On the other hand, the query speed of those methods with long features is much slower. For example, PCB takes 6.9s and 6.3s for query each image on the two datasets respectively. This is 3-\(4\times \) slower than IDE with 2s on either dataset. Specifically, CtF performs much faster than non-hashing methods and significantly, it achieves much better accuracy than comparable length real-value feature model. For example, CtF achieves \(93.7\%/87.6\%\) Rank-1 scores on Market-1501/DukeMTMC-reID, as compared to BoT having \(94.1\%/86.4\%\) respectively. This is because CtF (including AiO) utilizes all-in-one framework together with coarse-to-fine search strategy, which not only learns powerful binary code, but also complementarily uses short and long codes for both high accuracy and fast speed.

Table 4. Comparisons with state-of-the-art hashing ReID methods on Market-1501 and DukeTMTC-reID. AiO+k means learning multiple codes with all-in-one framework, but querying with only the code of length \(l_{k}\). Aio+CtF not only learns multiple codes with all-in-one framework, but also query with coarse-to-fine search strategy. Our AiO+CtF achieve a good balance between accuracy and speed.

4.4 Comparisons with Hashing ReID Methods

Hashing ReID methods learn binary codes using a hashing algorithm. Binary codes are good for speed but sacrifice model accuracy. To mitigate this problem, the state-of-the-art hashing ReID methods usually employ long codes such as 2048. In binary coding, 2048 is relatively very long as compared to the more commonly used 512 length, unlike in real-value feature length compared above. Table 4 shows that CtF (with AiO) not only achieves the best accuracy (even compared to much shorter code length used by other hashing methods), but also is significantly faster than existing hashing ReID methods (even compared to the same code length used by other hashing methods). Overall, hashing ReID methods usually perform much worse than non-hashing methods. For example, best non-hashing ReID methods achieves \(94.1\%\) and \(86.4\%\) Rank-1 scores on Market-1501 and DukeMTMC-reID respectively. But the best hashing ReID method only obtains \(81.4\%\) and \(82.5\%\) Rank-1 scores. Moreover, existing hashing ReID models can increase accuracy by using longer code length and compromising speed. For example, ABC with 512-dimensional binary codes achieves \(69.4\%/69.9\%\) Rank-1 scores and \(9.8/7.5\times 10^{-2}s\) query time per probe image. When using 2048 binary codes, its Rank-1 scores increase to \(81.4\%/82.5\%\) with query time slow down to \(2.8/2.0\times 10^{-1}s\). This observation is also verified with our method CtF (with AiO) using different code lengths. Importantly, our method CtF (with AiO) significantly outperforms all existing hashing ReID methods in terms of both accuracy (R1 12.3% or 5.1% better) and speed (\(5\times \) faster). Specifically, CtF with AiO achieves high accuracy very close to AiO without CtF using 2048 code length, but yields significant speed advantage that is comparable to much shorter 128 binary code length. CtF obtains \(93.7\%\) and \(87.6\%\) Rank-1 scores, similar to AiO without CtF of a fixed 2048 length at \(93.7\%\) and \(87.7\%\).

Fig. 3.
figure 3

Experimental results on large-scale ReID dataset Market-1501+500k. Our Coarse-to-Fine (CtF) get a high accuracy comparable with non-hashing ReID method of long code and fast speed comparable with hashing ReID method of short code.

4.5 Evaluation on Large-Scale ReID

Gallery size affects significantly ReID search accuracy and speed. To show the effectiveness of our proposed Coarse-to-Fine (CtF) search strategy, we evaluated it on a large-scale ReID dataset Market1501+500k. The dataset is based on the Market-1501 and enlarged with 500, 000 distractors. The experimental results are shown in Fig. 3. We can observe the following phenomenons.

Firstly, with the increase of gallery size, for all methods, the Rank-1 and mAP scores decrease, and the ReID speed per probe image slows down gradually. The reason is that more gallery images is more likely to contain more difficult samples. They make ReID search more challenging. Also, the extra gallery images significantly increase the time for computing all the distance comparisons and sorting required for ReID each probe image. Secondly, the non-hashing method with 2048-D real-value feature achieves the best accuracy but the worst time. This is because the real-value feature is more discriminative but slow to compute and sort. Thirdly, for hashing ReID methods, the 2048-D binary code obtains comparable ReID accuracy to that of the non-hashing model, but \(10\times \) faster. This is because Hamming distances and counting sort are faster to compute. ReID speed of 32-D binary code is \(5\times \) faster than that of 2048-D binary codes, but its accuracy drops dramatically. Finally, the proposed CtF model achieves a comparable accuracy to that of the non-hashing method but the advantage of similar speed to that of a hashing ReID method of 32-D binary code. Critically, the advantage is independent of the gallery size. Overall, these experiments demonstrate the effectiveness of CtF for a large-scale ReID task.

Table 5. Analysis of the All-in-One (AiO) framework. CP: learn multiple codes in a pyramid structure, otherwise separate models. SD: enhance binary codes via self-distillation. \(\mathbf {B}\) and \(\mathbf {R}\) mean binary codes and real-value features, respectively.

4.6 Model Analysis

Analysis of AiO. The All-in-One (AiO) framework aims to learn and enhance multiple codes of different lengths in a single model. It uses code pyramid (CP) structure and self-distillation (SD) learning. Results are in Table 5. Firstly, longer codes contribute to better accuracy. This can be seen in all settings no matter whether CP or SD is used and what code type is. Secondly, when using short codes, real-value features is much better than binary ones. But for long codes, they obtain similar accuracy. For example, the 32-dimensional real-value feature obtains \(82.7\%\) Rank-1 score, outperforming the 32-dimensional binary code by \(60\%\), where the latter achieved only \(25.5\%\). But when using 2048 code length, binary codes and real-valure features both achieve approx. Rank-1 \(94\%\) and mAP \(84\%\). This suggests that the quantization loss of short codes is significantly worse than that of longer codes. Thirdly, learning with code pyramid (CP) structure or self-distillation (SD) improves short codes significantly. For example, CP+SD boosts the 32-dimensional binary codes from \(25.5\%\) to \(60.0\%\) in Rank-1 score, upto \(35\%\) gain. It is evident that both code pyramid (CP) structure and self-distillation (SD) learning contribute to the effectiveness of the coarse-to-fine (CtF) search strategy, and significantly improve model performance.

Fig. 4.
figure 4

Accuracy and speed controlled by \(\beta \). With the increase of \(\beta \), the accuracy increases and speed becomes slow gradually.

Analysis of DTO. We further analyzed parameter \(\beta \) of the Distance Threshold Optimization (DTO) algorithm, which controls the balance between ReID accuracy and speed. Figure 4 show the model accuracy and speed using different \(\beta \) value on Market-1501 and DukeMTMC-reID. Firstly, it is evident that the value of \(\beta \) has a good control of accuracy and speed, increasing \(\beta \) slows down the speed but improves accuracy. For example, when \(\beta =10^{-2}\), ReID is fastest at approx. 0.03 and 0.02 s to ReID each probe image on Market-1501 and DukeMTMC-reID, but with mAP scores only at \(40\%\) and \(30\%\). In contrast, \(\beta =10^{1}\) gives high mAP \(85\%\) and \(75\%\), but the query speed is \(5\times \) slower at approx. 0.1 and 0.2 s. Secondly, when \(\beta \) is close to \(10^{0}\), Rank-1 and mAP are almost peaked with a good balance on speed.

5 Conclusion

In this work, we proposed a novel Coarse-to-Fine (CtF) search strategy for faster person re-identification whilst also improve accuracy on conventional hashing ReID. Extensive experiments show that our method is \(5\times \) faster than existing hashing ReID methods but achieves comparable accuracy with non-hashing ReID models that are 50\(\times \) slower.