HOB-net: high-order block network via deep metric learning for person re-identification

Chen, Dongyue; Wu, Pengfei; Jia, Tong; Xu, Fangbin

doi:10.1007/s10489-021-02450-y

HOB-net: high-order block network via deep metric learning for person re-identification

Published: 29 July 2021

Volume 52, pages 4844–4857, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

HOB-net: high-order block network via deep metric learning for person re-identification

Download PDF

Dongyue Chen¹,
Pengfei Wu¹,
Tong Jia¹ &
…
Fangbin Xu¹

549 Accesses
6 Citations
Explore all metrics

Abstract

Learning effective feature representations with deep convolutional neural network (CNN) and metric learning methods to distinguish pedestrians is the key to the success of recent advances in person re-identification (re-ID) tasks. However, the features provided by the common CNN network are not strong enough to distinguish between similar subjects because these common features describe only the scattered local patterns while neglecting the correlation and combinations between them. In fact, the high-order correlations of common features can be significant to the recognition of identities. In this paper, to tackle this problem, we design a flexible high-order block (HOB) module and a scheme of deep metric learning to produce the high-order representations of deep features for the re-identification of pedestrians. Extensive experiments prove the superiority of our proposed HOB module for person re-ID issue. On three large-scale datasets, including Market-1501, DukeMTMC-ReID, and CUHK03-NP, the HOB-net method achieves the competitive results with the state-of-the-arts, particularly in the mAP.

Person Re-identification by Multiple Feature Representations and Metric Learning

A Simple Deep Feature Representation for Person Re-identification

A siamese pedestrian alignment network for person re-identification

Article 30 August 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Person re-identification is a branch of image retrieval, which aims to identify the same person from multiple detected pedestrian images, typically captured from different cameras without view overlap [1]. In the last few decades, with the development of human recognition and image restoration technology in various real-world scenarios, such as video surveillance, pedestrian detection, and video restoration [2, 3], the algorithm of person re-ID has accomplished quite a lot of important applications in the intelligent security system, individual tracking, and smart mall system, etc. Although the re-ID issue has already achieved dramatic progress with the utilization of CNN, it is still a challenging problem with a few aspects remaining to be bettered due to the complexity of pedestrians in real-world scenarios, e.g. pose changes, environment illumination, and view angle changes, etc. As shown in the Fig. 1.

Affected by the aforementioned factors, the feature representations of pedestrians obtained by the CNN extractor actually cannot represent the input images exactly. Therefore, finding a group of pedestrian representations with good anti-interference, invariance, and distinguish ability has become the key for re-ID issue. For this purpose, many works [4, 5] choose to localize different body parts and align their associated features, while many other works [6, 7] use spatial or channel-based attention selection network to improve feature learning. However, all of the above works are confined by first-order occurrences and just mine simple and coarse information, which cannot be well applied to model person in re-ID cases. Considering the subtle differences among pedestrians caused by complexity in real-world scenarios, a simple representation based on the first-order features is obviously insufficient to capture the interactions of visual parts. As a result, the extracted feature presentations of pedestrians are not discriminative enough for the target task.

In this paper, we propose a flexible and powerful feature extraction module, referred to as High-Order Block (HOB), to extract high-order statistics from the deep features provided by the backbone network. We dedicate to modeling the deep metric learning mechanism via high-order statistics so as to capture the relationships among training samples and produce refined feature representations for pedestrians. To this end, we design a HOB-based network (HOB-net) to upgrade the quality of features by integrating the high-order statistics information into representations of input images.

Our main contributions of the paper are summarized as follows:

First of all, based on the comparison between three different layouts of convolutional layers, the High-Order Block (HOB) module with the new architecture is proposed as an embedding part of the backbone network to extract high-order statistics from the deep features.
Second, a new feature extraction network based on the HOB module in multiple orders is proposed, with which the effectiveness of different combinations between HOB module and loss functions are explored to find out their best scheme for the re-ID task.
At last, through extensive experiments, we prove the superiority of the proposed HOB-net over a wide range of state-of-the-art re-ID models on three large benchmarks, i.e. DukeMTMC-ReID [8], Market-1501 [9] and CUHK03-NP [10].

2 Related work

2.1 Backbone architecture for re-ID

One of the key issues of re-ID model based on deep learning is the architecture of feature extraction deep network. At present, many advanced architectures of feature networks have been presented, including strip-based, attention-based, and spatial deformation, etc. Stripe-based methods [4, 5, 11, 12] aggregate the salient local features from different body parts and global cues together to improve the representation. In which, [4] split the input feature map horizontally into a fixed number of strips, from which local features are aggregated. Attention-based methods [6, 13,14,15,16] enhanced the feature representations with attention mechanism, which guides the feature-extract network to capture and focus on attentive regions, to handle the imperfect detection of the bounding box and the misalignment of body parts. Besides, spatial deformation methods [17, 18] introduce a spatial transformer network (STN) to align pedestrians or introduce generative adversarial network (GAN) to generate standard posture to alleviate the influence of pose variance. All the three methods above boost features based on new fashions of spatial arrangement while ignoring the representation ability and distinguish ability of the features themselves in nature. Consequently, these methods are limited to distinguish and recognize very similar targets. Hence, finding more detailed, comprehensive, and distinctive features by digging into the structure and learning methods of the deep network is a promising direction in the researches on re-ID issue.

2.2 High-order metric learning

Inspired by some impressive works on fine-grained image classification [19,20,21,22,23], we can obtain the features with stronger capacity of the representation and discrimination, namely high-order metric learning. High-order metric learning is to change the fixed first-order distance measure between feature and loss function into parameterized high-order distance measure. And the parameters of high-order distance measure can be learned in the process of optimization. Furthermore, through gradient back-propagation, the feature representation can get more refined learning based on high-order measurement, so as to improve their ability to distinguish between different targets with the first-order similarity but insufficient high-order similarity. Many recent works [23] about fine-grained visual categorization and large-scale visual recognition tasks have demonstrated that second-order statistics have better performance than descriptors exploiting first-order statistics. However, only using second-order or lower moments information might not be enough when the feature distribution is not Gaussian [21]. Naturally, the higher-order (greater than two) statistics have been explored in many works [19,20,21,22]. Among them, [21] utilizes the third-order statistics for person re-ID. [19, 20, 22] exploit higher-order statistics for visual concept detection and fine-grained visual categorization. Although there are some differences between fine-grained classification and re-ID in task requirements, we can transform high-order metric learning into high-order feature learning by adjusting and improving the structure and loss function of the backbone network of the re-ID model. Thus, it is expected to provide high-order features with better discrimination ability for person re-ID task.

2.3 Loss function and learning strategy for re-ID

In addition to the network structure, feature learning still largely depends on loss function and training strategy. The focus includes the mathematical form of the loss function and the corresponding sampling strategy. On the mathematic form, many works proposed the loss function based on distance metrics, such as triplet loss [24], ranked list loss [25], etc. The principle of all loss function design is to better match the basic functional requirements. However, the above loss functions only take into account the distribution of the representations when designing a loss function. Finally, ensemble methods [26,27,28] have become an increasingly popular way of improving the performances of deep metric learning (DML) architectures. On the learning strategy, many works aim to mine samples that improve robustness, such as hard triplet mining [29], margin sample mining [30], etc. Since deep metric learning methods are sensitive to the samples of pairs, selecting suitable samples to train the model by mining strategy is shown to be effective. Furthermore, to learn high-order features through deep networks, it is necessary to select and design a suitable loss function and sampling strategy according to the characteristics of high-order metric learning.

The above works provide a theoretical basis on the backbone architecture design, high-order statistics computation, and metric learning strategy. In which, the works in Section 2.1 focus our research on the backbone architecture. And works in Section 2.2 prompt us to exploit higher-order statistics for re-ID task. The loss function and learning strategy in Section 2.3 make it possible to learn high-order person representations through high-order metric learning.

3 Proposed approach

In this section, we will first describe the pipeline of the proposed method in Section 3.1, then detail the proposed High-order Block module in Section 3.2, finally show the design of loss function for the overall framework in Section 3.3.

3.1 Network architecture

As described in Fig. 2, a person image patch is fed into a plain CNN to extract a deep feature map of size h × w × d, where h and w are the height and width of the feature map respectively, and d is the number of feature channels. Following standard first-order DML practices, in the training stage, these features are aggregated using Global Average Pooling (GAP) to build a representation which is projected into an embedding space through the first-order block directly to optimize both the classification loss and the similarity loss. Within this pipeline, however, these representations without high-order metric learning are relatively coarse and are unable to capture the complex interactions among different parts, resulting in less discriminative in the deep features and scattering distribution in the deep embedding space. To this end, we dedicate to modeling with high-order statistics.

In our proposed HOB-net, we directly modulate the embedding feature space by minimizing the high-order distance between samples with the same person-ID while maximizing the high-order distance between samples with different IDs. By ”high-order distance”, we mean a distance-like metric that can be approximated by calculating some high-order moments. Combining all these high-order metrics together, a DML-based loss function can be computed and combined with the classification loss to guide the learning of all the features representations through the proposed feature network in an end-to-end fashion. During training, as shown in Fig. 2, the deep metric learning loss is applied on both the first-order branch and the high-order branch, and the classification loss is applied only on the first-order branch. During testing, the features from both branches are concatenated into a long plain vector to describe the testing image patch of the person.

3.2 High-order computation

Suppose that a person image I is passed by a plain CNN, and the corresponding 3D feature tensor of the output convolutional layer is denoted by x ∈ R^h×w×d. Most of the existing person re-ID models turn the feature tensor x into a vector φ(x) ∈R^d using global average pooling and then transform it into a new vector v ∈R^l through a fully connected (FC) layer. Technically, it is expected that the representation vector v should focus on the local features which keep invariant within the same person while distinguishing between different persons. Going further along with this approach, we consider that the high-order relationships between local features could be a better representation than the first-order features. Therefore, we design a group of high-order mapping modules f_k(x) to produce multiple feature vectors in different orders.

In practice, the dimension of the high-order features of an input image is extremely high and the corresponding calculation cost is too large to implement in real applications. Fortunately, for the person re-ID task, it only needs to calculate the similarity of the high-order feature vectors between a query image Q and the image S to be matched. If it can be proved mathematically that the inner product of the high-order feature vectors between Q and S can be approximated by the inner product of two relatively lower-dimensional vectors, we will only need to design a network to extract the lower-dimensional features as image representation. In fact, the RM algorithm [31] provides an effective solution to this problem, which relies on a set of random projectors to approximate the inner product between the k th-order moments of two basic feature vectors x,y ∈R^l, which is denoted as $\left \langle \textbf {x},\textbf {y}\right \rangle ^{k}$.

$$ \begin{array}{@{}rcl@{}} \left\langle\textbf{x},\textbf{y}\right\rangle^{k} &=&\left\langle\underbrace{\textbf{x}\otimes {\cdots} \otimes \textbf{x}}_{k\ times},\underbrace{\textbf{y}\otimes {\cdots} \otimes \textbf{y}}_{k\ times}\right\rangle \\&=&E_{w_{1},{\dots} ,w_{k} \sim p_{w}} [{{\varPhi}}_{k}(\textbf{x}){{\varPhi}}_{k}(\textbf{y})] \end{array} $$

(1)

where ⊗ is the Kronecker product, $E_{w_{1},{\dots } ,w_{k} \sim p_{w}}$ is the expectation over the random projectors $w_{1},{\dots } ,w_{k}\in \Re ^{l}$, whose elements follow the uniform distribution p_w in the interval of [− 1,+ 1], and Φ_k(x) ∈R is defined as follows:

$$ {{\varPhi}}_{k}(\textbf{x})= {\prod\limits_{i}^{k}}\left\langle w_{i},\textbf{x} \right\rangle $$

(2)

According to the works in [20], the k th-order inner product $\left \langle \textbf {x},\textbf {y}\right \rangle ^{k}$ can be further approximated to the inner product between two lower-dimensional vectors ψ_k(x),ψ_k(y) ∈R^s,s ≪ l^k

$$ \left\langle\textbf{x},\textbf{y}\right\rangle^{k}\approx \frac{1}{s}\left\langle \psi_{k}(\textbf{x}),\psi_{k}(\textbf{y})\right\rangle $$

(3)

Using Tensor Tucker Decomposition [32], ψ_k(x) can be written as:

$$ \psi_{k}(\textbf{x})=(W_{k,1}^{T}\textbf{x})\odot (W_{k,2}^{T}\textbf{x})\odot {\cdots} \odot (W_{k,k}^{T}\textbf{x}) $$

(4)

in which, $W_{k,i}\in \Re ^{l\times s}, i=1,2,\dots ,k$ are a series of random matrices sampled independently, ⊙ represents the Hadamard (element-wise) product. In this case, the feature x in (4) is exactly a 3D tensor in the size of l = h × w × d, so the calculation of $W_{k,i}^{T}x, i=1,2,\dots ,k$ can be evaluated by feeding the 3D tensor x ∈R^h×w×d into multiple individual convolutional layers respectively.

Actually, two typical architectures have been used in relative works to evaluate ψ_k(x), as shown in Fig. 3b and c. The former applies a cascade architecture that evaluates the k th-order component based on the k − 1 order gradually [20]:

$$ \psi_{k}(\textbf{x})=\psi_{k-1}(\textbf{x}) \odot (W_{k,k}^{T}\textbf{x}) $$

(5)

Equation (5) can be regarded as a particular case of (4) under the hypothesis that W_p,k = W_q,k,∀q,p ≥ k. For the duplicate architecture given by Fig. 3c, it is supposed that [22]:

$$ \psi_{k}(\textbf{x})=(\underbrace{W_{k,k}^{T}\textbf{x}) \odot (W_{k,k}^{T}\textbf{x})\odot\cdots\odot(W_{k,k}^{T}\textbf{x}}_{k\ times}) $$

(6)

That means all the random matrices $W_{k,i}^{T}x, i=1,2,\dots ,k$ for the k th-order moment are supposed to be the same, namely $W_{k,k}^{T}x, \forall i\leq k$. The above two architectures can reduce the number of trainable parameters and save the computational cost. However, it also lowers the ability of the network to approximate the high-order moments. Considering the challenge of pedestrian Re-ID task, we still employ the original result of Tensor Tucker Decomposition [32] according to (4). In other words, each convolutional layer corresponding to the parameter matrix is learned independently in the fashion of end-to-end from the training data, and then the output tensors of all the convolutional layers at the same order are multiplied element-wise, as shown as Fig. 3a.

In comparison, suppose that the maximal order is K, both the cascade and duplicate architectures need K parameter matrices while our model uses K(K + 1)/2 matrices. That means our model introduces more trainable weights to improve the performance of the network. Although more parameters will increase computation cost, to avoid over fitting in practical problems, the maximal order K is usually not large. Experimental results have revealed that the performance of the network starts to saturate at K = 6, which means that the ratio of the trainable parameters between our model and the other two models is about 15:6. The additional calculation cost is completely acceptable.

At last, the Hadamard product of all the convolutional layers’ outputs in the k th-order block is transformed to a feature vector through the modules of GAP and a FC layer, as illustrated in Fig. 4. The final feature vector of the k th-order block is denoted as $\textbf {f}_{k}(\textbf {x}),k=1,2,\dots ,K$.

3.3 Loss functions

With the proposed architecture of the HOB-net, we compute a DML-based loss function on each of the high-order moments using the empirical estimator, such that similar (respectively dissimilar) images have similar (respectively dissimilar) higher-order moments:

$$ \begin{aligned} L_{DML}(I,J)&=\sum\limits_{k=1}^{K}L_{k}\left( E_{x\sim I}[\textbf{f}_{k}(\textbf{x}),\textbf{f}_{k}(\textbf{y})]\right)\\ &=\sum\limits_{k=1}^{K}L_{k}\left( \frac{1}{|I|}\underset{x_{i}}{\sum}^{}\textbf{f}_{k}(\textbf{x}_{i}),\frac{1}{|J|}\underset{x_{j}}{\sum}^{}\textbf{f}_{k}(\textbf{x}_{j})\right) \end{aligned} $$

(7)

where x_i and x_j are the sets of deep features exacted from person images I and J. As describe in Section 3.1, in our practical application for the classification and DML, we utilize the softmax loss and the batch-hard triplet loss proposed in [29] respectively. We randomly sample P identities and N instances for each identity in each mini-batch to meet the requirement of the batch-hard triplet loss. Typically, the loss function

$$ \begin{array}{@{}rcl@{}} L_{triplet}&=&\sum\limits_{i=1}^{P}\sum\limits_{a=1}^{N} \left( \vphantom{\underset{\underset{\underset{n\neq a,j\neq i}{j=1,\dots,P;}}{{n=1,\dots,N;}}}{\min}} \alpha +\displaystyle\underset{\underset{p\neq a}{p=1,\dots,N;}}{\max}D\left( f(x_{a}^{(i)}),f(x_{p}^{(i)})\right) \right.\\ &&\left.-\displaystyle\underset{\underset{\underset{n\neq a,j\neq i}{j=1,\dots,P;}}{{n=1,\dots,N;}}}{\min}D\left( f(x_{a}^{(i)}),f(x_{n}^{(j)})\right) \right)_{+} \end{array} $$

(8)

where $f(x_{a}^{(i)},f(x_{p}^{(i)}),f(x_{n}^{(j)})$ are the features extracted from the anchor, positive, and negative samples respectively. D(x,y) computes the Euclidean distance between vectors x and y, and α is the margin hyper parameter. In addition to batch-hard triplet loss, we employ softmax cross entropy loss for the discriminative learning of the model as well, which can be formulated as follows:

$$ L_{softmax}=-\sum\limits_{i=1}^{P}\sum\limits_{a=1}^{N}log\frac{e^{\phi_{i,a,y_{a,i}}}}{{\sum}_{k=1}^{C}e^{\phi_{i,a,k}}} $$

(9)

where y_a,i is the ground truth identity of the a-th sample of the i-th identity in the mini-batch, ϕ_i,a,k denotes the k-th element of the corresponding feature, and C is total number of person identities. Then the overall loss function for optimization is the combination of softmax loss and batch-hard triplet loss. Specifically, according to the following experimental analysis, we employ batch-hard triplet loss for all orders but softmax cross entropy loss only for the first order. Therefore, the overall loss function for the HOB-net is formulated as follows:

$$ L_{HOB-net}=L_{softmax} + \upbeta\cdot L_{triplet} $$

(10)

where β ∈ [0,1] is the scale factor of L_triplet. By minimizing the overall loss function given by (10), the proposed HOB-net can be trained well to provide the multi-order features which are used to measure the similarity between the query images and the images to be matched.

4 Experiments

4.1 Datasets

Our approach is evaluated on three popular used re-ID datasets: Market-1501, DukeMTMC-ReID and CUHK03-NP.

Market-1501 is a large-scale person re-ID dataset collected from six cameras, which contains 32668 annotated images of 1501 identities. For evaluation, there are 12936 training images of 751 identities and 19732 testing images of 750 identities. Gallery and query sets have 19,732 and 3,368 images respectively with another 750 identities.

DukeMTMC-ReID is a subset of DukeMTMC, which is a multi-target, multi-camera pedestrian tracking dataset. It includes 36411 bounding boxes of 1404 identities. We divide the dataset into 16522 images of 702 identities for training and 17661 images of the other 702 identities for testing. The 2208 query images are picked from 17661 gallery images set.

CUHK03-NP is a new training-testing dataset, following the new protocol. It splits the CUHK03 into two subsets which contain labeled (by human) and detected (by a person detector) person images. The detected set includes 7365 training images from two cameras, 1,400 query images and 5,332 gallery images. The labeled set contains 7,368 training images, 1,400 query and 5,328 gallery images respectively. And according to the new protocol, their training and testing sets are split into 767 and 700 identities.

4.2 Experimental setting

Implementation details

As described in Section 3.3, The proposed HOB network is constructed based on the architecture of ResNet50 [33]. We train both the baseline model and our model according to the strategy presented in [34]. Specifically, we keep the aspect ratio of all images and resize them to 288 × 144. Two data augmentation methods, random cropping and random horizontal flipping, are employed during training. To meet the requirement of hard-batch triplet loss, each mini-batch is sampled with randomly selected P = 16 identities and randomly sampled N = 8 images for each identity from the training set, so that the mini-batch size is 128. The Warmup learning rate is applied to bootstrap the network. The initial learning rate is set to 0.00035, which increases linearly to 0.0035 in 10 epochs and decreases by 0.1 at the 40th epoch and the 70th epoch respectively. We use the Adam solver to optimize the parameters for a total of 120 epochs. Our model is implemented on the Pytorch platform and trained with one NVIDIA 2080Ti GPU. All our experiments on different datasets follow the same settings as above.

Evaluation metrics

In the test phase, to take advantage of the high-order moments, the final feature vectors are concatenated to form the final representation. Then, the metrics of cumulative matching characteristic (CMC) and mean average precision (mAP) are used to evaluate the overall performance of our model. For the sake of fairness, the re-ranking tricks are not adopted.

4.3 Comparison with the state-of-the-art methods

We present the superiority of our method by comparing our results with the state-of-the-arts in Tables 1 and 2, in which all methods have been divided into different types, including mask-guided [1, 35,36,37], stripe-based [4, 5, 11, 12], attention-based [6, 13,14,15,16], GAN-based [38], and Global feature-based [33, 39,40,41,42,43,44,45,46]. And for a fair comparison, we select the ResNet50-based backbone models, and the top two results are shown in red and blue respectively. From these two tables, it is noticeable that our HOB methods can significantly improve the performances over the baseline methods (e.g. comparing with baseline, HOB-6 achieve 4.5%/5.1% improvements of R-1/mAP on Market-1501 and 8.9%/11.4% improvements of R-1/mAP on DukeMTMC-ReID). Moreover, compared with other recent methods of different types, our method achieves state-of-the-art performance on both Market-1501 and DukeMTMC-ReID datasets. For CUHK03-NP dataset, our method achieves competitive results with the state-of-the-arts, particularly in the mAP. The above results prove the superiority of our HOB method.

Table 1 Comparison of the proposed method with the art on Market-1501 and DukeMTMC-ReID

Full size table

Table 2 Comparison of the proposed method with the art on CUHK03-NP. The best/second results are shown in red/blue respectively

Full size table

4.4 Component analysis

Analysis on the maximal order

In order to find the proper maximal order of the HOB module for person re-ID task, we conduct comparisons between HOB networks with the same backbone and loss functions but different maximal orders. The results on CUHK03, DukeMTMC-ReID, and Market-1501 datasets are shown in Fig. 5. From the results, it is noticeable that all the HOB modules in different orders can significantly improve the performances over the baseline methods. Concretely, comparing the HOB-2 with the baseline, there are 3.9%/6.2% on Market-1501, 7.1%/4.8% on DukeMTMC-ReID and 3.0%/4.4% on CUHK03-L improvements of R-1/mAP respectively. Moreover, it can be observed that the higher-order HOB modules can boost the capability of feature representations and achieve better performances. Specifically, e.g. on CUHK03-L, the scores of R-1/mAP enhance by 5.3%/2.1% from 64.1%/59.0 to 69.4%/66.5% when the order of HOB module increases from 2 to 6. The similar improvements can be found on the other two datasets. The experimental results show that employing higher-order HOB modules benefits to the ability of recognizing the person identities. Besides, the CMC and mAP results of HOB-net over all the three benchmarks demonstrate the effectiveness of our method. As a complementary account, when the further increase in the order of HOB module, e.g. 7 or 8, the performance improvements are few. Even with the higher order, the performance may start to decline. In this case, the number of network parameters increase sharply, which will increase the burden of model in training phase. Therefore, we select 6 as the proper maximal order of our proposed HOB module.

Analysis on the combined loss function

Since loss function has a dramatic impact on the final performance of the re-ID models, comparisons between different combinations of loss functions are conducted to find out the best solution. More specifically, with the fixed maximal order, and Softmax and Triplet loss as the candidates of the loss functions, the corresponding CMC and mAP results on DukeMTMC-ReID and Market-1501 datasets have shown in Table 3, from which it can be seen clearly that the combination of “Softmax +Triplet” for the first-order and ”Triplet” for the high-orders achieves the best performances. Specifically, the way of using only Softmax loss or Triplet loss reaches 90.6% or 93.0% of R-1 respectively on Market-1501 dataset. The combinations of both the Softmax loss and Triplet loss can significantly improve the performances, up to over 93.4% of R-1. Moreover, there is also a gap between different combinations. As shown in Table 3, the combination of Softmax loss with the first-order and Triplet loss with the high-order modules can achieve 94.3% of R-1 on Market-1501 dataset. And when assigning Triplet loss to the first-order, it is up to 94.7%. However, the score is down to 94.1% of R-1 when the high-order modules are allotted with Softmax loss. A similar degradation of performance can be also observed on the DukeMTMC-ReID dataset, which indicates the alliance between the high-order modules and deep metric learning loss functions (e.g. triplet loss) benefits to the enhancement of the final performance.

Table 3 Effect of the combination mode about loss function

Full size table

Analysis on architecture of HOB module

As illustrated in Fig. 3, there are three different architectures that can be used to construct the HOB module. Therefore, we carried out some experiments to validate the superiority of our proposal (shown in Fig. 3a). The CMC and mAP results on DukeMTMC-ReID and Market-1501 datasets are shown in Fig. 6, where our proposal achieves the best results, reaching 94.7%/86.3% and 88.2%/77.2% of R-1/mAP on the two datasets respectively. The cascade architecture comes second, reaching 94.4%/86.0% and 87.3%/76.2% of R-1/mAP. And the duplicate architecture obtains the result of 94.2%/85.8% and 87.0%/74.6% of R-1/mAP. Comparing with the other two methods, cascade architecture is the most concise architecture, which limits the number of trainable weights. On the contrary, our module provides the maximal number of trainable weights among the three ones due to the complicated architecture, which enhances the nonlinear expressive ability of the whole network model and completes more complex feature extraction for re-ID task. Consequently, our proposed HOB module achieves the best performance on both two datasets.

Analysis on backbone network and effectiveness of HOB

In order to confirm the backbone network of our model, we tried many popular CNN-based feature networks by fixing the maximal order and the loss function. From the results shown in Fig. 7, it can be observed that the proposed HOB module is universally valid for different backbone networks. For the Market-1501 dataset, SeResnext 101 achieves the best, reaching 95.8%/88.5% of R-1/mAP. Also, for the DukeMTMC-ReID dataset, using IBN-net50 as the backbone shows the highest scores that 90.9%/88.2% of R-1/mAP, which is the best performance ever on DukeMTMC-ReID dataset even compared with other state-of-the-art results. Furthermore, several person re-ID examples on Market-1501 and DukeMTMC-ReID produced by Baseline and HOB-net are shown in Figs. 8 and 9, where six query images with different pedestrians and camera views are selected respectively. From the retrieval results, it can be observed that our proposed HOB-net performs more superior than the baseline model, which effectively ranks more true person at the top of the ranking list, even with large variations in the gallery including pose change, illumination, view angle change, and so on.

Model size and time/memory complexity

We present the model size of the proposed HOB-net and time/memory complexity in the training phase in Table 4. From the results in the table, it can be observed that the number of parameters, training time, and maximum memory of our HOB-net increase with the order. While comparing with the baseline, the growth of training time and the maximum memory of each order HOB-net is reasonable. In terms of performance, our model has been greatly improved, showing that the HOB module is indeed flexible and efficient.

Table 4 Model size and time/memory complexity comparisons

Full size table

5 Conclusion

In this paper, we propose a flexible High-order Block (HOB) module which boosts the feature representations by introducing the high-order statistics and the corresponding high-order deep metric learning. And we study the influence of the different architectures of HOB and different maximal orders on the final performance. With taking the advantage of the HOB module, we propose the High-order Block Network (HOB-net) to improves the distinguishing ability of the deep features. Furthermore, we explore the training schemes of different combinations between HOB module and loss functions. As shown in the experimental results, the proposed HOB-net achieves very competitive performances on the three popular benchmarks, which validates the motivations and the conclusions that the proposed model can effectively improve the reliability and distinguishing ability of the features. In addition, the proposed HOB-net provides a strong baseline for the higher-order framework of re-ID feature network, and it can be also applied in real applications to enhance the recognition precision.

Data Availability

All the experimental data in this paper are published data sets.

Code Availability

The code is available on https://github.com/NothingToSay99/HOB-net.

References

Dai Z, Chen M, Gu X, Zhu S, Tan P (2019) Batch DropBlock network for person re-identification and beyond. In: Proceedings of the IEEE international conference on computer vision, pp 3691–3701
Yu X, Ye X, Gao Q (2020) Infrared handprint image restoration algorithm based on apoptotic mechanism. IEEE Access 8:47334–47343
Article Google Scholar
Khan MA, Javed K, Khan SA, Saba T, Habib U, Khan JA, Abbasi A (2020) Human action recognition using fusion of multiview and deep features: an application to video surveillance. Multimedia Tools Appl :1–27
Suh Y, Wang J, Tang S, Mei T, Mu Lee K (2018) Part-aligned bilinear representations for person re-identification. In: Proceedings of the European conference on computer vision (ECCV), pp 402–419
Sun Y, Zheng L, Yang Y, Tian Q, Wang S (2018) Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In: Proceedings of the European conference on computer vision (ECCV), pp 480–496
Li W, Zhu X, Gong S (2018) Harmonious attention network for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2285–2294
Tay CP, Roy S, Yap KH (2019) Aanet: Attribute attention network for person re-identifications. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7134–7143
Ristani E, Solera F, Zou R, Cucchiara R, Tomasi C (2016) Performance measures and a data set for multi-target, multi-camera tracking. In: European conference on computer vision, pp 17–35
Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q (2015) Scalable person re-identification: a benchmark. In: Proceedings of the IEEE international conference on computer vision, pp 1116–1124
Li W, Zhao R, Xiao T, Wang X (2014) Deepreid: Deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 152–159
Yang W, Huang H, Zhang Z, Chen X, Huang K, Zhang S (2019) Towards rich feature discovery with class activation maps augmentation for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1389–1398
Xiang S, Fu Y, Chen H, Ran W, Liu T (2020) Multi-level feature learning with attention for person re-identification. Multimed Tools Appl 79(43):32079–32093
Article Google Scholar
Wang C, Zhang Q, Huang C, Liu W, Wang X (2018) Mancs: a multi-task attentional network with curriculum sampling for person re-identification. In: Proceedings of the European conference on computer vision (ECCV), pp 365–381
Zheng M, Karanam S, Wu Z, Radke RJ (2019) Re-identification with consistent attentive siamese networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5735–5744
Li R, Zhang B, Teng Z, Fan J (2020) A divide-and-unite deep network for person re-identification. Appl Intell :1–13
Chen B, Deng W, Hu J (2019) Mixed high-order attention network for person re-identification. In: Proceedings of the IEEE International conference on computer vision, pp 371–381
Chen D, Chen P, Yu X, Cao M, Jia T (2019) Deeply-learned spatial alignment for person re-identification. IEEE Access 7:143684–143692
Article Google Scholar
Qian X, Fu Y, Xiang T, Wang W, Qiu J, Wu Y, Xue X (2018) Pose-normalized image generation for person re-identification. In: Proceedings of the European conference on computer vision (ECCV), pp 650–667
Cai S, Zuo W, Zhang L (2017) Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In: Proceedings of the IEEE international conference on computer vision, pp 511–520
Jacob P, Picard D, Histace A, Klein E (2019) Metric learning with horde: high-order regularizer for deep embeddings. In: Proceedings of the IEEE international conference on computer vision, pp 6539–6548
Gou M, Camps O, Sznaier M (2017) Mom: Mean of moments feature for person re-identification. In: Proceedings of the IEEE international conference on computer vision workshops. pp 1294–1303
Koniusz P, Yan F, Gosselin PH, Mikolajczyk K (2016) Higher-order occurrence pooling for bags-of-words: visual concept detection. IEEE Trans Pattern Anal Mach Intell 39(2):313–326
Article Google Scholar
Li P, Xie J, Wang Q, Zuo W (2017) Is second-order information helpful for large-scale visual recognition?. In: Proceedings of the IEEE international conference on computer vision, pp 2070–2078
Cheng D, Gong Y, Zhou S, Wang J, Zheng N (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1335– 1344
Wang X, Hua Y, Kodirov E, Hu G, Garnier R, Robertson NM (2019) Ranked list loss for deep metric learning. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 5207–5216
Kim W, Goyal B, Chawla K, Lee J, Kwon K (2018) Attention-based ensemble for deep metric learning. In: Proceedings of the European conference on computer vision (ECCV), pp 736–751
Xuan H, Souvenir R, Pless R (2018) Deep randomized ensembles for metric learning. In: Proceedings of the European conference on computer vision (ECCV), pp 723–734
Zhong Z, Zheng L, Li S, Yang Y (2018) Generalizing a person retrieval model hetero-and homogeneously. In: Proceedings of the European conference on computer vision (ECCV), pp 172– 188
Hermans A, Beyer L, Leibe B (2017)
Xiao Q, Luo H, Zhang C (2017) Margin sample mining loss: a deep learning based method for person re-identification. arXiv:1710.00478
Kar P, Karnick H (2012) Random feature maps for dot product kernels. In: Artificial intelligence and statistics, pp 583–591
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500
Article MathSciNet Google Scholar
Zheng L, Yang Y, Hauptmann AG (2016) Person re-identification: Past, present and future. arXiv:1610.02984
Luo H, Gu Y, Liao X, Lai S, Jiang W (2019) Bag of tricks and a strong baseline for deep person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 0–0
Kalayeh MM, Basaran E, Gökmen M., Kamasak ME, Shah M (2018) Human semantic parsing for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1062–1071
Zheng F, Cai T, Wang Y, Deng C, Chen Z, Zhu H (2020) A mask-pooling model with local-level triplet loss for person re-identification. IEEE Access 8:138191–138202
Article Google Scholar
Song C, Huang Y, Ouyang W, Wang L (2018) Mask-guided contrastive attention model for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1179–1188
Zhong Z, Zheng L, Zheng Z, Li S, Yang Y (2018) Camstyle: a novel data augmentation method for person re-identification. IEEE Trans Image Process 28(3):1176–1190
Article MathSciNet Google Scholar
Chang X, Hospedales TM, Xiang T (2018) Multi-level factorisation net for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2109–2118
Wang Y, Wang L, You Y, Zou X, Chen V, Li S, Weinberger KQ (2018) Resource aware person re-identification across multiple resolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8042–8051
Hou R, Ma B, Chang H, Gu X, Shan S, Chen X (2019) Interaction-and-aggregation network for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9317–9326
Lin Y, Zheng L, Zheng Z, Wu Y, Hu Z, Yan C, Yang Y (2019) Improving person re-identification by attribute and identity learning. Pattern Recogn 95:151–161
Article Google Scholar
Wang Z, Jiang J, Wu Y, Ye M, Bai X, Satoh SI (2019) Learning sparse and identity-preserved hidden attributes for person re-identification. IEEE Trans Image Process 29(1):2013–2025
Google Scholar
Ye H, Liu H, Meng F, Li X (2020) Bi-directional exponential angular triplet loss for rgb-infrared person re-identification. IEEE Trans Image Process (TIP) 2020:3045261
Google Scholar
Yin J, Fan Z, Chen S, Wang Y (2020) In-depth exploration of attribute information for person re-identification. Appl Intell 50(11):3607–3622
Article Google Scholar
Yang Z, Liu T, Liu J, Wang L, Zhao S (2020) A novel soft margin loss function for deep discriminative embedding learning. IEEE Access 8:202785–202794
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant U1613214 and in part supported by the Fundamental Research Funds for the Central Universities of China under Grant N170402008.

Author information

Authors and Affiliations

College of Information Science and Engineering, Northeastern University, Shenyang, China
Dongyue Chen, Pengfei Wu, Tong Jia & Fangbin Xu

Authors

Dongyue Chen
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tong Jia
View author publications
You can also search for this author in PubMed Google Scholar
Fangbin Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tong Jia.

Ethics declarations

Ethics approval

Compliance with ethical standards.

Conflict of Interests

The authors declare that they have no conflict of interests.

Consent for Publication

The authors declare that they consent to publication.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, D., Wu, P., Jia, T. et al. HOB-net: high-order block network via deep metric learning for person re-identification. Appl Intell 52, 4844–4857 (2022). https://doi.org/10.1007/s10489-021-02450-y

Download citation

Accepted: 20 April 2021
Published: 29 July 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10489-021-02450-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

HOB-net: high-order block network via deep metric learning for person re-identification

Abstract

Similar content being viewed by others

Person Re-identification by Multiple Feature Representations and Metric Learning

A Simple Deep Feature Representation for Person Re-identification

A siamese pedestrian alignment network for person re-identification

Explore related subjects

1 Introduction

2 Related work

2.1 Backbone architecture for re-ID

2.2 High-order metric learning

2.3 Loss function and learning strategy for re-ID

3 Proposed approach

3.1 Network architecture

3.2 High-order computation

3.3 Loss functions

4 Experiments

4.1 Datasets

4.2 Experimental setting

Implementation details

Evaluation metrics

4.3 Comparison with the state-of-the-art methods

4.4 Component analysis

Analysis on the maximal order

Analysis on the combined loss function

Analysis on architecture of HOB module

Analysis on backbone network and effectiveness of HOB

Model size and time/memory complexity

5 Conclusion

Data Availability

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics approval

Conflict of Interests

Consent for Publication

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation