Learning deep embedding with mini-cluster loss for person re-identification

Yuan, Caihong; Guo, Jingjuan; Feng, Ping; Zhao, Zhiqiang; Luo, Yihao; Xu, Chunyan; Wang, Tianjiang; Duan, Kui

doi:10.1007/s11042-019-7446-2

Learning deep embedding with mini-cluster loss for person re-identification

Published: 14 March 2019

Volume 78, pages 21145–21166, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Learning deep embedding with mini-cluster loss for person re-identification

Download PDF

Caihong Yuan^1,2,
Jingjuan Guo^1,3,
Ping Feng¹,
Zhiqiang Zhao³,
Yihao Luo¹,
Chunyan Xu⁴,
Tianjiang Wang¹ &
…
Kui Duan⁵

439 Accesses
9 Citations
Explore all metrics

Abstract

Recently, the triplet loss is commonly used in many deep person re-identification (ReID) frameworks to learn an embedding space in which similar data points are close and dissimilar data points are far away. However, the triplet loss simply focuses on the relative orders of points. This may lead to a relatively large intra-class variance and then a weak generalization capacity on the test set. In this paper, we propose a mini-cluster loss, which regards images belonging to the same identity as a mini-cluster and treats them as a whole during the training instead of considering them separately. For each mini-cluster in a batch, we define the largest distance between points in a mini-cluster as its inner divergence and the shortest distance with outer points as its outer divergence. By constraining the outer divergence larger than the inner divergence, our framework with the mini-cluster loss achieves the more compact mini-clusters while keeping the diversity distributions of the classes. As a result, a better generalization ability and a higher performance can be obtained. In the extensive experiments, our proposed framework achieves a state-of-the-art performance on two large-scale person ReID datasets (Market1501, DukeMTMC-reID) which clearly demonstrates its effectiveness. Specifically, 72.44% mAP and 87.05% rank-1 score are achieved on the Market1501 dataset with single query setting, 78.17% mAP and 91.05% rank-1 score with multiply query setting, and on the DukeMTMC-reID dataset, 60.19% mAP and 77.20% rank-1 score are obtained.

A Deep Clustering-Guide Learning for Unsupervised Person Re-identification

Embedding Deep Metric for Person Re-identification: A Study Against Large Variations

Centroid Tuplet Loss for Person Re-Identification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Person re-identification (ReID ) focuses on the problem of searching for person images owning the same identity as the given anchor pedestrian image, which are captured from the surveillance videos across multiple cameras or across time [7, 27]. Due to its importance in the intelligent surveillance application, it has been attracting more and more attentions of the computer vision and pattern recognition committee. Despite intense studies [1, 4, 9, 10, 36, 46, 47] have been done in the recent years, ReID is still a very challenging task due to its particularity. Compared with many other tasks, such as face recognition [20, 41], there are more posture variances of the same individual and more complex backgrounds. Besides these, it also usually undergoes illumination variances and large occlusions.

Recently, deep learning gains a tremendous success in many areas, which can implicitly capture intricate distributions of large scale data by learning and representing data with multiple levels of abstraction and understand multi-modal information with multiple processing layers [28]. Motivated by this, more and more deep frameworks [2, 5, 9, 10, 22, 29, 34, 36, 46, 47] for person ReID have been proposed by researchers. The main goals of them are to learn a feature embedding where points from the same identity are close and those from different are far away. Recently, the triplet loss becomes a major means and is commonly used in many deep person ReID frameworks to keep the correct order for the anchor image with its positive and negative. In a batch there are several images for every identity which could be included in the triplets. During the training, the optimization is done on them separately. And after thousands of iterations, small intra-class variances and large inter-class variances are expected. However, as shown in Fig. 1, during an iteration, some points of the same identity are pushed to completely different directions. Thus, it will result in an unstable convergence process of the models. And since just the relative distances of a single point with its negatives and positives are constrained, it easily leads to a relatively large intra-class variance. And it is found in [4, 31] that large intra-class variances may lead to a weak generalization of the learned models on the test set. And they argue that reducing the intra-class variances and enlarging the inter-class variances may improve the performance. For this, ImpTrpLoss [5] introduces an additional item to the triplet loss by keeping the distances of positive pairs smaller than a pre-defined value. The quadruplet loss [4] adds the relative constraint on the positive pairs and negative pairs without the same anchor to the triplet loss. However, both methods may destroy the diversity distributions of different classes.

Motivated by this, we propose a novel objective called mini-cluster loss defined on a blob. Images belonging to the same identity are defined as a mini-cluster and treated as a whole during the training. The proposed framework and the scheme of the mini-cluster loss are shown in Figs. 2 and 3 respectively. The input blob of the network contains P mini-clusters and each mini-cluster contains K images. For each mini-cluster in a batch, we define the largest distance between points in a mini-cluster as its inner divergence and the shortest distance with outer points as its outer divergence. By making the outer divergence larger than the inner divergence, the mini-cluster loss could keep the correct relative relationships of points and helps to form a feature embedding with smaller intra-class variances and larger inter-class variances. It can be found in our learned feature embedding which is shown in Fig. 4.

It should be mentioned that the mini-cluster loss shares similar spirits with hard samples mining methods, but the goals of them are different. For the triplet loss, hard triplets sampling is regarded as an effective measure for improving the performance. The goal of them is to mine the informative samples to enhance the gradients variance and speed up the convergence of the models while keeping the correct relative distances between the positive and negative pair with the same anchor. However for the mini-cluster loss, our searching for the inner divergence and the outer divergence is to represent the general characteristics of a cluster. And the aim is not to simply keep the relative relationship among a triplet but push the negative points far away from the anchor mini-cluster .

For explaining why the proposed method is effective, we conduct the experimental analysis in Section 4. Based on the triplet loss with batch hard [9], we explore the affects of the pairs of different hard levels in a mini-cluster for the performance. Through the analysis, we argue that the defined inner divergence and outer divergence are the appropriate representations for the general mini-cluster.

Generally, the main contributions of this paper can be listed as follows:

(1)
We propose two concepts which are the inner divergence and outer divergence to represent the general characteristic of a mini-cluster.
(2)
Based on them, we further propose a novel loss named mini-cluster loss, which regards images of the same identity in a batch as a whole. Just resorting to the especially selected positive (inner divergence) and negative pair (outer divergence), it could achieve more compact clusters while well keeping the diversities of different classes.
(3)
Experiments on two large-scale person ReID datasets show its significant performance.

This paper is organized as follows: in Section 2 we present the related works of person ReID. Then in Section 3, the triplet loss and the quadruplet loss are first reviewed. And then we elaborate the proposed mini-cluster loss. Experimental valuations on 2 large scale datasets are demonstrated in Section 4. Moreover, the impacts of the pairs with different hard levels in a mini-cluster to the performance are experimental analyzed. And Section 5 is the conclusions.

2 Related work

There are two camps in person ReID which are traditional methods and deep methods. For those traditional methods, good feature representation [18, 24, 38, 42] and appropriate distance metric [11, 16, 21, 37] are two key subtasks [12, 13, 44]. However the solution of two separated steps is hard to achieve an optimal results [39]. With the revolutionary success of the deep learning method on many computer vision tasks, more and more deep frameworks [2, 5, 9, 10, 22, 29, 34, 36, 47] for person ReID have been proposed by the researchers. These frameworks can learn a feature embedding or a similarity metric in an end-to-end way.

There are 3 usual solutions for deep ReID: classification, verification and ranking. For these classification based person ReID frameworks [32, 34], they treat it as an identity recognition task by minimizing the softmax loss in the network. In the training, the model is learned by assigning an image to an identity. And in the testing, the top probability vector is abandoned and the previous outputted vector is extracted as the feature. Then a basic distance function (i.e. Euclidean distance and Cosine distance) is applied to measure the distances or similarities between images. However, since the goal of these methods is to learn boundaries between classes, it may lead to insufficient intra-class compaction [3]. And the objective of these verification based approaches [5, 39, 43] is to verify whether a pair of images are of the same person. They usually borrow the siamese network with the contrastive loss or binary classifier to reduce the intra-class variances and enlarge the inter-class variances. These methods simply focus on the absolute distance but neglect relative relationships. However, due to the complex scenarios of ReID, distributions of different classes are discrepant. Thus verification based methods may be not the best option for person ReID [33]. And for these ranking based approaches [6, 9, 33], the triplet loss is widely used to learn an embedding space in which the distance of the negative pair should be larger than the that of the positive pair in a triple. These methods could well keep the relative relationships between positive and negative pairs. However, frameworks based on triplet loss easily suffer from slow convergence and sub-optimization.

To solve this problem, many researches explore some hard triplets sampling methods or novel loss functions based on the triplet loss. For triplets sampling, there are two trends which are making full use of all the relationships as much as possible and just sampling the hard triplets in a blob. For the former, Batch All sampling [6] considers all the triplets among a blob. However, there are many easy triplets which will generate zero loss and are not effective for the gradient generation. Thus, this sampling method will reduce the gradients and easily lead to a suboptimal solution. Then the Lifted Structure loss [19] lifts the distances between vector pairs to a pairwise distance matrix and defines a structured loss based on all positive and negative samples. And N-pair loss [23] samples N classes and randomly selects a pair of images for every class. Then they propose a N + 1 tuple loss by comparing a positive pair with all its negative pairs. The second trend is hard triplets sampling in a blob which is proved to be very efficient for improving the performance. Among them, Batch Hard [9] is an effective triplets mining method by selecting one hardest positive image and one hardest negative image for every image in a batch. Resorting to a soft margin triplet loss, it illustrates a state-of-the-art performance in the experiments. Besides these, Wu et al. [33] proposed a distance weighted sampling method and proved that sampling matters in the deep embedding learning.

At the same time, some researchers argue that the triplet loss just considers the relationships of positive pair and negative pair with the same anchor image may result in a large intra-class variance. They design some new loss functions by applying different constraints. Such as, ImpTrpLoss [5] introduces an additional item to the triplet loss by keeping the distances of positive pairs smaller than a pre-defined value. The quadruplet loss [4] is defined on a quadruplet including 4 images from 3 classes. Besides considering the relative distances between positive-negative pairs with the same anchor as the triplet loss does, an extra constraint for those without the same anchor is also added. The angular loss [30] is designed to introduce the scale invariance and a third-order geometric constraint by constraining the angle at the negative point of triplet triangles. The center loss [31] pushes the feature vectors to the center vector of the same class, and combined with the softmax loss, it effectively reduces the intra-class distance of the deep features.

Different with the triplet loss and its variants, our proposed mini-cluster loss considers images of the same identity in a batch as a whole and use the inner divergence and outer divergence to describe a mini-cluster. By limiting the outer divergence larger than the inner divergence, the relative relationships between the intra-class and inter-class are kept while more compact clusters are obtained.

3 The proposed method

Our proposed mini-cluster loss is closely related with the triplet loss and quadruplet loss. So, in this section, we introduce them firstly and then represent our mini-cluster loss in detail.

3.1 Preliminary

We define (x_i,s_i) as an input data, where $x_{i}\in \mathcal {X}$ is the image and s_i is the class label of x_i. The feature embedding kernel $f(\cdot ;\theta ):\mathcal {X} \rightarrow \mathbb {R}^{k}$ maps an input image x_i to a feature vector of k dimensions. To simplify, we use f(x_i) to represent f(x_i;𝜃).

The triplet loss

The triplet loss [20] takes triplets as input, each including three images (x_i,x_j,x_k), where x_i and x_j are from the same person and x_k is from the different one. The goal of the triplet loss is to keep x_i closer to x_j than x_k. The triplet loss on a single triplet is defined as following:

$$ \begin{array}{llll} \mathcal{L}_{triplet}= \frac{1}{2} \times \big{[}m+\left\lVert f(x_{i})-f(x_{j})\right\rVert{_{2}^{2}}\\ -\left\lVert f(x_{i})-f(x_{k})\right\rVert{_{2}^{2}}\big{]}_{+} \end{array} $$

(1)

where [⋅] = max(⋅,0), and m is a predefined margin for the relative distance. If $\mathcal {L}_{triplet}(\cdot )= 0$, the partial differentials with respect to the input f(x_i), f(x_j) and f(x_k) are all zeros. If not, they are as follows:

$$ \begin{array}{llll} & \frac{\partial(\mathcal{L}_{triplet}(\cdot))}{\partial(f(x_{i}))}=f(x_{k})-f(x_{j}) \\ & \frac{\partial(\mathcal{L}_{triplet}(\cdot))}{\partial(f(x_{j}))}=f(x_{j})-f(x_{i}) \\ & \frac{\partial(\mathcal{L}_{triplet}(\cdot))}{\partial(f(x_{k}))}=f(x_{i})-f(x_{k}) \end{array} $$

(2)

The quadruplet loss

The quadruplet loss [4] is a variant of the triplet loss. It is based on a quadruplet (x_i,x_j,x_k,x_l), where x_i and x_j are images from the same person, but x_k and x_l are from other persons. Besides considering the relative distance between the positive and negative pair with the same probe image, the quadruplet loss also imports another constraint of the relationship between the positive and another negative pair with different probe person. The quadruplet loss is defined in (3), in which g(⋅,⋅) represents the distance between two images, the first item is a main constraint and the second is a relatively weaker auxiliary constraint.

$$ \begin{array}{llll} \mathcal{L}_{quad}=\sum\limits_{i,j,k}^{N}\big{[}g\big{(}f(x_{i}),f(x_{j})\big{)}^{2}-g\big{(}f(x_{i}),f(x_{k})\big{)}^{2}+ \alpha_{1} \big{]}_{+} \\ +\sum\limits_{i,j,k,l}^{N}\big{[}g\big{(}f(x_{i}),f(x_{j})\big{)}^{2}-g\big{(}f(x_{l}),f(x_{k})\big{)}^{2}+\alpha_{2}\big{]}_{+}\\ s_{i}=s_{j}, s_{l}\ne s_{k}, s_{i} \ne s_{l}, s_{i}\ne s_{k} \end{array} $$

(3)

3.2 The mini-cluster loss

The scheme of the proposed mini-cluster loss is shown in Fig. 3. Suppose the input batch of our framework is consisted of P persons and K images for each person, which is called PK-batch in our paper. We consider K images of an individual as a mini-cluster. Before introducing the mini-cluster loss, we first define two concepts which are inner divergence and outer divergence to characterize a cluster. First, we define its inner divergence $\mathcal {I}dvg_{a}$ as the furthest distance between points of a mini-cluster :

$$ \mathcal{I}dvg_{a}=\max\limits_{x_{i},x_{j}\in C_{a}}\left\lVert f(x_{i})-f(x_{j})\right\rVert_{2} $$

(4)

in which, C_a is the ath mini-cluster, i.e., the set of the images with the same identity label l_a in a batch. x_i and x_j are different images from C_a. Moreover, for better representing the outer divergence, the distance between two mini-clusters is defined as the closest distance of points across two mini-clusters:

$$ D_{a,b}=\min\limits_{x_{i}\in C_{a}, x_{j}\in C_{b}}\left\lVert f(x_{i})-f(x_{j})\right\rVert_{2} $$

(5)

Based on it, the outer divergence $\mathcal {O}dvg_{a}$ is defined as:

$$ \mathcal{O}dvg_{a}=\min\limits_{b \ne a,b\in{1 {\cdots} P}}\{D_{a,b}\} $$

(6)

For each mini-cluster, there is an inner divergence representing its compact degree, and an outer divergence representing its separation degree with other mini-clusters. Intuitively, for a mini-cluster C_a, it should be satisfied with

$$ \mathcal{I}dvg_{a}<\mathcal{O}dvg_{a} $$

(7)

So the proposed mini-cluster loss on a batch is defined as:

$$ \begin{array}{lll} &\mathcal{L}_{mini} =\sum\limits_{a = 1}^{P}log\big{(}1+exp(\mathcal{I}dvg_{a}-\mathcal{O}dvg_{a})\big{)}\\ &\qquad\quad=\sum\limits_{a = 1}^{P}log\big{(}1+exp(\max\limits_{x_{i},x_{j}\in C_{a}}\left\lVert f(x_{i})-f(x_{j})\right\rVert_{2}\\ &\qquad\qquad\quad -\min\limits_{\begin{array}{l}b = 1,2{\cdots} P\\ b\neq a \end{array}}\{\min\limits_{\begin{array}{l}x_{k}\in C_{a}\\ x_{l}\in C_{b} \end{array}}\left\lVert f(x_{k})-f(x_{l})\right\rVert_{2}\})\big{)} \end{array} $$

(8)

For each mini-cluster, a tuple (x_i,x_j,x_k,x_l) is carefully selected, where x_i,x_j and x_k are from the anchor mini-cluster, and x_l is from another. Thus, for a batch, P tuples are used to generate punishments for the general loss.

3.3 Runtime complexity

The computation of the proposed mini-cluster loss mainly includes three parts: distance matrix, the inner divergence and outer divergence, and the loss value. Assuming that there are M images in the training set and the batch size is N = P × K, then an epoch should include $M\over N$ batches. In each batch, the complexity of computing the distance matrix, divergences and the loss value could be expressed as O((PK)²), O(P × [K² + K × (PK − K)]) and O(P) respectively. Thus, the complexity of an epoch should be $O(2MN+{M\over K})$, which can be simply represented as O(MN). As N → M, we can get a worst cast complexity O(M²). Compared with the naive hard mining algorithm of the triplet loss, which has a worst case complexity of O(M³) on an epoch, our method is computationally more efficient. And, compared with the recent methods which are the batch hard sampling of the triplet loss [9] whose complexity is O(2MN + M), and the quadruplet loss [4] whose complexity is O(MN³), the proposed method still has the advantage on the computational complexity.

3.4 Relationship with other metric objectives

Our mini-cluster loss is designed based on the triplet loss (1). And from the form of tuple, it is similar with the quadruplet loss (3). In this subsection, we discuss the relationship of our mini-cluster loss with them.

First, both the triplet loss and the quadruplet loss are not built on the whole mini-clusters but the individual points . They consider images of the same identity individually and keep the relative relationships among points. Thus, they apply some tuple mining method to generate tuples and several images of the same class may be contained. These selected tuples just represent themselves, not the whole mini-cluster. However, our proposed mini-cluster loss uses the inner divergence and the outer divergence to represent a mini-cluster, and ignores all other points here. Due to the particularity of the two concepts, just by limiting the outer divergence larger than the inner divergence, it would keep the correct relative relationship between intra-class and inter-class while pulling the clusters more compact. Of course, the mini-cluster loss can be regarded as a special hard tuple mining methods. In a PK-batch, for each mini-cluster a closest positive pair (x_i,x_j) and a furthest negative pair (x_k,x_l) are selected to form a tuple $(x_{i}, x_{j}, x_{k}, x_{l})_{x_{i}, x_{j}, x_{k}\in C_{a}, x_{l} \in C_{b}}$. The mini-cluster loss is defined on such P tuples. Hard samples selecting has been proved important for the improvement of performance in many researches. The sampling method of mini-cluster loss discards most of tuples and just keeps one tuple for a mini-cluster. In the following experiments, we illustrate that our models trained on the P selected tuples can show a better performance than other hard sampling methods.

Second, from the form, the tuple $(x_{i}, x_{j}, x_{k}, x_{l})_{x_{i}, x_{j}, x_{k}\in C_{a}, x_{l} \in C_{b}}$ our mini-cluster loss defined on is a quadruplet. But it is different with that in the quadruplet loss (3). Images in the tuple of our mini-cluster are just from two clusters. But the quadruplet $(x_{i}, x_{j}, x_{k}, x_{l})_{x_{i}, x_{j}\in C_{a}, x_{k}\in C_{b}, x_{l}\in C_{c}}$ in the quadruplet loss is from three clusters. The quadruplet loss not only considers the relative distance between the positive pair and the negative pair with the same anchor as the triplet loss does, but also the relative distance between the positive pair and negative pair from other clusters. But the extra constraint item may destroy the diverse of the distributions of the clusters. However, the mini-cluster loss keeps the relative relationship of the inner divergence and the outer divergence of a mini-cluster. Due to the outer divergence is a strict constrain, it helps to pull the mini-cluster more compact. And the diverse of the different clusters can also be kept.

4 Experiments

4.1 Setup

Datasets: :: We evaluate our method on two large-scale ReID benchmark datasets which are Market1501 [45] and DukeMTMC-reID [47]. The Market1501 dataset is captured in front of a supermarket from 6 different camera views including 5 high-resolution cameras and one low-resolution camera. It is detected by Deformable Part Model (DPM). There are 12,936 images of 751 IDs in the training set, 3368 query images and 19,732 gallery images of the remaining 750 IDs in the testing set. The DukeMTMC-reID dataset is cropped from 85-minute high-resolution videos from 8 different cameras by handcraft. It contains 1404 IDs in total where the randomly selected 702 IDs are as the training set and the remaining 702 IDs as the testing set. As a result, there are 16,522 images in the training set, 2228 query images and 17,661 gallery images (702 IDs + 408 distractor IDs) in the testing set.
Data augmentation: :: In various of deep learning tasks, data augmentation has been proved to be a very effective means for alleviating the over-fitting problem. In the following implementation, all the images are first resized into 288 × 144 pixels. Then during the training, we crop a 256 × 128 pixels image region with a random perturbation and mirror it or not decided in the manner of flipping a coin. Then the cropped images subtract the mean values (123.68,116.78,103.94) for Market1501 and (127.50,127.50,127.50) for DukeMTMC-reID in the corresponding dimensions. During the test, five crops in the upper left, upper right, lower left, lower right and center and their mirrors are used to extract their feature vectors. Then the average feature vectors are applied to compute the distance matrix and evaluate the performance.
Optimization setting: :: We use ResNet50 [8] as the CNN network. First we discard the last fully connected layer and the softmax loss layer and add a fully connected layer with 128 dimensions and the mini-cluster loss layer. Then we fine-tune the model on the pre-trained model of ResNet50. Our experiments are conducted in the Tensorflow deep learning framework. We adopt adaptive moment estimation optimizer (Adam) to optimize our deep framework. The exponentially rate decaying schedule is applied in the training. 25,000 iterations are conducted on both datasets. The initial learning rate is set to 3e-4 for the Market1501 dataset and 1e-4 for the DukeMTMC-reID dataset, and then gradually decays after 10000 iterations till 0.01 folds of the initial learning rate.

4.2 Construction of the input blob

Since the mini-cluster loss considers a cluster as a whole, the appropriate form of the input blob in the optimizing is PK-blob which randomly samples P identities and K images for each identity. Although it is expected to set a large value for P and K, it is limited by the capacity of GPU. So the scale of the blob, i.e. P and K should be carefully selected. In our experiments, we fix the size of a blob (P × K) to be 108. By set K to be 18, 12, 9, 6, 3 and 2 respectively, we investigate the influence to the results of K and explore the balance between K and P. We conduct them on the Market1501 dataset, and the results are illustrated in Table 1. From it, it can be seen that the best results are achieved when K is set to be 6 and P is 18. We argue the best balance is achieved at this setting when the batch size is set to be 108.

Table 1 Balance between K and P in a blob

Full size table

Moreover, it is noticed that a large performance decreasing can be found when P is larger or smaller, especially when K = 18 and K = 2. When K is set to be a larger value, P will be smaller. Then it may cause large distractions introduced by the large mini-cluster and the insufficient negative images. For example, when setting K to be 18, almost all the images of the same identity could be involved in a blob. For those classes without outliers, the inner divergence can well represent the mini-cluster. However, if the outliers are existed, it may draw the whole cluster to fit the outlier and result in an indistinct feature embedding. On the other hand, the small negative classes would lead to a weak outer divergence and then a zero loss in the later period of the optimization. It could generate an under-fitting model. On the contrary, the smaller K easily leads to a local inner divergence and then a suboptimal performance. So, in the following experiments, we set K to be 6 and P to be 18.

Table 2 Comparisons of our method with several state-of-the-art methods on both Market1501 and DukeMTMC-reID

Full size table

4.3 Progressive training strategy

Based on the keep observation on many training processes, it is found that it is easily collapsed in the early period of the optimization for models with the mini-cluster loss no matter how small of the learning rate. We analyze it is because that in the early period of the optimization, both the inner divergence and the outer divergence make no sense. Only when the feature embedding is elementary ordered, the mini-cluster loss would tap its potential in the training. So we apply a strategy of progressive training which is 2-phase training. In the first phase, we train the model by the triplet loss with the batch hard sampling defined in [9]. And then in the second phase, fine-tune the pre-trained model in the first phase. In our experiments, the general iteration number is set to 25,000 where the first 1000 iterations are trained by the method of [9] and the remaining 24,000 iterations are trained by the proposed mini-cluster loss.

4.4 Comparison with state-of-the-art methods

In Table 2, we compare our proposed method with several recent state-of-the-art works on both Market1501 and DukeMTMT-reID dataset. There are traditional methods and deep learning methods compared. The first two methods are traditional methods. Among them, DNS [40] designs a closed-form solution to make the images of the same person collapse into a single point while maximizing the relative between-class separation for solving the small sample size problem. And SSM [1] is a manifold-based affinity learning method for ReID. Methods of the next part of the Table 2 are deep ReID frameworks. For instance, Resnet+OIM [35] conducts the pedestrian detection and the identification task jointly by optimizing the online instance matching loss. Latent Parts [14] learns the localize deformable pedestrian parts first and then learns the context-aware features by integrating the learning of the full body and local parts in a united framework. Moreover, AACN [36] introduces pose and attention information to the deep framework. Besides these, another 2 methods aggregate the training samples using generative adversarial network (GAN), which are Verif-Identif.+LSRO [47] and dMpRL. Especially, Triplet+Batch hard [9] uses batch hard sampling method and soft margin triplet loss to learn the embedding which is most related with ours.

Table 3 Comparisons with some methods closely related with our method on the Market1501 and DukeMTMC-reID dataset

Full size table

As it can be seen that our proposed method shows better performances compared with all the above methods on both large-scale datasets. Especially, our method improves the mAP score from 69.14% [9] to 72.44% and improves the rank-1 accuracy from 85.90% [36] to 87.05% on the Market1501 dataset at the single query mode. And at the multiple query model, there is a slight improvement compared with the best results. On the DukeMTMC-reID dataset, there is nearly a 1% performance improvement compared with 59.25% [36] at mAP. It is noted that compared with [9] which is closely with our method, there is about average 2.50% improvement of the mAP score on the Market1501.

It is mentioned above that our method is motivated by or close with some deep learning methods. However, no available published results achieved from them on the 2 large-scale datasets can be found, or it was conducted under different experimental settings. So, for better and fair comparisons, we conduct 5 groups of the related experiments under the same setting as ours, which are listed in Table 3. Concretely, we re-implement the triplet loss with random sampling, batch all sampling and batch hard sampling, the quadruplet loss and the lifted structure loss. All the experiments are trained on the pre-trained ResNet50 model, and the forms of the input blobs are PK-blob where P is set to be 18 and K is be 6. For the triplet loss, we use the form of soft-margin as is done with [9] and ours. And for the quadruplet loss [4] (3), we set the two margins to be 1.0 and 0.5 respectively. Among these 5 methods, the Triplet+Batch hard(18 × 6) achieves the best results. And compared with the results published in [9] which are listed in Table 2, there is more than 2% improvement of the mAP score under the single query mode. And the single difference is applying a different blob construction with [9], which again illustrates that the balance between P and K is important for the performance improving.

Moreover, compared with these 5 methods, our method shows a higher performance. Especially on the DukeMTMC-reID dataset, it improves 55.75% mAP achieved by the batch hard method to 60.19% and 73.61% rank-1 accuracy to 77.20%. The better performance implies that the mini-cluster loss is an effective method for person ReID. However, it should be clarified that the delightful performances of the mini-cluster loss are achieved just on the large-scale person ReID datasets, we are not sure whether it is effective for the other applications which should be tested on.

For further qualitatively investigating the performance of our method, we illustrate a small part of the Barnes-Hut t-SNE [26] of the learned embedding for the Market1501 test set including the query images and the gallery images in Fig. 4, and show some ranked gallery image lists generated by our method in Fig. 5. From Fig. 4, it can be seen that the meaningful feature embedding is learned by our method. Especially, from Fig. 5, we can clearly observe that these top ranked images are all similar with the query image in the appearance. For those mismatched images, we also find they are dressed in the similar clothing or take the similar backpack. It should be noticed that the color and the global feature can be well abstracted by our method, but the local features about the detail information of an image are not well captured.

4.5 Discussion

In Tables 2 and 3, it can be seen that the batch hard triplets mining method [9] shows a better performance compared with others. So, based on the batch hard method proposed by [9], we conduct experiments on the Market1501 dataset for exploring the informative ability of the positive pairs and negative pairs with different hard levels and explain why the inner divergence and outer divergence are an appropriate representation for a mini-cluster.

For better understanding the setting of the experiments, we show a toy example in Fig. 6. With the batch hard method shown in Fig. 6b, a hardest positive and a hardest negative are selected for each image. And there will generate K triplets for each mini-cluster and PK triplets for the whole blob. For better conducting the analysis, we define two types of hard levels for the triplets which are hard levels on the positive pairs and those on the negative pairs in Fig. 6c–d. Based on this, we conduct 4 groups of experiments on 4 triplet sets of different hard level, which are pos-hard-m, neg-hard-m, pos-easy-m, neg-easy-m. By setting an increasing m, we analyze the impact of the added triplets to the performance.

Specifically, for the hard levels on the positive pairs (Fig. 6c), we first sort the K triplets in each mini-cluster in the descending order according to the distance of the positive pairs. Then we sample the first m(m = 1,2,⋯,K) triplets from every mini-cluster and the selected m × P triplets are sent to the triplet loss layer. We represent this group of experiments as pos-hard-m= {pos_ij|i = 1,⋯,m,j = 1,⋯,P}. And if we sample the last m(m = 1,⋯, K) triplets from every mini-cluster, this group of experiment is represented as pos-easy-m= {pos_ij|i = K,⋯,K − m + 1,j = 1,⋯,P}. Similar with those, we first sort K triplets in each mini-cluster in the ascending order according to the distances of the negative pairs (Fig. 6d), and then sample the first or last m triplets. These two groups of experiments are represented as neg-hard-m= {neg_ij|i = 1⋯m,j = 1,⋯,K} and neg-easy-m= {neg_ij|i = K,⋯,K − m + 1,j = 1,⋯,P}. Since K is 6 and P is 18 in our experiments, we also apply this setting in these 4 groups of experiments.

The results of the 4 groups of experiments are listed in Table 4 and shown in Fig. 7. As we have known, for pos-hard-m, we gradually add a triplet with positive pair of next hard level to the training set for a mini-cluster with m increasing 1. From Table 4, it can be seen that even if m is 1, that is, just 1 triplet is selected in a mini-cluster, we also achieve very competitive results. And with m increasing, there is no evident performance improvement. Similar character can be found in the columns of neg-hard-m. It illustrates that the hardest positive and negative pair in a mini-cluster play very important roles for the optimization and those sub-hard positives and negatives may not bring a large performance improvement. However, the performances in the columns of pos-easy-m and neg-easy-m are almost progressively improved with m increasing. This means it could bring significant performance improvements by gradually introducing the harder positives or harder negatives of a mini-cluster to the training samples. Based on this, we speculate that both the hardest positive pair and the hardest negative pair in a mini-cluster are very important for the model training. This finding lays the foundation of our mini-cluster as shown in Fig. 6a. As we have introduced above, we define the distance of the hardest positive pair and the hardest negative pair as the inner divergence and outer divergence respectively. And the inner divergence and outer divergence can be used to characterize a cluster. If a cluster has a small inner divergence and a large outer divergence, that means a small intra-class variance and a large inter-class variance. And keeping the outer divergence larger than the inner divergence in the training, it would help to generate an embedding with more compact clusters and more separation between clusters.

Table 4 Performance of the selected triplets from different hard levels on the Market1501 dataset

Full size table

5 Conclusion

In this work, we proposed a novel loss objective called mini-cluster loss for the person ReID task. Instead of considering samples in a blob separately, the proposed loss regards images of the same person in a blob as a mini-cluster and treats them as a whole in the training. For each mini-cluster, we define 2 concepts which are the inner divergence and the outer divergence to characterize a mini-cluster. By keeping the outer divergence larger than the inner divergence, the mini-cluster loss could learn a feature embedding with small intra-class variances and large inter-class variances. Moreover, we apply a progressive strategy to train our models. Extensive experimental results on both large-scale datasets (Market1501 and DukeMTMC-reID) clearly demonstrate the effectiveness of the proposed mini-cluster loss for person ReID. However, although our method can well capture the color and the global feature, the local features about the detail information of an image are not well obtained. We think that our method combined with the local feature extracted by using attribute information can largely improve the ReID performance. Meanwhile, we also notice that if there are just a small count of images for an identity in a ReID dataset, the proposed mini-cluster loss will not well reach its potential performance. Future work should be aimed at breaking through this limit by generating more images of an identity by using GANs(Generative Adversarial Networks).

References

Bai S, Bai X, Qi T (2017) Scalable person re-identification on supervised smoothed manifold. In: CVPR, pp 2530–2539
Barbosa IB, Cristani M, Caputo B, Rognhaugen A, Theoharis T (2017) Looking beyond appearances: synthetic training data for deep cnns in re-identification. Comput Vis Image Underst, 1–14
Chen Y, Chen Y, Wang X, Tang X (2014) Deep learning face representation by joint identification-verification. In: NIPS, pp 1988–1996
Chen W, Chen X, Zhang J, Huang K (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In: CVPR, pp 403–412
De C, Gong Y, Zhou S, Wang J, Zheng N (2016) Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: CVPR, pp 1335–1344
Ding S, Lin L, Wang G, Chao H (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recogn 48(10):2993–3003
Article Google Scholar
Gong S, Cristani M, Yan S, Chen CL (2014) Person re-identification. Springer
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. arXiv:1703.07737, 1–10
Huang Y, Xu J, Wu Q, Zheng Z, Zhang Z, Zhang J (2018) Multi-pseudo regularized label for generated samples in person re-identification. arXiv:1801.06742, 1–12
Li W, Wang X (2013) Locally aligned feature transforms across views. In: CVPR, pp 3594–3601
Li W, Zhao R, Wang X (2012) Human reidentification with transferred metric learning. In: ACCV, pp 31–44
Li Z, Chang S, Liang F, Huang TS, Cao L, Smith JR (2013) Learning locally-adaptive decision functions for person verification. In: CVPR, pp 3610–3617
Li D, Chen X, Zhang Z, Huang K (2017) Learning deep context-aware features over body and latent parts for person re-identification. In: CVPR, pp 384–393
Li W, Zhu X, Gong S (2017) Person re-identification by deep joint learning of multi-loss classification. IJCAI, 2194–2200
Liao S, Li SZ (2015) Efficient psd constrained asymmetric metric learning for person re-identification. In: ICCV, pp 3685–3693
Liu J, Zha ZJ, Qi T, Liu D, Yao T, Ling Q, Mei T (2016) Multi-scale triplet cnn for person re-identification. In: ACM on multimedia conference, pp 192–196
Matsukawa T, Okabe T, Suzuki E, Sato Y (2016) Hierarchical gaussian descriptor for person re-identification. In: CVPR, pp 1363–1372
Song HO, Yu X, Jegelka S, Savarese S (2016) Deep metric learning via lifted structured feature embedding. In: CVPR, pp 4004–4012
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: CVPR, pp 815–823
Shen Y, Lin W, Yan J, Xu M, Wu J, Wang J (2015) Person re-identification with correspondence structure learning. In: ICCV, pp 3200–3208
Shi H, Yang Y, Zhu X, Liao S, Lei Z, Zheng W, Li SZ (2016) Embedding deep metric for person re-identification: a study against large variations. In: ECCV, pp 732–748
Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. In: NIPS, pp 1857–1865
Su C, Yang F, Zhang S, Qi T, Davis LS, Gao W (2015) Multi-task learning with low rank attribute embedding for person re-identification. In: CVPR, pp 3739–3747
Sun Y, Zheng L, Deng W, Wang S (2017) Svdnet for pedestrian retrieval. ICCV, 3800–3808
Van Der Maaten L (2014) Accelerating t-sne using tree-based algorithms, vol 15
Vezzani R, Baltieri D, Cucchiara R (2013) People reidentification in surveillance and forensics:a survey. Acm Comput Surv 46(2):1–37
Article Google Scholar
Voulodimos A, Doulamis N, Doulamis A, Protopapadakis E (2018) Deep learning for computer vision: a brief review. In: Comput Intell Neurosci, pp 1–13
Wang F, Zuo W, Lin L, Zhang D, Zhang L (2016) Joint learning of single-image and cross-image representations for person re-identification. In: CVPR, pp 1288–1296
Wang J, Zhou F, Wen S, Liu X, Lin Y (2017) Deep metric learning with angular loss. In: ICCV, pp 2593–2601
Wen Y, Zhang K, Li Z, Yu Q (2016) A discriminative feature learning approach for deep face recognition. In: ECCV, pp 499–515
Wu S, Chen YC, Li X, Wu AC, You JJ, Zheng WS (2016) An enhanced deep feature representation for person re-identification. In: WACV, pp 1–8
Wu C-Y, Manmatha R, Smola AJ, Krahenbuhl P (2017) Sampling matters in deep embedding learning. In: CVPR, pp 2840–2848
Xiao T, Li H, Ouyang W, Wang X (2016) Learning deep feature representations with domain guided dropout for person re-identification. In: CVPR, pp 1249–1258
Xiao T, Li S, Wang B, Lin L, Wang X (2017) Joint detection and identification feature learning for person search. In: CVPR, pp 3376–3385
Xu J, Zhao R, Zhu F, Wang H, Ouyang W (2018) Attention-aware compositional network for person re-identification. CVPR, 2119–2128
Yang Y, Lei Z, Zhang S, Shi H, Li SZ (2016) Metric embedded discriminative vocabulary learning for high-level person representation. In: AAAI, pp 3648–3654
Yang Y, Wen L, Lyu S, Li SZ (2017) Unsupervised learning of multi-level descriptors for person re-identification. In: AAAI, vol 1, pp 4306–4312
Yi D, Lei Z, Li SZ (2014) Deep metric learning for practical person re-identification. Comput Sci, 34–39
Zhang L, Xiang T, Gong S (2016) Learning a discriminative null space for person re-identification. In: CVPR, pp 1239–1248
Zhang X, Fang Z, Wen Y, Li Z, Yu Q (2017) Range loss for deep face recognition with long-tail. ICCV, 1–10
Zhao R, Ouyang W, Wang X (2014) Learning mid-level filters for person re-identification. In: CVPR, pp 144–151
Zheng WS, Gong S, Xiang T (2011) Person re-identification by probabilistic relative distance comparison. In: CVPR, pp 649–656
Zheng W, Gong S, Xiang T (2013) Reidentification by relative distance comparison. IEEE Trans Pattern Anal Mach Intell 35(3):653–668
Article Google Scholar
Zheng L, Shen L, Lu T, Wang S, Wang J, Qi T (2015) Scalable person re-identification: a benchmark. In: ICCV, pp 1116–1124
Zheng Z, Zheng L, Yang Y (2017) A discriminatively learned cnn embedding for person re-identification. arXiv:1611.05666, 1–10
Zheng Z, Zheng L, Yi Y (2017) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. ICCV, 3754–3762
Zhou S, Wang J, Wang J, Gong Y, Zheng N (2017) Point to set similarity based deep feature learning for person reidentification. In: CVPR, pp 3741–3750

Download references

Acknowledgment

This work was supported by the National Natural Science Foundation of China (61572214, 61602244, U1536203 and U1504611), and partially sponsored by CCF-Tencent Open Research Fund.

Author information

Authors and Affiliations

School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Caihong Yuan, Jingjuan Guo, Ping Feng, Yihao Luo & Tianjiang Wang
School of Computer and Information Engineering, Henan University, Kaifeng, 475004, China
Caihong Yuan
School of Information Science and Technology, Jiujiang University, Jiujiang, 332005, China
Jingjuan Guo & Zhiqiang Zhao
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
Chunyan Xu
Huazhong University of Science and Technology, Wuhan, 430074, China
Kui Duan

Authors

Caihong Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Jingjuan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Ping Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yihao Luo
View author publications
You can also search for this author in PubMed Google Scholar
Chunyan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Tianjiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kui Duan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Caihong Yuan or Kui Duan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, C., Guo, J., Feng, P. et al. Learning deep embedding with mini-cluster loss for person re-identification. Multimed Tools Appl 78, 21145–21166 (2019). https://doi.org/10.1007/s11042-019-7446-2

Download citation

Received: 20 July 2018
Revised: 02 January 2019
Accepted: 06 March 2019
Published: 14 March 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s11042-019-7446-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning deep embedding with mini-cluster loss for person re-identification

Abstract

Similar content being viewed by others

A Deep Clustering-Guide Learning for Unsupervised Person Re-identification

Embedding Deep Metric for Person Re-identification: A Study Against Large Variations

Centroid Tuplet Loss for Person Re-Identification

1 Introduction

2 Related work