Keywords

1 Introduction

As a basic task of multi-camera surveillance system, person re-identification aims to re-identify a query pedestrian observed from non-overlapping cameras or across different time with a single camera [1]. Person re-identification is an important part of many computer vision tasks, including behavioral understanding, threat detection [2,3,4,5,6] and video surveillance [7,8,9]. Recently, the task has drawn significant attention in computer vision community. Despite the researchers make great efforts on addressing this issue, it remains a challenging task due to the appearance of the same pedestrian may suffer significant changes under non-overlapping camera views.

Recently, owing to the great success of convolution neural network (CNN) in computer vision community [10,11,12,13,14], many CNN-based methods are introduced to address the person re-identification issue. These CNN-based methods achieve many promising performances. Unlike the hand-craft methods, CNN-based methods can learn deep features automatically by an end-to-end way [28]. These CNN-based methods for person re-identification can be roughly divided into three categories: identification models, verification models and triplet models. The three categories of models differ in input data form and loss function and have their own advantages and limitations. Our motivation is to combine the advantages of three models to learn more discriminative deep pedestrian descriptors.

The identification model treats the person re-identification issue as a task of multi-class classification. The model can make full use of annotation information of datasets [1]. However, it cannot consider the distance metric between image pairs. Verification model regards the person re-identification problem as a binary classification task, it takes paired images as input and outputs whether the paired images belong to the same person. Thus, verification model considers the relationship between the different images, but it does not make full use of the label information of datasets. As regards to the triplet model [15], it takes triplet unit \( (x_{i} ,x_{j} ,x_{k} ) \) as input, where \( x_{i} \) and \( x_{k} \) are the mismatched pair and the \( x_{i} \) and \( x_{j} \) are the matched pair. Given a triplet images, the model tries to make the relative distance between \( x_{i} \) and \( x_{k} \) larger than the one between \( x_{i} \) and \( x_{j} \). Accordingly, triplet-based model can learn a similarity measurement for the pair images, but it also cannot take advantage of the label information of the datasets and uses weak labels only [1].

We design a CNN-based architecture that combines the three types of popular deep models used for person re-identification, i.e. identification, verification and triplet models. The proposed architecture can jointly output the IDs of the input triplet images and the similarity score as well as force the L2 distance between the mismatched pair larger than the one between the matched pair, thus enhancing the discriminative ability of the learned deep features.

2 Proposed Method

As shown in Fig. 1, the proposed method is a triplet-based CNN model that combines identification, verification and triplet losses.

Fig. 1.
figure 1

The proposed model

2.1 Loss Function

Identification Loss. In this work, we utilize the Online Instance Matching [16] to instead of the cross-entropy loss for supervising the identification submodule. For one image M, the similarity score of it belong to the ID j is written as:

$$ p_{j} = \frac{{\exp (v_{j}^{T} M/\partial )}}{{\sum\limits_{i = 1}^{L} {\exp (v_{i}^{T} M/\partial ) + \sum\limits_{k = 1}^{Q} {\exp (\mu_{k}^{T} M/\partial )} } }} $$
(1)

where \( v_{j}^{T} \) is the transposition of the j-th column of lookup table, \( \mu_{k}^{T} \) represents the transposition of the k-th ID of circular queue. \( \partial \) is the temperature scalar.

Verification Loss. For the verification subnetwork, we treat person re-identification as a task of binary classification issue. Similar to identification model, we also adopt cross-entropy loss as identification loss function, which is:

$$ \begin{aligned} q = soft\,\,\hbox{max} ({\text{conv}}), \hfill \\ L_{2} = \sum\limits_{i = 1}^{2} { - n_{i} \log } (q_{i} ) \hfill \\ \end{aligned} $$
(2)

in which ni is the labels of paired image, when the pair is the same pedestrian, n1= 1 and n2= 0; otherwise, n1= 0 and n2= 1.

Triplet Loss. The triplet subnetwork is adopted to make the Euclidean distance between the positive pairs smaller than that between negative pairs. For one triplet unit \( R_{i} \), the triplet loss function can be written as:

$$ L_{3} = [thre \, + \, d (F_{w} (R_{i}^{o} ) , { }F_{w} (R_{i}^{ + } ) ) { - }d(F_{w} (R_{i}^{o} ),F_{w} (R_{i}^{ - } ))]_{ + } $$
(3)

where \( thre \) is a threshold value and is a positive number, \( \left[ x \right] + \, = { \hbox{max} }(0,{\text{x}}) \), and d () is Euclidean distance.

Hybrid Loss. The deep architecture is jointly supervised by three OIM losses, two cross-entropy losses and one triplet loss. During the training phase, the hybrid loss function can be written as:

$$ L{ = }\alpha_{ 1} L_{ 1} { + }\alpha_{ 2} L_{ 2} { + }\alpha_{ 3} L_{ 3} $$
(4)

in which \( \alpha_{1} \), \( \alpha_{2} \) and \( \alpha_{3} \) are the balance parameters.

2.2 Training Strategies

We utilize the pytorch framework to implement our network. In this work, we use the Adaptive Moment Estimation (Adam) as the optimizer of the deep model. We use two types of training strategies proposed by [17]. Specially, for large-scale dataset like Market-1501, we use the designed model directly to transfer on its training set. As for the small datasets (e.g. CUHK01), we first train the model on the large-scale person re-identification dataset (e.g. Market-1501), then fine-tune the model on the small dataset.

3 Experiments

We first resize the training images into 256*128, then the mean image is subtracted by those resized training images. For our hybrid model, it is crucial for it to organize the mini-batch that can satisfy the training purpose of both identification, verification and triplet subnetworks. In this study, we use the protocol proposed by, we sample Q identities randomly, and then all the images R in the training set are selected. Finally, we use QR images to constitute one mini-batch. Among these QR images, we choose the hardest positive and negative sample for each anchor to form the triplet units. And we randomly selected 100 paired images for verification training. As to the identification subnetwork, we use all QR images in the mini-batch for training.

  1. (1)

    Evaluation on CUHK01

    For this dataset, 485 pedestrians are randomly selected to form the training set. The remainder 486 identities are selected for test. From Table 1, we can observe that the proposed method beats the most of compared approaches, which demonstrates the effectiveness of the proposed method.

    Table 1. Fig. 1. Table 1. Results (Rank1, Rank5 and Rank10 matching accuracy in %) on the CUHK01 dataset. ‘-’ means no reported results is available.
  2. (2)

    Evaluation on Market-1501

    We compare the proposed model with eight state-of-the-art methods on Market-1501 dataset. We report the performances of mean average precision (mAP), Rank-1 and Rank-5. Both the results are based on single-query evaluation. From Table 2, it can be observed that the accuracies of mAP, Rank-1 and Rank-5 of the proposed model achieve 77.43%, 91.60% and 98.33%, respectively, and our method beats all the other competing methods, which further proves the effectiveness of the proposed method.

    Table 2. Results (mAP, Rank-1 and Rank-5 matching accuracy in %) on the Market-1501 dataset in the single-shot. ‘-’ means no reported results is available.

4 Conclusions

In this paper, we design a hybrid deep model for the person re-identification. The proposed model combines the identification, verification and triplet losses to handle the intra/inter class distances. Through the hybrid strategy, the model can learn a similarity measurement and the discriminative features at the same time. The proposed model outperforms most of the state-of-the-art on the CUHK01 and Market-1501 datasets.