Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Person re-identification aims to identify whether two pedestrian images observed from disjoint camera views belong to the same person or not, which has great significance in video surveillance systems. Large variations in viewpoint, illumination and body posture across different camera views can cause a great appearance variance, which makes the re-identification still a challenging problem. Typically, methods for re-identification include two vital steps: (1) developing robust feature representations to handle the variations in pedestrian images; (2) designing discriminative distance metrics to measure the similarity between pedestrian images.

Representative feature descriptors include [1,2,3,4,5,6,7,8], which mostly come from color and texture. Gray et al. [1] used boosting to select a subset of optimal features composed by texture and color features; Farenzena et al. [2]proposed Symmetry-Driven Accumulation of Local Features (SDALF) consisted of both symmetry and asymmetry color and texture information; Zhao et al. [3] learned the mid-level filter (Mid-Level) from patch clusters with coherent appearance obtained by pruning hierarchical clustering trees to get view-invariant and discriminative features; SalMatch [5] was proposed to exploit both patch matching and salience matching for person re-identification, and in [7], Local Maximal Occurrence (LOMO) was proposed, which was extracted from the local HSV histograms and SILTP features with sliding windows to make a stable representation against viewpoint changes. However, due to the limitations of hand-crafted feature descriptors, it is hard to extract abstract and intrinsic features of the images, which makes these appearance-based features are highly susceptible and difficult to achieve a balance between discriminative power and robustness.

In recent years, many metric learning approaches have been proposed [5, 7, 9,10,11,12,13,14] and achieved remarkable performance for person re-identification. Representative methods include Cross-view Quadratic Discriminant Analysis (XQDA) [7], Large Scale Metric Learning from Equivalence Constraint (KISSME) [9], Metric learning to Rank (MLR) [10], Pairwise Constrained Component Analysis (PCCA) [11] and Large Margin Nearest Neighbor (LMNN) [14]. These methods extracted the hand-crafted features first to learn the transformation matrix of the initial feature space, which makes the distance become smaller between the same individuals and larger between different individuals in transformed feature space, some of them achieved impressive improvements for person re-identification.

Compared with the hand-crafted features based methods aforementioned, there are several deep learning based methods have been proposed [15,16,17,18,19]. More abstract and internal features can be learned automatically with the deep architecture, which makes the feature representation rather robust compared with those hand-crafted features. Li et al. [15] proposed a novel filter pairing neural network (FPNN) to jointly optimize feature learning, geometric transforms, photometric transforms, misalignment, occlusions and classification. Yi et al. [16] used a siamese deep convolutional architecture to learn the texture feature, color feature and metric together in fully cross dataset setting. Ahmed et al. [17] presented a deep neural network with layers specially designed for capturing relationships between different camera views. Wu et al. [18] used very small convolution filters and increased the depth of the network to improve the performance of re-identification. Xiao et al. [19] learned deep feature representation from multiple domains with Convolutional Neural Networks (CNNs). However, these deep neural network need to learn a large number of parameters, small datasets usually can not get remarkable results.

To address these problems, firstly, we learn a general Convolutional Neural Network with the mixture of various datasets as our deep feature extractor, which increases the scale of training set to make small datasets are applicable and enables us learn better features from multiple datasets. Then we extract the appearance-based features of pedestrian images as a supplement. Finally, we learn different metrics for the deep feature representation and hand-crafted feature representation, respectively, which makes the distance metrics more discriminative. Experiments show the superior performance of our proposed approach when compared with the state-of-the-art works.

2 Proposed Approach

In this paper, we extract both deep features and hand-crafted features to represent pedestrian images, and then learn the distance metrics respectively for the two types of feature representations to measure the similarity between different images in a more discriminative way. Section 2.1 introduces the Convolutional Neural Network we proposed to extract the deep feature representation of the images. Section 2.2 introduces the multiple feature representations and our independent metric learning.

2.1 Our Deep Architecture

Inspired by [19, 20], we build a CNN model described in Table 1, and mix the various datasets together to train a general CNN as our deep feature extractor for all the datasets. Specifically, three benchmark datasets include VIPER, CUHK01 and CUHK03 are used to validate our method, and all the images are scaled to \(144 \times 56\) pixels.

Table 1. The Architecture of Our Proposed CNN
Fig. 1.
figure 1

Inception modules used in our CNN structure, which were all proposed in [20] for image classification, module in (b) is an expanding of (a) to promote high dimensional representations on the coarsest grid

The structure of our CNN is the same with [19] expect the last two Inception modules and the two fully connected layers. Figure 1(b) shows the structure of our last two Inception modules, which was applied to image classification in [20], it expanded the filter bank outputs of the original Inception modules in Fig. 1(a) to promote high dimensional representation. After this, two fully connected layers were applied, the first has 2048 channels and the second contains the channels are equaled with the number of the individuals in training set which is set to 2168 in our model.

2.2 Multi-features Fusion and Independent Metric Learning

After trained the proposed CNN, we extract the fc7 layers output as the deep feature representation for the training and testing set, and exploit the hand-crafted feature LOMO [7] consisted of local HSV histograms and SILTP features as a complement. Then we learn the independent distance metric with XQDA [7] for the two types of feature representations, respectively.

The XQDA aims to learn a discriminant subspace and an effective distance metric at the same time. Given a pair of images (ij) captured from different views, \(\mathbf {x}_i\) and \(\mathbf {y}_j\) are the original features of the images. The distance between image i and j is formulated as:

$$\begin{aligned} f(\mathbf {x}_i,\mathbf {y}_j)=(\mathbf {x}_i-\mathbf {y}_j)^{\mathrm {T}}\mathbf {W}\mathbf {M}{\mathbf {W}}^{\mathrm {T}} (\mathbf {x}_i-\mathbf {y}_j)\;. \end{aligned}$$
(1)

where \(\mathbf {W}\in {{\mathbb {R}}^{d \times r}}\) is the subspace projection matrix, \(\mathbf {M}\in {{\mathbb {R}}^{r \times r}}\) is the learned metric kernel, d is the dimension of the original feature space, and \(r(r<d)\) is the dimension of the transformed feature space.

In this paper, we suppose \(\mathbf {x}_i^{dr}\) and \(\mathbf {x}_i^{hr}\) are the deep feature representation and hand-crafted feature representation of image i, respectively, \(\mathbf {y}_j^{dr}\) and \(\mathbf {y}_j^{hr}\) have the similar meanings. The distance between image i and j can be re-formulated as:

$$\begin{aligned} d(i,j)=d_{n}(\mathbf {x}_i^{dr},\mathbf {y}_j^{dr})+d_{n}(\mathbf {x}_i^{hr},\mathbf {y}_j^{hr})\;. \end{aligned}$$
(2)

where \(d_{n}(\mathbf {x}_i^{dr},\mathbf {y}_j^{dr})\) is the normalization of \(d(\mathbf {x}_i^{dr},\mathbf {y}_j^{dr})\), \(d_{n}(\mathbf {x}_i^{hr},\mathbf {y}_j^{hr})\) is the normalization of \(d(\mathbf {x}_i^{hr},\mathbf {y}_j^{hr})\), which are all calculated by Eq. (1).

3 Experiments

3.1 Datasets and Experiment Protocols

We validate the proposed approach on three widely-used person re-identification datasets include VIPER [21], CUHK01 [22], and CUHK03 [15].

VIPER is one of the most challenging dataset for person re-identification, it contains 632 pairs of person images taken from two camera views with various poses, viewpoints and illumination conditions. The CUHK01 dataset is larger in scale than VIPER, it contains 971 persons captured from two disjoint views and each person has two images in each camera view, camera A captured the frontal or back view of the individuals while camera B captured the side views. And the CUHK03 dataset is one of the most largest published person re-identification datasets, it includes five different pairs of camera views with more than 14,000 images of 1467 pedestrians, in addition, both manually cropped pedestrian images and images automatically detected by the pedestrian detector of [23] are all provided, this is a more realistic setting considering misalignment, occlusions, body part missing and detector errors.

We follow the widely adopted experimental protocols for VIPER and CUHK01 datasets, the individuals in these dataset are randomly divided into half for training and the other half for testing. And for CUHK03, we follow the settings in [19, 24], using both manually cropped pedestrian images and images automatically detected together and then randomly select 100 individuals for testing, the other 1367 individuals are used for training. We mix the three selected training sets together to train a general CNN which is employed to extract the deep feature representations of images for various datasets, we use Caffe [25] deep learning framework implement our network. And then we exploit the same individuals used for training our CNN in the three datasets to learn distance metrics for different datasets, respectively. The result is evaluated by cumulative matching characteristic (CMC) curve [26], which is also known as rank-n, an estimate of finding the correct match in the top n match. This procedure is repeated 10 times and the average of rank-n is reported for different dataset.

3.2 Evaluations of Proposed Method

In order to validate the effectiveness of the proposed method, here we conduct a series of experiments with different settings to evaluate the effectiveness of our approach, which include: (i) use our proposed method; (ii) replace our CNN with JSTL [19]; (iii) without hand-crafted feature representation; (iv) without our deep feature representation.

Fig. 2.
figure 2

The experimental results of different methods on the three datasets, measured by rank-1, rank-5, rank-10 and rank-20 matching rates. The red curves represent our proposed method, which achieve the best rank-1 matching rates for all the three datasets (Color figure online)

Figure 2 shows the rank-n(\(n=1,5,10,20\)) matching rates for different experiments and datasets. Experimental results show the effectiveness of the proposed method, our method achieves the better performance than other compared methods on all the three datasets. The first two experiments validate the effectiveness of our proposed CNN, by expanding filter bank outputs to promote higher dimensional representation, we can achieve a better performance. And the last two experiments validate that the two types of feature representations can complement each other well.

3.3 Comparison with State-of-the-Arts

We compare our approach with the following state-of-the-art methods: Metric Ensembles (Ensembles) [24], mFilter+LADF [3], mFilter [3], LOMO+XQDA [7], FT-JSTL+DGD [19] and JointRe-id [17]. Figure 3 shows the results on VIPER, CUHK01 and CUHK03 datasets. Our method improves rank-1 recognition rates by 5.4%, 7.6% and 8.7% on the three datasets compare with the state-of-the-arts.

Fig. 3.
figure 3

Performance comparison of the proposed method with the state-of-the-arts for VIPER, CUHK01 and CUHK03 datasets. Our approach outperforms all the state-of-the-art methods in most cases, especially on rank-1 matching rate

4 Conclusion

In this paper, we present an effective deep architecture trained with a mixture of various datasets to extract deep features of pedestrian images, then we use the deep feature representation and hand-crafted feature representation to learn different metrics, respectively. By using both deep feature representation and hand-crafted feature representation, we can gain more robust and comprehensive features, and learning independent distance metrics for the two types feature representation can realize a higher discriminative power. We conduct extensive experiments on three widely used person re-identification datasets to validate our approach. Experimental results demonstrate that our method achieves a better performance than other state-of-the-art methods in most cases.