Keywords

1 Introduction

Person re-identification aims to recognize people who have been observed from different disjoint cameras, which play a crucial role in video surveillance and visual information retrieval. Due to the large changes in appearances caused by variations in viewing angle, illumination, background clutter and occlusions, person re-identification is still a very challenging problem.

Recently proposed approaches which improve the person re-identification performance [19] can be mainly divided into two categories: (1) extracting robust descriptors to deal with the changes in person appearances; (2) designing discriminative metric functions to measure the similarity of person images. For the first category, several effective descriptors have been proposed, such as covariance descriptor [7], and local maximal occurrence (LOMO) [2]. For the second category, a discriminative metric is learned, under which the distance between the same persons and the distance between different persons are increased and decreased, respectively. Among them, Liong et al. [6] model and regulate the eigen-spectrums of covariance matrices in a parametric manner. Pedagadi et al. [9] learned the distance function by maximizing the between-class scatter matrix while minimizing the within-class scatter matrix using the Fisher discriminant objective. However, most metric learning methods only focus on the global measurement, neglecting the local discriminative power.

Chen et al. [1] proposed a similarity learning method with spatial constraints, which partitioned the person images into several sub-regions, and measured the similarity for each region, then, employed linear superposition to combine them together. He also collaborated the local measurements with global measurements and incorporated multiple visual cues to improve the performance. Experimental results show the effectiveness of this method. Considered different sub-regions and different features make different contribution to the final similarity, we proposed weighted local metric learning (WLML), which combines the similarity measurement of different sub-regions and different features by weighted summation. Then, we learn the weight by structured learning [3], instead of pre-defining. Experimental results on two widely used datasets demonstrate the efficacy of our proposed method.

2 Our Approach

In this section, we first propose weighted local metric learning and exploit structured method to learn the weight. Subsequently, we introduce metric method applied in our experiment.

2.1 Weighted Local Metric Learning

In this section, we introduce the overall similarity function first and then formulate the learning problem specifically.

Similarity Function.

Given a pedestrian image, we divide it into R non-overlapping horizontal stripe regions and extract F types of color and texture features from each stripe region. After that, we can obtain the f-th descriptor \( x^{r,\,f} \) for the r-th stripe region, where \( r \in \left\{ {1, \ldots ,R} \right\} \) and \( f \in \left\{ {1, \ldots ,F} \right\} \).

To measure the similarity between image descriptors \( x_{a} ,x_{b} \in {\mathbb{R}}^{d \times 1} \), where \( x_{a} \) and \( x_{b} \) are respectively from camera view A and camera view B, we employ the existing distance function \( d\left( {x_{a} ,x_{b} } \right) \) (will be introduced in Sect. 2.2) to calculate the distance between \( x_{a} \) and \( x_{b} \). Correspondingly we define the similarity function for the f-th descriptor of the r-th stripe region as:

$$ d^{r,\,f} \left( {x_{a} ,x_{b} } \right) = d\left( {x_{a}^{r,\,f} ,x_{b}^{r,\,f} } \right) $$
(1)

Considering one specific type of visual feature may not be powerful enough to discriminate individuals with similar visual appearance, we employ a weighted summation method to combine these features. The similarity function for r-th stripe region can be written as:

$$ d_{{}}^{r} (x_{a} ,x_{b} ) = \sum\limits_{f = 1}^{F} {w_{r,\,f} d_{{}}^{r,\,f} (x_{a} ,x_{b} )} $$
(2)

where \( w_{r,\,f} \ge 0 \) is the weight of \( d_{{}}^{r,\,f} (x_{a} ,x_{b} ) \). Since different regions make different contributions to the local similarity score. For all R regions, the local similarity function is represented as:

$$ d_{{}}^{local} (x_{a} ,x_{b} ) = \sum\limits_{r = 1}^{R} {w_{r} d_{{}}^{r} (x_{a} ,x_{b} )} $$
(3)

where \( w_{r} \ge 0 \) is the weight of \( d_{{}}^{r} (x_{a} ,x_{b} ) \). Since the horizontal stripe regions are non-overlapping, and the local descriptors can’t describe the matching of large patterns across the stripes. We combine local similarity with global similarity, and the overall similarity function can be written as:

$$ d(x_{a} ,x_{b} ) = d_{{}}^{local} (x_{a} ,x_{b} ) + w_{G} d_{{}}^{global} (x_{a} ,x_{b} ) $$
(4)

where \( w_{G} \ge 0 \) is the weight of \( d_{{}}^{global} (x_{a} ,x_{b} ) \), and the global similarity function \( d_{{}}^{global} (x_{a} ,x_{b} ) \) is defined as:

$$ d_{{}}^{global} (x_{a} ,x_{b} ) = \sum\limits_{f = 1}^{F} {w_{G,\,f} d_{{}}^{G,\,f} (x_{a} ,x_{b} )} $$
(5)

where \( d_{{}}^{G,\,f} (x_{a} ,x_{b} ) = d(x_{a}^{G,\,f} ,x_{b}^{G.\,f} ) \) and \( x_{a}^{G,\,f} \), \( x_{b}^{G,\,f} \) are the f-th type global visual feature for image a and image b. Then, expansion formula of Eq. (4) can be written as:

$$ d(x_{a} ,x_{b} ) = \sum\limits_{r = 1}^{R} {\sum\limits_{f = 1}^{F} {w_{r} w_{r,\,f} d_{{}}^{r,\,f} (x_{a} ,x_{b} )} } + \sum\limits_{f = 1}^{F} {w_{G} w_{G.\,f} d_{{}}^{G,\,f} (x_{a} ,x_{b} )} $$
(6)

Equation (6) can be simplified as follows by replacing \( w_{r} w_{r,\,f} \) and \( w_{G} w_{G,\,f} \) with \( w^{\prime}_{r,\,f} \) and \( w^{\prime}_{G,\,f} \):

$$ \begin{aligned} d\left( {x_{a} ,x_{b} } \right) = & \sum\limits_{r = 1}^{R} {\sum\limits_{f = 1}^{F} {w^{\prime}_{r,\,f} d^{r,\,f} \left( {x_{a} ,x_{b} } \right) + \sum\limits_{f = 1}^{F} {w^{\prime}_{G,\,f} d^{G,\,f} \left( {x_{a} ,x_{b} } \right)} } } \\ = & w^{T} \cdot d \\ \end{aligned} $$
(7)

where

$$ w = [w^{\prime}_{1,1} , \ldots w^{\prime}_{R,F} ,w^{\prime}_{G,1} , \ldots w^{\prime}_{G,F} ] $$

and

$$ d = [d^{1,1} (x_{a} ,x_{b} ), \ldots ,d^{R,\,F} (x_{a} ,x_{b} ),d^{G,1} (x_{a} ,x_{b} ), \ldots ,d^{G,\,F} (x_{a} ,x_{b} )]. $$

In next section, we will introduce the learning of the weighted similarity function Eq. (7) which makes \( d(x_{a} ,x_{b} ) \) smaller when \( x_{a} \) and \( x_{b} \) are from the same individual.

Structured Learning of Similarity Model.

In the training process, we randomly selected one image per individual form the gallery set and the remaining images are used to form the probe set. We denote the training set as \( \chi = \{ x_{p}^{q} \}_{p = a,b}^{{q = 1, \ldots ,N_{p} }} \), where \( x_{p}^{q} \) denotes the q-th pedestrian form camera view p. We refer to probe set as camera view a, and gallery set as camera view b. We also denote the ground-truth ranking structure as \( y^{ * } = \left\{ {y_{ij}^{ * } } \right\} \), and \( y_{ij}^{*} = 1 \) if \( x_{a}^{i} \) and \( x_{b}^{j} \) are the same person; otherwise, \( y_{ij}^{*} = 0 \). Then training process can be formulated as the following structured learning problem:

$$ \begin{aligned} \mathop {\hbox{min} }\limits_{w,\xi } \; & \frac{1}{2}\left\| w \right\|_{2}^{2} + C\xi \\ s.t.\; & w^{T} (\psi (\chi ,y^{ * } ) - \psi (\chi ,y)) \ge \Delta (y^{ * } ,y) - \xi ,\forall y \in \mathcal{Y},\xi \ge 0 \\ \end{aligned} $$
(8)

where \( w \) is the weight vector in Eq. (7), \( y \in \mathcal{Y} \) denote any arbitrary predicted ranking structure, \( \left\| \cdot \right\|_{2} \) denotes the \( l_{2} {\text{-}}{\text{norm}} \) of a vector, and \( C > 0 \) is the regularization parameter. We define the feature map \( \psi (\chi ,y) \) as:

$$ \psi (\chi ,y) = \frac{1}{{N_{a} }}\sum\limits_{i = 1}^{{N_{a} }} {\sum\limits_{{k \in \chi_{i}^{ + } }}^{{}} {\sum\limits_{{j \in \chi_{i}^{ - } }}^{{}} {(1 - y_{ij} )\frac{{d(x_{a}^{i} ,x_{b}^{j} ) - d(x_{a}^{i} ,x_{b}^{k} )}}{{\left| {\chi_{i}^{ + } } \right| \cdot \left| {\chi_{i}^{ - } } \right|}}} } } $$
(9)

where \( \chi_{i}^{ - } \) denotes irrelevant individuals set of \( x_{a}^{i} \), and \( \chi_{i}^{ + } \) denotes the relevant individual set of \( x_{a}^{i} \) correspondingly. Since we use single-shot training, \( \left| {\chi_{i}^{ + } } \right| = 1 \) and \( \left| {\chi_{i}^{ - } } \right| = \left( {N_{b} - 1} \right) \), Eq. (9) can be simplified as:

$$ \psi (\chi ,y) = \frac{1}{{N_{a} (N_{b} - 1)}}\sum\limits_{i = 1}^{{N_{a} }} {\sum\limits_{{j \in \chi_{i}^{ - } }}^{{}} {(1 - y_{ij} )(d(x_{a}^{i} ,x_{b}^{j} ) - d(x_{a}^{i} ,x_{b}^{k} )),k \in \chi_{i}^{ + } } } $$
(10)

The goal of constraints is to enforce the distance between irrelevant individuals and relevant individuals of the ground-truth ranking structure to be the largest among any arbitrary ranking structures. Following the large margin framework, we define the loss function as:

$$ \Delta (y^{ * } ,y) = \frac{1}{{N_{a} (N_{b} - 1)}}\sum\limits_{i = 1}^{{N_{a} }} {\sum\limits_{{j \in \chi_{i}^{ - } }}^{{}} {(y_{ij} - y_{ij}^{ * } )} } $$
(11)

which denotes the mean loss incurred by predicting ranking structures instead of the ground-truth ranking structure. Note that other convex loss functions can also be applied.

Optimization.

In principle we can solve the structured learning using cutting-plane algorithm [12]. The basic idea of cutting-plane algorithm is that it is sufficient to obtain a \( \varepsilon \)-approximate solution of optimization problem by using a small subset of all constraints. We list the algorithm steps in Algorithm 1. It begins with a null constraint set. At each iteration, we solve the optimization problem to find a suitable w over current constraint set. Based on the w, we can find the most violated ranking structure \( \bar{y} \) and add it to constraint set. The cutting-plane algorithm repeats the above steps until it converges.

The calculation of the most violated constraint (Algorithm 1, step 2) can be written as:

$$ \begin{aligned} \bar{y} = & \arg \mathop {\hbox{max} }\limits_{{y \in {\mathcal{Y}}}} \Delta (y^{ * } ,y) - w^{T} (\psi (\chi ,y^{ * } ) - \psi (\chi ,y)) \\ = & \arg \mathop {\hbox{max} }\limits_{{y \in {\mathcal{Y}}}} \frac{1}{{N_{a} (N_{b} - 1)}}\sum\limits_{i = 1}^{{N_{a} }} {\sum\limits_{{j \in \chi_{i}^{ - } }}^{{}} {y_{ij} } } - \frac{1}{{N_{a} (N_{b} - 1)}}\sum\limits_{i = 1}^{{N_{a} }} {\sum\limits_{{j \in \chi_{i}^{ - } }}^{{}} {y_{ij} w^{T} (d(x_{a}^{i} ,x_{b}^{j} ) - d(x_{a}^{i} ,x_{b}^{k} ))} } \\ = & \arg \mathop {\hbox{max} }\limits_{{y \in {\mathcal{Y}}}} \frac{1}{{N_{a} (N_{b} - 1)}}\sum\limits_{i = 1}^{{N_{a} }} {\sum\limits_{{j \in \chi_{i}^{ - } }}^{{}} {(y_{ij} (1 - w^{T} d_{ij} ))} } \\ \end{aligned} $$
(12)

where \( d_{ij} = d\left( {x_{a}^{i} ,x_{b}^{j} } \right) - d\left( {x_{a}^{i} ,x_{b}^{k} } \right) \). Obviously, \( \bar{y} \) can be written as:

$$ \bar{y}_{ij} = \left\{ \begin{aligned} 1,\, & if\;w^{T} d_{ij} \le 1 \\ 0,\, & \text{otherwise}\text{.} \\ \end{aligned} \right. $$
(13)

2.2 Metric Method

Metric learning can be divided into linear [6, 9] and non-linear methods [2, 4]. As to linear method, a projection matrix M is sought, so that the distance between \( x_{a} \) and \( x_{b} \) can be denoted as \( d(x_{a} ,x_{b} ) = (x_{a} - x_{b} )^{T} M(x_{a} - x_{b} ) \), which will be small if \( x_{a} \) and \( x_{b} \) are from the same person and large otherwise. By kernelization, linear method can be extended to non-linear method easily. The distance of non-linear method can be written as \( d\left( {x_{a} ,x_{b} } \right) = \left( {\phi \left( {x_{a} } \right) - \phi \left( {x_{b} } \right)} \right)^{T} M\left( {\phi \left( {x_{a} } \right) - \phi \left( {x_{b} } \right)} \right) \), where \( \phi \left( x \right) \) is mapping from feature to kernel space.

In our experiments, we used kernel Local Fisher Discriminant Analysis (kLFDA) [4], which is a non-linear extension to previously proposed LFDA [9]. Unlike LFDA, kLFDA learns projection matrix M in the kernel space \( \phi \left( x \right) \). Note that there are more metric learning methods can be used in our framework and we just choose the kLFDA for our experiments.

3 Experiments

We evaluate the proposed WLML method on two widely used person re-identification datasets, namely the VIPeR [10] and i-LIDS [11] databases. The following describe the details of our experiments and results.

3.1 Feature Extraction

We divide a pedestrian image into 4 non-overlapping horizontal stripe regions. For each stripe region, we extract 4 types of basic features multi-HS, multi-RGB, SILTP [13] and dense SIFT [5], which describe different aspects of person images. Among them, multi-HS and multi-RGB are 8 × 8 and 8 × 8 × 8 joint histograms respectively. SILTP and dense SIFT are texture descriptors extracted at RGB and LAB channel, respectively. Then each histogram feature is normalized with the \( l_{2} {\text{-}}{\text{norm}} \). Finally, two visual cues \( F_{1} \), \( F_{2} \) are organized as multi-HS/SILTP and multi-RGB/SIFT. For global descriptor, we also extract HS and RGB concatenated histograms with each channel having 32 bins and concatenate them with HOG [14] histogram.

3.2 Settings

In our experiments, we used the single-shot training and testing where one image per person was randomly selected to form the gallery set and the remaining images were used to form the probe set. For our WLML method, we set the regularization parameter C as \( 10^{2.8} \) and accuracy threshold \( \varepsilon \) as \( 10^{ - 6} \) for all experiments. For the non-linear metric method kLFDA [4], we set the regularization parameter for class scatter matrix as 0.01 and apply the \( {\text{RBF}}{\text{-}}\chi^{2} \) kernel for all features. In this experiment, we set the value of \( \sigma^{2} \) to be the same as the first quantile of all distances [4]. To evaluate our proposed method, the average cumulative matching curve (CMC) where a match is found at the top-n ranks by repeating the experiments 10 times.

3.3 Evaluation on the VIPeR Dataset

The VIPeR dataset [10] is one of the most popular datasets for person re-identification. It consists of 632 persons captured from two cameras with a viewpoint change of 90° and varying illumination conditions. We randomly select 316 persons to form the training set and the remaining 316 persons are used to form test set.

Table 1 show the matching results compared to 6 representative methods, which includes kLFDA [4], LOMO+XQDA [2], ME [3] and SCSP [1]. While kLFDA, LOMO+XQDA, ME and SCSP are state-of-the-art techniques that presented promising performance in person re-identification. We see that our proposed WLML method outperforms most existing methods. It achieved 50.9 % rank-1 accuracy in VIPeR dataset.

Table 1. Matching rates (%) of different metric learning methods on the VIPeR dataset

To validate the effectiveness of our proposed method, we compare our proposed method with other methods using multi-feature in Table 2. Since we haven’t the source code of SCSP, we compare our proposed method with ME and the linear superposition of the local metric in our method (SLML). The difference between SCSP and SLML is that they used different local metric and kernel. To fairly compare these methods, all these approaches use the same features and testing/training set. We see that our proposed WLML method outperforms SLML and ME methods with as high as approximately 4 % and 3 % rank-1 accuracy, which validates the effectiveness of our proposed weighted method and local metric, respectively.

Table 2. Matching rates (%) of 3 methods using the same features and testing/training set

3.4 Evaluation on the i-LIDS Dataset

The i-LIDS dataset consists of 476 images from 119 persons captured from eight disjoint cameras [11]. The number of images for each individual varies from 2 to 8. The dataset presents severe occlusions caused by busy crowd and luggage. We randomly select 59 persons to form the training set and the remaining 60 persons are used to form test set. Table 3 shows the matching rates compared to state-of-the-art method. We see that our proposed WLML method outperforms all other methods.

Table 3. Matching rates (%) of different metric learning methods on the i-LIDS dataset

4 Conclusion

In this paper, we have proposed a weighted local metric learning (WLML) for person re-identification. The proposed method learns the weight of local metric and combines the similarity measurement of different sub-regions and different visual cues by weighted summation. Experimental results on two widely used re-identification datasets have shown the effectiveness of the proposed method.