1 Introduction

Person re-identification (re-id), which addresses the problem of identifying the same person captured from different non-overlapping cameras, is a valuable research subject for building the intelligent video monitoring system. It is also a challenging research subject due to the significant intra-class variations on visual appearance caused by the change in illumination, occlusion and person pose across different views. Nevertheless, person re-id has made great progress in recent years thanks to the continuous efforts of the researchers.

Most existing person re-id studies focus on metric learning in the supervised setting [1,2,3,4]. The training set with labelled matching pairs for each pair of camera views is fully utilized to learn an optimal mapping function from the original feature space to the new feature space, so that positive pair (i.e. a pair of people from different views sharing the same identity) has a smaller distance than negative pair (i.e. a pair of people from different views sharing the different identity) in the new feature space.

Towards this end, the most direct way is to learn a mapping function, with which the distance of positive pair equals the minimum value of distances of all pairs with the given query person. Based on this, we propose a novel loss function for person re-id, minimizing the difference between the distance of positive pair and the minimum value of distances of all pairs. Other loss functions for person re-id are modeled with different format of setting constraints on distances, such as, the distance of positive pair is lower than a given threshold and the distance of negative pair is greater than the threshold [5, 6], or the distance of positive pair is lower than the distance of negative pair with a margin [1, 2, 7]. Although these loss functions ultimately optimize towards the same goal as the proposed loss function, i.e. positive pair has a smaller distance than negative pair, they still achieve the ranking goal in a relatively indirect way, for instance, measuring comparative similarity in a huge number of small subset (e.g. triplet loss or quadruplet loss) or setting a insufficiently necessary hard threshold (e.g. binary classification loss). The proposed loss function works toward the ranking goal directly to push the positive pair rank as the first place among all, and can obtain better performance in person re-id than other loss functions. In addition, since the proposed loss function acts directly on the whole set efficiently, the sample imbalance problem occurred in implementing other loss functions does not exist in the context of the proposed method.

To achieve better model optimization, considering the properties of non-smooth and non-differentiable at some points for the minimum function in the proposed loss function, p-norm, an analytic function, with great differentiable property, is utilized as a smooth approximation of minimum function (\(p<0\)). The introduced parameter p is set by theoretic analysis rather than choosing carefully like the other loss functions. And it is worth mentioning that, the relatively larger value for the parameter p (i.e. \(p \rightarrow 0\)) results in a strong constraint on a larger margin between the distances of positive pairs and negative pairs, that is, further enlarging inter-class variations and reducing the intra-class variations, which is effective for improving the accuracy of model in person re-id. Furthermore, we also develop an efficient solver for the model where only a small part of pairs are used in the computation. It can obtain almost the same performance on the testing set compared to the normal solver by using all negative pairs.

In summary, we make three contributions in this paper as follows:

  1. (1)

    We propose a novel ranking loss function for metric learning based person re-id, which optimizes the ultimate re-id ranking in the most direct way.

  2. (2)

    We utilize p-norm to develop a smooth and continuously differentiable version of the proposed ranking loss function, with the advantage of controlling the distance margin by manipulating parameter p to achieve better generalization ability.

  3. (3)

    We propose an efficient optimization algorithm by using a small portion of pedestrian pairs instead of using all pairs in computation.

2 Related Work

In general, person re-id problem is solved by three crucial steps: feature extraction [8,9,10], metric learning [1, 3, 5,6,7, 11] and re-ranking [12,13,14]. Extensive methods have been proposed focusing on one or more of the steps. For a detailed review of person re-id, the interested readers can refer to [15]. In this section, we only briefly review some representative metric learning based methods that are related to our work.

We can divide metric learning based person re-id methods into two broad categories: closed-form solution based and iterative-learning based. The closed-form solution based methods [9, 11, 16] are usually related to Linear Discriminative Analysis technology and the optimal solution can be obtained by the generalized eigenvalue decomposition. However, because the number of samples is generally lower than the dimension of samples’ feature vector in person re-id, there is usually the singular problem in this kind of methods. In the iterative-learning based methods [1, 3, 5,6,7], an objective function is usually constructed and solved by iterative optimization algorithms to satisfy the constraints on samples of the training set. According to the objective function, the iterative-learning methods can be summarized as three main types: the binary classification loss based [5, 6], the triplet loss based [1, 7] and the quadruplet loss based [2]. In PCCA [5], the objective function is optimized so that the distances of positive pairs are lower than a given threshold and the distances of negative pairs are greater than the threshold. Compared with the binary classification loss with a fixed threshold, a generalized similarity metric for person re-id was proposed with an adaptive threshold [6]. Ding et al. [7] proposed a deep neural network where the relative distances between positive pairs and negative ones are maximized. Zhou et al. [4] presented a novel set to set (S2S) loss layer in deep learning framework that focuses on maximizing the margin between the intra-class set and inter-class set, while preserving the compactness of intra-class samples. In addition to the triplet loss based methods, some methods have been proposed to combine and jointly optimize the binary classification loss and the triplet loss for pursuing better performance [17, 18]. Recently, Chen et al. [2] proposed a quadruplet deep network by introducing a quadruplet ranking loss to achieve a smaller intra-class variation and a larger inter-class variation.

3 Ranking Loss Function for Person Re-identification

3.1 Problem Description

Given a query person q from one camera view and a candidate set with N people \(\mathcal {G}\,\mathrm{{= \{ }}{\mathrm{{g}}_i}\left| {i = 1,2, \ldots ,N} \right. \mathrm{{\} }}\) from another camera view. The distance between q and \(g_i\) is measured by

$$\begin{aligned} d_L(q,{g_i}) = {\left\| {L({\varvec{x_q}} - {\varvec{x_{{g_i}}}})} \right\| ^2}, \end{aligned}$$
(1)

where \(\varvec{x_q}\) and \(\varvec{x_{{g_i}}}\) are the feature vector of the person q and \(g_i\) obtained by the feature extraction step. L is a feature transformation matrix.

A ranking list \(\mathcal {L}(q,\mathcal {G}) = \{ g_1^r,g_2^r, \ldots ,g_N^r\}\) can be obtained by the distances between the query person and the candidates in ascending order, i.e. \(d(q,\mathrm{{g}}_1^r)< d(q,\mathrm{{g}}_2^r)< \ldots < d(q,\mathrm{{g}}_N^r)\). For person re-id, we hope that the candidate having the same identity with the query person q (also known as the positive sample) is closer to the top of the ranking list. Certainly, it will be perfect in the case where the positive sample ranks first. To do this, we develop a model to learn an optimal feature transformation L by the training set.

3.2 Modeling

Without loss of generality, we introduce our method in this subsection based on the assumption that there is the case of single-shot, i.e one query person shares the same identity with only one gallery person in the candidate.

For each query person \(q_i\) (\(\mathrm{{i}} = 1, \ldots ,M\)), in the candidate set \(\mathcal {G}\), there are always one person denoted by \(g^ {i+}\) having the same identity with \(q_i\), and all the rest of people having the different identity with \(q_i\) denoted by \(g_j^{i - }\) \((j = 1,2, \ldots ,N - 1)\). With the goal of prioritizing the positive sample \(g^{i+}\) on the ranking list, we find an optimal feature transformation L by which the distance between feature vectors of \(q_i\) and positive sample \(g^ {i+}\) is the minimum value of distances between feature vectors of \(q_i\) and each sample in the set \(\mathcal {G}\). So, the objective function is deduced:

$$\begin{aligned} \mathop {\min }\limits _L {f^*}(L) = \sum \nolimits _{i = 1}^M {{[{d_L}({q_i},g^{i + }) - \mathop {\min }\limits _{{g_n} \in {\mathcal {G}}} {d_L}({q_i},{g_n})} ]}. \end{aligned}$$
(2)

For the convenience of expression, we simplify the notation \({{d_L}({q_i},g^{i + })}\) as \({d_{{q_i},g^{i + }}}\) and the simplifications of all other \({d_L}(\cdot ,\cdot )\) are similar to this below.

When for each term in \(f^*(L)\) , that is, all positive samples rank first in the corresponding ranking list, the objective function reaches its the lower bound value of zero. However, the minimum function in Eq. 2 is non-smooth and non-differentiable at some points. As a result, it is an intractable problem to solve the model in Eq. 2. For this, we use p-norm as a smooth approximation of the minimum function:

$$\begin{aligned} {(\sum \nolimits _{{g_n} \in {\mathcal {G}}} {{{({d_{{q_i},{g_n}}})}^p}} )^{\frac{1}{p}}} \approx \mathop {\min }\limits _{{g_n} \in {\mathcal {G}}} {d_{{q_i},{g_n}}}. \end{aligned}$$
(3)

Then, the new objective function is given:

$$\begin{aligned} \mathop {\min }\limits _L f(L) = \sum \nolimits _{i = 1}^M { {[{d_{{q_i},g^{i + }}} - {{(\sum \nolimits _{{g_n} \in {\mathcal {G}}} {{{({d_{{q_i},{g_n}}})}^p}} )}^{\frac{1}{p}}}} ]}. \end{aligned}$$
(4)

This model in Eq. 4 not only inherits the advantage of the one in Eq. 2, i.e. learning the transformation L by a direct way with request of the positive pair being in the top rank, but also is favorable for solving the optimal transformation L. In addition, the introduced parameter p controls the degree of margin between distances of positive pair and negative pair, resulting in more flexibility for model to obtain a better performance.

Now, we elaborate on why the model in Eq. 4 has these advantages. For this, we reformulate p-norm in Eq. 3 as:

(5)

Substituting Eq. 5 into Eq. 4, we can find that the minimum value of the objective function f(L) is reached if and only if \({(\frac{{{d_{{q_i},g_t^{i - }}}}}{{{d_{{q_i},g^{i + }}}}})^p} = 0\) \((\forall t = 1, \ldots ,{N-1})\) for each person \(q_i\). Since we want to achieve \({d_{{q_i},g^{i + }}} < {d_{{q_i},g_t^{i - }}}\) \((\forall t = 1, \ldots ,{N-1})\) for each person \(q_i\), meaning \(\frac{{{d_{{q_i},g_t^{i - }}}}}{{{d_{{q_i},g^{i + }}}}} > 1\) for all terms, it is possible to satisfy \({(\frac{{{d_{{q_i},g_t^{i - }}}}}{{{d_{{q_i},g^{i + }}}}})^p} = 0\) only if \(p < 0\). It follows that the model in Eq. 4 with a smooth and continuously differentiable loss function can achieve same functionality as the one in Eq. 2, as long as we choose an appropriate parameter of p.

Based on the basic constraint of \(p<0\), there are two cases where \({(\frac{{{d_{{q_i},g_t^{i - }}}}}{{{d_{{q_i},g^{i + }}}}})^p} = 0\) is satisfied as follows:

  1. (1)

    when \(p \rightarrow - \infty \), the value of \(\frac{{{d_{{q_i},g_t^{i - }}}}}{{{d_{{q_i},g^{i + }}}}}\) just need to be slightly larger than 1. It means that the margin between distances of positive pairs and negative pairs is very small;

  2. (2)

    when \(p \rightarrow 0\), the value of \(\frac{{{d_{{q_i},g_t^{i - }}}}}{{{d_{{q_i},g^{i + }}}}}\) need to be much larger than 1. It means that the margin between distances of positive pairs and negative pairs is big.

It is obvious that when we optimize the objective function and search its minimum solution in Eq. 4, the value of p controls the degree of margin between distances of positive pairs and negative pairs. We have the flexibility of choosing p based on the demand for the margin.

We argue that the performance of model on the testing set can be improved by further enlarging the inter-class variations and reducing the intra-class variations [19]. Therefore, it is beneficial to set p as a relative large value (i.e. set \(p \rightarrow 0\)) in the experiments. However, it is imperative to note here that, with p getting closer and closer to zero, the correspondingly learnt transformation L results in \(\frac{{{d_{{q_i},g_t^{i - }}}}}{{{d_{{q_i},g^{i + }}}}} \rightarrow \infty \) for satisfying \({(\frac{{{d_{{q_i},g_t^{i - }}}}}{{{d_{{q_i},g^{i + }}}}})^p} = 0\), which means the denominator \({d_{{q_i},g^{i + }}} \rightarrow 0\) and the constraint on the value of the numerator \({{d_{{q_i},g_t^{i - }}}}\) is weaken in the process of optimization. To avoid two undesirable consequences: (1) the model overfitting, (2) lose the control of the margin between distances of positive pairs and negative pairs, we propose a safer choice to set \(p \rightarrow 0\) but not getting closer to the value of zero. In the experiments, we set \(p=-5\) and provide a detailed validation for the performance of the proposed method with different value of p, to justify our analysis above.

3.3 Optimization

It is worth noting that we improve the model from Eq. 2 to Eq. 4 with a smooth and continuously differentiable objective function at the cost of increased computational complexity. Specifically, all corresponding negative pairs for each positive pair are need to be traversed for optimizing the model in Eq. 4. On the contrary, only one negative pair for each positive pair involves in the computation for optimizing the model in Eq. 2.

Therefore, in the process of optimizing the model in Eq. 4, we simplify the computation and only the first k sample pairs with smallest distances for each positive pair are used in the computation. We use the gradient descent scheme with line search to solve the model and the simplified objective function at t-th iteration is formulated as:

$$\begin{aligned} {f_t}(L) = \sum \nolimits _{i = 1}^M { {[{d_{{q_i},g^{i + }}} - {{(\sum \nolimits _{{g_n} \in \varOmega _t^{i,j,k}} {{{({d_{{q_i},{g_n}}})}^p}} )}^{\frac{1}{p}}}]} }, \end{aligned}$$
(6)

where \(\varOmega _t^{i,j,k} \subseteq {\mathcal {G}}\) includes first k sample pairs with smallest distances obtained by the feature transformation \({L_{t - 1}}\). If \(k = N\), we have \(\varOmega _t^{i,j,k} = {\mathcal {G}}\) with which the optimization process is equivalent to the one in Eq. 4; if \(k = 1\), the model reverts to the one in Eq. 2.

The corresponding gradient at t-th iteration is derived as follows:

$$\begin{aligned} \frac{{\partial f_t(L)}}{{\partial L}} = \sum \nolimits _{i = 1}^M { {[{\pi _L}({q_i},g^{i + }) - \rho ({q_i},g^{i + })} \sum \nolimits _{{g_n} \in \varOmega _t^{i,j,k}} {{{({d_{{q_i},{g_n}}})}^{p - 1}}{\pi _L}({q_i},{g_n})} ]}, \end{aligned}$$
(7)

where \({\pi _L}(\cdot , * )\) is the derivative of distance measurement \({d _L}\) with respect to L:

$$\begin{aligned} {\pi _L}(\cdot , * ) = 2L(\varvec{x_\cdot } - \varvec{x_ * }){(\varvec{x_\cdot } - \varvec{x_ * })^T}, \end{aligned}$$
(8)

and \(\rho ({q_i},g^{i + })\) is a constant:

$$\begin{aligned} \rho ({q_i},g^{i + }) = {(\sum \nolimits _{{g_n} \in \varOmega _t^{i,j,k}} {{{({d_{{q_i},{g_n}}})}^p}} )^{\frac{1}{p} - 1}}. \end{aligned}$$
(9)

With each iteration to approximate the optimal solution, the positive pair \(({q_i},g^{i + })\) is constantly pushed into the first rank with a minimum distance among the distances of all sample pairs \(({q_i},{g_n})\), where \(n \in \mathcal {G}\). It follows that although we discard some sample pairs in the process of optimization, it’s still equivalent to solving the model with all sample pairs. In experiments, we set \(k = 2\), and the related comparison experiments are carried out and prove that the proposed method with \(k=2\) can achieve almost the same performance and shorter running time compared to the one with \(k = N\).

For convenience, we name the proposed method as R-Loss below. The overall optimization process of R-Loss is shown in Algorithm 1.

3.4 Model Extension

In this section, we extend our method to the general case: multi-shot, i.e for a query person, there are more than one positive sample in the candidate set.

We formally express this case. For each query person \(q_i\) (\(\mathrm{{i}} = 1, \ldots ,M\)), there are several positive samples denoted by \({\mathcal {G}^{i + }} = \{ g_j^{i + } \in \mathcal {G}\left| {j = } \right. 1,\) \(2, \ldots ,{N^{i + }}\}\), and a majority of negative samples denoted by \({\mathcal {G}^{i - }} = \{ g_j^{i - } \in \mathcal {G} \left| {j = 1,2, \ldots ,{N^{i - }}} \right. \}\), where \({N^{i + }} + {N^{i - }} = N\).

figure a

For the case of single-shot, we hope that the positive sample can rank first in the ranking list for the test set, so the distance of positive pair is the minimum of the distances of all pairs is our objective in the training process. Similarly, for the case of multi-shot, we certainly hope that one of the positive samples can rank first for the test set. For obtain a better performance in the test set, we propose a stronger constraint on the training set that is all positive sample pairs be in the front of the ranking list. For this, we learn the optimal transformation L, so that the distance of each positive pair is the minimum of the distances of this positive pair and all negative pairs with the same query person. Therefore, the objective function is deduced:

$$\begin{aligned} \mathop {\min }\limits _L {f_{ex}}(L) = \sum \nolimits _{i = 1}^M {\sum \nolimits _{g_j^{i + } \in {\mathcal {G}^{i + }}} {[{d_{{q_i},g_j^{i + }}} - {{({{\sum \nolimits _{{g_n} \in \{ g_j^{i + }\} \cup {\mathcal {G}^{i - }}} {({d_{{q_i},{g_n}}})} }^p})}^{\frac{1}{p}}}]} }. \end{aligned}$$
(10)

When \({N^{i + }}=1\), the model in Eq. 10 reverts to the one in Eq. 4. We use the gradient descent scheme to solve the model in Eq. 10. The solving process is similar to Algorithm 1 and is not be repeated here.

4 Discussion About Different Loss Functions

In this section, we present a detailed discussion about the differences between the proposed ranking loss function and the existing three classical loss functions: the binary classification loss, the triplet loss and the quadruplet loss. Figure 1 illustrates how the relationships of distances between positive pairs and negative pairs are constrained for these loss functions.

Fig. 1.
figure 1

The relationships between positive pairs’ distances and negative pairs’ distances for loss functions. The blue line and point represent the positive pair and the red ones represent the negative pair. Better viewed in colour. (Color figure online)

First of all, we formalize the three loss function with the general case of multi-shot as follows:

(1) the binary classification loss function,

$$\begin{aligned} \mathop {\min }\limits _L {f^*_{binary}}(L) = \sum \nolimits _{i = 1}^M {\sum \nolimits _{j = 1}^N {\max (0,{y_{ij}}({d_{{q_i},{g_j}}} - c))} } , \end{aligned}$$
(11)

where \({y_{ij}} = \left\{ {\begin{array}{*{20}{c}} 1 &{} {{g_j} \in {\mathcal {G}^{i + }}} \\ { - 1} &{} {{g_j} \in {\mathcal {G}^{i - }}} \\ \end{array}} \right. \) is a sign function and c is a margin threshold. Considering the non-smooth and non-differentiable properties of hinge function \(h(x) = \max (0,x)\), there is usually another form of the binary classification loss function used for person re-id,

$$\begin{aligned} \mathop {\min }\limits _L {f_{binary}}(L) = \sum \nolimits _{i = 1}^M {\sum \nolimits _{j = 1}^N {{l_\beta }({y_{ij}}({d_{{q_i},{g_j}}} - c))} }, \end{aligned}$$
(12)

where \({l_\beta }(x) = \frac{1}{\beta }\log (1 + {e^{\beta x}})\) is a smooth approximation of hinge function with .

(2) the triplet loss function,

$$\begin{aligned} \mathop {\min }\limits _L {f_{triplet}}(L) = \sum \nolimits _{i = 1}^M {\sum \nolimits _{g_j^{i + } \in {\mathcal {G}^{i + }}} {\sum \nolimits _{g_j^{i - } \in {\mathcal {G}^{i - }}} {\max (0,{d_{{q_i},g_j^{i + }}} - {d_{{q_i},g_j^{i - }}} + c)} } }, \end{aligned}$$
(13)

where c is a margin threshold.

(3) the quadruplet loss function,

(14)

where \(c_1\) and \(c_2\) are the margin thresholds.

For the binary classification loss function, just the upper bound of positive pairs’ distances and the lower bound of negative pairs’ distances are constrained by the threshold value of c, and the margin between the positive pair and negative pair is not constrained. Instead, the value of p controls the margin in the proposed loss function and we can set the value of p resulting in a large margin between positive pair and negative pair in the new learnt space, which is profitable for performance in person re-id.

For the triplet loss function, the value of c directly determines the margin between positive pair and negative pair in the new learnt space. A small value of c results in the small margin between positive pair and negative pair and vice versa. However, in generally we don’t know what the value of margin should be, so it is difficult to choose a certain value of margin (i.e. set the value of c). Instead, the value of p controls the degree of the margin in the proposed loss function, and we know that a large margin in the new learnt is advantage for performance, and based on this, we can set the value of p.

For the quadruplet loss function, a stricter constraint on margin is established. And similar to the triplet loss function, it is difficult to choose the appropriate values of margin for the quadruplet loss function. Even worse is that there are two parameters \(c_1\) and \(c_2\) which are need to be set.

Our proposed ranking loss function, as well as the three loss functions, all aim to drive the distance of positive pair to the minimum of the distances of all pairs for each query person. But for different loss functions, there are different ways to reach the goal. And it is the most direct way for the proposed ranking loss function, which is more accordant with person re-id problem. In addition, from Eqs. 1114 we can see that all negative pairs involve in the computation for each positive pair, resulting in the high demand of computing. Therefore, researchers proposed to solve the model with one positive pair vs. a fraction of negative pairs for the binary classification loss function, one positive pair vs. a hardest negative pair (i.e. a negative pair with the smallest distance among all negative pairs) for the triplet loss function and quadruplet loss function. Even so, there is the problem of sample imbalance in the binary classification loss function, and the improved models for the triplet loss function and quadruplet loss function are intractable to solve due to the operation of finding the hardest negative pair. Instead, there does not exist the problem of sample imbalance in the proposed ranking loss function since we just impose constraints on the rank of positive pair, meanwhile, only two pairs for each positive pair are involved in the computation for our model, with almost the same performance and higher computational efficiency compared to using all pairs.

5 Experiments

In this section, we conduct experiments from three aspects: (1) evaluate the performance of the proposed method with different settings; (2) compare the proposed method with other loss function based methods; (3) compare the proposed method with state-of-the-art methods. Before starting, we make a detailed introduction for the settings relevant to the experiments.

5.1 Experimental Settings

Datasets. Two publicly available datasets are used in experiments: VIPeR [20] and CUHK01 [21]. The VIPeR dataset includes 632 pedestrian image pairs captured by two different cameras in an outdoor environment with only one image per person for each view. The CUHK01 dataset has 971 identities taken from two camera views in a campus environment with 2 images of every person under each camera view.

Feature. There are three feature descriptors used in the experiments: Local Maximal Occurrence (LOMO) [9], Gaussian Of Gaussian (GOG) [22] and Histogram of Intensity Pattern & Histogram of Ordinal Pattern (HIPHOP) [23]. For the LOMO descriptor, the horizontal occurrence of local features described by the Scale Invariant Local Ternary Pattern (SILTP) and HSV histogram are maximized to make a stable representation against viewpoint changes. For the GOG descriptor, the person image is described by cascaded Gaussian distributions in which both means and covariances are considered. For HIPHOP descriptor, it describes the person image based on AlexNet [24] convolution neural network.

Evaluation Protocol. We evaluate the performance of person re-id methods by Cumulative Matching Characteristics (CMC) where Rank r represents the expectation of correct match at r-th in the ranking list. The reported results are the average results of 10 times of the partition of training set and test set. For each partition, similar to most publications, the identities are randomly divide into two equal parts used for training set and test set in VIPeR dataset respectively; 485 identities and 486 ones are used for training and testing with single-shot (M1) and multi-shot (M2) setting in CUHK01 dataset.

5.2 Analysis of the Proposed Method

p-norm vs. min. We aim to learn an optimal feature transformation L by which the distance of positive pair is the minimum value of distances of all sample pairs with the same query person in the new feature space. The p-norm is used as a smooth approximation of the minimum function in Eq. 3. The superiority of p-norm compared with the minimum function has been analyzed theoretically in Sect. 3.2. In this subsection, we validate the derived theoretical results by the experiment with GOG feature descriptor employed. The CMC curves of R-Loss(p-norm) and R-Loss(min) on VIPeR and CUHK01 datasets are reported in Fig. 2. We can observe that, R-Loss(p-norm) improves the performance of R-Loss(min) by a large margin on all datasets. By using p-norm as a smooth approximation of minimum function, the Rank 1 increases 24.55%, 17.53% and 20.02% on VIPeR, CUHK01(M1) and CUHK01(M2) datasets, respectively. It reveals that the optimization algorithm by using the gradient descent scheme in the R-Loss(p-norm) model can converge to a better solution than in the R-Loss(min) model.

Fig. 2.
figure 2

Comparison of CMC curves for the proposed method by using p-norm function and minimum function.

On the Parameter p. In Sect. 3.2, we present a detailed analysis about the selection of parameter p and draw a conclusion that a larger value of p but not getting closer to the value of zero based on a basic constraint \(p<0\) is beneficial to the accuracy of person re-id. In this subsection, we compare the performance of the proposed R-Loss method with different values of p on VIPeR and CUHK01 datasets, and LOMO descriptor and GOG descriptor are used as the feature representation of person, respectively. As shown in Fig. 3, when \(p \rightarrow - \infty \), the accuracy of Rank1 gradually degrades on all datasets. More specifically, when \(p < - 10\), the accuracy rapidly degrades on all datasets. It indicates that a lower value of p with \(p \rightarrow - \infty \) is disadvantage to the performance. Besides, we can also see that the accuracy of Rank1 degrades rapidly with \(p = -1\) on CUHK01(M1) and CUHK01(M2) datasets. Although the proposed method with \(p = -1\) has well performance on VIPeR dataset, setting the value of p close to the value of zero is a riskier choice. Therefore, we set \(p=-5\) in the experiments.

Fig. 3.
figure 3

The performances of Rank1 for the proposed method with different value of p.

On the Parameter k. In the process of optimizing the model in Eq. 4, we introduce a simplified algorithm to find the optimal solution. At each iteration of optimization algorithm, the first k sample pairs with smallest distances for each positive pair are used in the computation. In this subsection, we investigate the effects of k on the performance and running time of the proposed R-Loss method. Here, we use the LOMO and GOG feature descriptors for representing the person image, respectively, and the experiments are carried out on VIPeR and CUHK01 datasets. We show the results in Table 2 by varying k from 2 to N. We can see that the proposed method is kind of invariant to k on performance (fluctuate around 2%), and the running time gradually decreases with the decrease of k. When \(k = N\), all samples are used for optimizing model so that the optimization process become much more complex, the performance might slightly be affected. As a result, \(k=2\) is an optimal choice.

5.3 Comparison with Other Loss Function Based Methods

In this paper, we propose a novel metric learning based person re-id method by optimizing a ranking loss function. Compared with the binary classification loss based [5, 6], the triplet loss based [1, 7] and the quadruplet loss based methods [2], there is no need to select the parameters carefully and no problem of sample imbalance in the proposed ranking loss based method. In the following, we will verify a more critical advantage of the proposed method compared with classical loss function based methods: a better generalization ability on the testing set (Table 1).

Table 1. Effects of k on the performance and running time of the proposed R-Loss method. Running time refers to one iteration time of optimization algorithm. \(k=N\) represents the case for using all samples in optimization.
Table 2. Comparison with methods based on other loss function on VIPeR and CUHK01 datasets. The best results (%) are respectively shown in red. Better viewed in colour.

For a fair comparison, GOG feature descriptor is used in this comparison experiment. Similar to most of methods, we set \(c=1\) for both the binary classification loss function and the triplet loss function, and \(c_1=1\), \(c_2=0.5\) for the triplet loss function. The comparison results are shown in Table 3 the proposed R-Loss method outperforms other loss function based methods on all datasets. Specifically, our method achieves about 3.9% gain at Rank 1 compared with the second result from the binary classification loss (smooth) based method on CUHK01 dataset with single-shot and multi-shot setting.

5.4 Comparison with State-of-the-Art Methods

In this section, we evaluate our method against the state-of-the-art person re-id methods. The experiment results of our method are presented by a simple fusion of scores obtained by the three types of features mentioned above.

Result on VIPeR Dataset. We compare the proposed method with 13 existing person re-id methods. From the results shown in Table 4, we can see that our method achieves the highest performance at Rank1, Rank5 and Rank20. More specifically, our method beats the closest competitor EBG [25] by 1.11% at Rank1. EBG [25] as a feature extraction based method where a deep neural network was proposed to solve the background bias problem, still yield the poorer results compared with our proposed method based on classical metric learning technology. It indicates that the deep learning based methods cannot currently work well on the small person re-id dataset.

Table 3. Comparison with the state-of-the-art methods on VIPeR dataset. The best results for deep learning based and traditional methods (%) are respectively shown in boldface and red. Better viewed in colour.
Table 4. Comparison with the state-of-the-art methods on CUHK01 dataset. The best results for deep learning based and traditional methods (%) are respectively shown in boldface and red. Better viewed in colour.

Result on CUHK01 Dataset. The proposed R-Loss method is compared with traditional methods and deep learning based methods in CUHK01 dataset with multi-shot setting. Table 4 shows that the best results are from the deep learning based method MC-PPMN [32] where deep feature representations for semantic-components and color-texture distributions are learned based on pyramid person matching network. However, the proposed method is competitive among the traditional methods at all Rank.

6 Conclusions

In this paper, we propose a novel ranking loss function from a new perspective to solve the person re-id problem. The loss function is optimized aiming to make the distance of positive pair be the minimum value of distances of all pairs with the given query person, which is more accordant with person re-id problem. Moreover, we propose to use p-norm to approximate the minimum function to obtain a smooth and continuously differentiable loss function, which is favorable for model solution. In addition, just the first two sample pairs with smallest distances are used in the process of model optimization for improving the efficiency of model solution, with almost the same accuracies as the model using all pairs. Extensive experiments with thorough analysis demonstrate that the proposed R-Loss method achieves superior performance than state-of-the-art methods on VIPeR and CUHK01 datasets. In future research, we will explore to apply the ranking loss to deep learning scheme so that a competitive performance can be achieved on the large datasets.