1 Introduction

The goal of the person re-id problem is to match the same individuals across multiple non-overlapping cameras. Given one target pedestrian image, we aims to search the same individuals among a set of candidate images in the gallery set, which are captured from disjoint camera networks. This is a crucial task in the computer vision community, and it is very essential for many video surveillance related tasks. It remains a challenging problem due to the following four reasons: 1) Variations of visual appearance, which are caused the viewpoint changes of the camera network or illumination variations; 2) Background cluster and occlusions, this happens in some crowd scenarios, such as the airport or market; 3) Changes in human pose, which occur in one person’s different activities within different time space; 4) Different identities share the same appearance, such as many people wearing similar clothes in public spaces. We have also illustrated some examples in Fig. 1 from three person re-id benchmark datasets, namely i-LIDS [37], and PRID2011 [19],CUHK03 [53]. Images in the red bounding boxes describe the same individuals, and the first three rows clearly show some of the challenging situations caused by viewpoint and illumination changes, background and occlusion, and changes in human pose, separately. While images in yellow bounding boxes are from different identities, but they share very similar appearance. Although many algorithms have been proposed to tackle this problem, the representation power of the learned features or metrics might be still limited.

Fig. 1
figure 1

Typical challenging examples for person re-id in datasets i-LIDS, PRID2011 and CUHK03. Images in red bounding boxes in row a, b and c are from the same person, and each of them clearly show some of the challenging situations caused by viewpoint and illumination changes, background and occlusion, and changes in human pose, separately. Images in yellow bounding boxes in row d are from different identities, and they show that different identities wear similar clothes thus having similar appearance

In order to address these challenges, extensive works have been proposed in the past few years [14, 17, 31, 32, 39, 44]. All these works can be mainly divided into two categories: one is to learn discriminative and robust feature representations for both the query and gallery images, and another is to develop good distance metrics to measure the similarity between the pedestrian images. For the first category, different cues , such as the color, texture and shape of the pedestrian images, are well explored to obtain robust and distinct feature representations for describing each individual’s visual appearance. The representative works contain [2, 8, 12,13,14, 17, 21, 31, 39]. As the metric learning category, one distance metric is learned from the labeled training samples, which always aims to make the same individual’s images closer and push the images from different individuals far apart. The representative works include [17, 32, 39].

Nowadays, the deep learning based methods have achieved promising results on almost all the computer vision tasks, especially on the image classification task [15, 24]. This encourages us to utilize the deep convolutional neural network (CNN) to learn the feature representations for the images under the supervision of some widely used distance metrics, to further improve the performances. The triplet loss is one of the widely used loss function for person re-id, which requires the distance between the same individual’s images closer than that of the different individuals’ images by a large margin. Meanwhile the softmax loss has always been used in the classification task to get the identity information of each class. In this paper, we have analyzed the effectiveness of these two widely used loss functions on the person re-id task, with different architecture on the widely used person re-id benchmark datasets. We have done experiments with the shallow part-based model on two relatively small datasets, namely i-LIDS and PRID2011 datasets. Meanwhile, we did experiments with the deep DGD [44] model on the larger CUHK03 dataset. We did not use the deep model on the small datasets or use the shallow network on the large dataset, because these two cases can lead to overfit or under-fit. we have found that the triplet loss is much suitable for the relatively small datasets with shallow network architecture, while the softmax loss is more suitable for the large datasets with relatively deep network architecture. The reason is as follows: The triplet loss function prefers to adopt image triplets to form their loss, which is inefficient and unstable on large datasets since the number of training triplets grows rapidly as the number of training data grows. That is to say, on small datasets, this relatively weak supervised triplet loss function can be easy to utilize nearly all these sample triplets as they required. While on the large datasets, it is inefficient to use all possible sample triplets with the deep network architecture, which can not catch up all the useful information and leads to under-fit the large datasets. Meanwhile, the softmax loss is one relatively strong supervised loss function. On small datasets, this cost function can easily over-fit the training datasets, but it can not be well extended to the test datasets. Fortunately, this instructive strategy can be used to develop the neural network architecture for person re-id task in the future work. What’s more, in this paper, we have proposed a asymmetric method to improve the training of the original triplet loss, where the distance between the different persons’ images are constructed by two anchor images instead of using only one anchor image. This can improve the training of the triplet loss. Moreover, based on our observation of the two cost function for person re-id task, we propose the joint supervision of the CNN architecture. We can clearly see that the triplet loss can greatly reduce the variations among one person’s different images, while the identification loss can enlarge the inter-personal variations. Besides, our experimental results show that the joint learning method can obtain slightly better performance than either of the single method.

The main contributions of this paper are twofold: 1) We have analyzed the triplet loss and the softmax loss for person re-id task with different network architecture in details. We conclude that the triplet loss is suitable for the smaller datasets with shallow network architecture, while the softmax loss is suitable for large datasets with relatively deeper network architecture. Moreover, we have also used the “center-loss” embedded softmax loss function in our experiments.

2)Based on the above observations, we combine the triplet loss with the softmax as the joint supervision, and a slightly better performance can be obtained. Experimental results illustrate the effectiveness of our proposed method on three widely used person re-id benchmark datasets.

The rest of the paper is organized as follows. In Section 2, we briefly review the related works. Section 3 introduces one shallow parts-based neural network under the joint supervision of the triplet comparison and softmax loss function, and its training algorithm. The experimental results, comparisons and analysis are presented in Section 4. Conclusion comes in Section 5.

2 Related work

The unsolved problem of person re-id has draw great attention and becomes an important topic in visual surveillance in recent years. To address this problem, most existing works about person re-id consist of two major components: extracting discriminant/reliable features to represent the query and gallery images, learning robust distance metric to measure the similarity between those features across images or subspaces. Researches on person re-id tasks usually focus either of the above two aspects, or a combination of both. Systems in the literature use various combinations of features and distance metric learning approaches.

The feature representation based methods mainly focus on developing discriminative features which are robust to illumination condition, viewpoint and pose changes. Recently, the most useful features designed for person re-id task mainly include the local binary patterns(LBP) [22, 23, 25, 28, 41, 52], color histogram and its variants [22, 23, 28, 41], color names [48], Gabor features [25], and some other contextual information or other appearance features [3, 5,6,7, 25, 52, 57,58,59]. There are also some other works investigating to combine multiple visual cues or some part-based feature combination methods which are [4, 22, 25, 41].

The metric learning methods have also been well explored in the past few years for person re-id task [30, 46]. These methods aim to learn a distance measurement which can map the features of the same person closer than that of the different person images. The representative works about the metric learning methods for person re-id mainly include [9, 22, 23, 26, 41]. All the method cast the person re-id problem as an image retrieval task, and then embed some ranking constraints in the model learning process. Thus the learned metric can be robust to the intra-personal variations.

Recently, many tasks in computer vision research areas have achieved promising results by using the deep learning based method[15, 24, 35, 44, 50, 51], and also many deep convolution neural network (DCNN) models have appeared for the person re-id task. As far as we know, most state-of-the-art results on many person re-id benchmark datasets, such as i-LIDS, CUHK03, etc, are obtained by these DCNN-based methods [44]. One main advantage of using CNNs for re-id problem is the ability to combine separate steps of traditional re-id pipeline into an end-to-end learning procedure, enabling automatic interactions among feature extraction, feature transformation and similarity measurements. In the following, we will describe some representative works that are much related to ours. The triplet network has been successfully use in the fine-grained image recognition [40], face recognition [33], and person re-id problem [11]. In this paper, we have analyzed two widely used loss function on the person re-id tasks, namely the softmax loss and the triplet loss function. Based on our observations, we have found that the triplet loss is suitable for the small dataseets with relatively shallow network architecture, while the softmax loss can obtain better performance on larger datasets with deeper network architecture. Then we have combine them to form the joint supervision to get slightly better performances.

Thus, the joint triplet and identification loss function can make the learned features with small intra-personal variations while large inter-personal variations. Specially, the triplet loss can pull the feature vectors from the same individuals closer, and push feature vectors from different individuals far apart from each other, which effectively reduce the intra-personal variations of the learned features. Yet, the identification loss can learn discriminative identity-related features which can enlarge the inter-personal variations. Meanwhile, our method can learn per-image descriptors and then compare them with simple Euclidean distance. In Section 4, we compare some of the above mentioned methods with our proposed method.

3 The proposed person re-id method

In this section, we will present our proposed person Re-Id method. Firstly, we will illustrate the overall framework of our proposed method, then we present the joint triplet distance comparison and identification loss function which is used to train our proposed CNN model, and also its training method.

3.1 The overall framework

Motivated by some work for face recognition, we propose the person re-id network architecture to satisfy both the triplet distance comparison objective and to predict each individual’s identity. Such that the network architecture is trained under the joint supervision of the triplet loss and identification loss. The triplet loss makes the distance between the matched pairs closer than that of the mismatched pairs in the learned feature space, which can effectively reduce the intra-personal variations. Meanwhile, the identification cost can help to enlarge inter-personal variations and learn more discriminative identity-related feature representation.

For the triplet loss, we denote t i =< I i ,I i+,I i− > as the i-th sample triplet which is constructed by three input images, where I i and I i+ are the same individual’s images, while I i− denotes a different person’s image with I i . We project the triplet images t i from the original raw image space into the learned feature space through our proposed triplet CNNs framework which share the same parameter set w, i.e., weights and biases. In the feature space, the sample triplet t i is represented by f w (t i ) =< f w (I i ),f w (I i+),f w (I i−) > . Each CNN in Fig. 2 is our proposed multi-channel parts-based CNN model, which is able to extract both the global full-body and local body-parts features. Our proposed CNN model is trained under the joint supervision of the triplet loss and “center loss” embedded softmax loss function, the learned feature space will have the following two properties: First, the distance between f w (I i ) and f w (I i+) is less than both the distance between (f w (I i ),f w (I i−)) and (f w (I i+),f w (I i−)). The triplet loss is aiming to pull the images of the same individuals closer, and meanwhile push the images belonging to different individuals far apart from each other in the projected feature space. Second, for the identification loss, it aims to encourage the learned features that have similar outputs for images of the same individuals and dissimilar outputs for different individuals, which enlarge the inter-personal variations. We use the “center loss” embedded softmax loss function as the identification cost, which equals to the cross-entropy between the predicted label probabilities and the ground truth identity labels.

Fig. 2
figure 2

The joint training framework on small datasets with the shallow parts-based network. Each of the constructed triplet training images is fed into its corresponding network, where the three network share the same parameter set. The network architecture is trained under the joint supervision of the triplet distance comparison and identification cost. The triplet distance comparison cost encourages the distance between the matched pairs closer than that of the mismatched pairs in the learned deep CNN feature space. Meanwhile, the identification cost help to learn more discriminative identity-related features with large inter-personal variations

As aforementioned, in our approach, we train a multi-channel parts-based CNN with joint triplet and identification loss function. The joint learning objective can be described as follows,

$$ \underset{\textbf{w}}{\text{argmin}}\ L = \frac{1}{T}\sum\limits_{i=1}^{T} {\Gamma}(t_{i};\textbf{w}) + \lambda \ell(I_{j},y_{j};\textbf{w},{\Theta},\textbf{c}). $$
(1)

The first term Γ(t i ;w) in (1) is the triplet loss, and T is the total number of constructed triplet training examples. The second term (I j ,y j ;w,Θ,c) is the identification cost, where we have used the softmax cross entropy with the center loss embedded as the identification cost. y j is the identity label of the corresponding image I j , w is the network parameter, Θ is the softmax weight matrix, and c is the learned center loss parameter in the training process. More details can refer to [42]. λ is a parameter to balance the triplet loss and the identification loss function.

In the following section, one shallow parts-based CNN architecture, the triplet loss, identification loss function, and its joint training process will be introduced in detail, separately.

3.2 One shallow parts-based CNN model

Firstly, we introduce one shallow parts-based deep CNN model, which can get promising performance on relatively small datasets. In Fig. 2, we can see that our network architecture contains four distinct layers: one global convolutional layer, four body-part convolutional layers, five channel-wise full connection layer, and one network-wise full connected layer. The global convolutional layer is the first layer of the proposed network architecture, which consists of 64 feature maps with the convolutional kernel of 7 × 7 × 3 and the stride of 3. Next, this convolutional layer is divided into four equal parts P i ,i = {1,...,4}, then we have four different convolutional layers for each part where the network parameter is 32 × 3 × 3 × 64. The following is the fully connected layer for each part, and also one fully connected layer for all the four parts which is called the full-body features. Then we concatenate the body-part features and the full-body features as one vector. Finally, the last layer is the fully connected layer whose dimension is the same as previous one. We have used the last fully connected layer as the input for the triplet and the identification loss function. Such that our network architecture has used the full-body and body-part features together.

3.3 Introduction of the asymmetric triplet loss function

As the description in Section 3.1, we utilize the joint triplet and softmax loss function to train the network model. First, we will introduce the triplet loss here. Given one triplet t i =< I i ,I i+,I i− > example, the CNN model projects t i into the learned feature space with f w (t i ) =< f w (I i ),f w (I i+),f w (I i−) > . The similarities between the triplet sample images I i ,I i+,I i− are measured by the euclidean distances among f w (I i ),f w (I i+),f w (I i−). The triplet loss function requires that the distances between the different person pairs (f w (I i o),f w (I i−)) or (f w (I i+),f w (I i−)) are larger than that of the same individual pairs (f w (I i ),f w (I i+)) by a large margin, and we use the following equation to enforce these requirements:

$$\begin{array}{@{}rcl@{}} d(f_{\textbf{w}}(I_{i}),f_{\textbf{w}}(I_{i}^{+})) - d(f_{\textbf{w}}(I_{i}),f_{\textbf{w}}(I_{i}^{-})) \leq{\tau}. \end{array} $$
$$\begin{array}{@{}rcl@{}} d(f_{\textbf{w}}(I_{i}),f_{\textbf{w}}(I_{i}^{+})) - d(f_{\textbf{w}}(I_{i}^{+}),f_{\textbf{w}}(I_{i}^{-})) \leq{\tau}. \end{array} $$

In order to merge the above requirements into one constraint, we have simplified it as follows,

$$\begin{array}{@{}rcl@{}} d^{n}(I_{i},I_{i}^{+},I_{i}^{-};\textbf{w}) &=& d(f_{\textbf{w}}(I_{i}),f_{\textbf{w}}(I_{i}^{+}))\\ & & - \frac{1}{2} [d(f_{\textbf{w}}(I_{i}),f_{\textbf{w}}(I_{i}^{-}))+d(f_{\textbf{w}}(I_{i}^{+}), f_{\textbf{w}}(I_{i}^{-}))] \leq{\tau}. \end{array} $$
(2)

In the equation τ is the margin parameter which is negative. From above equation, we can clearly see that the triplet loss function requires the distance between the same individual images closer than both two kinds of the distances between the different individual images by a large margin. We can find that by using these two kinds of distances of different individual image pairs, we can improve the training of the triplet loss function. This is a data augmentation way to further improve the CNN training. Using this triplet loss function, we can effectively reduce the intra-personal variations for the person re-id task.

In summary, the asymmetric triplet loss function is defined as follows:

$$ \frac{1}{T}\sum\limits_{i=1}^{T} {\Gamma}(t_{i};\textbf{w}) = \frac{1}{T}\sum\limits_{i=1}^{T} \max\{d^{n}(I_{i},I_{i}^{+},I_{i}^{-};\textbf{w}),\tau\} $$
(3)

where T is the total number of sample triplets. In our experiments, the used distance measurement d(.,.) is the euclidean distance,

$$ d(f_{\textbf{w}}(I_{i}),f_{\textbf{w}}(I_{i}^{+})) = ||f_{\textbf{w}}(I_{i})-f_{\textbf{w}} (I_{i}^{+})||^{2}. $$
(4)

3.4 The “center-loss” embedded softmax cost function

The goal of the softmax loss function is to make the learned feature that has similar outputs for images of the same person and dissimilar outputs for different people. Using this identification loss, we train the CNN models to learn discriminative identity-related features with large inter-personal variations, which is also adopted in the face recognition tasks. Specially, we have used the recently proposed softmax cross-entropy with the “center loss” [42] embedding as the identification loss function.

Suppose the training dataset for person re-id consist of N images with K different identities. Let {(I j ,y j )j=1N} denote all the training examples to the identification loss for person re-id, where I j is the j-th image in the dataset, and y j ∈{1,2,...,K} is the identity of the corresponding image I j . The identification loss employs a multi-class classification objective function to classify each person image into one of K different identities. This identification loss is achieved by one fully connected layer(N-fc) with a K-way softmax layer following the final feature representation layer, which outputs a probability distribution over the K classes. Besides, we also use the “center loss” to further minimize the intra-class variations globally in the training process. The network is trained to minimize the cross-entropy loss with the “center loss” embedding, which is called as the identification loss function. It is denoted as follows,

$$\begin{array}{@{}rcl@{}} \underset{\textbf{w},{\Theta},\textbf{c}}{\text{argmin}}& &\ell(I_{j},y_{j};\textbf{w},{\Theta},\textbf{c})\\ &= &\ell_{s} + \ell_{c}\\ &= &-\frac{1}{N} \sum\limits_{j=1}^{N} y_{j} \log (p(f_{\textbf{w}}(I_{j}),{\Theta})) + \frac{\beta}{2}\sum\limits_{j=1}^{N} ||I_{j} - c_{y_{j}}||_{2}^{2}. \end{array} $$
(5)

s is the traditional softmax cross-entropy loss function, and c standards for the “center loss”. In the softmax loss, f w (.) is the feature extractor, y j is the identity label, p(.) is the predicted probability of the person’s identity by the classifier, which can be defined as follows:

$$ p(f_{\textbf{w}}(I);{\Theta}) = \frac{\exp ({\Theta}_{i}\cdot f_{\textbf{w}}(I))}{{\sum}_{k} \exp({\Theta}_{k} \cdot f_{\textbf{w}}(I))} $$
(6)

In (6), Θ is the softmax weight matrix, Θ i and Θ k refer to the i th and k th column of Θ, respectively. While in the “center loss”, \(c_{y_{j}} \in R^{d}\) denotes the y j -th class center of the deep features. This formulation can effectively characterizes the intra class changes, and \(c_{y_{j}}\) can updated as the CNN feature changes. Most importantly, such center objective can be optimized to obtain the global center c in the training process. The optimization method for the “center loss” can refer to [42].

This identification loss encourages the network to learn more discriminative identity-related features(i.e. features with large inter-personal variations), which can help correctly classify all the classes simultaneously.

The objective function in (1) consists of two terms, the first is the proposed asymmetric triplet loss function. The second identification term is implemented by the K-way softmax function with “center loss” embedding, which can be directly obtained from the “caffe” softmax loss layer and [42] for the “center loss”. Such that for the optimization process, we mainly focus on the asymmetric triplet loss function. We have used the stochastic gradient decent algorithm to jointly train the proposed CNN architecture with the asymmetric triplet loss and the identification loss function.

Remark

In our experiments, we use the previously introduced one parts-based shallow network on two small dataets, namely the i-LIDS, and PRID2011 datasets. While we use the recently representative DGD network architecture [44] on the large CUHK03 datasets, where the DGDNet [44] is on reduced GoogLeNet [36]. We did not use the deep model on the small datasets or use the shallow network on the large dataset, because these two cases can lead to overfit or under-fit.

4 Experiments

4.1 Setup

Data augmentation

In order to increase the number of training samples, and also to alleviate the over-fitting problem, we have used the data augmentation technics. In the experiments, we randomly crop 80 × 230 pixels’ region from the original 100 × 250 pixels’ image during the training process. With this random cropping in each iteration to augment the training data, we can obtain some crucial performance improvement, especially for some relatively small datasets. We use this setting to train the part-based CNN model.

Setting training parameters

For the CNN model, we initialize the weights using the Gaussian distributions with zero-mean and the standard deviations of 0.01, and its bias terms are initialized by zeros. In the experiments, the parameters τ and λ are set to 1.0 and 0.01 for (3), respectively. To generate the triplet inputs for our proposed framework as illustrated in Fig. 3, for each image, we select its matched reference from the same class, and randomly select the mismatched one from the remaining different classes.

Fig. 3
figure 3

Illustration of the triplet training examples. Images in each red bounding box form a triplet training example, and the first two images denote the positive pair, while the third one is the miss-matched reference

Implementation details

We implemented our model based on the Caffe framework, with only the data layer and the loss layer replaced. And the identification cost is implemented by the softmax layer in the original caffe layers, and the “center-loss” can refer to [42]. We trained the network on a Tesla K40 GPU with 12G memory.

Datasets

Three widely used person re-id benchmark datsets are used for performance evaluations in our experiments, namely i-LIDS, and PRID2011 and CUHK03. Each of the datasets contains many different individuals, and each of whom consist some images captured by various non-overlapping cameras. Some detailed descriptions are illustrated in the following:

i-LIDS dataset Footnote 1.: It includes 479 images of 119 different individuals, which was captured from a busy airport hall. There are average 4 images for each individual, and among them there contains many occlusions and illumination changes.

PRID2011 dataset Footnote 2.: This dataset contains 1134 persons, but only 200 individuals appear in two camera views. Camera A consists 385 individuals, while camera B consists 749 individuals. In the testing phase, we just use 100 person pairs to construct the gallery and probe images for evaluation.

CUHK03 dataset: This dataset consists of 13164 images for 1360 different individuals. The images in this datasets are captured from six different surveillance cameras in a campus scenario. This dataset is one of the most largest and challenging datasets for the person re-id, which contains great pose, illumination variations.

Evaluation protocol

Our experiments follow the evaluation protocol in [11], on datasets i-LIDS, and we have use 100 identities’ images for testing on PRID2011 and CUHK03 datasets. In the testing phase, we random select two images from two different camera views, one is added to the gallery set, and another is added to the probe set. We match each probe image with every image in the gallery set by the projected CNN features, and rank the gallery images according to the euclidean distances between the features. And finally, the widely used CMC metrics [11] are adopted for the quantitative evaluations in our experiments.

4.2 Experimental evaluations

As aforementioned, the main contributions of this paper are as follows: 1) we conclude that the triplet loss is suitable for the smaller datasets with shallow network architecture, while the softmax loss is suitable for large datasets with relatively deeper network architecture. 2) To further improve the re-id performances, we have used the joint supervision of the triplet loss and identification loss to optimize the CNN model. To illustrate our conclusions, we have used the triplet loss and identification loss to train the part-based shallow network on the small datasets, respectively. Meanwhile, we also use the triplet loss and identification loss to train the relatively deeper network on large datasets, respectively. Then, we compare their different performances. Finally, we train these CNN models by the joint triplet and identification loss. As for the joint training, to reveal each of their ingredients contributing to the performance improvement, we implemented the following three variants of the person re-id method, and compare our proposed method with many representative algorithms in the recent literature:

  • Variant 1 (denoted as OursT): We train the CNN models only with the asymmetric triplet loss function.

  • Variant 2 (denoted as OursI): We train the CNN models only with the “center-loss” embedded softmax loss function, which is named as the identification loss, as our baseline.

  • Variant 3 (denoted as OursTI): We train the CNN models with the joint asymmetric triplet and identification loss function.

In order to evaluate the effectiveness of the “center loss” embedded softmax loss function, we also did experiments with/without the “center loss” embedded on the CUHK03 dataset using the deep DGDNet. We have found that we can have 1.6% performance improvements by embedding the “center loss” into the softmax loss function.

Tables 12, and 3 illustrate the experimental results on the aforementioned three person re-id benchmark datasets, respectively. We have used the widely adopted CMC metric to measure our experimental results, including the rank 1, 5, 10, 15 and 20 accuracies. We compared our method with several representative methods on each datasets, mainly including the related deep learning based methods, such as Ding’s method [11], FPPN [27], DeepM [49], mFilter [54] and Ejaz’s [1], and also some traditional matrix learning based methods including Sakrapee’s method [30] and mFilter+LADF [54]. From the above tables, we can see that our method has obtained promising results on the person re-id benchmark datasets.

Table 1 Experimental evaluations on i-LIDS dataset
Table 2 Experimental evaluations on PRID2011 dataset
Table 3 Experimental results on CUHK03 dataset with the DGDNet

Compared with the listed several representative methods in recent literature, our proposed methods OursTI achieves the top performance on the three benchmark datasets on the compared six measurements. We can obtain the following observations from the above three tables: By comparing the identification(center loss embedded softmax) loss with asymmetric triplet loss, we have found that the triplet loss function is better than the identification(softmax) loss on iLIDS and Prid2011 datasets. While the identification loss is better than triplet loss on large CUHK03 dataset. Moreover, when we train our network under the joint supervision of the triplet and identification loss, another 0.7–4% performance improvement will be achieved.

4.3 Balancing the comparison and identification cost

We investigate the interactions of triplet distance comparison and identification cost on featur e learning, by varying λ from 0 to \(+\infty \). At λ = 0, the identification signal vanishes and only the triplet comparison signal takes effect. When λ increases, the identification signal gradually dominates the training process. At the other extreme of \(\lambda \rightarrow +\infty \), only the identification loss remains. In Fig. 4, we show the top 1 matching result with various λ settings. It clearly illustrates that neither the identification nor the triplet comparison cost is the optimal one to learn features. Instead, effective features comes from the appropriate combination of them. As an aside, we have found that when integrate the identification cost with the triplet distance comparison loss, the convergence speed can be improved. When we train the network only with the triplet distance comparison cost, there needs 25k iterations to converge, while the joint training only needs 15k iterations. In order to efficiently train our model, we firstly just use the softmax loss to train the model as the initialization, and then train it jointly, and we have found that this way can help our training for the triplet framework.

Fig. 4
figure 4

Top 1 matching accuracy by varying the weighting parameter λ on PRID2011 dataset for person re-id

4.4 Performance analysis of two different loss function on person re-id task

As can be seen from Tables 1 and 2, we can clearly see that the triplet loss can get better performance than that of the softmax loss. For the i-LIDS dataset with only 69 training identities of 276 images, the triplet loss can get 24.9% better performance than that of the softmax loss. While on the PRID2011 datsets with 934 images of 100 same identities and some other individuals’ images for training, the performance gap between the triplet loss and the softmax loss is just 1 percent. Thus, we can get the observation that the triplet loss work better than the softmax loss function on the relatively smaller datasets. While for the larger CUHK03 datasets with 1367 indenties of 11367 images for training, we have used the DGDNet on this dataset, which is deeper than our introduced parts-based model, it can be clearly see that the performance obtained by the softmax loss function is better than that of the triplet loss with this deeper network architecture on the large dataset as illustrated in Table 3. We can also note that there contains 10 images for each person on average on the CUHK03 dataset, which is also one reason for the Softmax loss function to work well on this large dataset. Because the Softmax loss function treats this task as the classification problem, more training data for each class will benefit their training. This can also be illustrated in Niall’s [29] and Yan’s [47] works, they also used the Softmax loss on the video-based person re-id task. Because in the video-based person re-id datasets, the training images for each identity is relatively much more, which is suitable to be trained by the Softmax loss (Fig. 5).

Fig. 5
figure 5

The ranking results on three benchmark datasets, where the big green rectangle denotes top 5 results given the query image in the left, and the images in the small red rectangles represent the matched reference to the query image

5 Conclusion

In this paper, we analyze two kinds of widely used loss function, namely softmax loss and the triplet loss function, on different sizes of datasets with different network architecture. We have found that the triplet loss is much suitable for the shallow network on small datasets, while the softmax loss function work better on large datasets with deeper network architecture. Then we propose a new algorithm with the joint supervision of the asymmetric triplet loss function and the center loss embedded softmax cost function. In this framework, the CNN architecture is trained with two supervision signals: first, the asymmetric triplet cost aims to produce features that pulls the instances of the same person closer, and pushes the instances belonging to different persons far away from each other in the learned feature space. Meanwhile, the softmax cost can help learn more discriminative identity-related features with large inter-personal variations. Experimental results show that a slightly better performance can be obtained than either one of them by the joint supervision, and our model gets promising results on three widely used benchmark datasets.