Keywords

1 Introduction

In recent years, person re-identification (re-ID) has become increasingly popular due to its important applications in many real scenarios such as video surveillance [2], robotics [23] and automated driving. Person re-ID achieves constant improvements in recent years thanks to the considerable progress of the deep learning techniques. It can be seen as an image matching problem [5]. The challenge is to match two images of the same person coming from non-overlapping camera views [7]. Despite the advances in this field, many challenges still remain such as complex environment, various body poses [4], occlusion [10], background variations [6, 20], illumination variations [11] and different camera views [13]. Two images of the same person can present many differences related to the variation of the background, or two different persons can be captured under the same background.

Recent re-identification approaches are generally based on deep learning techniques. They learn Convolutional Neural Networks (CNN) to extract global features from the whole image, without differentiating the different parts of the persons to be identified. Some recent works in the literature has shown that it is critically important to examine multiple highly discriminating local regions of the person’s images to deal with large variances in appearance. Authors in [24] proposed a network that generate saliency maps focusing on one part and suppressing the activations on other parts of the image. In [21], the authors introduced a gating function to selectively underline such fine common local patterns by comparing the mid-level features across pairs of images. The network proposed in [3] is a multi-channel convolutional network model composed by one global convolutional layer which is divided into four equal local convolutional layers. Authors in [18] proposed a part-aware model to extract features from four body parts and generate four part vectors. [15] presented a multi-scale context-aware network to learn features from the full body and body parts in order to capture the local context knowledge. The work [12] proposed a multi-stream network that fuses feature distance extracted from the whole image, three body regions with efficient alignment and one segmented image. Besides the use of global and local features proposed by the papers cited above, we have showed in [6] that we can improve the person re-identification performance by background subtraction. Authors in [20] proposed a person-region guided pooling deep neural network based on human parsing maps in order to avoid the background bias problem. In addition to the way of extracting the characteristics from the image, some work has been devoted to the re-ranking process by considering the person re-identification as a retrieval process. The work presented in [17, 27] demonstrated that re-ranking process can significantly improve the re-ID accuracy of multiple baseline methods.

In this work, we propose a multi-stream person re-ID method that combines multiple body-part streams to capture complementary information. We propose first to extract person body parts using a semantic segmentation network based on CNN. Second, we propose to fuse the predicted confidence scores of multiple body-part stream re-ID CNNs. In order to extract visual similarities from the different body parts, four streams are combined: the first one deals with the original image to extract global features while the second stream exploit a background subtracted image to avoid the background bias problem. Stream three and stream four takes as input respectively a segmented image that contains the middle and the down part of the body in order to focus on local features. To evaluate our proposed method, we conducted experiments on the large benchmark dataset Market-1501 [25].

2 Overview of the Proposed Method

In this section, we provide the details of the proposed method. Figure 1 shows a flowchart of the whole framework. Our approach can be divided into three separated stages. The first stage is the semantic segmentation, which consists in extracting three images of the person body parts and an image without background from the original image. The second stage is about the feature extraction and re-ID on different streams witch takes as input four images of the same person (original image and three segmented images) and yields as output four similarity measures vectors. In the end, the third stage is the combination of the four stream outputs.

Fig. 1.
figure 1

(Images used are from the dataset Market-1501)

Overview of the proposed multi-stream feature similarity fusion method.

2.1 Segmentation Step

The use of the segmentation in this work has two advantages. First, it reduces the effect of the background variation due to the person posture and the multitude of cameras. In some cases, two images of the same person taken by two different cameras may present some difficulties caused by the background variation. The first image can be with a light background while the second can be with a dark background, even the texture can be different. We propose to deal with this problem by the background subtraction. Second, segmentation allows us to extract information from local parts of the image that is complementary to the global information extracted from the original image. Rather than using boxes to detect the body parts [12], we use the semantic segmentation to remove the background and to create specific body parts. The goal is to extract information only from the person body and not from the background which still present inside the box and can have an influence on the re-ID decision [20].

Our segmentation network (SEG-CNN) is inspired by the work proposed in [16]. SEG-CNN is a deep residual network based on ResNet-101 with atrous spatial pyramid pooling (ASPP) in the output layer so that it improve the segmentation and make it stronger. Moreover, in the attention of generating the context utilized in the refinement phase, two convolutions Res-5 are following. A structure-sensitive loss and a segmentation loss are also utilized to compute the similarity between the ground truth and the output image. Unlike [16] which aims to segment the person into 19 body parts and to predict the pose, we have changed the network architecture to obtain only 3 body parts: top, middle and down (See Fig. 3). From one person image, we get three different images, each image containing a part of the person body. In addition, we use those 3 parts to create a binary mask which is used to remove the background from the original image.

Since the Market-1501 dataset is not annotated for the segmentation field, SEG-CNN trained on a dataset named Look Into Person (LIP) [16] which is made for human parsing segmentation, was used to segment the Market-1501 dataset. Results on Market-1501 dataset are not satisfactory either as multiple images are over-segmented (see Fig. 2 on the right). To overcome this problem, we have visually chosen the well segmented images to create a Market-1501 sub-dataset which is then used to fine-tune our SEG-CNN.

Thanks to the fine-tuning, the segmentation results with SEG-CNN were considerably improved compared to the segmentation results obtained when the network was trained on only the LIP dataset. Figure 2 shows the improvement in the segmentation of SEG-CNN after it was fine-tuned on the Market-1501 sub-dataset.

Fig. 2.
figure 2

Qualitative comparison between the segmentation of SEG-CNN when it was trained on only the LIP dataset (image on the right) and SEG-CNN when it was fine-tuned on the Market-1501 sub-dataset (image on the left).

Fig. 3.
figure 3

Sample image of the semantic segmentation of SEG-CNN. (from the right to the left: original image, down part image, middle part image and top part image)

2.2 Multi-stream Feature Extraction Method

We propose in this work to combine global and local features to enhance the re-ID performances. Our multi-stream feature extraction module is composed of four branches as shown in Fig. 1. The first branch is used to process the whole image to obtain global features (called Full stream). The second branch processes the background subtracted image (called \(No\_bk\) stream) to focus only on the body part. This allows us to deal with the background variation problem. For local features, we use the last two branches that focus each on a segmented image (called Mid and Dwn streams respectively for middle and lower body parts). We empirically prove that the top body part don’t contribute on the improvement of the final fusion score since the top part only contains the person face which is not clear because the resolution of the images is low.

Given a probe image p and a gallery set G with N images, with \(G = \{g_1, g_2, ... , g_N \}\), the output of the first features extractor network is a vector \(V^1 = \{ S^1_1, S^1_2, ... , S^1_N \}\) of N similarity measures \(S^1_i\) calculated between the probe image p and each image \(g_i\) of the gallery set. By the same way, the output of the three other feature extractor networks are three vectors (\(V^2\), \(V^3\) and \(V^4\)) that contain the similarity between the segmented probe image \(p^*\) and each segmented image \(g^*_i\) in the gallery set.

2.3 Similarity Scores Fusion

There are several score fusion strategies that have been introduced in the literature [1]. Two types of fusion can be distinguished: early fusion and late fusion. Early fusion is a fusion of feature levels in which the output of a unimodal analysis is fused before training. For the late fusion, the results of the unimodal analysis are utilized to find out distinct scores for every modality. After the fusion, a final score is computed through the combination of the outputs of every classifier. Compared with early fusion, late fusion is easier to implement and often shows effective in practice.

We propose in this work to adopt a late fusion model. It is used to aggregate the outputs of the four feature extractor networks. To compute the final similarity measure \(V^{fus} = \{ S^{fus}_1, S^{fus}_2, ... , S^{fus}_N \}\), the weighted sum is used in order to combine the outputs of each stream feature extractor (see Eq. 1).

$$\begin{aligned} V^{fus}=V^1 \oplus V^2 \oplus V^3 \oplus V^4 \end{aligned}$$
(1)

where \(S^{fus}_i = \alpha \cdot S^{1}_i + \beta \cdot S^{2}_i + \sigma \cdot S^{3}_i + \gamma \cdot S^{4}_i\) ;

\(\alpha \), \(\beta \), \(\sigma \), \(\gamma \) \(\in [0, 1 ]\)

and \(\alpha \) + \(\beta \) + \(\sigma \) + \(\gamma \) = 1.

In order to fix the weighted sum parameters (\(\alpha \), \(\beta \), \(\sigma \) and \(\gamma \)), we consider a greedy search algorithm over the search space. At each iteration, we fix three weights and we vary the fourth weight according to a step, so as to always have the sum equal to 1. In this way, we went through all the possible combinations. A step of 0.05 has been chosen.

3 Experiments

We empirically evaluate the proposed method in this section. First, we introduce the used datasets, then we present the implementation details. Finally we present the experimental results of the proposed approach.

3.1 Datasets

LIP dataset [16] is a dataset focusing on semantic understanding of person. It gives a more detailed comprehension of image contents thanks to the segmentation of a human image into multiple semantics parts. It is composed by over than 50,000 annotated images with 19 semantic body parts indicating the hair, the head, the left foot, the right foot, etc. semantic part labels and 16 body joints, taken from many viewpoints, occlusions and background complexities.

Market-1501 [25] is a publicly available large-scale person re-ID dataset which is used in this work to evaluate our proposed method. The Market-1501 dataset is made up of 32,667 images of 1501 persons partitioned into 751 identities for training stage and 750 for testing stage. Images are taken by one low-resolution and five high-resolution cameras.

Three segmented databases (Images without background, images of the middle body parts and images of the down body parts) have been created from the Market-1501 database which are organized and composed exactly by the same number of images as Market-1501.

3.2 Implementation

Two neural networks architectures were used for the feature extraction and re-ID step. First, we used a ResNet-50 [8] (baseline) pre-trained on the ImageNet dataset [19] and then trained separately on different datasets (original Market-1501 dataset, Market-1501 dataset without background, middle and down part datasets). The categorical-cross-entropy was used to output a probability over all identities for each image. The stochastic Gradient descent (SGD) is utilized to minimize the loss function and then to refresh the network parameters. Moreover, we applied data augmentation to make the training dataset bigger by the use of transformations like shear-range, width-shift-range, height-shift-range and horizontal-flip. We used \(90\%\) of the images for training and the remaining \(10\%\) for validation.

Fig. 4.
figure 4

In our S-CNN, two ResNet-50 are utilized to extract individually two features vectors from two images, which are utilized to predict the identity of the two input images and to predict the binary classification

The second network is a siamese network (S-CNN) [26] that combines the verification and identification losses. This network is trained to minimize three cross-entropy losses jointly, one identification loss for each siamese branch and one verification loss. We show in Fig. 4 the architecture of our siamese network. S-CNN is a two-input network formed by two parallel ResNet-50 CNNs with shared weights and bias. The final fully connected layer of the ResNet-50 network architecture is removed and three additional convolutional layers and one square layer are added [8]. Each Resnet-50 CNN is already pre-trained on different datasets. S-CNN is then trained separately on the four datasets using alternated positive and negative pairs (details of S-CNN training can be found in our previous work [6]).

To evaluate the performance of our re-identification method, we use two common evaluation metrics: the cumulative matching characteristics (CMC) at rank-1, rank-5 and rank-10 and mean average precision (mAP).

3.3 Results

We provide in this section the empirical results of the proposed method and we show the activation maps to demonstrate that using segmented images push the network to extract local features.

Extracting features from only the whole body can skip some significant information (See Fig. 5). The active region in the \(No\_bk\) stream is larger than the active region in the Full stream what can cover more discriminative features. The activation map of the dwn stream show clearly that the use of segmented images allows us to exploit some regions of the image whose are not activated in the global stream.

Fig. 5.
figure 5

Samples of activation maps from our four streams for the same image. From left to right: the activation map of the Full stream, the \(No\_bk\) stream, the dwn stream and the Mid stream

Table 1. Fusion results obtained for different fusion methods on Market-1501 dataset.

In the literature, there are several methods of score fusion that can be used in the context of our work [14]. In order to choose the best method, a comparative study was made. The fusion methods that have been tested are: max rule, sum rule, product rule, accuracy weighted sum and greedy weighted sum. We display in Table 1 the result of this study. Table 1 shows that the weighted sum using the greedy search technique gave the highest accuracy for both S-CNN and Resnet-50 networks. Since we have four streams with fairly different recognition rates, the weighted sum makes it possible to penalize the stream with low precision and favor the one with high precision. With greedy search technique, we can found suitable weights for our recognition model.

We show in the top of Table 2 the result of our two re-ID networks for the four streams (Full, \(No\_bk\), Mid and Dwn). We notice that S-CNN provides better results except for the stream Mid where S-CNN is slightly worse than the baseline. For both of S-CNN and ResNet-50, the best performance is obtained for the stream Full which extract information from the whole image.

Table 2. Fusion results of the multi-stream outputs on Market-1501 dataset.

In the bottom of Table 2, we present the combination results of the four stream re-ID model outputs using a weighted sum method. A greedy search method is used to get the best results on the validation dataset. The weights setting consists in choosing the best parameters: \(\alpha \), \(\beta \), \(\sigma \) and \(\gamma \) (combination coefficients of the four streams respectively Full, \(No\_bk\), Mid and Dwn), where \(\alpha \) + \(\beta \) + \(\sigma \) + \(\gamma \) = 1.

Table 2 shows that the best combination is when the four streams are activated and S-CNN always provide better results than the baseline for the different combinations. With S-CNN, we get \(87.02\%\) in the rank-1 accuracy and with Resnet-50 we get \(83.87\%\).

The obtained results show that the combination of multiple streams significantly improves the re-ID performances. Comparing to Full stream results, the improvement in rank-1/mAP is \(7.24\%/9.5\%\) with S-CNN and \(10.27\%/11.26\%\) with Resnet-50 when the four streams are combined. Less improvement values are shown when only three or two streams are combined. This improvement confirms that the features extracted from the body parts (local features) give complementary information to the whole person image features (global features).

We notice that for a re-ID model based on only three streams, the best combination is when the two streams Mid and Dwn are combined with the Full stream. The rank-1 accuracy respectively for S-CNN and Resnet-50 are, 85.71 and 83.16. This result confirms that local features provide additional complementary information that can enhance the re-ID decision.

Fig. 6.
figure 6

Top 10 predictions for each stream and the fusion for the same query image. Person surrounded by blue box corresponds to the probe image. The first four rows correspond to the results produced by \(S-CNN\) on the Full, \(No\_bk\), Mid, and Dwn streams respectively. The fifth row corresponds to the results produced by the fusion method. Person surrounded by green box denotes the same person as the probe and person surrounded by red box denotes the negative predictions. (Color figure online)

We display in Fig. 6 one qualitative result showing how the fusion of multi-stream outputs can significantly improve the prediction results. Even when all of the four streams provide a negative prediction in rank-1, the fusion method success to provide the same identity as the probe image. Some images with the same identity as the probe image can move up in the top 10 ranking list despite their absence in the top 10 of all the four streams. This result can be explained by the slight difference between the scores in some cases where the images of different people are visually difficult to distinguish. The fusion method therefore makes it possible to favor images corresponding to the probe identity at the expense of false identities.

Our method is also compared to several state-of-the-art methods for person re-ID based on multi-stream strategy. This comparative study is presented in Table 3. The obtained results highlight the performances of the proposed method and show that our method outperforms many previous works by a large margin in rank-1 accuracy and mAP. After embedding the re-ranking method [27], we obtain \(89.22\%\) in rank-1, an improvement of \(2.2\%\).

Table 3. Comparison with the state-of-the-art approaches on Market-1501 dataset.

4 Conclusion

In this paper, we proposed a multi-stream method for person re-ID that aims to exploit global and local features and to avoid problems related to background variations. We propose to first perform a semantic segmentation to extract body parts, then we combine local and global feature extractor outputs to make final decision. We showed that combining multiple highly discriminative local region features of the person images leads to an accurate person re-identification system. Experiments on Market-1501 dataset clearly demonstrate that our proposed approach considerably improves the performance as compared with mono-stream methods and to the state-of-the-art approaches.