1 Introduction

As one of the most challenging problems in the computer vision field, person re-identification aims to judge whether two images from different non-overlapping cameras contain the same pedestrian. It has potential practical value in video surveillance by saving a large number of human resources, so it received many significant efforts in the past few years. However, because of many variances between the different cameras, such as changes in person’s appearance, body pose, camera angle, occlusion and illumination conditions, identifying the same pedestrian across different camera views has not been solved yet.

In order to address the problem, many efforts are made on person re-identification. With the rapid development of deep learning, some researchers start employing CNN to learn global features [2, 3, 6]. However, it is not discriminative enough just by global features to identify the same pedestrian across different camera views, because a) global features contain much useless background information, and b) the lack of useful local information.

Therefore, in recent years, some works pay more attention to local details. To focus on exploring local features, some of them cut images into fixed rigid parts, which horizontal stripes are the most common, and then learn the discriminative local features [7, 33, 35, 40]. These simple partition methods assume that the position and pose of each person are similar in the bounding box, which is usually impossible. Therefore, it is important to design a method that can alleviate the negative effect caused by body parts misalignment. Some works deal with misalignment by extracting features from patches [16, 48], stripes [1], or pose-guided region of interest (RoI) [31, 49, 53]. However, these methods still contain much useless information caused by background. Recently, a few methods combining part maps and feature maps to form part-aligned representations are proposed [32, 45]. They usually design two sub-networks, one to learn part maps and another to learn appearance maps. Then they combine these two kinds of maps together as the final appearance features. Some body-part misalignment examples are shown in Figure 1. However, it is also difficult to judge whether an image belongs to a certain person only by local features, so some researchers combine global and local features together to get a stronger feature. In [42], they first extract global feature and local feature by their Harmonious Attention CNN model, and combine these two features to get better performance. Li et al. [18] use the advantages of jointly learning local and global features to find their correlation.

Figure 1
figure 1

Some examples of misalignment caused by different poses/viewpoints in different cameras. We can see the corresponding body parts are usually not spatially aligned, which increases the difficulty of person re-identification

In this paper, we propose a network to learn powerful features combining global features and part-alignment local features. More specifically, first a pose estimation model is employed to detect the key points, then we process the person images into several part-based images with the help of these key points to reduce the influence of background for the local features. Then these part-based images are fed into the deep network to extract local features, which will be concatenated to achieve part alignment. Finally, global features and local features are concatenated to form the final features.

Re-ranking is an effective way to improve the final performance. It can be implemented in any baseline because it doesn’t need any additional training samples. Some previous works conduct re-ranking step by the similarity between probe and top-ranked gallery images (e.g. k-nearest neighbors). They assume that gallery image in the top rank is likely to be a true match. However, the accuracy based on this hypothesis depends largely on the initial rank list. Once the top-k ranks of the initial rank list are all false, it cannot improve the final accuracy or even lead a worse result. Some re-ranking methods make efforts to alleviate the negative effect of the re-ranking methods aforementioned. Leng et al. [14] exploit a bidirectional ranking method to simultaneously compute both content and context similarities between bidirectional ranking lists. Garcia et al. [9] try to find the visual ambiguities in a ranking list and remove them. Zhong et al. [55] introduce a concept called k-reciprocal nearest neighbors to alleviate the bad effect of false matches. However, all existing re-ranking methods only exploit images’ visual clues to refine the initial rank list, ignoring the potential spatial-temporal information. Our hypothesis is that there are different spatial-temporal models between different camera pairs, which means the probabilities of transfer time between different cameras are varied. Based on this hypothesis, we propose a spatial-temporal information based re-ranking method for person re-identification. To summarize, the contributions of this paper are as below:

  • We propose a new network to extract discriminative local features for person re-identification from a series of parts. By introducing a pose estimation model to detect key points, the local features are aligned automatically, which results in more distinctive visual cues. Then, combining global and local features can make the final features more discriminative and robust.

  • We propose a novel re-ranking method by exploiting the spatial-temporal information. The proposed approach is full-automatic without any human feedback. All it needs is spatial-temporal information which can be easily stored in the name of images, so it can be implemented to any baseline to improve the performance.

  • Extensive experiments on the GRID, Market-1501, and DukeMTMC-reID demonstrate that our approach is of effectiveness and efficiency.

The rest of our paper is organized as follows. The related work will be discussed in Section 2. In Section 3, we will talk about the detail of our proposed method. Then, Section 4 is about the experimental results. Finally, Section 5 comes to a conclusion and future work.

2 Related work

Traditional methods for person re-identification mainly focus on two aspects, (1) to exact robust and discriminative hand-crafted features [4, 19, 24, 24, 26, 27] or deep learning features [7, 15, 34, 36, 38] and (2) to learn a more robust metric [5, 11, 13, 19, 20, 28]. In addition, some works bring in more thoughts such as attributes [30], transfer learning [25], spatial-temporal information [23] and re-ranking, etc..

Regular spatial-partition based methods

This kind of method usually divides every person image into several parts by fixed partition such as grid cell [2, 15] or horizontal stripe [7, 35]. Their assumption is that the position and pose of each person are similar in the bounding box. However, it is very hard to be that way in the real world. Therefore, if the image can not satisfy this assumption, the result will be unsatisfied.

Body part-aligned based methods

Body part misalignment is unavoidable in person re-identification, and it becomes one of the most crucial problems to be solved. In the past few years, body parts and key points detection have been brought into person re-identification to solve the misalignment problem. And because of the popularity of deep learning, some works [17, 41, 49] try to use deep learning techniques to achieve the goal. Some of them separate the images of body parts which are detected by pose estimator to extract a more discriminative local feature. And some of them bring in the use of attention maps [32, 45]. These methods usually design two sub-networks, one to learn part maps and another to learn appearance maps. Then they combine these two kinds of maps together as the final appearance features. Our method combines global features and part-alignment local features to form the final powerful features.

Re-ranking methods

Re-ranking methods also can be divided into two categories: 1) re-ranking with human feedback and 2) re-ranking without human feedback, depending on whether the human feedback is needed during the re-ranking process. In [21], the end user needs to select one strong negative sample or additional weak negative samples as feedback to refine the initial rank list during the test stage. In [37], both similar and dissimilar samples are chosen by the end user. In [39], they propose a new incremental model which becomes stronger with human feedback. However, a re-ranking method that requires human feedback is not good enough because the human feedback could be expensive. Thus some researches pay more attention to automatic re-ranking approach without human feedback. In [9], a re-ranking method is proposed to remove the visual ambiguities in a ranking by analyzing the content and context information in the initial ranking. In [46], the author believe that if two images belong to the same person, these two images should have similar appearance not only in global view, but also in local view. In [47], two initial rank lists are obtained by different baseline methods, and then the final rank list is computed through the aggregation of similarity and dissimilarity information in the two initial rank lists. In contrast to the above re-ranking methods, our proposed method explores the importance of the potential spatial-temporal information. We believe that there are some inherent behavior patterns between different camera pairs, that is to say, it is feasible to filter the negative samples which are of high visual similarity in the top-k list with the help of the spatial-temporal models.

3 Proposed method

In this work, we aim to learn discriminative feature representations for persons and improve the performance of person re-identification. Our proposed network is shown in Figure 2. For deep network, its goal is to extract discriminative global features and part-alignment local features at the same time. For re-ranking stage, we design a novel spatial-temporal-based re-ranking method to refine the initial ranking list by exploiting spatial-temporal information.

Figure 2
figure 2

The framework of our proposed method

In this section, we introduce our method as follows: the proposed network for powerful features in Section 3.1, and re-ranking method to achieve a better result in Section 3.2.

3.1 Proposed network

The proposed network consists of two components, the global feature learning component and the local feature component. In the first part, our goal is to extract discriminative global features from the whole person images. And in the second part, a 3-branch deep convolutional network is designed to obtain part-alignment local features.

Global features learning

The global feature learning component is trained with a multi-task network. One is identification task and the other is verification task. As shown in Figure 3, the global part contains two branches and they share the same parameters during training. A pair of images is fed into the global feature learning component as input. After the parameters-shared network, each image’s representation can be extracted and fed into the final fully connected layer to classify the ID of image for identification task. As for verification task, it aims to judge whether two images belong to the same ID, and it’s accomplished through a sub-module. The sub-module contains a square layer and a fully connected layer, and outputs a two-dimensional vector. Finally, after the trained model obtained, the global features fglobal can be extracted through it.

Figure 3
figure 3

The structure of Global Features Learning Network

Part-alignment local features learning

As mentioned above, body part alignment is a key problem to improve the result. So in this paper, we proposed a 3-branch local network to generate local features, which can make body part aligned. As shown in Figure 4, to learn the part-alignment local features, we first employ a pose estimation model [43] to detect the key points, then we process the person images into several part-based images with the help of these key points. Then these part-based images are fed into deep network to extract local features. Finally, all these local features can be concatenated and the body part alignment can be achieved automatically. In this part, three branches do not share the same parameters. During the training of each branch in local features learning network, each branch has an independent classifier based on only part of image as input, which can enforce the network to extract discriminative details of each part.

Figure 4
figure 4

The structure of Local Features Learning Network

Person representation

In the test stage, the final robust features are formed by concatenating global features and all local features, which can be written as below:

$$ f=[\alpha\times f_{global}, \beta\times f_{upper}, \gamma\times f_{arm}, \sigma\times f_{leg}] $$
(1)

where α, β, γ, σ are weight parameters. fglobal represents the global features and fupper, farm, fleg are 3-branch local features, respectively.

Training

In the global phase, the loss function can be formulated as:

$$ L_{g}=\beta \ell_{id}+\gamma \ell_{veri} $$
(2)

where β, γ are weights. id is the loss of identification task and veri is the loss of verification task.

In the local phase, the loss function can be formulated as:

$$ L_{p}= \lambda_{1}\ell_{id}^{u}+\lambda_{2}\ell_{id}^{a}+\lambda_{3}\ell_{id}^{l} $$
(3)

where λ1, λ2, λ3 are weights. \(\ell _{id}^{u}\), \(\ell _{id}^{a}\), \(\ell _{id}^{l}\) are loss from upper branch, arm branch and leg branch, respectively.

The Softmax loss is used as the classification loss and cross-entropy loss is used as the loss of verification task.

3.2 Proposed re-ranking method

Most existing re-ranking methods only exploit images’ visual clues to refine the initial rank list, ignoring the potential spatial-temporal information. Our hypothesis is that there are different spatial-temporal models between different camera pairs, which means the probabilities of transfer time between different cameras are varied. Based on this hypothesis, we propose a spatial-temporal information based re-ranking method for person re-identification. Specifically, given probe images, an initial rank list for each probe image is obtained through baseline. Then, some reliable gallery samples are selected for each probe image according to the initial rank list, and these gallery samples are treated as true matches. After we get the samples, spatial-temporal models between different camera pairs can be learned from them. Finally, the final distance is calculated as the combination of the original distance and the probability according to the camera pairs. The re-ranking list can be obtained through the final distance.

Suppose there are 3 sets, a probe set P, a gallery set G and a train set T, and the amount of these 3 sets is Np, Ng, Nt respectively. Given a probe image \(p_{i}\left (i = 1 , 2, ... N_{p}^{} \right )\) and a gallery image \(g_{j}\left (j = 1 , 2, ... N_{g}^{} \right )\), the initial distance between pi and gj can be computed through Euclidean distance or Mahalanobis distance,

$$ d\left (p_{i},g_{j} \right )=\left (x_{p_{i}}- x_{g_{j}}\right )^{T}M\left (x_{p_{i}}- x_{g_{j}}\right ) $$
(4)

where \(x_{p_{i}}\) and \(x_{g_{j}}\) are the features of probe pi and gallery gj respectively, and M is a positive semidefinite matrix. Then we can obtain the initial rank list \(R\left (p_{i},G \right )=\left \{ g_{1},g_{2},...g_{N_{g}} \right \}\) by sorting the original distances calculated between probe pi and each gallery gj in the gallery set G, where \(d\left (p_{i},g_{j} \right )< d\left (p_{i},g_{j+1} \right )\).

The spatial-temporal information can be collected after the initial rank list is obtained. Given an initial rank list \(R\left (p_{i},G \right )=\left \{ g_{1},g_{2},...g_{N_{g}} \right \}\), the top-k samples of the rank list, i.e., g1, g2,...gk, are selected and treated as true matches of probe pi. Then, we assume a positive direction, for example, assuming the direction from the camera with a small number to the camera with a large number is positive direction. So, the spatial-temporal information is computed as follows:

$$ \left\{ \begin{array}{rcl} st_{c_{p_{i}},c_{g_{j}}}= f_{j}-f_{i}, & {c_{p_{i}}<c_{g_{j}},f_{j}>f_{i}} \\ st_{c_{g_{j}},c_{p_{i}}}= f_{j}-f_{i}, & {c_{p_{i}}<c_{g_{j}},f_{j}<f_{i}} \\ st_{c_{g_{j}},c_{p_{i}}}= f_{i}-f_{j}, & {c_{p_{i}}>c_{g_{j}},f_{j}<f_{i}} \\ st_{c_{p_{i}},c_{g_{j}}}= f_{i}-f_{j}, & {c_{p_{i}}>c_{g_{j}},f_{j}>f_{i}} \end{array} \right. $$
(5)

where \(c_{p_{i}}\), \(c_{g_{j}}\) represent which camera pi and \(g_{j}\left (j=1,2,...k \right )\) come from respectively, and fi, fj represent the frame of pi and gj respectively. After the spatial-temporal information of every sample pairs is collected, it is sorted out into different spatial-temporal information sets \(ST_{c_{p_{i}},c_{g_{j}}}\) according to the identifiers of cameras. Cameras contain spatial information and frames contain temporal information. Finally, we sort every camera pairs’ temporal information, which means there are \(n\ast \left (n-1 \right )\) spatial-temporal models, where n is the number of cameras. Specifically, assuming that the amount of camera is 6, there are 30 time information sets except 6 time information sets consisting of the same camera. After the Npk time differences are obtained, they are sorted into the 30 time information sets according to the camera pairs where the current image pairs come from.

Besides, different similarities between different sample pairs should make different contribution to the final result. The sample pairs with high similarities are more likely to be the true matches. Therefore, the similarities are also supposed to be collected as well when obtaining spatial-temporal information. In this paper, similarities are collected as follows:

$$ \left\{ \begin{array}{rcl} S_{c_{p_{i}},c_{g_{j}}}=\sum s_{p_{i},g_{j}}, & {c_{p_{i}}<c_{g_{j}},f_{j}>f_{i}} \\ S_{c_{g_{j}},c_{p_{i}}}=\sum s_{p_{i},g_{j}}, & {c_{p_{i}}<c_{g_{j}},f_{j}<f_{i}} \\ S_{c_{g_{j}},c_{p_{i}}}=\sum s_{p_{i},g_{j}}, & {c_{p_{i}}>c_{g_{j}},f_{j}<f_{i}} \\ S_{c_{p_{i}},c_{g_{j}}}=\sum s_{p_{i},g_{j}}, & {c_{p_{i}}>c_{g_{j}},f_{j}>f_{i}} \end{array} \right. $$
(6)

where \(S_{c_{p_{i}},c_{g_{j}}}\) is the sum of all samples pairs’ similarities for each camera pairs. and \(s_{p_{i},g_{j}}\) is the similarity between pi and gj. For example, if there are 6 cameras, 30 sums should be computed for 30 different camera pairs except 6 pairs consist of same camera.

After calculating the spatial-temporal information of each probe image and its top-k gallery images, the potential behavior patterns between different camera pairs are obtained.

In the test stage, spatial-temporal probabilities whether a probe image and each gallery image in the gallery set G belong to the same person should be calculated through the above behavior patterns. In order to make the probability more confident, an interval is set based on the time difference between probe image and gallery image. Thus the probability is computed as follows:

$$ probability\left (p_{i},g_{j} \right )=s_{p_{i},g_{j}}\ast Num_{p_{i},g_{j}}/S_{c_{p_{i}},c_{g_{j}}} $$
(7)
$$ Num_{p_{i},g_{j}}=Index\left (t+{\varDelta} \right )- Index\left (t-{\varDelta} \right ) $$
(8)

where t means the time difference between probe image pi and gallery image gj, Δ is an interval, Index(x) means the index of x in the \(ST_{c_{p_{i}},c_{g_{j}}}\), \(S_{c_{p_{i}},c_{g_{j}}}\) means the whole weight between camera \(c_{p_{i}}\) and camera \(c_{g_{j}}\) and \(Num_{p_{i},g_{j}}\) means the number of spatial-temporal information within the interval \(\left [t-{\varDelta }, t+{\varDelta } \right ]\).

Considering the spatial-temporal information should be treated as auxiliary information and it can be complementary to the appearance representations, we jointly aggregate the initial distance and spatial-temporal probability to revise the initial ranking list, and the final distance dfinal is defined as

$$ d_{final}\left (p_{i},g_{j} \right )=d_{initial}\left (p_{i},g_{j} \right )/probability\left (p_{i},g_{j} \right ) $$
(9)

Besides, not all pairs are supposed to calculate their spatial-temporal information, because this could lead to some negative samples that are dissimilar in visual representation jumping into the top-k list due to their high spatial-temporal probabilities. Thus it’s necessary to set a distance limitation to judge whether spatial-temporal information should be considered in the final distance, and we set the spatial-temporal probability to a minimum value if it is ignored. Finally, the final rank list can be obtained by sorting the final distance.

4 Experiments

4.1 Datasets and settings

Datasets

Although spatial-temporal information can be easily recorded in the name of each image, there are only 3 datasets containing spatial-temporal information. They are QMUL GRID [22], Market-1501 [50], and DukeMTMC-reID [54] respectively. Thus experiments are conducted on these 3 datasets. The overview of datasets is shown in Table 1.

Table 1 The details of datasets

QMUL GRID

Loy et al.[22] is the first re-ID benchmark dataset which contains spatial-temporal information in the image name. There are 500 pedestrian images containing 250 persons. Each person has a pair of images from different cameras. The number of cameras is 8 and the background is a busy underground station.

Market-1501

Zheng et al. [50] is a large-scale image-based dataset containing spatial-temporal information in the image name. There are 32,668 pedestrian images containing 1,501 persons from 6 different cameras. These images are captured by Deformable Part Model (DPM) [8]. In the experiments, the standard training and evaluation protocols in [50] where 751 identities are used for training and the remaining 750 for testing is used.

DukeMTMC-reID

Zheng et al. [54] is a subset of the DukeMTMC for image-based re-identification, whose format is the same as Market-1501 dataset. There are 36,411 pedestrian images containing 1,404 persons from 8 different cameras. In the experiments, the standard training and evaluation protocols where 702 identities are used for training and the remaining 702 for testing is used.

Evaluation metrics

For small-scale dataset QMUL GRID, we only use Cumulative Match Characteristic (CMC) curve to evaluate the performance of Re-ID methods, because there is only one gallery image for each identity. For large-scale dataset Market-1501 and DukeMTMC-reID, we use both Cumulative Match Characteristic (CMC) curve and mean average precision (mAP) to evaluate the performance.

Feature representations

Except the features extracted from our proposed network, we also employ some other features to demonstrate the effect of our proposed re-ranking method. The Local Maximal Occurrence (LOMO) [19] features are used to represent the person visual appearance for all 3 datasets. It is hand-crafted feature and robust to view changes and illumination variations. In addition, we also employ the ID-discriminative Embedding (IDE) feature proposed in [52] for Market-1501, whose model is based on ResNet-50 [10]. Finally, we employ a baseline deep feature proposed in [51] for Market-1501 and DukeMTMC-reID.

Implementation details

We implement the proposed person re-identification model based on a improved Resnet50 framework. The stochastic gradient descent is used to optimize the networks. The initial learning rate, weight decay, and the momentum are set to 2 × 10− 4, 5 × 10− 4, and 0.9, respectively.

4.2 Experiments on QMUL GRID

We first conduct our experiments on the small-scale dataset QMUL GRID. Considering there are only 250 pairs of images, we just use the hand-crafted LOMO feature and use XQDA as the metric method. We set the interval Δ to 10, the max_interval to 30 and m to 50, where the max_interval is the threshold of frame difference and m is the number of distances used to compute a similarity threshold. Then we calculated the spatial-temporal probability threshold as follows: First, we calculated the mean value of all query images top-50 initial distances and set spatial-temporal probability threshold to 1e-7. Then we calculated the mean value of all spatial-temporal probabilities that greater than 1e-7 as the spatial-temporal probability threshold. The result on QMUL GRID is shown in Table 2.

Table 2 The results of our approach on the QMUL GRID

It clearly shows that there is a huge margin between baseline and our method. Our method gains 54.08%, 61.12% and 45.52% improvement in rank-1, rank-5 and rank-20 respectively for LOMO + XQDA. In addition, no matter what metric we use, our method always can improve rank-1 significantly. The reason why there is a so huge margin could be that the behavior patterns between different camera pairs is rather simple, thus the spatial-temporal information plays a dominant role.

4.3 Experiments on market-1501

We follow the standard training and evaluation protocols in [50] where 751 identities are used for training and the remaining 750 for testing. And we employ LOMO [19], IDE [52] and baseline deep feature [51] to evaluate the performance. We set the interval to 700, the max_interval to 40000 and m to 50. The spatial-temporal probability threshold is calculated the same way in the QMUL GRID part. The result is shown in Table 3.

Table 3 The results of our re-ranking method on the Market-1501

As we can see from Table 3, our re-ranking method consistently improves the rank-1 accuracy and mAP over all features and metrics. Especially when the performance of baseline is not so good, our method always makes a great improvement by a huge margin. Even though the performance of baseline is good enough such as Deep Feature, our method still can improve 3.32% and 2.02% in rank-1 and mAP respectively. Comparing with other re-ranking methods, our method outperforms in rank-1 accuracy. It’s reasonable that the improvement of mAP is less than the other re-ranking method, because the aim of our method mainly focuses on improving the top-k accuracy, thus we only consider the gallery images of high visual similarity, i.e., the top-m candidates of the initial rank list. Therefore, these true matches of low visual similarity are ignored by our re-ranking method, leading the benefit for mAP is not significant enough.

Considering our features are fused by several components, thus we compare different components’ effectiveness. ”Global” means fglobal extracted by global network. ”Upper-L”, ”Arm-L” and ”Leg-L” denotes fupper, farm and fleg respectively. ”All-L” means all local features are fused together. The result is shown in Table 4. It shows that every part is effective and we get the best performance by fusing every part, which proves that they are complementary (Figure 5).

Table 4 The results of our deep network on the Market-1501
Figure 5
figure 5

The CMC curve of feature fusion on Market-1501

Table 5 shows the comparison of our best result with some other state-of-the-art methods on the Market-1501 dataset. We can see our best result can be competitive with these state-of-the-art methods.

Table 5 Comparison with state-of-the-art methods on the Market-1501

4.4 Experiments on DukeMTMC-reID

Finally, we conduct some experiments on the latest large-scale dataset DukeMTMC-reID. We follow the standard training and evaluation protocols in [8] where 702 IDs as the training set and the remaining 702 IDs as the testing set. We set the interval to 700, the max_interval to 40000 and m to 50. The spatial-temporal probability threshold is calculated the same way in the QMUL GRID part. The LOMO and baseline deep feature are employed to evaluate the performance. The result is shown in Table 6. And the result of different part fusion is shown in Table 7.

Table 6 The results of our re-ranking method on the DukeMTMC-reID
Table 7 The results of our deep network on the DukeMTMC-reID

It’s clear to see from table that our re-ranking method gains at least 7% improvement in rank-1 accuracy, no matter what feature or metric is used. This indicates that spatial-temporal information is a key clue for person re-identification (Figure 6).

Figure 6
figure 6

The CMC curve of feature fusion on DukeMTMC-reID

Table 8 shows the comparison of our best result with some other state-of-the-art methods on the DukeMTMC-reID dataset. We can see our proposed method + re-ranking can achieve competitive performance.

Table 8 Comparison with state-of-the-art methods on the DukeMTMC-reID

4.5 Experiments on cross-dataset

It’s practical to exploit an existed model trained on a labeled dataset in a totally new environment. Therefore, two more experiments are conducted on cross dataset. One is training model on DukeMTMC-reID, and testing on Market-1501.The other is training model on Market-1501, and testing on DukeMTMC-reID. The results are shown in Tables 9 and 10 respectively.

Table 9 The results from DukeMTMC-reID to Market-1501
Table 10 The results from Market-1501 to DukeMTMC-reID

Apparently, our re-ranking method can improve the performance a large margin even if the model was trained on another labeled dataset, which means our method is more generalized and can be applied to any new environment with a pretrained basic model.

5 Conclusion

In this paper, an effective framework for person re-identification is proposed, which includes a new deep network and a re-ranking method. In the new network, in addition to extracting global features, we also design a multi-branch network to extract features from a series of local regions effectively which can alleviate the disadvantage of misalignment. And in the re-ranking phase, we proposed a novel re-ranking method that exploits spatial-temporal information to obtain a better performance. In our further studies, we will focus on the network and try to design a more robust and generalized network.