1 Introduction

Vision-based person re-identification (re-id) aims at identifying a target person with a gallery of pedestrian images. It is important in many video surveillance applications, such as finding criminals, cross-camera person tracking, and person activity analysis. This problem is challenging because of complex variations of human poses, camera viewpoints, lighting, occlusion, resolution, background clutter, etc., and thus draws plenty of research attention in recent years [28, 29, 51].

Traditional routine of the person re-id methods consists of two stages: detection and recognition.

  1. 1.

    Person detection tries to find people in images. Histograms of oriented gradients (HOG) proposed by Dalal et al. have been proven success in the past few years [7]. Recently, researchers focus on complex situations such as occlusion [20], crowded scenarios [15], mobile cameras [21]. In addition, people also try to tackle this problem using multiple features [19, 35, 38,39,40,41].

  2. 2.

    Person recognition tries to match the detected person with a specific target in the dataset. In these two stages, person recognition is more difficult. Even with perfect detecting results, recognition is still challenging. Therefore, researchers devote themselves into different aspects of this problem, such as features [27, 37], metrics [25], and matching [18].

Similar to many other applications of computer vision, the key to re-id is finding descriptive features. To achieve it, deep learning has been used. Yi et al. [36] and Li et al. [16] both employ a siamese neural network [3] to determine whether a pair of input images belong to the same ID. The reason for choosing the siamese model is probably that the number of training samples for each identity is limited (usually two).

Existing person re-identification benchmarks and methods mainly focus on matching cropped pedestrian images between queries and candidates. However, it is different from real-world scenarios where the annotations of pedestrian bounding boxes are unavailable and the target person needs to be searched from a gallery of whole scene images. Recently, some researchers propose end-to-end methods to tackle the re-id problem. Xiao et al. use fasterRCNN framework [31]. Zheng et al. discuss the relationship between detection and recognition [45,46,47]. In this way, they try to find the best combination of detectors and recognizers.

Although numerous person re-id datasets and methods have been proposed, there is still a big gap between the problem setting itself and real-world applications. In most benchmarks, the gallery only contains manually cropped pedestrian images, while in real applications, the goal is to find a target person in a gallery of whole scene images. Following the protocols of these benchmarks, most of the existing person re-id methods assume perfect pedestrian detections. However, these manually cropped bounding boxes are unavailable in practical applications. Off-the-shelf pedestrian detectors would inevitably produce false alarms, misdetections, and misalignments, which could harm the final searching performance significantly.

To tackle the above problem, we propose a method to combine body part detection and person recognition. Specifically, we design a novel deep architecture named multi-task deep neural networks (MDNN) for person re-id. Different from existing methods, the proposed method defines part detection and person recognition as two tasks. In this way, it applies deep neural network (DNN)-based feature extraction to represent pedestrian images and multi-task learning based modal to construct the mapping relationship from images to identification. The contributions of this paper are summarized below:

  1. 1.

    First, we propose a new multi-task learning framework based on Deep Neural Network (DCNN). In this framework, DCNN-based feature mapping and multi-task learning are connected to obtain a DNN-based regression for re-id, which unified multimodal problem in a single model.

  2. 2.

    Second, in the proposed framework, re-id consists of two tasks. They are part detection and person recognition. In this way, a two-stream network is applied and the outputs of these two tasks are unified to compute the results.

  3. 3.

    Finally, the proposed framework is naturally multimodal. We conduct comprehensive experiments to on a challenging dataset [31]. The experimental results validate the effectiveness of our method.

The remainder of this paper is organized as follows. Related works on multi-task learning and deep learning are reviewed in Sect. 2. Then, the proposed MDNN-based re-id method is presented in Sect. 3. After that, we demonstrate the effectiveness of MDNN by experimental comparisons with other state-of-the-art methods in Sect. 4. We conclude in Sect. 5.

2 Related works

2.1 Multi-task learning

Multi-task learning has recently been employed in image classification [43], visual tracking [44], multi-view action recognition [34], egocentric daily activity recognition [33], and image privacy protection [42]. Given a set of related tasks, MTL [4] seeks to simultaneously learn a set of task-specific classification or regression models. The intuition behind MTL is that a joint learning procedure accounting for task relationships is more efficient than learning each task separately. Traditional MTL methods [1, 8] assume that all the tasks are related and their dependencies can be modeled by a set of latent variables. However, in many real-world applications, not all tasks are related, and enforcing erroneous (or nonexistent) dependencies may lead to negative knowledge transfer.

Recently, sophisticated methods have been introduced to counter this problem. These methods assume a-priori knowledge (e.g., a graph) defining task dependencies [6], or learn task relationships in conjunction with task-specific parameters [9, 11, 14, 48, 49]. Among these, our work is most similar to [6].

2.2 Deep learning

Feature description of images is critical to image-based analysis [22, 32]. To obtain descriptive representation, deep learning architectures [23, 24] have been efficient in exploring hidden representations in natural images and have achieved proven success in a variety of vision applications. For example, an autoencoder [2] is an efficient unsupervised feature learning method in which the internal layer acts as a generic extractor of inner image representations. A double-layer structure, which efficiently maps the input data onto appropriate outputs, is obtained by using a multilayer perceptron. In addition, deep learning can exploit parallel GPU computation and deliver high speeds in the forward pass. These advantages make deep models an attractive option for handling the re-id problem.

3 Multi-task deep neural networks

3.1 Overview of the proposed method

The flowchart of the proposed method is shown in Fig. 1. The training process consists of two tasks. They are part detection and person recognition. They are multimodal and trained in two deep neural networks. Then, the output of DNNs is connected to a multi-task regression learning. With the trained model, re-id can be achieved.

Fig. 1
figure 1

Flowchart of the proposed method is shown. Generally speaking, it consists of three stages. First, we extract parts from the images. Second, we use deep neural networks to compute features. Finally, we connect multi-task learning with neural networks to compute the regression model for re-id. With this model, we can map image features to the person identities

3.2 The framework of multi-task learning

As mentioned before, the traditional routine to achieve re-id is mapping images to people with pre-computed regression models, specifically the key to define a reasonable loss function between the estimation and the groundtruth while training. Therefore, we aim at computing a well-defined regression model. In data mining and machine learning, a common paradigm for classification and regression is to minimize the penalized empirical loss:

$$\begin{aligned} \arg \min _W \ell (W)+\varPhi (W), \end{aligned}$$
(1)

where W is the parameter to be estimated from the training samples, \(\ell (W)\) is the loss function, and \(\varPhi (W)\) is the regularization term that encodes task relatedness.

In our application, two-stream re-id with V modals can be considered as a multi-task process with V tasks while \(V=2\). The training data for v-th task can be denoted by \(x_i^v,y_i^v\), where \(v=1,2\), \(i = 1,\ldots ,N\) and N is the number of samples. \(X=x_i^vR^{d1}\) are image features, where d1 is the dimension. The goal of multi-task learning can be defined as:

$$\begin{aligned} \arg \min \sum _{v=1}^V \sum _{i=1}^N \ell (y_i^v,f(x_i^v;w^v))+\varPhi (w^v), \end{aligned}$$
(2)

where \(f(x_i^v;w^v)\) is a function of \(x_i^v\) and parameterized by a weight vector \(w^v\). There are several existing choices of \(\ell (\cdot )\).

  • Trace norm regularized learning with least squares loss (LeastTrace) [12]: the loss function is defined as:

    $$\begin{aligned} \arg \min _W \sum _{i=1}^t (0.5 * \text {norm}(Y_i-X_i'*W(:,i))^2)+\rho _1 \parallel W \parallel _*, \end{aligned}$$
    (3)

    where \(\parallel W \parallel _*=\sum (\text {SVD}(W,0))\) is the trace norm.

  • L21 joint feature learning with least squares loss (LeastL21) [1]: the loss function is defined as:

    $$\begin{aligned}&\arg \min _W \sum _{i=1}^t (0.5 * \text {norm}(Y_i-X_i'*W(:,i))^2)\nonumber \\&\quad +\,\text {opts}.\rho _{L2}*\parallel W \parallel ^2_2+\rho _1 \parallel W \parallel _{2,1}. \end{aligned}$$
    (4)
  • Sparse structure-regularized learning with least squares loss (LeastLasso) [26]: the loss function is defined as:

    $$\begin{aligned}&\arg \min _W \sum _{i=1}^t (0.5 * \text {norm}(Y_i-X_i'*W(:,i))^2)\nonumber \\&\quad +\,\text {opts}.\rho _{L2}*\parallel W \parallel ^2_F+\rho _1 \parallel W \parallel _1. \end{aligned}$$
    (5)
  • Incoherent sparse and low-rank learning with least squares loss (Least-SparseTrace) [5]: the loss function is defined as:

    $$\begin{aligned}&\arg \min _W \sum _{i=1}^t (0.5 * \text {norm}(Y_i-X_i'*W(:,i))^2)\nonumber \\&\quad +\,\gamma *\parallel P \parallel _1. \hbox {subject to:} W=P+Q,\parallel Q \parallel _* \le \tau \nonumber \\ \end{aligned}$$
    (6)

where \(\parallel Q \parallel _* = \sum (\text {SVD}(Q,0))\) is the trace norm.

According to existing reports and our experiments, LeastSparseTrace outperforms the other loss functions. Due to the sparse constraints, Least-SparseTrace can improve the descriptive ability with features from different tasks. \(\varPhi (w^v)\) is the regularization term that penalizes the complexity of weights. In this way, the objective function can be rewritten as:

$$\begin{aligned} \arg \min _W \frac{1}{2} \sum _{v=1}^V (0.5 * \text {norm}\left( Y - \mathcal {F}(X^v;W^v)) ^2\right) +\sum _{v=1}^V \parallel W^v \parallel ^2 , \end{aligned}$$
(7)

where \(W = w^v\) is the weighted matrix with the same meaning as Eq. (1). To solve the above function, the key is how to define an optimized regression function \(\mathcal {F}(X^v;W^v)\).

3.3 Deep neural network based regression

Deep neural networks (DNNs) has been proven success in image description, especially with multi-task learning [17]. In our method, we solve \(f(\cdot )\) by using DNNs. In DNNs, this function is called the activation function. In computational networks, the activation function of a node defines the output of that node given an input or set of inputs. In the scenario of the deep neural network, activation functions project \(x^v_i\) to higher level representation gradually by learning a sequence of nonlinear mappings, which can be defined as:

$$\begin{aligned} (x^v_i)^0\xrightarrow [W]{\mathcal {R}}\left( x^v_i\right) ^1\xrightarrow [W]{\mathcal {R}}...\xrightarrow [W]{\mathcal {R}}\left( x^v_i\right) ^l, \end{aligned}$$
(8)

where l is the number of layers and \(\mathcal {R}\) is the mapping function from input to estimated output.

To optimize the weighted matrix W which contains the mapping parameters, we use a back-propagation strategy. For each echo of this process, the weighted matrix is updated by \(\varDelta W\), which is defined by:

$$\begin{aligned} \varDelta W = - \eta \frac{\partial E}{\partial W} \end{aligned}$$
(9)

\(\eta \) is the learning rate and:

$$\begin{aligned} \frac{\partial E}{\partial W} = (y_i^v-\mathcal {R}(x_i^v))(x_i^v)^T, \end{aligned}$$
(10)

In this way, we try to minimize the differences between the groundtruth \(y_i^v\) and the estimated output \(\mathcal {R}(x_i^v)\). The back-propagation strategy can be modeled by:

$$\begin{aligned} (x^v_i)^0\xleftarrow [W]{\mathcal {R}}(x^v_i)^1\xleftarrow [W]{\mathcal {R}}...\xleftarrow [W]{\mathcal {R}}(x^v_i)^l. \end{aligned}$$
(11)

3.4 Implementation details

In most of the re-id datasets, the body parts are not labeled. To get them, we use EdgeBox to automatically extract body parts [52]. If the detected parts are located inside the labeled bounding boxes of people, we add them in the body part stream. To get a uniform representation, all the body parts are cropped and normalized to \(16 \times 16\). On the other hand, the people within the bounding boxes are used in the whole person stream. Then, the features of body parts and whole bodies are extracted in two networks. These two DNNs contain three hidden layers, and the parameters are shown in Fig. 1. DNNs are implemented based on Caffe [13]. Finally, the outputs of DNNs are used as features of multi-task learning and the final re-id decision is made. Multi-task learning is implemented based on MALSAR [50]. Besides, the experiments are conducted on a workstation equipped with 4 Titan X (Pascal).

4 Experimental results and discussion

4.1 Dataset and settings

In our paper, we use the dataset proposed by Person Search [31]. In this dataset, a large-scale person search image set is collected and annotated. Two data sources are exploited to diversify the scenes. On one hand, hand-held cameras are used to shoot street snaps around an urban city. On the other hand, images are collected from movie snapshots that contain pedestrians, as they could enrich the variations of viewpoints, lighting, and background conditions. In the 18184 images of this dataset, 96143 bounding boxes of pedestrians are annotated. Then, the same person that appears across different images is associated, resulting in 8432 identities.

In the experiment, the dataset is split into a training subset and a test subset, ensuring no overlapped images or labeled identities between them. The test identity instances are further divided into queries and galleries. For each of the 2900 test identities, one of his/her instances are randomly chosen as the query, while the corresponding gallery set consists of two parts all the images containing the other instances and some randomly sampled images not containing this person. Different queries have different galleries, and jointly they cover all the 6978 test images. This process is repeated 20 times, and the average results are recorded.

To evaluate the performance, we employ cumulative matching characteristics (CMC top-K). The first one is inherited from the person re-id problem, where a matching is counted if there is at least one of the top-K predicted bounding boxes overlaps with the ground truths with intersection-over-union (IoU) greater or equal to 0.5. To simplify the demonstration, we use top-one as the evaluation metric.

4.2 Experimental results

In the proposed multi-task learning framework, \(\gamma \) in Eq. (6) may influence the performance. Therefore, we try different settings of \(\gamma \) to optimize the performance, which is shown in Fig. 2. We can figure out that the best performance is achieved when \(\gamma = 0.7\) and this setting is used in the following experiments.

Fig. 2
figure 2

Performance with different values of \(\gamma \)

In our experiments, we refer to several state of the arts and the performance is compared with them. They are PersonSearch, DomainDropout, and TripletLoss. They are all based on deep learning and achieve the best performance on the dataset we use. Besides, the theoretical contributions of them and the proposed method are similar.

  1. 1.

    PersonSearch [31]: instead of breaking it down into two separate tasks pedestrian detection and person re-identification, we jointly handle both aspects in a single convolutional neural network. An online instance matching (OIM) loss function is proposed to train the network effectively, which is scalable to datasets with numerous identities.

  2. 2.

    DomainDropout [30]: different from the standard Dropout, which treats all the neurons equally, our method assigns each neuron a specific dropout rate for each domain according to its effectiveness on that domain.

  3. 3.

    TripletLoss [10]: firstly we introduce variants of the classic triplet loss which render mining of hard triplets unnecessary and we systematically evaluate these variants. And secondly, we show how, contrary to the prevailing opinion, using a triplet loss and no special layers, we achieve state-of-the-art results both with a pretrained CNN and with a model trained from scratch.

The performance of different methods is shown in Fig. 3. We can clearly figure out the following conclusion:

  1. 1.

    The proposed MDNN achieves the best performance.

  2. 2.

    DomainDropout can produce stable achievement.

  3. 3.

    PersonSearch is able to give a good performance, but the stabilities of PersonSearch and TripletLoss are lower than the other methods.

Fig. 3
figure 3

Comparison of re-id accuracy

Some of the typical re-id results are shown in Fig. 4. The proposed method is effective for re-id.

Fig. 4
figure 4

Some sample results of re-id are shown

5 Conclusion

In this paper, we propose a novel person re-identification method. It improves previous methods by employing deep learning and multi-task learning. First, we define the task of re-id as the combination of two tasks. They are part detection and person recognition. In this way, a two-stream strategy is designed. Second, to train the mapping model from images to identifications, we propose a multi-task learning framework based on deep neural network (DCNN). In this framework, DCNN-based feature mapping and multi-task learning are connected to obtain a DNN-based regression for re-id. Second, experimental results show that the proposed method outperforms state-of-the-art methods of re-id.