Keywords

1 Introduction

Re-Id, also known as pedestrian re-identification, is a technology that uses computer vision technology to judge whether there is a same target pedestrian in a camera image that does not overlap. The research on pedestrian re-identification initially focused on the manual annotation of pedestrian feature extraction, the identification of different camera angles, and the learning methods based on distance measurement [1]. However, due to the cost of manual labeling and the lack of traditional methods, this research has not obtain great progress. With the development of machine learning and deep learning, in the ImageNet competition in 2012, the Hinton team [2] first tried to integrate the convolutional neural network into the pedestrian detection technology, and achieved good experimental results. Since 2014, researchers have attempted to incorporate deep learning into the issue of pedestrian re-identification. Deep learning provides a way to solve computer vision problems without the need for excessive manual annotation of image features. The back propagation algorithm dynamically adjusts the parameters in the CNN so that the feature extraction and pairwise comparison processes are unified into a single network.

However in the real world, the appearance of a pedestrian will be largely affected by the difference of camera angle, illumination, height, etc. The manual annotation feature can solve the problem well, and the application can make the method more reliable in the pedestrian re-identification task. In order to effectively combine the artificial annotation features with the CNN features, a deep feature fusion network is proposed. The artificial annotation features are used to adjust the CNN process. The features extracted by CNN can also be used as a supplement to the manual annotation features.

Experiments on three challenging pedestrian re-identification attribute data sets (VIPeR, PRID, and CUHK) demonstrate the validity of the new features. Compared with the existing method, the recognition rate of rank-1 has been significantly improved. In summary, the manual annotation feature can improve the extraction process of CNN features and achieve a more robust image representation.

2 Related Work

Due to the outstanding performance of deep learning in pedestrian detection and recognition, the re-identification of pedestrians in recent years mainly focuses on deep learning. Many researchers at home and abroad have proposed different improved algorithms in this process. There are two main methods for pedestrian re-identification using deep learning. One is to use the end-to-end technology to extract pedestrian features using the convolutional neural network (CNN) to achieve pedestrian re-identification, such as DeepReID : deep filter pairing neural network for person re-identification; Another is to achieve pedestrian re-identification in combination with the high-level semantic features of pedestrians. For example, Li Jiali et al. [3] proposed to add multi-classification features based on human body structure detection based on deep learning features, and established a multi-feature fusion model with enhanced depth features; reference [4] not only identifies each type of pedestrian attribute separately, but also arranges and combines pedestrian attributes, and then jointly identifies multiple attributes and individual attributes; literature [5] jointly recognized pedestrian attributes and pedestrian IDs, making full use of the pedestrian’s annotation information, and improving the accuracy of pedestrian recognition; paper [8] used color histogram features (RGB, HSV, YCbCr, Lab and YIQ) and texture features, combined with the features attracted by the traditional CNN, largely improved the recognition accuracy. These efforts have led to a leap-forward development in pedestrian recognition.

Combining feature extraction and image pair classification into a single CNN network, the comparison and symmetrical structure of paired pictures are widely used, but pairwise comparison demanded to form lots of pairs for each probe image and perform deep convolution on these pairs [6]. Inspired by [6], a feature fusion network is proposed that extracts depth features from a single image without using pairs of inputs. The artificially labeled pedestrian attribute features are combined and merged with the depth features to be unified into a single network, and then the loss measurement function proposed in this paper is used to optimize the network recognition results. The test results show that the proposed algorithm is superior to the current mainstream pedestrian recognition algorithm in public data sets, especially the first accuracy rate (rank-1).

3 Model

3.1 Network Structure

We use the fine-tuned feature fusion network to learn new features. The network structure consists of two parts, as shown in Fig. 1. The first part uses the ResNet network to extract features from pedestrian sample images. ResNet can solve the problem of network degradation and multi-convergence in multi-layer CNN training, and can extract deeper features. The second part deals with the hand-crafted feature of the same picture. Finally, the two sub-networks are combined to produce a complete image description. In the second part of the study, the results of the first part will be adjusted.

Fig. 1.
figure 1

Fusion feature network structure model

3.2 CNN Features

The extraction task of image features is using ResNet, the upper part of Fig. 1, which uses the ResNet-50 network, that is a residual network with a depth of 50 layers, and the network is mainly composed of convolutional layer, pooling layer and residuals. In the process of deep learning development, the researchers found that as the number of network layers deepens, the network will undergo gradient dispersion and gradient explosion during training, as a result, the network can’t be able to converge during training. Differ from the general CNN, ResNet has a unique residual block structure, which avoid the network degradation and the convergence problem during training without introducing additional parameters and computational complexity by learning the residual function and realizes the identity mapping.

In our framework, by using back propagation, the parameters of the entire CNN network are affected by the hand-crafted features. Our goal is to combine features into a unified feature space. A feature fusion deep neural network is proposed to adjust CNN features by using hand-crafted features, so that CNN can extract complementary features and have a more complete feature representation. The buffer layer can be used as a bridge between the extracted CNN feature and the composite attribute feature to reduce the huge difference between different features and ensure the convergence of the fusion layer. If the input of the fusion layer is:

$$ \text{X = }\left[ {\text{Composite}\_\text{Attribute}\_\text{Features},\;\;\;\;\;\;\text{CNN}\_\text{Features}} \right], $$
(1)

Then the output of this layer is calculated by the following formula:

$$ Z_{Fusion} \left( x \right) = h\left( {W_{Fusion}^{T} x + b_{Fusion} } \right), $$
(2)

Where h(·) indicates the activation function.

Existing deep re-identification networks for person re-identification adopt Deviance Loss or Maximum Mean Discrepancy as loss function [6]. Our goal is to effectively extract the depth features of each image, rather than comparing the image pairs through deep neural networks. So we use the softmax loss function, a more discriminative feature representation will get a lower softmax value.

3.3 Hand-Crafted Features

The recognition of pedestrian images by semantic attributes, such as gender, wearing, backpack color, etc., can be used as auxiliary information to improve pedestrian recognition accuracy. There are several advantages to using manual annotation of attribute features: First, because most people have similar appearances (such as clothing color, backpack, hair color, etc.) it is more difficult to manually mark the same pedestrian in a camera with low pixels. In contrast, the labeling of pedestrian attributes is simpler and more accurate. Second, the number of classification of pedestrian attributes is less than that of different pedestrians, since different pedestrians will have the same attributes.

The training data set contains N pedestrian images, and a pedestrian image has multiple attribute annotations (such as gender male, upperbody black), and we group these attributes. According to the paper [4], each image is marked by G attribute groups, e.g., Gender, Age, and every attribute group has a different number of attributes, denoted as K(g), e.g., group gender has male, female, group upperbody clothing has sweater, t-shirt, suit and so on. We assume that each attribute group can only have one attribute value. For example, a pedestrian’s upperbody color is black and white. We only select one color, that is, one attribute value.

Since the categories of each attribute group are inconsistent, we define a weighted cross entropy loss function. The loss of output node j is calculated as follows:

$$ P_{(y = j)} = \frac{{e^{{a_{j} }} }}{{\mathop \sum \nolimits_{j = 1}^{T} e^{{a_{k} }} }} $$
(3)

Where T represents the number of categories. The cross entropy loss function is as follows:

$$ {\text{L}} = - \frac{1}{{N^{g} }}\sum\nolimits_{j = 1}^{N} {\sum\nolimits_{k = 1}^{{K^{g} }} {\frac{{y_{j} logP_{j} }}{{N_{k(i)}^{g} }}} } $$
(4)

Where N represents the number of pedestrian pictures, \( N^{g} \) represents the number of pedestrian images in the g-th attribute group, \( N_{k(i)}^{g} \) is the number of training samples of the k-th attribute in the g-th attribute group, and \( P_{j} \) is calculated by the formula 3 inferred.

4 Experiment

4.1 Dataset

This paper uses the PETA (PEdesTrian Attribute) data set, which is the largest data set currently open for pedestrian attribute recognition tasks. The data set contains 8705 pedestrians for a total of 19,000 images (resolutions from 17*39 to 169*365). Each pedestrian is marked with 61 binary value and 4 multi-category attributes (binary values such as whether they are under 15 years old, and multiple categories such as upperbody colors can have multiple coexistences). In fact, the PETA data set is a collection of multiple smaller pedestrian re-identification data sets that are labeled by attributes. The partial data sets included are shown in the following Table 1:

Table 1. Partial data set of PETA.

For all manual annotation properties, we divided then into 8 groups, each group contains a different number of attribute characteristics, as shown in Table 2. Among them, some attribute will be removed if the number of them is less than 10. In addition, if an attribute group has two attribute values at the same time (for example, upperbody has black and white), and we randomly select one value as the attribute label.

Table 2. The group of the attribute

4.2 Setup

We use three pedestrian re-identification databases to evaluate the fine-tuned CNN features and implement our approach using the keras framework. We resize all training images into 256*128 pixels and add a pad with 10 pixels, then randomly crop 256*128 sub-windows. For test time, we resize all the input images to 256*128 pixels. The CNN parameters are all derived from the pre-trained model ResNet-50, we start the last fully connected layer from random weights. The batch size is set to 256, the initial learning rate is set to \( \upgamma = 0.0001 \), and every 20,000 iterations is reduced as \( \upgamma_{new} = 0.1 *\upgamma \).

The method proposed in this paper was tested on three data sets and compared with other methods, the experimental results are shown in Table 3. The results of rank-1 on the VIPeR, PRID, and CUHK datasets were 45.23%, 50.22%, and 49.20, separately. It can be seen that compared to the methods listed in the table, the accuracy tested in the three datasets used the method proposed in this paper has improved a lot, which proves the effectiveness of our method.

Table 3. Performance comparison with different methods

5 Conclusion

The paper proposed a network structure that combines the processed manual annotation attribute features with CNN extraction features, and then uses the loss measurement function proposed in this paper to optimize the recognition results. The CNN extraction feature is based on the ResNet network, and combine the hand-crafted attribute features. The combine the two sub-network into a single network structure. According to the back propagation, the attribute feature can supplement the CNN features extracted by the network, and obtain a more complete feature representation to achieve more accurate pedestrian recognition results. Tested on three challenging public data sets (VIPeR, PRID, CUHK), the experimental results prove the effectiveness of the proposed method. Subsequent work will further study the pedestrian attributes and hope to further improve the accuracy of pedestrian re-identification.