1 Introduction

With the rapid development of deep learning, more and more convolutional neural network models effectively deal with the problems of image classification, object detection and other tasks in computer vision. Large-scale data is required for current large convolutional networks to ensure the generalization ability of models. Person re-identification (person re-ID), identifying pedestrians among different camera views, is facing the challenge that the sizes of many person re-ID datasets are small. Although the recognition accuracy on different datasets have been increasing, the problem that some large networks can not work well on small datasets is still existing. For instance, the same CNN model has higher recognition accuracy on Market-1501 [36] with over 30,000 images than that on PRID450s [27] with less than 1000 images.

Data augmentation viewed as a data preprocessing method, generates new training samples from original datasets. It is used widely to increase data size of a dataset in image classification, object recognition and person re-identification. Moreover, data augmentation plays a key role in deep learning gradually due to the ability which alleviates over-fitting of large convolutional network models effectively. Two commonly used data augmentation methods, Random Cropping [32] and Random Flipping [29], effectively improve the generalization ability of most existing CNN models. In detail, Random Cropping decreases the influence of background in the CNN decision, and focuses more on parts of the object than on the whole object. Random Flipping enables CNN models to learn from different directions of input images. Thus, both of them concern about image processing on one image itself.

Recently, Random Erasing [41] has been proposed to implement in most existing CNN models. For person re-identification, it increases occluded images through selecting a random rectangle region in a pedestrian image and erasing its pixel with random values. Training images with different levels of occlusion are generated to reduce the risk of over-fitting. However, it only concerns a simple situation that some noise is embedded in the image.

In this paper, we propose Random Linear Interpolation (RLI), a new data augmentation method to increase images with more complicated occlusion in the person re-ID dataset. Random Linear Interpolation keeps a part of images unchanged, and generates new images by fusing pairs of images with linear interpolation [3, 4, 23]. Plenty of mixed images, looking similar to original images but not the same, are generated in training to power the generalization ability of CNN models. Furthermore, Random Linear Interpolation greatly improves the recognition accuracy in person re-ID, and makes CNN models robust to outliers and variable levels of occluded images. Examples of Random Linear Interpolation are shown in Fig. 1.

Fig. 1
figure 1

Examples of Random Linear Interpolation applied on the two person re-identification datasets, i.e. Market1501, DukeMTMC-reID

In Fig. 1, two random generated images with different interpolation strength on Market1501 [36] and DukeMTMC-reID [39] are shown. Image A is viewed as a base image, and image B is fused in it. The new generated image A* shares both features of image A and image B. μ is a hyper-parameter that controls the strength of interpolation among pedestrian images. We can obviously observe that the similarity between image A and image A* is determined by μ. For instance, when μ is equal to 0.3004, features of the pedestrian in image A remain a little in the generated image A on Market1501. With the increasing of μ, more original features are seemed remaining in new images. Specifically, when the value of μ is equal to 0.9212, only several features of pedestrian changes on DukeMTMC-reID.

In summary, this paper makes the following contributions:

  • We propose a new data augmentation method - R andom L inear I nterpolation (RLI), which is light-weighted and can implement in most existing convolutional neural network models to improve the generalization ability.

  • For person re-identification, RLI can reduce the requirement of I/O and increase various levels of complicated occlusion by fusing images of mini-batches in training. Softly adjusting the proportions of interpolated samples are adopted in RLI to control the learning ability of the model for outliers.

  • The proposed methods can improve the performance of baseline models including ResNet and DenseNet. Furthermore, we also augment data by exploring a new way that we consider data augmentation between two random samples rather than a sample itself.

2 Related work

2.1 Data augmentation

Data augmentation is first proposed by [31] to deal with missing value problems, as exemplified by some training samples without labels. Recently, it has been focused on once again when the convolutional neural network (CNN) developed rapidly. The size of data is vital to most existing CNN models because a CNN model with large trained samples will have a lower risk of over-fitting. Data augmentation is the technique that enlarges the number of training samples by processing original samples. In [20], random cropping was first proposed to construct a large dataset with 80 million tiny images for object and scene recognition. Simonyan and Zisserman [29] further adopted random flipping to augment the training dataset for large-scale image recognition.

Later, to deal with the task of image classification, object detection and person re-identification, random erasing [41] is presented to randomly select a rectangle region in training images and erase its pixels with random values. The deep convolutional generative adversarial network (DCGAN) is proposed by [39]. In DCGAN, a generator network as a subnet is used to generate virtual data and a discriminator network is used to identify whether the sample is virtual or real. Different from these methods, this paper presents a new method named Random Linear Interpolation to randomly generate mixed images by fusing an image to another image and feed both generated and actual images into the CNN model, thereby enhancing robustness of the model to outliers.

2.2 Person re-identification

Person re-identification is a challenging image retrieval problem [7, 21, 42]. Due to factors such as different lighting and camera angles, the same person displays different appearance features, which poses a great challenge for recognition. In order to find the invariant features of pedestrian images in different camera views, a large number of distance metric learning methods including KISSME, KCCA, MLF, EIML, RPLM and APML [13, 14, 22, 25, 28, 33] have been proposed. Some methods based on dictionary learning [6, 17,18,19] have also been used to deal with the unsupervised person re-ID problem because it can convert high-dimensional visual features into low-dimensional sparse coding.

In recent years, due to the continuous expansion of the person re-ID data sets, many convolutional neural network models have been proposed to improve the accuracy of pedestrian recognition. Zheng et al. [37] proposed the IDE method which deals with person re-ID by embedding discriminant identities. With using the triplet loss function to maximize the errors between the positive and negative samples while minimizing the errors of the positive and positive samples, TriNet [12] alleviates the problem that the scale of datasets are small and effectively works on various convolutional neural network models. SVDNet [30] optimizes the deep representation learning process in CNN training for the application of person re-identification. Zheng et al. [39] uses a deep convolutional neural network to generate unlabeled samples and proposes a label smoothing regularization method (LSRO) for integrating unlabeled outliers.

3 Our approach

In this section, we show the details of Random Linear Interpolation implementing in the convolutional neural network model for person re-identification. We first describe RLI by selecting two images to interpolate randomly in the training set. Then the theoretical basis of Random Linear Interpolation for person re-ID is analyzed. Finally, we analyze the difference among Random Linear Interpolation and other data augmentation methods.

3.1 Random linear interpolation

Random Linear Interpolation is used during training process to randomly mix two images by linear interpolation in the convolutional neural network model. Suppose there are n images in a mini-batch, then we keep k samples unchanged and perform linear interpolation operation on remaining images. In the step of data pre-processing, more and more virtual similar samples are generated randomly. And a new image represents the fusion of two random original images, in other words, a sample in the training set owning two labels.

figure g

Two original images Ia = (xa, ya) and It = (xt, yt), where Ia represents a sample to be interpolated while It is a random image in the same mini-batch, are selected to mix all pixels in them. Specifically, the new virtual sample has a soft label representation which differs from the hard label representation that one data only corresponds to one label. The samples in the new dataset may be shown as \( I_{a}^{*} = (\overline {x}_{a},\overline {y}_{a}) \), where \( \overline {x}_{a} \) represents features of the virtual sample generated by Random Linear Interpolation and \( \overline {y}_{a}\) is the mixed label of (ya, yt). Moreover, we keep Ia unchanged if Ia is not augmented by linear interpolation. The new generated virtual sample by Random Linear Interpolation is defined as:

$$ \overline{x}_{a}\left\{ R,G,B \right\} = \mu x_{a}\left\{ R,G,B \right\} + (1-\mu)x_{t}\left\{ R,G,B \right\} $$
(1)
$$ \overline{y}_{a} = \mu y_{a} + (1-\mu)y_{t} $$
(2)

where R, G, B represents the pixel values of three channels of the original input image. μ is a randomly generated number through a Beta distribution Beta(α, β). For simplicity, we set β equal to α in this paper. To be noted, in training, we only use one data loader to achieve one mini-batch and generate virtual samples in the same mini-batch. This operation reduces the requirement of I/O and also decreases the confusion of the dataset. The pipeline of Random Linear Interpolation is shown as Algorithm 1.

3.2 Random linear interpolation for person re-identification

In the person re-identification problem, we focus on finding a function fF that matches the same identities among different camera views. Most of existing methods based on convolutional neural network belong to supervised learning. In the recent convolution neural network, a large number of parameters, even more than the number of samples in the data set, may over-learn the feature of training data and have weak generalization ability to deal with outliers and the presence of occlusion. Therefore, we try to learn by enhancing the data that approximates the original samples, which fits well with the idea of Vicinal Risk Minimization (VRM) [5].

Assume that distribution of raw data set \(D=\left \{ (x_{i},y_{i}) \right \}_{i = 1}^{m}\) is P and (xi, yi) ∼ P for all i = 1, 2, ... , m. But in most practical situations, the data distribution P is unknown and \(\hat {P}\) is defined to approximate the true distribution P. VRM is one of the principles that minimize the risk between predictions and targets. The approximated distribution \(\hat {P}\) in the VRM is shown as (3):

$$ \hat{P}_{v}(\overline{x},\overline{y}) = \frac{1}{m}\sum\limits_{i = 1}^{m}v(\overline{x},\overline{y} |x_{i}, y_{i}) $$
(3)

where \(v(\overline {x},\overline {y} | x_{i},y_{i}) \) is the vicinal distribution. However, different from using gaussian kernel in VRM, we construct a virtual dataset \(D_{v} \left \{ (\overline {x},\overline {y}) \right \}_{i = 1}^{n}\) that satisfies the principle that new samples in Dv are generated by the linear interpolation of two random samples in D. And the new vicinal distribution can be written as (4) and (5):

$$ P^{*}(\overline{x},\overline{y}) = \frac{1}{n} \sum\limits_{i = 1}^{n} \delta(\overline{x},\overline{y}) $$
(4)
$$ \delta(\overline{x},\overline{y}) = \delta(\overline{x}= \mu \cdot x_{i}+(1-\mu)\cdot x_{j}, \overline{y}=\mu \cdot y_{i}+(1-\mu) \cdot y_{j}) $$
(5)

where δ(∙) is the Dirac distribution and (xj, yj) represents a random sample in D. As a hyper-parameter, μ controls the similarity between virtual and real samples.

3.3 Comparison with other data augmentation methods

We compare Random Linear Interpolation (RLI) with three effective data augmentation methods for person re-identification including random cropping, random flipping and random erasing. First of all, random cropping is a classic data augmentation method that training images is cropped randomly to meet the input size of the convolutional neural network. Parts of input pedestrian images are paid more attention to than the whole image by random cropping, and meanwhile the background information of images can be ignored. Different from random cropping, Random Linear Interpolation takes each pedestrian image as a whole, and fuses other image in it linearly. In other words, although we change all the pixel values in an image, we do not change the internal structure of the image data.

Random flipping is another data augmentation method which is commonly used in person re-identification. It enables a CNN model to learn the same image with different directions, and do not incur information loss in pedestrian images during augmentation. In comparison with random flipping, each time the model is trained by Random Linear Interpolation, the new generated image will show different values of all the pixels. Moreover, The fusion of two images can be viewed as adding various levels of noise to one of images.

Recently, random erasing has been introduced to dealing with person re-ID problems. It selects a rectangle region in an image randomly and erasing its pixels with random values. To make the CNN model robust to occlusion in person re-identification, various levels of occlusion in pedestrian images are focused on during training. Compared to random erasing, Random Linear Interpolation generates a new image by fusing a pair of images. In addition, changing the pixels of the whole image by linear interpolation, RLI expands the scope of image occlusion and improve the generalization ability of the model.

In our experiment, it is observed that all of these data augmentation methods can improve the recognition accuracy of person re-ID, and the proposed method achieves the best performance.

4 Experiments

In this section, we describe in detail the experimental performance evaluation of Random Linear Interpolation(RLI) for person re-identification. Two standard datasets, i.e., Market1501 and DukeMTMC-reID are introduced at first and then we evaluate the performance of the proposed method in some baseline models, such as ResNet18, ResNet34, ResNet50, ResNet101, ResNet152 [11] and DenseNet [15]. Finally the results that RLI compared to superior data augmentation and person re-identification methods on two benchmark datasets are analyzed.

4.1 Datasets

Two commonly used datasets including Market1501 and DukeMTMC-reID are summarized in Table 1. We take many experiments on the benchmark datasets. Market1501 is collected by six cameras in front of a supermarket in Tsinghua University. Overall, this dataset contains 32,668 bounding boxes of 1,501 pedestrian identities. Images of each identity are captured about 20 images on average. Both hand-crafted and the Deformable Part Model (DPM) [8] are used to label bounding boxes of 1501 identities. We follow the paper [36] that used 12936 images for training and another 19732 images for testing.

Table 1 Information of four person re-ID datasets in our experiments

DukeMTMC-reID was taken from the Duke MTMC tracking dataset and has a total of 36,411 images for 1404 identities. We used 702 identities for training and the remaining identities for testing. All of images of this dataset are captured by 8 cameras, and pedestrian bounding boxes are available by hand-crafted. Follow [39], in testing, we pick one query image for each identity in each camera and put the other images in the gallery.

4.2 Settings

4.2.1 Baseline CNN model for person Re-ID

ResNet is chosen as the baseline model to evaluate the performance of the propose method. In our experiments, we take many experiments on five ResNet models including ResNet18, ResNet34, ResNet50, ResNet101 and ResNet152. Note that we remove the full connected layer of ResNet, adding a linear layer after ResNet, and followed by Batch Normalization [16] to regularize mini-batch of input. Moreover, we use Leaky ReLU [34] as activation function with negative slope equal to 0.01 and a dropout layer is added with the possibility equal to 0.5. We use the weight parameters of ResNet model pre-trained on Image-net for fine-tuned and set bias to 0.

In training, we divide input images into mini-batches and adopt a stochastic gradient descent algorithm proposed in [10]. Note that, we set the training batch size of the ResNet101 and ResNet152 to 16 considering the memory limit, and in other networks we set the batch size to 32. Besides, the size of input images is 256 × 128. The learning rate is set to 0.1 in the full connected layer and classification layer, and 0.01 in other layers of the ResNet50 model. After 40 training epochs, we decrease the learning rate to 0.001. The momentum is 0.9 and the weight decaying is set to 5e− 4. We train 60 epochs in all networks.

After augmentation for an input image in a mini-batch, there are two target labels generated for the image. Most existing loss functions can not be applied to this situation that the error between the predicted label ypred and two ground-truth labels (ya, yb) is calculated. Thus, we adopt a novel loss function shown as (6):

$$ loss = \mu \cdot (y_{pred}-y_{a}) + (1-\mu) \cdot (y_{pred} - y_{b}) $$
(6)

In ResNet50, the training loss on Market1501 and DukeMTMC-reID of 10 epochs are shown in Fig. 2. It can be seen that, training loss in the model with Random Linear Interpolation is above baseline but this has no effect on the accuracy of the recognition.

Fig. 2
figure 2

Training loss of 10 epochs on DukeMTMC-reID (a) and Market1501 (b)

4.2.2 Leaky ReLUs and dropout

Like rectified linear unit (ReLU), leaky ReLU is another commonly used activation functions to alleviate the vanishing gradient problem by identifying positive values. It is first proposed in acoustic model [26]. In comparison with ReLU, Leaky ReLU allows negative values and gives all of them with a non-zero slope. Xu et al. [34] has demonstrated that Leaky ReLU shows a better performance than ReLUs in image classification task. We used Leaky ReLU as activation function for the better performance in person re-identification.

Because of large cross-view misalignment, occlusions and pose variations in person re-ID, some patch information on the same pedestrain is likely learned incorrectly in training. We adopt the dropout strategy to overcome the influences of mismatched patches and bad neural units. When a training sample as input in each epochs, some outputs of convolutional layers are randomly set as zeros. In Dropout, we set possibility as 0.5 to reduce half of neural units in a convolutional layer randomly. This action can make the CNN model more stable in person re-ID.

4.3 Parameter analysis

Two important parameters are involved with Random Linear Interpolation, i.e., interpolation strength α and interpolation possibility γ. α is the parameter in beta distribution that determines values of hyper-parameters μ. For each interpolated image, the value of μ, which controls the strength of interpolation, is randomly generated by Beta(α, α). With gamma set to 0.4, the recognition accuracy of the proposed method is shown by adjust alpha from 0.1 to 1 stepping 0.1. It can be seen that in Fig. 3b, Random Linear Interpolation with α improves over the baseline model. And rank-1 accuracy changes weakly on Market1501 when using different values of α in RLI.

Fig. 3
figure 3

Results of sensitivity analysis. In 3a, when γ is fixed, results of ResNet50 with α from 0.1 to 1. By fixing α, scores on DukeMTMC-reID with γ changed from 0 to 1

γ is another parameter that determines the number of samples to be subjected to Random Linear Interpolation. When α equals to 0.1, results of rank-1 accuracy with different γ are shown in Fig. 3b. It is observed that when is set to 0.4, ResNet50 achieves the highest recognition accuracy on DukeMTMC-reID. Furthermore, with the increase of γ, the model over-learns features of generated samples and reduces attention to the original matched samples.

4.4 Performance evaluation

4.4.1 Improving different baseline models

We first verify the effectiveness of Random Linear Interpolation. γ and α are set to 0.4 and 0.1, respectively. The same parameters such as learning rate, weight decaying and dropout possibility are used in the following experiments.

Three convolution neural network models including ResNet18, ResNet34 and ResNet50 are used as baseline methods for person re-identification. Results of rank-1 accuracy and mAP on Market1501 and DukeMTMC-reID are shown in Fig. 4, respectively.

Fig. 4
figure 4

We compare the performance of models with RLI and without RLI on Market1501 and DukeMTMC-reID respectively

In Fig. 4a, with using RLI, the rank-1 accuracy of three models increases from 1.58% to 2.17% and the mAP increases from 1.07% to 6.23% on Market1501. Specificallly, after implementing RLI in the models on DukeMTMC-reID shown in Fig. 4b, the rank-1 accuracy increased by 3.06%, 10.09% and 3.33% in ResNet18, ResNet34 and ResNet50, respectively. And mAP increased by an average of 4.13%. Results of these experiments show great performance while a baseline CNN model using Random Linear Interpolation data augmentation for person re-ID.

4.4.2 Compared to superior data augmentation methods

The comparison results among different data augmentation methods including Random Cropping (RC), Random Erasing (RE) and Random Linear Interpolation (RLI) on Market1501 and DukeMTMC-re-ID, are shown in Table 2. To be fair, ResNet50 with the same structure and initialized parameters in Section 4.2.1 is used as the baseline method. Note that, random cropping is not adopted when evaluating performance of random erasing and Random Linear Interpolation. Furthermore, we combine the models with re-ranking [1, 40].

Table 2 Person re-identification performance with Random Cropping (RC), Random Erasing (RE) and Random Linear Interpolation (RLI) on Market-1501 and DukeMTMC-reID based on the ResNet50 model. Random cropping is not adopted when evaluating performance of RE and RLI

It can be observed that, all of data augmentation methods improve rank-1 accuracy and mAP of the baseline. On Market1501, both random erasing and Random Linear Interpolation show a better performance than random cropping. Specially, ResNet50 with RLI gives rank-1 accuracy of 88.93% and mAP of 72.09%. This is 1.05% higher in rank-1 accuracy and 1.31% higher in mAP than ResNet50 with RC. In addition, ResNet50 with RLI obtains 70.87% rank-1 accuracy and 63.69% mAP for DukeMTMC-reID with re-ranking. Compared to random erasing, the proposed data augmentation is 1.64% higher in rank-1 accuracy and 1.42% higher in mAP. Experimental results in Table 2 demonstrate that Random Linear Interpolation is an effective method for data augmentation and achieves a better performance than two commonly used methods, i.e., Random Cropping and Random Erasing.

Furthermore, we also compare the proposed method with the Random Cropping method on three state-of-the-art baseline models including ResNet101, ResNet152 and DenseNet121. When performing experiments on ResNet101 and ResNet152, we set batch size to 16 because of memory limitations. Experiment results are shown in Table 3. The proposed method increases average rank-1 accuracy in three superior models by 2.34% and 6.32% on Market1501 and DukeMTMC-reID, respectively. In addition, it can be observed that our method has a greater performance improvement for DukeMTMC-reID dataset that is more difficult to accurately re-identify, which means that our methods can be applied to more difficult person re-identification tasks. We achieve 91.38% rank-1 accuracy and 76.89% mAP on the Market1501 dataset by DenseNet121, while achieving 74.01% rank-1 accuracy and 53.64% mAP on the DukeMTMC-reID dataset by ResNet152.

Table 3 Comparison with Random Cropping(RC) on three state-of-the-art baseline models including ResNet101, ResNet152 and DenseNet121

4.4.3 Comparison with State-of-the-Art methods for person Re-ID

Many state-of-the-art methods deal with person re-identification efficiently on two benchmark datasets including Market1501 and DukeMTMC-reID. Both DukeMTMC-reID and Market1501 have been commonly used to evaluate the performance of superior methods recently. In Table 4, we show the compared results among our approach and other methods including [2, 9, 12, 24, 30, 35,36,37,38,39, 41] on Market1501. It can be seen that our method achieves competitive results with the state of the art. Specifically, our method obtains rank-1 accuracy = 92.71% and mAP = 88.98% for Market1501 with re-ranking. In addition, some superior methods including DCGANs [39], LOMO + XQDA [24], PAN [38], IDE [37] and BoW+KISSME [36]are compared to our method on DukeMTMC-reID. The detail scores of rank-1 accuracy and mAP are summarized in Table 5. In the compared methods, Random Linear Interpolation exceeds PANs [38] by 2.41% in rank-1 accuracy and 2.14% in mAP in ResNet152. Furthermore, the proposed method achieves rank-1 accuracy = 82.19% and mAP = 75.91% on DukeMTMC-reID with re-ranking.

Table 4 Comparison of our method with the state-of-the-art methods on the Market-1501 dataset. Rank-1 accuracy (%) and mAP(%) are shown
Table 5 Comparison with prior art on DukeMTMC-reID. Rank-1 accuracy (%) and mAP (%) are shown

5 Conclusion

In this paper, we proposed a novel data augmentation named Random Linear Interpolation for person re-identification. Pairs of images in training are mixed by interpolating the pixels of them. New samples with various levels of complicated occlusion are generated in training that can improve the robustness of convolutional neural network models. Experiments conducted on two benchmark datasets Market1501 and DukeMTMC-reID demonstrate that Random Linear Interpolation improves the performance of baseline CNN models for person re-identification. However, there are also many standard single-shot person re-identification datasets such as VIPeR and PRID450S with only one image per camera for each pedestrian, which poses a heavy problem for existing data augmentation methods. Therefore, we are going to optimize our method and explore a better data augmentation method based on generating mixed pedestrian background to solve the person re-identification problem of single-shot datasets in the future work.