Keywords

1 Introduction

Face recognition has been a hot research topic in the past decades and attracted considerable research attention. In recent years, with the development of deep learning techniques and the emergences of large-scale face datasets, deep network-based methods have significantly advanced face recognition techniques [1,2,3]. Applications of face recognition are emerging in video surveillance, social security, company attendance, and identity authentication.

Although deep models have proved their effectiveness in improving the performance of face recognition techniques, most of the existing face recognition systems are trained with large-scale datasets. In some applications, each person contains only 1–2 samples, and the training data may be inadequate to cover the changing illumination, pose, and image quality. Particularly, the performance of the existing deep networks will decrease dramatically when dealing with extreme lighting condition. Therefore, the topic of learning robust deep representations with insufficient training samples will be worthwhile for face recognition.

A natural idea is to augment the training data and generate additional training samples. Some recent works have focused on synthesizing novel face images with changing poses, attributes, and identities by GAN (generative adversarial network) [4,5,6]. Learning deep networks with the synthesized face images improves the performance of deep networks to some specific problem. However, the generalizability of GAN-based approaches to other datasets is under study. Besides, controlling the facial details and ID of the generated image by GAN is extremely difficult. Furthermore, the training process of the GAN models is time-consuming, and the labeling of face image attributes is costly.

In this paper, we focus on improving the performance of face recognition systems using the data augmentation technique. We propose to perform illumination augmentation to increase the diversity of the training data. Firstly, we generate different reference illumination templates from other datasets. For each training sample, our approach simulates different illumination conditions using the pre-defined illumination templates. Eventually, we utilize the singular value decomposition (SVD) algorithm to transform the output image to the color subspace which is consistent with the input image. Furthermore, we construct a new dataset by collecting images stored in the second-generation ID cards and images captured in the realistic surveillance environment. We also build a testing set which comprises of images captured in the railway station. Experiments demonstrate that the proposed illumination augmentation approach is effective for improving the performance of deep network-based face recognition models.

2 Related Works

Deep networks have achieved remarkable success in face recognition and dramatically improved the performance of the state-of-the-art methods [1,2,3]. Taigman et al. [1] proposed a pioneer CNN model named DeepFace which outperformed traditional face recognition methods and closely approached human-level performance. Sun et al. [2] proposed a DeepID network which employed identification and verification supervisory signals to improve the recognition performance. Schroff et al. [3] proposed a network named FaceNet which adopted triplet loss to enforce a margin between distances of intra-class samples and those of inter-class samples.

3 Proposed Method

3.1 Overall Framework

Figure 1 demonstrates the overall framework of the proposed illumination augmentation approach. We first perform Gaussian filtering on the reference images from an external benchmark to extract the reference illumination masks. Then, we extract the facial details of the input image by subtracting the illumination mask of the input. After that, we combine the facial details of the input image with the reference illumination masks to generate face images with different illumination situations. Eventually, we perform color correction to make sure that the color components of the augmented image is consistent with the input image.

Fig. 1.
figure 1

Proposed framework. We first simulate different illumination situations using the reference images from an external dataset and then perform color correction to obtain the illumination augmentation output.

3.2 Illumination Variation Simulation

We perform Gaussian filtering with a large blur kernel on the reference image and input image to extract the corresponding illumination mask. Denote the input image and the reference image as \(\varvec{X}_i\) and \(\varvec{X}_r\), we can compute the illumination masks \(\varvec{X}^m_i\) and \(\varvec{X}^m_r\) as follows:

$$\begin{aligned} \begin{aligned} \varvec{X}_i^m&=\varvec{X}_i*\mathcal {G},\\ \varvec{X}_r^m&=\varvec{X}_r*\mathcal {G},\\ \end{aligned} \end{aligned}$$
(1)

where \(\mathcal {G}\) denotes the function of Gaussian filtering and can be defined as follows:

$$\begin{aligned} \begin{aligned} \mathcal {G}=\frac{1}{2\pi \sigma ^2}e^{-\frac{\varvec{X}^2+\varvec{y}^2}{2\sigma ^2}}. \end{aligned} \end{aligned}$$
(2)

We can obtain the facial details of the input image by subtracting the illumination mask. Denote \(\varvec{X}^d_i\) as the facial details, the computation is as follows:

$$\begin{aligned} \begin{aligned} \varvec{X}_i^d&=\varvec{X}_i-\varvec{X}_i^m. \end{aligned} \end{aligned}$$
(3)

Then we combine the facial details with the reference illumination mask to simulate different illumination conditions \(\varvec{X}^v_i\), which is formulated as follows:

$$\begin{aligned} \begin{aligned} \varvec{X}_i^v&=\varvec{X}_i^d+\varvec{X}_r^m. \end{aligned} \end{aligned}$$
(4)

3.3 Color Correction

The color components of the simulated image may be different to the input image. We propose to conduct color correction to ensure that the color components of the final output is consistent with that of the input image. We first perform SVD [11] algorithm on each channel of both the input image and the simulated output to extract their color components. Denote \(\varvec{X}_{iA},A=\{R,G,B\}\) and \(\varvec{X}_{iA}^v,A=\{R,G,B\}\) as input and simulated image, we have:

$$\begin{aligned} \begin{aligned} \varvec{X}_{iA}&=U_{iA}\varSigma _{iA}V_{iA},A=\{R,G,B\},\\ \varvec{X}_{iA}^v&=U_{iA}^v\varSigma _{iA}^vV_{iA}^v,A=\{R,G,B\}. \end{aligned} \end{aligned}$$
(5)

Note that \(\varSigma _{iA}\) and \(\varSigma _{iA}^v\) contains the color components of the input and simulated image, we can correct the color condition of the simulated image according to the input image by replacing \(\varSigma _{iA}^v\) with \(\varSigma _{iA}\). Denote \(\varvec{X}_{oA}\) as the augmented output, then we have:

$$\begin{aligned} \begin{aligned} \varvec{X}_{oA}&=U_{iA}^s\varSigma _{iA}V_{iA}^s,A=\{R,G,B\} \end{aligned} \end{aligned}$$
(6)

4 Experiment

4.1 Experimental Settings

Training Set. We utilize CASIA-WebFace [7], a popular public face dataset, to train the baseline model. CASIA-WebFace contains 494,414 samples of 10,575 subjects detected from the Internet.

We also construct a domestic dataset for training a stronger model for the domestic face recognition. The domestic training dataset contains 864,652 samples of 386,847 subjects. Most of the subjects contain only 2–3 images, of which one image is from the second-generation ID cards and other images are from the surveillance videos. Training a robust model for the domestic dataset is challenging because of the lack of training sample for each person.

Testing Set. We evaluate the performance of the proposed illumination augmentation approach on the LFW dataset [8]. The LFW dataset contains 13,233 images of 5,749 subjects captured in the unconstrained environment. The LFW dataset is now the most popular benchmark for face recognition. We adopt the standard verification protocol to conduct fair comparison with other methods.

We also construct a domestic testing set to evaluate the performance of the face recognition models under realistic surveillance environment. The domestic testing dataset contains 3,722 prob images of 39 subjects captured in a railway station. The challenges include illumination, pose, and occlusion. For testing, we conduct matching between the domestic testing dataset and a gallery set comprised of 10,039 images captured in the second-generation ID cards.

Table 1. The DenseNet structure
Fig. 2.
figure 2

Illumination augmentation results on CASIA-WebFace. (b) and (c) are the augmented images of (a), while (e) and (f) are the augmented images of (d).

Fig. 3.
figure 3

Illumination augmentation results on the domestic training set. (b) and (c) are the augmented images of (a), while (e) and (f) are the augmented images of (d).

Implementation Details. We select 20 images from the CMU-PIE [9] dataset to generate reference illumination templates. For each training sample, we randomly select 2 reference templates and obtain 2 illumination augmentation results. Our training process is two step. First we train a baseline model with CASIA-WebFace using the DenseNet [10] structure. Table 1 demonstrates the details of the network. Then we utilize the triplet loss [3] to fine-tune the baseline model with the domestic training set. For the first step, we set the batch size as 128, the learning rate as 0.1 and will be decreased by half every 40,000 iterations, and the weight decay as \(5\times 10^{-4}\). For the second step, we set the batch size as 120, the learning rate as 0.01 and will be decreased by half every 40,000 iterations, and the weight decay as \(5\times 10^{-4}\).

4.2 Qualitative Evaluation of Illumination Augmentation

Figures 2 and 3 demonstrate the illumination augmentation results on CASIA-WebFace and the domestic training set. We can see that the illumination augmentation approach manage to add additional illumination variations to the input image without changing the facial details for both CASIA-WebFace and the domestic training set. Our approach is proved to be adaptive to any training sample with changing illumination, pose, and image quality.

4.3 Quantitative Evaluation of Illumination Augmentation

Evaluation on LFW. Table 2 demonstrates the quantitative evaluation of the proposed illumination augmentation (IA) approach on the LFW dataset. We compare our method with DeepFace [1], DeepID2+ [2], and FaceNet [3]. The experimental results verify that the proposed IA approach is effective for improving the performance of the existing deep models. Implementing the proposed network with the augmented training samples results in an improvement of 0.27% verification accuracy. We also notice that the verification accuracy of the proposed approach outperforms DeepFace and DeepID2+. Note that with the same training samples, the accuracy of our method is higher than that of FaceNet. Consequently, our method is competitive against the state-of-the-art methods.

Table 2. Evaluation on LFW
Table 3. Evaluation on the domestic testing set

Evaluation on the Domestic Testing Set. Table 3 demonstrates the evaluation results on the domestic testing set. We can see that training deep models with the domestic training set is beneficial to improving the recognition accuracy on the test set captured in the realistic surveillance environment. Compared with the deep models trained with CASIA-WebFace, an improvement of 24.6% is obtained for FaceNet trained with the domestic training set. Similarly, an improvement of 19.77% is also observed for the proposed network. With more training data, the performance of deep networks continue to improve. As the number of training data increases from 0.12M to 0.86M, we can see a performance gain of 7% for FaceNet and 8.73% for our network. Furthermore, we notice that the proposed IA approach is effective for improving the performance of deep networks with the domestic dataset. With IA approach, an improvement of 4.02% is observed for the proposed network. Note that the performance of the proposed network trained with CASIA-WebFace outperforms that of FaceNet with a margin of 8.56%. Consequently, our method achieves better generalizablity than FaceNet.

5 Conclusion

In this paper, we study the topic of data augmentation for face recognition and propose an illumination augmentation (IA) method. We first simulate different illumination conditions from the external benchmark and then perform color correction to obtain the augmented training samples with additional illumination variations while preserving the facial details. The IA approach is suitable for any face image with changing illumination, pose, and image quality. To further improve the performance of the deep networks towards robust face recognition under realistic environment, we construct a domestic training set together with a domestic testing set. Experimental results on the LFW and the domestic testing set verify the effectiveness of the proposed approach.