Keywords

1 Introduction

Face recognition is the method of choice behind some of the most widely deployed biometric authentication systems, currently supporting a range of applications, from passport control at airports to mobile phone or laptop login. A key weakness of the technology, is its vulnerability to presentation attacks, where imposters attempt to gain wrongful access by presenting in front of the system’s camera a photo, or a video, or by wearing a mask resembling a registered person. As a solution to this problem, algorithms for presentation attack detection (PAD) are developed, that is, binary classifiers trained to distinguish between the bona fide samples coming from live subjects, and those coming from imposters.

Here, we deal with the problem of race bias in face anti-spoofing algorithms. The proposed race bias analysis process has three key characteristics. First, the focus is on the bona fide error, that is, on genuine people wrongly classified as imposters. Biases in this type of error have significant ethical, legal and regulatory ramifications, and as it has recently pointed out “creates customer annoyance and inconvenience”, [12]. Secondly, we do not analyse just the final binary classification outcome, but also the scalar responses of the network prior to thresholding. Thirdly, we treat the value of the threshold, which determines the classifier’s operating point on the ROC curve, as a variable. We do not assume it is fixed by the vendor of the biometric verification system in a black-box process.

We demonstrate the proposed bias analysis approach on a face anti-spoofing algorithm based on the Vector Quantized Variational Autoencoder (VQ-VAE) architecture, [20]. The network is trained and validated on the SiW database, and tested for bona fide racial bias on the SiW and RFW databases. Hypotheses are tested using the chi-squared test on the binary outcomes, the Mann-Whitney U test on the scalar responses, and the Hartigan’s Dip for testing bimodality in the response distributions.

Our main finding is that racial bias can be attributed to several characteristics of the response distributions at the various demographics: different means; different variances; bimodality; outliers. As a secondary contribution, we also demonstrate that a database which does not specialise in face anti-spoofing, such as RFW, can nevertheless be used to analyse face anti-spoofing algorithms.

The rest of the paper is organised as follows. In Sect. 2, we review the relevant literature. In Sect. 3, we describe the experimental setup. In Sects. 4, and 5 we present the bias analysis on the SiW and RFW databases, respectively. We briefly conclude in Sect. 6.

2 Background

We briefly review the area of face anti-spoofing, and then focus on previous studies of bias in machine learning, and PAD in particular.

2.1 Face Anti-spoofing

The state-of-the-art in face anti-spoofing [5, 14, 25,26,27,28, 30, 31], is based on various forms of deep learning, such as Central Difference Convolutional Networks (CDCN) [27, 28], or transformers [23]. Following some earlier approaches [4, 15], the state-of-the-art may also utilise depth information [22, 24, 25, 30], usually estimated by an independently trained neural network, while the use GAN estimated Near Infrared (NIR) information was proposed [14].

Regarding the face anti-spoofing databases we use in this paper, our training dataset is from the SiW database, introduced in [15]. It comprises videos of 165 subjects of four types of ethnicities: 35% of Asian and 35% Caucasian and 23% Indian, and 7% African American. The bias analysis is performed on SiW with the subject annotated for ethnicity type by us, and the already annotated RFW database [21], which is widely used in the bias analysis literature. RFW again comprises four types of ethnicities: Caucasian, Asian, Indian, and African.

2.2 Bias in Machine Learning

Because of the ethical, legal, and regulatory issues associated with the problem of bias within human populations, there is a considerable amount of research on the subject, especially in face recognition (FR). A recent comprehensive survey can be found in [17], where the significant sources of bias are categorised and discussed, and the negative effect of bias on downstream tasks is pointed out.

In one of the earliest studies of bias in FR, predating deep learning, [18] reported differences in the performance on humans of Caucasian and East Asian descent between Western and East Asia developed algorithms. In [9], several deep learning-based FR algorithms are analysed and a small amount of bias is detected in all of them.

In [10], the authors compute cluster validation measures on the clusters of the various demographics inside the whole population, aiming at measuring the algorithm’s potential for bias, rather than actual bias. Their result is negative, and they argue for the need of more sophisticated clustering approaches. In [19], the aim is the detection of bias by analysing the activation ratios at the various layers of the network. Similarly to our work, their target application is the detection of race bias on a binary classification problem, gender classification in their case. Their result is positive in that they report a correlation between the measured activation ratios and bias in the final outcomes of the classifier. However, it is not clear if their method can be used to measure and assess the statistical significance of the expected bias.

In Cavazos et al. [6], similarly to our approach, most of the analysis assumes a one-sided error cost, in their case the false acceptance rate, and the operating thresholds are treated as user-defined variables. However, the analytical tools they used, mostly visual inspection of ROC curves, do not allow for a deep study of the distributions of the similarity scores, while, here, we give a more in-depth analysis of the response distributions, which is the equivalent of the similarity scores. In Pereira and Marcel [8], a fairness metric is proposed, which can be optimised over the decision thresholds, but again, there is no in-depth statistical analysis of the scores.

The literature on bias in presentation attacks is more sparse. Race bias was the key theme in the competition of face anti-spoofing algorithms on the CASIA-SURF CeFA database [13]. Bias was assessed by the performance of the algorithm under a cross-ethnicity validation scenario. Standard performance metrics, such as APCER, BPCER and ACER we reported. In [2], the standard CNN models Resnet 50 and VGG16, were compared for gender bias against the debiasing-VAE proposed in [3], and several performance metrics were reported. A recent white paper by the ID R &D company presents the results of a large-scale bias assessment experiment conducted by Bixelab, a NIST accredited independent laboratory [12]. Similarly to our approach, they focus on bona fide errors, and their aim is for the BPCER error metric to be below a prespecified threshold across all demographics. Regarding other biometric identification modalities, [7] studied gender bias in iris PAD algorithms.

3 Experimental Setup

We chose the VQ-VAE architecture because of some recently reported impressive results on various computer vision problems. For a more detailed description of the classifier, see our Arxiv preprint [1].

3.1 The VQ-VAE Classifier

The encoder consists of two convolutional layers of kernel size 4, stride step 2, padding 1; followed by a ReLU; one convolutional layer of kernel size 3, stride step 1, padding 1; followed by two residual blocks implemented as ReLU (\(3 \times 3\) conv, ReLU, \(1 \times 1\) conv for each block). It outputs a \(16\times 16\) grid of vectors quantized on a codebook of size 512. The decoder is symmetrical to the encoder, using transposed convolutions. The model was ADAM optimised with learning rate 1e-3, for 100 epochs, with batch size 16. The weight factor \(\beta \) was set to 0.25.

For face detection we used the Multi-Task Cascade Convolutional Neural Network (MTCNN) [29]. The detected faces were horizontally aligned, and cropped at \(64 \times 64\). As our classifier is based on anomaly detection, the training set consisted of bona fide only data, 124,000 samples. We assessed performance on a test set of 1,600 samples, 400 samples from each race, with equal split between bona fide and attack. At an operating threshold of 0.054, corresponding to the EER value at an independent validation set, we obtained an HTER of 0.169, which indicates satisfactory performance.

3.2 Overview of the Bias Analysis Process

The bias analysis process is summarised in Fig. 1. The binary outcomes of the classifiers are analysed with the chi-squared test, and the scalar responses with the Mann-Whitney U test [16].

Fig. 1.
figure 1

The bias analysis process. The binary outcome analysis is shown in purple and the scalar response analysis in blue. (Color figure online)

4 Bias Analysis on SiW

We perform bias analysis on the bona fide samples of SiW test set in Sect. 3.

4.1 Statistical Analysis of the Binary Outcomes

First, we analyse the binary outcomes corresponding to the operating threshold 0.054, which was used in the validation of the classifier in Sect. 3. For each pair of races, we form the \(2\times 2\) contingency tables, and apply the chi-squared test, computing p-values for the hypothesis that samples from the race with the most misclassifications have higher misclassification probability. The results are summarised in Table 1. In several cases, the p-values are low, meaning that for any reasonable threshold of statistical significance, the bias hypothesis is accepted. In other cases, p-values above 0.05 mean that bias has not been detected.

Table 1. p-values of the chi-squared tests for the 0.054 threshold used in Sect. 3.

Next, we treat the operating threshold as a variable. Figure 2 shows the p-values as a function of the threshold for the six pairs of races. We notice that, over the range of all thresholds, there could be several disconnected intervals corresponding to high bias (low p-values), which means that threshold optimisation for low bias should not assume a unique solution, as it is often implicitly assumed in the literature.

Fig. 2.
figure 2

For each pair of races, graphs of the p-value as a function of the threshold.

4.2 Statistical Analysis of the Scalar Responses

For an insight in the behaviour of the graphs in Fig. 2, we analyse the classifier’s scalar responses on the premise that a complex behaviour of their density fun ctions, will induce complex bias behaviour. Table 2 summarises the statistics computed on the responses of each race: mean, standard deviation, and Hartigan’s Dip value [11]. Figure 3 shows plots of histograms and density functions for each pair of races.

Table 2. Response means, st. dev., and Hartigan’s dip values for each race in SiW.
Fig. 3.
figure 3

For each pair of races in SiW, histograms and density functions of the responses.

We tested for statistically different mean responses with the Mann-Whitney U test, as the Shapiro-Wilk test rejected the normality hypothesis. Table 3 shows for each pair of races p-values for the hypothesis that randomly selected responses from the two populations have different values. We note that, for example, the p-value of the Asian and Indian pair is very low, and the large range of high bias thresholds in the corresponding U-shaped diagram in Fig. 2 is due to a statistically significant higher mean response on Asians compared to Indians.

Table 3. p-values of the Mann-Whitney U test on each pair of races.
Fig. 4.
figure 4

For each pair of races, graphs of the p-value as a function of the threshold.

In contrast, the mean response difference between Asians and Caucasians is not statistically significant. Thus, the bias we can observe in the corresponding diagram in Fig. 2, which for small threshold values on the left-hand side of the diagram is statistically significant, is due to different standard deviations.

We checked for bimodality using Hartigan’s Dip Test with 50 bins. For the 200 samples we have from a race, a statistical significance of 95% corresponds to a critical value of 0.037. We notice that all Dip values are below the significance threshold, and thus, all populations should be considered unimodal. In particular, that means that some very high responses on African people should be treated as outliers. We note that against all the other three races, these outliers create a second, or third region of high bias thresholds, in which regions samples from the African population are treated less favourably.

5 Bias Analysis on RFW

Here, we apply the same analysis on a test set from the RFW database, consisting of 200 images from each race. This time the race labels are part of the database, rather than being annotated by us. As RFW database is not a specialised face anti-spoofing database, we do not have imposter images and thus we do not have empirically established operating thresholds, as for example the ones corresponding to EER values. Instead, in our diagrams we indicate thresholds corresponding to bona fide error rates of: 1%, 2%, 5%, 10%, 20%.

In Fig. 4, for each race pair, we plot the p-values of the chi-squared test as a function of the threshold. We observe behaviours similar to those in Sect. 4.

Table 4 shows the means, standard deviations and dip values for each race, and Table 5 shows the p-values of the Mann-Whitney U test for each race pair. We note in Table 4 that the Hartigan’s test detects a bimodality in the responses on Indian people, having a dip value of 0.055, above the significance threshold of 0.037. This can also be verified by visual inspection of the corresponding histograms and density functions, shown in Figs. 5 for race pairs. We also note that this bimodality can be detected in the behaviour of the corresponding graphs of the p-values of the chi-squared test. Indeed, in the three graphs in Fig. 4 corresponding to Indian people, we can detect two distinct regions of higher bias, even though the second one does not reach the level of statistical significance.

Table 4. Response means, st. dev., and Hartigan’s dip values for each race in SiW.
Table 5. p-values of the Mann-Whitney U test on each pair of races.
Fig. 5.
figure 5

For each race pair in RFW, histograms and density functions of the responses.

6 Conclusion

We conducted an empirical study of race bias in face anti-spoofing with the following characteristics: we analysed the bona fide error; the classifier’s binary outcomes and scalar responses were both analysed for bias; the threshold determining the classifier’s operating point was considered a variable.

Our main finding is that the behaviour of race bias depends on several characteristics of the response distributions: different means or different variances between two demographics; bimodality or existence of outliers in a certain demographic. The implication is that race bias is cannot always be attributed to different mean responses, a misconception sometimes reinforced by the fact that in statistics, colloquially, the term bias is often used to describe the component of the error corresponding to the difference in means. As a practical implication of our findings, we note that methods for automatically choosing low bias thresholds should not assume a unique solution to the problem.

In our future work, we would like to conduct a theoretical study of bias, assuming, for example, that the responses follow log-normal distributions.