Abstract
Iris Presentation Attack Detection (PAD) algorithms address the vulnerability of iris recognition systems to presentation attacks. With the great success of deep learning methods in various computer vision fields, neural network-based iris PAD algorithms emerged. However, most PAD networks suffer from overfitting due to insufficient iris data variability. Therefore, we explore the impact of various data augmentation techniques on performance and the generalizability of iris PAD. We apply several data augmentation methods to generate variability, such as shift, rotation, and brightness. We provide in-depth analyses of the overlapping effect of these methods on performance. In addition to these widely used augmentation techniques, we also propose an augmentation selection protocol based on the assumption that various augmentation techniques contribute differently to the PAD performance. Moreover, two fusion methods are performed for more comparisons: the strategy-level and the score-level combination. We demonstrate experiments on two fine-tuned models and one trained from the scratch network and perform on the datasets in the Iris-LivDet-2017 competition designed for generalizability evaluation. Our experimental results show that augmentation methods improve iris PAD performance in many cases. Our least overlap-based augmentation selection protocol achieves the lower error rates for two networks. Besides, the shift augmentation strategy also exceeds state-of-the-art (SoTA) algorithms on the Clarkson and IIITD-WVU datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Iris recognition systems are vulnerable to presentation attacks (PAs). An imposter can use a printed image or replay an iris video to impersonate an enrolled user or wear textured contact lenses to escape recognition. Therefore, developing a reliable iris PAD algorithm is still a challenging task. Considering that neural networks successfully improve the performance in many computer vision fields, deep learning-based algorithms are further applied for iris PAD [6, 15, 17, 30, 39]. However, most neural networks suffer from overfitting, where the network does not generalize very well on an unseen test set. Several strategies are therefore proposed to improve the generalizability of networks, e.g., Dropout [34], Batch normalization [25]. In contrast to such methods, the data augmentation technique targets the root problem and insufficient training data variability. Most iris PAD datasets are limited to a small-scale compared to the datasets used for general purposes, for privacy security. Data augmentation can be categorized into data warping and oversampling. Data warping creates more images based on affine transformation like rotation or translation. Oversampling generates synthetic images, such as using Generative Adversarial Networks (GAN) [18]. Data augmentation techniques improve the performance of modern image classifiers without doubts [10, 23, 32]. In the iris PAD field, several studies also showed the improvement of performance by augmentation techniques. Gragnaniello et al. [19] utilized a data augmentation to generate more training data by rotating the original images for iris PAD task. Their results are slightly improved when applying data augmentation. Raghavendra et al. [30], Chen et al. [6] and Choudhart et al. [7] also utilized the augmentation techniques to avoid the overfitting in training phase (see Table 1). However, the contribution of augmentation techniques is not clear because no analysis or experimental comparison is provided as summarized in Table 1. It is worth noting that iris images generated by GAN [18] cannot be used as augmented data in our application to improve the performance, as done for general computer vision tasks. This is because the generated iris images are considered another type of presentation attack for impersonation [37]. As a result, we chose to explore the effect of the data warping technique on iris PAD performance due to the restricted condition of augmentation techniques in the PAD field.
Furthermore, the detailed effect of the data augmentation on iris PAD performance is relatively understudied. In this regard, this work provides answers to the following questions: (1) What is the relative effect of various data augmentation techniques on the performance of iris PAD? (2) Does the combination of all augmentation techniques at various design levels always lead to superior performance, or can there be a formal approach to augmentation methods selection? (3) Do different augmentation strategies improve PAD performance by bringing the “same” misclassified samples to the correct classes? Or do they have a less overlapping effect?
To answer these questions, we explore the impact of different augmentation techniques, specifically data warping techniques, on the generalization of deep learning-based iris PAD. The main contributions of the work are as follows: (1) provide a first in-depth analysis of data augmentation techniques role on the performance and reliability for iris PAD, (2) propose a classification error overlap-based augmentation selection protocol, (3) demonstrate the experiments in terms of fine-tuned and trained from scratch networks with various augmentations on multiple cross-validation scenarios datasets, (4) visualize and discuss the overlapping effect of different augmentation techniques to provide a better explanation of the generalizability induced by augmentation techniques.
2 Related work
Iris recognition systems have been widely applied in different recognition scenarios due to the uniqueness and high accuracy of iris features [2,3,4,5]. However, the operational security of the iris recognition has raised many concerns. This section provides a brief review of deep learning-based iris PAD algorithms and general data augmentation techniques. The Iris-LivDet-2017 [42] is the most recent published competition. The used competition datasets and protocols indicated that improving the generalizability of iris PAD is a major challenge. Some recent iris PAD competitions, such as Iris-LivDet-2017 [42] or Iris-LivDet-2020 [11], are organized to evaluate the generalizability of iris PAD algorithms. In contrast to Iris-LivDet-2017 [42], the 2020 edition competition [11] did not offer any official training data and the test data are not yet publicly available, the experiments and analysis in this work are still based on the protocols designed in Iris-LivDet-2017 competition [42]. Hence, we focus here on the algorithms and results in Iris-LivDet-2017. The protocols in this competition are designed under cross-dataset and cross-PA scenarios to reflect the real-world situation. In this competition [42], CASIA proposed to train two SpoofNets to detect printouts and textured contact lenses separately, while UNINA relied on the Scale Invariant Descriptor (SID) and Bag of Words (BoW) to classify the attacks. Afterward, Kuehlkamp et al. [27] proposed to combine 61 CNN lightweight CNNs via meta-fusion to classify multiple Binarized Statistical Image Features (BSIF) views of the iris image to overcome such generalization problems. Their results outperformed the winners of the competition. Furthermore, Sharma et al. [31] proposed a DenseNet network-based iris PA detector, D-NetPAD, to demonstrate the experiments on a proprietary dataset and four public competition datasets. They trained a D-NetPAD model on their private dataset, including 12,772 training data. Then, this pre-trained model was used in three ways to examine the generalizability on the competition datasets: 1) the pre-trained D-NetPAD is used directly on the test sets in the competition, 2) train a D-NetPAD model from scratch on the competition training sets, 3) fine-tune this pre-trained model on the competition training sets. As expected, the fine-tuned model performed the best. They achieved the lowest error rate (0.30% ACER values) on the Notre Dame dataset in the competition, whereas the second-lowest error is 3.28% from the previous Meta-Fusion method. However, there is a slight problem that their proprietary training data include the testing data of Notre Dame. To fairly compare the results using the same data, we only report the D-NetPAD trained from scratch, and also the Meta-Fusion results later on in Table 12. Besides, we compare our results with the multi-layer fusion (MLF) method achieving the 2.31% ACER in Notre Dame and the recently published micro-stripe analysis (MSA) method ([14, 17]) obtaining good performance (11.13% ACER value) in the IIITD-WVU dataset.
Even though such neural network-based algorithms obtain good performance, they still suffer from overfitting. One reason is that training data are insufficient, in quantity and variation. For example, there are only 1200 training iris images in the Notre Dame dataset in the competition [42], which is quite limited compared to datasets designed for generic computer vision tasks. Moreover, not only iris PAD algorithms have this problem, and most networks suffer from overfitting leading to low generalization. Under this condition, data augmentation can help to reduce overfitting and enhance the generalizability of networks by virtually generating more training images (more variations) from the original data. The data augmentation techniques can be categorized into data warping and synthetic oversampling [38]. The term data warping can be traced back to the distortion of handwriting in [1]. The warped data are created by applying geometric and color augmentations, such as rotation, shift, flipping, and changing the contrast. In addition to data warping applied in data-space, synthetic oversampling creates images in feature-space by using GANs. The recent iris PAD studies and their used augmentation techniques are presented in Table 1. It is noticed that many works did not mention applying data augmentation, and those who did, did not study the effect of that augmentation in an ablation study. Only [19] did measure this effect, however, as all other works, did not study multiple augmentation methods nor provided a formal selection protocol for augmentation selection. It should be noticed that such generated synthetic images [26, 41] are classified as a type of presentation attack in the PAD field, i.e., only increase the number of attack samples without bona fide samples. Such synthetically generated iris images are exploited by an adversary to impersonate someone else’s identity. For example, Yadav et al. [41] studied the impact of the synthetic data on PAD algorithms when used as a presentation attack. Hence, we explore the impact of augmentation techniques on the performance of iris PAD algorithms. Nevertheless, we only perform data warping augmentation methods due to the imbalance generation of synthetic oversampling techniques.
As summarized in Table 1, the augmentation techniques used in most iris PAD works are rotation, flip, and shear. However, the exact impact of these transformations on PAD performance is unspecified in these works. Moreover, our experimental results (in Sect. 5) show that not all single or combined augmentations increase the iris PAD performance. Therefore, it is essential to find out the most contribute augmentations by considering the unique characteristic of iris data, e.g., NIR illumination, specific sensors, and no noise background. Furthermore, as shown later in Sect. 5, these individual data augmentations that can improve the performance and generalizability of networks help understand the nature of the variations in the attacks. Consequently, studying the specific role of augmentations inspired us to fuse them by sorting overlap classification rates.
3 Methodology
In this section, we will introduce the investigated data augmentation techniques along with the augmentation selection and fusion protocols, as well as the three CNNs used in our iris PAD study.
3.1 Data augmentation techniques
The collection of large-scale iris datasets is challenging for iris research because of various factors, e.g., privacy concerns and high demand for acquisition environment specifications. Deep learning-based iris PAD studies are thus limited by inadequate datasets. Compared to datasets designed for general purposes like ImageNet dataset [12], most iris PAD datasets have only a dozen to a hundred distinct irises (distinct subjects) as summarized in [8]. The problem of training on small-scale datasets is overfitting, which refers to the phenomenon that a trained network can not generalize well on unseen data. Besides, the Iris-LivDet-2017 competition results suggested that cross-PA and cross-dataset scenarios can be considered the major challenges of current iris PAD fields. To simultaneously validate against insufficient data resources and cross scenarios, we explore the impact of data augmentation methods on iris PAD generalization ability.
To observe the respective impact of data augmentation strategies, we perform six geometric transformation-based augmentation techniques. Notably, the oversampling augmentation technique is neglected in this work because the iris data generated by the GAN [18] are considered fake iris [26, 41], i.e., an attack. The explored six basic augmentations in this study are: horizontal shift, vertical shift, brightness adjustment, zoom in/out, and horizontal flipping. Such augmentation techniques are widely used in the computer vision field with proved positive effect [10, 23, 32] and also in the iris PAD field [6, 7, 19, 30]. More reasons that lead us to choose these augmentations are: (1) even under a controlled environment, the irises are not in the same position and same viewpoint. There is still a small geometric variation between iris images. (2) the capture light condition varies between the different datasets when performing the cross-dataset evaluation. (3) the size of the captured irises varies slightly depending on the collectors. (4) iris textures are distinct between the left and right eyes of the same person [8]. However, in some cases, only a single eye of a person is contained in PAD datasets [12]. Hence, it is interesting to explore if horizontal flipping of iris images can improve the performance of PAD algorithms. Considering that the position, direction, size, and illumination differences of iris images are small, we augment the images in a relatively small degree to avoid inducing unwanted noise. The detailed augmentation parameters are listed in Table 4, and the corresponding explanation is in Sect. 4.2. Most interestingly, we look at the effect of each of these augmentation in respect to the other methods.
3.2 Fusion and augmentation protocol
Furthermore, we investigate two methods to fuse the above individual augmentation strategies: strategy-level and score-level fusion. For the former category, the training data are generated by using a combination of several augmentation strategies. For example, an iris image can be rotated, shifted, zoomed, and other operations simultaneously. For the latter category, the prediction scores by each network (trained with one of the single augmentation methods) are fused to calculate a final prediction.
On the other hand, we investigate an augmentation selection protocol. This protocol is based on the overlapping ratio of misclassified samples caused by the different augmentations (as explained later) and thus their relative effect on the performance. This selection step is based on two assumptions: (1) different augmentation techniques contribute to different aspects of the PAD performance, (2) selecting augmentations with the lower overlap of misclassified samples to fuse may improve the results as they focus on the different types of variability in the images.
Let \(A = \{A_1, ..., A_n\}\) define a set of augmentation techniques. \(I_{A_n}^{a} = \{ I_{A_n}^{a_1}, ..., I_{A_n}^{a_m} \}\) presents a set of misclassified attack images with augmentation \(A_n\) and \(I_{A_n}^{bf} = \{I_{A_n}^{bf_1}, ..., I_{A_n}^{bf_k}\}\) is a set of misclassified bona fide images with augmentation \(A_n\). The misclassified attacks overlap ratio \(O_{A_{pq}}^{a}\) denotes the ratio of attack samples classified incorrectly with augmentation technique \(A_p\) that are also classified wrongly with augmentation \(A_q\). Similarly, the misclassified bona fides overlap ratio \(O_{A_{pq}}^{bf}\) denotes the ratio of bona fide samples misclassified with augmentation technique \(A_p\) that are also misclassified with augmentation \(A_q\). The ratios can be computed as followed equations:
where \(p, q \in \{1, ..., n\}\). Then, the overall overlap ratio \(O_{A_{pq}}\) between augmentation techniques \(A_p\) and \(A_q\) is:
The detailed pseudo-code of the selection protocol can be found in Algorithm 1. We set \(k = 3\) in our experiment and select the \(A_b\) with the minimum Equal Error Rate (EER) values.
3.3 Neural networks
To evaluate the effect of data augmentation on iris PAD more generally, we train three neural networks: (1) fine-tuning ResNet50, (2) fine-tuning VGG16, (3) training from scratch MobileNetV3-small. On the one hand, ResNet and VGG networks are used widely either as feature extractor or end-to-end architectures in biometric research fields [29, 36, 39]. For example, Nguyen et al. [35] used ResNet [21], VGGNet [33], etc., to extract image features for iris recognition. Yadav et al. [39] fused features extracted from off-the-shelf VGG16 model and handcrafted haralick features to detect iris presentation attacks. Therefore, we fine-tune the pre-trained ResNet50 [21] and VGG16 [33] to perform iris PAD. On the other hand, most generic models trained on ImageNet datasets [12] have different patterns compared to iris images. Therefore, we train a lightweight network architecture, MobileNet V3 Small [22] from scratch to target iris PAD issues additionally. MobileNet V3 small has only 2.25M parameters, which is suitable to deploy on mobile devices and to be trained on limited iris data, while ResNet50 has 25.64M parameters and VGG16 has 138M parameters. MobileNet V3 [22] uses the depth-wise convolution and squeeze-and-excitation to reduce parameters and preserve the accuracy at the same time. The training hyperparameters are listed in Table 3. In this work, we focus on the impact of various augmentation techniques and aim to discover the consistency of data augmentation effects, the augmentation selection protocols, and the fusion protocols under diverse network architectures and training strategies. Therefore, we opted to intentionally select a diverse set of networks and training protocols that have shown good performances on iris PAD in previous works [14, 16, 17, 39]. Hence, we fine-tune the ResNet50 and VGG16 networks and train from scratch MobileNetV3 following the experimental settings adopted in [14, 16, 17, 39].
4 Experimental setup
This section describes the datasets, the used parameters in the neural networks and data augmentation techniques, and the evaluation metrics.
4.1 Datasets
The experiments are demonstrated on publicly available benchmark datasets used in the Iris-LivDet-2017 competition [42] to explore the impacts of different data augmentation techniques on PAD performance. The Iris-LivDet-2017 competition [42] contains four datasets: Clarkson, Warsaw, Notre Dame, and IIITD-WVU. Because the Warsaw dataset is no longer publicly available, we use the remaining three datasets in our experiments. Furthermore, the Iris-LivDet-2017 are designed for cross-PA, cross-sensor, and cross-dataset evaluation. Figure 1 presents iris samples from the training and test sets of each of the used datasets. The varying appearance between different datasets indicates the challenging task of cross-dataset PAD. Table 2 summarizes the description of the used datasets, including the number of images in the training and test sets and sensors.
Clarkson dataset The Clarkson dataset is designed as a cross-PA evaluation. The test set consists of additional unknown attack image types that are not present in the training set. The unknown data include visible-light image printouts attack and the extra pattern contact lenses produced by different manufactures. The bona fide visible-light images are presented neither in the training set or the test set.
Notre dame dataset The Notre Dame dataset contains bona fide iris images (without lenses) and textured contact lens attacks. The test set is a combination set of the known subset and unknown subset, corresponding to the cross-PA scenario. The unknown subset includes iris images with textured lenses produced by different manufacturers (different patterns) and not represented in the training data. Another difficulty of this dataset is the limited training data.
IIITD-WVU dataset The IIITD-WVU dataset is an amalgamation of two datasets: the IIITD dataset used for training and the WVU dataset for testing. The experiments performed on the IIITD-WVU dataset correspond to the cross-dataset evaluation because the sensors, data acquisition environments, subject population, and PA generation procedures for the training and testing are different. The training set (IIITD set) was selected from the IIIT-Delhi Contact Lens Iris (CLI) dataset [40] and IIITD Iris Spoofing (IIS) dataset [20], where the images were captured by multiple sensors under a controlled environment. The test set (WVU set) was captured using a mobile iris sensor under both controlled (indoor) and uncontrolled (outdoor) environments. The Iris-LivDet-2017 competition results [42] indicated that the cross-dataset evaluation was considered the most challenging task on account of the significant variations.
4.2 Parameters setting
To make our experimental setting compliant with the Iris-LivDet-2017 competition [42], we use the pre-defined training and the test set as described in Sect. 4.1. Additionally, 20% of the images are selected randomly from each training set to serve as a validation set during the training procedure. The training hyperparameters listed in Table 3 are used to fine-tune the ResNet50 [21] and VGG16 [33] networks, and train the MobileNetV3-small [22] from scratch. The input size of the three networks is \(480 \times 640 \times 3\), where the grey-scale iris images are converted to three-channel images filled with the same pixel values. The number of actual training epochs is controlled by the early stopping method. The training stops if the validation loss does not decrease after ten epochs or the training reaches its maximum training epochs in our experiments.
The parameters of augmentation techniques are listed in Table 4. An image can be shifted horizontally or vertically by a specific ratio of the image width or height. The range of the shift is 0 to 100%. In our case, the specific ratio sets to 10%. A rotation augmentation randomly rotates the image clockwise between 0 and 360 degrees. We limited the maximum rotation degree to 15 degrees. Also, the brightness of the image can be augmented by either randomly darkening or brightening. The range of the brightness argument is from 0 to 200%. The brightness is not changed when the value is 100. The values less than 100 darken the image, whereas values larger than 100 brighten the image. Furthermore, the iris image can be zoomed in/out with a specific ratio. The range of zoom arguments is 0 to 200%. The image is not changed when the zoom argument is 100%. In the experiment, we zoom the images between 85% and 115% randomly. Finally, more iris images can be produced by horizontally flipping. The code is implemented based on the Keras library.Footnote 1
4.3 Evaluation metrics
The following metrics are used to measure the PAD algorithm performance:
-
Attack Presentation Classification Error Rate (APCER): The proportion of attack images incorrectly classified as bona fide samples.
-
Bona Fide Presentation Classification Error Rate (BPCER): The proportion of bona fide images incorrectly classified as attack samples.
-
Average Classification Error Rate (ACER): corresponds to the average of BPCER and APCER.
The APCER, BPCER, and ACER follow the standard definition presented in the ISO/IEC 30107-3 [24]. The threshold used to decide an iris image is bona fide is 0.5, as defined in the Iris-LivDet-2017 protocol [42]. Moreover, the Detection Equal Error Rate (D-EER) and the BPCER value by fixing the APCER value at \(1\%\) are reported for more analysis.
Furthermore, we use the Fisher Discriminant Ratio (FDR) to examine the achieved class separability (attack and bona fide) induced by different augmentation settings to indicate classification generalizability. The FDR is described in [28] and [9] as the measurement of separability between genuine and imposter scores. In our work, the high separation between bona fide and attack scores indicates higher reliability of the applied augmentation technique in the iris PAD system. The FDR is described in Equ. 3:
where \(\mu ^{bf}\) and \(\mu ^{a}\) are the respective standard deviation of bona fide and attack scores, and \(\sigma ^{bf}\) and \(\sigma ^{a}\) are their mean values. We also analyze the differences in the augmentation-induced enhancement of different augmentation strategies with the help of the confusion matrix plotted based on the overlap misclassified samples as mentioned in Sect. 3.1. The details of this confusion matrix are described in Sect. 5.2.
5 Experiments evaluation
This section evaluates the several augmentation techniques using the three models on three datasets in terms of the different metrics. In addition to the individual augmentation methods (see Tables 5, 6, 7, 8, 9, 10), we also report the results of strategy-level and score-level combination in Table 11. We also draw the ROC curves of either single augmentation technique (as appended in Fig. 2) or multiple fusion methods (as appended in Fig. 3). Furthermore, we analyze the overlapping misclassified images by employing the confusion matrix (as shown Fig. 6, 7 and 8).
5.1 Results
In this subsection, we first analyze the results in terms of individual datasets per specific augmentation technique. Then, for further study, the fusion-based results are discussed. Finally, we compare our results with the SoTA algorithms for an overall analysis.
Clarkson Results Table 5 reports the iris PAD performance in terms of D-EER, the BPCER value at 1% APCER value, and FDR. It can be observed that (1) translation, brightness, and horizontal flip augmentation produce better results in some cases, e.g., applying the MobileNetV3 model, (2) however, not all augmentations can improve the PAD performance, (3) the higher FDR values mostly coincide with the lower D-EER value. By looking at Table 5 and Table 6 together, we can find that the FDR value has a greater potential to suggest a lower ACER value relative to the D-EER metric.
Notre Dame Results Table 7 and Table 8 describe the iris PAD performance on Notre Dame dataset. As shown in Table 7, the model fine-tuned without augmentation (ResNet50 and Vgg16) outperforms than most other augmentations. In contrast, the performance of the MobileNetV3 (scratch) is mostly improved compared to training without any augmentation techniques. Moreover, unlike the lowest ACER acquired by MobileNetV3 on the Clarkson dataset, ResNet50 achieved the best result (9.56% ACER) in Table 8 by using brightness augmentation on the Notre Dame dataset. The Clarkson and Notre Dame datasets both correspond to the cross-PA scenarios that include unseen cosmetic lens patterns. However, the same network architectures show a significant difference. As shown in Table 6 and Table 8, ResNet50 performed worst on the Clarkson and best on the Notre Dame dataset, whereas MobileNetV3 performed best on the Clarkson and worst on the Notre Dame. One possible reason for this opposite variation is insufficient training data for the Notre Dame dataset (4937 training data in Clarkson and 1200 training in Notre Dame). Another possibility is the differences in the ratio of their unknown PA in the test set (21.03% unknown attack in the attack test subset in Clarkson and 50% in Notre Dame). Considering these two reasons, we argue that models pre-trained on large-scale datasets may perform better on unseen pattern data with insufficient training data for fine-tuning. Besides, similar to the third finding in the Clarkson dataset, the augmentation technique obtained with the higher FDR values also achieves the lower ACER values determined by a pre-defined threshold.
IIITD-WVU Results Table 9 and Table 10 denote the results of the IIITD-WVU dataset, which corresponds to a challenging cross-dataset scenario. It can be observed that D-EER values and ACER values of the IIITD-WVU are higher than the Clarkson and Notre Dame datasets. As shown in Fig. 2, when fixing the APCER values (x-axis), the ROC curves indicate that the IIITD-WVU dataset has higher BPCER values (the y-axis coordinate is 1-BPCER) than the Clarkson and Notre Dame datasets. Moreover, the variation between individual augmentation techniques is more pronounced on the IIITD-WVU dataset. In addition to such variations on different datasets, the variations of augmentation techniques are slightly different across methods. For example, ResNet50 and VGG16 achieve better results with vertical shift on all datasets; however, the MobileNetV3 model performs worse when using vertical shift (See referable AUC values). Looking at Table 5, horizontal shift and zoom yields better results with VGG16 and MobileNetV3 networks. The lowest D-EER (9.26%) and the lowest ACER (10.05%) are achieved by the vertical shift when fine-tunning the ResNet50 model. Consistent with the observations in the Clarkson and Notre Dame datasets, the higher FDR value potentially points to a lower ACER value in most cases. Therefore, we can conclude that the FDR metric is more suitable than the D-EER metric for measuring the reliability and generalizability of the PAD algorithms.
Fusion-based Results Table 11 presents the performance results of the Best Single augmentation (BS) for each dataset and network, and four fusion-based methods: (1) STategy-level fusion (ST) with all augmentations, (2) SCore-level fusion (SC) with all augmentations, (3) Least Overlap-based strategy-level fusion (\(LO_{ST}\)), (4) Least Overlap-based score-level fusion (\(LO_{SC}\)). The augmentations used for \(LO_{ST}\) and \(LO_{SC}\) are selected by Algorithm 1 described in Sect. 3.1. It can be observed in Table 11 that strategy-level fusion has a greater probability to produce the best results than the score-level fusion method. For instance, the ST method obtains the lowest D-EER values in the Clarkson and the Notre Dame dataset by using the ResNet50 model, and \(LO_{ST}\) fusion achieves the best performance in the Clarkson by VGG16 and in the IIITD-WVU by the MobileNetV3 network. Moreover, for VGG16 and MobileNetV3 networks, our augmentation selection protocol achieves one of the two lowest ACERs for five of the six experimental setups. Although a pre-defined threshold can influence the ACER value, a higher FDR value always suggests a lower ACER value. Therefore, the higher the FDR value, the higher the reliability of the PAD algorithm.
Comparison with SoTAs We also compare our results with several SoTA algorithms in Table 12. The first three rows are the winners of the Iris-LivDet-2017 competition [42], followed by four of the latest SoTAs, and then the best results of our three networks, respectively. The detailed description of the competition and SoTA algorithms is presented in Sect. 2. The Meta-Fusion [27] approach combined 61 CNNs to classify multiple BSIF views of the iris images via SVM meta-fusion. D-NetPAD method [31] adopted a DenseNet model that is pre-trained on a private combined iris dataset. They also trained a DenseNet model on the competition datasets from scratch. We report these scratch D-NetPAD results for a fair comparison on the same data resource. MLF method [13] fused the information from multiple network layers to make a PAD decision. MSA [14, 17] approach focuses on the artifacts differences in the image dynamics around the iris/sclera border area by extracting information from micro-stripes. Because MLF and MSA do not report the results on the Clarkson dataset, we mark ’-’ in Table 12.
For the Clarkson dataset, the lowest ACER value (0.84%) is produced by the MobileNet trained with the horizontal shift augmentation. For the IIITD-WVU dataset, our ResNet50 model trained with vertical shift generated data achieves the best result with the ACER value of 10.05%. However, the MLF [13] method achieves the best results on the Notre Dame dataset, while our solutions perform worse than Anon1, D-NetPAD, Meta-Fusion, and MSA methods. Due to the lack of training data in the Notre Dame dataset (1200 training data, 3600 testing data), even though data augmentations improve the results, the model still overfits. Therefore, we concluded that shift augmentation is worth attempting for the improvement of the PAD performance. Also, fusing various augmentations in the strategy-level is a good start for iris PAD by considering all the previous results.
Cross-dataset evaluation In addition to inter-dataset evaluation, we also report the cross-dataset results in terms of D-EER, ACER, and FDR values in Table 13. In the cross-dataset scenario, the training data are the training subset of one dataset, while the test data are the test subset of the other two datasets. For instance, the model trained on the Clarkson dataset is used to produce the prediction scores on the test subset of Notre dame and IIITD-WVU datasets. The threshold is set to 0.5 as defined in the Iris-LivDet-2017 competition protocol. To demonstrate the generalizability of the different fusion strategies, we provide the results generated by the BS, ST, SC, \(LO_{ST}\) and \(LO_{SC}\) settings, similarly to the inter-dataset results in Table 11. In addition to fusion methods, the results of the training without augmentation technique (denoting as No) are also reported for comparison. The bold values in the D-EER and ACER columns are the lowest two error rates, and the FDR column’s bold values indicate the Top-2 separability measured by FDR. For further comparison, we also provide a visual representation of the D-EER values achieved by the different experimental settings in Fig. 4 and the ROC curves in Fig. 5. As can be concluded from Table 13, (1) training without augmentation techniques performs worse than using augmentations in most cases. (2) BS and ST methods achieve one of the two lowest ACER values in half of the experimental setups. (3) SC augmentation method obtains one of the lowest D-EER values for nine of the eighteen experimental setups, notably eight of the nine lowest D-EER values are produced by the fine-tuned ResNet50 and VGG16. Furthermore, the reliability of the FDR value is consistent with the observation of the previous inter-dataset results that a higher FDR value hints at a lower ACER value, even though the ACER value can be affected by a pre-defined threshold. It can also be noticed in Table 13 that training MobileNetV3 from scratch with \(LO_{ST}\) performs better than with other augmentation strategies in most cases. Similar observation can be found in Fig. 4. SC (yellow) and \(LO_{SC}\) (green) methods achieve lower D-EER values than ST (grey) and \(LO_{ST}\) (navy blue) methods for ResNet50 and VGG16 networks. In contrast, SC and \(LO_{SC}\) produce higher D-EER values than ST and \(LO_{ST}\) for the MobileNetV3 network. One possible reason is the different training strategies of networks.
5.2 Analysis and discussion
This section explores if different augmentations lead to the same or different kinds of performance improvements. To do that, we analyze the overlap of misclassified samples between different augmentation protocols, including four fusion methods with the help of confusion matrices. Furthermore, the limitations and potentials of our analyses will be discussed. The confusion matrices for each dataset can be seen in Figs. 6, 7 and 8. The horizontal axis (X axis) from left to right and the vertical axis (Y axis) from top to bottom correspond to the augmentation strategies: No, Shift\(_{h}\), Shift\(_{v}\), Rotation, Brightness, Zoom, Flip\(_{h}\), ST, SC, \(LO_{ST}\) and \(LO_{SC}\), respectively.
The matrices from left to right are generated by the ResNet50, VGG16, and MobileNetV3 separately. The value in top matrices refers to the misclassified attacks overlap ratio \(O_{A_{pq}}^{a}\) computed as in Eq. (1a), and the bottom matrices present the misclassified bona fides overlap ratio \(O_{A_{pq}}^{bf}\) computed as in Eq. (1b) in Sect. 3.1.
As can be seen from the previous results, different augmentation strategies improve the performance on the different datasets. In this case, shift, rotation, and horizontal flip play a relatively prominent role. The most overlap values are between 0.2 and 0.7 in confusion matrix plots. In general, the lower overlap rates indicate that different augmentation techniques enhance the model to adapt to different variations in iris samples. As shown in Figs. 6, 7 and 8, we can find that the overlap misclassification rate on MobileNetV3 network is lower (lighter blue) compared to the ResNet50 and VGG16 for each dataset. A general observation can be made from Figs. 6, 7 and 8, the fusion of multiple augmentation techniques (all or by our proposed augmentation selection protocol), especially on the score-level (SC and \(LO_{SC}\)), leads to higher overlap with the basic augmentation methods. This indicates our success in addressing a larger number of variations in the data simultaneously. This is not the case when we apply the strategy-fusion method, as the multiple augmentation methods used in the training phase might cause confusion.
Summing up all the results, we can see that training with augmentation techniques significantly improves PAD performance than training only with original data. Each augmentation method plays a positive role on a particular dataset or network. Shift augmentation performs better than other methods in most cases. However, the results do not exhibit an exact consistency across all networks, augmentation techniques, and datasets. One improvement can be to preserve the created images in the memory rather than randomly augment and fed them to the network during the training process The advantage is the later exact knowing numbers of original and augmented data, whereas the drawback is the higher hardware requirements. The data augmentation techniques are classed into two general categories, data warping, and oversampling. Because the images generated by oversampling methods like using the GAN network should be detected as attack images, this could easily exacerbate the imbalance in the data. For iris PAD, only data warping can be applied to augment the training data. However, there is no consensus about the best augmentation strategy, especially no best combination way. The reason is that the intrinsic bias in the capture environment, subject population, or scale of datasets is different. Consequently, the first future work is to learn an optimal augmentation strategy in an automatic way. Also, we need to find an optimal dataset size after augmentation by balancing the used strategy and the available memory for storing augmented images. Moreover, the imbalance between bona fide and attack samples can be addressed.
6 Conclusion
This paper addresses a clear research gap by providing an in-depth analysis of the data augmentation role in iris PAD. Data augmentation technique is one of the crucial steps to address the limitation of iris attack data. We explore the impact of widely used data augmentation strategies and two combination methods, strategy-level, and score-level, on the generalization of iris PAD. We also propose a least overlap-based augmentation selection protocol to bring different types of wrongly classified samples into the correct classification. This is based on a detailed analysis of the overlap between the effect of different augmentation techniques. The experiments are performed on three datasets in the Iris-LivDet-2017 competition [42] and with three neural networks for comparison and analysis. The experimental results linked certain data augmentation methods to significant enhancement of generalizability and indicated the relatively low-overlapping effect of these augmentations.
Notes
Keras: A high-level NNs API (https://keras.io/).
References
Baird, H.S.: Document image defect models and their uses. In: In: 2nd International Conference Document Analysis and Recognition, pp. 62–67. IEEE Computer Society, Tsukuba City, Japan (1993)
Bakshi, S., Mehrotra, H., Majhi, B.: Postmatch pruning of SIFT pairs for iris recognition. Int. J. Biom. 5(2), 160–180 (2013). https://doi.org/10.1504/IJBM.2013.052965
Barpanda, S.S., Sa, P.K., Marques, O., Majhi, B., Bakshi, S.: Iris recognition with tunable filter bank based feature. Multim. Tools Appl. 77(6), 7637–7674 (2018). https://doi.org/10.1007/s11042-017-4668-z
Boutros, F., Damer, N., Raja, K.B., Ramachandra, R., Kirchbuchner, F., Kuijper, A.: Iris and periocular biometrics for head mounted displays: Segmentation, recognition, and synthetic data generation. Image Vis. Comput. 104, 104007 (2020)
Boutros, F., Damer, N., Raja, K.B., Ramachandra, R., Kirchbuchner, F., Kuijper, A.: In: IJCB, (ed.) On benchmarking iris recognition within a head-mounted display for AR/VR applications, pp. 1–10. IEEE (2020)
Chen, C., Ross, A.: A multi-task convolutional neural network for joint iris detection and presentation attack detection. In: 2018 IEEE Winter Applications of Computer Vision Workshops, WACV Workshops 2018, Lake Tahoe, NV, USA, March 15, 2018, pp. 44–51. IEEE Computer Society (2018). https://doi.org/10.1109/WACVW.2018.00011
Choudhary, M., Tiwari, V., Uduthalapally, V.: Iris presentation attack detection based on best-k feature selection from yolo inspired roi. In: Neural Comput and Applic (2020)
Czajka, A., Bowyer, K.W.: Presentation attack detection for iris recognition: An assessment of the state-of-the-art. ACM Comput. Surv. 51(4), 86:1-86:35 (2018). https://doi.org/10.1145/3232849
Damer, N., Opel, A., Nouak, A.: Biometric source weighting in multi-biometric fusion: Towards a generalized and robust solution. In: 22nd European Signal Processing Conference, EUSIPCO 2014, Lisbon, Portugal, September 1-5, 2014, pp. 1382–1386. IEEE (2014)
Dao, T., Gu, A., Ratner, A., Smith, V., Sa, C.D., Ré, C.: A kernel theory of modern data augmentation. In: K. Chaudhuri, R. Salakhutdinov (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, vol. 97, pp. 1528–1537. PMLR (2019)
Das, P., McGrath, J., Fang, Z., Boyd, A., Jang, G., Mohammadi, A., Purnapatra, S., Yambay, D., Marcel, S., Trokielewicz, M., Maciejewicz, P., Bowyer, K.W., Czajka, A., Schuckers, S., Tapia, J.E., Gonzalez, S., Fang, M., Damer, N., Boutros, F., Kuijper, A., Sharma, R., Chen, C., Ross, A.: Iris liveness detection competition (livdet-iris) - the 2020 edition. In: 2020 IEEE International Joint Conference on Biometrics, IJCB 2020, Houston, TX, USA, September 28 - October 1, 2020, pp. 1–9. IEEE (2020).https://doi.org/10.1109/IJCB48548.2020.9304941
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248–255. IEEE Computer Society (2009)
Fang, M., Damer, N., Boutros, F., Kirchbuchner, F., Kuijper, A.: Deep learning multi-layer fusion for an accurate iris presentation attack detection. In: IEEE 23rd International Conference on Information Fusion, FUSION 2020, Rustenburg, South Africa, July 6-9, 2020, pp. 1–8. IEEE (2020). https://doi.org/10.23919/FUSION45008.2020.9190424
Fang, M., Damer, N., Boutros, F., Kirchbuchner, F., Kuijper, A.: Cross-database and cross-attack iris presentation attack detection using micro stripes analyses. Image Vis. Comput. 105, 104057 (2021). https://doi.org/10.1016/j.imavis.2020.104057
Fang, M., Damer, N., Boutros, F., Kirchbuchner, F., Kuijper, A.: Iris presentation attack detection by attention-based and deep pixel-wise binary supervision network. In: 2021 IEEE International Joint Conference on Biometrics, IJCB 2021, Shenzhen, China, Aug.4 - 7, 2021. IEEE (2021)
Fang, M., Damer, N., Kirchbuchner, F., Kuijper, A.: Demographic bias in presentation attack detection of iris recognition systems. In: 28th European Signal Processing Conference, EUSIPCO 2020, Amsterdam, Netherlands, January 18-21, 2021, pp. 835–839. IEEE (2020). https://doi.org/10.23919/Eusipco47968.2020.9287321
Fang, M., Damer, N., Kirchbuchner, F., Kuijper, A.: Micro stripes analyses for iris presentation attack detection. In: 2020 IEEE International Joint Conference on Biometrics, IJCB 2020, Houston, TX, USA, September 28 - October 1, 2020, pp. 1–10. IEEE (2020). https://doi.org/10.1109/IJCB48548.2020.9304886
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc. (2014). http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
Gragnaniello, D., Sansone, C., Poggi, G., Verdoliva, L.: Biometric spoofing detection by a domain-aware convolutional neural network. In: 2016 12th International Conference on Signal-Image Technology Internet-Based Systems (SITIS), pp. 193–198 (2016). https://doi.org/10.1109/SITIS.2016.38
Gupta, P., Behera, S., Vatsa, M., Singh, R.: On iris spoofing using print attack. In: 22nd International Conference on Pattern Recognition, ICPR 2014, Stockholm, Sweden, August 24-28, 2014, pp. 1681–1686. IEEE Computer Society (2014). https://doi.org/10.1109/ICPR.2014.296
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE CVPR, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society (2016)
Howard, A., Pang, R., Adam, H., Le, Q.V., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., Chu, G., Vasudevan, V., Zhu, Y.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 1314–1324. IEEE (2019)
Hu, B., Lei, C., Wang, D., Zhang, S., Chen, Z.: A preliminary study on data augmentation of deep learning for image classification. CoRR arXiv:1906.11887 (2019)
International Organization for Standardization: ISO/IEC DIS 30107-3:2016: Information Technology – Biometric presentation attack detection – P. 3: Testing and reporting (2017)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: F. Bach, D. Blei (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 448–456. PMLR, Lille, France (2015)
Kohli, N., Yadav, D., Vatsa, M., Singh, R., Noore, A.: Synthetic iris presentation attack using idcgan. In: 2017 IEEE International Joint Conference on Biometrics, IJCB 2017, Denver, CO, USA, October 1-4, 2017, pp. 674–680. IEEE (2017). https://doi.org/10.1109/BTAS.2017.8272756
Kuehlkamp, A., Pinto, A., Rocha, A., Bowyer, K.W., Czajka, A.: Ensemble of multi-view learning classifiers for cross-domain iris presentation attack detection. IEEE Transactions on Information Forensics and Security 14(6), 1419–1431 (2019)
Lorena, A.C., de Leon Ferreira de Carvalho, A.C.P.: Building binary-tree-based multiclass classifiers using separability measures. Neurocomputing 73(16–18), 2837–2845 (2010). https://doi.org/10.1016/j.neucom.2010.03.027
Nguyen, D.T., Pham, T.D., Lee, Y., Park, K.R.: Deep learning-based enhanced presentation attack detection for iris recognition by combining features from local and global regions based on NIR camera sensor. Sensors 18(8), 2601 (2018)
Raghavendra, R., Raja, K.B., Busch, C.: Contlensnet: Robust iris contact lens detection using deep convolutional neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, Santa Rosa, CA, USA, March 24-31, 2017, pp. 1160–1167. IEEE Computer Society (2017). https://doi.org/10.1109/WACV.2017.134
Sharma, R., Ross, A.: D-netpad: An explainable and interpretable iris presentation attack detector. 2020 IJCB, Sep. 28 - Oct. 1, 2020, online conference arXiv: 2007.01381 (2020)
Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6, 60 (2019). https://doi.org/10.1186/s40537-019-0197-0
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Y. Bengio, Y. LeCun (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(56), 1929–1958 (2014)
Thanh, K.N., Fookes, C., Ross, A., Sridharan, S.: Iris recognition with off-the-shelf CNN features: A deep learning perspective. IEEE Access 6, 18848–18855 (2018). https://doi.org/10.1109/ACCESS.2017.2784352
Tolosana, R., Gomez-Barrero, M., Busch, C., Ortega-Garcia, J.: Biometric presentation attack detection: Beyond the visible spectrum. IEEE Trans. Information Forensics and Security 15, 1261–1275 (2020)
Wei, Z., Tan, T., Sun, Z.: Synthesis of large realistic iris databases using patch-based sampling. In: 19th International Conference on Pattern Recognition (ICPR 2008), December 8-11, 2008, Tampa, Florida, USA, pp. 1–4. IEEE Computer Society (2008). https://doi.org/10.1109/ICPR.2008.4761674
Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data augmentation for classification: When to warp? In: 2016 International Conference on DICTA, 2016, Gold Coast, Australia, November 30 - December 2, 2016, pp. 1–6. IEEE (2016)
Yadav, D., Kohli, N., Agarwal, A., Vatsa, M., Singh, R., Noore, A.: Fusion of handcrafted and deep learning features for large-scale multiple iris presentation attack detection. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 572–579. IEEE Computer Society (2018). https://doi.org/10.1109/CVPRW.2018.00099
Yadav, D., Kohli, N., Jr., J.S.D., Singh, R., Vatsa, M., Bowyer, K.W.: Unraveling the effect of textured contact lenses on iris recognition. IEEE Trans. Inf. Forensics Secur. 9(5), 851–862 (2014). https://doi.org/10.1109/TIFS.2014.2313025
Yadav, S., Chen, C., Ross, A.: Synthesizing iris images using rasgan with application in presentation attack detection. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 2422–2430. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPRW.2019.00297
Yambay, D., Becker, B., Kohli, N., Yadav, D., Czajka, A., Bowyer, K.W., Schuckers, S., Singh, R., Vatsa, M., Noore, A., Gragnaniello, D., Sansone, C., Verdoliva, L., He, L., Ru, Y., Li, H., Liu, N., Sun, Z., Tan, T.: Livdet iris 2017 - iris liveness detection competition 2017. In: 2017 IEEE IJCB, Denver, CO, USA, October 1-4, 2017, pp. 733–741. IEEE (2017)
Open Access
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research work has been funded by the German Federal Ministry of Education and Research and the Hessen State Ministry for Higher Education, Research and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fang, M., Damer, N., Boutros, F. et al. The overlapping effect and fusion protocols of data augmentation techniques in iris PAD. Machine Vision and Applications 33, 8 (2022). https://doi.org/10.1007/s00138-021-01256-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-021-01256-9