1 Introduction

Iris recognition (IR) has achieved vigorous research interest due to its peerless individualities such as the rich morphological structure, certain distinctiveness for individuals (even twins), and constancy in micro-features regardless of the growing age [1]. Nevertheless, the IR systems are susceptible to presentation attacks that attempt to emasculate the application security. These attacks represent the forged or deliberately designed iris patterns in front of the iris camera/sensor to obstruct the functioning of the IR system [2]. These may be used to register contrived irises, purposely obscure a party’s trait, or even forge the iris pattern of another person [3]. There are several ways to reproduce the iris patterns, such as using textured contact lenses, printed iris images, artificial eyeballs, and playing iris images/videos on the LCD, and drug-prompted iris employment [2, 3] as depicted in Fig. 1. As the IR systems are progressively installed in precarious applications, e.g., border control, airport security, etc., there is an urge for some security means to recognize the presentation attacks. With this motivation, various presentation attack detection (PAD) mechanisms are introduced in the literature [2].

Fig. 1
figure 1

Depiction of various iris presentation attack samples (bottom row) corresponding to live iris (top row). The iris liveness detection methods are supposed to discriminate between these two categories

The current iris PAD approaches are categorized as either sensor-based or image-based. Sensor-based approaches generally incorporate additional hardware to acquire visual or physical patterns of the eye [4, 5]. Whereas, image-based methods analyze the micro-structures existing within the iris image through handcrafted methods and a classifier [6]. Indeed, an image-based method takes an ocular image captured by the iris sensor, slices of the iris region, extracts local and/or global features, and categorizes it as “live” or “attack” through a classifier. There are several feature descriptors existing in the literature, such as local binary patterns (LBP) [7], binarized statistical image features (BSIF) [8], and scale-invariant descriptors (SID) [9], to constitute pixel-level features. The recent research is extensively utilizing the convolution neural networks (CNNs) for self-feature learning to realize PAD [10, 11]. The uncertainty of the micro-structures for live iris and attack samples results in various discrete patterns corresponding to the same class. Thus, a purposely developed handcrafted feature may be incapable of capturing all possible patterns [12]. Besides, several methods usually include iris segmentation as a crucial stage to perform local feature extraction from the segmented region [3]. Iris segmentation locates the inner and outer iris edges in the image. Nevertheless, it endures some problems since the structure of the iris is not essentially circular; in fact, it has no fixed shape. Therefore, detecting the iris edges without any pattern loss is extremely challenging. In this view, a PAD with such intrinsic segmentation techniques is not robust [12].

The Iris liveness detection competition began in 2013 to examine the evolving PAD algorithms, and to unveil the progress status of the iris PAD. The recent edition occurred in 2017 [10], which uncovered some interesting open problems, e.g., cross-sensor and cross-dataset systems (also known as cross-domain) in the context of iris PAD. With the aspect of improving the cross-domain iris PAD, this paper introduces an image-based PAD scheme as depicted in Fig. 2. It begins with the region of interest (RoI) localization, which is carried out by a preeminent CNN framework, i.e., DarkNet-19 [13] that was initially designed for generic object detection. This model predicts the spatial dimensions of the rectangular box that tightly bounds the RoI. The RoI is then cropped from the image based on the rectangle box, and then, we use OpenCV in python to detect the rectangle in the image and crop it. In the next step, the selective image enhancement is performed over the RoI of given iris images to remove blurriness and to magnify the pixel intensity [14, 15]. Further, the enhanced RoI is fed to various handcrafted and data-driven algorithms to extract key features and to produce corresponding feature-vectors. Further, an optimal feature set is obtained through the Friedman test based feature selection approach and is fused using score-level fusion for final attack prediction.

Fig. 2
figure 2

Flowchart of the proposed approach. The region of interest (RoI) is identified using YOLO framework, then image enhancement is carried out on the selected region. Next, various local and global feature descriptors are employed to extract features from enhanced RoI. Further, top-k features are selected through Freidman test, and are combined using score-level fusion for attack prediction

1.1 The motivation behind the proposed approach

This work is based on some important observations related to iris PAD. Most of the iris presentation attacks primarily alter the iris region rather than the entire eye image [2]. Moreover, the amendment caused by such attacks is also evident in the iris region. Based on such observations, it is concluded that analyzing the entire eye image for feature extraction is not desirable. Instead, it is beneficial to identify the region of interest (RoI) within the eye image, where best discriminative features exist corresponding to all possible presentation attacks [11]. The majority of existing iris segmentation approaches [16, 17] follow handcrafted procedures to perceive iris inner and outer boundary pixels. However, such procedures need a set of empirically defined parameters and, thus, are less generalizable to varying image categories. Besides, the circular Hough transforms, and integro-differential operator for detecting iris and pupil boundaries work well with high-quality images but less robust to blur and noisy images [14]. In this vein, some approaches [12] apply filters on raw iris images for feature extraction, whereas others [3, 6] down-sample the images according to their favorable input size before feature extraction. However, both cases seem to be inadequate because feature extraction from raw images leads to a huge amount of computation and perhaps redundant features construction. A previous study [14] demonstrated a comparative study between RoI images and raw eye images using accuracy as the performance metrics, where accuracy on RoI images is substantially larger than raw images. They gave the reasoning that heavy eyelids and eyelashes may occlude the operative iris regions for feature extraction and may cause intra-class variations. Besides, image down-sampling in the required input size (e.g., VGG-16 requires the input size as 224 × 224) results in significant spatial feature loss [15]. Therefore, RoI detection and segmentation from the given eye image is an adequate choice for constituting better features. The choice of YOLO framework is inspired by the speed hike achieved at the test time as it analyses the given image only once, unlike other object localization models such as region proposal CNN and faster RCNN with repetitive image processing [18]. It also exhibits better generalizability and less error (less than half of the background errors compared to fast and faster RCNN) [19]. The speed is primarily important as the iris localization is an integral step of the test time procedure of the proposed approach.

The comprehensive study of computer vision and image classification suggests that multiple feature fusion substantially enhances the classification performances compared to the sole feature [20,21,22]. A similar concept is being followed in iris related applications, where handcrafted and data-driven features are combined to construct an enhanced feature set [12, 20, 22]. However, the entire feature set may encompass redundant or less important features, which should be removed to obtain an optimal feature set [23].

1.2 Contribution

YOLO- and CNN-based approaches have already been used for RoI detection and feature extraction for iris PAD. Even though, to the best of our knowledge, there is no previous work that focuses on optimal feature selection out of multiple handcrafted and data-driven features and deploying it in the cross-domain environment. The integration of handcrafted and data-driven features is aimed to exploit their respective benefits and to build an iris PAD algorithm with appropriate generalizability to various attack categories. The use of multiple algorithms provides key features extracted with diverse views to the data since each method examines the features with a distinct perspective. In specific, the use of a deep CNN model in RoI detection provides the flexibility to adapt to varying image qualities without extra parameter adjustments. Additionally, we have presented a novel insight into the preeminent Friedman test, where it may be used for optimal feature selection by examining each feature-vector with the corresponding output. The score-level fusion of optimal features yields a fair contribution of each feature in the attack prediction. The major contribution and novelties of this paper can be summarized as follows:

  • A novel approach that employs the YOLO model for iris region localization since it is speedy and accurate in predicting RoI at the test time.

  • An algorithm that employs multiple handcrafted and CNN-based methods for feature extraction in order to perceive key features with multiple perspectives.

  • A new feature selection mechanism based on the Friedman test that examines each feature-vector with output labels on distinct databases to enhance the robustness of the optimal feature set.

  • A comprehensive cross-domain assessment of the proposed PAD approach on datasets currently used to evaluate the state of the arts in the field of iris PAD.

  • The proposed novel PAD approach outpaces the winner of LivDet-Iris-2017 (it is the most recently conducted iris liveness detection competition).

The leftover segment of this paper is structured as; Section 2 explores the literature regarding progress in iris PAD together with the current issues. Section 3 thoroughly describes the proposed scheme and various phases involved in processing the iris images. Section 4 describes the underlying datasets and validation protocols included for the proposed method assessment, along with the experimental outcomes and discussion. Finally, Section 5 concludes the entire work.

2 Related work summary

Since the last two decades, the vulnerability of an IR system to be obscured through presentation attacks has attained a sufficient interest of researchers. The presentation attack detection can be carried out at the sensor-level, pixel-level, or algorithm-level. At the sensor-level, specific designs of iris cameras/sensors can simplify live/fake iris detection. Lee et al. [5] addressed PAD through inspecting the specular blotches of collimated infrared light emitting diode (IR-LED). However, it is incapable of identifying contact lenses, as the visibility of iris texture worsens upon wearing it. Further, authors [24] incorporated algorithms based on pupil dynamics to perceive forged iris, and it failed to identify textured lenses and artificial irises. Sensor-level schemes may extensively acquire the ocular properties of the legitimate iris pattern. However, the generalization capability of the sensor-level PAD schemes is limited as they require the specific design of sensors and depend upon the special hardware functionalities.

Conversely, the pixel-level PAD schemes do not demand special iris sensors exploiting the optical features of iris to classify live/attack samples. Therefore, the techniques utilizing local descriptors to scrutinize iris micro-structures are extremely inspiring [25, 26]. In this context, Daugman [27] introduced the real-time PAD system, where extra peaks in the Fourier amplitude spectrum may be recognized via 2-D Fourier transforms for the cosmetic lens, which does not occur in the real iris’s spectrum. He et al. [7] suggested utilizing local binary patterns (LBP) for contact lens-based PAD, where LBPs are undermined from six related iris subregions. Yet, the AdaBoost algorithm identifies the principal LBP feature. Several other feature descriptors, such as co-occurrence of adjacent LBP (CoA LBP) [28], DAISY [29], and HOG [30], have also been proposed in the literature. In addition, authors in [25] demonstrated an in-depth investigation of spoofing attacks on IR systems and employed multiscale BSIF for feature extraction from iris images. They primarily focused on printed iris images and iris video images captured from LCD. Furthermore, authors in [31] proposed a novel scheme to collect features from the regions of pupil and sclera. Here, LBPs are autonomously extracted from several regions of normalized iris images and then concatenated to discriminate between “attack” and “live” samples. Authors in [32] jointly utilized frequency analysis and extra quality features for printed iris and cosmetic lens detection. Moreover, Sharifi et al. [33] perceive cosmetics on iris and face images by exploiting the combination of micro-texton material and color spaces to discover edges, spots, curves, etc., magnificently. Additionally, the change in variability scores conveyed by fake and real texture is exploited as a distinctive factor. It is noticed that the aforesaid schemes include handcrafted feature extraction to produce iris codes and observes the deviation in attack samples from original counterparts. However, a similar effect could be attained by learning the attack patterns within the raw eye image. Precisely, matching the features or score distribution of fake irises with legitimate correspondents is not pretty adequate. Besides, in contrast to handcrafted features, self-feature learning could be adapted to identify attack patterns. Such approaches are described below.

An adequate framework adhering to the self-feature learning (data-driven) approach is the convolutional neural network (CNN). Menotti et al. [34] designed a three-layer CNN, namely SpoofNet as a liveness detection model for fingerprints, iris, and face. The model is capable enough to extract iris features and to undermine semantics and visual features from raw iris images. Next, an analogous architecture was used in [35] to discriminate among normal, soft, and textured lenses. However, such models conveyed diminished accuracy due to the shallow architecture. Further, in LivDet 2017 challenge [10], a seven-layer inception-based CNN architecture was participated to distinguish between live (real) and fake irises. Another multi-patch CNN model was proposed by He et al. [36], which is trained on 28 subsequent patches of real and attack samples. The respective outputs of all patches are gathered individually to feed the decision layer to classify between live and fake irises. However, the computational expenses are increased since a sole training stage entails 28 CNN operations. In addition, Choudhary et al. [6] exploited the DenseNet121 model with some customizations, for feature extraction and the SVM classifier for classification between iris contact lenses. However, these models map the given lens category images to the respective class, instead of considering the entire dataset. Besides, Chen et al. [11] introduced a multi-task CNN-based framework that concurrently detects iris region and presentation attacks in terms of probability. Notice that all these methods incorporate single feature extractor, i.e., the classification is carried out based on a single feature vector. However, as the presentation attacks may amend the real iris in several aspects, analyzing them with sole angle does not yield impressive results. Therefore, authors in [21] projected a premise that a pool of noble features results in an ominously enhanced discrimination. They anticipated a feature selection and fusion network using six different local features, i.e., LBP, HoG, CoA LBP, BSIF, SID, and DAISY, together with an eight-layer VGG model. However, the authors focused merely on textured lens-based PAD, while other categories of presentation attacks are left unexplored. Besides, Yadav et al. [22] delineated the feature-level fusion of VGG-8 and Haralick features for multiple iris presentation attack detection. Similarly, Kohli et al. [37] also fused the Zernike moment-based features with LBP with variance (LBPV) to handle the medley of iris presentation attacks. Furthermore, a recent work [38] suggested coalescing features from three distinct local and global regions within the given eye image through feature-level and score-level fusion. Table 1 comprehensively outlines the literature focusing on iris PAD in terms of the underlying feature extraction mechanisms.

Table 1 Summarizing literature analysis in terms of different feature extraction methods focusing on various eye regions

3 Proposed approach: YOLO with statistical methods

This subsection describes each of the three subsequent phases of the proposed approach to improve iris liveness detection. Begin with region of interest (RoI) localization through bounding box regression; it demonstrates the architecture and functionality of the YOLO framework [13] used. Further, seven distinct feature extractors used to constitute features from enhanced RoI, and the Friedman test [39] used to accomplish best-k feature selection from resultant features are described in detail. Furthermore, the selected features are combined using score-level fusion to make a final attack prediction.

3.1 CNN framework for RoI detection

The RoI detection procedure is inspired by an earlier work [11] with a slight modification that it focuses on iris localization instead of liveness detection. The framework (model) deployed to detect RoI is depicted in Fig. 3, which adheres to a precise version of the CNN network, i.e., DarkNet-19 [13]. This framework contains nine convolutional blocks, where the first six are followed by max pooling layers. Each convolution block represents a combination of three subsequent operations, i.e, convolution, batch normalization (BN), and rectified linear unit (ReLU). Here, each convolution layer is implemented with a filter size of 3 × 3, excluding the last, where 1 × 1 filters are used. Notice that a fully connected layer is not included here so that the model automatically adapts to accept the varying sized input. The topmost layer uses a softmax function, which is responsible for predicting the coordinates of the bounding box representing the iris region. Table 2 shows the entire network architecture, along with several parameters. Here, the input image is resized with 416 × 416 × 3, before feeding to the model. Further, the spatial dimensions are diminished by a factor of 32 after performing a chain of convolution and pooling functions, and the output dimensions become 13 × 13 × 25. Indeed, the output feature map contains the number of channels as (#class + #coords + 1)*#ancors, where #class denotes the number of output classes, #coords represents the coordinates of the bounding box to predict (i.e., x,y coordinates depict the center of the bounding box, along with the height and width of the iris region), and #ancors is the number of predefined anchors (or the number of the bounding box to examine) to obtain the best bounding box. In the experiment, it is set to 5. Notice that, since merely single object detection and/or localization is carried out in the proposed setup, #class is set to 0, and the framework is utilized as a regressor instead of a classifier.

Fig. 3
figure 3

Depiction of the CNN framework employed for region of interest (RoI) detection, working as a regressor, predicting the dimensions of bounding box, which tightly bounds the iris region

Table 2 Various parameters involved in the YOLO framework in the experimental setup

The DarkNet-19 model was originally trained with the ImageNet dataset with 1000 distinct output categories. However, this model is designed to act as a classifier. Therefore, it is retrained with the explicitly annotated and labeled iris images along with predefined coordinates of the bounding box. The pre-trained model is employed to adopt knowledge transfer when retrained on the standard iris datasets. Moreover, by using the weights learned during pre-training yields the quick network convergence with improved accuracy [13]. This is since the initial CNN learns generic features, such as points, blobs, and edges, and during retraining; such knowledge is successfully transferred to various diverse tasks. Since the model is not too deep, it is entirely fine-tuned rather than freezing some initial layers and fine-tuning the rest. Next, to further enhance the training size and to address overfitting, some data augmentation methods are employed to perform transformations such as random flipping, shearing, rotation, and cropping on the training batches. The augmented images are supplied to the model by ImageDataGenerator, an open-source tool provided by Keras [40], during model training only. Since the model does not perform classification, the classification loss is excluded from the loss function, which contains localization loss and confidence loss. Therefore, the loss function to accomplish regression (predicting the bounding box parameters), is defined as follows:

$$ \begin{aligned} L & = \lambda_{\text{cord}} \sum\limits_{i = 0}^{{C^{2} }} {\sum\limits_{j = 0}^{N} {\left[ {B_{ij}^{\text{iris}} (x_{{{\text{trgt}},i}} - \widehat{{x_{\text{pred},i} }})^{2} + (y_{\text{trgt},i} - \mathop y\limits^{ \wedge }_{\text{pred},i} )^{2} } \right] + } } \lambda_{\text{cord}} \sum\limits_{i = 0}^{{C^{2} }} {\sum\limits_{j = 0}^{N} {\left[ {B_{ij}^{\text{iris}} (w_{\text{trgt},i} - \widehat{{w_{\text{pred},i} }})^{2} + (h_{\text{trgt},i} - \widehat{{h_{\text{pred},i} }})^{2} } \right]} } \\ & \quad + \lambda_{\text{iris}} \sum\limits_{i = 0}^{{C^{2} }} {\sum\limits_{j = 0}^{N} {\left( {B_{ij}^{\text{iris}} (O_{\text{trgt},i} - \widehat{{O_{\text{pred},i} }})^{2} } \right) + } } \lambda_{\text{no}\_\text{iris}} \sum\limits_{i = 0}^{{C^{2} }} {\sum\limits_{j = 0}^{N} {\left( {B_{ij}^{\text{no}\_\text{iris}} (O_{\text{trgt},i} - \widehat{{O_{\text{pred},i} }})^{2} } \right)} } \\ \end{aligned} $$
(1)

Here, \( B_{ij}^{\text{iris}} \) represents the case if iris exists in the ith cell and \( B_{ij}^{\text{iris}} \) signifies if the jth bounding box in the ith cell primarily contributes to the prediction. In contrast, \( B_{ij}^{\text{no}\_\text{iris}} \) denotes the case when the jth bounding box in the ith cell does not contain the iris. C2 denotes the total number of cells present in the last feature map, and N shows the number of bounding boxes to predict. In the experiment, the values for N and C are set to 5 and 13, respectively. Ultimately, the total number of bounding boxes to predict for an image is given by C × C×N, i.e., 13 × 13 × 5. The actual values for output class Oi is computed as follows:

$$ O_{{{\text{trgt}},i}} = \left\{ {\begin{array}{*{20}l} 1 & {if\;B_{{ij}}^{{{\text{iris}}}} = 1} \\ 0 & {if\;B_{{ij}}^{{{\text{no}}\_{\text{iris}}}} = 1} \\ \end{array} } \right.{\mkern 1mu} $$
(2)

Fundamentally, the above derivation calculates the difference in the actual and predicted values then measures the L2 loss. \( \lambda_{\text{coord}} \),\( \lambda_{\text{iris}} \), and \( \lambda_{\text{no}\_\text{iris}} \) are the hyper-parameters employed to weight the distinct regression losses. In our experiment, these values are considered as 1, 5, and 1, respectively.

Refer to Eq. (1), the first two terms of the loss function describe the coordinates of the predicted bounding box’s center, whereas the second term relates to the box’s height and width. The third and fourth terms focus on the probability of the box to encompass the iris. Indeed, all the loss terms are summed together to form the unified L2 regression loss. Further, the model is trained by using the SGD with momentum, with 64 batch size. Consider that the boxtg and boxpd represent the target and predicted bounding box, respectively. Then,

$$\begin{aligned}\text{box}_{\text{tg}} &= (x_{\text{trgt}} ,y_{\text{trgt}} ,w_{\text{trgt}} ,h_{\text{trgt}} ) \\ \text{box}_{\text{pd}} &= (\hat{x}_{\text{pred}} ,\hat{y}_{\text{pred}} ,\hat{w}_{\text{pred}} ,\hat{h}_{\text{pred}} )\end{aligned} $$
(3)

Here, the tuple \( (x_{\text{trgt}} ,y_{\text{trgt}} ,w_{\text{trgt}} ,h_{\text{trgt}} ) \) implies the target output values from the coordinates of labeled bounding boxes, whereas \( (\widehat{x}_{\text{pred}} ,\widehat{y}_{\text{pred}} ,\widehat{w}_{\text{pred}} ,\widehat{h}_{\text{pred}} ) \) denotes the predicted coordinate values for the bounding boxes. Let (bx, by, bw, bh) are the outcomes of the last convolutional layer, then they are transformed through the pre-specified anchor locations (lw, lh) to the offsets as given below [13].

$$ \begin{aligned} \widehat{x}_{\text{pred}} = \,\,\sigma (b_{x} )\, + \,c_{x} \hfill \\ \widehat{y}_{\text{pred}} = \,\,\sigma (b_{y} )\, + \,c_{y} \hfill \\ \widehat{w}_{\text{pred}} = \,\,l_{w} exp\,\left( {b_{w} } \right) \hfill \\ \widehat{h}_{\text{pred}} = \,\,l_{h} exp\,\left( {b_{h} } \right) \hfill \\ \end{aligned} $$
(4)

Here, (cx, cy) signifies the coordinates of the upper left corner from the current cell of the resultant feature map. The default anchor locations given in [13] are used in the model, i.e., {(3.42, 4.41), (1.08, 1.19), (9.42, 5.11), (6.63, 11.38), (16.62, 10.52)}.

$$ {\text{Intersection }}\,{\text{of}}\,{\text{ Union}}\, ( {\text{IoU)}} = \frac{{ \cap (\text{Box}_\text{tg} ,\text{Box}_\text{pd} )}}{{ \cup (\text{Box}_\text{tg} ,\text{Box}_\text{pd} )}} $$
(5)

The procedure to predict bounding boxes by incorporating the pre-specified anchors is depicted in Fig. 4. The predicted bounding boxes are shown in blue rectangles, computed from the five pre-specified anchors. This framework convolves on all grid cells and calculates the IoU between the target bounding box (boxtg) and the predicted bounding box (boxpd) as given in (5), and the largest IoU is observed. If the largest IoU is greater than a preset threshold, then the respective cell produces zero loss to calculate the probability of the bounding box enclosing the iris. The output of the iris localization framework corresponding to images from different datasets is shown in Fig. 5.

Fig. 4
figure 4

The pictorial representation of the process of bounding box prediction using the pre-specified anchors

Fig. 5
figure 5

Output of the iris localization framework containing the bounding box indicating the iris region along with the probability score

3.2 Selective image enhancement on RoI

The localized RoI (iris region) is cropped from the image based on the rectangle box using OpenCV in python and undergoes selective image enhancement through rescaling, sharpening, color, and contrast variation, etc. It increases the subjective and textural quality of the image to enrich the textural details. The cropped iris image is first rescaled by a factor of 1.25, and then image sharpening is performed. It highlights the fascinating minutiae in the region to eliminate the noise and to make the image more alluring. Indeed, the edge sharpening and fine details are determined by the sharp conversions in the image intensity. Further, the sharpening is produced by preserving the high-frequency modules and discarding the low-frequency components. Besides, the contrast is twisted by the variance in the illumination reflected from two neighboring surfaces. In specific, contrast refers to the distinction in chromic properties that enables an object discernable from other objects and background as well. It is obtained by the difference in the brightness and color of an object from others. There are various algorithms and linear and nonlinear functions to accomplish contrast enhancement; however, logarithmic transformation is used in our experiments. Figure 6 represents the output image samples after performing image enhancement.

Fig. 6
figure 6

Depicting image enhancement in Region of Interest (RoI) through rescaling, sharpening, color, and contras variation

3.3 Feature extraction from RoI

Feature extraction from the RoI is carried out using three different approaches: key point-based, local and global descriptors, and a deep learning-based feature extractor [3]. The key point-based feature extractors include scale-invariant feature transform (SIFT) to extract the set of local key points. Besides, the local descriptors such as LBP [25], CoA LBP [28], Multiscale BSIF (MBSIF) [41], Zernike moment [37] and Haralick features [22] perform textural analysis to constitute discriminative patterns and generate output feature-vectors. Further, the VGG-8 model is employed to extract deep learning-based features, which is an imitation of VGGNet [42] with eight layers instead of sixteen. Such feature extractors are described in the following subsections.

Key-point-based feature extraction: key points or key features denote the points in the image, which are invariant to chromatic deviations and image rescaling. In our experiment, the SIFT descriptor is used to identify unique key features from the RoI. Such key points have diverse colors, which are utilized to indicate discernment between key features and are considered beneficial for detecting presentation attacks with cosmetic lenses and printed iris images.

Local and global feature extraction: local descriptors refer to image neighborhood localities computed at several interest points. Usually, interest points are perceived at various measures and estimated to reprise across distinct views of an object. In addition, they are also probable to acquire the core of the object’s exterior. Such feature descriptors designate the image patch surrounding the point of interest. The prime advantage of employing local features is that they have significant potential to recognize the object despite clutter and occlusion. In this work, Co-occurrence of LBP (CoA LBP), MBSIF, Zernike movement, and Haralick features are employed. CoA LBP refers to an image feature that relies upon spatial co-occurrence amongst micro-structures represented by an LBP. It was introduced to counter the limitations of conventional LBP, wherein LBP histograms, all the LBPs of micro-features, are wrapped into a sole histogram. It abandons essential information regarding spatial relations amongst the LBPs, although they may encompass information regarding the image’s global structure.

Besides, MBSIF is an advancement over the traditional BSIF method, where instead of a single fixed-size filter, multiple filters with changing scales are incorporated. The multiple filter responses are combined to create an improved, unique feature set. This work utilized three domain-specific filters of sizes 17 × 17, 7 × 7, and 5 × 5 with a bit length of 12, 10, and 8, respectively, publicly provided in [43]. These filters are domain-specific since they are trained on patches of iris samples and more powerful than generic BSIF filters [41].

On the other side, the Zernike moments [37] are robust across the variations in rotation, scale, and translation, also efficaciously applied in iris segmentation and recognition [44]. In this work, Zernike moments are used to capture the fluctuations in the contour between the live iris and attack samples. An orthogonal set of polynomials is included to define the Zernike moments in an image, and a radial polynomial Pm,n is computed over it. This radial polynomial is demarcated as follows:

$$ P_{m}^{n} (\rho )\, = \,\,\,\sum\limits_{k = 0}^{{\frac{m - \left| n \right|}{2}}} {\frac{{( - 1)^{k} \rho^{m - 2k} (m - k)!}}{{k!\,\left( {\frac{m + \left| n \right|}{2} - k} \right)!\,\,\left( {\frac{m - \left| n \right|}{2} - k} \right)!}}} $$
(6)

where \( \rho \) is the difference between a point (i, j) and the center of the image, m shows the order of polynomial function and n depicts the repetition such that |n| < m and |mn| is even. The Zernike function is unswervingly calculated in the Cartesian coordinate space given by:

$$ Z_{m,n} \left( {p,q} \right)\,\, = \,\,R_{m}^{n} (\rho_{p,q} )e^{{ - jn\theta_{p,q} }} $$
(7)

Consider, S × S as the image size, then

$$ \rho_{p,q} \,\, = \,\,\frac{1}{S}\, \times \sqrt {\left( {2p - S + 1} \right)^{2} + \left( {S - 1 - 2q} \right)^{2} } $$
(8)
$$ \theta_{p,q} \,\, = \,\,tan^{ - 1} \left( {\frac{S - 1 - 2q}{2p - S + 1}} \right) $$
(9)

Let’s consider I as an iris image, then Zernike moments are computed for (x, y) across the non-overlapping cell of n × n. Several pairs of (x, y) are chosen to calculate the amplitude of multi-order Zernike moments. It aids to improve the depiction of the input image. In our experiment, Zernike moments are calculated on non-overlapped patches of 4 × 4, 8 × 8, and 16 × 16, and the resultant features are combined to produce combined feature vector.

Next, the Haralick features [22] are well-known statistical global descriptors used to encode the textural details within an image. These are effectively employed in several domains, such as medical imaging, texture classification, and face presentation attack detection. Haralick features exploit the gray level co-occurrence matrices (GLCM), a tabulation of the occurrences of distinct combinations of gray pixels in an image. Typically, the aim is to map the given unknown sample to either of a set of predefined texture classes. The textural features may be discrete histograms, scaler numbers, or empirical distributions. Moreover, such features characterize the image’s textural properties such as contrast, orientation, spatial structure, and roughness, and encompass definite correlation with the target output. The GLCM represents the scatter of co-occurred pixel intensities within the image (I) at a well-defined pair (∆p, ∆q) at position (x, y). Therefore, GLCM∆p, ∆q (x, y) is calculated as follows:

$$ {\text{GLCM}}_{\Delta p, \, \Delta q} \left( {x, \, y} \right)\,\, = \,\,\,\sum\limits_{p = 1}^{m} {\sum\limits_{q = 1}^{n} {\left\{ \begin{aligned} 1,\,\,\,\,\,if\,\,I(p,q)\,\, = \,\,x\,\,;\,\,I(p\, + \,\Delta p,\,q\, + \,\Delta q)\,\, = \,\,y \hfill \\ 0,\,\,\,otherwise \hfill \\ \end{aligned} \right.} } $$
(10)

After calculating GLCM, Haralick features are computed to encode the textural details in the image. Indeed, there are 13 different Haralick features (i.e., contrast, the sum of variance, the difference in the variance, correlation, sum of average, inverse difference moment, entropy, the sum of entropy, the sum of squares of variance, two information correlation scores, angular second moment, and difference in entropy).

After feature extraction and encoding by using the aforementioned local and global descriptors, corresponding feature-vectors are generated, and they may vary in length depending upon the number of feature values extracted by a particular descriptor. Each feature-vector is fed to a dedicated SVM classifier to generate corresponding output class, i.e., to label the given eye image as either “live” or “attack”. The training procedure of the SVM classifier with the feature-vectors to generate the output labels is described in the next subsection.

3.4 Local classification using SVM

In the proposed approach, multiple SVM classifiers are used at various stages, e.g., in best-k feature selection and score-level fusion. Therefore, this subsection provides a deep insight into the working principle of the SVM classifier.

Let \( \left\{ {F_{k} } \right\}_{k = 1}^{N} \) as the training feature set, and \( \left\{ {Y_{k} } \right\}_{k = 1}^{N} \) as the corresponding labels, the SVM attempts to learn a hyperplane ϒ as follows:

$$ \arg_{\gamma } \hbox{min} \left\| \gamma \right\|_{2}^{2} \, + \,\rho \sum\limits_{k} {L\left( {\gamma ,F_{k} ,Y_{k} } \right)} \, $$
(11)

where ρ and L(γ, Fk, Yk) denotes the penalty parameter and the loss function, respectively. Due to the efficacy of the quadratic Hinge loss in the image classification, it is employed in the experiments. The hinge loss function is expressed as follows:

$$ L\left( {\gamma ,F_{k} ,Y_{k} } \right)\,\, = \,\,\left[ {\hbox{max} \left( {0,\,\gamma^{T} F_{k} Y_{k} \, - \,1} \right)} \right]^{2} $$
(12)

where, Yk is set to 1 for live iris, whereas − 1 signifies spoofs. After learning γ, a test set Ytest is given to the SVM classifier, and the classification utilizes the sign of γTYtest. Here, ρ is set to 0.1. As mentioned earlier, a dedicated SVM classifier is associated with each feature descriptor (selected in best-k features) for local classification, i.e., the class prediction using the single feature vector. Further, the score-level fusion of all individual classifications is accomplished to perceive the global prediction.

3.5 Global classification through score-level fusion

Let c1, c2,….,ck represent the local outcomes of k classifiers, and the score-level fusion acquires a set of weights w1,w2,…..wk, where w1+ w2 +···. + wk=1, to compute the fused output (Cs) as follows:

$$ C_{s} \, = \,c_{1} w_{1} \, + \,c_{2} w_{2} \, + \cdots + \,c_{k} w_{k} $$
(13)

The weights w1, w2,…..wk in (13) are learned by recursively evaluating the individual performances of the classifiers on the test set. This is obtained by using varying partitioning over the train and test sets, where the train-set is assigned a large number of samples so that the classifiers may learn effective feature discernment. Though, instead of combining outputs of all classifiers, best-k features are identified that result in more accurate classification. In this vein, we exploit the preeminent statistical tests for the concurrent assessment of several classifiers. As all SVMs are the same, they differ in their performances due to the feature-vectors, which are used for classifiers’ training. Thus, the association of each feature extractor with SVM acts as a classification algorithm, where the best-k methods are identified using the Friedman test [39]. It performs feature selection by simultaneously evaluating each feature with the output and ranks them based on their performances. Further, the optimal set of weights for score-level fusion is obtained by an inclusive analysis of the train-set for various experiments. In particular, the partitioning is performed as 5267 training and 1316 testing samples for IIITD CLD, 8482 training and 2120 testing samples for IIITD-CSD, 4080 training and 1020 testing samples for ND CLD, 3840 training and 960 testing samples for ND-LivDet, 6356 training and 1589 testing samples for Clarkson-2015, and 6476 training and 1619 testing samples for Clarkson-2017 datasets. Each dataset is randomly split 100 times, where the best-k features are identified through tenfold cross-validation on the train-set. Finally, the local responses of such k classifiers for the test samples are fused using (13) to obtain the global outcome. The first 50 random splits are utilized to learn weights, whereas the performance of sore-level fusion is computed using the rest 50 partitions.

  1. 1.

    Best-k Feature selection through Friedman test

In general, the Friedman test is aimed to discard a null hypothesis that given multiple classifiers are statistically similar, i.e., they all exhibit equal performances. In this vein, the Friedman test exploits two statistics expressed as follows:

$$ \chi_{F}^{2} = \frac{12D}{n(n + 1)}\left[ {\sum\limits_{i} {R_{i}^{2} - \frac{{n(n + 1)^{2} }}{4}} } \right] $$
(14)
$$ F_{F} = \frac{{\left( {D - 1} \right)\chi_{F}^{2} }}{{D\left( {n - 1} \right) - \chi_{F}^{2} }} $$
(15)
figure a

where, n, R, D signify the number of datasets, average rank, and the number of classifiers, respectively. Nevertheless, in the proposed approach, the null hypothesis is modified as “all extracted features exhibit identical contribution in the output prediction.” In the Friedman test, D and n should be big enough (as a rule of thumb, D > 10 and n > 5) [39]. There are multiple feature-vectors; each with separate SVMs, acting as classifiers (n), yet the number of datasets (D) is limited. Therefore, subsampling is performed over the datasets as given in Algorithm-1 to counter it, where three distinct samples are created from each dataset. Thus, a total of 18 sampled datasets are produced from six datasets and are considered as separate datasets. However, sampling on the raw dataset seems inadequate as each iris image is not included in the sampled dataset. In addition, the size of the sampled datasets would be perilously less, and thus, the overall outcome may be affected. Therefore, before subsampling, image augmentation on each dataset is carried out by using imagedatagenerator (an image processing tool provided in Keras [40]). It performs certain transformations on each image matrix in the dataset (as described in the next subsection) and generates similar augmented images.

  1. 2.

    Image augmentation

In order to generate auxiliary training samples, several augmentation methods are incorporated that accomplish various transformations on the images of the given iris datasets. Such transformation includes rotation, flipping, shearing, shearing after rotation, rotation after shearing at varying directions and angles, etc. Indeed, such transformations are similar to regular matrix operations, i.e., the input image matrix is modified in terms of pixel values and locations surrounding the axis. Flipping is implemented horizontally as well as vertically, where pixels are moved along the height and width. Similarly, rotation moves the pixel values in the 2D plane counter-clockwise with the predefined angle about the origin. Besides, shearing amends pixel values according to the variation in their distances from all axis. These transformations are carried out by employing ‘ImageDataGenerator’, i.e., a class facilitated by Keras [40] for image pre-processing. The parameters specified for such transformations are given as: shear (≤ 0.2), rotation (≤ 40), flip (horizontal and vertical = True), height shift (≤ 0.2), and width shift (≤ 0.2). It aids in producing auxiliary images with analogous features to augment and enrich the model input.

4 Experimental framework and discussion

Aiming to realize iris liveness detection, six different iris datasets containing iris images with various attack variants, are considered in this study. The primary reason behind including many datasets is a prerequisite of the Friedman test for best-k feature selection. However, the proposed approach is examined on each dataset for performance validation. A series of experiments are performed on these datasets to validate the efficacy of the proposed approach. The following subsections describe the databases along with the validation protocols used in the experiments for method assessment, and the respective outcomes are also discussed. A comparative study among the proposed approach and the state of the arts is also described.

4.1 Description of iris datasets and validation protocols

Table 3 demonstrates the iris datasets used in this work, along with the underlying image distribution and attack types. The IIITD contact lens dataset (CLD) [45] contains live, soft, and patterned lens iris samples of 101 distinct subjects that are captured through Cogent and Vista sensors. Besides, IIITD combined spoofing dataset (CSD) [37] contains live, patterned, print-scan, and print-capture iris images. Notice that, IIITD-CSD contains the IIITD contact lens dataset as a part of it; thus, we exclude it from the combined spoofing dataset. Further, ND-LivDet-2017 [10] and ND contact lens 2013 [46] dataset contains merely live irises and patterned iris samples. Furthermore, both LivDet-Iris 2015 and LivDet-Iris 2017 (Clarkson) [10] datasets contain images of live irises, iris printouts, and textured contact lenses. In addition, a “Combined” dataset is also prepared for some experiments by merging images from all datasets. Each dataset is provided with train and test partitions to facilitate the training and testing of algorithms. The datasets belonging to LivDet-Iris 2017 further divide the test samples into two groups; test-known and test-unknown. In the former group, both live and artifacts possess the same “known” properties like train samples. Besides, the second group has unknown or different properties than train samples. Note that this work adheres to binary classification, i.e., live versus attack, and does not discriminate among attack types.

Table 3 Description of iris PAD datasets in terms of image distribution with arrack types

For all the abovementioned datasets, the experiments follow the predefined train-test partitioning for feature extraction and SVM training. However, fivefold cross-validation is used for best-k feature selection, where each dataset is divided into five equal parts. Afterward, in each training phase, one part is considered as test set, and the rest four are used for algorithm training. Notice that the fivefold cross-validation is employed in merely the Friedman test to compute area under the curve (AUC) values, whereas the remaining experiments follow the procedure described in subsection-III(D), i.e. “score-level fusion.” The feature extraction methods constitute the discriminative features from the train and test sets and constitute the train and test features. The classifiers are trained using train-features to realize binary classification, where authentic iris samples are labeled as “live” and artifacts denote “attack.” In the experiments, the PAD performance is expressed as per ISO/IEC SC37 [47] metrics as below:

  • Accuracy: Ratio of correctly classified samples out of total samples.

  • Bona fide presentation classification error rate (BPCER): Ratio of live irises, incorrectly classified as attacks, out of total samples.

  • Attack presentation classification error rate (APCER): Ratio of attack samples erroneously classified as live, out of total samples.

  • Average classification error rate (ACER): Average of BPCER and APCER.

  • Equal error rate (EER): The point/value, where ACPER and BPCER are equal.

Here, APCER and BPCER correspond to false acceptance rate (FAR) and false rejection rate (FRR), respectively. Such error rates vary based on the variation in the threshold on the classifier’s output. As mentioned above, the point/value, where both error rates become equal, is referred to as an equal error rate (EER). The trade-off between FAR and FRR is outlined using detection error trade-off (DET) curves based on varying thresholds [47], where the diagonal line aids in EER computation with the point, where this line meets the DET curve.

4.2 Best-k feature selection

Table 4 compares seven distinct algorithms (feature extraction plus SVM) on 18 datasets using the Friedman test, where the intermediate results are expressed in terms of average ranks. Besides, separate ranking is done according to the higher values of the area under the curve (AUC) for each dataset as in [39]. The fivefold cross-validation is employed on each dataset, and an average AUC is computed. Although there are several feature extraction algorithms available in the literature, few selected methods are examined due to their improved performances reported in previous works [21, 22, 37]. In specific, SIFT, LBP, MBSIF, CoA LBP, Haralick features, Zernike moments, and VGG-8 model exhibit better performances in textural classification. In this view, these methods are embedded with dedicated SVM classifiers to discriminate between attack and live samples and examined 18 different sampled datasets. More details about the Friedman test can be found in [39].

Table 4 Analysis of seven distinct features on 18 subsampled iris PAD datasets for best-k feature selection through the Friedman test

Considering Table 4, \( \chi_{F}^{2} \) and \( F_{F} \) are evaluated as 91.29 and 86.70 using (14) and (15). FF is distributed with (7–1) and (7–1) (18–1) degrees of freedom based on F distribution for 7 classifiers and 18 datasets. Since the critical value of F (6, 102) at α = 0.05 is 2.14 ≪ FF; thus, we reject the null hypothesis by following the Friedman statistics. Further, it can be concluded from the average ranks that SIFT, VGG-8, and MBSIF are top three; whereas, CoA LBP, Zernike, and Haralick features are bottom three algorithms. It is worth identifying their statistical difference to find best-k features, which is carried out by using two post hoc tests (Nemenyi test and Bonferroni Dunn test) [39]. These tests state that two classifiers statistically differ if the difference in their average ranks is larger than the critical difference (CD), which is computed as follows:

$$ {\text{CD}} = q_{\alpha } \sqrt {\frac{n(n + 1)}{6D}} $$
(16)

Here, qα is the critical value for given α. Next step is to employ the Nemeny test for 7 classifiers, where qα = 2.949 (for α = 0.05), and CD = 2.123. It infers that SIFT, VGG-8, and MBSIF perform equally. Likewise, Zernike, CoA LBP, and Haralick features are similar. Though, nothing is decided for LBPV, as its average ranks differ from that of MBSIF and CoA LBP methods by less than the critical difference. Further, at α = 0.1, qα = 2.693, the CD is computed as 1.939, which signifies that LBPV is significantly different from MBSIF. Thus, LBPV is not included in best-k features. Consequently, SIFT, VGG-8, and MBSIF are selected as best-k features.

Figure 7 illustrates the outcome of the post hoc tests, where it is perceived that the average ranks also demonstrate a fair assessment of the classifiers. Besides, the top horizontal line within the figure depicts the axis to plot the average ranks from left (lowest value/finest rank) to the right (higher value/lowest rank); thus, the methods on the left are superior. Besides, the methods that are statistically similar are connected via a horizontal line. The result unveils that SVMs report three best classifications with SIFT, VGG-8, and MBSIF features. Therefore, these three features are combined through score-level fusion expressed in (13) to obtain the final output.

Fig. 7
figure 7

Graphical illustration of results reported by Friedman test and Nemeny test, where SIFT, VGG-8, and MBSIF are ranked in top three features, and are statistically similar

4.3 Validation of Freidman test’s outcome

As depicted in Fig. 7, the Friedman test clustered the features into two groups according to the similarity in their performance on various databases. We consider these as group-A (SIFT, MBSIF, VGG-8) and group-B (LBPV, CoA LBP, Zernike, Haralick). As discussed in Section-B, group-A is selected as the most discriminative (optimal) feature set by the Nemeny test. This subsection further validates the effectiveness of such a feature set through a comparative analysis between group-A and group-B. To achieve this, methods in both groups are examined on all six original datasets; IIITD CLD, ND CLD, ND-LivDet, IIITD-CSD, and Clarkson. The training and testing procedures follow the validation protocols provided with each dataset, i.e., validation is performed on the predefined test sets. More in detail, methods in each group individually perform feature selection and classification, whereas the final score is obtained through performing score-level fusion on their outcomes. The experimental results for all datasets are shown in Fig. 8, where methods in group-A outperform group-B with significant performance improvement over all datasets. Here, one question arises “why the selected k features are optimal?” The reason is the underlying feature extraction procedures of MBSIF, SIFT, and VGG-8 methods. In specific, the MBSIF method uses domain-specific filters [43] to construct iris features which are more powerful than the generic filters [8]. Similarly, VGG-8 is retrained on iris datasets, and thus, the feature maps learned by the VGG-8 model are also domain-specific. Therefore, their combination yields a significant improvement in the iris pattern discrimination. Besides, SIFT features identify key points within the iris region that certainly differs for presentation attacks with cosmetic lenses and printed iris images [25]. Each of these methods individually performs better classification. However, their score-level fusion causes an additional improvement towards the correct output prediction.

Fig. 8
figure 8

Performance of two different groups of features (group-A and group-B) clustered by Friedman test, on five datasets. The feature extraction methods plus SVMs (in each group) are individually trained and the results are computed using score-level fusion on their individual scores. It is observed that methods in group-A perform significantly better for all datasets

4.4 Fusion methods comparative analysis

In this study, one probable question arises that why to use “score-level fusion” instead of others. To answer this, we demonstrate a fair comparison among four distinct yet widely used fusion methods, i.e., score-level fusion, majority voting, feature-level fusion, and rank-level fusion. Except for feature-level fusion, all methods work at the classifier-level, i.e., on the predicted output labels. Whereas, in feature-level fusion, all three features of iris images are concatenated to form a combined feature-vector that is fed to the SVM classifier for output prediction. Table 5 summarizes the outcomes of the abovementioned fusion methods on all datasets in terms of accuracy, APCER, BPCER, and ACER. Although there is no universal trend towards using an explicit fusion method, score-level fusion is used comparatively more in iris related literature. Additionally, in the experimental outcomes, score-level fusion outperforms other counterparts with a significant margin. The reason behind score-level fusion performing better is that it considers the fair contribution of each feature in the output prediction based on their prediction accuracy. More in detail, instead of neglecting the features upon wrong prediction (as in majority voting), each feature is assigned some weights, and thus a fair contribution from each is achieved.

Table 5 Results of four different fusion approaches on various datasets

4.5 Intra-domain evaluation

In this validation scheme, the proposed approach is evaluated on individual datasets, where the algorithm’s training and testing are carried out on the predefined train-test partitions within each dataset. However, an additional sub-splitting is created during training, where 20% of images are selected randomly from each train-set to serve as the validation set. In a nutshell, the selected feature extractors, i.e., SIFT, MBSIF, and VGG-8 extract features that are fed to dedicated SVMs to map the given iris images to either real or attack category and the outcome is obtained after score-level fusion. Afterward, two vectors for the error rates (APCER and BPCER) are calculated based on the varying threshold on the SVM’s output. Further, DET curves are plotted for each dataset that plots APCER against BPCER, as shown in Fig. 9. The EER value for the dataset is obtained by observing the point on the diagonal line where the curve intersects and is listed in Table 6.

Fig. 9
figure 9

DET plots for Intra-sensor evaluation of the proposed method on various datasets. The method performs best for IIITD CLD dataset, while highest misclassification rate is reported for IIITD-CSD dataset

Table 6 Error rates resulted by the proposed approach within intra-dataset domain

It is observed from the intra-dataset evaluation results that the proposed approach works significantly well while trained and examined on the sole dataset. The discrimination error rate is reduced up to 1.07% and 1.62%, respectively, for IIITD Contact Lens and IIITD Combined Spoofing datasets. It infers that the attack patterns with textured and printed iris can be successfully discriminated from genuine samples. Besides, the proposed method performs PAD with less than 3% misclassification rate for ND datasets. However, for Clarkson datasets, there is still a requirement to further diminish the misclassification error.

4.6 Cross-domain evaluation

Except Clarkson 2017, all remaining datasets facilitate cross-sensor evaluation, since images in their train and test partitions are captured through distinct sensors and environments. Therefore, in this section, we attempt to explore the proposed method in cross-domain, where the train and test sets possess high intra-class variations in the iris samples. First, the inter-domain evaluation is performed at the sensor-level, where different sensor images within the same dataset are served as train-test sets. As both the IIITD datasets (CLD and CSD) contain images captured through Cogent and Vista sensors, we design train-test pairs from these sensors for both datasets. Likewise, images in ND datasets (ND CLD and ND-LivDet) were captured from IrisGuard AD100 (ND-I) and LG4000 (ND-II) sensors; thus these are also arranged in cross-sensor train-test pair. The performance outcome of the proposed approach for these sensor pairs in terms of accuracy and error rates is depicted in Table 7, and the respective DET curves are shown in Fig. 10.

Table 7 Experimental results reported by the proposed approach in cross-sensor evaluation
Fig. 10
figure 10

DET plots for cross-sensor evaluation of proposed method on various datasets. The method performs best for Congent- > Vista pair, while highest misclassification rate is reported for ND-II- > ND-I

Further, images in Clarkson 2015 were captured using Dalsa and LG sensors, where the textural details within the respective images, along with the acquisition pattern, differ significantly. Therefore, it is expected that the error rates would be comparatively high for LG and Dalsa train-test pair. The DET curve corresponding to the Clarkson 15 dataset is illustrated in Fig. 11. It can be observed from the experimental outcomes that the proposed approach results in better accuracy for Cogent → Vista, ND-I → ND-II pairs. However, for LG-Dalsa sensor pairs, it exhibits a comparatively high misclassification error rate.

Fig. 11
figure 11

DET plots for cross-sensor evaluation of proposed method on Clarkson 2015 dataset. The method performs better while using LG and Dalsa sensor images for training and testing, respectively

In the next phase, different datasets (cross-dataset) are considered as train-test pairs to validate the likelihood of transfer learning through these datasets. Since IIITD CLD and IIITD-CSD, both datasets contain iris samples captured with Vista and Cogent sensors; thus, the intra-class variation exists at sensor-level instead of dataset-level. Therefore, we may expect that direct cross-dataset evaluation would perform similar to cross-sensor. Accordingly, the cross-dataset evaluation experiment is conducted, where training and testing are performed over IIITD-CSD and IIITD CLD datasets, respectively. On the other side, ND CLD and ND-LivDet datasets also contain images captured using identical sensors; thus, knowledge transfer may be expected. However, according to the results shown in Table 8 and Fig. 12, the proposed method doesn’t generalize pretty well for cross-dataset validation. The reported EER values for IIITD and ND datasets are 16.07% and 22.10%, respectively. On the other side, the texture of Clarkson samples differs from other datasets to a huge extent, as can be seen in Fig. 13. The combination of Clarkson with any other dataset as a train-test pair results in accuracy analogous to random predictions. The results exhibit that the variation in the datasets in properties (textured lens, print-capture, print-scan, or a mix of both), arrangement (concerning sizes of distribution among classes), acquisition sensors, and environmental conditions limit the ability to knowledge transfer.

Table 8 Experimental results reported by the proposed approach in cross-dataset evaluation
Fig. 12
figure 12

DET plots for cross-dataset evaluation of proposed approach on three dataset pairs. It reports less discrimination accuracy for IIITD and ND datasets compared to intra-domain counterparts. Moreover, for Clarkson, the performance is not better than the random prediction and thus not included here

Fig. 13
figure 13

Presentation attack samples from various iris datasets representing intra-class variation, where PAD approaches face difficulties

It may be inferred from Table 8 that cross-dataset evaluation is not efficacious. However, a further possibility to obtain a successful evaluation in cross-domain is to pool images of all datasets at one place to create combined-dataset. The pooling is done according to correspondence in the individual train-test partitions. More in detail, the images in the training and testing partitions are pooled separately. In Clarkson 2017 and 2015 datasets, the test-unknown and test-known partitions are retained while dataset pooling, which is not concerned with other databases. In this regard, two different test partitions are generated “known-test,” which contains the test-known partition from Clarkson and test sets of all other datasets, and “unknown-test”, which contains the test-unknown partitions of both Clarkson datasets. Table 9 shows the proposed method’s performance over the combined-dataset while testing with both the aforementioned test sets in terms of ACER %. With known prediction, the proposed method achieves 5.81% ACER with classification accuracy higher than 90%. Besides, with the unknown prediction, the proposed method successfully reduced ACER to 7.22%, with classification accuracy more than 88.8%. The corresponding DET curves are depicted in Fig. 14.

Table 9 Experimental results reported by the proposed approach in combined-dataset evaluation
Fig. 14
figure 14

DET plotes resulted by the proposed method for combined-dataset evaluation

4.7 Comparative analysis with state of the arts

This section demonstrates a comparative study between the proposed method and the state of the arts performing best over the datasets used in this study. As ND-LivDet and Clarkson datasets belong to LivDet-Iris 2017 competition, their corresponding results are compared with the competition winner along with a recently introduced PAD method based on Meta-Fusion [12]. However, for other datasets, the results are compared with another state of the arts performing best on the respective datasets. Moreover, we have carefully implemented a DensePAD framework as described in [48] on all the datasets considered in this work. This is because the DensePAD framework has not been examined on these datasets in the original work. Table 10 summarizes the results from the state of the arts (according to the dataset used) and compares it with the proposed method in terms of error reduction. The term “error reduction” refers to the % of error reduced by the proposed method compared to the state of the arts. It is observed that except for the ND-LivDet, the proposed approach outperforms the state of the arts for all datasets. For the IIITD CLD dataset, the proposed method achieves more than a 40% error reduction. With the ND CLD dataset also, it successfully reduces the error rate by 17.73%. However, in the case of Clarkson, there is no significant error reduction reported. Besides, for the ND-LivDet dataset, the proposed method lacks 69.17% from the existing counterpart. The DET curves resulted from the proposed method corresponding to the abovementioned experiments are represented in Fig. 15.

Table 10 Comparison with state of the arts in terms of EER %
Fig. 15
figure 15

DET plotes resulted by the proposed method for various datasets with predefined train and test samples

4.8 Discussion

The proposed scheme incorporates five subsequent steps, i.e., RoI localization, image enhancement, feature extraction, best-k feature selection, and classification to accomplish iris PAD. Each step has some significance and contribution to the attack prediction. The feature extraction procedure incorporates multiple feature extraction methods, including handcrafted and CNN-based. It is believed that CNN itself has enough potential to constitute discriminatory features from the images to perform errorless classification. However, CNN requires thousands of images per class to learn respective features that are not currently available in iris datasets. Moreover, the textural quality in images within iris datasets significantly varies due to differences in hardware and wavelength range of different iris sensors, as depicted in Fig. 13. This, in turn, results in intra-class variation in iris datasets that may not be captured through a single feature extraction method. Since each method analyzes the iris samples from a certain perspective, using multiple features tends to analyze images from multiple angles and may improve discrimination. Further, the feature selection procedure based on the Friedman test removes the redundant features with the insignificant contribution in the output prediction and results in an optimal feature set to improve iris PAD.

The experimental results infer that the proposed optimal feature set exhibits excellent performance for intra-dataset iris PAD with the least error resulted in the IIITD CLD dataset. Further, a trivial upsurge in the error rate is observed with cross-sensor deployment, yet the best outcome is observed when the train and test samples are borrowed from ND-I and ND-II datasets, respectively. Besides, the cross-dataset assessment results in a substantial increase in the error rate, where the highest two error rates have resulted in ND-LivDet → ND CLD and IIITD CLD → ND CLD as training and testing datasets, respectively. It infers that there is a huge scope towards diminishing the iris presentation attack detection errors in cross-domain setup. Furthermore, the analysis of fusion approaches to combine the features in the optimal feature set suggests that score-level fusion is an adequate choice to improve cross-domain iris PAD.

The entire test time procedure, i.e., RoI localization, feature extraction, and SVM classification requires an average of 0.021 s for a given sample. Notice that this execution time is corresponding to a single image instead of the entire dataset. The computational platform used to execute the experiment is Intel Scalable processors Xeon 4114, 64 GB DDR4 RAM, GTX 1080Ti 11 GB GPU card.

5 Conclusion

This study focuses on improving discrimination between live iris and attack patterns/samples to enrich iris presentation attack detection. It primarily emphasizes the iris attacks that are launched at the sensor-level through a patterned contact lens, printed and scanned copies of genuine iris images. To deal with such attacks, an approach is proposed with a sequence of phases, each focusing on mitigating a certain challenge. The YOLO approach localizes the iris region without pattern loss while retaining the important textural details where discriminatory patterns exist. The RoI localization reduces the amount of computation required for feature extraction from the iris samples. The feature extraction procedure with handcrafted and CNN-based methods aimed to construct features from multiple perspectives. Further, the feature selection reduces the number of features to process without compromising the average classification accuracy. Therefore, it again yields a significant reduction in computational cost and execution time, which includes feature extraction from iris images and the corresponding classifier’s training procedure. As a result of feature selection, SIFT, MBSIF, and VGG-8 features are selected as the top three features discriminating significantly among the live and attack patterns. Further, these features are combined by performing score-level fusion on the corresponding classifier’s outcomes. The feature selection is robust as it is unbiased towards a certain dataset; instead, the features are examined on multiple iris PAD datasets to observe their consistency in cross-domain.

On comparing the proposed method with state of the arts, it is concluded that except for the ND-LivDet dataset, it outperforms all existing methods with significant error reduction. The improved performance is due to the efficacy of domain-specific MBSIF filters in textural feature construction, the robust key points detection through SIFT features to identify printed iris and cosmetic lenses, and the iris-specific feature maps learned by the VGG-8 model. Also, the score-level fusion boosts the accuracy by assigning appropriate weights to each feature. Although the proposed approach is lengthy, yet minimizes the misclassification error rate of both attack and genuine iris patterns.