Introduction

Biometric systems that use ocular images have been extensively investigated due to the high level of singularity in the iris and because the periocular region can provide discriminative patterns even in noisy images1,2,3,4,5,6. The term ocular comprises the periocular and iris regions7. The periocular region comprises eyebrows, eyelashes and eyelids, while the iris is the colored region between the sclera and pupil. There are two main modes that an ocular biometric system can operate: identification (1:N comparison) and verification (1:1 comparison). The identification task consists of determining a subject’s identity, whereas the verification one verifies whether a subject is who she/he claims to be. There are also two main protocols to evaluate biometric systems: closed-world and open-world8,9. In the former, the training and test sets have different samples from exactly the same subjects. On the other hand, in the open-world protocol, the training and test sets must have samples from different subjects. With these modes and protocols, it is possible to evaluate some characteristic of biometric approaches to produce discriminative features and generalization capability.

Table 1 Comparison of the available periocular datasets containing visible (VIS) images with our dataset (UFPR-Periocular).

Nowadays, with the advancement of deep learning-based techniques, several methodologies applying them to ocular images have been proposed for several tasks, for example, spoofing detection24,25, iris and periocular region detection26,27,28, iris and sclera segmentation29,30, and iris and periocular recognition31,32,33,34,35,36,37. The advancement of these technologies can be observed by the recent contests that have been conducted to evaluate the evolution of the state-of-the-art methods for different applications, such as iris recognition in heterogeneous lighting conditions (NICE.I and NICE.II)21,38, iris recognition using mobile images (MICHE.I and MICHE.II)2,16, iris and periocular recognition in cross-spectral scenarios (Cross-Eyed 1 and 2)17,18, and periocular recognition using mobile images captured in different lighting conditions (VISOB 1 and 2)23. Note that all these contests used datasets containing images obtained in the visible wavelength. The most recent contests also used images captured by mobile devices2,23. The results achieved by the proposed methods have shown that it is challenging to develop a robust biometric system in such conditions, mainly due to the high intra-class variability. Based on recent works2,5,7, we can state that developing an ocular biometric system that operates in unconstrained environments is still a challenging task, especially with images obtained by mobile devices. In this condition, the images captured by the volunteer may present several variations caused by occlusion, pose, eye gaze, off-angle, distance, resolution, and image quality (affected by the mobile device).

With the existing periocular datasets, it is difficult to assess the scalability performance of biometric applications, i.e., if an approach can produce discriminative features even in a large dataset in terms of the number of subjects. As we can see in Table 1, the datasets in the literature do not present a large number of subjects and have few capture devices and session captures. As described in some previous works5,6, one common problem in ocular biometric systems is the within-class variability, which is generally affected by noises and attributes present in the same individual images. A robust biometric system must handle images obtained from different capture devices, extracting distinctive representations regardless of the source and environments. In this sense, samples from the same subject obtained in different sessions are of paramount importance to capture the intra-class variation caused by various noise factors.

Considering the above discussion, in this work, we introduce a new periocular dataset, called UFPR-Periocular. The subjects themselves collected the images that compose our dataset through a mobile application (app). In this way, the images were captured in unconstrained environments, with a minimum of cooperation from the participant, and have real noises caused by poor lighting, occlusion, specular reflection, blur, and motion blur. Figure 1 shows some samples from the UFPR-Periocular. As part of this work, we also present an extensive benchmark, employing several state-of-the-art architectures of CNN models that have been explored to develop ocular (periocular and iris) recognition biometric systems. Face and eye detection are not covered in this work. The recognition methods are evaluated with manually pre-processed images (also available in the dataset).

Figure 1
figure 1

Sample images from the UFPR-Periocular dataset. Observe that there is great diversity in terms of lighting conditions, age, gender, eyeglasses, specular reflection, occlusion, resolution, eye gaze, and ethnic diversity.

Note that our dataset is the largest one in terms of the number of subjects, sessions, and capture devices, as shown in Table 1. It also has more images than all datasets except VISOB. Another key feature is that the proposed dataset has images captured by 196 different mobile devices. The samples captured with less cooperation of the participant in unconstrained environments have several variations on the ocular images since they are obtained during three different sessions. To the best of our knowledge, this is the first periocular dataset with more than 1, 000 subject samples and the largest one in different capture devices in the literature. Thus, we believe that it can provide a new benchmark to evaluate and develop new robust periocular biometric approaches.

Recently, with the advancement of devices enabling the self-capture of images that can be used as biometrics, the term “selfie biometrics” has been extensively explored by the research community39,40, especially in face and iris recognition41,42,43. As described by Rattani et al. [3], the term “selfie biometrics” consists of a biometric system where the input data is acquired by the user using the capture devices available in their device. Thus, we can consider the UFPR-Periocular dataset, presented in this work, as a selfie biometric dataset since its images were acquired by the users through their own smartphones.

The remainder of this work is organized as follows. In “Related work”, we describe the periocular datasets containing VIS images for periocular biometrics. In “Dataset”, we present information about the UFPR-Periocular dataset and the proposed protocol to evaluate biometric systems. “Benchmark” presents the CNN architectures used to perform the benchmark. In “Results and discussion”, we present and discuss the benchmark results. Finally, the conclusions are given in “Conclusion”.

Related work

In recent years, several ocular contests and datasets have been released to evaluate state-of-the-art methods for many applications. Zanlorensi et al.7 detailed and described several datasets and contests for iris and periocular recognition. Different problems have been addressed by the researchers, such as ocular recognition in unconstrained environments, ocular recognition on cross-spectral scenarios, iris/periocular region detection, iris/periocular region segmentation, and sclera segmentation44.

Existing periocular datasets can be organized into constrained (or controlled) or unconstrained (or non-controlled) environments. The quality of the images is different in constrained and unconstrained environments, as some noise can occur in the images captured in unconstrained environments such as lighting variation, occlusion, blur, specular reflection, and distance. Images can also be acquired cooperatively and non-cooperatively in relation to some image capture restrictions imposed on the subject. Ocular non-cooperative images can have some problems caused by off-angle, focus, distance, motion blur, and occlusions by some attributes such as eye-glasses, contact lenses, and makeup.

As described in7, datasets containing images obtained at the Near-infraRed (NIR) wavelength were created mainly to investigate the intricate patterns present in the iris region45,46. There are also other studies on NIR ocular images, such as generating synthetic iris images47,48, spoofing and liveness detection49,50,51,52, contact lens detection53,54,55,56, and template aging57,58. The use of NIR ocular images captured in controlled environments by biometric systems has been studied for several years. Thus, it can be considered a mature technology that has been successfully employed in several applications3,45,46,59,60.

In general, better results can be achieved on biometric methods using VIS images by exploring the periocular region instead of the iris trait, as the iris is rich in melanin pigment that absorbs the most visible lights—not reflecting the iris features as occur with NIR lights59. Also, the small resolution of ocular images is a common problem that makes it almost impracticable to use the iris trait alone. Regarding these problems, the use of VIS ocular images captured in a non-cooperative way under unconstrained environments became a recent challenge. In this sense, several studies have been carried out on periocular biometric recognition using images obtained by mobile devices in uncontrolled environments using different capture devices10,16,23. The following datasets were developed to investigate the use of iris and periocular traits in VIS images: UPOL 14, UBIRIS.v1 20, UBIRIS.v2 21 and UBIPr 22. There are also datasets of iris and periocular region images for cross-spectral recognition, i.e., match ocular images obtained at different wavelengths (NIR against VIS and vice-versa): UTIRIS 15, IIITD Multi-spectral Periocular 13, PolyU Cross-Spectral 19, CROSS-EYED 17,18, and QUT Multispectral Periocular 12. Focusing specifically on ocular recognition using non-cooperative images obtained in uncontrolled environments by mobile devices, we highlight the following datasets: MICHE-I 16, VSSIRIS 10, CSIP 11 and VISOB 23.

Nowadays, it is difficult to evaluate the scalability factor of the state-of-the-art biometric approaches due to the size in terms of subjects and images on the available datasets. As shown in Table 1, the most extensive dataset regarding subjects and images is VISOB 23, which has 158, 136 images from 550 subjects. The ICIP 2016 Competition on mobile ocular biometric recognition23 employed this dataset, and in the WCCI/IJCNN2020 challenge (VISOB 2.0 Dataset and Competition results available at https://sce.umkc.edu/research-sites/cibit/dataset.html), a second version of the dataset was launched. Both contests evaluated the periocular recognition using VIS images obtained by mobile devices. The second contest’s main difference is that the input images were a stack with five periocular images belonging to the same subject. The best methods achieved an EER of 0.06% and 5.26% on the first and second contests, respectively.

Also using VIS ocular images, other contests were carried out to evaluate iris and periocular recognition: NICE.II38, MICHE.II2, and CROSS-EYED I17 and II18. The NICE.II contest evaluated iris recognition using images containing noise within the iris region. The winner method fused features extracted from the iris and the periocular region using ordinal measures, color histograms, texton histograms, and semantic information. The MICHE.II contest also evaluated iris and periocular recognition, but using images captured by mobile devices. The winner approach extracted features from the iris and the periocular region, using the rubber sheet model normalization61 and 1-D Log-Gabor filter and Multi-Block Transitional Local Binary Patterns, respectively. Lastly, the CROSS-EYED I and II contests evaluated iris and periocular recognition on the cross-spectral scenario. In both contests, the winner approach employed handcrafted features based on Symmetry Patterns (SAFE), Gabor Spectral Decomposition (GABOR), Scale-Invariant Feature Transform (SIFT), Local Binary Patterns (LBP), and Histogram of Oriented Gradients (HOG).

Inspired by impressive results achieved by deep learning-based techniques in multiple domains62, several methods proposing and applying such techniques have been developed to address different tasks using ocular images4,5,6,24,25,26,27,28,29,30,31,32,33,34,35,36,37. Also, as found in the literature, deep learning frameworks for ocular biometric systems are a recent technology that still needs improvement7. The use of ocular datasets containing images captured by mobile devices in unconstrained environments is a challenging task that has gained attention in recent years2,5,7,23,63.

Dataset

The UFPR-Periocular dataset was created to obtain images in unconstrained scenarios that contain realistic noises caused by occlusion, blur, and variations in lighting, distance, and angles. To this end, we developed a mobile application (app) enabling the participants to collect their pictures using their smartphones (Project approved by the Ethics Committee Board from the Health Science Sector of the Federal University of Paraná, Brazil—Process CAAE 02166918.2.0000.0102, registered in the Plataforma Brazil system—https://plataformabrasil.saude.gov.br/). We confirm that all methods were carried out following relevant guidelines and regulations by the Ethics Committee Board from the Health Science Sector of the Federal University of Paraná. Furthermore, we confirm that an informed consent form has been obtained from all subjects, and we do not store any data that could be used to identify the subject. We confirm that all periocular images presented in this paper (Figs. 1, 3, 4, 5, 6, 7, and 10) were extracted from the UFPR-Periocular dataset and that we have permission to publish these images in open access journal. The single instructions to the participants is to place their eyes on a region of interest marked by a rectangle drawn in the app, as illustrated in “Picture” in Fig. 3. We also restricted the images to be captured in 3 sessions, with 5 images per session and a minimum interval of 8 hours between sessions. In this way, we guarantee that the dataset has samples of the same subject with different noises, mainly due to different lighting and environments. Furthermore, imposing this minimum time interval between sessions, it is possible to collect different attributes in the periocular region of the same subject, as the images are captured at different times of the day, e.g., subjects wearing and not wearing glasses and makeup. Another attractive feature of this dataset is that all participants are Brazilian, and as Brazil has great ethnic diversity, there are images of subjects from different races, making this one of the first periocular datasets with such cultural diversity.

The images were collected from June 2019 to January 2020. The gender distribution of the subjects is (53.65%) male and (46.35%) female, and approximately \(66\%\) of the subjects are under 31 years old. In total, the dataset has images captured from 196 different mobile devices—the five most used device models were: Apple iPhone 8 (4.1%), Apple iPhone 9 (3.1%), Xiaomi Mi 8 Lite (3.0%), Apple iPhone 7 (3.0%), and Samsung Galaxy J7 Prime (2.7%).

We remark that each subject captured all of their images using the same device model. The distribution of age, gender, and image resolutions present in our dataset is shown in Fig. 2.

Figure 2
figure 2

Age, gender and image resolution distributions in the UFPR-Periocular dataset. (a) note that gender has a balanced distribution, but the age range is concentrated under 30 years old (64% of the subjects). (b) more than 45% of the images have a resolution between \(1034\times 480\) and \(1736\times 772\) pixels, and more than 65% of the images have resolution higher than \(740\times 400\) pixels.

The dataset has 16, 830 images of both eyes from 1, 122 subjects. Image resolutions vary from \(360\times 160\) to \(1862\times 1008\) pixels—depending on the mobile device used to capture the image. We cropped/separated the periocular regions of the right and left eyes to perform the benchmark, assigning a unique class to each side. Note that, once the image was cropped, the remainder image region was discarded as claimed in our project request to the Ethics Committee Board to preserve at maximum the identity of the participants. We manually annotated the eye corners with four points per image (inside and outside eye corners) and used these points to normalize the periocular region regarding scale and rotation. This process is detailed in Fig. 3.

Figure 3
figure 3

Image acquisition and normalization process. First, after the subject took the shot, the rectangular region (outlined in blue) was cropped and stored. Then, the images were normalized in terms of rotation and scale using the manual annotations of the corners of the eyes. Lastly, the normalized images were cropped, generating the periocular regions of the left and right eyes.

Using the center point of each eye (average corners point), the images were rotated and scaled to normalize the eye positions in a size of \(512\times 256\) pixels. Then, the images were split into two patches (\(256\times 256\) pixels) to create the left and right eye sides, generating 33, 660 periocular images from 2, 244 classes. The intra- and inter-class variability in this dataset is mainly caused by lighting, occlusion, specular reflection, blur, motion blur, eyeglasses, off-angle, eye-gaze, makeup, and facial expression.

This new periocular dataset is the main contribution of this work. It can be employed in future works to evaluate and perform research in biometrics, including recognition, detection and segmentation. Furthermore, it can also be used to explore studies on recent topics such as gender and age bias64,65, and to assess the scalability of biometric systems since this dataset is the largest one in the literature in terms of the number of subjects. Regarding the semantic segmentation problem, we reproduced the experiments presented by Banerjee et al.66, which were proposed to generate segmentation masks for iris detection. This method consists of first transforming the raw image into the HSV and YCbCr color spaces, then using a threshold to binarize both images (HSV mask and YCbCr mask), and finally applying a dot product in both masks to generate the final global mask. However, as the images from our dataset have considerably more noise than those employed in the original work, the method could not obtain masks of satisfactory quality for us to consider them as ground truth. For this reason, the semantic segmentation problem will be addressed in future work.

Experimental protocols

We propose protocols for the two most common tasks in biometric systems: identification (1:N) and verification (1:1). The identification task consists of determining a subject sample identity (probe) within a known dataset or a cluster (gallery). The probe is compared against all the gallery samples, considering the closest match as the subject’s identity. Furthermore, probabilistic models can be employed/trained using the gallery data to determine the probe subject’s identity based on the highest confidence output. The verification task refers to the problem of verifying whether a subject is who she/he claims to be. If two samples match sufficiently, the identity is verified; otherwise, it is rejected59. Verification is usually used for positive recognition, where the goal is to prevent multiple people from using the same identity. The identification is a critical component in negative recognition, where the goal is to prevent a single person from using multiple identities67. Furthermore, the proposed protocol also encompasses two different scenarios: closed-world and open-world. In the closed-world protocol, the dataset is split through different samples from the same subject, i.e., training and test sets have samples of the same subjects. In the open-world protocol, there are different subjects both in the training and test sets. The identification task is performed in the closed-world protocol, while the verification task can be performed in both closed and open-world protocols. In the open-world protocol, we also propose two different splits regarding the training and validation sets. Note that we do not change the test set, keeping it in the open-world protocol, and only vary the training protocols. The first split uses the closed-world protocol, in which the training and validation sets have samples from the same subjects. The second split, on the other hand, has different subjects in the training and validation sets, i.e., in an open-world protocol. With these two training/validation splits, it is possible to use multi-class networks (classification/identification) and also models based on the similarity of two distinct inputs (verification task): Siamese networks, triplet networks, and pairwise filters. Although models built for the verification task can be trained through the closed-world protocol, the design can be better improved using the open-world protocol to split the training and validation sets, as it is a more realistic scenario regarding the test set. Table 2 summarizes the proposed protocols.

Table 2 Images, classes, and pairwise comparison distributions for the closed-world (CW) and open-world (OW) protocols.

We defined 3 folds with a stratified split into training, validation, and test sets for both biometric tasks (identification and verification) for all protocols. The test set comprises all against all comparisons for genuine pairs and aiming to reduce the pairwise comparisons only impostor pairs using the images of all subjects with the same sequence index, i.e., the i-th images of each subject are combined two at-a-time to generate all impostor pairs, for \(1\le i\le n\), where \(n = 3 \text { sessions} \times 5 \text { images}\). As the UFPR-Periocular dataset has images captured under 3 sessions, we designated one session as a test set for each fold in the closed-world protocol. Thus, we have images from sessions 1 and 2, 2 and 3, 3 and 1 for training/validation, and sessions 3, 1, and 2 for testing, respectively for each of the three folds. To evaluate the ability of the models to recognize subjects samples at different environments, for all folds, we employed samples of both sessions in the training and validation sets to fed the models with images from the same subject varying the capture conditions. For each subject, we employed the first 3 images of each session for training and the remaining 2 for validation (\(60\%/40\%\) for training/validation splits). The test set contains new images from the subjects present in the training/validations sets with different noises caused by the environment, lighting, occlusion, and facial attributes.

For the open-world protocol we generate the training, validation, and test sets by splitting the dataset through different subjects. Thus, for each fold, the test set has samples of subjects not present in the training/validation set. Splitting sequentially by the subject index for each fold, we have samples of 748 subjects for training/validation and 374 subjects for testing. Moreover, we propose two different splits for the training/validation splits, the first one containing images of the same subject in the training and validation sets (closed-world validation). The second one contains samples from different subjects in the training and validation sets (open-world validation). Both training/validation protocols have pros and cons. The advantage of using the closed-world validation is that the training has samples of more subjects than the open-world validation protocol. However, in this scenario, the models can only learn distinctive features for the gallery samples and may not extract distinctive features for subjects not present in the training process. On the other hand, the open-world validation has samples of fewer subjects than the closed-world validation protocol, presenting a more realistic scenario since samples of subjects not known in the training stage are present in the validation set. In the closed-world validation protocol, for each one of the 748 subjects in the training set, we used the first 3 images of each session for training, and the remaining 2 for validation (\(60\%/40\%\) for training/validation splits). In the open-world validation protocol, we employed samples of the first 700 subjects for training and samples of the remaining 48 subjects to validate each fold. The number of the generated pairwise comparison for all protocols are detailed in Table 2. The files determining all splits and setups detailed in this section are available along with the UFPR-Periocular dataset.

Benchmark

To carry out an extensive benchmark, we employ different models and strategies based on deep learning that achieved promising results in the ImageNet dataset/contest68 and were applied in recent works of ocular recognition6,32,35,36,69. These methods differ from each other in network architecture, loss function, and training strategies. We employed the following CNN models: Multi-class classification, Multi-task learning, Siamese networks, and Pairwise filters networks. Please note that we did not evaluate detection in this paper. We employ the images already cropped and resized (to normalize distance and rotation) to evaluate the recognition methods. In the following subsections, we describe and detail each one of them.

Multi-class classification

Multi-class classification is the task of classifying instances into three or more classes, where each sample must have a single unique class/label. Several techniques70,71,72 have been proposed combining multiple binary classifiers to solve multi-class classification problems. Deep learning-based approaches usually address this problem through CNN models with softmax cross-entropy loss. Therefore, we start by evaluating several CNN architectures that achieved expressive results in the ImageNet dataset/contest68. In summary, the architecture of these models has several convolutional, pooling, activation, and fully-connected layers, as shown in Fig. 4.

Figure 4
figure 4

Multi-class classification CNN architecture.

In the training stage, a batch of images and their labels feed these models. The model extracts the image features through convolutional, pooling, and fully connected (dense) layers. The last layer is composed of a fully connected layer using the softmax cross-entropy as a loss function. In this work, following previous approaches21,73,74, we considered each eye of each subject as a unique class, i.e., the left and right eyes belong to different classes. In this way, as expected, a person’s identity can only be verified by the same eye side, i.e., the left and right eyes of the same person can not be matched. Below we describe the main characteristics of each model.

VGG

The VGG model, proposed by Simonyan and Zisserman75, consists of a CNN using small convolution filters (\(3\times 3\)) with a fixed stride of 1 pixel. The spatial polling is computed by 5 max-pooling layers over a \(2\times 2\) pixel window. Two models were proposed varying the number of convolutional layers: VGG16 and VGG19. Both models have two fully connected layers at the top with 4096 channels each—these architectures achieved the first and second places in the localization and classification tracks on the ImageNet Challenge 2014. The authors also stated that it is possible to improve prior-art configurations by increasing the depth of the models. Parkhi et al.76 applied these models (called VGG16-Face) on the face recognition problem, showing that a deep CNN with a simpler network architecture can achieve results comparable to the state of the art. Furthermore, recent approaches for ocular (iris/periocular) biometrics employing VGG models have demonstrated the ability to produce discriminant features6,32,35,36,69,77,78. In this work, we employed the VGG16 and VGG16-Face to perform the benchmark.

ResNet

The Residual Network (ResNet) was introduced by He et al.79 and applied to biometrics for face recognition80, iris recognition6,35,69,77,81 and periocular recognition6,37,78,82. The authors addressed the degradation (vanishing gradient) problem caused by deeper network architectures proposing a deep residual learning framework. They added shortcut connections between residual blocks to insert residual information. These residual blocks are composed of a weighted layer followed by batch normalization, an activation function, another weighted layer, and batch normalization. Let F(x) be a residual block, and x the input of this block (identity map), the residual information consists of adding x to F(x), i.e., \(F(x) + x\), and using it as input to the next residual block. Different architectures were proposed and evaluated, varying the depth of the models: ResNet50, ResNet101, and ResNet152. These models achieved promising results on the ImageNet dataset68. In83, He et al. proposed the ResNetV2 by changing the residual block by adding a pre-activation into it. Empirical experiments showed that the proposed method improved the network generalization ability, reporting better results than ResNetV1 on ImageNet.

InceptionResNet

The InceptionResNet model84, combines the residual connections79 and the inception architecture85. The first inception model86, known as GoogLeNet, introduced the Inception module aiming to increase the network depth while keeping a relatively low computational cost. The main idea of inception is to approximate a sparse CNN with a normal dense construction. The inception module consists of several convolutional layers, where their output filter banks are concatenated and used as the input to the next module. The model version difference is based on the organization inside its inception module. Combining the residual connections with the InceptionV3 and InceptionV4 models, the author developed InceptionResNetV1 and InceptionResNetV2, respectively. Experiments performed on the ImageNet dataset showed that the InceptionResNet models trained faster and reached slightly better results than the inception architecture84. In our experiments, we employed the InceptionResNetV2 model since it achieved the best results on ImageNet.

MobileNet

The first version of the MobileNet model (MobileNetV1)87 was developed focusing on mobile and embedded vision applications, in which it is desirable that the CNN model has a small size and high computational efficiency. This model is based on depthwise separable filters, which are composed of depthwise and pointwise convolutions. As described in87, depthwise convolutions apply a single filter for each input channel, and pointwise convolutions use a \(1\times 1\) convolution to compute a linear combination of the depthwise output. Both layers use batch normalization and ReLU activation. MobileNetV1 achieved promising results in both terms of performance and accuracy on several tasks such as fine-grained recognition, large scale geolocation, face attributes classification, object detection, and face recognition87. MobileNetV288 combines the first version architecture with an inverted ResNet79 structure, which has shortcut connections between the bottleneck layers. Experiments performed in different tasks such as image classification, object detection, and image segmentation showed that the MobileNetV2 can achieve high accuracy with low computation costs compared to state-of-the-art methods88.

DenseNet

The Dense Convolutional Network (DenseNet) model89 consists of a CNN architecture where each layer is connected to every other layer in a feed-forward way. Thus, let L be the number of layers from a network, a DenseNet layer has \(\frac{L(L+1)}{2}\) direct connections with subsequent layers—instead of L as a traditional CNN model. As in the ResNet models79,83, these connections can handle the vanishing-gradient problem and ensure maximum information flow between layers. The feed-forward is preserved, passing the output from all layers as an additional input to the subsequent ones in a channel-wise concatenation. The DenseNet models achieved state-of-the-art accuracies in image classification on the CIFAR10/100 and ImageNet datasets68,89. The authors proposed different models varying the depth of the network. In our experiments, we employed DenseNet121 (the shallowest one).

Xception

Xception model was inspired by inception modules, being defined as an intermediate step between convolution and depthwise separable convolution operation90. The proposed architecture replaces the standard inception modules with depthwise separable convolutions and residual connections. The Xception is similar to InceptionV3 in terms of parameters but outperforms it on the ImageNet dataset68.

Multi-task learning

Multi-task learning improves generalization using the domain information of related tasks as an inductive bias91. This architecture learns several tasks using a shared CNN model, where each task can help the generalization of other tasks. Caruana91 introduced the Multi-task learning concept and evaluated it in different domains, demonstrating that this method can achieve better results than single-task learning models for related tasks. In deep neural networks, multi-task learning can be performed by two different setups: hard or soft parameter sharing92. All the hidden (convolutional) layer weights are shared in the hard parameter sharing, i.e., the model learns a single representation for all tasks. In this configuration, it is also possible to add specific layers for different tasks93. On the other hand, each task is processed by a different model in the soft parameter sharing. Then, the parameters of these models are regularized to encourage similarities among them.

As shown in Fig. 5, our Multi-task network shares all convolutional layers and some dense layers. The model has exclusive dense layers for each task, followed by the prediction layers, using the softmax cross-entropy as function loss.

Figure 5
figure 5

Multi-task CNN architecture. In this model, each task has its own output and all tasks share the convolutional layers. The loss of all tasks is used to update the weights of the convolutional layers.

In this work, based on the results of multi-class classification, we employ MobileNetV2 as the base model on our multi-task approach. Furthermore, as detailed in Table 3, we build our multi-task model with hard parameter sharing for the following 5 tasks: (i) class prediction, (ii) age rate, (iii) gender, (iv) eye side, and (v) smartphone model.

Table 3 Multi-task architecture in the closed-world protocol.

For the age estimation task, we generate the classes by grouping ages into the following 10 ranges: 18–20, 21–23, 24–26, 27–29, 30–34, 35–39, 40–49, 50–59, 60–69, and 70–79. The gender and eye side prediction tasks have only 2 classes, while the smartphone model prediction has 196 classes. Note that it is possible to employ weighted loss for each task in the Multi-task learning networks, penalizing the wrong classification of some tasks more than others. For simplicity, in this work, we do not use weighted losses in our experiments, giving equal importance to all tasks.

As shown in Table 3, we build exclusive dense layers for each task by connecting them directly to the backbone model (MobileNetV2). Then, each dense layer is connected to its respective prediction layer, making it possible that each task has its own specialized (feature) dense layer.

Pairwise filters network

Inspired by Liu et al.94, which is one of the first works applying deep learning for iris verification, we also evaluate the performance of the pairwise filters network. This kind of model directly learns the similarity between a pair of images through pairwise filters. The Pairwise Filters Network is a Multi-class classification model that contains one or two outputs informing whether the input pairs are from the same class or from different classes. The difference is that the network input is a pair of images instead of a single image. Thus, the network architecture consists of convolutional, pooling, activation, and fully connected layers, as shown in Fig. 6.

Figure 6
figure 6

Pairwise filters CNN architecture. This model contains filters that directly learn the similarity between a pair of images. The output informs whether the images are of the same person or not.

As described by Liu et al.94, in this kind of model the similarity map is generated through convolution and summarizes the feature maps of a pair of input images. We generate the input pairs by concatenating the images at their channel levels. Let two RGB images with shapes of \(224\times 224\times 3\), concatenating both images by their channels; the resulting input image will have a shape of \(224\times 224\times 6\) (\(224\times 244\) pixels by 6 channels, 3 from the first image and 3 from the second image). These images proceed through convolution layers that generate feature maps regarding their similarity. The output of our model has two neurons and uses a softmax cross-entropy loss. As the verification problem has only two classes, this model’s output can have only one neuron using a binary cross-entropy loss function. As in the Multi-task network, we employ MobileNetV2 as the base model for our Pairwise Filters Network.

Siamese network

Siamese networks were first described by Bromley et al.95 for signature verification. This architecture consists of twin branches sharing their trainable parameters. Such models are generally employed for verification tasks since they learn similarities/distances between a pair of inputs. As illustrated in Fig. 7, each branch of the Siamese structure is composed of a CNN model followed by some dense layers. These models can also have shared and non-shared dense layers at the top.

Figure 7
figure 7

Siamese CNN architecture. This model is composed of two twin branches of convolutional layers sharing their trainable parameters. The output computes a distance between the input image pairs.

As detailed in Table 4, we employ MobileNetV2 as the base model for each branch of the Siamese network. We use the contrastive loss96,97 in the training stage to compute the similarity between the input pair images.

Table 4 Siamese network architecture description.

As described in97, let \(D_W\) be the Euclidean distance between two input vectors, the contrastive loss can be written as follows:

$$\begin{aligned} C(W) = \sum _{i=1}^{P}L(W,(Y,\mathbf {X_{1}},\mathbf {X_{2}})^{i}), \end{aligned}$$
(1)

where

$$\begin{aligned} L(W,(Y,\mathbf {X_{1}},\mathbf {X_{2}})^{i}) = (1 - Y)L_{S}(D_{W}^{i}) + YL_{D}(D_{W}^{i}) \, , \end{aligned}$$
(2)

and P is the number of training pairs, \((Y,\mathbf {X_{1}},\mathbf {X_{2}})^{i}\) corresponds to the i-th label (Y) of the sample pair \(\mathbf {X_{1}},\mathbf {X_{2}}\), and \(L_{S}\) and \(L_{D}\) are partial losses for a pair of similar and dissimilar points, respectively. The objective of this function is to minimize L for \(L_{S}\) and \(L_{D}\) by computing low and high values of \(D_{W}\) for similar and dissimilar pairs, respectively.

The contrastive loss was proposed and applied to face verification96,97 and has been employed for periocular recognition98,99 and iris recognition69.

Results and discussion

This section presents the benchmark results for the identification and verification tasks. We first describe the experimental setup used to perform the benchmark. Then, we report and discuss the results achieved by each approach.

Experimental setup

Inspired by several recent works6,32,34,35,37,63,69,82,100, we perform the benchmark employing pre-trained models on ImageNet and also for face recognition (VGG16-Face and ResNet50-Face). Afterward, we fine-tuned these models using the UFPR-Periocular dataset. Similar to recent works on ocular recognition7,32,35,36, we modify all models by adding a fully convolutional layer before the last layer (softmax) to generate a feature vector with a size of 256 for each image. The default input size of the models is \(224\times 224\times 3\), except for the InceptionResNet and Xception models, which have an input size of \(299\times 299\times 3\). Note that the input dimensions are different because we are using pre-trained models and therefore our fine-tuning process should follow the original architectures’ input size. In this way, for training and evaluation, the periocular images were resized to fit the input size required for each method, i.e., \(299\times 299\times 3\) for both InceptionResNet and Xception and \(244\times 244\times 3\) for the remaining models.

For all methods, the training was performed during 60 epochs with a learning rate of \(10^{-3}\) for the first 15 epochs and \(5\times 10^{-4}\) for the remaining epochs using the Stochastic Gradient Descent (SGD) optimizer. Then, we used the weights from the epoch that achieves the lower loss in the validation set to perform the evaluation.

We employ Rank 1 and Rank 5 accuracy for the identification task, and the Area Under the Curve (AUC), Equal Error Rate (EER), and Decidability (DEC) metrics for verification. Furthermore, to generate the verification scores, we compute the cosine distance between the deep representations generated by each CNN model. As described and applied in several works with state-of-the-art results5,6,32,35, the cosine distance is computed by the cosine angle between two vectors, being invariant to scalar transformation. This measure gives more attention to the orientation than to the coefficient of magnitude of the representations, being an interesting metric to compute the similarity between two vectors. The cosine metric distance is given by:

$$\begin{aligned} d_{c}(A,B) = 1 - \frac{\sum _{j=1}^{N}A_{j}B_{j}}{\sqrt{\sum _{j=1}^{N}A_{j}^2} \sqrt{\sum _{j=1}^{N}B_{j}^2}} \,, \end{aligned}$$
(3)

where A and B stand for the feature vectors.

Regarding the models explicitly developed for the verification tasks, i.e., the Siamese and the Pairwise Filters networks, as this task has unbalanced samples of genuine and impostors pairs, selecting the best samples to perform the training is challenging. Thus, trying to fit the models by feeding them samples as diverse as possible, we employed all genuine pairs and randomly selected the same number from the impostor pairs for each epoch. Hence, each epoch may have different impostor samples. However, for a fair comparison, we generated the random impostor pairs only once for each epoch and fold, and used the same samples for training both models.

The reported results are from five repetitions for each fold, except for the Siamese and Pairwise filter networks, in which we ran only three repetitions due to the high computational cost. All experiments were performed on a computer with an AMD Ryzen Threadripper 1920X 3.5GHz (4.0GHz Turbo) CPU, 64 GB of RAM and an NVIDIA Quadro RTX 8000 GPU (48 GB). All CNN models were implemented in Python using the Tensorflow (https://www.tensorflow.org/) and Keras (https://keras.io/) frameworks.

Benchmark results

The results obtained by each approach in the closed-world and open-world protocols are presented in this section. An ablation study were performed evaluating each task’s influence in the identification mode on the Multi-task learning network. Table 5 shows the size and the number of trainable parameters of each CNN model used as a benchmark. This information was extracted from the models employed in the closed-world protocol since they have more neurons on the last layer than the open-world protocol models. We also report the results achieved by employing the state-of-the-art method that achieved first place in the VISOB 2 competition on mobile ocular biometric recognition101. This method6 consists of an ensemble of five ResNet-50 models pre-trained for face recognition and fine-tuned using the periocular images of our dataset and employing the same experimental protocol described in this work.

Table 5 Size (MB) and number of trainable parameters of the CNN models used in the benchmark.

As can be seen, the benchmark has a great diversity of models with different sizes and parameters due to their difference in structure, depth, concept, and architectures.

Closed-world protocol

We perform the benchmark for both the identification and verification tasks in the closed-world protocol. All results are presented in Table 6 and Fig. 8. Even though the MobileNetV2 is the shortest model in size and trainable parameters, it achieved the best results for identification and verification tasks. Therefore, we employed MobileNetV2 as the base model for the Multi-task, Siamese, and Pairwise Filters networks.

Table 6 Benchmark results in the closed-world protocol for the identification and verification tasks.
Figure 8
figure 8

Receiver operating characteristic curve to compare methods in the closed-world protocol.

The Multi-task model achieved the best results in Rank 1, Rank 5, AUC, and EER metrics. We emphasize that we only explored other tasks such as—age, gender, eye side, and mobile device model—at the training stage of this model. We extracted the representations only for the classification task to evaluate identification (using the softmax layer) and verification (using the cosine distance) tasks. The Siamese network obtained the worst results in the benchmark. In contrast, the Pairwise Filters network reached the higher Decidability index, indicating that it was the most useful to separate genuine and impostors distributions. Nevertheless, it did not achieve the best results in terms of AUC and EER.

The models pre-trained for face recognition generally achieve best results than those pre-trained on the ImageNet dataset as stated in some previous works32,100.

Open-world protocol

The main idea of the open-world protocol is to evaluate the capability of the methods to extract discriminant features from samples of classes that are not present in the training stage. Thus, for this protocol, we perform a benchmark only for the verification task. The results are shown in Table 7 and Fig. 9.

Table 7 Benchmark results in the open-world protocol for the verification task.
Figure 9
figure 9

Receiver operating characteristic curve to compare methods in the open-world protocol.

As in the closed-world protocol, the Multi-task model achieved the best results in Rank 1, Rank 5, AUC, and EER, and the Pairwise network achieved the best Decidability index. The Siamese and Pairwise Filters networks trained using the closed-world validation split reached better results than when trained using the open-world validation split. We believe this occurred due to the fact that there are fewer classes in the training set in the open-world validation split than in the closed-world validation split. Although the open-world validation split corresponds to a more realistic scenario regarding the test set, the networks trained with samples from a larger number of classes can reach a higher capability of generalization, producing discriminative representations even for samples from classes not present in the training stage.

Multi-task learning

The Multi-task model reached the best results in the closed- and open-world protocols. As this network simultaneously learns different tasks, we perform an ablation study by running some experiments with 4 new models created by removing one of the tasks at a time. The experiments were carried out in the closed-world protocol evaluating the performance for identification and verification. We also evaluated the results achieved by all models in each task.

Table 8 Results (%) from several multi-task models trained to predict different tasks.

According to Table 8, the Multi-task network without the prediction of the mobile device model was the most penalized for the identification task, followed by the network variations without age, gender, and eye side estimation, respectively. All models handled the gender and eye side classification tasks well, while the device model and age range classification tasks proved to be more challenging. One problem in the device model and age range classification is the unbalanced number of samples per class. Such bias probably contributed to the lower results being achieved in these two tasks.

Note that we only employed the class prediction for the matching in both closed-world and open-world protocols. However, as shown in Table 8, the multi-task architecture also achieved promising results in the other tasks. In this sense, it may be possible to further improve the recognition results by adopting heuristic rules based on the scores of the other tasks.

Subjective evaluation

In this section, we perform a subjective evaluation through visual inspection on the pairs of images erroneously classified by the Multi-task model, which achieved the best result in the verification task in the closed-world protocol. The best impostors (impostors classified as genuine) and the worst genuines (genuine classified as impostors) pairs are presented in Fig. 10.

Figure 10
figure 10

Pairwise images wrongly classified by the model that obtained the best result in the verification task in the open-world protocol. Higher scores mean that the pair of periocular images is more likely to be genuine.

Performing a visual analysis of all pairwise errors, it is clear that hair occlusion, age, eyeglasses, and eye shape were the most influential factors that led the model to the wrong classification of genuine pairs (intra-class comparison). In pairs wrongly classified as impostors (inter-class comparison), we saw that lighting, blur, eyeglasses, off-angle, eye-gaze, reflection, and facial expression caused the main difference between the images. We hypothesize that some errors caused by lightning, blur, reflection, and occlusion can be reduced by employing some data augmentation techniques in the training stage. Attribute normalization5 can also reduce the errors caused by attributes present in the periocular region such as eyeglasses, eye gaze, makeup, and some types of occlusion. Although some methods can be applied to reduce the matching errors, there are still several characteristics in these images that make the mobile periocular recognition a challenging task, mainly to the high intra-class variations.

Conclusion

This article introduces a new periocular dataset that contains images captured in unconstrained environments on different sessions using several mobile device models. The main idea was to create a dataset with real-world images regarding lighting, noises, and attributes in the periocular region. To the best of our knowledge, in the literature, this is the first periocular dataset with more than 1, 000 subject samples and the largest one in the number of different sensors (196).

We presented an extensive benchmark with several CNN models and architectures employed in recent works for periocular recognition. These architectures consist of models for multi-class classification and multi-task learning, in addition to Siamese and pairwise filters networks. We evaluated the methods in the closed-world and open-world protocols, as well as for the identification and verification tasks. For both protocols and tasks, the multi-task model achieved the best results. Thus, we conducted an ablation study on this model to understand which tasks significantly influenced the results. We stated that the mobile device model identification task was the most important, followed by age range, gender, and eye side classification. Note that we did not conduct experiments employing only left or right eye sides or images separated by gender. The model trained using all these tasks reported the best result for the identification and verification in the closed- and open-world protocols.

In a complementary way, we performed a subjective analysis of the best/worst false genuine and true impostors image pairwise comparisons using the Multi-task model, which achieved the best performance for the verification task. We observed that lighting, occlusion, and image resolution were the most critical factors that led the model to wrong verification.

We believe that the UFPR-Periocular dataset will be of great relevance to assist in evolving periocular biometric systems using images obtained by mobile devices in unconstrained scenarios. This dataset is the most extensive in terms of the number of subjects in the literature and has natural within-class variability due to samples captured in different sessions.

The Multi-task network using MobileNetV2 as baseline model achieved the best benchmark results for the identification and verification tasks, reaching a rank 1 of 84, 32% and an EER of 0.81% in the closed-world protocol, and an EER of 2.81% in the open-world protocol with thresholds of 0.80 and 0.78, respectively. Therefore, there is still room for improvement in both identification and verification tasks.