1 Introduction

Biometric user verification in mobile devices has all but won the top spot as the user access control method of choice [8, 13]. Biometrics has brought convenience and enhanced security to a wide range of applications such as user login, paymentsFootnote 1, and eCommerce in generalFootnote 2. The use of biometrics in mobile devices is termed as mobile biometrics [13].

Thanks to deep learning and advanced camera technologies, mobile face biometrics has come a long way in terms of robustness, accuracy, and user experience. However, given the recent privacy concerns, especially amid the COVID-19 pandemic, and the resulting face-covering mandates, there is an intensified desire for alternate solutions to face recognition [1, 2]. According to a recent 2020 NIST study [9], the presence of face masks could cause face recognition systems to fail up to \(50\%\). Ocular biometrics offers a viable alternative to mobile face recognition given that similar to face, the ocular band can be acquired using the front-facing RGB camera of the mobile device. Ocular biometrics in and of itself has attracted exceeding attention from the research community thanks to its accuracy, security, and robustness against many facial expressions [12, 16]. The ocular regions that have been studied for their biometric utility include the iris [5], conjunctival and episcleral vasculature [4], and the periocular region [7]. Several datasets have been published capturing ocular images in the visible spectrum under various conditions, including UBIRIS [11] (241 subjects), MICHE-I [3] (92 subjects), and VISOB [10]. The last one offers the largest number of subjects (550) captured in mobile environment. Part of this dataset was used for VISOB 1.0 ICIP 2016 ocular biometric recognition competition.

Following the success of our previous VISOB ICIP 2016 competition [14], we organized VISOB 2.0 competition [10] as a part of the IEEE WCCI 2020 conference using a different subset of the VISOB database. The differences between VISOB dataset used in WCCI 2020 compared to ICIP 2016 version are given in Table 1. In VISOB 2.0 competition, we extended the region of interest from the tight eye crop (mainly iris, conjunctival, and episcleral vasculature) to larger periocular (a region encompassing the eye and the surrounding skin). The evaluation protocol for VISOB 2.0 is subject-independent (akin to open-set for identification), in which the subjects in the training and testing set do not overlap. This is compared to the less challenging subject-dependent evaluation used in ICIP VISOB 1.0 competition. More specifically, in VISOB 1.0 the 150 subjects in the testing set overlapped with the 550 identities in the training set; while there are no such overlapping identities between training and testing sets in VISOB 2.0. Further, instead of single frame eye captures of VISOB 1.0, VISOB 2.0 samples are comprised of stacks of five images captured in rapid succession (burst mode), opening the door for multi-frame enhancements.

Table 1. Differences between VISOB 1.0 and VISOB 2.0 competition.

We note that multi-frame ocular biometrics in the visible spectrum has not attracted much attention in the research community [15], which could be in part due to a lack of public multi-frame datasets, something that VISOB 2.0 strives to overcome. Single-frame mobile captures from the front-facing “selfie" camera may unexpectedly introduce degradation due to illumination variations, noise, blur, and user to camera distance; all adversely affecting matching performance. One way to mitigate this problem is by capturing multiple frames of the eye in burst-mode, followed by multi-frame image enhancement. Frames may be fused at the input level (e.g., using multi-frame image enhancement and super-resolution techniques) or at the feature or score level for enhanced matching performance (e.g. a multi-match system) (Fig. 1).

Fig. 1.
figure 1

Example eye images from VISOB 2.0, WCCI 2020 competition edition.

2 VISOB 2.0 Dataset and Protocol

VISOB 2.0 Dataset: WCCI 2020 VISOB 2.0 competition VISOB Dataset is publicly availableFootnote 3, and consists of stacks of eye images captured using the burst mode by two mobile devices: Samsung Note 4 and Oppo N1. During the data collection, the volunteers were asked to take their selfie images in two visits, 2 to 4 weeks apart from each other. The selfie-like images were captured with the participant holding the phone naturally, using front-facing camera of the mobile devices under three lighting conditions: daylight, indoor (office) lighting, and dim indoors in two sessions (about 10 to 15 min apart). The ocular burst stacks were cropped from full face frames. The burst sequences were selected if correlation coefficient between the center frame and the remaining four images was greater than \(90\%\) (i.e. no excessive motion). We detected the face and eye landmarks using Dlib library [6]. The eye crops were generated such that the width and height of the crop is \(2.5\times \) that of the eye’s corner to corner width.

Protocol: VISOB 2.0, WCCI 2020 edition, consists of captures from 150 identities. Both left and right eyes from two visits were provided to the participants. Data characteristics is given in Table 2. Also, we provided images from visit 1 and visit 2 (2–4 weeks apart) under earlier mentioned three lighting conditions in order to keep the focus on the long-term verification and cross-illumination comparisons. No image enhancement was applied to the data so that the participants could perform end-to-end learning to obtain the best fusion of biometrics information and multi-frame image enhancement from the burst of input images. In order to evaluate the submissions according to real-life scenarios, we set up this competition in a subject independent environment. For the competition, the participants were simply asked to submit a model that generates the match score from a pair of images (simple reference-probe comparison). Table 3 shows 18 experiments with 3.6M comparisons across different lighting conditions at the evaluation stage. We used Equal Error Rate (EER), ROC Area Under the Curve (AUC), and Genuine Match Rates (GMR) at \(10^{-2}\), \(10^{-3}\), and \(10^{-4}\) False Match Rates (FMR) to evaluate accuracies.

Table 2. Number of VISOB 2.0 training images provided to the challenge participants.

3 Summary of Participants’ Algorithms

Department of Informatics, Federal University of Parana (UFPR), Curitiba, PR, Brazil: Zanlorensi et al.’s submitted model is an ensemble of five ResNet-50 models pre-trained on the VGG-Face dataset proposed in [17]. Each ResNet-50 was fine-tuned using a softmax loss through 30 epochs on the periocular images from VISOB 2.0 training subset. The last fully connected layer from the original architecture was removed and replaced by two fully connected layers. The first layer is the feature layer containing 256 neurons, and the last one is the prediction layer consisting of 300 neurons as the number of classes in the training set (left and right eyes from 150 subjects). Eventually, the prediction layer was removed, and the output of the feature layer was taken as the deep feature vector for each input image. For each stack of five images, the five ResNet-50 ensemble generates a combined feature vector of length 1280 (5\(\times \)256). The authors used cosine distance similarity to generate a match score and compare template-test ocular image pairs.

Bennett University, India: Ritesh Vyas’ submission employed hand-crafted features, namely directional threshold local binary patterns (DTLBP) and a wavelet transform for feature extraction. This was the only non-deep learning approach submitted to the competition. The authors used Daubechies, an orthogonal wavelet, to facilitates the multi-resolution analysis. The local texture representation operator captures the unique intensity variations of the periocular image. DTLBP is more robust to noise and is able to extract more distinctive feature representation than the local binary pattern (LBP). Chi-square distance was utilized to compare features from two stacks of images, followed by score normalization.

Table 3. Data distribution for the 18 experiments performed on the test set, as used by the organizers to evaluate the submitted methods.
Table 4. Details of the algorithm submitted to the IEEE WCCI VISOB 2.0 competition.

Anonymous Participant: The authors used a GoogleNet pre-trained on the ImageNet dataset to extract the representation features. Euclidean distance was employed to calculate the similarity between pairs of periocular images. Following the distance calculation, the scores were used to train Long Short Term Memory (LSTM) model to predict if the pair of images belong to the same individual.

4 Result and Discussion

Table 4 shows the details of the three algorithms submitted to the competition. Experiments were setup as subject independent (open-set-like). All the algorithms consisted of a feature extractor and a similarity-based matcher. The former extracts the feature representation of the image, and the latter computes the match score between two data samples (enrollment and verification). Two out of the three submissions employed deep learning based approaches.

Fig. 2.
figure 2

GMR% at \(10^{-2}\), \(10^{-3}\), and \(10^{-4}\) FMR of the three submissions.

Table 5 shows the EER and AUC of the competition’s 18 experiments using Note4 and OPPO N1 challenge data for the three submitted algorithms (note that OPPO N1 has a better camera). Figure 2 shows the average GMRs at different FMRs in 18 experiments. These values are calculated by taking the average of GMRs from the 18 experiments. It can be easily seen that team 1 outperformed the other two teams by a large margin. The best result obtained by team 1 for Note 4 is 5.256% EER and 0.988 AUC for the 9th experiment (office versus office), shown in the result table. For OPPO N1, team 1 achieved the highest performance for dim light versus dim light condition with 6.394% EER and 0.984 AUC. Three experiments with enrollment and verification under the same lighting condition (experiment 10, 14, and 18) generally obtained slightly better performance than the other experiments. This implies cross illumination comparison degrades the performance of the model submitted by team 1.

As shown in Table 5, team 2 achieved the 2nd best place in our competition. Using a similar cosine matcher as team 1, team 2 utilized a non-deep learning based textural feature extractor, DTLBP. The lowest EER for team 2 was 27.05% for Note 4 and 26.208% for OPPO N1 device in the office versus office lighting setting. However, the model performance degraded significantly for other experiments with EER fluctuating from 30% to 43%. It appears that the non-deep learning features from DTLBP are not as robust against changes in illumination. Team 3’s model did not obtain satisfactory results for any of the experiments.

Table 5. EER and AUC of the 9 experiments for three submissions, Note 4 device.

5 Conclusion

Ocular biometric is becoming an attractive alternative to face recognition in the mobile environment, especially due to occlusion caused by masks worn during the COVID-19 pandemic. We organized the VISOB 2.0 competition at IEEE WCCI 2020 conference to further advance the state-of-the-art in such ocular recognition methods, with a focus on multi-frame captures. We performed a thorough evaluation of three ocular recognition algorithms submitted to our VISOB 2.0 Challenge Competition. VISOB 2.0 dataset consists of stacks of five ocular images captured in burst mode using the front-facing camera from two different smartphones. From the obtained test results, it is obvious that the deep learning approach could obtain better results in our more challenging subject-independent evaluation settings. The comparison across different illumination settings showed adverse effects on the performance of all three submissions. These results can serve as a reference for future research and development in multi-frame RGB ocular recognition.