Keywords

1 Introduction

Mobile devices, such as smartphones and tablets, have become an integral part of many people’s lives. Nowadays, the transfer and processing of personal information and various financial transactions are carried out using mobile devices. They are personal, which means the presence of an authentication procedure.

The methods of biometric authentication have been actively promoted to replace conventional schemes that use keys, personal identification numbers, etc. The growth of interest in biometric technologies is associated mainly with the strengthening of the security requirements of the system and its usability. A lot of attention has been paid to mobile biometrics in recent years [1,2,3].

The present paper is focused on the human iris as the most reliable biometric modality. The goal of iris recognition is to recognize a human’s identity through the textural characteristics of the muscular patterns of the iris. A typical iris recognition system consists of the following stages: iris image acquisition, iris image segmentation, feature extraction, and pattern matching [4]. Iris image acquisition is usually performed using a high resolution camera which is either near-infrared (NIR) or visible-spectrum (VIS), under controlled environmental conditions [5]. The minimal requirements for iris image capturing are summarized in ISO/IEC 19794-6:2011 [6].

Since the mobile market is global, all the possible behavioral- and race-dependent features of the final users must be taken into account. For this reason, in particular, only the NIR spectrum is considered in this paper. The advantages of using NIR images have been well explained in the literature [6,7,8,9]. It should be noted that the development and implementation of an iris capturing camera for mobile devices is outside the scope of the present paper. Most of the issues related to the iris capturing device are well summarized in [5, 7].

In the case of a mobile device, it is not always possible to satisfy all the mentioned requirements imposed on the camera. There are two main reasons for this: the camera module should be small enough and not expensive to manufacture. The costs of production play a significant role in the case of a mass market. Another challenge is that capturing the iris image is performed under uncontrolled environmental conditions. These factors greatly affect the quality of the iris image.

This paper describes a method of iris feature extraction and matching that is capable of working in real-time on a mobile device equipped with an NIR camera.

The rest of this paper is organized as follows: the key issues of iris recognition on a mobile device are explained in Sect. 2; Sect. 3 surveys related work; the proposed approach is presented in Sect. 4; and experimental results and conclusions are presented in Sects. 5 and 6 respectively.

2 Problem Statement

A mobile biometric sensor should be able to handle the data under constantly changing environmental conditions and consider user inherent features. In biometric systems that use an image as input data, the factors of the environment are becoming more important. One of them is the ambient illumination, which varies over a range from 10−4 at night to 105 Lux under direct sunlight. Another is the randomness of the locations of the light sources, along with their unique characteristics, which creates a random distribution of the illuminance in the iris area. These factors lead to a deformation of the iris structure caused by a change in the pupil size, making users squint, and degrading the overall image quality. Several examples of iris images are given in Fig. 1. The influence of the environment is well described in the literature [10,11,12,13]. Other factors inherent to the user also affect the quality of the output, such as the use of glasses or contact lenses, the existence of a hand tremor, or the mere act of walking, thereby introducing a shaking of the device, the variation in distance to the iris causing the iris to move out of the camera’s depth of field, and occlusion of the iris area by eyelids and eyelashes if the user’s eye is not opened enough [10]. All these and many other factors affect the quality of the input biometric data thus influencing the accuracy of the recognition [14, 15].

Fig. 1
figure 1

Examples of iris images captured with a mobile device

Mobile applications should be simple and convenient in use. In the case of a biometric system on a mobile device, being convenient means providing an easy user interaction and a high recognition speed, which is determined by the computational complexity. Mobile secure systems that process any personal data are even more limited in computational resources. Not many researchers have attached importance to this. These systems typically represent a system-on-a-chip (SoC) completely abstracted from external resources, keeping all the processing inside itself. Such systems were initially developed to carry out simple operations with simple data (PINs, passwords etc.) and did not require huge resources. Neither were they ready for the complex processing of biometric information, but they have continued to be improved. The latter system’ restrictions usually meant even more reduced CPU frequency and limited RAM.

All these problems greatly complicate iris feature extraction, making most of the existing methods unreliable, and promising techniques such as deep neural networks (DNN) inoperable.

There are several commercial mobile iris recognition solutions known to date. The first smartphones enabled with the technology were introduced by Fujitsu [16] and Microsoft [17] in 2015. All Samsung flagship devices were equipped with iris recognition technology during 2016–2018 [18]. Some B2B and B2G applications of the technology are also known in the market, such as Samsung Tab Iris [19] and IrisGuard EyePay Phone [20]. The scope of the applications of this technology is growing and has brought about a demand for its further improvement.

The present paper is focused on the feature extraction and matching parts of the iris recognition pipeline. Feature extraction, in this case, means a numeric representation of the unique iris features extracted from the preliminarily determined iris area of the image. Matching means calculating a measure of dissimilarity between the two extracted feature vectors.

3 Related Work

Recent achievements in the field of deep learning have allowed a significant leap in the reliability and quality of the research in the field of biometrics and, in particular, in iris recognition. One of the first attempts to explore the capabilities of DNNs was a feasibility analysis of DNN embeddings trained on ImageNet for classification, with the PCA+SVM applied over the VGG embeddings [21] by Minae et al. Furthermore, Gangwar et al. [22] introduced their DeepIrisNet as a model combining all successful deep learning techniques known at the time. They thoroughly investigated the obtained features and produced a strong baseline as a robust foundation for future research. A year later, similar work based on these embeddings was introduced by Tang et al. [23]. At the same time, Proenca et al. [24] presented IRINA. The idea was to use a DNN to find corresponding patches from the examined images, use MRF to perform precise deformable registration, and a SVM to classify genuine and impostor data. They achieved unprecedented robustness to pupil/iris variations and segmentation errors, but the accuracy of the solution was traded off against performance. The proposed design significantly limited the applicability of the method for mobile applications. Another approach with two fully-convolutional networks with a modified triplet loss function has been proposed recently [25]. One of the networks is used for iris template extraction whereas the second produces the accompanying mask. Fuzzy image enhancement combined with simple linear iterative clustering and an SOM neural network were proposed in [26]. Although this method was designed for iris recognition on a mobile device, real-time performance was not achieved. Another recent paper [13] meant to be suitable for the case of a mobile device proposed a two-headed (iris and periocular) CNN with a fusion of the embeddings. Thus, there is no fully optimal solution for iris feature extraction and matching in the published papers.

4 Iris Feature Extraction and Matching

The proposed method represents a CNN designed to use the advantages of the normalized iris image as an invariant, both low and high level discriminative feature representations, and information about the environment. It contains iris feature extraction and matching parts trained together.

4.1 Recognition Pipeline

A common iris recognition pipeline consists of several stages separated by intermediate quality checks. The feature extraction part is preceded by the segmentation stage and followed by the matching. All the input data for the feature extraction (normalized iris and mask images) were obtained automatically by an algorithm developed in our lab. The basic structure of the algorithm was taken from [10] with two modifications: (i) the scheme that contains a special quality buffer was replaced with a straightforward structure as depicted in Fig. 2; (ii) the feature extraction and matching parts were also replaced with the new ones. All the other parts of the algorithm and quality checks were used with no modifications.

Fig. 2
figure 2

Quality assessment scheme

4.2 Low-Level Feature Representation

It is known from the previous literature [27, 28] that shallow layers in CNNs are responsible for the extraction of low-level textural information while high-level representation is achieved with depth. Methods of iris feature extraction based on local texture characteristics, which are calculated by spatially and spectrally local transformations [9, 29] are basically attempts to use low-level description of the texture. These methods have proven their reliability for scenarios with an almost unchanging environment, but are highly sensitive to environmental variations.

A normalized image of the iris allows textural element-wise features to be useful in the case of narrow changes of environmental conditions. They remain well aligned with each other in such a case. For this reason, iris recognition is a good example of a task for which the profitability of using low-level feature representations could be investigated in the context of CNN-based methods and a wide range of environmental changes.

The influence of shallow features in the context of CNNs on recognition performance is studied in this paper. A classic approach [9] using a Hamming-distance based dissimilarity score has been taken as the basis. The vector FV sh of elements x i is used as a representation of low-level discriminative features:

$$\displaystyle \begin{aligned} x_i = \frac{\sum|FM^{Sq}_{1,i}-FM^{Sq}_{2,i}| \times M_c}{\sum{M_c}}, \end{aligned} $$
(1)

where \(FM^{Sq}_{k,i}\) is the ith feature map of the kth iris after normalization to zero mean and unit variance, binarized by sign; M c is a binary mask representing noise; and M c is a combination of M 1 and M 2.

The shallow feature extraction block is depicted in Fig. 3 and the structure of the convolution block #1 is presented in Table 1. Depth-wise separable convolution block structures, first proposed in [30] as memory and computationally efficient, were picked as the basic structural elements for the entire network. Feature maps \(FM^{Sq}_{1,i}\) and \(FM^{Sq}_{2,i}\) in (1) are obtained after the first convolution layer (Table 1).

Fig. 3
figure 3

Proposed model scheme

Table 1 Structure of the convolution blocks
Table 2 Datasets details

After 100 epochs of training, the distributions of the elements of FV sh for genuine and impostor comparisons from the validation set appear as in Fig. 4. Although the filters vary considerably, the distributions look very similar. The shape of the distributions for both classes resemble a Gaussian, therefore d’ and EER values were chosen for the evaluation of their separation degree. How the values for each filter were changed during training is presented in Fig. 5. The results presented in Table 3 show that the model using FV sh as an additive factor obtains slightly better results for the baseline model with 3×3 kernels on the first layer. It is also shown that for the larger kernels, the difference in performance becomes more significant (Table 3).

Fig. 4
figure 4

Distributions of elements of FV sh after 100 epochs

Fig. 5
figure 5

The dependency of d’ (left) and EER (right) for each filter on the number of epochs

Table 3 Recognition performance results for several model modifications on IM dataset

4.3 Deep Feature Representation

Deep (high-level) feature representation is obtained with convolution block #2. The feature maps \(FM^{Sq}_{1,i}\) and \(FM^{Sq}_{2,i}\) are concatenated after block #1 by channels and passed through it (Fig. 3). The meaning of the concatenation at this stage is in the invariance property of the normalized iris image. Experiments showed the advantages of this approach in comparison with standard techniques [31] where the feature vectors had highly decreased dimensionality. However, the large sizes of the vectors and the complexity of the matching procedure are among the drawbacks of this approach. The structure of the block is presented in Table 1. The output vector FV deep ∈ R 128 reflects a high-level representation of the discriminative features and is assumed to handle complex non-linear distortions of the iris texture.

4.4 Matching Score Calculation

The analysis of outliers along with the nature of the distributions of the elements of FV sh gave rise to the idea of using a variational inference technique for regularization. What this means is that some vectors are being represented as n-dimensional random variables with a certain shape distribution. In the present paper, the representations of both FV sh and FV deep vectors are described as having multi-variate normal distributions \(FV^{\prime }_{sh}\sim N(\mu _{sh},\Sigma _{sh})\) and \(FV^{\prime }_{deep}\sim N(\mu _{deep},\Sigma _{deep})\) respectively, where μ is the vector of mean values and Σ is the covariance matrix. Variational inference is performed with the so called re-parametrization trick described in [32]. Sampling from the distributions is performed only for training, while only the values of μ are used for inference. A sigmoid activation function is then applied to the result. The same procedure is further performed for the concatenated vectors \(FV^{\prime }_{sh}\), \(FV^{\prime }_{deep}\) and FV env. Here, FV env reflects environment conditions and contains information about iris area and pupil dilation: \(FV_{env}=\left \lbrace {\Delta {NPR},AOI}\right \rbrace \), and the area of intersection \(AOI=\Sigma {M_c}/M^h_c\times M^w_c\) with ΔNPR given by

$$\displaystyle \begin{aligned} \Delta{NPR}=\left|\frac{R^p_1}{R^i_1}-\frac{R^p_2}{R^i_2}\right| \end{aligned} $$
(2)

where R p and R i are the radii of the pupil and the iris, respectively. The output vector \(FV^{\prime }_d\in R^{128}\) is an input for the last fully-connected layer with two nodes describing the classes. A SoftMax classifier is applied to the values from the nodes for probability (matching score) estimation.

According to the obtained results (Table 3), the application of variational inference (VI) improved the recognition performance (VI = No means the replacement of the VI structure with simple fully-connected layers of the same dimensionality and activations), but it is also worth mentioning that it becomes less reasonable with an increasing amount of training data.

4.5 Weighted Loss

A specially designed loss function is another proposed feature. Sometimes two images of the same iris are very different from each other. This can happen for various reasons: the different parts of the iris can be occluded by some noise, one of the images can be badly distorted due to segmentation errors, etc. Thus, it is almost impossible to attribute them to the same class and for this reason a certain part of all genuine comparisons in the training data obstruct the convergence of the model. So, it is reasonable to consider or even completely ignore these comparisons when training. The following algorithm is proposed: (i) calculate the loss function (e.g., cross-entropy) for each comparison in the batch; (ii) apply weights = {w 0..w K} to the top k highest values among the genuine matches; (iii) output the overall sum. In this paper, the values were set to: weights = 0 and k = 10%. This approach provided better convergence and achieved a better recognition performance.

4.6 Multi-instance Iris Fusion

The input images may contain both eyes, as depicted in Fig. 6. In this case both irises can be used at the same time [33], which is the obvious way to increase the reliability and convenience of the recognition. It has been observed that at least 40% of the iris area should be visible to achieve the given accuracy level. In other words, the user should open the eyes wider during one-eye recognition, which is not always convenient. Often the iris is significantly occluded by the eyelids, eyelashes, highlights, etc. This happens mainly because of the complex environment, in which the user cannot open the eyes wide enough (bright illumination, windy weather, etc.). It makes the application of the iris multi-instance approach reasonable.

Fig. 6
figure 6

Examples of the images captured with the mobile device equipped with an NIR camera

An ideal scenario for matching is when both compared irises are well aligned to each other spatially and the conditions of the capturing are the same in both cases [15, 34]. This is impossible to satisfy in practice, especially in the mobile case. But it is reasonable to use information about the initial relative position of the compared irises before the normalization. A method that performs the fusion of the two irises and uses the relative spatial information and several factors that describe the environment is also considered as an important path of the presented research.

The final dissimilarity score is calculated as a logistic function of the form

$$\displaystyle \begin{aligned} score=\frac{1}{1+e^{-\sum w_i\cdot M_i}} \end{aligned} $$
(3)

where M ∈ R 7 is the set of the following measures:

$$\displaystyle \begin{aligned} M=\left \{\Delta d_{0}, d_{avg}, AOI_{min}, AOI_{max}, \Delta ND_{min}, \Delta ND_{max} \Delta PIR_{avg} \right \} \end{aligned} $$
(4)

where

  • Δd 0 is the normalized score difference for two pairs of irises,

    $$\displaystyle \begin{aligned} \Delta d_{0}=\frac{\left|d_0^{left}-d_0^{right}\right|}{d_0^{left}+d_0^{right}} \end{aligned} $$
    (5)
  • d avg is the average score for the pair,

    $$\displaystyle \begin{aligned} d_{avg}=0.5\cdot(d_0^{left}+d_0^{right}) \end{aligned} $$
    (6)

AOI min, AOI max are the minimum and maximum values of the area of intersection between the two binary noise masks M prb and M enr in each pair,

$$\displaystyle \begin{aligned} AOI=\Sigma{M_c}/(M^h_c\times M^w_c), \; M_c=M_{prb} \times M_{enr} \end{aligned} $$
(7)

ΔND min,  ΔND max are the minimum and maximum values of the normalized distance ΔND between the centers of the pupil and the iris,

$$\displaystyle \begin{aligned} \Delta{ND}=\sqrt{(NDX_{prb}-NDX_{enr})^2+(NDY_{prb}-NDY_{enr})^2} \end{aligned} $$
(8)
$$\displaystyle \begin{aligned} NDX=\frac{x_{P}-x_{I}}{R_{I}}, NDY=\frac{y_{P}-y_{I}}{R_{I}}, \end{aligned} $$
(9)

where x P and y P are the coordinates of the center of the pupil and R P is its radius, while x I and y I, are the coordinates of the center of the iris and R I is its radius, as depicted in Fig. 7.

Fig. 7
figure 7

Parameters of the pupil and iris used for the iris fusion

The measure ΔPIR avg reflects the difference in pupil dilation during the enrollment and probe using the value of PIR = R PR I:

$$\displaystyle \begin{aligned} \Delta{PIR_{avg}}=0.5\cdot\left(\left|PIR^{left}_{enr}-PIR^{left}_{prb}\right|+\left|PIR^{right}_{enr}-PIR^{right}_{prb}\right|\right) \end{aligned} $$
(10)

where R P and R I are the radii of the pupil and the iris, respectively.

The weight coefficients for the logistic function were obtained after the training of the classifier on genuine and impostor matches on a small subset of the data. In case only one out of two feature vectors is extracted, all the pairs of values used in the weighted sum are assumed to be equal.

The proposed method helped to increase the recognition accuracy. It is also allowed to decrease the threshold for the visible iris area from 40% to 29% during verification/identification without any loss in the accuracy and performance, which means a decreased overall FRR as a result.

A comparison of the proposed method with well-known consensus and minimum rules was carried out. According to the consensus rule, a matching is considered as successful if both \(d_0^{left}\) and \(d_0^{Right}\) are less than the decision threshold. In the minimum rule, what is required is that the minimum of the two values \(\min (d_0^{left}, d_0^{right})\) should be less than the threshold. The testing results are presented in Table 7.

5 Experimental Results

The main objectives of the biometric system performance evaluation include assessing the progress in improving the accuracy during the development of the algorithms and providing an objective reflection of the performance when the system is in operation [35]. To meet these goals, two types of evaluation were conducted: (i) an image-to-image evaluation of the proposed feature extraction and matching method with state-of-the-art methods on several datasets, including publicly available ones; (ii) a video-to-video evaluation to simulate real-world usage of the whole iris recognition solution and test the proposed multi-instance iris fusion approach.

5.1 Image-to-Image Evaluation

Three different datasets were used for the comparison. The following methods were selected as state-of-the-art: (1) FCN+ETL proposed by Zhao and Kumar in [25], which is one of the most cutting edge solutions, with the highest recognition performance; (2) DeepIrisNet [22], representing a classic deep neural net approach as one of the earliest applications of deep learning in the field of iris recognition. A lightweight CNN recently proposed in [13] could also be used for comparison since the results were obtained on the same dataset. Refer to the original paper [13] for the results on the CASIA-Iris-M1-S3 dataset [36].

Many methods were excluded from consideration due to their computational complexity and therefore unsuitability for mobile applications.

5.1.1 Dataset Description

The following datasets were used for training and evaluation: CASIA-Iris-M1-S2 (CMS2) [36], CASIA-Iris-M1-S3 (CMS3) [36], and Iris-Mobile (IM). The collection of the last one was performed privately using a mobile device with an embedded NIR camera to simulate real authentication scenarios of the user of a mobile device. The images were captured under a wide range of changes in illumination, both indoors and outdoors, with and without glasses (Table 2). Examples of images are presented in Fig. 1. Images from all the datasets were marked automatically by an algorithm developed in our lab. Examples of iris and mask images are presented in Fig. 3. Each dataset was initially divided into training, validation, and testing subsets in proportions of 70/10/20 (%) respectively. This was so that there were would be no images of the same iris in two different subsets.

5.1.2 Training

The number of genuine comparisons N G was much smaller than the number of impostor comparisons. Therefore all genuine comparisons were used for training and the number of impostor comparisons was fixed as N I = 10N G. The model that showed the lowest EER on the validation set was selected for evaluation on the testing dataset. All the models were trained for 150 epochs using the Adam optimizer. The training of the proposed model was performed so that one epoch was equivalent to one iteration over all the genuine comparisons whereas the impostors are always randomly selected from the entire set for each batch. The proportion of genuine and impostor comparisons in a batch was set to \(N^b_I=10N^b_G\) and AOI ≥ 0.2 for all the image pairs.

5.1.3 Performance Evaluation

The recognition accuracy results are presented in Table 4 and Fig. 8. The proposed feature extraction and matching method outperforms the chosen state-of-the-art ones on all the datasets. Since the number of comparisons for the CMS2 and CMS3 testing sets did not exceed 10 million after the division into subsets, it was impossible to estimate the FNMR at FMR = 10−7. Another experiment was used to estimate the performance of the proposed model on those datasets without training on them. The model trained on IM was evaluated on the entire CMS2 and CMS3 datasets in order to obtain FNMR at FMR = 10−7 (CrossDB). The results presented in Table 4 and Fig. 8 demonstrate the high generalization ability of the model. However, it is fair to note that the IM dataset contains much more data than the other two.

Fig. 8
figure 8

ROC curves obtained for comparison with state-of-the-art methods on different datasets: (a) CASIA-Iris-M1-S2 (CMS2), (b) CASIA-Iris-M1-S3 (CMS3) and (c) Iris Mobile (IM)

Table 4 Recognition performance evaluation results

A mobile device equipped with Qualcomm Snapdragon 835 CPU was used for estimating the overall execution time for these iris feature extraction and matching methods. It should be noted that a single core of CPU was used. The results are summarized in Table 4.

5.2 Video-to-Video Evaluation

In fact, both the registration and verification procedures involve the processing of not one, but a sequence of images. The video format gives more information about the possible behavior of the user and the environment. Unfortunately, there are no such publicly available iris datasets. So, in order to test the recognition performance on data that would be close to real-world scenarios, an additional dataset was collected privately. It is a set of two-second video sequences, each of which is a real enrollment/verification attempt.

5.2.1 Dataset Description

The dataset was collected using a mobile device with an embedded NIR camera. It contains videos captured in different environment: (i) indoors (IN) and outdoors (OT); (ii) with and without glasses; (iii) at different distances. The conditions of illumination during the capturing were set as: (i) three levels for the indoor samples (0–30, 30–300 and 300–1000 Lux); (ii) a random value in the range 1–100 K Lux (data was collected on a sunny day); Different arrangements of the device relative to the sun were also considered during the capturing. A detailed description of the dataset is presented in Table 5. The Iris Mobile (IM) dataset used for the image-to-image evaluation was randomly sampled from, as well. Examples of the pictures from the videos are depicted in Fig. 6.

Table 5 Dataset specification

5.2.2 Testing Procedure

All the video sequences were used for simulating both the enrollment and verification transactions (attempts) in the non-glasses (NG) case. The sequences captured for users wearing glasses (G) were used for simulation of the verification attempts only. Each video sequence is considered as a single attempt. The extracted probe/enrollment template is the result of a successful attempt and may contain a maximum of 60 (30 frames × 2 eyes) iris feature vectors. The successful construction of the feature vector means passing all the intermediate quality checks in the recognition pipeline.

The testing procedure consists of the following steps:

  1. 1.

    Passing of all the videos that satisfy the condition IN&NG through the feature extraction to produce the enrollment template: the template is considered as successfully created if the following requirements are met:

    1. a.

      At least 5 feature vectors were constructed for each eye;

    2. b.

      At least 20 out of 30 frames were processed.

  2. 2.

    Passing of all the videos through the feature extraction to produce the probe template, which is considered as successfully created in the case of at least 1 feature vector being constructed;

  3. 3.

    Creating of a pair-wise matching table of the dissimilarity scores for all the comparisons: each probe template is compared with all enrollment templates except the ones generated from the same video;

  4. 4.

    Calculating of the measures: FTE, FTA, FNMR(FMR) and FRR(FAR).

One important thing that makes the enrollment and verification different are the values of the following thresholds: (i) the normalized eye opening (NEO) value, described in [10], was set as 0.5 for the enrollment and 0.2 for the verification; (ii) the non-masked area of the iris (not occluded by any noise) was set as 0.4 and 0.29 for the enrollment and probe, respectively.

5.2.3 Performance Evaluation

The recognition accuracy results are presented in Table 6. The proposed feature extraction and matching method is compared with the one proposed in [10] as a part of the whole pipeline. The compared method is based on Gabor wavelets with an adaptive phase quantization technique (Gabor+AQ). Both methods were tested in three different verification environments: indoors without glasses (IN&NG), indoors with glasses (IN&G), and outdoors without glasses (OT&NG). The enrollment was always carried out only indoors without glasses and, for this reason, the value of FTE = 3.15 is the same for all the cases. The target FMR = 10−7 was set in every experiment.

Table 6 Recognition accuracy in different verification conditions

Applying different matching rules was also investigated. The proposed multi-instance fusion showed advantages over the other compared rules (Table 7).

Table 7 Recognition accuracy for different matching rules

The overall execution time for the whole pipeline was measured on a single core of Qualcomm Snapdragon 835 CPU and was 55 milliseconds, which is about 18 FPS real-time performance.

6 Conclusion

A novel approach to iris feature extraction and matching was proposed in this paper. It showed robustness to the high variability of the iris representation caused by changes in the environment and physiological features of the iris itself. The profitability of using shallow textural features, feature fusion, and variational inference as a regularization technique, was also investigated in the context of the iris recognition task. One more feature of the proposed solution is its multi-instance iris fusion, which helps to increase the performance in case the input image contains both eyes at the same time. The proposed solution was tested in the video-to-video scenario and showed its ability to work in real-time in an uncontrolled environment. Although it showed high accuracy indoors, the outdoor recognition is still challenging.