1 Introduction

In the current digital era, smartphones and mobile devices are ubiquitous. With the growth of smartphone usage, people store enormous amounts of personal and confidential information on their smartphones. Storing such information on smartphones demands suitable security mechanisms. Traditional security measures include passwords, patterns, or pins. However, these methods need to be memorized by the users and are vulnerable to shoulder surfing attacks [1]. Alternatively, biometric-based user authentication is now more popular and requires minimal effort from the users.

Fig. 2.1
figure 1

Acquisition sensors and their corresponding captured modalities

As illustrated in Fig. 2.1, modern smartphones have multiple sensors that can facilitate user authentication. For instance, cameras can be used to capture face [2] and finger-selfies, while fingerprint sensors can be used to acquire fingerprints. Researchers and commercial entities have explored the usability of all three, and each posing certain advantages and constraints. For instance, traditional fingerprints are accurate but require the installation of additional capacitive sensors [3]. Face selfies are easy to capture, but they may be affected by several external factors. Similarly, finger-selfies do not need any additional sensors, but the technology requires more research to demonstrate the effectiveness. This chapter focuses on finger-selfie, presenting a review of the research efforts related to improving the usability and accuracy of finger-selfie recognition.

Fig. 2.2
figure 2

An illustration of acquisition mechanism of finger-selfie and the corresponding finger-selfie

As shown in Fig. 2.2, finger-selfie acquisition involves capturing ridge-valley details present on the tip of the finger using a device camera by the user. Overcoming the drawback of traditional biometric-based authentication, a finger-selfie does not require an additional sensor. All it needs is the smartphone’s in-built camera. As per Tim Ahonen’s Phone book [4] and Statista [5], approximately 89% of all digital photographs arise from handheld devices such as tablets and smartphones. While these statistics motivate the use of finger-selfies as a cost-effective method for authentication, there are other advantages as well. Finger-selfies act as a contactless fingerprint acquisition technique, which is hygienic and secure, leaving no latent impressions on the surface of the sensor. Over the flattened live scan fingerprints, finger-selfies also contain additional information such as finger shape and phalanx lines. While these lines may not have global uniqueness, a localized correlation with ridge-valley patterns in the neighborhood may aid person identification [6].

Fig. 2.3
figure 3

Visual difference between a finger-selfie and a legacy fingerprint image

Other than authentication for device unlocking, law enforcement agencies have also shown their interest toward finger-selfies. For instance, on finding a finger-selfie of a potential drug dealer holding drugs on his fingers, the South Wales Police and the scientific support unit utilized the finger-selfie to identify the culprit [7]. Similarly, a hacker used an image of a German minister’s finger, acquired from a distance of three meters, to generate fingerprints [8]. Such use cases highlight the need for finger-selfie-based recognition systems.

Fig. 2.4
figure 4

Sample finger-selfie images from the proposed UNFIT database. While the database incorporates numerous challenges, a real-life unconstrained acquisition of finger-selfies might contain one or more challenges together, making finger-selfie recognition a complex problem. Varying resolutions of the camera adds to the challenges of finger-selfie recognition

Emphasizing on the other side of the coin, finger-selfie-based user authentication is not perfect either. As illustrated in Fig. 2.3, a finger-selfie looks drastically different from a traditional fingerprint, with skin and background visible along with ridge-valley details. While its acquisition requires minimal effort from the user, their lack of cooperation might induce many challenges. Unlike capturing face selfies, acquiring a good-quality finger-selfie may not be a trivial task, and the captured finger-selfie might comprise several variations such as illumination, in- and out-of-plane rotations, blur, and occlusion. Users might even present multiple fingers in the same frame. A summary of these challenges is illustrated in Fig. 2.4. While these challenges highlight a real-life unconstrained acquisition scenario, detection and recognition of these finger-selfies for smartphone authentication become a cumbersome task.

To promote unconstrained finger-selfie-based recognition, this chapter first provides a review of existing research on finger-selfie followed by finger-selfie-based authentication in an unconstrained environment. This research is inspired by our preliminary work, which showcased the application of finger-selfies in an unconstrained environment [9]. The important research contributions of this chapter are:

  1. 1.

    A review of existing databases utilized in the literature for finger-selfie/image/photograph-based recognition and a detailed summary of existing approaches for finger-selfie recognition are discussed.

  2. 2.

    A novel publicly available UNconstrained FIngerphoTo (UNFIT) database, which is captured under challenging unconstrained conditions. The database also contains manual annotation of identities and location for 3450 images from 115 subjects.

  3. 3.

    A segmentation algorithm to segment finger regions from a finger-selfie using the existing VGG SegNet [10] model. The performance of the segmentation algorithm is compared with other segmentation methods such as FCN 8 [11]. We show that existing deep learning algorithms for segmentation can easily outperform the traditional skin color-based segmentation [12] methods used in the literature.

  4. 4.

    Finally, recognition of the segmented finger is performed. The benchmarking for feature extraction and matching is performed using CompCode [13] and ResNet50 [14] followed by Hamming distance and cosine similarity, respectively. Experimental results show that despite multiple challenges present in the UNFIT database, finger-selfie-based biometric authentication is feasible and pragmatic.

2 Related Work

Recent studies have demonstrated the usage of fingerphoto/contactless fingerprints acquired using smartphones and other digital cameras toward benchmarking of contactless fingerprint recognition. However, a significant limitation of these studies is the use of constrained or semi-constrained fingerphoto datasets. A summary of the datasets is presented in Table 2.1, and their details are given below.

Table 2.1 Literature review of existing databases of contactless fingerprints/fingerphotos

2.1 Existing Databases

Several researchers have designed algorithms and shown results on contactless fingerprint recognition. However, a significant limitation related to the research on finger-selfie recognition is the unavailability of public datasets. While four of the datasets are publicly available, these datasets incur just one or two variations, which lack common challenging scenarios of acquisition present in finger-selfies. A summary of these datasets is presented below.

2.1.1 Publicly Available Databases

As illustrated in Table 2.1, there exist databases for contactless fingerprints; however, for benchmarking and algorithmic evaluation, only the following databases are publicly available in the research community:

  • HKPU Low-Resolution Fingerprint Database [6]: The database has a total of 1566 low-resolution contactless fingerprint images from 156 subjects. The contactless fingerprints are acquired using a webcam in two different sessions. While the database is acquired at a low resolution, it incorporates no other challenge during acquisition. Hence, the database can be termed as semi-constrained.

  • IIITD Smartphone Fingerphoto Database [12]: In 2015, Sankaran et al. proposed this database, containing 4096 fingerphoto images from 64 participants acquired using a smartphone camera. The database also includes 1024 livescan images to promote matching of fingerphoto with legacy fingerprint databases. The subsets of the database include varying background and illumination. Hence, this database can also be considered as semi-constrained.

  • PolyU Contactless to Contact-based Fingerprint Database [27]: Recently, Lin and Kumar proposed a constrained dataset, with 1800 contactless fingerprint samples from 300 different fingers. While the images of fingers were acquired in a constrained setting, the database aimed to establish the matching of contactless fingerprints with contact-based livescan fingerprints. Hence, the database also includes 1800 contact-based livescan images.

  • Other than the databases mentioned above, Taneja et al. [26] proposed a Spoofed Fingerphoto Database, which aimed to establish the effect of spoofing of fingerphotos using display and print attack. This database was created using fingerphotos taken from the IIITD Smartphone Fingerphoto Database [12].

Using the in-house and publicly available touchless fingerprint databases, researchers have demonstrated benchmarking results of their proposed algorithms. A summary of these algorithms is presented below.

2.2 Finger-Selfie Recognition Techniques

For touchless fingerprint recognition, Song et al. [15] used only blue channel information of finger images. They utilized mean and coherence for segmentation and Gabor filters to enhance ridge details. Their results were illustrated visually on a touchless fingerprint image. In 2006, Lee et al. [16] performed segmentation by combining normalized color (RB) model and frequency information extracted using the Tenengrad method. Minutiae were extracted from the segmented image, following which the authors reported about 80% GAR at 0.01% FAR. In 2008, Lee et al. [17] aimed at focus estimation by estimating blur. They also used coherence and symmetry for quality estimation and difference in frames (contour extraction) for pose estimation. On the Samsung Database (SDB)—I, II, III, IV—with 60, 30, 30 image sequences and 1200 fingerprint images, respectively, authors reported a rejection rate of 5.67% and EER of 3.02%.

Piuri and Scotti [18] performed blur reduction using Lucy-Richardson algorithm and Wiener filter algorithm followed by color model and morphology-based segmentation. After performing fingerphoto registration, enhancement, and minutia extraction using MINDTCT, authors reported an EER of 0.042% for 150 images. Hiew et al. [19] utilized Gabor features, followed by PCA and SVM for verification. They reported an EER of 1.23%. In 2011, while proposing a publicly available dataset, Kumar and Zhou [6] performed enhancement by Sobel filtering and area thresholding on the acquired image, followed by Gaussian sharpening. Using LRT and CompCode features followed by Hamming distance, the authors reported a cross-session EER of 3.95% with 93.97% accuracy on the proposed dataset. In the same year, Derawi et al. [20] performed feature extraction and matching using COTS and reported an EER of 0.00–23.62% for different fingers on their in-house database.

Yang et al. [21,22,23] utilized their semi-constrained database with 2100 samples toward quality assessment of fingerprint images captured from a smartphone camera. They defined a total of seven [21] and twelve [22] quality metrics to determine the quality of contactless fingerprint image. Using the same dataset, Raghavendra et al. [23] performed mean shift clustering to segment the probable finger regions. The final finger is detected from top five-sized regions using a fusion of Pearson, Fourier magnitude, and energy measure based on the wavelet transform. They reported an average segmentation accuracy of 96.46%. Using NBIS MINDTCT for minutia extraction followed by matching, authors report an EER of 3.74%. In 2013, Stein et al. [24] performed spoof detection, followed by minutia extraction and matching. The authors reported 1.20% EER for contactless fingerprints and 3.00% EER for finger videos. Tiwari and Gupta [25] found ROI in fingerphoto by adaptive thresholding followed by morphological operations. They aligned the image using PCA followed by image enhancement using adaptive histogram equalization. Using SURF features, authors report an EER of 3.33% on their proposed in-house database.

In 2015, Sankaran et al. [12] created IIITD Smartphone Fingerphoto Database and proposed a fingerphoto-to-fingerphoto and fingerphoto-to-livescan matching algorithm. With segmentation performed using adaptive thresholding, authors performed image sharpening and median filtering to enhance the image [28]. From the enhanced image, ScatNet features were extracted, followed by PCA and matching using RDF classifier. On the proposed semi-constrained dataset, authors reported an EER of 3.65–7.45% on different subsets of fingerphoto-to-fingerphoto matching and 7.07–10.43% for fingerphoto-to-livescan matching. Later, in 2017, Malhotra et al. [29] further improved the state-of-the-art performance on IIITD Smartphone Fingerphoto Database. Using an LBP-based enhancement, the authors reported an EER of 1.47–8.36% on different subsets of fingerphoto-to-fingerphoto matching and 6.44–7.61% for fingerphoto-to-livescan matching. Recently, Lin and Kumar [27] proposed a livescan and contactless fingerprint image database. To align the contactless images with livescan images, the authors proposed an RTPS-based fingerprint deformation correction model. By performing minutiae- and ridge-based matching, the authors reported a rank-1 accuracy of 94.11% using their proposed algorithm.

While these algorithms have shown good accuracies and low error rates, their performance is not evaluated in a real-life scenario of unconstrained finger-selfie recognition. A primary reason is the absence of an unconstrained finger-selfie database. To address this concern and to promote finger-selfie recognition in an uncontrolled scenario, we present UNFIT: an unconstrained fingerphoto database in the next section.

3 UNconstrained FingerPhoto (UNFIT) Dataset

In Sect. 2.2.1.1, we highlighted publicly available databases for contactless fingerprint recognition. While these datasets have an ample number of samples, these samples are acquired in a constrained or semi-constrained environment. In this research, we create the first unconstrained fingerphoto (UNFIT) database and make it available for the research community.Footnote 1 The database has many challenges, which would be present in a finger-selfie acquired in an uncontrolled environment with minimal user cooperation. The details of the dataset are presented below.

3.1 Database Acquisition

Forty-five different smartphones belonging to the subjects are used to capture finger-selfies. This brings variations in terms of resolution and camera sensor to the database. OnePlus and iPhone devices are used to acquire 48% of images in the database followed by other phones including Redmi devices, Google Nexus, Lenovo K3 Note, Lenovo K4, Mi 4, Le 1s, Samsung Galaxy, Micromax Canvas, Moto G, Moto C, Moto M, and HTC devices. The camera resolutions of these smartphones varied from 8 to 16 MP. The distribution of different smartphone devices used for finger-selfie acquisition can be seen in Fig. 2.5a.

Fig. 2.5
figure 5

Acquisition details: a Devices used for finger-selfie acquisition, and b Offline and online mechanisms used for obtaining finger-selfies

The database is collected via both online and offline methods which helps incorporate the effect of image compression due to transmission. WhatsApp, Telegram, Google Drive, Gmail, and Facebook messenger are used for online data collection, whereas for offline data collection, different phone devices belonging to the subjects are used followed by transmission via a pen drive. Figure 2.5b shows the distribution of images collected using different modes of online and offline data collection. Adding on, variations in illumination, intensity, and blur are present in the database due to the optional usage of auto-focus and flash for acquiring finger-selfies.

During database acquisition, no constraints are enforced for distance of the finger from the camera sensor. Varying distance allows the presence of more challenges, such as position and scale variation. However, the appearance of ridge-valley details stays limited with respect to the camera sensor. The minimum and maximum distances for a focussed detailed acquisition depend upon the camera’s aperture and len’s focal length. With 45 different smartphones used to obtain finger-selfies, the camera’s aperture and len’s focal length vary across the smartphone devices. Hence, a generic claim for a minimum and maximum distance for a focussed image cannot be made. Thus, varying sensors, lens, the distance of finger, illumination, and background variations makes locating, segmenting, and recognizing ridge-valley details in the finger challenging.

3.2 Database Statistics

Over a span of three months, we collated a novel finger-selfie database consisting of 3450 images and termed it as Unconstrained FIngerphoTo (UNFIT) database. The database has multiple images of the index and the middle finger for each subject, where both the fingers of the same participants are considered as different classes. We refrained from acquiring thumb finger-selfies since capturing frontal region of thumb while holding a phone facing downward in the other hand is inconvenient for subjects. During acquisition, the participants are allowed to use either of the hand for capturing the finger-selfies, as long as all the finger-selfies arise from the same hand. The database contains 230 different classes belonging to 115 participants. Out of the 115 subjects from whom finger-selfies were captured in the UNFIT database, 38 were female participants, and 77 were male participants. The details of the database can be seen in Table 2.2. Figure 2.6 exhibits some sample images from the database. Two different sets of finger-selfies are collected from each subject:

Table 2.2 A summary of various subsets presents in the UNFIT database
Fig. 2.6
figure 6

Sample finger-selfie images from different subsets of the proposed UNFIT database

  • Set I: Single Finger—Images of the index and middle fingers belonging to the same hand of a user are captured. Finger-selfies are collected from either the left or right hand of the user as per his/her convenience without enforcement of any constraints regarding background, illumination, resolution, position, or orientation of the finger. Figure 2.6a and b demonstrates sample images belonging to this set. The set contains a total of 2300 images (=115 subjects \(\times \) 2 fingers \(\times \) 10 instances per finger).

  • Set II: Multiple Fingers—At times, users may capture multiple fingers, intentionally or unintentionally, and this additional information can be useful for improving finger-selfie recognition performance. Thus, this is useful for demonstrating the effect of multiple fingers on finger-selfie recognition. Figure 2.6c shows the sample images belonging to this set. The set contains a total of 1150 samples (=115 subjects \(\times \) 10 instances per participant) of both index and middle fingers belonging to the same hand taken together.

3.3 Challenges

In a scenario where the user cooperation is minimal, intra-class variations may increase. Some of these variations are shown in Fig. 2.4. A detailed description of challenges included in the proposed UNFIT database is as follows:

  • Affine variations: Finger-selfie acquisition involves presenting the finger in front of the rear or front camera of the smartphone. While this task sounds trivial, there can be enormous affine variations. These variations may include translation and rotation of finger. Rotation variation may be caused both by rotation of finger in the 2D image plane (Fig. 2.4c–d) and by rolling of the finger on axis of the finger. While rotation in the 2D image plane does not lead to any information loss, a rotation along the finger axis may result in different amount of acquired ridge-valley detail. The varying distance from the acquisition camera would result in scale variations.

  • Multiple fingers: As a part of the UNFIT dataset, index and middle fingers are collected together. While the multiple fingers can be placed in any order and may experience all variations a single finger can, multiple fingers may encounter other challenges as well. As illustrated in Fig. 2.4e–f, the multiple fingers may be split or may be presented together. The split-finger scenario aids in the robust testing of segmentation algorithms, since the algorithms should be able to segment the fingers in both situations.

  • Illumination: The finger-selfies can be captured in both indoor and outdoor environments. It induces illumination variations, which may result in dull or bright finger-selfies. Usage of camera flash, as illustrated in Fig. 2.4h, may result in targeted bright regions too.

  • Background: Allowing any natural background to be present, finger-selfies may have similar looking backgrounds. Adding on, there may be regions in the background with skin (Fig. 2.4k). In such a scenario, selection of salient fingers becomes a tedious task.

  • Blur: During the capture process, a common problem is unfocused acquisition of an image. It may lead to a blurred finger-selfie due to which ridge-valley details might not be prominent. Similarly, finger-selfie may incur motion blur due to hand movement or unstable holding of smartphones.

  • Deformation: In some cases, participants provided finger-selfies with crooked fingers.

3.4 Ground-Truth Annotation

Due to various challenges incorporated in the proposed database (as mentioned in Sect. 2.3.3), the position and appearance of fingers in the images vary. To determine the exact location of the finger, it is necessary to generate ground-truth annotations for the same. A segmentation tool is developed in MATLAB using Piotr Dollar’s toolbox [30]. The GUI of the toolbox allows the user to utilize rotatable and resizable rectangular boxes to manually bound the finger region. With a rectangular region representing a finger region, only a minimal amount of background pixels are labeled as foreground. It acts as a loose bound for the finger, making sure that there is only a negligible loss of ridge-valley details. The rectangular region can easily be cropped and fed to recognition modules. The ground-truth annotations, which are represented as a mask, are also publicly available along with the database with the same image name in a different folder.

3.5 Experimental Protocol

As mentioned in Sect. 2.3.2, the UNFIT database is collected from 115 subjects with 30 images taken from each participant. While training and testing, a 50:50 subject disjoint split is maintained. Hence, training data includes 1740 images corresponding to 58 subjects, and testing data consists of remaining 1710 images from 57 participants. The index and middle fingers of the same subject are considered as different classes, resulting in 116 classes during training and 114 classes while testing. During testing, the first five images of each case (index, middle, or both fingers) are treated as the gallery, whereas the remaining images (sample #6–10) are considered as the query images. While generating scores, the genuine scores are generated when index–index, middle–middle, and multiple–multiple fingers of the same subjects are matched. All other combinations of match scores generated by matching query with gallery images are treated as imposter scores.

4 Segmentation Framework

The unique and discriminative features of a fingerprint lie in its ridge-valley pattern. These details are present on the finger-tip, which constitutes for the foreground of the finger-selfie image. Hence, a framework is presented which aims to discard the background pixels and keep only the foreground information. A summary of the segmentation framework is illustrated in Fig. 2.7, and its details are elaborated below.

Fig. 2.7
figure 7

Illustration of the segmentation framework using VGG SegNet followed by 32 \(\times \) 32 block-wise smoothening

4.1 Segmentation Using VGG SegNet

The segmentation framework primarily utilizes VGG SegNet for classifying pixels as foreground or background. The VGG SegNet architecture has encoder and decoder network. While the role of the encoder is to convert the input data into a meaningful feature map at a lower dimension, the decoder upsamples the lower-dimensional feature map. The lower-dimensional feature map is produced due to max-pooling operation after a sequential process of convolution, batch normalization, and ReLU activation to produce nonlinearity. The locations of features, which are propagated in the network after max-pooling, are stored for further computation.

The decoder network utilizes pooling indices (the ones stored during encoding) to perform a nonlinear upsampling in order to counter the effect of max-pooling. The stored pooling indices guide the decoder network to map a lower input feature map to a higher-dimensional feature map. Hence, the upsampled feature map obtained from the decoder network has a sparse representation of the input. The upsampling approach using pooling indices is a training-free method, hence reducing the number of training parameters of the model.

While pooling is known to have local invariance, in this work, a standard encoder–decoder network with pooling layers is utilized. The previous encoder–decoder architectures also use a standard pooling in their model (or global average pooling at the end of the network). It can be noted that networks that have used pooling [11, 31, 32] have worked well for the task of object segmentation. However, to eliminate pooling, the entire model has to be revamped and replaced by a capsule-net style architecture. Such scenario would require training from scratch, disallowing us to use pre-trained network. With a limited number of training instances, training a pooling-free network would be beyond the scope of the proposed framework.

The sparse representation is fed as input to a convolutional layer, which is succeeded by a Softmax classification layer. The Softmax layer classifies each of the image pixels as foreground or background. Thus, the VGG SegNet-based segmentation algorithm utilizes a pre-trained model of VGG SegNet for finger-selfie segmentation. The model is fine-tuned using finger-selfies. However, as we explain in Sect. 2.4.4.1, the predicted mask is tightly bound, due to which a significant foreground area is lost. Therefore, VGG SegNet architecture is succeeded by a \(32\times 32\) block-wise smoothening layer to increase the number of foreground pixels. The full segmentation pipeline is shown in Fig. 2.7. Algorithm 1 summarizes the complete segmentation algorithm.

figure a

4.2 Implementation Details

To train the VGG SegNet + \(32\times 32\) block-wise smoothening network, finger-selfies of size \(224\times 224 \times 3\) are used along with their corresponding ground-truth annotation of size \(224 \times 224 \times 1\). As illustrated in Fig. 2.7, VGG SegNet consists of an encoder and a decoder network. The output dimension of encoder network is \(14 \times 14 \times 512\). This multi-channel output is fed to the decoder network, which in turn gives an output of dimension \(112 \times 112 \times 2\). The output of the decoder network serves as input to the Softmax layer, whose task is to provide a binary prediction for each pixel. The white pixel in the binary predicted mask represents the finger region, whereas the black pixel represents the background. Similar to VGG SegNet, FCN 8 is also provided finger-selfies and its corresponding ground-truth annotation.

The VGG SegNet and FCN 8 architectures are fine-tuned using an augmented training set. The augmented training data is created by increasing the original training set with mirror flipped, intensity changed, blurred, and rotated finger-selfies. Rotation of finger-selfies is performed at three different angles: \(90^{\circ }, 180^{\circ }\), and \(270^{\circ }\). After image augmentation, the size of the training set increases to 27600 images. The corresponding finger location annotation is generated for these augmented images from the original ground-truth annotation. Using the augmented training dataset, the deep architectures are fine-tuned for 100 epochs.

4.3 Performance Evaluation Metrics

To evaluate the performance of segmentation algorithm, the following metrics are used:

  • Segmentation accuracy (SA):

    $$\begin{aligned} SA = \frac{CPB}{TB} \end{aligned}$$
    (2.1)

    where CPB is a count of the correctly predicted blocks while TB is the total number of blocks.

  • Foreground segmentation accuracy (FSA):

    $$\begin{aligned} FSA = \frac{CPFB}{TFB} \end{aligned}$$
    (2.2)

    FSA is the normalized foreground segmentation accuracy, where CPFB represents the number of correctly predicted foreground blocks, normalized with respect to the total count of foreground annotated blocks (TFB).

  • Background Segmentation Accuracy (BSA):

    $$\begin{aligned} BSA = \frac{CPBB}{TBB} \end{aligned}$$
    (2.3)

    BSA is the normalized background segmentation accuracy, where CPBB portrays the number of correctly predicted background blocks normalized with respect to the total count of background annotated blocks (TBB).

Figure 2.8 demonstrates a visual elucidation of FSA and BSA using the segmentation algorithm.

Fig. 2.8
figure 8

Interpretation of FSA and BSA while segmenting finger-selfies

4.4 Segmentation Performance

Table 2.3 reports the segmentation performance of the algorithm in terms of FSA, BSA, and SA. VGG SegNet, along with \(32\times 32\) block-wise smoothening, provides the best foreground segmentation accuracy and performs well in terms of BSA and SA as well. Tables 2.4 and 2.5 illustrate a comparison of various segmentation techniques with the VGG SegNet+block-wise smoothening algorithm. Figure 2.9 shows a few samples where the segmentation framework can segment finger-selfie correctly, whereas Fig. 2.10 shows some failure cases of the segmentation algorithm.

Table 2.3 Segmentation performance of the VGG SegNet + \(32 \times 32\) block-wise smoothening finger-selfie segmentation algorithm
Table 2.4 Comparison of the segmentation framework with VGG SegNet: illustrating the effect of \(32 \times 32\) block-wise smoothening
Fig. 2.9
figure 9

Illustration of the successful cases of the segmentation framework

In the proposed UNFIT database, background pixels constitute 86.21% pixels compared to 13.79% foreground pixels. While FSA is lower than BSA in Table 2.3, the reported segmentation accuracy (SA) is biased toward BSA for all fingers. This is due to higher number of background pixels in the UNFIT database as compared to foreground finger region pixels.

Fig. 2.10
figure 10

Illustration of the failure cases of the segmentation framework

Fig. 2.11
figure 11

Significance of \(32\times 32\) smoothening over VGG SegNet architecture

4.4.1 Effect of \(32\times 32\) Block-Wise Smoothening

Table 2.4 shows a comparison of the proposed architecture with VGG SegNet. For VGG SegNet, it can be observed that BSA outperforms FSA for all the fingers. The reason for higher BSA is the tight bound over the located finger-selfie obtained by the trained VGG SegNet. A drawback of a tight bound over the located finger-selfie is that few foreground finger regions are termed as background while most background regions are predicted as background. Thus, for VGG SegNet, BSA is higher than FSA due to erroneous classification of foreground pixels on the boundary of the located finger-selfie.

As observed by the segmentation performance of VGG SegNet in Table 2.4, FSA remains lower due to misclassification of foreground pixels located on the boundary of the located finger-selfie. Loosening the predicted boundary by VGG SegNet will increase foreground pixels, in turn increasing FSA. Thus, a \(32\times 32\) block-wise smoothening layer is added in the VGG SegNet architecture and it aids in increasing the FSA from 66.75 to 71.22%. While there is a trade-off for reduced SA and BSA by 1.04 and 1.98%, respectively, the distinctive ridge-valley details present in foreground region in finger-selfies are not compromised. An illustration of the effect of smoothening over VGG SegNet is shown in Fig. 2.11.

4.4.2 Comparison of VGG SegNet with FCN 8

Similar to VGG SegNet, a FCN 8 architecture is also fine-tuned. Inferring from the positive effect of \(32\times 32\) block-wise smoothening on FSA, FCN 8 architecture also includes a \(32\times 32\) block-wise smoothening. The FCN 8 trains a fully convolutional encoder–decoder network, and it uses an AdaDelta optimizer and a cross-entropy loss function.

Table 2.5 Comparison of segmentation performance of the finger-selfie segmentation framework with FCN 8

Table 2.5 shows a comparison of segmentation performance of FCN-8-based segmentation with VGG SegNet-based segmentation algorithm. However, with highest FSA and overall segmentation accuracy, the VGG SegNet + block-wise smoothening model outperforms under both the scenarios. One of the major reasons for better performance of VGG SegNet-based approach is the lesser number of trainable parameters [33]. Using the max-pooling indices from respective encoding layers, the decoder in VGG SegNet performs sparse upsampling. This procedure reduces computation time as well as increases generalizability of the model. On the contrary, FCN 8 learns parameters for upsampling too. Hence, despite data augmentation, the training data may not be enough to train additional parameters, which justifies VGG SegNet outperforming FCN 8.

4.4.3 Comparison with Skin Color-Based Segmentation

Inspired from existing studies [12, 16, 18, 23], the VGG SegNet + \(32 \times 32\) block-wise smoothening model is also compared with various skin color-based segmentation algorithms. The results are presented in Fig. 2.12. The foremost comparison is performed with a thresholding color channel-based skin color segmentation algorithm [34, 35]. The finger-selfie image, available in RGB color space, is converted to HSV and YCbCr color space. The information in Hue, Cb, and Cr color space is used to find probable skin color regions using pre-defined thresholds. While the VGG SegNet + 32\(\times \)32 block-wise smoothening method provides FSA of 71.22%, skin color-based segmentation provides FSA of 58%. Segmentation algorithm proposed by Sankaran et al. [12] also fails to perform well. Due to image augmentation by varying intensities, our fine-tuned model becomes robust toward illumination variations and flash usage in finger-selfies. However, because of too bright or too dull skin regions in certain cases, the standard skin color algorithms fail due to fixed thresholds.

Additionally, a comparison is shown of skin color segmentation with a deep architecture. Firstly, the salient region is cropped out using skin color-based segmentation. The salient region is fed as input to the architecture: VGG SegNet + \(32\times 32\) smoothening. However, both SA and FSA are reduced. The results are shown in Fig. 2.12. These results indicate that in an unconstrained scenario, skin color-based segmentation is likely to fail.

Fig. 2.12
figure 12

Comparison of segmentation accuracies obtained with the skin color-based techniques and the VGG SegNet with block-wise smoothening algorithm

5 Finger-Selfie Recognition

In 2013, Li et al. [22] highlighted that minutiae-based techniques for feature extraction and matching would fail for finger-selfies. Sankaran et al. [12] showcased a similar inference, highlighting that minutiae-based techniques fail for semi-constrained scenarios. Hence, the authors used ScatNet for their experiments. While ScatNet worked for the semi-constrained scenario, the representation would fail to encode discriminatory information under deformations and rotational variations present in the UNFIT database. As a result, we too utilized two non-minutiae-based algorithms for feature extraction, namely CompCode and ResNet50. The details are mentioned in the subsection below.

5.1 Feature Representations

Non-Deep learning: Competitive Coding (CompCode) [13, 36] is a popular non-minutiae-based feature representation, commonly deployed for fingerprint and palmprint recognition. Quite recently, CompCode and its variant were exploited for utilizing ridge-valley details present in palmprints for person recognition [36]. With ridge-valley pattern forming a unique structure, filters that encode orientation information can provide an efficient feature representation. CompCode features are extracted by convolution of the real part of the Gabor filter \(G_r\) over the image I. The Gabor filters \(G_r\) have J different orientations, each of which varies from previous by \(\frac{\pi }{J}\). Along with orientation variations, Gabor filters differ in frequency W as well. Hence, the total number of filters convolved to obtain the feature representation are \(J\times W\). The response of the filter, convolved over the segmented finger-selfie I, is given as:

$$\begin{aligned} R=I(x,y)*\psi _R(x,y,\omega _i, \theta _j) \end{aligned}$$
(2.4)

Here, \(\psi _R\) is the real part of the Gabor filter \(\psi \), while \(\omega _i\) and \(\theta _j\) are frequency and orientation of the Gabor filter. Note that the segmented output is upscaled to a fixed size of 400\(\times \)400 before applying Gabor filters to obtain the representation.

Deep learning-based approach: The segmented finger-selfie is served as input to the ResNet50 architecture [14]. ResNets have shown their application to general object recognition with deeper networks. To counter the effect of vanishing gradient and overfitting, ResNets have shortcut connections among different convolutional layers. Intuitively, along with the feedforward mapping F(x) from the previous layer \(C_l\), the input to the next convolution layer \(C_{l+1}\) also includes an identity mapping x from some previous layer \(C_{l-k}\). Hence, the input to convolutional layer \(C_{l+1}\) can be written as:

$$\begin{aligned} F(x,\{W_i\})+x \end{aligned}$$
(2.5)

where \(W_i\) signifies transformation through multiple convolutional layers. In the ResNet50 architecture, the function F(x) involves two stacked convolutional layers. This implies that the input x is taken from the activated output of layer \(C_{l-2}\), and \(W_ix\) is a transformation of x over two convolutional layers.

Fig. 2.13
figure 13

Procedure to obtain feature representation using ResNet50 architecture

The segmented RGB image is provided to the network at a fixed size of \(224\times 224\). In our experiments, the ResNet50 architecture is initialized using the weights of the model trained on the ImageNet database. With the Softmax classification layer removed, the network provides a feature vector of dimension \(2048\times 1\), which is treated as the feature representation of the finger-selfie. The intermediate layers of ResNet50 look for different shapes and strokes. Hence, the final feature representation encodes curves, vertical and horizontal lines, and other shapes, which are equivalent to ridge orientations, finger shape, and phalanx lines. The procedure to obtain the feature representation is illustrated in Fig. 2.13.

5.2 Finger-Selfie Recognition Performance

After extracting features from finger-selfie images, the next step is to match the query feature templates with the gallery templates. The CompCode features are matched with gallery templates using Hamming distance to obtain a distance score. Similarly, representation obtained from ResNet50 architecture is matched with gallery templates using cosine similarity.

Fig. 2.14
figure 14

Receiver operating characteristic (ROC) curve for the VGG SegNet + \(32\times 32\) segmentation pipeline. Representation from ResNet50 architecture is matched using cosine similarity, and CompCode features are matched using Hamming distance metric on the test set of UNFIT database

Table 2.6 Confusion matrix when feature representation from CompCode and ResNet50 are matched using Hamming distance and cosine similarity, respectively. From a total of 731,025 pairs (855 probe representations matched with 855 gallery representations), there are 4275 genuine and 726,750 imposter pairs. Values are reported at 10% FAR

On the testing set of 57 subjects, receiver operating characteristic (ROC) curve is used to report the verification performance. The ROC curve is shown in Fig. 2.14. Table 2.6 shows the confusion matrix when feature representation from CompCode and ResNet50 are matched using Hamming distance and cosine similarity, respectively. In spite of the potency of CompCode for palmprint and fingerprint recognition, we observe an EER of 41.41% for finger-selfie matching. On the other hand, the cosine similarity of ResNet50-based representation yields a better performance with EER as 35.32%.

The finger-selfie dataset, namely the UNFIT database, has numerous variations. The variations occur due to an unconstrained environment. The ResNet50 model is pre-trained on ImageNet database, where objects are of different shapes and sizes. These learned weights can handle variations in finger-selfies pertaining to scale and orientation of finger-selfie. Also, ResNet50 feature representation for segmented finger-selfies is matched using cosine similarity. Since cosine similarity is an angular similarity of two vectors, variations introduced in the magnitude of representations due to illumination variations do not effect cosine similarity. Hence, the recognition model becomes robust toward illumination variations. Thus, the overall performance of ResNet50 + Cosine similarity is better than CompCode + Hamming distance-based recognition.

While such results are motivating that deep architectures have a better potential for finger-selfie recognition, there is still a long way to go for recognition of finger-selfies in an unconstrained scenario. With the proposed UNFIT database, we expect that the research community will be driven toward building better segmentation, enhancement, quality assessment, and feature representation modules for finger-selfie-based recognition.

6 Conclusion

This chapter presents a review of existing research on finger-selfies and later introduces finger-selfie in an unconstrained environment. The proposed UNconstrained FIngerphoTo (UNFIT) database incorporates various challenges such as rotation, translation, orientation, position, scale, multiple fingers, illumination, background, and resolution which arise due to the differing environments in which the finger-selfies are acquired. This database includes the manual annotations and experimental protocol, using which segmentation and verification results are benchmarked. A VGG SegNet-based segmentation approach is presented along with baseline results, followed by matching algorithms using CompCode and ResNet50 representations. We assert that the proposed database can take forward the research in this domain and the segmentation pipeline can segment and perform authentication using finger-selfies despite the challenges posed in the database. Future work can include quality assessment for detection of poor-quality finger-selfies and use of minutiae in conjunction with deep learning features for improved recognition performance.