Keywords

1 Introduction

Fingerprints, i.e. ridge-valley patterns on the tip of a human finger, are one of the most important biometric characteristics due to their known uniqueness and persistence properties [9, 16]. Opposed to touch-based systems touchless fingerprint recognition does not suffer from problems like distortions due to pressing the finger on a sensor plate, areas of low contrast caused by dirt, humidity, or latent fingerprints left on the sensor plate [12, 17].

Fig. 1.
figure 1

Example of (a) an input image to the proposed touchless fingerprint segmentation approach and (b) the corresponding segmentation result. Note that the depicted thumb does not contain a fingerprint suitable for recognition purposes.

However, reaching sufficient biometric performance in the preprocessing of touchless fingerprints is a challenging task. Segmenting the fingertip from the background is an essential step during the preprocessing [15]. In this context, the fingertip is defined as only the tip on the underside of the finger where the ridge-valley patterns are located. Figure 1 illustrates the segmentation of the fingertip region from a 2D image by the segmentation approach presented in this work.

State-of-the-art hand and finger segmentation systems employed in touchless fingerprint recognition schemes are mostly based on the analysis of sharpness, color, or shape. First approaches employ simple filter, e.g. Sobel operator [13] or Gaussian filter [14], in order to separate the sharp foreground from the background area. Such approaches require a clear gap between sharp and blurred areas and assume that the finger area is focused. Jonietz et al. [10] proposed a conjunction of a shape and color-based finger detection using edge pairing. The authors apply machine learning algorithms to estimate the finger shape on color-based segmented images. Several contributions use properties of color models to segment the skin-tone color from the background. Here the YCbCr color model is most prominent [1, 8, 23, 24]. Raghavendra et al. [22] used a mean shift segmentation to filter the input image and segment it by fusing the convergence points in homogeneous regions. Multiple approaches utilize Otsu’s algorithm to find a proper threshold between hand and background area, e.g. Wang et al. [24]. The detection of fingertips is further investigated by Raghavendra et al. [22] who aim to find the first finger knuckle based on its darker color. Lee et al. [13] present a region growing scheme by analyzing ridge-valley patterns in the frequency domain. Such two-stage schemes of segmenting the hand area and detecting the fingertip is considered error-prone in unconstrained use cases.

Semantic segmentation using deep learning techniques represents an active field of research in recent years. For a comprehensive overview on the topic the interested reader is referred to surveys on this topic [5, 6, 18]. Especially in challenging environments object detection and segmentation greatly benefit from machine learning. Due to the requirements of a touchless fingerprint capturing process deep learning techniques are highly suitable for segmenting the hand area and the fingertips. To the best of the authors‘ knowledge no comprehensive research has yet been published on this topic.

This work proposes a fingertip segmentation system based on deep learning which is able to segment the hand area and fingertips in a single processing step. The contributions of this work are:

  • An adaptation of the state-of-the-art general purpose deep learning model DeepLabv3+ to the specific requirements of touchless fingerprint recognition.

  • The extension of the database for hand gesture recognition (HGR) by fingertip ground truth masks.

  • The application of suitable data augmentation to the database to obtain a sufficient amount of training samples.

  • A comprehensive evaluation in a tenfold cross-validation on a subject disjoint training and evaluation split including a comparison against a color-based segmentation system (baseline) and a detailed discussion of the segmentation performance.

Fig. 2.
figure 2

Overview on the proposed system: (a) in the training stage, the skin-based ground truth of a hand gesture recognition database is extended by incorporating a fingertip class and data augmentation employed; (b) in the evaluation stage a comprehensive evaluation of the semantic segmentation network based on DeepLabv3+ for hand and fingertip segmentation is performed.

The rest of this paper is structured as follows: The following Sect. 2 describes the proposed system. Section 3 presents our experimental setup. Section 4 summarizes the results obtained in our experiments. Finally, Sect. 5 concludes.

2 Proposed System

The workflow proposed system consists of two stages (1) data preparation and preprocessing and (2) semantic segmentation of hands and fingertips which is based on the DeepLabv3+ model. Figure 2 gives an overview on key components of the proposed system.

2.1 Preprocessing and Training Data Preparation

For this work, we use all subsets of the database for Hand Gesture Recognition HGR [7, 11, 19] created at the Silesian University of Technology. The original intent of this database is to provide gestures from the Polish and American sign language. A total number of 53 gestures are represented. Overall, the database provides 1,558 images from 33 subjects along with a skin-based ground truth masks, and a list of feature point positions, e.g. fingertips and knuckles.

To make the HGR database suitable for training a semantic hand and fingertip segmentation network, we implemented the following adaptations. First, the ground truth masks are extended by a fingertip class. To this end, the feature points representing fingertips and the first finger knuckle are employed. Precisely, a circular area is defined with the fingertip as center point and the distance between the fingertip and the first knuckle as radius. This circular area is intersected with the hand labeled pixels to consider only hand area pixels as fingertip. This process is illustrated in Fig. 3. During a manual revision, the labels of a few fingertips were manually post-processed in order to increase their accuracy. In particular, fingertips for which the underside of the finger are not visible were discarded.

Fig. 3.
figure 3

Example of the proposed extension of original ground truth skin mask with feature points by a fingertip class.

Fig. 4.
figure 4

Examples of an original image (left) of the resulting cropped (middle) and zoomed (left) images used for subsequent data augmentation.

The used deep learning model DeepLabv3+ works with square input images and an image size which is a power of two. For this reason, all samples are scaled and cropped to a size of \(512\times 512\) pixels, cf. Fig. 4 (left, middle). Even so a \(512\times 512\) pixels hand pose image is not suitable for fingerprint extraction the resulting segmentation mask can be utilized for fingerprint extraction on the full size image in a practical application. The model uses feature points to preserve as much hand area and fingertips of the original image as possible.

Second, a data augmentation is applied. The cropped samples are further augmented by a rotation of 90\(^\circ \), −90\(^\circ \), 180\(^\circ \), or a mirroring on the vertical or horizontal axis. Additionally, a combination of rotation by 90\(^\circ \) and a vertical mirroring is applied. Further a zooming is implemented to a subset of 1103 samples to emphasize the fingertip region cf. Fig. 4 (right). It should be noted that not all samples are suitable for zooming because not every sample contains a fingertip area. The zoomed samples are also augmented by a rotation of 90\(^\circ \) or −90\(^\circ \) and a mirroring on the vertical or horizontal axis. In total, this results in 15,318 samples.

2.2 Semantic Segmentation

As deep learning model, DeepLabv3+ [2, 3] is utilized. To this day, DeepLabv3+ is one of the best performing general purpose segmentation networks on the Pascal VOC Challenge [4]. It is based on a encode-decoder structure using Atrous Spatial Pyramid Pooling (ASPP). For more details on DeepLabv3+ the reader is referred to [2, 3].

Fig. 5.
figure 5

Example images for poses of the three categories (a) “good”, (b) “bad”, and (c) “ugly”.

For better segmentation results on small data sets, the DeepLab developers provide pre-trained models. In this research a model is used which was pre-trained on the general-purpose Pascal VOC data set for transfer learning [20]. The number of classes is reduced to three (hand area, fingertips and background) in order to fit to the extended HGR database. The neural network is trained over 24 epochs whereas one epoch consists of a run through all images included in the training.

3 Experimental Setup

For a reliable segmentation result, the HGR database is separated into a training set and an evaluation set. The separation is performed subject disjoint to ensure that the model is not evaluated on subjects which it has already seen during the training phase.

Despite the data augmentation the amount of training and evaluation samples is relatively low for a deep learning-based semantic segmentation network. For this reason, a cross-validation is performed. Over ten rounds, different randomly selected distributions of subjects are used for training and evaluation. This ensures a proper assessment of segmentation performance.

The HGR database contains 53 poses of different relevance for the conducted experiment. For example, a pose were four fingers of the inner hand are shown (cf. Fig. 1) has much higher relevance for the use case of touchless fingerprint recognition than a pose showing a fist. For this reason, the evaluation is done for three categories of poses:

  • Good: poses which are well-suited for our scenario, e.g. an image of multiple fingers of the inner hand were the fingertips are clearly shown (Fig. 5(a)).

  • Bad: more challenging poses where fingertips are shown but rotated (Fig. 5b left, middle), or partly covered by other fingers or parts of the hand (Fig. 5(b) right).

  • Ugly: poses which do not contain a clear fingertip area or the fingertips are only partially visible. This is the case, e.g. when the hand is clenched in a fist (Fig. 5(c) middle), or the back of the hand is shown (Fig. 5(c) left, right).

In total 18 poses (470 samples) are categorized as “good”, 15 (441 samples) as “bad”, and 20 (674 samples) as “ugly”. It should be noted that the number of samples per pose varies and that not every pose is performed by each subject.

For the evaluation of our segmentation results we use the very common Intersection over Union (IoU) metric as defined in equation 1. Here G stands for the ground truth whereas S refers to the segmentation result. The IoU is estimated for single classes while the mean IoU (mIoU) is estimated for all classes including background. Additionally, the inter-class IoU is computed. This refers to the erratic intersection between two different classes which should be disjoint and gives us a better understanding how the segmentation errors are distributed.

$$\begin{aligned} IoU = \frac{G \cap S}{G \cup S} \end{aligned}$$
(1)

To compare the deep learning model against a baseline system we implemented a well-established color-based segmentation system. Here Otsu’s adaptive threshold algorithm is applied to the RGB image. The algorithm segments the hand area from the background which results in a binary image as segmentation mask.

Table 1. Comparison of the segmentation performance between the deep learning-based system and the baseline approach.
Fig. 6.
figure 6

Comparison on segmentation performance between the baseline and the deep learning system.

Depending on the brightness distribution of each color channel the hand area in the segmented image is either black or red. Therefore, the IoU is computed on the original segmentation mask and its inverse. A max filter takes the image with the better IoU score. It should be noted that this approach is principally not capable of detecting the fingertip regions.

4 Results

In our experiments, we first compare the segmentation performance in terms of IoU and mIoU between the proposed system and the baseline. Further, a more detailed evaluation of the segmentation results on the hand and fingertips areas is done. All results are generated in the tenfold cross-validation.

4.1 Proposed System vs. Baseline

In a first experiment, we compare the color-based baseline system with the deep learning-based segmentation. This evaluation considers only the overall segmentation performance and the segmentation performance on the hand area because the baseline system does not feature a fingertip detection.

Fig. 7.
figure 7

Comparison between the results on (a) a challenging input image with (b) corresponding ground truth of the (c) the color-based baseline system and (d) our proposed system. It is observable that the proposed system is close to the ground truth whereas the baseline system fails to segment the hand area.

Both systems show a promising overall segmentation accuracy. The deep learning system shows a slightly better mIoU of \(95.02\%\) than the baseline system with \(89.01\%\), as shown in Fig. 6. Consequentially, the IoU of the hand area shows similar results (deep learning: \(89.01\%\), baseline: \(85.20\%\)) as summarized in cf. Table 1. The good performance of the color-based segmentation is attributed to a homogeneous background and the high contrast between hand and background in most of the images in the database. Figure 7 shows a more challenging sample. Here a background which is of a skin tone-like color leads to an inaccurate segmentation result of the baseline system whereas the deep learning system segments the hand area more accurate. On the one hand, from Fig. 7(c) it can be observed that the baseline system is more vulnerable to segmenting background area as hand area because it features no metric which considers the shape. On the other hand, the deep learning system more thoroughly segments the hand area, as can be seen in Fig. 7(d). The high standard deviation on the baseline system (mean IoU: \(20.97\%\), hand IoU: \(16.44\%\)) is also attributed to challenging samples. This illustrates a lack of robustness of the color-based segmentation.

4.2 Segmentation of Hand and Fingertips

In a second experiment, we evaluate the segmentation performance especially on the fingertip class and analyze the kind of errors the deep learning system makes. As discussed, we separate the poses in three categories, “good”, “bad”, and “ugly”. For each category the mean IoU, hand IoU, and fingertip IoU is computed. Obtained results are listed in Table 2 and plotted in Fig. 8. Moreover, we estimate the IoU between the background and the hand, and between the background and the fingertip. Illustratively, this can be seen as the erratic IoU between two classes which should be separated. Corresponding results are summarized in Table 3.

Table 2. Segmentation performance of the proposed system across image categories.
Fig. 8.
figure 8

Overview on segmentation performance in terms of IoU of the different three categories “good”, “bad”, and “ugly”.

In general, deep learning techniques learn color, contrast, and shape properties of every class. In our use case the hand area is well separated from the background by color and contrast. In general, the experimental results of our proposed deep learning system showcase a competitive hand and fingertip segmentation performance on the most relevant category. The fingertip class has naturally no separation from the hand area by color or contrast. Here the learning of shapes is most important. The results show that the learning of fingertip areas was successful in most cases but also highlight some challenges. Figure 9 highlights a collection of good performing samples.

The fingertip segmentation performs competitively on samples categorized as “good” with a fingertip IoU at \(68.03\%\), cf. Fig. 8. Samples categorized as “bad” still show a fingertip IoU of \(61.10\%\), whereas the performance of “ugly” samples drops to \(33.12\%\). The reason for better results of the “good” and “bad” categories is that there are one or more fingers raised and the inner hand is presented. The performance drop on the “ugly” category can be explained by the challenging task of estimating if the front or back of the hand is visible. In such cases, the deep learning system often fails by wrongly segmenting fingertips on the back of the hand (cf. Fig. 10).

Table 3. Segmentation errors (inter-class IoU) of the proposed system across image categories.
Fig. 9.
figure 9

Examples of correctly segmented fingertips for various poses: all visible fingertips with visible fingerprints are segmented.

Fig. 10.
figure 10

Examples of inaccurately segmented fingertips: falsely segmented fingertips on the back of the hand, and connected fingertip areas.

An important aspect of the proposed system is which kind of segmentation errors it makes. For this reason, the inter-class intersections between background, hand, and fingertip are computed. From Table 3 we observe that the IoU between background and fingertip is very low (“good”: \(0.08\%\), “bad”: \(0.12\%\), “ugly”: \(0.06\%\)). The IoU between background and hand is respectively (“good”: \(0.34\%\), “bad”: \(0.48\%\), “ugly”: \(0.49\%\)). The highest inter-class IoU value can be observed between the hand and the fingertip (“good”: \(1.88\%\), “bad”: \(1.95\%\), “ugly”: \(1.85\%\)). This distribution suggests that the system segments fingertips almost exclusively within the hand area. The high hand to fingertip IoU is caused by segmenting a fingertip area which is too big or a fingertip at the back of the hand. However, such errors might not be considered as critical as the fingerprint will still be contained in the segmented fingertip. Further examples of sub-optimal segmentation results are shown in Fig. 10.

In all three categories the standard deviation on the fingertips is much higher than the standard deviation on the hand area and the background. One hypothesis is that learning solely through shape (fingertip) is more vulnerable to miss-segmentation than learning based on color, shape, and contrast (hand). Another important aspect is that in some training-evaluation splits, hand poses with few samples are not shown to the neural network during the training phase but in the evaluation. In these constellations the segmentation performance on these poses is rather low which leads to a higher standard deviations. On “ugly” poses the standard deviation is even higher. Here, the fact that only a few of the “ugly” categorized samples contain a fingertip area further increases the standard deviation.

Some aspects could further improve the segmentation results: a training database which is more suitable for the intended application scenario will most likely lead to a more robust segmentation result. Furthermore, in a guided capturing process, instructions can be given to not present the back of the hand. This lowers the variety of poses which must be estimated and subsequently increases the segmentation accuracy. The proposed system is not able to assess how many fingertips are segmented and of which quality they are. Quality assessment of touchless fingerprint using NFIQ2.0 is investigated in [21]. Hence, multiple fingertips of the inner hand might be segmented as one connected area. Moreover, rotated fingertips, especially the thumb, are segmented, regardless of their rotation angle (cf. Fig. 10). Hence, a dedicated postprocessing would be required to extract single fingerprint regions.

5 Conclusion

In this work, we present a feasibility study on direct fingertip segmentation for touchless fingerprint recognition through the use of a deep learning-based semantic segmentation. For this purpose, we adapted a general-purpose segmentation network to fit the use case of fingertip segmentation by extending a hand gesture database with a fingertip class and adding suitable data augmentation. A tenfold cross-validation was conducted and evaluated, including a comparison to a well-established color-based segmentation scheme. The resulting comparison with the color-based baseline system shows superior segmentation results of the hand area and represents a demonstrates the feasibility of direct fingertip segmentation. Compared to traditional contrast-based finger knuckle detection approaches, the presented method is expected to be less error-prone. Especially in unconstrained environments with challenging heterogeneous background and illumination the proposed system is expected to be more robust.

The development of adequate postprocessing to extract single fingerprint images from obtained segmentation results as well as an integration to the processing pipeline of a touchless fingerprint recognition system are subject to future work.