1 Introduction

Nowadays, biometric systems are among the most widely used technologies in the world for person authentication. The development of new technologies that effectively captures these biometric traits have also benefited their use for human recognition. Many research studies in this area include unimodal and multimodal biometric traits for person identification or verification [1, 2]. Among these systems, face, iris, fingerprint, palmprint, gait and voice are the most widely used biometric traits. However, the new trend is the usage of vein images captured from finger, palm and other regions of the hand such as wrist.

Palm vein as biometric trait has received enormous attention in recent years because it is located under the skin which makes it relatively impossible to spoof. It is also relatively stable and devoid of occlusion and noise such as hair. The development of low cost devices capable of capturing palm vein patterns has also made it popular for use in high security authentication systems. The image capturing process is fast and user friendly which makes it easy for users to willingly use such devices.

In this paper, we concentrated on palm vein as a biometric trait and propose a method for improving palm vein biometric authentication systems by combining a texture-based method with a convolutional neural networks (CNN)-based method. The first method extracts features from five overlapping sub-regions of palm vein images using Binarized Statistical Image Features (BSIF) which is a state-of-the-art texture-based algorithm. The BSIF method obtains binary codes for neighborhood of each pixel of an image by binarizing filter responses that are generated by convolving through the image using a set of linear filters [3]. The features extracted by BSIF are matched and fused by score-level fusion strategy. The second method is a deep learning CNN architecture based on AlexNet structure [4] which was trained to obtain the second decision. The final decision of the proposed system is obtained by fusing the decisions of the two structures, namely BSIF with 5 sub-regions and CNN model.

Fig. 1
figure 1

Sample palm vein images and their regions of interest for five palm vein databases

The experiments were conducted on CASIA [5], FYO [6], PUT [7], VERA [8] and Tongji [9] databases to demonstrate the results for the proposed method. Samples of palm vein images from each dataset are shown in Fig. 1 with their corresponding Region of Interest (ROI) images. Additionally, we compared our proposed method with several state-of-the-art methods. The contributions of this study can be summarized as follows:

  • A new palm vein recognition method that fuses texture features extracted from different overlapping sub-regions of a palm vein image is proposed.

  • Score-level fusion of powerful texture-based BSIF on overlapping regions is applied for palm vein recognition to obtain the first decision.

  • A CNN model for fast deep learning classification system is employed for palm vein recognition to obtain the second decision.

  • Decision-level fusion of two strong decisions obtained by texture-based and CNN-based approaches is conducted for boosting palm vein recognition.

The rest of the paper is organized as follows. Section 2 gives an overview of similar researches in the area of using palm vein as a biometric trait. Section 3 describes the proposed method by giving a step by step detailed explanation. Experiments are presented in Sect. 4, while the conclusion and the future work are given in Sect. 5.

2 Literature review

Most of the existing palm vein recognition studies in literature used palm vein images captured by infrared or near-infrared cameras [6, 10] and several feature extraction methods have been employed, from hand-crafted methods to artificial neural networks [11, 12].

In 2011, the first publicly available palm vein and wrist vein database called PUT database was introduced by Kabacinski and Kowalski [7]. The authors presented the experimental results as within series and between series for images collected. Equal Error Rate (EER) for within series comparison was reported as 1.1% which was better than the EER results for between series comparison, given as 3.8%. Lee in 2012 [13] constructed a near-infrared camera-based device to capture the palm vein images and extracted features from these based on 2-D Gabor filter. The authors reported the result of experiments in terms of accuracy as 99.18% and in terms of EER as 1.82%. On the other hand, Adaptive Gabor filter was used in the study of Han and Lee [14] where the appropriate parameters for Gabor filter were selected at different orientations and frequencies, as an improvement on [13] where the parameters were to be initialized at the beginning. The highest accuracy reported is 99.38%.

Palm vein recognition was studied against spoof attacks under Print Attack category by Tome and Marcel in 2015 [8]. VERA palm vein database was introduced in that study and the experimental results were reported for two different regions of interest. The best EER was reported as 3.33%.

Shah et al. [15] developed a low cost system consisting of a web camera and an infrared LED illumination for acquiring images of veins. The authors reported the result of experiments in terms of accuracy as 93.54%. Similarly, a palm vein biometric modality for access control in multilayered security system where Principal Component Analysis (PCA) and template matching techniques were used as verification algorithms of palm veins was proposed in the study of Athale et al. [16] and they concluded that the system can successfully authenticate subjects with an average accuracy of 92.00%.

Moreover, Zhang et al. [9] introduced a large scale contactless palm vein dataset in 2018 collected at Tongji University. The authors presented a deep convolutional neural network (DCNN)-based palm print and palm vein recognition system. The experimental results were presented for both palm print and palm vein identification. Palm vein identification was reported in terms of accuracy as 100%, and verification results are given in terms of EER as 2.30%.

The aforementioned studies on palm vein recognition are summarized and presented in chronological order in Table 1 with the details of each study.

Table 1 Comparison of palm vein recognition systems

3 Proposed method

In this study, we propose a palm vein recognition system that employs two decisions from two different methods. In the first method, BSIF is used to extract features from five overlapping sub-regions of palm vein images. Texture-based feature extraction methods have been used extensively in different research areas such as in the classification of brain tumor from magnetic resonance imaging (MRI) [17]. BSIF has been recently shown to be the most powerful hand-crafted texture-based method for vein recognition [6]. Additionally, BSIF achieved the best performance for gender and texture classification among 13 variants of Local Binary Patterns approach [18]. The second method is a CNN-based approach proposed in [6], which was modeled after AlexNet [3]. The two decisions are fused to obtain the final decision.

Palm vein recognition systems generally involve an effective pre-processing stage for the enhancement of vein images; then a powerful feature extraction method that extracts the features from the enhanced vein images; finally, the matching stage to compare features extracted and make identification decision. Consequently, we propose a novel palm vein recognition system that divides the ROI of a palm vein image into five overlapping sub-regions and applies Histogram Equalization for image enhancement. Separate features extraction for the five sub-regions follows. Scores from each sub-region are concatenated and then classification is performed with k-Nearest Neighbor (k-NN) classifier to obtain Decision I.

Decision II of the system is obtained from the CNN-based approach which was applied on the whole palm vein images. The proposed method fuses the two decisions obtained by texture-based and CNN-based approaches using decision-level fusion technique to obtain the final decision of the system. The stages of the proposed method are shown in Fig. 2 and explained in the following subsections.

Fig. 2
figure 2

Block diagram of the proposed method

3.1 Pre-processing

The pre-processing step is in three stages; ROI cropping, sub-region definition and image enhancement. ROI cropping is necessary to remove background and parts of the hand such as fingers, wrist and crest between fingers that are not needed in this experiment. ROI is cropped semi-automatically for all sample images by determining the most appropriate dimension that best presents the region of interest for all samples. This implies that ROI dimensions are not the same for all sample images in the dataset due to variation in size of hands and hand positions. Therefore, cropping is done in two repeating phases; the first phase is the automatic segmentation and the second is manual inspection for images that are badly segmented, either due to rotation or size of hand. The badly segmented images are automatically re-segmented with different dimensions. These phases are repeated until all images are properly cropped.

The proposed sub-region definition is carried out in the next stage. The regions are 5 overlapping parts of an image as shown in Fig. 2. The first step is re-sizing all images to be the same size since images might be of different sizes after the cropping stage. The dimension of the image is used to divide it into quarters. These sub-regions are adjusted to overlap about 10% into the neighboring quarters. The same size of the adjusted quarters is defined at the middle of the image as well to complete a five overlapping sub-regions system.

The third pre-processing stage employs image enhancement using histogram equalization. The aim in this stage is to enhance ROI image with histogram equalization which is an image processing technique used to improve contrast in images.

3.2 Texture-based feature extraction

BSIF method used for texture-based feature extraction was proposed by Kannala and Rahtu [3] for face recognition and texture classification. The method was inspired by Local Binary Patterns (LBP) and Local Phase Quantization (LPQ) approaches [3]. LBP and LPQ approaches can be seen as statistics of labels computed in the local pixel neighbourhoods through filtering and quantization. These methods describe each pixel’s neighbourhood by a binary code which is obtained by first convolving the image with a manually predefined set of linear filters and then binarizing the filter responses. The bits in the code string correspond to binarized responses of different filters. The methods have proven to be very effective, showing very good results in different computer vision problems.

The idea of BSIF is to automatically learn a fixed set of filters from a small set of natural images instead of using hand-crafted filters such as in LBP and LPQ. BSIF algorithm approach for palm vein representation consists of learning instead of manual tuning to obtain statistically meaningful representation of the palm vein data. This enables efficient information encoding using simple element-wise quantization as used for fingerprint in [3].

The histograms of pixel’s BSIF code values are used to characterize the texture properties within each palm vein sub-region. The value of each element (i.e. bit) in the BSIF binary code string is computed by binarizing the response of a linear filter with a threshold at zero. Each bit is associated with a different filter and the desired length of the bit string determines the number of filters used [8].

BSIF approach in [3] works by considering an image patch X of size \(l \times l\) pixels and a linear filter \(W_i\) of the same size. Then, the filter response \(s_i\) is obtained by (1) as follows:

$$\begin{aligned} s_i = \sum W_i(u,v)X(u,v) =W_i^Tx \end{aligned}$$
(1)

where vectors w and x contain the pixels of \(W_i\) and X. Consequently, the binarized feature \(b_i\) extracted using BSIF is obtained by setting \(b_i\) = 1 if \(s_i\) > 0 and \(b_i\) = 0 otherwise. The filters \(W_i\) are learnt using Independent Component Analysis (ICA) by maximizing the statistical independence of \(s_i\) [19].

Fig. 3
figure 3

CNN architecture

3.3 Matching

Features extracted by the feature extraction stage for all the training and the test ROI images are matched in the matching stage. The test image vector is matched with the training image counterparts to generate scores using Manhattan distance measurement. The Manhattan distance d between two points refers to the dissimilarity between features generated by feature descriptor, where feature p1, located at (x1, y1) and feature p2, located at (x2, y2) is calculated as

$$\begin{aligned} d = \vert x_1 - x_2 \vert + \vert y_1 - y_2 \vert \end{aligned}$$
(2)

3.4 Score-level fusion and classification

Fusion is performed in different levels for multimodal biometric systems. Score-level fusion is one of the most popular and easy-to-use strategies for the fusion of multiple traits [1, 2]. In this study, we applied score-level fusion on the scores obtained after matching each of the five sub-region features.

Score-level fusion for different traits requires score normalization to ensure that the scores are in a common scale. However, in this study, the features are extracted from the same ROI image, using the same algorithm, therefore they are in a common scale and normalization is not required.

Match scores of all sub-regions are integrated in score-level fusion to produce a single match score vector which is used at the classification stage to classify sample test images.

k-NN classifier is used to decide on the most resembling training image to the test image to obtain Decision I of the system. k-NN was selected because k classes can be easily chosen to be the number of subjects in a dataset, and distances between fused scores can be determined using Manhattan distance. This method is relatively fast because of easy computation, more data can be seamlessly added since no training is required and it is effective against large data.

3.5 CNN-based feature extraction and classification

A deep learning-based CNN architecture modeled inline with AlexNet structure [4] was used to obtain Decision II. The structure has five convolution layers as in AlexNet but with fewer filters to reduce computation time as shown in Fig. 3.

Each convolution layer is implemented by mapping out values from an input image using set of filters. The feature map is performed using (3) where the input image is denoted by f and filter by h. The indexes of rows and columns of the resulting matrix are marked with m and n, respectively. The feature map function G[m,n] is given as follows:

$$\begin{aligned} G[m,n]= & {} (f *h)[m,n]= \sum _{j} \sum _{k} h[j,k]\nonumber \\&\times f[m-j,n-k] \end{aligned}$$
(3)

The activation function for each convolution layer is a Rectified Linear Unit (ReLU) given by \(y = max(0, x)\) which translates to linear (identity) for all positive values, and zero for all negative values. Batch normalization is performed so that the data resembles a normal distribution using (4) as

$$\begin{aligned} y_{i} = \frac{x_{i}-\mu _{B}}{\sqrt{\sigma _{B}^{2} + \varepsilon }} \end{aligned}$$
(4)

where \(\mu _{B}\) is the mean of a batch of training set and \(\sigma _{B}^{2}\) is the variance of a batch of training set. Max pooling layer follows each convolution layer which downsizes its input by taking the maximum of a \(2 \times 2\) region to create a new output matrix where each element is the maximum of a region in the original input. The architecture is completed with a dropout layer and a fully connected layer with Softmax as activation function which normalizes the input values to (0,1) using (5) as follows:

$$\begin{aligned} y = \frac{\exp ^{x_{i}}}{\sum _{j}^{K}\exp ^{x_{j}}} \end{aligned}$$
(5)

where exponential of each input \( x_{i} \) is divided by the sum of exponential of \( x_{j} \) to \( x_{K} \) which corresponds to all input values.

3.6 Decision-level fusion

The final decision of the proposed method is obtained by fusing the two decisions, namely Decision I and Decision II using a Weighted OR Rule. Each decision returns either correct recognition (True) or incorrect recognition (False). True decision is given a weight of 1, while False decision is assigned 0. The weights of the two decisions are added together and measured against a threshold set at 0.9 to obtain the final decision.

4 Experimental analysis

Experiments are conducted on a computer with Windows 7 (64 bit) operating system, Intel Core i5 Processor and 16 GB RAM. The details related to the experiments are explained in the following subsections.

4.1 Palm vein databases

Five palm vein datasets were employed in this paper using CASIA, FYO, VERA, PUT and Tongji contactless palmvein datasets.

CASIA Multi-Spectral Palmprint Image Database [5] contains 7200 palm images captured from 100 unique subjects with the aid of a multiple spectral imaging device. The images are stored as 8 bit gray-level JPEG files. A total of 36 images are captured from each hand in two sessions. Each sample contains six palm images which were captured at the same time with six different electromagnetic spectrums. Wavelengthes of the illuminator corresponding to the six spectrum are 460 nm, 630 nm, 700 nm, 850 nm, 940 nm and white light, respectively.

FYO Multimodal Vein Database [6] consists of dorsal, palmar and wrist vein images, with a total of 1920 images taken from both hands of 160 subjects in two sessions: 640 for each of the three traits. The database also contains a folder of generated images obtained by generating 10 images from each ROI image using Keras data generator: This equals a total of 6400 images for each of the three traits. We only used palm vein images (called FYOPV dataset) from that database.

PUT Vein pattern database [7] consists of 2400 images half of which are palm vein images while the other half contains wrist vein pattern. These were acquired from both hands of 50 students, amounting to 100 different patterns for palm and wrist each; only palm vein images were used in this study.

Tongji contactless palmvein dataset [9] is a large-scale database where images were collected from 300 volunteers and composed of 192 males and 108 females whose ages are between 20 and 50 years. Image acquisitions were done in two separate sessions. In each session, ten images (palm vein and palmprint images) were taken for each palm, amounting to 40 images per person after both sessions. Therefore, in total, the database contains 12,000 images captured from 600 different palms. We only used Session 2 of that database since only that session has palm vein images.

VERA palm vein database [8] was provided by Idiap Research Institute in Martigny and Haute Ecole Spécialisée de Suisse Occidentale in Sion, Switzerland. It consists of 2200 images taken from 110 different people. Five images of both hands were taken from each volunteer in two sessions. The volunteers in the database were composed of 40 women and 70 men whose ages were between 18 and 60 and an average of 33. Image acquisitions were carried out in two different locations, the first 78 individuals from one location and the remaining 32 from another location.

4.2 Preliminary experiments

A palm vein recognition system on the ROI image using BSIF feature descriptor is initially implemented. Then, experiments are carried out on the sub-regions of the image; middle-sub-region, top-left corner sub-region, top-right corner sub-region, bottom-left corner sub-region, bottom-right corner sub-region. They are defined as follows: The ROI image is set to \(500 \times 450\) pixels and each sub-region defined to be \(300 \times 250\) pixels. The middle sub-region is \(300 \times 250\) pixels with origin at the center of the image. The other sub-regions are windows of \(300 \times 250\) pixels shifted to the top-left, top-right, bottom-left, bottom-right corners of the ROI image.

Preliminary experiments were conducted using BSIF feature descriptor considering entropy H given by (6) as

$$\begin{aligned} H= -\sum _{k=1}^{K} p_k \mathrm{log}_2(p_k) \end{aligned}$$
(6)

where K is the number of gray levels and \(p_k\) is the probability associated with gray level k. Entropy, which was carried out on each sub-region, showed that each sub-region has varying information as shown in Table 2, where the sum of entropies of each sub-region are added for all training sample images. The lowest entropy value is shown for the middle sub-region that implies almost a stable sub-region [21]. The strength of the mid-sub-region was tested on all the datasets used and the corresponding accuracies are shown in row 2 in Table 3. These results as well as entropy given by (6) showed that experimentation on each region and the fusion of their results will improve the system.

Table 2 Sum of sub-region entropies
Table 3 Accuracy for experimented methods

4.3 Experiments on the proposed method

In order to fuse different sub-regions, scores are obtained by matching the features’ test samples of sub-regions individually against their corresponding training sets. The scores are fused using score-level fusion and the decision (Decision I of the proposed method) is determined for each test sample using k-NN classifier. In order to compare different fusion techniques with the proposed system, we also implemented feature-level and decision-level fusion. The corresponding results on several datasets are presented in Table 3. Similarly, aforementioned CNN model was trained using approximately 80 percent data for each of the datasets used. Predictions for the test-sets are obtained as Decision II, and the accuracy of this CNN system is also shown in Table 3. The last step is the fusion of Decisions I and II which is the proposed method shown at the end of Table 3.

Table 4 Average training and testing time (seconds)

The accuracies presented in Table 3 demonstrate that the proposed method achieves the best accuracies on all datasets. Moreover, the accuracies are as much as 100% in FYO, VERA and Tongji databases. With these results, we concluded that score-level fusion is the best fusion method to be used for combining texture-based sub-regions of the palm vein recognition systems and the proposed method achieves the best accuracies compared to the other methods on all palm vein datasets. Run-time analysis of the model is also presented in Table 4. The CNN model was implemented on Python while BSIF implementation and identification with k-NN were implemented on MATLAB. The algorithm runs offline.

4.4 Comparison with the state-of-the-art

Finally, we compared our results with the state-of-the-art methods in Table 5 adapted from [9] where a large-scale palm vein image database was introduced, comprising of 12,000 images acquired from 600 different palms. Additionally, a more recent study is involved where palmar, dorsal and wrist vein datasets were introduced and used in a multimodal biometric system [6]. Comparison was performed by using the accuracies of different state-of-the-art methods presented in [9] on the same dataset and by implementing BSIF and our proposed method on the same dataset. The results in Table 5 show that our proposed method is better than most of the state-of-the art methods on palm vein recognition and it achieves 100% accuracy as in [9].

The proposed unimodal biometric system is a robust system that can compare favorably with multimodal systems such as the method proposed in [6]. Since acquiring more than one trait is more expensive in practice, this is a more practical method.

Table 5 Comparison with the state-of-the-art methods

5 Conclusion

In this paper, we introduce a novel method for palm vein recognition which combines a texture-based system with CNN-based architecture using decision-level fusion. Overlapping palm vein image sub-region features are extracted by the texture-based algorithm, BSIF, and fused by score-level fusion. Additionally, a deep learning-based CNN architecture was modeled inline with AlexNet structure, but with fewer filters. The decision from CNN and decision from the five sub-regions with BSIF descriptor are fused to obtain a final decision. The proposed method was implemented on five palm vein databases with various strategies for proper comparisons. For all the datasets used, our proposed method achieved the best accuracies compared to the other methods implemented in this study. Moreover, we compared the proposed method with the state-of-the-art palm vein recognition systems performed on Tongji Contactless Palm Vein Dataset and FYO Palm Vein (FYOPV) Dataset. The accuracy of the proposed method compared favorably with these methods, with 100% accuracy. On the other hand, future direction of our work is expected to be in the area of implementing the proposed method on hand dorsal vein and wrist vein for biometric authentication.