1 Introduction

One active area in ‘soft biometrics’ research involves classifying the gender of the person from a biometric sample. Most work done on gender classification has involved the analysis of faces or periocular images. Various types of classifiers have been used in gender classification after feature extraction and selection. The classifiers that have yielded highest classification accuracy in gender-from-face are Adaboost, RBF networks, and Support vector machines (SVM) [1, 20, 22, 24, 25, 31]. Recently Levi et al. [19] proposed an automatic face age and gender classification using a simple convolutional net architecture that can be used when the quantity of learning data is limited. Also Juefei-Xu et al. [15] classified gender using faces images with occlusion and low resolution using a progressive trained convolutional neural network with attention.

Gender classification using iris information is a rather new topic, with only a few papers published [2, 17, 26, 28]. Most gender classification methods reported in the literature use all iris texture features for classification. As a result, gender-irrelevant information might be fed into the classifier which may result in poor generalization, especially when the training set is small. It has been shown both theoretically and empirically that reducing the number of irrelevant or redundant features increases the learning efficiency of the classifier [21].

In terms of comparing iris codes for identity recognition, iris codes of different individuals, and even of the left and right eyes of the same individual, have been shown to be independent. At the same time, several authors have reported that, using an analysis of iris texture different from that used for identity recognition, it is possible to classify the gender of the person with an accuracy much higher than chance.

There are several reasons why gender-from-iris is an interesting and potentially useful problem [27]. One possible use arises in searching an enrolled database for a match. If the gender of the sample can be determined, then it can be used to order the search and reduce the average search time. Another possible use arises in social settings where it may be useful to screen entry to some area based on gender, but without recording identity. Gender classification is also important for demographic information collection, marketing research, and real-time electronic marketing. Another possible use is in high-security scenarios, where there may be value in knowing the gender of the people who attempt entry but are not recognized as any of the enrolled persons. And, at a basic science level, it is of value to more fully understand what information about a person can be extracted from analysis of their iris texture.

Thomas et al. [28] were the first to explore gender-from-iris, using images acquired with an LG 2200 sensor. They segmented the iris region and employed machine learning techniques to develop models that predict gender based on the iris texture features. They segmented the iris region, created a normalized iris image, and then applied a log-Gabor filter to the normalized image. They ran a rigorous quality check to discard images with poor quality such as motion blur or out of focus. In addition to the log-Gabor texture features, they used seven geometric features of the pupil and iris, and were able to reach an accuracy close to 80%.

Lagree et al. [17] experimented with iris images acquired using an LG 4000 sensor. They computed texture features separately for eight five-pixel horizontal bands, running from the pupil-iris boundary out to the iris sclera boundary, and ten 24-pixel vertical bands from a \(40\times 240\) image. The normalized image is not processed by the log-Gabor filters that are used by IrisBEE software [23] to create the ‘iris code’ for biometrics purpose and do not use any geometrics features to develop models that predict gender and ethnicity based on the iris texture feature. These are the differences from features computed by Thomas in [28]. This approach reached an accuracy close to 62% for gender and close to 80% for ethnicity.

Bansal et al. [2] experimented with iris images acquired with a Cross Match SCAN-2 dual-iris camera. An statistical feature extraction technique based on correlation between adjacent pixels was combined with a 2D wavelet tree based on feature extraction techniques to extract significant features from the iris image. This approach reached an accuracy of 83.06% for gender classification. Nevertheless, the database used in this experiment was very small (300 images) compared to other studies published in the literature.

Tapia et al. [26] experimented with iris images acquired using an LG 4000 sensor to classify gender of a person based on analysis of features of the iris texture. They used different implementations of Local Binary Patterns from the iris image using the masked information. Uniform LBP with concatenated histograms significantly improves accuracy of gender prediction relative to using the whole iris image. Using a non subject-disjoint test set, they were able to achieve over 91% correct gender prediction using the texture of the left iris.

Costa-Abreu et al. [8] explored the gender prediction task with respect to three approaches using only geometric features, only texture features and both geometric and texture features extracted from iris images. This work used a BioSecure Multimodal DataBase (BMDB) and these images were taken using a LG Iris Access EOU-3000. They were able to achieve over 89.74% correct gender prediction using the texture of the iris. Nevertheless the dataset is not available and the author chose the images, spectacles were not allowed to be worn by subjects, although contact lenses were allowed. Our database is a more real representation.

Bobeldik et al. [4] explored the gender prediction accuracy associated with four different regions from NIR iris images: the extended ocular region, the iris-excluded ocular region, the iris-only region, and the normalized iris-only region. They used a Binarized Statistical Image Feature (BSIF) texture operator [16] to extract features from the regions previously defined. The ocular region reached the best performance with 85.7% while normalized images exhibited the worst performance, with almost a 20% difference in performance over the ocular region (65%). Thereby we can understand that the normalization process may be filtering out useful information.

Recently, Tapia et al. [27] predicted gender directly from the same binary iris code that could be used for recognition. They found that information for gender prediction is distributed across the iris, rather than localized in particular concentric bands. They also found that using selected features representing a subset of the iris region achieves better accuracy than using features representing the whole iris region achieving 89% correct gender prediction using the fusion of the best features of iris code from the left and the right eyes.

A summary of these methods’ experimental setup is presented in Table 9.1.

Table 9.1 Gender classification summary of previously published papers. N represents: Normalized Image, E represents: Encoded Image, P represent Periocular Image

2 Deep-Learning Models

In this chapter, we first propose an unsupervised approach to use a big quantity of unlabeled iris images to pretrain a Deep Belief Network [13] and then use it to make a fast fine tuning of a Deep Multilayer Network (MLP) for gender classification. A second approach is proposed using a Lenet-5 [18] Convolutional Neural Network (CNN) model to improve the gender classification from normalized Near-Infrared Red (NIR) iris images. We then compared and discuss the results obtained by both methods.

2.1 Semi-supervised Method

In the iris biometric scenario we usually have a lot of information without labels (age, gender, ethnicity), because iris databases were created focusing on iris identification with encoding or periocular images and not in soft biometric problems such as classifying gender-from-iris. Thus, we commonly have databases with iris images taken for the same subject across several sessions and many of them do not have the gender information available. Because of this, it can be a hard task to create a large person-disjoint dataset to proper train and evaluate soft biometrics iris algorithms. DBN unsupervised algorithm may help to classify unlabeled iris images.

2.1.1 Deep Belief Network

Deep Belief Network (DBN) consists of multiple layers of stochastic latent variables trained using an unsupervised learning algorithm followed by a supervised learning phase using feed-forward back-propagation Multilayer Neural Networks (MLP) [13]. In the unsupervised pretraining stage, each layer is trained using a Restricted Boltzmann Machine (RBM). Unsupervised pretraining is an important step in solving a classification problem with terabytes of data and high variability without labels that is the main concern in this proposal. A DBN is a graphical model where neurons of the hidden layer are conditionally independent of one another for a particular configuration of the visible layer and viceversa. A DBN can be trained layer-wise by iteratively maximizing the conditional probability of the input vectors or visible vectors given the hidden vectors and a particular set of layer weights. This layer-wise training can help in improving the variational lower bound on the probability of the input training data, which in turn leads to an improvement of the overall generative model. DBNs are graphical models which learn to extract a deep hierarchical representation of the training data. They model the joint distribution between observed vector x and the l hidden layers \(h^k\) as follows:

$$\begin{aligned} P(x,h^k,\ldots ,h^l)=\left( \prod _{k=0}^{l-2} P(h^k|h^{k+1})\right) P(h^{l-1} -1 ,h^l) \end{aligned}$$
(9.1)

where \(x=h^0\), \(P(h^{k-1}|,h^k)\) is a conditional distribution for the visible units conditioned on the hidden units of the RBM at level k, and \(P(h^{l-1},h^l)\) is the visible-hidden joint distribution in the top-level RBM. This is illustrated in Fig. 9.1.

Fig. 9.1
figure 1

Block diagram of RBM. X represent input layer, h1–h3 hidden layer

In according with Hinton et al. [13] The principle of greedy layer-wise unsupervised training can be applied to DBNs with RBMs as the building blocks for each layer [3]. The process is as follows:

  1. 1.

    Train the first layer as an RBM that models the raw input \(x = h(0)\) as its visible layer.

  2. 2.

    Use that first layer to obtain a representation of the input that will be used as data for the second layer. Two common solutions exist. This representation can be chosen as being the mean activations \(p(h(1) = 1|h(0))\) or samples of p(h(1)|h(0)).

  3. 3.

    Train the second layer as an RBM, taking the transformed data (samples or mean activations) as training examples (for the visible layer of that RBM).

  4. 4.

    Iterate (2 and 3) for the desired number of layers, each time propagating upward either samples or mean values.

  5. 5.

    Fine tune all the parameters of this deep architecture with respect to a proxy for the DBN log-likelihood, or with respect to a supervised training criterion (after adding extra learning machinery to convert the learned representation into supervised predictions).

The RBM can be denoted by the energy function E

$$\begin{aligned} E(v,h)=-\sum _{i} a_{i}v_{i} - \sum _{j} b_{j}h_{j} - \sum _{i} \sum _{i} h_{j} w_{i,j}v_{i} \end{aligned}$$
(9.2)

where, the RBM consists of a matrix of layer weights W = (\(w_{i,j}\)) between the hidden units \(h_{j}\) and the visible units \(v_{j}\). The \(a_{i}\) and \(b_{j}\) are the bias weights for the visible units and the hidden units respectively. The RBM takes the structure of a bipartite graph and hence it only has inter-layer connections between the hidden or visible layer neurons but no intra-layer connections within the hidden or visible layers. So, the activations of the visible unit neurons are mutually independent for a given set of hidden unit activations and viceversa. Hence, by setting either h or v constant, we can compute the conditional distribution of the other as follows [13]:

$$\begin{aligned} P(h_{j} =1 \mid v) = \sigma \left( b_{j} + \sum _{i=1}^{m} w_{i,j} v_{i}\right) \end{aligned}$$
(9.3)
$$\begin{aligned} P(v_{i} =1 \mid h) = \sigma \left( a_{i} + \sum _{j=1}^{n} w_{i,j} h_{i}\right) \end{aligned}$$
(9.4)

where \(\sigma \) denotes the log sigmoid function. The training algorithm maximizes the expected log probability assigned to the training dataset V. So if the training dataset V consists of the visible vectors v, then the objective function is as follows:

$$\begin{aligned} argmax E\left[ \sum _{v \in V} log P(v)\right] \end{aligned}$$
(9.5)

A RBM is trained using a Contrastive Divergence algorithm [7]. Once trained, the DBN can be used to initialize the weights of the MLP for the supervised learning phase.

In this work, we also focus on fine tuning via supervised gradient descent (Fig. 9.2). Specifically, we use a logistic regression classifier to classify the input x based on the output of the last hidden layer h(l) of the DBN. Fine tuning is then performed via supervised gradient descent of the negative log-likelihood cost function. Since the supervised gradient is only non-null for the weights and hidden layer biases of each layer (i.e., null for the visible biases of each RBM), this procedure is equivalent to initializing the parameters of a deep MLP with the weights and hidden layer biases obtained with the unsupervised training strategy. One can also observe that the DBN is very similar with the one for Stack denoise Autoencoding [30], because both involve the principle of unsupervised layer-wise pre-training followed by supervised fine tuning as a Deep MLP. In this chapter we use the RBM class instead of the denoise autocoding class. Now the main problem is to choose the best hyperparameters for Deep MLP.

Fig. 9.2
figure 2

Block diagram with the main stages from semi-supervised iris gender classification. Pretrained (Unsupervised Stage) was made with DBN and fine tuning with Deep MLP (Supervised Stage)

Once all layers are pretrained, the network goes through a second stage of training called fine tuning. Here we consider supervised fine tuning where we want to minimize prediction error on a supervised task. For this, we first add a logistic regression layer on top of the network (more precisely on the output code of the output layer). Then train the entire network as we would train a Deep MLP. This stage is supervised, since now we use the target class during training.

2.2 Supervised Method

A second approach is proposed using a Lenet-5 [20] Convolutional Neural Network (CNN) model to improve gender classification from Normalized Near Infrared Red (NIR) iris images.

2.2.1 CNN

Convolutional Neural Networks (CNN) are biologically inspired variants of MLPs [12]. From Hubel and Wiesel’s early work on the cat’s visual cortex [14], we know the visual cortex contains a complex arrangement of cells. These cells are sensitive to small subregions of the visual field, called a receptive field. The subregions are tiled to cover the entire visual field. These cells act as local filters over the input space and are well-suited to exploit the strong spatially local correlation present in natural images. Additionally, two basic cell types have been identified: Simple cells respond maximally to specific edge-like patterns within their receptive field. Complex cells have larger receptive fields and are locally invariant to the exact position of the pattern.

2.2.2 Sparse Connectivity

CNNs exploit spatially local correlation by enforcing a local connectivity pattern between neurons of adjacent layers. In other words, the inputs of hidden units in layer m are from a subset of units in layer \(m-1\), units that have spatially contiguous receptive fields. We can illustrate this graphically as follows: Imagine that layer \(m-1\) is the input retina. In Fig. 9.3, units in layer m have receptive fields of width 3 in the input retina and are thus only connected to three adjacent neurons in the retina layer. Units in layer \(m+1\) have a similar connectivity with the layer below. We say that their receptive field with respect to the layer below is also 3, but their receptive field with respect to the input is larger (5). Each unit is unresponsive to variations outside of its receptive field with respect to the retina. The architecture thus ensures that the learnt ‘filters’ produce the strongest response to a spatially local input pattern.

Fig. 9.3
figure 3

Representation of sparse connectivity

However, as show in Fig. 9.3 stacking many such layers leads to (nonlinear) ‘filters’ that become increasingly ‘global’ (i.e., responsive to a larger region of pixel space). For example, the unit in hidden layer \(m+1\) can encode a nonlinear feature of width 5 (in terms of pixel space). One of the first applications of convolutional neural networks (CNN) is perhaps the LeNet-5 network described by [18] for optical character recognition. Compared to modern deep CNN, their network was relatively modest due to the limited computational resources of the time and the algorithmic challenges of training bigger networks. Though much potential laid in deeper CNN architectures (networks with more neuron layers), only recently have they became prevalent, following the dramatic increase in both the computational power, due to availability of Graphical Processing, Units (GPU); the amount of training data readily available on the Internet; and the development of more effective methods for training such complex models. One recent and notable examples is the use of deep CNN for image classification on the challenging Imagenet benchmark [11]. Deep CNN have additionally been successfully applied to applications including human pose estimation, facial keypoint detection, speech recognition, and action classification. To our knowledge, this is the first report of their application to the task of gender classification from iris images.

2.2.3 Shared Weight

In addition, in CNNs, each filter \(h_{i}\) is replicated across the entire visual field. These replicated units share the same parametrization (weight vector and bias) and form a feature map. In Fig. 9.4, we show three hidden units belonging to the same feature map. Weights of the same color are shared and constrained to be identical. Gradient descent can still be used to learn such shared parameters, with only a small change to the original algorithm. The gradient of a shared weight is simply the sum of the gradients of the parameters being shared. Replicating units in this way allows for features to be detected regardless of their position in the visual field. Additionally, weight sharing increases learning efficiency by greatly reducing the number of free parameters being learnt. The constraints on the model enable CNNs to achieve better generalization on vision problems.

Fig. 9.4
figure 4

Representation of shared weight on CNN

2.2.4 Pooling

Another important concept of CNNs are the pooling techniques. Today one of the most used pooling algorithms is Max Pooling, but others can also be used such as: Average, Sum, and Mean Pooling. The pool layers are used mainly to down-sample the volumes spatially and to reduce the maps of features of previous layers. This chapter uses a Max pooling approach.

Max pooling partitions the input image into a set of non-overlapping rectangles (subregions) and, for each subregion, outputs the maximum value. Max pooling is useful in for two main reasons: (a) by eliminating non-maximal values, it reduces computation for upper layers; and (b) it provides a form of translation invariance. Imagine cascading a max pooling layer with a convolutional layer. There are eight directions in which one can translate the input image by a single pixel. If max pooling is done over a \(2\times 2\) region, three out of these eight possible configurations will produce exactly the same output at the convolutional layer. When max pooling is applied over a \(3\times 3\) window, this jumps to 5 / 8. Since it provides additional robustness to position, max pooling is a ‘smart’ way of reducing the dimensionality of intermediate representations within a convolutional neural network.

2.2.5 Network Architecture

In this work we trained two Lenet-5 networks, named CNN-1 and CNN-2. Our proposed network architecture is used throughout our experiments for gender classification. The network comprises only three convolutional layers and two fully connected layers with a small number of neurons. Our choice of a smaller network design is motivated both from our desire to reduce the risk of overfitting as well as the nature of the problem we are attempting to solve (gender classification has only two classes). Only one channel (grayscale images with the occlusion mask) is processed directly by the network. Normalized iris images of sizes \(20\times 240\) are fed into the network.

Sparse, convolutional layers and max pooling are at the heart of the LeNet family of models. Figure 9.5 shows a graphical depiction of a LeNet model. The three subsequent convolutional layers are then defined as follows for CNN-1:

Fig. 9.5
figure 5

Representation with the main stages from Lenet-5 CNN for iris gender classification. The input are normalized NIR iris image with mask in yellow

  1. 1.

    20 filters of size \(1\times 5\times 5\) pixels are applied to the input in the first convolutional layer, followed by a rectified linear operator (ReLU), a max pooling layer taking the maximal value of \(2\times 2\) regions with two-pixel strides and a local response normalization layer.

  2. 2.

    50 filters of size \(1\times 10\times 10\) pixels are applied to the input in the second convolutional layer, followed by a rectified linear operator (ReLU), a max pooling layer taking the maximal value of \(2\times 2\) regions with two-pixel strides and a local response normalization layer.

  3. 3.

    100 filters of size \(1\times 20\times 20\) pixels are applied to the input in the third convolutional layer, followed by a rectified linear operator (ReLU), a max pooling layer taking the maximal value of \(2\times 2\) regions with two-pixel strides and a local response normalization layer.

    The following fully connected layers are then defined by:

  4. 4.

    A first fully connected layer that receives the output of the third convolutional layer and contains 4,800 neurons, followed by a ReLU and a dropout layer (0.2).

  5. 5.

    A second fully connected layer that receives the 3,840 dimensional output of the first fully connected layer and again contains 500 neurons, followed by a ReLU and a dropout layer (0.2).

  6. 6.

    Finally, the output of the last fully connected layer is fed to a softmax layer that assigns a probability for each class. The prediction itself is made by taking the class with the maximal probability for the given test image.

The first CNN was modified in order to use more appropriate filter sizes, considering that our images are wider than taller. The three subsequent convolutional layers are then defined as follows for CNN-2:

  1. 1.

    128 filters of size \(1\times 5\times 21\) pixels are applied to the input in the first convolutional layer, followed by a rectified linear operator (ReLU), a max pooling layer taking the maximal value of \(2\times 2\) regions with two-pixel strides and a local response normalization layer.

  2. 2.

    256 filters of size \(1\times 5\times 21\) pixels are applied to the input in the second convolutional layer, followed by a rectified linear operator (ReLU), a max pooling layer taking the maximal value of \(2\times 2\) regions with two-pixel strides and a local response normalization layer.

  3. 3.

    512 filters of size \(1\times 2\times 45\) pixels are applied to the input in the third convolutional layer, followed by a rectified linear operator (ReLU), a max pooling layer taking the maximal value of \(2\times 2\) regions with two-pixel strides and a local response normalization layer.

    The following fully connected layers are then defined by

  4. 4.

    A first fully connected layer that receives the output of the third convolutional layer and contains 4,800 neurons, followed by a ReLU and a dropout layer (0.5).

  5. 5.

    A second fully connected layer that receives the 2,400 dimensional output of the first fully connected layer and again contains 4,800 neurons, followed by a ReLU and a dropout layer (0.5).

  6. 6.

    Finally, the output of the last fully connected layer is fed to a softmax layer that assigns a probability for each class. The prediction itself is made by taking the class with the maximal probability for the given test image

3 Iris Normalized Images

In this work, we used normalized iris images because the main idea is to extract the information from the same process used for iris identification so gender classification could be used as a secondary stage of information.

The iris feature extraction process involves the following steps: First, a camera acquires an image of the eye. All commercial iris recognition systems use near-infrared illumination, to be able to image iris texture of both ‘dark’ and ‘light’ eyes [26]. Next, the iris region is located within the image. The annular region of the iris is transformed from raw image coordinates to normalized polar coordinates. This results in what is sometimes called an ‘unwrapped’ or ‘rectangular’ iris image. See Fig. 9.6. A texture filter is applied at a grid of locations on this unwrapped iris image, and the filter responses are quantized to yield a binary iris code [5]. Iris recognition systems operating on these principles are widely used in a variety of applications around the world [6, 10, 29].

The radial resolution (r) and angular resolution \((\theta )\) used during the normalization or unwrapping step determine the size of the normalized iris image, and can significantly influence the iris recognition rate. This unwrapping is referred to as using Daugman’s rubber sheet model [9]. In this work, we use a normalized image of \(20 (r)\times 240 (\theta )\), created using Daugman’s method and Osiris implementation. Both implementations also create a segmentation mask of the same size as the normalized image. The segmentation mask indicates the portions of the normalized iris image that are not valid due to occlusion by eyelids, eyelashes or specular reflections.

Fig. 9.6
figure 6

Transformation of Cartesian coordinates (xy) to Polar coordinates in order to generate the normalized image

Fig. 9.7
figure 7

Example of iris images captured using an LG 4000 sensor

4 Dataset

The images used in this paper were taken with an LG 4000 sensor (labeled images) and IrisGuard AD-100 (unlabeled images). The LG 4000 and AD-100 use near-infrared illumination and acquire a \(480\times 640\), 8-bit/pixel image. Example of iris images appear in Fig. 9.7. We used two subsets, one for the unsupervised stage and the other for the supervised stage. The image dataset for the supervised stage is person-disjoint.

The whole database consists of two datasets: The first dataset has 10,000 unlabeled images, 5 K left iris images and 5 K right iris images. We do not know the quantities of male and female iris images. We used a fusion of 10 K (left and right) for the pretrained (unsupervised) stage. Currently this dataset is not publicly available.

The second dataset has one left eye image and one right eye image for each of 750 males and 750 females, for a total of 3,000 images. Of the 1,500 distinct persons in the dataset, visual inspection of the images indicates that about one-fourth are wearing clear contact lenses. In order to improve the results we used the fusion of the left and right normalized iris images, totalling 3,000 normalized images with their corresponding segmentation mask computed using Osiris implementation.Footnote 1 This dataset is available to other researchers at GFI gender-from-iris dataset.Footnote 2

A training portion of this 3,000-person dataset was created by randomly selecting 60% of the males and 60% of the females. The remaining 40% was used as the test set. Once parameter selection is finalized, the same validation set used in [27] was used for the final evaluation. This validation dataset contains 1,944 images: three left eye images and three right eye images for each of 175 males and 149 females. It is known that some subjects are wearing clear contact lenses, and evidence of this is visible in some images. Also, a few subjects are wearing cosmetic contact lenses in some images. In order to replicate real life conditions we did not remove any images containing any type of defect (blurred or noised).

It is important to note that iris images of different persons, or even the left and right iris images for a given person, may not present exactly the same imaging conditions. The illumination by the LEDs may come from either side of the sensor, specular highlights may be present in different places in the image, the inclination of the head may be different, the eyelid occlusion may be different, and so forth. Also the training, test and validation subsets are all independent and person disjoint in this work.

5 Experiments

Our method is implemented using the TheanoFootnote 3 and KerasFootnote 4 open-source framework. Training was performed on a EC2 Amazon GPU machine with GRID K-520 using g2.2x large instances with 8GB of video memory. For DBN we needed to train two different stages, the pretrained stage with RBM and fine tuning with Deep MLP. The first took 207 min to train and the second 150 min.

Training each convolutional network (CNN-1 and CNN-2) required about four hours with 238 s per epoch to CNN-1 and 10 h for CNN-2. Predicting gender on a single normalized image using our network requires about 100 ms. for CNN-1 and 178 ms. for CNN-2. Training running times can conceivably be substantially improved by running the training on image batches.

6 Hyperparameters Selection

In this work, we used grid search to find the best hyperparameters of the DBN + Deep MLP and Lenet-5 CNN. We look at tuning the batch size and number of epochs used when fitting the networks. The batch size in iterative gradient descent is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory.

The number of epochs is the number of times that the entire training dataset is shown to the network during training. Some networks are sensitive to the batch size, such as Multilayer Perceptron Networks, and Convolutional Neural Networks.

In this work, we look at tuning the batch size and number of epochs used when fitting the network. We evaluated a suite of different mini batch sizes from n \(=\) 8 to n \(=\) 500 in steps of \(2^n\) for CNN-1–2 and m \(=\) 10 to m \(=\) 100 in steps of 10.

Keras and Theano offers a suite of different state-of-the-art optimization algorithms such as:‘SGD’, ‘RMSprop’, ‘Adagrad’, ‘Adadelta’, ‘Adam’, and others. We tuned the optimization algorithm used to train the network, each with default parameters. The best result was reached with ‘SDG’.

The Learning rate parameter controls how much to update the weight at the end of each batch and the momentum controls how much to let the previous update influence the current weight update. We tried an small set of standard learning rates and momentum values ranging from 0.1 to 0.9 in steps of 0.1 for CNN-1 and CNN-2. For DBN a learning rate parameter for the pretraining stage and another learning rate parameter for the fine tuning stage are needed. Both were set in the range \(1e-1\) to \(1e-5\).

The activation function controls the non-linearity of individual neurons and when to fire. Generally, the rectifier activation function is the most popular, but sigmoid and tanh functions were also very used and these functions may still be more suitable for different problems. In this work, we evaluated the suite of different activation functions available. We only used these functions in the hidden layer, as we required a sigmoid activation function in the output for the binary classification problem to reach the best results in Deep MLP. For CNNs the best results were reach with ReLU and Softmax activation functions.

We also looked at tuning the dropout rate for regularization parameter in an effort to limit overfitting and improve the models ability to generalize. To get good results, dropout is best combined with a weight constraint such as the max norm constraint. This involves fitting both the dropout percentage and the weight constraint. We tried dropout percentages between 0.0 and 0.9 and max norm weight constraint values between 0 and 5. The best results were reached with 0.5.

Table 9.2 Gender classification rates using DBN \(+\) Deep MLP. The number of the hidden layers are indicated in parenthesis. We best results were reached with batch size of 100. ‘preT-LR’ represent pre-trainning Learning Rate stage with value of \(1*e-2\) and ‘FineT’ represent Fine Tunning Learning rate with value of \(1*e-1\)
Table 9.3 Gender classification rates using CNN-1 with Batch size of 64. LR represent Learning Rate

The number of neurons in a layer also is an important parameter to tune. Generally the number of neurons in a layer controls the representational capacity of the network, at least at that point in the topology. The best number of neurons is shown in Tables 9.2, 9.3.

7 Results

7.1 Semi-supervised Method

The RBM pretraining stage is very sensitive to batch size and learning rate, therefore we used values from 1 to 200 batch size and learning rate from 1e-1 to 1e-5 for pre-train with 10 K images and fine tuning stage with near to 4,900 images. In average the pretrained stage takes 307 min and fine tuning only 150 min. Table 9.2 shows a summary of the results.

Table 9.4 Gender classification rates using CNN-2 with Batch size of 64. LR represent Learning Rate
Table 9.5 Accuracy of our proposal models (*) and the state-of-the-art methods
Table 9.6 Summary of the best gender classification rates. Second column shows results without data augmentation. Third column shows results with data augmentation (DA) in relation to Tables 9.3 and 9.4

7.2 Supervised Method

In order to compare the results with the semi-supervised method we trained two ‘small’ Lenet-5 CNN. We consider these networks ‘small’, because we do not have a huge quantity of iris images available like in databases such as Deepface, Imagenet, or VGG(16–19). This is one of our main concerns for applying deep-learning techniques to any iris-based soft biometric problem. A future challenge to researchers in this area is to increase the number of soft biometric images captured with different sensors to train more complex models.

Tables 9.2, 9.3 and 9.4 show the summary of the results using different number of parameters. Table 9.5 shows the results of state-of-the-art algorithms compared with our two proposals.

In order to create a larger dataset to train a supervised stage, we used data augmentation by flipping vertically and horizontally the original iris images, increasing the number of images three times for each iris. We applied this technique only to the best results of Tables 9.3 and 9.4. As future work we propose to study other techniques to analyze the influence of data augmentation on the results. The summary of the results using data augmentation, are presented on Table 9.6.

8 Conclusion

Our best approach was to use convolutional neural networks trained with labeled iris data, that reached 83.00% without data augmentation and 84.66% with data augmentation being one of the top performance on relation with other state-of-the-art methods as shown in Table 9.5. We did not outperform our previous work in [27] but our deep-learning method presented here, did not use any preprocessing stages or feature selection methods. Also, this proposal used normalized images and not the encoded information. According with the results reported by Bobeldik et al. [4] normalization process may be filtering out useful information. We think these are good results considering the quantity of images used to train our methods. Deep-learning techniques usually require much larger databases to reach its full potential. We think that this work further validates the hypothesis about the iris having information about a person’s gender.

Also, it is important to highlight the results reached by the semi-supervised method even when it is not reached the best results of the test. The accuracy is very motivating considering that the results were reached with unlabeled iris images.

One of the main contributions of this work is that, to our understanding, this is the first work that uses deep-learning techniques (both semi-supervised and supervised) to try to solve the problem of gender-from-iris. Our proposal reached good performance by training a common deep-learning architecture designed to avoid overfitting due to the limitation of available labeled iris images. Our network is comparable to some of the recent network architectures, thereby reducing the number of its parameters and the chance for overfitting.

Also, we increased the number of images in our training set using data augmentation in order to preliminary test its usefulness in improving the gender recognition process. The resulting techniques were tested using unfiltered iris images and perform comparably to most of the recent state-of-the-art methods.

Another contribution was to propose the usage of the large quantity of unlabeled iris images available to help in the initialization of the weights of a deep neural network. A larger number of unlabeled image may help to improve the gender classification rate. If we have more images captured from different sensors we can create a general purpose gender classification method and understand if the information present on the iris is dependent of the sensor used. To our understanding the features present on the iris that make gender classification feasible, are independent of the sensor used to capture the image. As a future challenge for all iris researchers, we propose to make an effort to capture more images with soft biometrics labeled data.

Two important conclusions can be made from our results. First, convolutional neural networks can be used to provide competitive iris gender classification results, even considering the much smaller size of contemporary unconstrained image sets labeled for gender. Second, the simplicity of our model implies that more elaborate systems using more training data may be capable of substantially improving results beyond those reported here.