1 Introduction

Facial gender classification has always been one of the most studied soft-biometric topics. Over the past decade, gender classification on constrained faces has almost been perfected. However, challenges still remain on less-constrained faces such as faces with occlusions, of low resolution, and off-angle poses. Traditional methods such as the support vector machines (SVMs) and its kernel extension can work pretty well on this classic two-class problem as listed in Table 8.8. In this work, we approach this problem from a very different angle. We are inspired by the booming deep convolutional neural network (CNN) and the attention model to achieve occlusion and low-resolution robust facial gender classification via progressively training the CNN with attention shift. From an orthogonal direction, when images are under severe degradations such as contiguous occlusions, we utilize a deep generative approach to recover the missing pixels such that the facial gender classification performance is further improved on occluded facial images. On one hand, we aim at building a robust gender classifier that is tolerant to image degradations such as occlusions and low resolution, and on the other hand, we aim at mitigating and eliminating the degradations through a generative approach. Together, we stride toward pushing the boundary of unconstrained facial gender classification.

Fig. 8.1
figure 1

(Top) Periocular region on human faces exhibits the highest saliency. (Bottom) Foreground object in focus exhibits the highest saliency. Background is blurred with less high-frequency details preserved

1.1 Motivation

Xu et al. [70] proposed an attention-based model that automatically learns to describe the content of images which has been inspired by recent work in machine translation [3] and object detection [2, 57]. In their work, two attention-based image caption generators were introduced under a common framework: (1) a ‘soft’ deterministic attention mechanism which can be trained by standard back-propagation method and (2) a ‘hard’ stochastic attention mechanism which can be trained by maximizing an approximate variational lower bound. The encoder of the models uses a convolutional neural network as a feature extractor, and the decoder is composed of a recurrent neural network (RNN) with long short-term memory (LSTM) architecture where the attention mechanism is learned. The authors can then visualize that the network can automatically fix its gaze on the salient objects (regions) in the image while generating the image caption word by word.

For facial gender classification, we know from previous work [16, 56] that the periocular region provides the most important cues for determining the gender information. The periocular region is also the most salient region on human faces, such as shown in the top part of Fig. 8.1, using a general purpose saliency detection algorithm [14]. Similar results can also be obtained using other saliency detection algorithms such as [15, 17]. We can observe from the saliency heat map that the periocular region does fire the most strongly compared to the remainder of the face.

Now we come to think about the following question:

\(\mathcal {Q}\) : How can we let the CNN shift its attention toward the periocular region, where gender classification has been proven to be the most effective?

The answer comes from our day-to-day experience with photography. If you are using a DSLR camera with a big aperture lens, and fixing the focal point onto an object in the foreground, all background beyond the object in focus will become out of focus and blurred. This is illustrated in the bottom part of Fig. 8.1 and as can be seen, the sharp foreground object (cherry blossom in hand) attracts the most attention in the saliency heat map.

Thus, we can control the attention region by specifying where the image is blurred or remains sharp. In the context of gender classification, we know that we can benefit from fixing the attention onto the periocular region. Therefore, we are ‘forcing’ what part of the image the network weighs the most, by progressively training the CNN using images with increasing blur levels, zooming into the periocular region, as shown in Table 8.1. Since we still want to use a full-face model, we hope that by employing the mentioned strategy, the learned deep model can be at least on par with other full-face deep models, while harnessing gender cues in the periocular region.

Table 8.1 Blurred images with increasing levels of blur

\(\mathcal {Q}\) : Why not just use the periocular region crop?

Although experimentally, periocular is the best facial region for gender classification, we still want to resort to other facial parts (beard/mustache) for providing valuable gender cues. This is especially true when the periocular region is less ideal. For example, some occlusion like sunglasses could be blocking the eye region, and we want our network to still be able to generalize well and perform robustly, even when the periocular region is corrupted.

To strike a good balance between full face-only and periocular-only models, we carry out a progressive training paradigm for CNN that starts with the full face, and progressively zooms into the periocular region by leaving other facial regions blurred. In addition, we hope that the progressively trained network is sufficiently generalized so that it can be robust to occlusions of arbitrary types and at arbitrary locations.

\(\mathcal {Q}\) : Why blurring instead of blackening out?

We just want to steer the focus, rather than completely eliminating the background, like the DSLR photo example shown in the bottom part of Fig. 8.1. Blackening would create abrupt edges that confuse the filters during the training. When blurred, low-frequency information is still well preserved. One can still recognize the content of the image, e.g., dog, human face, objects, etc. from a blurred image.

Blurring outside the periocular region, and leaving the high-frequency details at the periocular region will both help providing global and structural context of the image, as well as keeping the minute details intact at the region of interest, which will help the gender classification, and fine-grained categorization in general.

\(\mathcal {Q}\) : Why not let CNN directly learn the blurring step?

We know that CNN filters operate on the entire image, and blurring only part of the image is a pixel location dependent operation and thus is difficult to emulate in the CNN framework. Therefore, we carry out the proposed progressive training paradigm to enforce where the network attention should be.

2 Related Work

In this section, we provide relevant background on facial gender classification and attention models.

The periocular region is shown to be the best facial region for recognition purposes [20,21,22, 24, 27, 31,32,33,34,35,36, 38, 39, 39, 40, 59, 62, 63], especially for gender classification tasks [16, 56]. A few recent work also applies CNN for gender classification [4, 50]. More related work on gender classification is consolidated in Table 8.8.

Attention models such as the one used for image captioning [70] have gained much popularity only very recently. Rather than compressing an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. This is especially important when there is a lot of clutter in an image. It also helps gaining insight and interpreting the results by visualizing where the model attends to for certain tasks. This mechanism can be viewed as a learnable saliency detector that can be tailored to various tasks, as opposed to the traditional ones such as [8, 14, 15, 17].

It is worth mentioning the key difference between the soft attention and the hard attention. The soft attention is very easy to implement. It produces distribution over input locations, reweights features, and feeds them as input. It can attend to arbitrary input locations using spatial transformer networks [19]. On the other hand, the hard attention can only attend to a single input location, and the optimization cannot utilize gradient descent. The common practice is to use reinforcement learning.

Other applications involving attention models may include machine translation which applies attention over input [54]; speech recognition which applies attention over input sounds [6, 9]; video captioning with attention over input frames [72]; image, question to answer with attention over image itself [69, 75]; and many more [67, 68].

3 Proposed Method

Our proposed method involves two major components: a progressively trained attention shift convolutional neural networks (PTAS-CNN) framework for training the unconstrained gender classifier, as well as a deep convolutional generative adversarial networks (DCGAN) for missing pixel recovery on the facial images so that the gender recognition performance can be further improved on images with occlusions.

3.1 Progressively Trained Attention Shift Convolutional Neural Networks (PTAS-CNN)

In this section we detail our proposed method on progressively training the CNN with attention shift. The entire training procedure involves \((k+1)\) epoch groups from epoch group 0 to k, where each epoch group corresponds to one particular blur level.

3.1.1 Enforcing Attention in the Training Images

In our experiment, we heuristically choose 7 blur levels, including the one with no blur at all. The example images with increasing blur levels are illustrated in Table 8.1. We use a Gaussian blur kernel with \(\sigma = 7\) to blur the corresponding image regions. Doing this is conceptually enforcing the network attention in the training images without the need of changing the network architecture.

3.1.2 Progressive CNN Training with Attention

We employ the AlexNet [46] architecture for our progressive CNN training. The AlexNet has 60 million parameters and 650,000 neuron, consisting of 5 convolution layers and 3 fully connected layers with a final 1000-way softmax. To reduce overfitting in the fully connected layers, AlexNet employs “dropout” and data augmentation, both of which are preserved in our training. The main difference is that we only need a 2-way softmax due to the nature of gender classification problems.

As illustrated in Fig. 8.2, the progressive CNN training begins with the first epoch group (Epoch Group 0, images with no blur), and the first CNN model \(\mathcal {M}_0\) is obtained and frozen after convergence. Then, we input the next epoch group for tuning the \(\mathcal {M}_0\) and in the end produce the second model \(\mathcal {M}_1\), with attention enforced through training images. The procedure is carried out sequentially until the final model \(\mathcal {M}_k\) is obtained. Each \(\mathcal {M}_j (j = 0,\ldots ,k)\) is trained with 1000 epochs and with a batchsize of 128. At the end of the training for step j, the model corresponding to best validation accuracy is taken ahead to the next iteration \((j+1)\).

Fig. 8.2
figure 2

Progressive CNN training with attention

3.1.3 Implicit Low-Rank Regularization in CNN

Blurring the training images in our paradigm may have more implications. Here we want to show the similarities between low-pass Fourier analysis and low-rank approximation in SVD. Through the analysis, we hope to make connections to the low-rank regularization procedure in the CNN. We have learned from a recent work [65] that enforcing a low-rank regularization and removing the redundancy in the convolution kernels is important and can help improve both the classification accuracy and the computation speed. Fourier analysis involves expansion of the original data \(x_{ij}\) (taken from the data matrix \(\mathbf {X}\in \mathbb {R}^{m\times n}\)) in an orthogonal basis, which is the inverse Fourier transform:

$$\begin{aligned} {x}_{ij} = \frac{1}{m} \sum _{k=0}^{m-1} c_k e^{\mathbf {i} 2 \pi jk/m}. \end{aligned}$$
(8.1)

The connection with SVD can be explicitly illustrated by normalizing the vector \(\{ e^{\mathbf {i} 2 \pi jk/m} \}\) and by naming it \(\mathbf {v}'_k\):

$$\begin{aligned} {x}_{ij} = \sum _{k=0}^{m-1} b_{ik} v'_{jk} = \sum _{k=0}^{m-1} u'_{ik} s'_k v'_{jk}, \end{aligned}$$
(8.2)

which generates the matrix equation \(\mathbf {X} = \mathbf {U}'\varvec{\varSigma }'\mathbf {V}'^\top \). However, unlike the SVD, even though the \(\{\mathbf {v}'_k\}\) are an orthonormal basis, the \(\{\mathbf {u}'_k\}\) are not in general orthogonal. Nevertheless this demonstrates how the SVD is similar to a Fourier transform. Next, we will show that the low-pass filtering in Fourier analysis is closely related to the low-rank approximation in SVD.

Suppose we have N image data samples of dimension d in the original two-dimensional form \(\{\mathbf {x}_1, \mathbf {x}_2, \ldots , \mathbf {x}_N\}\). Let matrix \(\hat{\mathbf {X}}\) contain all the data samples undergone 2D Fourier transform \(\mathcal {F}(\cdot )\), in the vectorized form:

Matrix \(\hat{\mathbf {X}}\) can be decomposed using SVD: \(\hat{\mathbf {X}} = \hat{\mathbf {U}}\hat{\varvec{\varSigma }}\hat{\mathbf {V}}^\top \). Without loss of generality, let us assume that \(N=d\) for brevity. Let \(\mathbf {g}\) and \(\hat{\mathbf {g}}\) be the Gaussian filter in spatial domain and frequency domain, respectively, namely \(\hat{\mathbf {g}} = \mathcal {F}(\mathbf {g})\). Let \(\hat{\mathbf {G}}\) be a diagonal matrix with \(\hat{\mathbf {g}}\) on its diagonal. The convolution operation becomes dot product in frequency domain, so the blurring operation becomes

$$\begin{aligned} \hat{\mathbf {X}}_{\mathrm {blur}} = \hat{\mathbf {G}} \cdot \hat{\mathbf {X}} = \hat{\mathbf {G}} \cdot \hat{\mathbf {U}}\hat{\varvec{\varSigma }}\hat{\mathbf {V}}^\top , \end{aligned}$$
(8.3)

where \(\hat{\varvec{\varSigma }} = \mathrm {diag}({\sigma }_1,{\sigma }_2,\ldots ,{\sigma }_d)\) contains the singular values of \(\hat{\mathbf {X}}_{\mathrm {blur}}\), already sorted in descending order: \(\sigma _1 \ge {\sigma }_2 \ge \ldots \ge {\sigma }_d\). Suppose we can find a permutation matrix \(\mathbf {P}\) such that when applied on the diagonal matrix \(\hat{\mathbf {G}}\), the diagonal elements are sorted in descending order according to the magnitude: \(\hat{\mathbf {G}}' = \mathbf {P} \hat{\mathbf {G}} = \mathrm {diag}(\hat{g}'_1,\hat{g}'_2,\ldots ,\hat{g}'_d)\). Now, let us apply the same permutation operation on \(\hat{\mathbf {X}}_{\mathrm {blur}}\), we can thus have the following relationship:

$$\begin{aligned} \mathbf {P} \cdot \hat{\mathbf {X}}_{\mathrm {blur}}&= \mathbf {P}\cdot \hat{\mathbf {G}} \cdot \hat{\mathbf {U}}\hat{\varvec{\varSigma }}\hat{\mathbf {V}}^\top \end{aligned}$$
(8.4)
$$\begin{aligned} \hat{\mathbf {X}}_{\mathrm {blur}}'&= \hat{\mathbf {G}}' \cdot \hat{\mathbf {U}}\hat{\varvec{\varSigma }}\hat{\mathbf {V}}^\top = \hat{\mathbf {U}} \cdot (\hat{\mathbf {G}}'\hat{\varvec{\varSigma }} )\cdot \hat{\mathbf {V}}^\top \end{aligned}$$
(8.5)
$$\begin{aligned}&= \hat{\mathbf {U}} \cdot \mathrm {diag}(\hat{g}'_1\sigma _1, \hat{g}'_2\sigma _2,\ldots , \hat{g}'_d\sigma _d) \cdot \hat{\mathbf {V}}^\top . \end{aligned}$$
(8.6)

Due to the fact that Gaussian distribution is not a heavy-tailed distribution, the already smaller singular values will be brought down to 0 by the Gaussian weights. Therefore, \(\hat{\mathbf {X}}_{\mathrm {blur}}\) actually becomes low-rank after Gaussian low-pass filtering. To this end, we can say that low-pass filtering in Fourier analysis is equivalent to the low-rank approximation in SVD up to a permutation.

This phenomenon is loosely observed through the visualization of the trained filters, as shown in Fig. 8.15, which will be further analyzed and studied in future work.

3.2 Occlusion Removal via Deep Convolutional Generative Adversarial Networks (DCGAN)

In this section, we first review the basics of DCGAN and then show how DCGAN can be utilized for occlusion removal, or missing pixel recovery on face images.

3.2.1 Deep Convolutional Generative Adversarial Networks

The generative adversarial network (GAN) [12] is capable of generating high-quality images. The framework trains two networks, a generator \(\mathcal {G}_\theta (\mathbf {z}):\mathbf {z}\rightarrow \mathbf {x}\), and a discriminator \(\mathcal {D}_\omega (\mathbf {x}): \mathbf {x}\rightarrow [0,1]\). \(\mathcal {G}\) maps a random vector \(\mathbf {z}\), sampled from a prior distribution \(p_{\mathbf {z}}(\mathbf {z})\), to the image space. \(\mathcal {D}\) maps an input image to a likelihood. The purpose of \(\mathcal {G}\) is to generate realistic images, while \(\mathcal {D}\) plays an adversarial role to discriminate between the image generated from \(\mathcal {G}\), and the image sampled from data distribution \(p_\mathrm {data}\). The networks are trained by optimizing the following minimax loss function:

$$\begin{aligned} \min \limits _{\mathcal {G}} \max \limits _{\mathcal {D}} V(\mathcal {G},\mathcal {D}) = \mathbb {E}_{\mathbf {x} \sim p_\mathrm {data}(\mathbf {\mathbf {x}})} \Big [\log (\mathcal {D}(\mathbf {x})) \Big ] + \mathbb {E}_{\mathbf {z}\sim p_\mathbf {z}(\mathbf {z})} \Big [\log (1-\mathcal {D}(\mathcal {G}(\mathbf {z}))\Big ], \end{aligned}$$

where \(\mathbf {x}\) is the sample from the \(p_{\mathrm {data}}\) distribution; \(\mathbf {z}\) is randomly generated and lies in some latent space. There are many ways to structure \(\mathcal {G}(\mathbf {z})\). The deep convolutional generative adversarial network (DCGAN) [60] uses fractionally strided convolutions to upsample images instead of fully connected neurons as shown in Fig. 8.3.

The generator \(\mathcal {G}\) is updated to fool the discriminator \(\mathcal {D}\) into wrongly classifying the generated sample, \(\mathcal {G}(\mathbf {z})\), while the discriminator \(\mathcal {D}\) tries not to be fooled. In this work, both \(\mathcal {G}\) and \(\mathcal {D}\) are deep convolutional neural networks and are trained with an alternating gradient descent algorithm. After convergence, \(\mathcal {D}\) is able to reject images that are too fake, and \(\mathcal {G}\) can produce high-quality images faithful to the training distribution (true distribution \(p_\mathrm {data}\)).

Fig. 8.3
figure 3

Pipeline of a standard DCGAN with the generator \(\mathcal {G}\) mapping a random vector \(\mathbf {z}\) to an image and the discriminator \(\mathcal {D}\) mapping the image (from true distribution or generated) to a probability value

3.2.2 Occlusion Removal via DCGAN

To take on the missing data challenge, we need to utilize both the \(\mathcal {G}\) and \(\mathcal {D}\) networks from DCGAN, pre-trained with uncorrupted data. After training, \(\mathcal {G}\) is able to embed the images from \(p_\mathrm {data}\) onto some nonlinear manifold of \(\mathbf {z}\). An image that is not from \(p_\mathrm {data}\) (e.g., corrupted face image with missing pixels) should not lie on the learned manifold. Therefore, we seek to recover the “closest” image on the manifold to the corrupted image as the proper reconstruction.

Let us denote the corrupted image as \(\mathbf {y}\). To quantify the “closest” mapping from \(\mathbf {y}\) to the reconstruction, we define a function consisting of contextual loss and perceptual loss, following the work of Yeh et al. [73].

In order to incorporate the information from the uncorrupted portion of the given image, the contextual loss is used to measure the fidelity between the reconstructed image portion and the uncorrupted image portion, which is defined as

$$\begin{aligned} \mathcal {L}_\mathrm {contextual}(\mathbf {z}) = \Vert \mathbf {M} \odot \mathcal {G}(\mathbf {z})-\mathbf {M} \odot \mathbf {y}\Vert _1, \end{aligned}$$
(8.7)

where \(\mathbf {M}\) denotes the binary mask of the uncorrupted region and \(\odot \) denotes the element-wise Hadamard product operation. The corrupted portion, i.e., \((1- \mathbf {M}) \odot \mathbf {y}\), is not used in the loss. The choice of \(\ell _1\)-norm is empirical. From the experiments carried out in [73], images recovered with \(\ell _1\)-norm loss tend to be sharper and with higher quality compared to ones reconstructed with \(\ell _2\)-norm.

The perceptual loss encourages the reconstructed image to be similar to the samples drawn from the training set (true distribution \(p_\mathrm {data}\)). This is achieved by updating \(\mathbf {z}\) to fool \(\mathcal {D}\), or equivalently to have a high value of \(\mathcal {D}(\mathcal {G}(\mathbf {z}))\). As a result, \(\mathcal {D}\) will predict \(\mathcal {G}(\mathbf {z})\) to be from the data with a high probability. The same loss for fooling \(\mathcal {D}\) as in DCGAN is used:

$$\begin{aligned} \mathcal {L}_\mathrm {perceptual}(\mathbf {z}) = \log (1- \mathcal {D}( \mathcal {G} (\mathbf {z}))). \end{aligned}$$
(8.8)

The corrupted image with missing pixels can now be mapped to the closest \(\mathbf {z}\) in the latent representation space with the defined perceptual and contextual losses. We follow the training procedure in [60] and use Adam [45] for optimization. \(\mathbf {z}\) is updated using back-propagation with the total loss:

$$\begin{aligned} \hat{\mathbf {z}} = \mathop {\mathrm {arg\,min}}_{\mathbf {z}}(\mathcal {L}_\mathrm {contextual}(\mathbf {z}) + \lambda \mathcal {L}_\mathrm {perceptual}(\mathbf {z})) \end{aligned}$$
(8.9)

where \(\lambda \) is a weighting parameter. After finding the optimal solution \(\hat{\mathbf {z}}\), the hallucinated full-face image can be obtained by

$$\begin{aligned} \mathbf {x}_\mathrm {hallucinated} = \mathbf {M} \odot \mathbf {y} + (1 - \mathbf {M}) \odot \mathcal {G}(\hat{\mathbf {z}}). \end{aligned}$$
(8.10)

Examples of face recovery from contiguous occlusions are shown in Fig. 8.11 using DCGAN. Applying this deep generative approach for occluded image recovery is expected to improve the performance of unconstrained gender classification.

4 Experiments: Part A

In this section we detail the training and testing protocols employed and various occlusions and low resolutions modeled in the testing set. Accompanying figures and tables for each subsection encompass the results and observations and are elaborated in each section.Footnote 1

4.1 Database and Preprocessing

Training set: We source images from 5 different datasets, each containing samples of both classes. The datasets are J-Mugshot, O-Mugshot, M-Mugshot, P-Mugshot, and Pinellas. All the datasets, except Pinellas, are evenly separated into males and females of different ethnicities. The images are centered, by which, we mean that we have landmarked certain points on the face, which are then anchored to fixed points in the resulting training image. For example, the eyes are anchored at the same coordinates in every image. All of our input images have the same dimension \(168\times 210\). The details of the training datasets are listed in Table 8.2. The images are partitioned into training and validation and the progressive blur is applied to each image as explained in the previous section. Hence, for a given model iteration, the training set consists of \(\sim \)90 k images.

Testing set: The testing set was built primarily from the following two datasets: (1) The AR Face database [55] is one of the most widely used face databases with occlusions. It contains 3,288 color images from 135 subjects (76 male subjects + 59 female subjects). Typical occlusions include sunglasses and scarves. The database also captures expression variations and lighting changes. (2) Pinellas County Sherrif’s Office (PCSO) mugshot database is a large-scale database of over 1.4 million images. We took a subset of around 400 K images from this dataset. These images are not seen during training.

Table 8.2 Datasets used for progressive CNN training
Fig. 8.4
figure 4

Overall classification accuracy on the PCSO (400 K). Images are not corrupted

The testing images are centered and cropped in the same way as the training images, though other preprocessing like the progressive blur are not applied. Instead, to model real world occlusions we have conducted the following experiments to be discussed in Sect. 8.4.2.

4.2 Experiment I: Occlusion Robustness

In Experiment I, we carry out occlusion robust gender classification on both the AR Face database and the PCSO mugshot database. We manually add artificial occlusions to test the efficacy of our method on the PCSO database and test on the various images sets in the AR Face dataset.

4.2.1 Experiments on the PCSO Mugshot Database

To begin with, the performance of various models on the clean PCSO data is shown in Fig. 8.4. As expected, if the testing images are clean, it should be preferable to use \(\mathcal {M}_F\), rather than \(\mathcal {M}_P\). We can see that the progressively trained models \(\mathcal {M}_1-\mathcal {M}_6\) are on par with \(\mathcal {M}_F\).

Fig. 8.5
figure 5

Various degradations applied on the testing images for Experiment I and II. Row 1 random missing pixel occlusions; Row 2 random additive Gaussian noise occlusions; Row 3 random contiguous occlusions. Percentage of degradation for Row 1–3 10, 25, 35, 50, 65 and 75%. Row 4 various zooming factors (2x, 4x, 8x, 16x) for low-resolution degradations

We corrupt the testing images (400 K) with three types of facial occlusions. These are visualized in Fig. 8.5 with each row corresponding to some modeled occlusions.

(1) Random Missing Pixels Occlusions

Varying factors of the image pixels (10, 25, 35, 50, 65 and 75%) were dropped to model lost information and grainy images.Footnote 2 This is corresponding to the first row in Fig. 8.5. From Table 8.3 and Fig. 8.6, \(\mathcal {M}_5\) performs the best with \(\mathcal {M}_6\) showing a dip in accuracy suggesting a tighter periocular region is not well suited for such applications, i.e., a limit on the periocular region needs to be maintained in the blur set. There is a flip in performance of the models \(\mathcal {M}_P\) and \(\mathcal {M}_F\) going from the original to \(25\%\) with the periocular model generalizing better for higher corruptions. As the percentage of missing pixels increases, the performance gap between \(\mathcal {M}_P\) and \(\mathcal {M}_F\) increases. As hypothesized, the trend of improving performance between progressively trained models is maintained across factors indicating a better learned model toward noise.

Table 8.3 Overall classification accuracy on the PCSO (400 K). Images are corrupted with random missing pixels of various percentages
Fig. 8.6
figure 6

Overall classification accuracy on the PCSO (400 K). Images are corrupted with random missing pixels of various percentages

(2) Random Additive Gaussian Noise Occlusions

Gaussian white noise \((\sigma = 6)\) was added to image pixels in varying factors (10, 25, 35, 50, 65 and 75%). This is corresponding to the second row in Fig. 8.5 and is done to model high noise data and bad compression. From Table 8.4 and Fig. 8.7, \(\mathcal {M}_4-\mathcal {M}_6\) perform best for medium noise. For high noise, \(\mathcal {M}_5\) is the most robust. Just like before, as the noise increases, the trend undertaken by the performance of \(\mathcal {M}_P\) & \(\mathcal {M}_F\) and \(\mathcal {M}_5\) & \(\mathcal {M}_6\) is maintained and so is the performance trend of the progressively trained models.

Table 8.4 Overall classification accuracy on the PCSO (400 K). Images are corrupted with additive Gaussian random noise of various percentages
Fig. 8.7
figure 7

Overall classification accuracy on the PCSO (400 K). Images are corrupted with additive Gaussian random noise of various percentages

(3) Random Contiguous Occlusions

To model big occlusions like sunglasses or other contiguous elements, continuous patches of pixels (10, 25, 35, 50, 65 and 75%) were dropped from the image as seen in the third row of Fig. 8.5. The most realistic occlusion corresponds to the first few patches, and other patches are extreme cases. For the former cases, \(\mathcal {M}_1-\mathcal {M}_3\) are able to predict the classes with the highest accuracy. From Table 8.5 and Fig. 8.8, for such large occlusions and missing data, more contextual information is needed for correct classification since \(\mathcal {M}_1-\mathcal {M}_3\) perform better than other models. However, since they perform better than \(\mathcal {M}_F\), our scheme of focused saliency helps generalizing over occlusions.

Table 8.5 Overall classification accuracy on the PCSO (400 K). Images are corrupted with random contiguous occlusions of various percentages
Fig. 8.8
figure 8

Overall classification accuracy on the PCSO (400 K). Images are corrupted with random contiguous occlusions of various percentages

4.2.2 Experiments on the AR Face Database

We partitioned the original set to smaller subsets to better understand our methodology’s performance under different conditions. Set 1 consists of neutral expression, full-face subjects. Set 2 has full face but varied expressions. Set 3 includes periocular occlusions such as sunglasses and Set 4 includes these and other occlusions like clothing, etc. Set 5 is the entire dataset including illumination variations.

Referring to Table 8.6 and Fig. 8.9, for Set 1, the full-face model performs the best and this is expected as this model was trained on images very similar to this. Set 2 suggests that the models need more contextual information when expressions are introduced. Thus, \(\mathcal {M}_4\) which has focus on periocular but has face information too performs best. For Set 3, we can see two things: one, \(\mathcal {M}_P\) performs better than \(\mathcal {M}_F\) indicative of its robustness to periocular occlusions. Two, \(\mathcal {M}_5\) is the best as it combines periocular focus with contextual information gained from incremental training.

Set 4 performance brings out why periocular region is preferred for occluded faces. We ascertained that some texture and loss of face contour is throwing off the models \(\mathcal {M}_1-\mathcal {M}_6\). The performance of the models on Set 5 reiterates previously stated observations of the combined importance of contextual information about face contours and the importance of periocular region. This is the reason for the best accuracy reported by \(\mathcal {M}_3\).

Table 8.6 Gender classification accuracy on the AR Face database
Fig. 8.9
figure 9

Gender classification accuracy on the AR Face database

Table 8.7 Overall classification accuracy on the PCSO (400 K). Images are down-sampled to a lower resolution with various zooming factors
Fig. 8.10
figure 10

Overall classification accuracy on the PCSO (400 K). Images are down-sampled to a lower resolution with various zooming factors

4.3 Experiment II: Low-Resolution Robustness

Our scheme of training on Gaussian blurred images should generalize well to low-resolution images. To test this hypothesis, we tested our models on images from the PCSO mugshot dataset by first down-sampling them by a factor and then blowing them back up (zooming factor for example: 2x, 4x, 8x, 16x).Footnote 3 This inculcates the loss of edge information and other higher order information and is captured in the last row of Fig. 8.5. As seen in Table 8.7 and Fig. 8.10 for cases, 2x, 4x, 8x, the trend between \(\mathcal {M}_1-\mathcal {M}_6\) and their performance with respect to \(\mathcal {M}_F\) is maintained. As mentioned before, \(\mathcal {M}_4\) performs well due to the balance between focus on periocular region and saving the contextual information of a face.

Table 8.8 Summary of many related work on gender classification. The proposed method is shown in the top rows

4.4 Discussion

We have proposed a methodology for building a gender recognition system which is robust to occlusions. It involves training a deep model incrementally over several batches of input data preprocessed with progressive blur. The intuition and intent is twofold, one to have the network focus on periocular regions of the face for gender recognition. And two, to preserve contextual information of facial contours to generalize better over occlusions.

Through various experiments we have observed that our hypothesis is indeed true and that for a given occlusion set, it is possible to have high accuracy from a model that encompasses both of above-stated properties. Irrespective of the fact that we did not train on any occluded data, or optimize for a particular set of occlusions, our models are able to generalize well over synthetic data and real-life facial occlusion images.

We have summarized the overall experiments and consolidated the results in Table 8.8. For PCSO large-scale experiments, we believe that 35% occlusion is the right amount of degradations, on which accuracies should be reported. Therefore, we average the accuracy from our best model on three types of occlusions (missing pixel, additive Gaussian noise, and contiguous occlusions) which gives 93.12% in Table 8.8. For low-resolution experiments, we believe 8x zooming factor is the right amount of degradations, so we report the accuracy 95.67% in Table 8.8. Many other related work on gender classification are also listed for a quick comparison. This table is based on [16].

5 Experiments: Part B

In order to boost the classification accuracy and support the models trained in the previous section, we trained a GAN to hallucinate occlusions and missing information in the input image. The next two sections detail our curation of the input data and the selection of the testing sets for our evaluation. The gender model (\(\mathcal {M}_k\)) definitions remain the same as in the previous section. And we use \(\mathcal {G}_z\) and \(\mathcal {D}_x\) to denote the Generator and Discriminator models from GAN.

5.1 Network Architecture

The network architectures and approach that we use are similar to the work of [60]. The input size of the image is \(64\times 64\). While training we used Adam [45] optimization method because it does not require hand-tuning of the learning rate, momentum, and other hyper-parameters.

For learning z we use the same hyper-parameters as learned in the previous model. In our experiments, running 2000 epochs helped converge to a low loss.

5.2 Database and Preprocessing

In order to train \(\mathcal {G}_z\) and \(\mathcal {D}_x\) our approach initially was to use the same input images as used to train \(\mathcal {M}_k\). However, this resulted in the network not converging to a low loss. In other words, we were not able to learn a generative distribution that the generator could sample from. Our analysis and intuition suggested that in order for the adversarial loss to work, the task had to be a challenge for both \(\mathcal {G}_z\) and \(\mathcal {D}_x\). Our fully frontal pose images were ill-posed for this model.

Hence to train the GAN, we used the Labeled Faces in the Wild (LFW) database [47] and aligned the faces using dLib as provided by the OpenFace [1]. We trained on the entire dataset comprising around 13,000 images with 128 images held out for the purpose of qualitatively showing the image recovery results as in Fig. 8.11. In this case, by visual analysis as well by analytical analysis the \(\mathcal {G}_z\) was better able to learn a distribution of z, \(p_g\), that was a strong representation of the data, \(p_{\mathrm {data}}\). That is symbolically, \(p_g = p_{\mathrm {data}}\). The results and detailed evaluation of the model are done later in Sect. 8.5.3. In this part of the experiment, we use a subset of the PCSO database containing 10,000 images (5,000 male and 5,000 female) for testing the gender classification accuracy. The reason we did not test on the entire 400 K PCSO is simply because the occlusion removal step involves an iterative solver which is time consuming.

Fig. 8.11
figure 11

Qualitative results of occlusion removal using DCGAN on images from LFW dataset

5.3 Experiment III: Further Occlusion Robustness via Deep Generative Approach

As shown in Fig. 8.11, we chose to model occlusions based on percentage of pixels missing from the center of the image. The shown images are from the held-out portion of the LFW dataset. We show recovered images in Fig. 8.11 as a visual confirmation that the DCGAN is able to recover unseen images with high fidelity, even under pretty heavy occlusions as much as 75% in the face center. The recovery results on the PCSO datasets to be tested for gender classification are comparable to that of the LFW, but we are not allowed to display mugshot images in any published work.

Fig. 8.12
figure 12

Quality measure of occlusion removal on the PCSO subset

Table 8.9 Image quality measures (in dB) on the masked and occlusion removed images
Fig. 8.13
figure 13

Overall classification accuracy for Experiment III on the PCSO subset. Images are not corrupted

Table 8.10 Overall classification accuracy for Experiment III on the PCSO subset. Images are corrupted with centered contiguous missing pixels of various percentages

For training the \(\mathcal {G}_z\) and \(\mathcal {D}_x\), we used the LFW data that captures a high variance of poses, illumination, and faces. We found this was critical in helping especially the \(\mathcal {G}_z\) converge to a stable weights. The model was able to generalize well to various faces and poses. As can be seen in Fig. 8.11, the model is able to generate missing information effectively.

Not relying on visual inspection, we plotted the PSNR and SNR of the recovered (occlusion removed) faces from the 10K PCSO subset in Fig. 8.12. This is our quality measure of occlusion removal using GAN. As can be seen in Table 8.9, the PSNR/SNR is better for completed images and as expected is higher for images with lesser occlusions.

The primary motivation behind training a GAN was to improve the classification of \(\mathcal {M}_k\) models. This is covered in Table 8.10. The first column of the table is our baseline case. This was constructed using upsampled images from the resolution as needed by \(\mathcal {G}_z\) and \(\mathcal {D}_x\) to the resolution expected by \(\mathcal {M}_k\). All other accuracies should be evaluated with respect to this case. (Figure 8.13)

Figure 8.14 corresponds to the other columns of Table 8.10. The accuracies on completed images using \(\mathcal {G}_z\) are significantly better than the accuracies on masked images. This suggests that the hallucinations are able to preserve the gender sensitivity of the original images.

Fig. 8.14
figure 14

Overall classification accuracy on the PCSO subset. Images are corrupted with centered contiguous occlusions of various percentages

The above statement can also be verified through visual inspection of Fig. 8.11. Even for high-percentage occlusions, \(50 - 75\%\), the contours and features of the original face are pretty accurately reproduced by \(\mathcal {G}_z\).

5.4 Discussion

In the current generative model for image occlusion removal, we assume that the occlusion mask is known to the algorithm, which is the \(\mathbf {M}\) in Eq. 8.7. Although it is beyond the scope of this work to study how an algorithm can automatically determine the occlusion region, it will be an interesting research direction. For example, [61] is able to automatically tell which part is the face or the non-face region.

One big advantage of the DCGAN or GAN in general is that it is entirely unsupervised. The loss function is based off essentially measuring the similarity of two distributions (the true image distribution and the generated image distribution), rather than image-wise comparisons, which may require labeled ground-truth images be provided. The unsupervised nature of the GAN has made the training process much easier by not needing careful curating of the labeled data.

The effectiveness of the DCGAN is also salient. It not only can recover the high-percentage missing data with high fidelity, which translates to significant improvement on the gender classification tasks, but also can be used for future data augmentation in a total unsupervised manner. We can essentially generate as much gender-specific data as needed, which will be an asset for training an even larger model.

During our experimentation, we find that training the DCGAN using more constrained faces (less pose variations, less lighting variations, less expression variations, etc.) such as the PCSO mugshot images actually degrades the recovery performance. The reason could be as follows. When the training data is less diversified, let us use one extreme case, which is a training dataset comprising thousands of images from only one subject. In this case, the ‘true distribution’ becomes a very densely clustered mass, which means whatever images the generator \(\mathcal {G}\) tries to come up with, the discriminator \(\mathcal {D}\) will (almost) always say that the generated ones are not from the true distribution, because the generated image distribution can hardly hit that densely clustered mass. This way, we are essentially giving a too easy task for the discriminator to solve, which prevents the discriminator to become a strong one, which leads to poorly performing generator as well in this adversarial setting. In a nutshell, during the DCGAN training, we definitely need more variations in the training corpus.

6 Conclusions and Future Work

In this work, we have undertaken the task of occlusion and low-resolution robust facial gender classification. Inspired by the trainable attention model via deep architecture, and the fact that the periocular region is proven to be the most salient region for gender classification purposes, we are able to design a progressive convolutional neural network training paradigm to enforce the attention shift during the learning process. The hope is to enable the network to attend to particular high-profile regions (e.g., the periocular region) without the need to change the network architecture itself. The network benefits from this attention shift and becomes more robust toward occlusions and low-resolution degradations. With the progressively trained CNN models, we have achieved better gender classification results on the large-scale PCSO mugshot database with 400 K images under occlusion and low-resolution settings, compared to the one undergone traditional training. In addition, our progressively trained network is sufficiently generalized so that it can be robust to occlusions of arbitrary types and at arbitrary locations, as well as low resolution.

To further improve the gender classification performance on occluded facial images, we invoke a deep generative approach via deep convolutional generative adversarial networks (DCGAN) to fill in the missing pixels for the occluded facial regions. The recovered images not only show high fidelity as compared to the original un-occluded images but also significantly improve the gender classification performance.

In summary, on one hand, we aim at building a robust gender classifier that is tolerant to image degradations such as occlusions and low resolution, and on the other hand, we aim at mitigating and eliminating the degradations through a generative approach. Together, we are able to push the boundary of unconstrained facial gender classification.

Future work: We have carried out a set of large-scale testing experiments on the PCSO mugshot database with 400 K images, shown in the experimental section. We have noticed that, under the same testing environment, the amount of time it takes to test on the entire 400 K images various dramatically for different progressively trained models \((\mathcal {M}_0 - \mathcal {M}_6)\). As shown in Fig. 8.15, we can observe a trend of testing time decrease when testing using \(\mathcal {M}_0\) all the way to \(\mathcal {M}_6\), where the curves correspond to the additive Gaussian noise occlusion robust experiments. This same trend is observed across the board for all the large-scale experiments on PCSO. The time difference is stunning. For example, if we look at the green curve, \(\mathcal {M}_0\) takes over 5000 seconds while \(\mathcal {M}_6\) only around 500. One of the future directions is to study the cause of this phenomenon. One possible direction is to study the sparsity or the smoothness of the learned filters.

Fig. 8.15
figure 15

(Top) Testing time for the additive Gaussian noise occlusion experiments on various models. (Bottom) Visualization of the 64 first-layer filters for models \(\mathcal {M}_0\), \(\mathcal {M}_3\), and \(\mathcal {M}_6\), respectively

Shown in our visualization (Fig. 8.15) of the 64 first-layer filters in AlexNet for models \(\mathcal {M}_0\), \(\mathcal {M}_3\), and \(\mathcal {M}_6\), respectively, we can observe that the progressively trained filters seem to be smoother and this may be due to the implicit low-rank regularization phenomenon discussed in Sect. 8.3.1.3. Other future work may include studying how the ensemble of models [43, 44] can further improve the performance and how various multimodal soft-biometrics traits [5, 23, 25, 26, 28,29,30, 37, 41, 42, 64, 66, 74] can be fused for improved gender classification, especially under more unconstrained scenarios.