Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Unconstrained face detection is challenging due to pose and illumination variations, occlusions, blur, etc. While illumination variations are handled relatively better due to many physical models, pose variations and occlusions are the most commonly encountered problems in practiceFootnote 1. Many approaches have been specifically proposed to solve pose variations [2,3,4] and occlusions [5,6,7,8,9], however, few of them addresses pose variations and occlusions in a model explicitly and simultaneously.

Recently, a number of Convolutional Neutral Network (CNN) [10] based face detection methods [11,12,13,14,15] have been proposed due to the power of CNN in dealing with computer vision problems. However, CNN models generally deal with problems in face detection by learning from a large number of diverse training samples. Such data driven solutions may be good in dealing with various face variations, however, they usually result in very complex models that run slowly, which limits their application in practice, especially in embedding devices. On the other hand, Yang et al. [13] proposed a specific architecture called Faceness-Net, which considers facial component based scoring and their spatial configuration to explicitly deal with occluded face detection. This work inspires that explicit modeling of challenges in face detection is still required and more effective than pure data driven, though the fixed spatial configuration in Faceness-Net is still an issue, and the model is still expensive to apply.

Putting occlusions and large pose variations together, a common issue is that some facial components are invisible under either condition. This motivates us to only detect visible components that share some pose invariance property, and adaptively aggregate them together to form the whole face detection. Therefore, in this paper we propose a novel face detection method called Aggregating Visible Components (AVC), which addresses pose variations and occlusions simultaneously in a single framework.

Specifically, to handle pose variations, we define two pose-invariant (or pose-robust) components by considering half facial view, and a regression based local landmark alignment. Such a consistent component definition helps to reduce the model complexity. Accordingly, we train two component detectors, mirror them to detect the other half view, and introduce a local region competition strategy to alleviate false detections. To handle facial occlusions, we only detect visible facial components, and build a local to global aggregation strategy to detect the whole face adaptively. Experiments on the FDDB and AFW databases show that the proposed method is robust in handling pose variations and occlusions, achieving much better performance but lower model complexity compared to the corresponding holistic face detector.

The remaining parts of this paper are organized as follows. Section 2 gives a concise review of related works. Section 3 gives an overview of the proposed AVC detector. Section 4 introduces the pose-invariant component definition and the detector training. In Sect. 5, we present the local region competition strategy and the adaptive local to global aggregation strategy. Experimental results on AFW and FDDB are shown and discussed in Sect. 6 and we conclude the paper in Sect. 7.

2 Related Works

Given that the original Viola-Jones face detector [16] is limited to multi-view face detection, various cascade structures have been proposed to handle pose variations [2,3,4]. Today multi-view face detection by partitioning poses into discrete ranges and training independently is still a popular way to handle pose variations, for example, in recent works [12, 17]. Zhu and Ramanan [18] proposed to jointly detect a face, estimate its pose, and localize face landmarks in the wild by a Deformable Parts-based Model (DPM), which was further improved in [19, 20]. Ranjian et al. [21] proposed to combine deep pyramid features and DPM to handle faces with various sizes and poses in unconstrained settings. Chen et al. [22] proposed to combine the face detection and landmark estimation tasks in a joint cascade framework to refine face detection by precise landmark detections. Liao et al. [23] proposed to learn features in deep quadratic trees, where different views could be automatically partitioned. These methods are effective in dealing with pose variations, however, not occlusions simultaneously.

Face detection under occlusions is also an important issue but has received less attention compared to multi-view face detection, partly due to the difficulty of classifying arbitrary occlusions into predefined categories. Component-based face detector is a promising way in handling occlusions. For example, Chen et al. [8] proposed a modified Viola-Jones face detector, where the trained detector was divided into sub-classifiers related to several predefined local patches, and the outputs of sub-classifiers were re-weighted. Goldmann et al. [24] proposed to connect facial parts using topology graph. Recently, Yang et al. [13] proposed a specific architecture called Faceness-Net, which considers faceness scoring in generic object proposal windows based on facial component responses and their spatial configuration, so that face detection with occlusions can be explicitly handled. However, none of the above methods considered face detection with both occlusions and pose variations simultaneously in unconstrained scenarios.

Our work is also different from other part-based methods like [25,26,27,28,29] in that [25] describes an object by a non-rigid constellation of parts and jointly optimize parameters whereas we learn component detectors independently and apply an aggregation strategy to constitute a global representation. On the other hand, AVC define parts via component-invariant mapping, in contrast to [26] which defines parts by a search procedure while [27,28,29] deploy CNN structures.

Recently, the Convolutional Neutral Network (CNN) [10] based methods [11,12,13,14,15] have been proposed for face detection due to the power of CNN in dealing with computer vision problems. For example, Li et al. [11] proposed a cascade architecture based on CNN and the performance was improved by alternating between the detection net and calibration net. Most recently Zhang et al. [14] and Ranjan et al. [15] combined face detection with other vision tasks such as face alignment and involved multi-task loss into CNN cascade.

3 Overview of the Proposed Method

Figure 1 is an overview of the proposed AVC face detection method. It includes three main steps in the detection phase: visible component detection step, local region competition step, and the local to global aggregation step. AVC works by detecting only the visible components which would be later aggregated to represent the whole face. Two half-view facial component detectors are trained, and for this we introduce a pose-invariant component definition via a regression based local landmark alignment, which is crucial for training sample cropping and pose-invariant component detection. Then the two learned detectors are mirrored to detect the other half view of the facial components. Next, the detected visible facial components go through a local region competition module to alleviate false detections, and finally a local to global aggregation strategy is applied to detect the whole face adaptively.

Fig. 1.
figure 1

The processing steps of the proposed AVC face detection method. (a) Input image. (b) Visible eye detection. (c) Detection of all visible components (Red: left eye; Blue: right eye; Green: left mouth; Pink: right mouth). (d) Refinement after local region competition. (e) Aggregated whole face detection. (Color figure online)

The intuition behind our component-based design is the fact that face images in real-world applications are often with large pose variations and occlusions. Consider for example, a face turning left over 60 degrees (see Fig. 2(a)), where the holistic face detector unavoidably includes unwanted backgrounds (see Fig. 2(b)).

Fig. 2.
figure 2

Illustration of holistic face detection and component-based face detection. (a) Input image. (b) Typical holistic face detection. (c) Left eye (LE) detection. (d) Left mouth (LM) detection. (e) Aggregating LE and LM to get a global detection.

However, a robust face detector should not only predict the number of faces but also give bounding boxes as tight as possible. The criteria on this performance was first introduced by FDDB [1], a face benchmark that employs both discrete metric and continuous metric for evaluation. While a typical face detector may fail to bound a profile face tightly and miss faces under occlusions, we discover however, that pose variations and occlusions can be jointly solved by locating and aggregating facial components. We trained two facial component detectors respectively for the detection of left eyebrow + left eye (denoted as LE Fig. 2(c)) and left nose + left mouth (denoted as LM Fig. 2(d)).

It’s observed that although a face with large rotation towards left may lead to left eye invisible, we can still, under this circumstance, locate the right eye or mouth and nose etc. It also applies to occlusions where for example, the left half face is occluded by another person’s shoulder, we can still locate the whole face by the detection of right eye. Furthermore, we only consider training two half facial view components, and mirror them to detect the other half view. This strategy not only reduces the training effort, but also enables us to deal with larger pose variations because for example, the left eye component appears to be invariant under 0–60\(^ \circ \) pose changes, and beyond this range the right eye or other component is usually detectable.

4 Pose-Invariant Component Detection

4.1 Pose-Invariant Component Mapping

As was indicated in AFLW [30], although there is largely an agreement on how to define anchor points and extents of rectangle for frontal faces, it’s not so obvious for profile and semi-profile views, which makes it harder to get consistently annotated samples for training. Unlike the training input of a holistic face detector, facial part detector requires uniform eye patches and mouth patches as training set. This would not be made possible without pose-invariant component mapping.

Samples in AFLW consist of 21 landmarks. We first calculate the mean shape of the whole database with samples normalized and missing coordinates excluded. Region in the mean shape which we want to map i.e. left eyebrow and left eye for LE component is mapped directly to a new input sample by applying the transformation

$$\begin{aligned} a\mathbf {\bar{x}} + {x_0} = \mathbf {x} \end{aligned}$$
(1)
$$\begin{aligned} a\mathbf {\bar{y}} + {y_0} = \mathbf {y} \end{aligned}$$
(2)

Note that in (1) and (2) \(\mathbf {\bar{x}}\) and \(\mathbf {\bar{y}}\) are vectors representing x coordinates and y coordinates of mean shape while \(\mathbf {x}\) and \(\mathbf {y}\) representing those of a new sample. \( \mathbf {E} \) is a nx1 vector with all elements being 1, \(x_0, y_0\) are scalars that denote offsets and n is the number of landmarks used for regression. Closed form solution can be derived as the following

$$\begin{aligned} a = \frac{{{\mathbf {{\bar{x}}}^T} \cdot \mathbf {x} + {\mathbf {{\bar{y}}}^T} \cdot \mathbf {y} - \frac{1}{n} \cdot ({\mathbf {\bar{x}}^T} \cdot \mathbf {E})({\mathbf {x}}^T \cdot \mathbf {E} ) - \frac{1}{n} \cdot ({\mathbf {\bar{y}}^T} \cdot \mathbf {E})({\mathbf {y}}^T \cdot \mathbf {E})}}{{\mathbf {{\bar{x}}}^T} \cdot \mathbf {x} + {\mathbf {{\bar{y}}}^T} \cdot \mathbf {y} - \frac{1}{n} \cdot {({\mathbf {\bar{x}}^T} \cdot \mathbf {E})}^2 -\frac{1}{n} \cdot {({\mathbf {\bar{y}}^T} \cdot \mathbf {E})}^2} \end{aligned}$$
(3)
$$\begin{aligned} {x_0} = \frac{1}{n} \cdot {\mathbf {x}^T} \cdot {\mathbf {E}} - a\frac{1}{n} \cdot {\mathbf {{\bar{x}}}^T} \cdot {\mathbf {E}} \end{aligned}$$
(4)
$$\begin{aligned} {y_0} = \frac{1}{n} \cdot {\mathbf {y}^T} \cdot {\mathbf {E}} - a\frac{1}{n} \cdot {\mathbf {{\bar{y}}}^T} \cdot {\mathbf {E}} \end{aligned}$$
(5)

An intuitive visual interpretation is shown in Fig. 3. In Fig. 3(c), blue points are annotated landmarks while red points are mapped from meanshape. Positive samples extracted in this way retain excellent uniformity, which would be used for training LE and LM component detector. The pose-invariant component mapping method is also used for preparing negative samples for bootstrapping (see Fig. 4).

Fig. 3.
figure 3

Pose-invariant component mapping and cropping. (a) Input. (b) Meanshape of the LE component. (c) Regression based local landmark alignment of LE component. (d) Cropping of the LE component. (e) Input. (f) Meanshape of the LM component. (g) Regression based local landmark alignment of LM component. (d) Cropping of the LM component. (Color figure online)

Fig. 4.
figure 4

Positive and negative examples for components. The first and third rows show positive training samples of the LE and LM components respectively, while the second and forth rows show images for bootstrapping negative LE and LM samples respectively.

4.2 Why LE and LM?

In our paper, we trained two part-based detectors, namely LE (left eyebrow and left eye) and LM (left nose and left mouth) and Fig. 4 displays some positive and hard-negative training samples obtained using method of the last subsection. But why not eyes, noses or other patches? Our motivations are: (1) These patches are not defined arbitrarily or conceptually but based on the regression of local landmarks. As in Fig. 3, these landmarks are derived by LE/LM meanshape of AFLW to ensure that they retain invariance throughout the database (2) Why 6 landmarks instead of 3 or 9? According to AFLW, a nose is defined by 3 landmarks, the width/height of these patches would then be too small for training and testing. While 9 landmarks would result with a facial area too broad thus vulnerable for occlusions.

4.3 Training Procedure

In this subsection, we give a brief introduction about the feature employed for facial representation as well as the work flow of the training algorithm.

Feature: We choose NPD [23] as our feature mainly for its two properties: illumination invariant and fast in speed because each computation involves only two pixels. For an image with size \(p=w \times h\), the number of features computed is \(C_p^2\) which can be computed beforehand, leading to superiority in speed for real world applications. With the scale-invariance property of NPD, the facial component detector is expected to be robust against illumination changes which is important in practice.

Training Framework: The Deep Quadratic Tree (DQT) [23] is used as weak classifier which learns two thresholds and is deeper compared to typical tree classifiers. Soft-Cascade [31] as well as hard-negative mining are applied for cascade training. While individual NPD [32] features may be “weak”, the Gentle AdaBoost algorithm is utilized to learn a subset of NPD features organized in DQT for stronger discriminative ability.

5 Local to Global Aggregation

5.1 Symmetric Component Detection

Figure 5 shows some example outputs by LE and LM detector respectively. As can be seen, our component-based detector has the inherent advantages under occasions of occlusions (Fig. 5(a,h)) and pose-variations (Fig. 5(c,g)), where a holistic detector would normally fail. The detection of right eyebrow + right eye (RE) and right mouth + right nose (RM) can be achieved by deploying the detector of their left counterpart. Figure 6(a) to (d) illustrates how we locate RM and RE using the same detectors as LM and LE.

Fig. 5.
figure 5

Some example component detections by the proposed LE (upper row) and LM facial component detector.

Fig. 6.
figure 6

(a): Input image; (b): Left eye detection; (c): Left eye detection in mirrored image; (d): Right eye detection mapped back to the original image.

5.2 Local Region Competition

Adopting facial part detection also brings about many troublesome issues. If handled improperly, the performance will vary greatly. First, LE, LM, RE, RM detector for different facial parts will each produce a set of candidate positive windows with a set of confidence scores. But the goal for face detection is to locate faces each with a bounding box as tight as possible, so we need to merge these detections from different facial part detectors and remove duplicated windows. A common solution is Non-Maximum Suppression (NMS) [33] but issue arises on how to do window merging with a good trade-off between high precision rate and high detection rate. Second, different benchmarks with different annotation styles could lead to biased evaluation. Noted in [20], this diversity becomes more prominent for profile faces. In this section, we address the above issues by exploiting the advantage of a competitive strategy.

Figure 1 illustrates the idea of the proposed local region competition. The core idea is to reject false alarms during merging (compete) while improving localization accuracy during aggregation (collaborate). In Algorithm 1 line 6 to line 11 first obtains candidate outputs of a specific facial region by LE, RE, LM, RM facial part detectors denoted as region_rects, see Fig. 1(c) which shows detection results of all components and Fig. 1(d) after competition as an example. In this example, left eye region may well contain the outputs of other facial part detectors such as RE (false alarms) other than LE and vice versa. It is through this competitive strategy that we ensure candidate windows of only one facial part detector are reserved for each region, rooting out the possibility of using false alarms for aggregation.

figure a

5.3 Aggregation Strategy

After deploying competitive strategy to exclude possible false positives, the task now is to ensure accurate localization of detection outputs. This is achieved by taking the full use of information from rects of different regions. We use rectangle as facial representation. Note that our proposed pipeline also applies to elliptical representation as the aforementioned workflow remains unchanged.

In Algorithm 1 line 12, winning rectangles from each region as illustrated in Fig. 5 are regressed directly to bounding boxes. Note that we only learn two sets of regression parameters (linear regression), because during inference the coordinates of RE/RM component are first mirrored, regressed and then mirrored back using the same parameters of their left counterparts. This is a local to global bottom up strategy because rects of different facial regions are mapped to global facial representations. In Algorithm 1 Line 15 to Line 18, these rects are then concatenated for partitioning using disjoint-set algorithm. Then the locations of partitioned rects are translated and adjusted by tuning their widths and heights according to their confidence scores (weights). Through this process, information of different regions are collaborated to get a more accurate localization of the whole face. Finally, NMS [33] is deployed to eliminate interior rects.

6 Experiments

6.1 Training Parameters:

Annotated Facial Landmarks in the Wild (AFLW) [1] is an unconstrained face benchmark that contains 25993 face annotations in 21997 real world images with large pose variations, occlusions, illumination changes as well as a diversity of ages, genders, and ethnicity. In total, we use 43994 images from AFLW together with its flipped counterpart as positive samples and 300000 background images for training. And an additional 12300 images of natural scenes are scraped from the Internet to mask face components for hard-negative mining. In training AVC, images of 15x20 pixels are assigned to LE component while images of 20x20 pixels are used for LM. Pose-invariant component mapping is deployed to crop positive training patches and prepare bootstrapping samples.

6.2 AFW Results:

Annotated Faces in the Wild (AFW) [18] contains 205 images collected from Flickr that contain images of cluttered scenes and different viewpoints.

To evaluate on AFW, we fit winning rects from local component detectors to rectangle representations of the whole face, which would be used for further aggregation. The fitting parameters are learned on AFLW using 10-cross validation and this also applies to the learning of elliptical fitting parameters for testing on FDDB.

We use the evaluation toolbox provided by [20]. The comparison of Precision-Recall curves generated by different methods is shown in Fig. 7(a). We compare AVC with both academic methods like DPM, HeadHunter, Structured Models and commercial systems like Face++ and Picasa. As can be seen from the figure, AVC outperforms DPM and is superior or equal to Face++ and Google Picasa. The precision of AVC is 98.68% with a recall of 97.13%, and the AP of AVC is 98.08%, which is comparable with the state-of-the-art methods. Example detection results are shown in the first row of Fig. 8, note that we output rectangle for evaluation on AFW.

Fig. 7.
figure 7

Experimental results on AFW and FDDB database. Best viewed in color.

6.3 FDDB Results:

Face Detection Data Set and Benchmark (FDDB) [1] contains 2845 images with 5171 faces, with a wide range of arbitrary poses, occlusions, illumination changes and resolutions. FDDB uses elliptical annotations and two types of evaluation metrics are applied. One is the discrete score metric which counts the number of detected faces versus the number of false alarms. A detected bounding box is considered true positive if it has an IoU of over 0.5 with ground truth. The other is the continuous score metric that measures the IoU ratio as the indicator for performance.

Fig. 8.
figure 8

Qualitive results of AVC on AFW (first row using rectangle representations) and FDDB (second and third row using elliptical representations).

As FDDB uses ellipse for annotations, we fit the output rectangles to elliptical representations of the whole face. We use the evaluation code provided by Jain and Learned-Miller [1] and the results using discrete score metric are shown in Fig. 7. We compare our results with the latest published methods on FDDB including MTCNN, DP2MFD, Faceness-Net and Hyperface. Ours performs worse than MTCNN and DP2MFD which resort to powerful yet complex CNN features but is better than Faceness-Net, which is also component-based but with the help of CNN structure. AVC gets 84.4% detection rate at FP = 100, and a detection rate of 89.0% at FP = 300. Example detection results are shown in the second and third row of Fig. 8, where faces under poses changes and occlusions have been successfully located.

6.4 Does Component-Invariant Mapping Help?

We have tried two other methods when preparing facial-component patches for training component detectors. One is to define anchor points and extents of rectangle, the other is to project 3D landmarks back to 2D plane. However, unlike training holistic face detector that gets by with ordinary methods, the uniformity of component training-set under profile or semi-profile views deteriorates notably compared to those under frontal views. The resulting detectors that we have trained achieve at best 81% AP on FDDB. To the best of our knowledge, it remains a tricky issue on how to achieve consistency under profile views [30]. This motivates us to make new attempts and explore component-invariant mapping, whose performance is further boosted with the help of symmetric component detection because, when a face only exposes RE/RM component, LE/LM component detector would fail. Second, its likely that symmetric component detection presents a symmetric but unblocked or simpler view for detector. Third, symmetric detection obviates the need to train another two more detectors and regression parameters. Experiment shows that trained part-detectors using conventional cropped patches will decrease AP by about 8.2% on FDDB.

6.5 Model Complexity

As is shown in Table 1, different tree levels for training have been evaluated, leading to different training stages and number of weak classifiers. Training FAR indicates to what extent AVC has converged, but it can not reflect the performance of the model on test set. The complexity of the model is measured by aveEval, which means the average number of NPD features evaluated per detection window. The lower the value of aveEval, the faster the detector. For the sake of speed, this index is important for the choices of our component models.

Table 1. Comparison of model complexity between AVC and NPD

The aveEval in LE and LM are 24.754 and 26.755 respectively (See Table 1). So the total number of features per detection window that AVC has to evaluate is 103.018 with symmetric detection considered, which is faster than NPD holistic face detector implemented in [23] that has 46401 weak classifiers and an aveEval of 114.507. With regard to pose-variations and occlusions, AVC also outperforms NPD detector by a notable margin on FDDB (See Fig. 7(c)). Another advantage of AVC is that storage memory required is low compared to CNN methods, which is crucial for real-world applications. The total model size of AVC is only 2.65 MB, smaller compared to NPD (6.31 MB) or a typical CNN model.

7 Conclusion

In this paper, we proposed a new method called AVC highlighting component-based face detection, which addresses pose variations and occlusions simultaneously in a single framework with low complexity. We show a consistent component definition which helps to achieve pose-invariant component detection. To handle facial occlusions, we only detect visible facial components, and build a local to global aggregation strategy to detect the whole face adaptively. Experiments on the FDDB and AFW databases show that the proposed method is robust in handling illuminations, occlusions and pose-variations, achieving much better performance but lower model complexity compared to the corresponding holistic face detector. The proposed face detector is able to output local facial components as well as meanshape landmarks, which may be helpful in landmark detection initialization and pose estimation. We will leave it as future work for investigation.