Occlusion-Aware 3D Morphable Models and an Illumination Prior for Face Image Analysis

Egger, Bernhard; Schönborn, Sandro; Schneider, Andreas; Kortylewski, Adam; Morel-Forster, Andreas; Blumer, Clemens; Vetter, Thomas

doi:10.1007/s11263-018-1064-8

Occlusion-Aware 3D Morphable Models and an Illumination Prior for Face Image Analysis

Published: 31 January 2018

Volume 126, pages 1269–1287, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Computer Vision Aims and scope Submit manuscript

Occlusion-Aware 3D Morphable Models and an Illumination Prior for Face Image Analysis

Download PDF

Bernhard Egger ORCID: orcid.org/0000-0002-4736-2397¹,
Sandro Schönborn¹,
Andreas Schneider¹,
Adam Kortylewski¹,
Andreas Morel-Forster¹,
Clemens Blumer¹ &
…
Thomas Vetter¹

1932 Accesses
52 Citations
3 Altmetric
Explore all metrics

Abstract

Faces in natural images are often occluded by a variety of objects. We propose a fully automated, probabilistic and occlusion-aware 3D morphable face model adaptation framework following an analysis-by-synthesis setup. The key idea is to segment the image into regions explained by separate models. Our framework includes a 3D morphable face model, a prototype-based beard model and a simple model for occlusions and background regions. The segmentation and all the model parameters have to be inferred from the single target image. Face model adaptation and segmentation are solved jointly using an expectation–maximization-like procedure. During the E-step, we update the segmentation and in the M-step the face model parameters are updated. For face model adaptation we apply a stochastic sampling strategy based on the Metropolis–Hastings algorithm. For segmentation, we apply loopy belief propagation for inference in a Markov random field. Illumination estimation is critical for occlusion handling. Our combined segmentation and model adaptation needs a proper initialization of the illumination parameters. We propose a RANSAC-based robust illumination estimation technique. By applying this method to a large face image database we obtain a first empirical distribution of real-world illumination conditions. The obtained empirical distribution is made publicly available and can be used as prior in probabilistic frameworks, for regularization or to synthesize data for deep learning methods.

Markov Chain Monte Carlo for Automated Face Image Analysis

Article 02 November 2016

Generative Face Parsing Map Guided 3D Face Reconstruction Under Occluded Scenes

“Look Ma, No Landmarks!” – Unsupervised, Model-Based Dense Face Alignment

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Face image analysis is a major field in computer vision. We focus on 3D reconstruction of a face given a single still image under occlusions. Since the problem of reconstructing a 3D shape from a single 2D image is inherently ill-posed, a strong object prior is required. Our approach builds on the analysis-by-synthesis strategy. We extend the approach by an integrated segmentation into face, beard and non-face to handle occlusions.

We work with a 3D morphable face model (3DMM) as originally proposed by Blanz and Vetter (1999) representing the face by shape and color parameters. With illumination and rendering parameters we can synthesize facial images. We adapt all parameters to render an image which is as similar as possible to the target. This process is called fitting.

Most “in the wild” face images depict occluders like glasses, hands, facial hair, microphones and various other things (see Köstinger et al. 2011). Such occlusions are a challenge in an analysis-by-synthesis setting—the fitting procedure of 3DMMs is misled by them. Occlusions make the face model either drift away, or the shape and color parameters are distorted because the face model adapts to those occlusions. Standard methods like e.g. robust error measures are not sufficient for generative face image analysis since they tend to exclude important details like the eyes, eyebrows or mouth region which are more difficult to explain by the face model (see Fig. 2). At the same time, it is impossible to detect occlusions without strong prior knowledge. The challenge of occlusion-aware model adaptation is to explain as much as possible by the face model and only exclude non-face pixels. At the same time, it should not tend to include occlusions or strong outliers.

In the proposed framework we distinguish between occlusions that are specific for faces like beards or glasses and occlusions that are not face specific. Face-specific occlusions are suitable to be modeled explicitly, whereas other objects like microphones, food or tennis rackets are captured by a general occlusion model. In this way, we omit to model all objects that happen to occur near faces explicitly. Beards cannot be handled as general occlusion since the illumination model can explain them by a darker illumination. The beard region tends to mislead illumination estimation and the model adaptation. Another challenge with beards is that they share their appearance with parts included in the model like the eyebrows. Compared to other occlusions, beards are a part of the face and therefore even occluding the face in a cooperative setting. In this paper, we model beards explicitly and handle other occlusions with a general occlusion model. The beard prior is composed of a prototype based shape model and an appearance model derived from the beard region in the target image.

Our approach is to segment the image into face, beard and non-face regions. The beard model is coupled to the face model via its location. Non-face regions can arise from occlusions or outliers in the face or background region. We formulate an extended morphable face model that deals explicitly with beards and other occlusions. This enables us to adapt the face model only to regions labeled as face. The beard model is defined on the face surface in 3D and therefore its location in a 2D image is strongly coupled to the face model. This coupling allows the beard model to guide the face model relative to its location. During inference, we also update the segmentation into face, beard and non-face regions. Segmentation and parameter adaptation cannot be performed separately. We require a set of face model parameters as prior for segmentation and a given segmentation for parameter adaptation. We, therefore, use an expectation–maximization-like (EM) algorithm to perform segmentation and parameter adaptation in alternation.

Previous work on handling occlusions with 3DMMs was evaluated on controlled, homogeneous illumination only. However, complex illuminations are omnipresent in real photographs and illumination determines facial appearance, see Fig. 1. We analyze “in the wild” facial images. They contain complex illumination settings and their occlusions cannot be handled with previous approaches. We show the corrupting effect of illumination using a standard robust technique for face model adaptation in Fig. 2. We also present the effect of occlusions on non-robust illumination estimation techniques in Fig. 3. Therefore, we propose a robust illumination estimation. We incorporate the face shape prior in a RANSAC-like algorithm for robust illumination estimation. The results of this illumination estimation are illumination parameters and an initial segmentation into face and non-face regions. Based on this robust illumination estimation we initialize our face model parameters and the segmentation labels.

The annotated facial landmarks in the wild database (AFLW) (Köstinger et al. 2011) provides “in the wild” photographs under diverse illumination settings. We estimate the illumination conditions on this database to obtain an unprecedented prior on natural illumination conditions. The prior spans the space of real-world illumination conditions. The obtained data are made publicly available. This prior can be integrated into probabilistic image analysis frameworks like Schönborn et al. (2017) and Kulkarni et al. (2015).

Furthermore, the resulting illumination prior is helpful for discriminative methods which aim to be robust against illumination. Training data can be augmented (Jourabloo and Liu 2016; Zhu et al. 2016) or synthesized (Richardson et al. 2016; Kulkarni et al. 2015) using the proposed illumination prior. This is especially helpful for data-greedy methods like deep learning. Those methods are already including a 3D Morphable Model as prior for face shape and texture. Currently, no illumination prior learned on real-world data is available. The proposed illumination prior is an ideal companion of the 3D Morphable Model, allows the synthesis of more realistic images and is first used in a publicly available data generator^{Footnote 1} by Kortylewski et al. (2017).

An overview of our full occlusion-aware face image analysis framework is depicted in Fig. 4. The paper focuses on the three blocks of robust illumination estimation which can be applied independently or be used as initialization, the full combined segmentation and face model adaptation framework and an illumination prior derived from real-world face images.

1.1 Related Work

3DMMs are widely applied for 3D reconstruction in an analysis-by-synthesis setting from single 2D images (Blanz and Vetter 1999; Romdhani and Vetter 2003; Aldrian and Smith 2013; Schönborn et al. 2013; Zhu et al. 2015; Huber et al. 2015). Recently there are methods based on convolutional neural networks arising for the task of 3D reconstruction using 3D morphable face models. Besides the work of Tewari et al. (2017) they are only reconstructing the shape and not estimating facial color or the illumination in the scene. The work of Tewari et al. is a model adaptation framework which can be trained end-to-end. It is not robust to occlusions and states this as a hard challenge. Although occlusions are omnipresent in face images, most research using 3DMMs relies on occlusion-free data. There exist only a few approaches for fitting a 3DMM under occlusion. Standard robust error measures are not sufficient for generative face image analysis. Areas like the mouth or eye regions tend to be excluded from the fitting because of their strong variability in appearance (Romdhani and Vetter 2003; De Smet et al. 2006), and robust error measures like applied in Pierrard and Vetter (2007) are highly sensitive to illumination. Therefore, we explicitly cover as many pixels as possible by the face model.

De Smet et al. (2006) learned an appearance distribution of the observed occlusion per image. This approach focuses on large-area occlusions like sunglasses and scarves. However, it is sensitive to appearance changes due to illumination and cannot handle thin occlusions. The work relies on manually labeled fiducial points and is optimized to the database.

Morel-Forster (2017) predicts facial-occlusions by hair using a random forest. The detected occlusion are integrated in the image likelihood but stay constant during model adaptation.

The recent work of Saito et al. (2016) explicitly segments the target image into face and non-face regions using convolutional neural networks. Compared to our method it relies on additional training data for this segmentation. The model adaptation does not include color or illumination estimation and is performed in a tracking setting incorporating multiple frames.

The work of Huang et al. (2004) combines deformable models with Markov random fields (MRF) for segmentation of digits. The beard prior in our work is integrated in a similar way as they incorporate a prior from deformable models. The method of Wang et al. (2007) also works with a spherical harmonics illumination model and a 3D morphable model. The face is also segmented into face and non-face regions to handle occlusions using MRFs. They work on grayscale images in a highly controlled setting and focus on relighting of frontal face images. They do not show any segmentation results which we can compare to. The idea of having different models competing to explain different regions of the image is related to image parsing framework proposed by Tu et al. (2005) and our implementation is unique for face image analysis.

Maninchedda et al. (2016) proposed joint reconstruction and segmentation working with 3D input data. The semantic segmentation is also used to improve the quality of face reconstruction. The general challenges are related but they used depth information, which is not available in our setting. Depth information strongly helps when searching for occlusions, beards or glasses.

Previous work on occlusion-aware 3DMM adaptation focussed on databases with artificial and homogeneous, frontal illumination settings. We present a model which can handle occlusion during 3DMM adaptation on illumination conditions arising in “in the wild” databases.

Yildirim et al. (2017) presented a generative model including occlusions by various objects. 3D occlusions are included in the training data. During inference, the input image is decomposed into face and occluding object and occlusions are excluded for face model adaptation.

Similar models have recently been proposed besides applications of the 3DMM: In the medical imaging for atlas-based segmentation of leukoaraiosis and strokes from MRI brain images (Dalca et al. 2014) and for model-based forensic shoe-print recognition from highly cluttered and occluded images (Kortylewski 2017).

Robust illumination estimation or inverse lighting is an important cornerstone of our approach. Inverse lighting (Marschner and Greenberg 1997) is an inverse rendering technique trying to reconstruct the illumination condition. Inverse rendering is applied for scenes (Barron and Malik 2015) or specific objects. For faces there are 3D Morphable Models by Blanz and Vetter (1999) as the most prominent technique used in inverse rendering settings. The recent work of Shahlaei and Blanz (2015) focuses on illumination estimation and provides a detailed overview of face-specific inverse lighting techniques. The main focus of the presented methods is face model adaptation in an analysis-by-synthesis setting. Those methods are limited either to near-ambient illumination conditions (De Smet et al. 2006; Pierrard and Vetter 2007) or cannot handle occlusions (Romdhani and Vetter 2003; Aldrian and Smith 2013; Schönborn et al. 2017).

Our robust illumination estimation technique handles both, occlusions and complex illumination conditions by approximating the environment map using a spherical harmonics illumination model. Few methods incorporate prior knowledge of illumination conditions. The most sophisticated priors are multivariate normal distributions learned on spherical harmonics parameters estimated from data as proposed in Schönborn et al. (2017) and Barron and Malik (2015). Those priors are less general and not available to the research community. Our robust estimation method enables us to estimate an illumination prior from available face databases. This illumination prior fills a gap for generative models.

This work extends our previous work (Egger et al. 2016). We additionally integrate an explicit beard model and a prior during segmentation. The Chan–Vese segmentation (Chan and Vese 2001) was therefore exchanged by a Markov Random field segmentation since it features multi-class segmentation and can easily integrate a prior on the segmentation label. The explicit modeling of beards improves the quality of fits compared to when beards are handled as occlusion. The robust illumination estimation technique was proposed in our previous work but was not explained in detail, we added a complete description of the algorithm and an evaluation of its robustness. As an application of this work, we obtain and publish a first illumination prior available to the research community.

2 Methods

In this section, we present a method for combining occlusion segmentation and 3DMM adaptation into an occlusion-aware face analysis framework (Fig. 4). Our segmentation distinguishes between face, beard, and non-face. The initialization of the segmentation is estimated by a RANSAC strategy for robust illumination estimation.

Our approach is based on five main ideas: First, we exclude occlusions during the face model adaptation. The face model should be adapted only to pixels belonging to the face. Second, we model beards explicitly. The segmentation of the beard on a face can guide face model adaptation. Third, we semantically segment the target image into occlusions, beards and the face. We pose segmentation as a Markov random field (MRF) with a beard prior. Fourth, we robustly estimate illumination for initialization. Illumination is dominating facial appearance and has to be estimated to find occlusions. Fifth, we perform face model adaptation and segmentation at the same time using an EM-like procedure. The face model adaptation assumes a given segmentation and vice-versa.

2.1 Image Model

The 3D morphable face model was first described by Blanz and Vetter (1999). Principal component analysis (PCA) is applied to build a statistical model of face shape and color. Faces are synthesized by rendering them with an illumination and camera model. We work with the publicly available basel face model (BFM) presented by Paysan et al. (2009). The model was interpreted in a Bayesian framework using Probabilistic PCA by Schönborn et al. (2017).

The aim of face model adaptation (fitting) is to synthesize a face image which is as similar to the target image as possible. A likelihood model is used to rate face model parameters $\theta $ given a target image. The parameter set $\theta $ consists of the color transform, camera (pose), illumination and statistical parameters describing the face. The likelihood model is a product of the pixels i of the target image $\tilde{I}$. In the formulation of Schönborn et al. (2017), pixels belong to the face model ($\mathcal {F}$) or the background model ($\mathcal {B}$). The foreground and background likelihoods ($\ell _{\mathrm{face}}, b$) compete to explain pixels in the image. The full likelihood model covering all pixels i in the image is

$$\begin{aligned} \ell \left( { \theta ; \tilde{I} }\right) = \prod _{i \in \mathcal {F}} \ell _{\mathrm{face}} \left( { \theta ; \tilde{I}_i }\right) \prod _{i' \in \mathcal {B}} b\left( {\tilde{I}_{i'} }\right) . \end{aligned}$$

(1)

The foreground F is defined solely by the position of the face model (see Fig. 5) and therefore this formulation cannot handle occlusions.

2.2 Occlusion Extension of the Image Model

We extend (1) to handle occlusion. We distinguish between beards and other sources of occlusions. Therefore, we introduce a random variable z, indicating the class k a pixel belongs to. The standard likelihood model (1) is therefore extended to incorporate different classes:

$$\begin{aligned} \ell \left( { \theta ; \tilde{I}, z }\right) = \prod _{i} \prod _{k} \ell _{k} \left( { \theta ; \tilde{I}_i }\right) ^{z_{ik}} \end{aligned}$$

(2)

with $\sum _{k} z_{ik} = 1 \forall i$ and $z_{ik} \in \{0,1\}$.

The likelihood model is open for various models for different parts of the image. In this work we use three classes k, namely face ($z_\text {face}$), beard ($z_\text {beard}$) and non-face ($z_\text {non-face}$). In Fig. 5 we present all different labels and regions.

The main difference to the formulation by Schönborn et al. (2017) is that the face model does not have to fit all pixels in the face region. Pixels in the image are evaluated by different likelihoods $\ell _{k}$ for the respective class models k.

For face model adaptation we chose the strongest label z for every pixel $\max _k z_{ik}$. The generative face model is adapted to pixels with the label $z_\text {face}$ only, according to (2). Beard and other non-face pixels are handled by separate likelihoods during face model adaptation. Non-face pixels are only characterized by a low face and beard likelihood. Thus, they can be outliers, occlusions or background pixels. A graphical model of the full occlusion-aware Morphable Model is depicted in Fig. 6.

2.3 Segmentation

To estimate the label z for a given parameter set $\theta $ we used an extension of the classic Markov random field segmentation technique including a beard prior similar as in Huang et al. (2004), see Fig. 7. The beard prior c is a prior on the labels z: $P(z|\theta ,c)$.

The beard prior is a simple prior based on m prototype shapes $l \in \{1,\ldots ,m\}$. We use an explicit beard prior, which is defined on the face model (see Fig. 8). In the image, it is depending on the current camera and shape model parameters contained $\theta $. For the non-face and face label, we use a uniform prior.

The MRF is formulated as follows:

$$\begin{aligned} \begin{aligned}&P(z|\tilde{I}, \theta ) \propto \\&\quad \prod _i \prod _k \ell _{k}\left( { \theta ; \tilde{I}_i }\right) ^{z_{ik}} {P(z_{ik}|\theta )}P(c) \prod _{j \in n(i)} P(z_{ik}, z_{jk}). \end{aligned} \end{aligned}$$

(3)

The data-term is built from the likelihoods for all classes k and overall pixels i and combined with the beard prior. The smoothness assumption enforces spatial contiguity of all pixels j which are neighbors n(i) of i.

The prior for beards is a prior over the segmentation label z at a certain pixel position i. The label z is depending on the beard prior P(c). The prior is based on prototype shapes defined on the face surface. The segmentation prior c is derived from the current set of face model parameters $\theta $ since the pose and shape of the face influence its position and shape in the image. We derived the prototype from manual beard segmentations labeled on the Multi-PIE database (Gross et al. 2010). We used k-means++ clustering technique as proposed in Arthur and Vassilvitskii (2007) to derive a small set of prototypes. The resulting prototypes are shown in Fig. 8. We manually added a prototype for no-beard and another one to handle occlusion of the complete beard region. Those additional prototypes allow us to consider all possible labels in the beard region of the face. All prototypes are defined on the face surface and their position in the image is depending on the current face model parameters $\theta $.

2.4 Likelihood Models

Depending on the class label z we apply different likelihood models during model adaptation for each pixel. The same likelihoods are used for segmentation.

2.4.1 Face Likelihood

The likelihood of pixels to be explained by the face model is the following:

$$\begin{aligned} \ell _{\mathrm{face}} \left( { \theta ; \tilde{I}_i}\right) = {\left\{ \begin{array}{ll} \frac{1}{N} \exp \left( - \frac{1}{2 \sigma ^2} \left||\tilde{I}_i - I_i(\theta ) \right||^2 \right) &{} \text {if}\,i \in \mathcal {F}\\ \frac{1}{\delta } h_f(\tilde{I}_i, \theta ) &{} \text {if}\,i \in \mathcal {B}. \end{array}\right. } \end{aligned}$$

(4)

Pixels are evaluated by the face model if they are labeled as face ($z_\text {face}$) and are located in face region $\mathcal {F}$. The rendering function I generates a face for given parameters $\theta $. This synthesised image $I_i(\theta )$ is compared to the observed image $\tilde{I}$. The likelihood model for pixels covered by the face model is assuming per-pixel Gaussian noise in the face region. The likelihood $\ell _{\mathrm{face}}$ is defined over the whole image and therefore also in the non-face region $\mathcal {B}$. For those pixels that are not covered by the generative face model, we use a simple color model to compute the likelihood of the full image. We use a color histogram $h_f$ with $\delta $ bins estimated on all pixels in $\mathcal {F}$ labelled as face ($z_\text {face}$).

2.4.2 Beard Likelihood

The beard likelihood is a simple histogram color model. The histogram $h_b$ with $\delta $ bins, is estimated on the current belief on $z_\text {beard}$:

$$\begin{aligned} \ell _{\mathrm{beard}} \left( { \theta ; \tilde{I}_i}\right) = \frac{1}{\delta } h_b(\tilde{I}_i, \theta ). \end{aligned}$$

(5)

In Egger (2017) we also proposed a discriminative approach based on hair detection as beard likelihood. This likelihood can be extended to more elaborate appearance and shape models as proposed in Le et al. (2015) or Nguyen et al. (2008).

2.4.3 Occlusion and Background Likelihood

The likelihood of the non-face model to describe occluding and background pixels is the following:

$$\begin{aligned} \ell _{\text {non-face}} \left( { \theta ; \tilde{I}_i}\right) = b\left( {\tilde{I}_{i} }\right) = \frac{1}{\delta } h_{\tilde{I}}(\tilde{I}_i). \end{aligned}$$

(6)

This likelihood allows us to integrate simple color models based on the observed image $\tilde{I}$. We use a simple color histogram estimated on the whole image $\tilde{I}$ to estimate the background likelihood as described by Egger et al. (2014).

2.5 Inference

The full model consists of the likelihoods for face model adaptation shown in (2) and segmentation from (3). Those equations depend on each other. To estimate the segmentation label z we assume a given set of face model parameters $\theta $. And vice versa we assume a given segmentation label z to adapt the face model parameters $\theta $. Both are not known in advance and are adapted during the inference process to get a MAP-estimate of the face model parameters and the segmentation. We use an EM-like algorithm (Dempster et al. 1977) for alternating inference of the full model. In the expectation step, we update the label z of the segmentation. In the maximisation step, we adapt the face model parameters $\theta $. An overview of those alternating steps is illustrated in Fig. 4.

Face model adaptation is performed by a Markov Chain Monte Carlo strategy (Schönborn et al. 2017) with our extended likelihood from (2). MRF segmentation is performed using loopy belief propagation with a sum-product algorithm as proposed by Murphy et al. (1999).

The histogram-based appearance model of beards is adapted during segmentation and fitting. During the segmentation, the appearance is updated respecting the current belief on $z_i$. During the fitting, the beard appearance is also updated due to the spatial coupling with the face model. When the shape or camera parameters of the face model change, the beard model has to be updated.

During segmentation, we assume fixed parameters $\theta $ and during fitting, we assume given labels $\max _k z_{ik}$. Since the fixed values are only an approximation during the optimisation process, we account for those uncertainties. These misalignments are especially important in regions like the eye, nose, and mouth. These regions are often mislabelled as occlusion due to their high variability in appearance when using other robust error measures. In the inference process, those regions are automatically incorporated gradually by adapting the face and non-face likelihood to incorporate this uncertainty. To account for the uncertainty of the face model parameters $\theta $ during segmentation, we adapt the face model likelihood for segmentation [compare to (4)]:

$$\begin{aligned} \ell '_{\mathrm{face}} ( \theta ; \tilde{I}_i) = \frac{1}{N} \exp \left( - \frac{1}{2 \sigma ^2} \min _{j \in n(i)} \left||\tilde{I}_i - I_{i,j}(\theta ) \right||^2 \right) . \end{aligned}$$

(7)

The small misalignment of the current state of the fit is taken into account by the neighbouring pixels j in the target image. In our case we take the minimum over a patch of the $9\times 9$ neighbouring pixels direction (interpupillary distance is $\sim \,120$ pixels).

To account for the uncertainty of the segmentation label z for face model adaptation, we adapt the likelihood of the non-face during face model adaptation [compare to (6)]. Pixels which are masked as non-face can be explained by the face model if it can do better:

$$\begin{aligned} \ell '_{\text {non-face}} \left( { \theta , \tilde{I}_i}\right) = \max \left( \ell _{\mathrm{face}}\left( { \theta , \tilde{I}_i}\right) , b\left( {\tilde{I}_{i} }\right) \right)&\text {if}\,i \in \mathcal {F}. \end{aligned}$$

(8)

Both modifications more likely label pixels as face and this leads to consider them during face model adaptation.

2.6 Initialisation and Robust Illumination Estimation

In the early steps of face model adaptation under occlusion, we need an initial label z. Occlusions are however hard to determine in the beginning of the face model adaptation due to the strong influence of illumination on facial appearance (see Figs. 1, 2).

The main requirement for our illumination estimation method is robustness against occlusions. The idea is to find the illumination setting which most consistently explains the observed face in the image. With a set of points with known albedo and shape, we estimate the illumination condition.

Occlusions and outliers render this task ill-posed and mislead non-robust illumination techniques. The selection of the points used for the estimation shall not contain outliers or occlusions. We use an adapted random sample consensus algorithm (RANSAC) (Fischler and Bolles 1981) for robust model fitting. We synthesize illumination conditions estimated on different point sets and find the illumination parameters most consistent with the target image. The following steps of our procedure are visualized in Fig. 9 and written in Pseudo-Code in Algorithm 1.

The idea of the RANSAC algorithm adapted to our task is, to randomly select a set of points on the face surface which are visible in the target image $\tilde{I}$ and estimate the illumination parameters $L_{{ lm}}$ from the appearance of those points (step 1 and 2 of Algorithm 1). The estimated illumination is then evaluated on all available points (step 3). The full set of points consistent with this estimation is called the consensus set. Consistency is measured by counting the pixels of the target image which are explained well by the current illumination setting. If the consensus set contains more than 40% of the visible surface points the illumination is re-estimated on 1000 random points from the full consensus set for a better approximation. If this illumination estimation is better than the last best estimation according to Eq. (4), it is set as the currently best estimation.

The sampling on the face surface (step 1 of Algorithm 1) considers domain knowledge. There are some occlusions which often occur on faces like facial hair or glasses. Therefore we include a region prior to sample more efficiently in suitable regions. Details on how to obtain this prior are in Sect. 2.6.2.

Most “in the wild” face images contain compression artifacts (e.g. from the jpeg file format). To reduce those artifacts and noise we blur the image in a first step with a Gaussian kernel ($r=4$ pixels with an image resolution of $512 \times 512$ pixels).

2.6.1 Illumination Model

The proposed algorithm is not limited to a specific illumination model. The main requirement is, that it should be possible to estimate the illumination from a few points with given shape, albedo, and appearance. The spherical harmonics illumination model has two main advantages. First, it is able to render natural illumination conditions by approximating the environment map. Second, the illumination estimation from a set of points is solved in a system of linear equations.

Spherical harmonics allow an efficient representation of an environment map with a small set of parameters $L_{{ lm}}$ (Ramamoorthi and Hanrahan 2001; Basri and Jacobs 2003). The radiance function is parametrized through real spherical harmonics basis functions $Y_{{ lm}}$. The radiance $p_j^c$ per color channel c and for every point j on the surface is derived from its albedo $a_j$ and surface normal $\mathbf {n}_j$ and the illumination parameters $L_{{ lm}}$:

$$\begin{aligned} p_j^c = a_j^c \sum _{l=0}^2 \sum _{m=-l}^l Y_{{ lm}}(\mathbf {n}_j) L_{{ lm}}^c \alpha _l. \end{aligned}$$

(9)

The expansion of the convolution with the reflectance kernel is given by $\alpha _l$, for details, refer to Basri and Jacobs (2003). We use Phong shading and interpolate the surface normal for each pixel. As the light model is linear (Eq. 9), the illumination expansion coefficients $L_{{ lm}}^c$ are estimated directly by solving a linear system (least squares) with given geometry, albedo and observed radiance as described by Zivanov et al. (2013). This system of linear equations is solved during the RANSAC algorithm using a set of vertices j.

2.6.2 Region Prior

Facial regions differ in appearance and elicit strong variations. Whilst some regions like the eyebrows or the beard region vary strongly between different faces, other regions like the cheek stay more constant. Also, common occlusions through glasses or beards strongly influence facial appearance. Regions with low appearance variation are more suitable for illumination estimation than those with stronger variation. We restrict the samples in the first step of the RANSAC algorithm to the most constant regions. The regions with strong variation are excluded for sample generation (step 1 of Algorithm 1) but included in all other steps.

We estimate the texture variation on the Multi-PIE database (Gross et al. 2010) which contains faces with glasses and beards under controlled illumination. We chose the images with frontal pose (camera 051) and with frontal, almost ambient illumination (flash 16) from the first session. We chose this subset to exclude all variation in illumination and pose (which we model explicitly) for our prior. The variation is estimated on all 330 identities. The images are mapped on the face model surface by adapting the Basel Face Model to each image with the approach by Schönborn et al. (2017). For the first step of the illumination estimation algorithm, we use the regions of the face where the texture variation is below half of the strongest variation. The variation and the resulting mask are depicted in Fig. 10.

2.6.3 Illumination and Segmentation Initialisation

Our robust illumination estimation gives a rough estimate of the illumination and segmentation. However, the obtained mask is underestimating the face region. Especially the eye, eyebrow, and mouth regions are not included in this first estimate. Those regions differ from the skin regions of the face by their higher variance in appearance, they will be gradually incorporated during the full model inference.

The initialization of the beard model is derived from the segmentation obtained by robust illumination estimation. The prior is initialized by the mean of all beard prototypes. The appearance is estimated from the pixels in the prototype region segmented as non-face by the initial segmentation.

3 Experiments and Results

We present results measuring the performance of our initialization and the full model adaptation separately. The robust illumination estimation plays an important role since later steps rely on a proper initialization. For the full model, we separately evaluate the segmentation and quality of the fit.

The software implementation is based on the Statismo (Lüthi et al. 2012) and Scalismo^{Footnote 2} software frameworks. Our source code will be released within the Scalismo-faces project.^{Footnote 3} We also prepared tutorials with working examples.^{Footnote 4}

3.1 Robustness of Illumination Estimation

To evaluate the robust illumination estimation we performed our experiments on faces using a spherical harmonics illumination model, however, the algorithm is open to various objects and illumination models. We use synthetic data to evaluate the sensitivity against occlusion. Our algorithm assumes a given pose, we investigate the errors introduced by this pose estimation. We exclude appearance variations by using a simple prior for shape and texture. We also examine the sensitivity to shape and texture changes. Besides the quantitative evaluation on synthetic data in this Section, the performance of our method is evaluated on real-world face images in a qualitative experiment in Sect. 3.1.1. And last, we present a novel illumination prior learned on empirical data in Sect. 3.1.2.

We rely on the mean face of the basel face model (Paysan et al. 2009) as prior for face shape and texture for all experiments. We used $n=30$ points for the illumination estimation as parameters in step 2. We use $\sigma =0.043$ estimated on how well the Basel Face Model is able to explain a face image (Schönborn et al. 2017) and threshold the points for the consensus set at $t=2\sigma $. We estimate the illumination on the full consensus set if the consensus set contains more than $x=40\%$ of the surface points visible in the rendering. We stop the algorithm after $m=500$ iterations.

We measure on synthetic data how much occlusion our algorithm is able to handle. We also investigate how robust the algorithm is against pose misalignments and how much our simplified shape and texture prior influences the result. We need ground truth albedo and illumination and therefore generate synthetic data. We use the mean shape and texture of the Basel Face Model as an object and render it under 50 random illumination conditions.

For the first experiment, we add a synthetic random occlusion to this data. The random occlusion is a block with a random color. Note this type of occlusions are worst-case occlusions. The synthetic occlusions are positioned randomly on the face. We randomly generate illumination parameters $L_{{ lm}}$ to synthesize illumination conditions. We sample the parameters according to a uniform distribution between ${-}$ 1 and 1 for all parameters. An example of the synthesized data is depicted in Fig. 11. We estimate the illumination condition on these data using our robust illumination estimation technique to measure the robustness. We measured the approximation error by measuring the RMS-distance in color space of the sphere rendered with the estimated illumination condition and the sphere with the ground truth illumination condition as proposed by Barron and Malik (2015).

We cope with 40% of occlusion and reach a constantly good estimation, see Fig. 12. Occlusions which surpass 40% and which can partially be explained by illumination, are not properly estimated by our algorithm, see Fig. 11. The generated illumination conditions are unnatural, therefore we also evaluated the robustness against occlusion on observed real-world illumination conditions (Sect. 3.1.1). Every measurement is based on 50 estimations.

Our algorithm relies on a given pose estimation. Pose estimation is a problem which can only be solved approximatively. We, therefore, show how robust our algorithm is against small misalignments in the pose. We again generate synthetic data with random illumination parameters and manipulate the pose before we estimate the illumination. The results are shown in Fig. 13. We present the separate effects of yaw, pitch, and roll as well as combining all three error sources. Small pose estimation errors still lead to good estimations, as expected they grow with stronger pose deviations. We have to expect errors smaller than 10 degrees from pose estimation methods (compare Murphy-Chutorian and Trivedi 2009).

We use a simple prior for shape and texture for the proposed illumination estimation. In this experiment, we want to measure how changes in shape and texture influence the illumination estimation. We, therefore, manipulate all shape respectively color parameters by a fixed standard deviation. The result is presented in Fig. 14. We observe that those variations influence the illumination estimation but do not break it.

3.1.1 Illumination Estimation in the Wild

We applied our robust illumination estimation method on the Annotated Facial Landmarks in the Wild database (AFLW) (Köstinger et al. 2011) containing 25’993 images with a high variety of pose and illumination. For this experiment and the construction of the illumination prior we used the manual landmarks (all other experiments use CLandmarks) provided within the database for a rough pose estimation following the algorithm proposed by Schönborn et al. (2017). The illumination conditions are not limited to lab settings but complex and highly heterogeneous. We observe that small misalignments due to pose estimation still lead to a reasonable illumination estimation. The affected pixels, e.g. in the nose region, are automatically discarded by the algorithm.

Both, the estimated illumination and the occlusion mask arising from the consensus set can be integrated for further image analysis. We present a selection of images under a variety of illumination conditions with and without occlusions in Fig. 15. The illumination estimation results demonstrate the robustness of our approach against occlusions like facial hair, glasses, and sunglasses in real-world images.

Table 1 Comparison of segmentation performance in SMC and in brackets the F1-Scores (class|rest) for all labels on the AR face database (Martinez and Benavente 1998). We present separate results for our initialization using robust illumination estimation (line 1–3) and the full model including segmentation (line 4–6)

Full size table

3.1.2 Illumination Prior

The idea of an illumination prior is to learn a distribution of natural illumination conditions from training data. There are various areas of application for such a statistical prior. It can be directly integrated into generative approaches for image analysis or be used to synthesize training data for discriminative approaches.

For faces, the 3D morphable face model (Blanz and Vetter 1999; Paysan et al. 2009) provides such a prior distribution for shape and color but it does not contain a prior on illumination. Such priors to generate synthetic data recently attracted the attention of the deep learning community for their generative capabilities. 3D morphable face models were already used as an appearance prior for data synthesis (Richardson et al. 2016; Kulkarni et al. 2015) or augmentation (Jourabloo and Liu 2016; Zhu et al. 2016). An additional illumination prior is essential to generate realistic images. We, therefore, publish an illumination prior for the diffuse spherical harmonics illumination model (see Sect. 2.6.1) learned on real-world illumination conditions.

We derive an illumination prior on real-world face images by applying our proposed robust illumination estimation method described in Algorithm 1 to a large face database. We again chose the AFLW database since it contains a high variety of illumination settings. We applied the pose and robust illumination estimation as described in Sect. 3.1.1. We excluded gray-scale images and faces which do not match our face model prior (strong make-up, dark skin, strong filter effects). We manually excluded the images where the estimation failed and used the remaining 14’348 images as training data. The parameters for illumination estimation where chosen as described in Sect. 3.1.

By concatenating all spherical harmonic parameters $L_{{ lm}}^c$ of the first 3 bands and color channels we get 27 parameters. As a model, we propose to use a multivariate normal distribution on the spherical harmonics parameters $L_{{ lm}}^c$ and present the first eigenmodes (PCA) in Fig. 16. The illumination conditions are normalized relative to the camera position (not to the facial pose). We show some new unseen random illumination conditions in Fig. 17.

Together with this paper, we publish the estimated illumination (empirical distribution) of our illumination prior as well as the multivariate normal distribution. The prior allows us to generate unseen random instances from the prior distribution to synthesize illumination settings.

The limitations of this illumination prior and the robust illumination estimation are directly deduced from the used spherical harmonics illumination model. We did not incorporate specular highlights or effects of self-occlusion explicitly to keep the model simple. This simplification does not mislead our approximation since regions which are sensitive to self-occlusion or contain specular highlights are excluded from the illumination estimation during our robust approach (see Fig. 15).

3.2 Occlusion-Aware Model

For the 3DMM adaptation, the 3D pose has to be initialized. In the literature, this is performed manually (Blanz and Vetter 1999; Romdhani and Vetter 2003; Aldrian and Smith 2013) or by using fiducial point detections (Schönborn et al. 2013). For our model adaptation experiments on the AR and LFW database, we use automated fiducial point detection results from the external CLandmark Library made available by Uřičář et al. (2015). We integrate the fiducial point detections in our probabilistic framework as proposed in Schönborn et al. (2013) and assume an uncertainty $\sigma $ of 4 pixels on the detections. Our method is therefore fully automated and does not need manual input.

For the model adaptation experiments, we perform alternating 2000 Metropolis–Hastings sampling steps (best sample is taken to proceed) followed by a segmentation step with 5 iterations and repeat this procedure five times. This amounts to a total of 10,000 samples and 25 segmentation iterations.

3.2.1 Segmentation

To evaluate the segmentation performance of the proposed algorithm, we manually segmented occlusions and beards in the AR face dataset proposed by Martinez and Benavente (1998). We took a subset of images consisting of the first 10 male and female participants appearing in both sessions. Unlike other evaluations, we include images under illuminations from the front and the side. We selected the neutral (containing glasses and beards) images from the first session and images with scarves and sunglasses from the second session. The total set consists of 120 images. We had to exclude 5 images because the fiducial point detection (4) or the illumination estimation (1) failed for them ($\text {m-009-25}$, $\text {w-009-25}$, $\text {w-002-22}$, $\text {w-012-24}$, $\text {w-012-25}$). For evaluation, we labeled an elliptical face region and occlusions manually. We added a separate label for beards to distinguish between occlusions and beards. The face region and occlusion inside it are labeled manually. The evaluation was done within the face region only. We have made our manual annotations used in this evaluation publicly available.^{Footnote 5}

In our previous work (Egger et al. 2016), we compared our method to a standard technique to handle outliers, namely a trimmed estimator including only n% of the pixels which are best explained by the face model. In this work, we present the segmentation result including beards as an additional label. We present the simple matching coefficient (SMC) and the F1-Score for detailed analysis in Table 1. In our experiments, we distinguish the three image settings: neutral, scarf and sunglasses. We include the result of the initialization to depict its contribution and show that the fitting improves the segmentation even more.

3.2.2 Quality of Fit

We present qualitative results of our fitting quality on the AR face database (Martinez and Benavente 1998) and the Labelled Faces in the Wild database (LFW) (Huang et al. 2007). In our results in Fig. 18, the images include beards, occlusions, and un-occluded face images. In Fig. 19 we also include results where our method fails. Our method detects occlusions by an appearance prior from the face model. If occlusions can be explained by the color or illumination model, the segmentation will be wrong.

Through the explicit modeling of beards, the errors in the beard region are reduced and the face model does not drift away anymore in this region (see Fig. 20). An additional result of the inference process is an explicit beard type estimation. The quality of the model adaptation on the AFLW database shows almost the same performance as we obtain on data without occlusions. Interesting parts of the face are included gradually during the fitting process, see Fig. 21. Our method also performs well on images without occlusions and does not tend to exclude parts of the face. In our recent work, we showed that our approach is not limited to neutral faces but it can also be applied using a model with facial expressions (Egger et al. 2017) like the Basel Face Model 2017 (Gerig et al. 2017).

3.2.3 Runtime

The method’s performance in terms of speed is lower than that of optimization-only strategies. The stochastic sampler runs in approximately 20 min on current workstation hardware (Intel Xeon E5-2670 with 2.6 GHz), single-threaded. The segmentation takes around 2 min including the updates of the histograms in every step. The robust illumination estimation method needs 30 s.

The full inference process takes around 25 min per image. The long runtime is in large part due to the use of a software renderer and the high resolution of the face model, we did not invest in tuning computational performance. The overall computational effort is within the range of other techniques for full Morphable Model adaptation including texture and illumination.

3.2.4 Discussion

Quantitative evaluation of our full framework and a comparison to other state-of-the-art techniques is currently not possible. For in the wild images, especially under occlusion, there is no ground truth for the shape or other parameters like the illumination available. Such a benchmark would be highly beneficial for the community but is out of the scope of this publication. We choose to evaluate each component of our method in a quantitative manner and evaluated the full framework qualitatively. For future comparisons, we release our source code. In our previous work (Schönborn et al. 2017) we provide an in-depth performance analysis including a quantitative evaluation of 3D reconstruction. On these semi-synthetic renderings, we achieved state-of-the-art results.

4 Conclusion

We proposed a novel approach for combined segmentation and 3D morphable face model adaptation. Jointly solving model adaptation and segmentation leads to an occlusion-aware face analysis framework and a semantic segmentation. Regions like the eye, eyebrow, nose, and mouth are harder to fit by the face model and are therefore often excluded by other robust methods. Our model includes those regions automatically during the inference process.

Illumination estimation is a critical part of robust image analysis. A proper initialization of the illumination parameters is necessary for the robust analysis-by-synthesis process. We propose a RANSAC-like robust illumination estimation leading to good estimations even under occlusions.

Beards are especially challenging as they are not included in the face model and mislead the illumination estimation. We explicitly add a prototype-based beard model to overcome this limitation. Beards are segmented separately and guide the face model adaptation.

Additionally, we applied the robust illumination estimation technique on the AFLW database containing faces in a huge variety of scenes under arbitrary illumination conditions. The resulting prior is highly applicable for probabilistic frameworks as well as for data-greedy algorithms like deep learning methods for augmenting or generating data under unconstrained illumination conditions. The published illumination prior, from a broad range of real-world photographs, is the first, publicly available one.

Our model is a fully probabilistic occlusion-aware face analysis framework. It builds upon a generative model which is built from few high-quality face scans instead of a large amount of training data. It does not require any manual input or database adaptation and is therefore fully automated. The approach performs well on images under complex illumination settings and on an “in the wild” database for various kinds of occlusions.

Notes

https://github.com/unibas-gravis/parametric-face-image-generator.
Scalismo—a scalable image analysis and shape modelling software framework available as open source under https://github.com/unibas-gravis/scalismo.
Scalismo-faces—famework for shape modeling and model-based image analysis available as Open Source under https://github.com/unibas-gravis/scalismo-faces.
Tutorials on our Probabilistic Morphable Model framework http://gravis.dmi.unibas.ch/PMM/.
http://gravis.cs.unibas.ch/publications/2017/2017_Occlusion-aware_3D_Morphable_Models.zip.

References

Aldrian, O., & Smith, W. A. (2013). Inverse rendering of faces with a 3D morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5), 1080–1093.
Article Google Scholar
Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Proceedings of the 18th annual ACM–SIAM symposium on discrete algorithms (pp. 1027–1035). Society for Industrial and Applied Mathematics.
Barron, J. T., & Malik, J. (2015). Shape, illumination, and reflectance from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(8), 1670–1687.
Article Google Scholar
Basri, R., & Jacobs, D. W. (2003). Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(2), 218–233.
Article Google Scholar
Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3D faces. In SIGGRAPH’99 proceedings of the 26th annual conference on computer graphics and interactive techniques (pp. 187–194). ACM Press.
Chan, T. F., & Vese, L. A. (2001). Active contours without edges. IEEE Transactions on Image Processing, 10(2), 266–277.
Article Google Scholar
Dalca, A. V., Sridharan, R., Cloonan, L., Fitzpatrick, K. M., Kanakis, A., Furie, K. L., Rosand, J., Wu, O., Sabuncu, M., Rost, N. S., et al. (2014). Segmentation of cerebrovascular pathologies in stroke patients with spatial and shape priors. In Medical image computing and computer-assisted intervention: MICCAI international conference on medical image computing and computer-assisted intervention (Vol. 17, p. 773), NIH Public Access.
De Smet, M., Fransens, R., Van Gool, L. (2006). A generalized EM approach for 3D model based face recognition under occlusions. In 2006 IEEE computer society conference on computer vision and pattern recognition (Vol. 2, pp. 1423–1430). IEEE.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39, 1–38.
MathSciNet MATH Google Scholar
Egger, B. (2017). Semantic morphable models. PhD thesis, University of Basel.
Egger, B., Schneider, A., Blumer, C., Forster, A., Schönborn, S., & Vetter, T. (2016). Occlusion-aware 3D morphable face models. In British machine vision conference (BMVC).
Egger, B., Schönborn, S., Blumer, C., Egger, B., Schönborn, S., Blumer, C., & Vetter, T. (2017). Probabilistic morphable models. In Statistical shape and deformation analysis: Methods, implementation and applications (p. 115).
Chapter Google Scholar
Egger, B., Schönborn, S., Forster, A., & Vetter, T. (2014). Pose normalization for eye gaze estimation and facial attribute description from still images. In German conference on pattern recognition (pp. 317–327). Springer.
Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.
Article MathSciNet Google Scholar
Gerig, T., Morel-Forster, A., Blumer, C., Egger, B., Lüthi M, Schönborn, S., & Vetter, T. (2017). Morphable face models—An open framework. Preprint arXiv:1709.08398.
Gross, R., Matthews, I., Cohn, J., Kanade, T., & Baker, S. (2010). Multi-PIE. Image and Vision Computing, 28(5), 807–813.
Article Google Scholar
Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Tech. rep. 07-49, University of Massachusetts, Amherst.
Huang, R., Pavlovic, V., & Metaxas, D. N. (2004). A graphical model framework for coupling mrfs and deformable models. In Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition, 2004. CVPR 2004 (Vol. 2, pp. II–739). IEEE.
Huber, P., Feng, Z. H., Christmas, W., Kittler, J., & Rätsch M (2015). Fitting 3D morphable face models using local features. In 2015 IEEE international conference on image processing (ICIP) (pp. 1195–1199). IEEE.
Jourabloo, A., & Liu, X. (2016). Large-pose face alignment via CNN-based dense 3D model fitting. In CVPR.
Kortylewski, A. (2017). Model-based image analysis for forensic shoe print recognition. PhD thesis.
Kortylewski, A., Egger, B., Schneider, A., Gerig, T., Forster, A., & Vetter, T. (2017). Empirically analyzing the effect of dataset biases on deep face recognition systems. Preprint arXiv:1712.01619.
Köstinger, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2011). Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE international conference on computer vision workshops (ICCV workshops) (pp. 2144–2151).
Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., & Mansinghka, V. (2015) Picture: A probabilistic programming language for scene perception. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4390–4399).
Le, T. H. N., Luu, K., & Savvides, M. (2015). Fast and robust self-training beard/moustache detection and segmentation. In 2015 international conference on biometrics (ICB) (pp. 507–512). IEEE.
Lüthi, M., Blanc, R., Albrecht, T., Gass, T., Goksel, O., Buchler, P., et al. (2012). Statismo—A framework for PCA based statistical models. The Insight Journal, 1, 1–18.
Google Scholar
Maninchedda, F., Häne, C., Jacquet, B., Delaunoy, A., & Pollefeys, M. (2016). Semantic 3D reconstruction of heads. In European conference on computer vision (pp. 667–683). Springer.
Marschner, S. R., & Greenberg, D. P. (1997). Inverse lighting for photography. Color and Imaging Conference, Society for Imaging Science and Technology, 1997, 262–265.
Google Scholar
Martinez, A. M., & Benavente, R. (1998). The AR face database. CVC technical report 24.
Morel-Forster, A. (2017). Generative shape and image analysis by combining Gaussian processes and MCMC sampling. PhD Thesis, University of Basel, Faculty of Science.
Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the 15th conference on uncertainty in artificial intelligence. (pp. 467–475). Morgan Kaufmann Publishers Inc.
Murphy-Chutorian, E., & Trivedi, M. M. (2009). Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 607–626.
Article Google Scholar
Nguyen, M. H., Lalonde, J. F., Efros, A. A., & De la Torre, F. (2008) Image-based shaving. In Computer graphics forum (Vol. 27, pp. 627–635). Wiley Online Library.
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., & Vetter, T. (2009). A 3D face model for pose and illumination invariant face recognition. In Proceedings of the 6th IEEE international conference on advanced video and signal based surveillance (AVSS) (pp 296–301). IEEE.
Pierrard, J. S., & Vetter, T. (2007). Skin detail analysis for face recognition. In IEEE conference on computer vision and pattern recognition, 2007. CVPR’07 (pp. 1–8). IEEE.
Ramamoorthi, R., & Hanrahan, P. (2001). An efficient representation for irradiance environment maps. In Proceedings of the 28th annual conference on computer graphics and interactive techniques (pp. 497–500). ACM.
Richardson, E., Sela, M., & Kimmel, R. (2016). 3D face reconstruction by learning from synthetic data. Preprint arXiv:1609.04387.
Romdhani, S., & Vetter, T. (2003). Efficient, robust and accurate fitting of a 3D morphable model. In 2003. Proceedings. 9th IEEE international conference on computer vision (pp. 59–66). IEEE.
Saito, S., Li, T., & Li, H. (2016). Real-time facial segmentation and performance capture from RGB input. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer vision—ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VIII (pp. 244–261). Cham: Springer International Publishing.
Chapter Google Scholar
Schneider, A., Schönborn, S., Egger B, Frobeen, L., & Vetter, T. (2017). Efficient global illumination for morphable models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3865–3873).
Schönborn, S., Egger, B., Morel-Forster, A., & Vetter, T. (2017). Markov chain Monte Carlo for automated face image analysis. International Journal of Computer Vision, 123, 160–183.
Article MathSciNet Google Scholar
Schönborn, S., Forster, A., Egger, B., & Vetter, T. (2013). A Monte Carlo strategy to integrate detection and model-based face analysis. In J. Weickert, M. Hein, & B. Schiele (Eds.), Pattern recognition (pp. 101–110). Berlin: Springer.
Google Scholar
Shahlaei, D., & Blanz, V. (2015). Realistic inverse lighting from a single 2D image of a face, taken under unknown and complex lighting. In 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG) (Vol. 1, pp. 1–8). IEEE.
Tewari, A., Zollhöfer M, Kim, H., Garrido, P., Bernard, F., Pérez, P., & Theobalt, C. (2017). Mofa: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. Preprint arXiv:1703.10580.
Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2005). Image parsing: Unifying segmentation, detection, and recognition. International Journal of Computer Vision, 63(2), 113–140.
Article Google Scholar
Uřičář, M., Franc, V., Thomas, D., Akihiro, S., & Hlaváč, V. (2015). Real-time multi-view facial landmark detector learned by the structured output SVM. In 11th IEEE international conference and workshops on automatic face and gesture recognition (FG) (Vol. 02, pp. 1–8).
Wang, Y., Liu, Z., Hua, G., Wen, Z., Zhang, Z., & Samaras, D. (2007). Face re-lighting from a single image under harsh lighting conditions. In IEEE conference on computer vision and pattern recognition, 2007. CVPR’07 (pp. 1–8). IEEE.
Yildirim, I., Janner, M., Belledonne, M., Wallraven, C., Freiwald, W. A., & Tenenbaum, J. B. (2017). Causal and compositional generative models in online perception. In To be published at 39th annual conference of the cognitive science society.
Zhu, X., Lei, Z., Liu, X., Shi, H., & Li, S. Z. (2016). Face alignment across large poses: A 3D solution. In CVPR.
Zhu, X., Yan, J., Yi, D., Lei, Z., & Li, S. (2015). Discriminative 3D morphable model fitting. In Proceedings of 11th IEEE international conference on automatic face and gesture recognition FG2015. Ljubljana.
Zivanov, J., Forster, A., Schönborn, S., & Vetter, T. (2013). Human face shape analysis under spherical harmonics illumination considering self occlusion. In ICB-2013, 6th international conference on biometrics. Madrid.

Download references

Acknowledgements

Funding was provided by Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Grant No. SNF153297).

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of Basel, Basel, Switzerland
Bernhard Egger, Sandro Schönborn, Andreas Schneider, Adam Kortylewski, Andreas Morel-Forster, Clemens Blumer & Thomas Vetter

Authors

Bernhard Egger
View author publications
You can also search for this author in PubMed Google Scholar
Sandro Schönborn
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Adam Kortylewski
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Morel-Forster
View author publications
You can also search for this author in PubMed Google Scholar
Clemens Blumer
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Vetter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bernhard Egger.

Additional information

Communicated by Edwin Hancock, Richard Wilson, Will Smith, Adrian Bors and Nick Pears.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Egger, B., Schönborn, S., Schneider, A. et al. Occlusion-Aware 3D Morphable Models and an Illumination Prior for Face Image Analysis. Int J Comput Vis 126, 1269–1287 (2018). https://doi.org/10.1007/s11263-018-1064-8

Download citation

Received: 01 February 2017
Accepted: 06 January 2018
Published: 31 January 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11263-018-1064-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Occlusion-Aware 3D Morphable Models and an Illumination Prior for Face Image Analysis

Abstract

Similar content being viewed by others

Markov Chain Monte Carlo for Automated Face Image Analysis

Generative Face Parsing Map Guided 3D Face Reconstruction Under Occluded Scenes

“Look Ma, No Landmarks!” – Unsupervised, Model-Based Dense Face Alignment

Explore related subjects

1 Introduction

1.1 Related Work

2 Methods

2.1 Image Model

2.2 Occlusion Extension of the Image Model

2.3 Segmentation

2.4 Likelihood Models

2.4.1 Face Likelihood

2.4.2 Beard Likelihood

2.4.3 Occlusion and Background Likelihood

2.5 Inference

2.6 Initialisation and Robust Illumination Estimation

2.6.1 Illumination Model

2.6.2 Region Prior

2.6.3 Illumination and Segmentation Initialisation

3 Experiments and Results

3.1 Robustness of Illumination Estimation

3.1.1 Illumination Estimation in the Wild

3.1.2 Illumination Prior

3.2 Occlusion-Aware Model

3.2.1 Segmentation

3.2.2 Quality of Fit

3.2.3 Runtime

3.2.4 Discussion

4 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation