1 Introduction

Fair machine learning aims at producing predictive models that do not induce any prejudice or favoritism toward an individual or a group based on a set of sensitive characteristics. As of now, a large majority of works in the field focused on group fairness, that ensures a form of conditional independence between outcomes of the models \({\hat{Y}}\) and any sensitive attribute A. However, group fairness may induce dramatic consequences for some individuals. For example, a person may be refused a position only because she belongs to a privileged group, regardless of her merit within the group.

Recently, Counterfactual fairness (Kusner et al., 2017) proposed to assess fairness at the individual level, by leveraging causal inference to ensure that some sensitive attributes are not the cause of a prediction change. It argues to lead to a more intuitive, powerful, and less error-prone way of reasoning about fairness (Chiappa, 2019). The idea is to imagine what any individual would look like with a variation of a given attribute of interest, such as a different gender or race for instances, in order to ensure similar outcomes for every alternate version of the same individual. While plenty of methods have been proposed recently to tackle this challenge for discrete variables, to the best of our knowledge no approach address the continuous case. The existing approches may not hold when, for instance, the sensitive attribute is the age or the weight of an individual. As discussed in Sect. 2.2, discretizing sensitive attributes is not an option in most of cases. Moreover, existing approaches present some limitations for counterfactual inference even in the discrete case (see end of Sect. 2.2).

The main contributions of this paper are:

  • We propose a novel adversarial learning approach to overcome these limitations for counterfactual inference;

  • Based on this, we define an approach for counterfactual fairness tolerant to continuous features, notably via a dynamic sampling method that focuses on individualized hard locations of the sensitive space;

Section 2 first gives details for counterfactual fairness, which we believe are essential for a good understanding of our contributions. Section 3 details our approach in two main steps. Section 4 evaluates performances for both the discrete and the continuous settings.

2 Background

Recently, there has been a dramatic rise of interest for fair machine learning by the academic community. Many questions have been raised, such as: How to define fairness (Hinnefeld et al., 2018; Hardt et al., 2016; Dwork et al., 2012; Kusner et al., 2017)? How to mitigate the sensitive bias (Zhang et al., 2018; Grari et al., 2019; Kamiran & Calders, 2012; Bellamy et al., 2018; Calmon et al., 2017; Zafar et al., 2015; Celis et al., 2019; Wadsworth et al., 2018; Louppe et al., 2017; Chen et al., 2019; Kearns et al., 2017)? How to keep a high prediction accuracy while remaining fair in a complex real-world scenario (Grari et al., 2019; Adel et al., 2019)? To answer these questions, three main families of fairness approaches exist in the literature. While pre-processing (Kamiran & Calders, 2012; Bellamy et al., 2018; Calmon et al., 2017) and post-processing (Hardt et al., 2016; Chen et al., 2019) approaches respectively act on the input or the output of a classically trained predictor, in-processing approaches mitigate the undesired bias directly during the training phase (Zafar et al., 2015; Celis et al., 2019; Zhang et al., 2018; Wadsworth et al., 2018; Louppe et al., 2017). In this paper we focus on in-processing fairness, which reveals as the most powerful framework for settings where acting on the training process is an option.

Throughout this document, the aim is to learn a predictive function \(h_\theta\) from training data that consists of m examples \({(x_{i},a_{i},y_{i})}_{i=1}^{m}\), where \(x_{i} \in {\mathbb {R}}^{p}\) is the p-sized feature vector X of the ith example, \(a_i \in \Omega _{A}\) the value of its sensitive attribute and \(y_{i}\) its label to be predicted. According to the setting, the domain \(\Omega _{A}\) of the sensitive attribute A can be either a discrete or a continuous set. The outcome Y is also either binary or continuous. The objective is to ensure some individual fairness guarantees on the outcomes of the predictor \({\hat{Y}}=h_\theta (X,A)\), by the way of Counterfactual Fairness.

2.1 Fairness definitions and metrics

The vast majority of fairness research works have focused on two metrics that have become very popular in the fairness field: Demographic parity (Dwork et al., 2012) and Equalized odds (Hardt et al., 2016). Both of them consider fairness globally, by focusing on equity between groups of people, classically defined according to one or several categorical sensitive attributes.

Some recent works recently proposed to extend this for the continuous setting by minimizing (non-linear) correlation between predictions \({\hat{Y}}\) and sensitive attributes A (Mary et al., 2019; Grari et al., 2019), that can be measured for instance via the Hirschfeld-Gebelein-Rényi maximal correlation (HGR).

However, even such approaches in the continuous setting only consider fairness globally and can lead to particularly unfair decisions at the individual level. For example, a fair algorithm can choose to accept a high MSE error for the outcome of a given person if this allows the distribution \(P({\widehat{Y}}|A)\) to get closer to \(P({\widehat{Y}})\). Penalization can be arbitrarily high on a given kind of individual profile compared to any other equivalent one, only depending on where the learning process converged. Global fairness is unfair.

To tackle this problem, Counterfactual fairness has been recently introduced for quantifying fairness at the most individual sense (Kusner et al., 2017). The idea is to consider that a decision is fair for an individual if it coincides with the one that would have been taken in a counterfactual world in which the values of its sensitive attributes were different. It leverages the previous work (Pearl, 2009), which introduced a causal framework to learn from biased data by exploring the relationship between sensitive features and data.

Definition 1

Counterfactual demographic parity (Kusner et al., 2017): A predictive function \(h_\theta\) is considered counterfactually fair for a causal world G, if for any \(x \in X\) and \(\forall y \in Y\),\(\forall (a,a\prime ) \in \Omega _{A}^2\) with \(a \ne a\prime\): \(p({\hat{Y}}_{A\leftarrow {a}} =y|X=x,A=a) =p({\hat{Y}}_{A\leftarrow {a\prime }} =y|X=x,A=a)\), where \({\hat{Y}}_{A\leftarrow {a'}} = h_\theta (X_{A\leftarrow {a'}},a')\) is the outcome of the predictive function \(h_\theta\) for any transformation \(X_{A\leftarrow {a'}}\) of input X, resulting from setting \(a'\) as its sensitive attribute value, according to the causal graph G.

Following Definition 1, an algorithm is considered counterfactually fair in term of demographic parity if the predictions are equal for each individual in the factual causal world where \(A=a\) and in any counterfactual world where \(A=a^{\prime }\). It therefore compares the predictions of the same individual with an alternate version of him/herself. Similar extension can be done to adapt the Equalized Odds objective for the Counterfactual framework (Pfohl et al., 2019). Learning transformations \({\hat{X}}_{A\leftarrow {a'}}\) for a given causal graph is at the heart of Counterfactual Fairness, as described in below.

2.2 Counterfactual fairness

Fig. 1
figure 1

Graphical causal model

In this paper, we focus on the classical causal graph depicted in Fig. 1, often used in the counterfactual fairness literature (Kusner et al., 2017; Pfohl et al., 2019; Chiappa, 2019), which can apply for most applications. For more specific tasks, note further that our approach could be easily adapted for different graphs, such as those explored in Kusner et al. (2017) for instances. In this causal graph, both input X and outcome Y only depend on the sensitive attribute A and a latent variable U, which represents all the relevant knowledge non dependent on the sensitive feature A. In that setting, the knowledge of U can be used during training to simulate various versions of the same individual, corresponding to different values of A, in order to obtain a predictive function \(h_\theta\) which respects the fairness objective from Definition 1. For any training sample, U has to be inferred since only X, A and Y are observed. This inference must however ensure that no dependence is created between U and A (no arrow from U to A in the graph from Fig. 1), unless preventing the generation of proper alternative versions of X and Y for any values A.

Concerning causal effect identifiability (i.e., whether a joint distribution of latent and observed confounder variables can be uniquely inferred from observations), sufficient conditions as raised in (Louizos et al., 2017; Madras et al., 2019; Kilbertus et al., 2020) imply strong assumptions which require specific directed acyclic graphs different from ours. As in Pfohl et al. (2019), that considers the same causal graph, we make no formal guarantee on identification even in the case where these assumptions hold (more information in their article). However, we argue that, given any distribution P(UAXY) exactly inferred from a sufficiently large amount of observations (XYA), with a constant prior on U, the counterfactual quantities \(P(X_{A\leftarrow a'}|X,Y,A)\) and \(P(Y_{A\leftarrow a'}|X,Y,A)\) are identifiable, whenever U is independent from A. From this, if the prior P(U) is the true one, and the decoding is sufficiently powerful, a classifier can be trained to minimize counterfactual unfairness according to the inferred model (step 2 in the following).

Several current approaches Louizos et al. (2017); Kim et al. (2021); Kocaoglu et al. (2017); Xu et al. (2019) enforce fairness on counterfactual data generated by their model. These works, which do not focus on the final predictor itself, assume that giving fair generated counterfactual observations as input to a traditional machine learning algorithm is sufficient to maintain the fairness objective. We argue that it is not always the case and the final predictions need to be evaluated to ensure a good fairness level. For this reason, we rather leverage a two-step method, as already considered in Russell et al. (2017); Pfohl et al. (2019), that focus separately on Causal Inference (step 1) and Model Learning (step 2). We develop and discuss the general principles of this family of methods in the following.

2.2.1 Step 1: Counterfactual Inference

The goal is to define a way to generate counterfactual versions of original individuals. As discussed above, this is usually done via approximate Bayesian inference, according to a pre-defined causal graph. The initial idea to perform inference was to suppose with strong hypothesis a non deterministic structural model with some specific distribution for all the causal links (Kusner et al., 2017). In this setting, the posterior distribution of U was estimated using the probabilistic programming language Stan (Team et al., 2016). Then, leveraging recent developments for approximate inference with deep learning, many works proposed to use Variational Autoencoding (Kingma & Welling, 2013) methods (VAE) to generalize this first model and capture more complex - non linear - dependencies in the causal graph. This leads to consider the following lower bound (ELBO) on the training set \({{\mathcal {D}}}\):

$${\mathcal{L}}_{{ELBO}} = - {\mathbb{E}}_{{\begin{array}{*{20}c} {(x,y,a)\sim {\mathcal{D}},} \\ {u\sim q_{\phi } (u|x,y,a)} \\ \end{array} }} \left[ {\log p_{\theta } \left( {x,y|u,a} \right)} \right] + D_{{KL}} \left( {q_{\phi } \left( {u|x,y,a} \right)||p\left( u \right)} \right)]$$

where \(D_{KL}\) denotes the Kullback-Leibler divergence of the posterior \(q_{\phi }(u|x,y,a)\) from a prior p(u), typically a standard Gaussian distribution \({{{\mathcal {N}}}}(0,I)\). The posterior \(q_{\phi }(u|x,y,a)\) is represented by a deep neural network with parameters \(\phi\), which typically outputs the mean \(\mu _\phi\) and the variance \(\sigma _\phi\) of a diagonal Gaussian distribution \({{{\mathcal {N}}}}(\mu _\phi ,\sigma _\phi I)\). The likelihood term factorizes as \(p_{\theta }(x,y|u,a)= p_{\theta }(x|u,a)p_{\theta }(y|u,a)\), which are defined as neural networks with parameters \(\theta\). Since attracted by a standard prior, the posterior is supposed to remove probability mass for any features of U that are not involved in the reconstruction of X and Y. Since A is given together with U as input of the likelihoods, all the information from A should be removed from the posterior distribution of U.

However, some works (Chiappa, 2019; Louizos et al., 2017; Madras et al., 2019; Pfohl et al., 2019) show that the resulting latent space U and the sensitive variable A remain too highly correlated with this classical ELBO optimization. Some information from A leaks in the inferred U. To cope with it, a specific TARNet (Shalit et al., 2017) architecture can be employed (Madras et al., 2019) or a penalisation term can be added in the loss function. For example, (Chiappa, 2019; Pfohl et al., 2019) add a Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) constraint. The MMD term can be used to enforce all the different aggregated posterior to the prior distribution (Pfohl et al., 2019): \({\mathcal {L}}_{MMD}(q_{\phi }(u|A=a_{k})||p(u))\) for all \(a_{k} \in \Omega _{A}\) (referred to as MMD wrt P(U) in the following). Alternatively, the constraint can directly enforce the matching between pairs of posteriors (Chiappa 2019): \({\mathcal {L}}_{MMD}(q_{\phi }(u|A=a_{k})||q_{\phi }(u|A=a))\) for all \(a_{k} \in \Omega _{A}\), with a standing for the original sensitive value of the considered individual (referred to as MMD wrt \(U_a\) in the following). Notice that while this additional term can improve independence, it can also encourage the model to ignore the latent confounders U, by being too restrictive. One possible approach to address this issue is to apply weights \(\uplambda\) (hyperparameters) to control the relative importance of the different terms. In addition, we employ in this paper a variant of the ELBO optimization as done in Pfohl et al. (2019), where the \(D_{KL}(q_{\phi }(u|x,y,a)||p(u))\) term is replaced by a MMD term \({\mathcal {L}}_{MMD}(q_{\phi }(u)||p(u))\) between the aggregated posterior \(q_{\phi }(u)\) and the prior. This has been shown more powerful than the classical \(D_{KL}\) for ELBO optimization in Zhao et al. (2017), as the latter can reveal as too restrictive (uninformative latent code problem) (Chen et al. 2016; Bowman et al. 2015; Sønderby et al. 2016) and can also tend to overfit the data (Variance Over-estimation in Feature Space). Finally, the inference for counterfactual fairness can be optimized by minimizing (Pfohl et al. 2019):

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{CE-VAE} =&- \mathop {{\mathbb {E}}}_{\begin{array}{c} (x,y,a)\sim {\mathcal {D}}, \\ u \sim q_{\phi }(u|x,y,a) \end{array}}\left[ \begin{array}{l} \uplambda _x \log (p_{\theta }(x|u,a)) \ + \\ \uplambda _y \log (p_{\theta }(y|u,a))\end{array} \right] \\&\quad + \uplambda _{MMD} \ {\mathcal {L}}_{MMD}(q_{\phi }(u)||p(u)) +&\frac{\uplambda _{ADV}}{|\Omega _{A}|}\sum _{a_{k} \in \Omega _{A}} {\mathcal {L}}_{MMD}(q_{\phi }(u|a=a_{k})||p(u)) \end{aligned} \end{aligned}$$

where \(\uplambda _{x}\), \(\uplambda _{y}\), \(\uplambda _{MMD}\), \(\uplambda _{ADV}\) are scalar hyperparameters. The additional MMD objective can be interpreted as minimizing the distance between all moments of each aggregated latent code distribution and the prior distribution, in order to remove most sensitive dependency from the code generator. It requires however a careful design of the kernel used for MMD computations (typically a zero mean isotropic Gaussian). Note that we chose to present all models with a generic inference scheme q(U|XYA), while most approaches from the literature only consider q(U|XA). The use of Y as input is allowed since U is only used during training, for generating counterfactual samples used to learn the predictive model in step 2. Various inference schemes are considered in our experiments (Sect. 4).

2.2.2 Step 2: Counterfactual predictive model

Once the causal model is learned, the goal is to use it to learn a fair predictive function \(h_\theta\), by leveraging the ability of the model to generate alternative versions of each sample. The global loss function is usually composed of the traditional predictor loss \(l(h_{\theta }(x_{i},a_{i}),y_{i})\) (e.g. cross-entropy for instance i) and the counterfactual unfairness estimation term \({\mathcal{L}}_{{{\mathcal{C}\mathcal{F}}}} (\theta)\):

$${\mathcal{L}} = \frac{1}{m}\sum\limits_{i}^{m} l \left( {h_{\theta } (x_{i} ),y_{i} } \right) +{\mathcal{L}}_{{{\mathcal{C}\mathcal{F}}}} (\theta )$$
(1)

where \(\uplambda\) is an hyperparameter which controls the impact of the counterfactual loss in the optimization. The counterfactual loss \({\mathcal{L}}_{{{\mathcal{C}\mathcal{F}}}} \left( \theta \right)\) considers differences of predictions for alternative versions of any individual. For example, Russell et al. (2017) considers the following Monte-Carlo estimate from S samples for each individual i and each value \(a \in \Omega _{A}\):

$${\mathcal{L}}_{{{\mathcal{C}\mathcal{F}}}} (\theta ) = \frac{1}{m}\sum\limits_{{i = 1}}^{m} {\frac{1}{{m_{a} }}} \sum\limits_{{a_{k} \in \Omega _{A} }} {\frac{1}{S}} \sum\limits_{{s = 1}}^{S} {\Delta _{{a_{k} }}^{{i,s}} }$$
(2)

where \(\Delta _{a_k}^{i,s}=\Delta (h_{\theta }(x_{i,A\leftarrow {a_i}}^{s},a_i),h_{\theta }(x_{i,A\leftarrow {a_{k}}}^{s},a_k))\) is a loss function that compares two predictions, \(x_{i,A\leftarrow {a}}^{s}\) denotes the s-th sample from the causal model for the i-th individual of the training set and the sensitive attribute value a. Following the causal model learned at step 1, \(x_{i,A\leftarrow {a}}^{s}\) is obtained by first inferring a sample u from \(q_{\phi }(u|x_i,a_i,y_i)\) and then sampling \(x_{i,A\leftarrow {a}}^{s}\) using \(p_{\theta }(x|u,a)\) with the counterfactual (or factual) attribute value a. According to the task, \(\Delta\) can take various forms. For binary classification, it can correspond to a logit paring loss as done in Pfohl et al. (2019): \(\Delta (z,z')=(\sigma ^{-1}(z)-\sigma ^{-1}(z'))^2\), where \(\sigma ^{-1}\) is the logit function. For continuous outcomes, it can simply correspond to a mean squared difference.

2.2.3 Discussion

For now, state-of-the-art approaches have focused specifically on categorical variables A. Unfortunately, the classical methodology for CounterFactual Fairness as described above cannot be directly generalized for continuous sensitive attributes, because the two steps involve enumerations of the discrete counterfactual modalities \(a_{k}\) in the set \(\Omega _{A}\). Particularly in step 1, sampling A from a uniform distribution for approximating the expectation \(E_{{a\sim p(A)}}{\mathcal{L}}_{{MMD}} \left( {q_{\phi } \left( {u|A = a} \right)||p\left( u \right)} \right)\) is not an option since this requires to own a good estimation of \(q_{\phi }(u|A=a)\) for any \(a \in \Omega _A\), which is difficult in the continuous case. While such a posterior can be obtained for discrete sensitive attributes (at least when \(|\Omega _A|<< m\)) by aggregating the posteriors \(q_{\phi }(u|x_i,a_i,y_i)\) over training samples i such that \(a_i=a\), such a simple aggregation over filtered samples is not possible for continuous attributes. Note that splitting samples in bins regarding to their sensitive value A is not an option due to the difficulty for setting an effective discretization step size : while a large step size induces aggregating too different sensitive values (leading to dependencies on A inside bins), small steps imply unreliable aggregated posterior estimates due to small numbers of samples in each bin (especially for data unevenly spread over \(\Omega _A\)).

Moreover, existing approaches based on MMD costs imply to infer codes U from a distribution that takes A as input, in order to be able to obtain the required aggregated distributions via: \(q_\phi (u|a) = {\mathbb {E}}_{p_{data}(x,y|a)} [q_\phi (u\vert x,y,a)]\). Omitting A from the conditioning of the generator would correspond to assume the mutual independence of u and a given x and y, which is usually wrong. On the other hand, passing A to the generator of U can encourage their mutual dependency in some settings, as we observe in our experiments. This is not the case with our proposal below.

3 Adversarial learning for counterfactual fairness

In this section we revisit the 2 steps shown above by using adversarial learning rather than MMD costs for ensuring Counterfactual Fairness. Our contribution covers a broad range of scenarios, where the sensitive attribute A and the outcome value Y can be either discrete or continuous.

3.1 Step 1: Counterfactual Inference

Fig. 2
figure 2

Our Counterfactual inference architecture. Circles are observed variables, squares are samples from the neural distributions. Arrows represent retro-propagated gradients

To avoid the comparison of distributions for each possible sensitive value, which reveals particularly problematic in the continuous setting, we propose to employ an adversarial learning framework, which allows one to avoid the enumeration of possible values in \(\Omega _A\). We follow an approach similar to the adversarial auto-encoders proposed in Makhzani et al. (2015), but where the discriminator real/fake data is replaced by a sensitive value predictor. The idea is to avoid any adversarial function to be able to decode A from the code U inferred from the encoder \(q_\phi\), which allows one to ensure mutual independence of A and U. This defines a two-players adversarial game, such as in GANs (Goodfellow et al., 2014), where the goal is to find some parameters \(\phi\) which minimize the loss to reconstruct X and Y, while maximizing the reconstruction loss of A according to the best decoder \(p_{\psi }(A|U)\):

$$\mathop {\arg {\text{ }}\min }\limits_{{\theta ,\phi }} \max _{\psi }{\mathcal{L}}_{{ADV}} \left( {\theta ,\phi ,\psi } \right)$$
(3)
$$\begin{aligned} {\mathcal {L}}_{ADV}(\theta ,\phi ,\psi ) \ge&- \mathop {{\mathbb {E}}}_{\begin{array}{c} (x,y,a)\sim {\mathcal {D}}, \\ u \sim q_{\phi }(u|x,y,a) \end{array}}\left[ \begin{array}{l} \uplambda _x \log (p_{\theta }(x|u,a)) + \uplambda _y \log (p_{\theta }(y|u,a))\end{array} \right] \\ +&\uplambda _{MMD} \ {\mathcal {L}}_{MMD}(q_{\phi }(u)||p(u)) + \uplambda _{ADV} \mathop {{\mathbb {E}}}_{\begin{array}{c} (x,y,a)\sim {\mathcal {D}}, \\ u \sim q_{\phi }(u|x,y,a)) \end{array}}[log(p_{\psi }(a|u))] \end{aligned}$$

where \(\uplambda _{x}\), \(\uplambda _{y}\), \(\uplambda _{MMD}\), \(\uplambda _{ADV}\) are scalar hyperparameters. Compared to existing approaches presented in the previous section, the difference is the last term which corresponds to the expectation of the log-likelihood of A given U according to the decoder with parameters \(\phi\). This decoder corresponds to a neural network which outputs the parameters of the distribution of A given U (i.e., the logits of a Categorical distribution for the discrete case, the mean and log-variance of a diagonal Gaussian in the continuous case).

All parameters are learned conjointly. Figure 2 gives the full architecture of our variational adversarial inference for the causal model from Fig. 1. It depicts the neural network encoder \(q_{\phi }(U|X,Y,A)\) which generates a latent code U from the inputs X, Y and A. A neural network decoder \(p_{\theta }(X,Y|U)\) reconstructs the original X and Y from both U and A. The adversarial network \(p_\psi\) tries to reconstruct the sensitive attribute A from the confounder U. As classically done in adversarial learning, we alternate steps for the adversarial maximization and steps of global loss minimization (one gradient descent iteration on the same batch of data at each step). Optimization is done via the re-parametrization trick (Kingma & Welling, 2013) to handle stochasticity.

3.2 Step 2: Counterfactual predictive model

As described in Sect. 2.3, the counterfactual fairness in the predictive model learned at step 2 is ensured by comparing, for each training individual, counterfactual predictions \(Y_{A \leftarrow a'}\) for all \(a' \in \Omega _A\). For the discrete case (i.e., A is a Categorical variable), we keep this process for our experiments. However, for the continuous setting (i.e., A is for instance generated from a Gaussian), such an approach must be somehow adapted, due to the infinite set \(\Omega _{A}\). In that case, we can consider a sampling distribution \(P'(A)\) to formulate the following loss, which can be optimized via Monte-Carlo sampling and stochastic gradient descent (SGD):

$$\begin{aligned} {\mathcal {L}}_{CF}(\theta ) = \frac{1}{m} \sum _i^{m} l(h_{\theta }(x_i),y_i) + \uplambda \mathop {{\mathbb {E}}}_{\begin{array}{c} u \sim P(u|x_i,a_i,y_i), \\ {\tilde{x}} \sim P(x|u_i,a_i), \\ a'\sim P'(A), x' \sim P(x|u,a') \end{array}}[(h_{\theta }({\tilde{x}}) - h_{\theta }(x'))^2] \end{aligned}$$
(4)

This formulation is equivalent to the one from Eq. (2), for continuous outcomes \({\hat{Y}}\) (thus considering a least squared cost as \(\Delta\)) and for continuous attributes A (thus using the sampling distribution \(P'(A)\) rather than considering every possible \(a \in \Omega _A\)). Note that using a non-uniform sampling distribution \(P'(A)\) would enforce the attention of the penalisation near the mass of the distribution. This prevents using the prior of A estimated from the training set, since this would tend to reproduce inequity between individuals: counterfactual predictions for rare A values would be little taken into account during training. We therefore consider a uniform \(P'(A)\) in our experiments for the continuous setting when using the \({\mathcal {L}}_{CF}(\theta )\) objective at step 2.

However, for the specific case of high-dimensional sensitive attributes A, using a uniform sampling distribution \(P'(A)\) could be inefficient. The risk is that a high number of counterfactual samples fall in easy areas for the learning process, while some difficult areas - where an important work for fairness has to be performed - remain insufficiently visited. To tackle this problem, we propose to allow the learning process to dynamically focus on the most useful areas of \(\Omega _A\) for each individual. During learning, we consider an adversarial process, which is in charge of moving the sampling distribution \(P'(A)\), so that the counterfactual loss is the highest. This allows the learning process to select useful counterfactuals for ensuring fairness. Who can do more can do less: dynamically focusing on hardest areas allows one to expect fairness everywhere. Again, we face a two-players adversarial game, which formulates as follows:

$$\begin{aligned}&\mathop {\text {arg min}}\limits _{{\theta }}\mathop {\text {arg max}}\limits _{{\phi }}{\mathcal {L}}_{DynCF}(\theta ,\phi ) \nonumber \\&\begin{aligned} {\mathcal {L}}_{DynCF}(\theta ,\phi ) =&\frac{1}{m} \sum _i^{m} l(h_{\theta }(x_i),y_i) + \uplambda \quad \mathop {{\mathbb {E}}}_{\begin{array}{c} u \sim P(u|x_i,a_i,y_i), \\ {\tilde{x}} \sim P(x|u,a_i), \\ a' \sim P_\phi (a|u), x' \sim P(x|u,a') \end{array}} [(h_{\theta }({\tilde{x}}) - h_{\theta }(x'))^2] \end{aligned} \end{aligned}$$
(5)

Compared to Eq. (4), this formulation considers an adversarial sampling distribution \(P_\phi (A|U)\) rather than a uniform static distribution \(P'(A)\). It takes the form of a neural network that outputs the parameters of the sampling distribution for a given individual representation U. In our experiments we use a diagonal logit-Normal distribution \(sigmoid({\mathcal {N}}(\mu _{\phi }(u),\sigma _{\phi }^2(u)I))\), where \(\mu _{\phi }(u)\) and \(\sigma _{\phi }^2(u)\) stand for the mean and variance parameters provided by the network for the latent code u. Samples from this distribution are then projected on the support \(\Omega _A\) via a linear mapping depending on the shape of the set. Passing U as input for the network allows the process to define different distributions for different codes: according to the individual profiles, the unfair areas are not always the same. This also limits the risk that the adversarial process gets stuck in sub-optimums of the sensitive manifold. As done for adversarial learning in step 1, all parameters are learned conjointly, by alternating steps for the adversarial maximization and steps of global loss minimization. The re-parametrization trick (Kingma & Welling, 2013) is also used, for the adversarial optimization of \(P_\phi (A|U)\).

4 Experiments

We empirically evaluate the performance of our contribution on 6 real world data sets. For the discrete scenario and specifically in the binary case (\(Y \in \{0,1\}, A \in \{0,1\}\)), we use 3 different popular data sets: the Adult UCI income data set (Dua & Graff, 2017) with a gender sensitive attribute (male or female), the COMPAS data set (Angwin et al., 2016) with the race sensitive attribute (Caucasian or not-Caucasian) and the Bank dataset (Moro et al., 2014) with the age as sensitive attribute (age is between 30 and 60 years, or not). For the continuous setting (Y and A are continuous), we use the 3 following data sets: the US Census dataset (US Census Bureau, 2019) with gender rate as sensitive attribute encoded as the percentage of women in the census tract, the Motor dataset The Institute of Actuaries of France (2015) with the driver’s age as sensitive attribute and the Crime dataset (Dua & Graff, 2017) with the ratio of an ethnic group per population as sensitive attribute.

Additionally to the 6 real-world datasets, we consider a synthetic scenario, that allows us to perform a further analysis of the relative performances of the approaches. The synthetic scenario subject is a pricing algorithm for a fictional car insurance policy, which follows the causal graph from Fig. 1. We simulate both a binary and a continuous dataset from this scenario. The main advantage of these synthetic scenarios is that it is possible to get "ground truth" counterfactuals for each code U, obtained using the true relationships of the generation model while varying A uniformly in \(\Omega _A\). This will allow us to evaluate the counterfactual fairness of the models without depending on a given inference process for the evaluation metric, by relying on prediction differences between these true counterfactuals and the original individual. The objective of this scenario is to achieve a counterfactual fair predictor which estimates the average cost history of insurance customers. We suppose 5 unobserved variables (Aggressiveness, Inattention, Restlessness, Reckless and Overreaction) which corresponds to a 5 dimensional confounder U. The input X is composed of four explicit variables \(X_1,...,X_4\) which stand for vehicle age, speed average, horsepower and average kilometers per year respectively. We consider the policyholder’s age as sensitive attribute A. The input X and the average cost variable Y are sampled from U and A as depicted in Fig. 1 from the main paper. We propose both a binary and a continuous version of this scenario. For both of them, 5000 individuals are sampled. Details of distributions used for the continuous setting of this synthetic scenario are given below:

$$\begin{aligned} U&\sim {\mathcal {N}} \begin{bmatrix} \begin{pmatrix} 0\\ 0.5\\ 1\\ 1.5\\ 2 \end{pmatrix}\!\!,&{} \begin{pmatrix} 1 &{} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 4 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 2 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 3 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 &{} 2 \end{pmatrix} \end{bmatrix}\\ X1&\sim {\mathcal {N}}(7+0.1*A+ U_{1}+U_{2}+U_{3},1) ;\\ X2&\sim {\mathcal {N}}(80+ A + U_{2}^{2},10) ; \\ X3&\sim {\mathcal {N}}(200+5*A + 5*U_{3},20) ; \\ X4&\sim {\mathcal {N}}((10^4+5*A + U_{4}+U_{5},1000) \\ X&\sim [X1,X2,X3,X4]; \\ A&\sim {\mathcal {N}}[45,5]; \\ Y&\sim {{{\mathcal {N}}}}( 2*(7*A + 20*\sum _j U_{j}),0.1) \end{aligned}$$

4.1 Step 1: Counterfactual Inference

In this section, we report experiments performed for assessing our adversarial approach for Counterfactual Inference (step 1 of the previous section). We compare our adversarial approach with two version of the approach in Eq. , each using one of the two MMD constraints MMD wrt P(A) or MMD wrt \(U_a\) as presented in Sect. 2.2 (step 1). Note that these approaches are not applicable for continuous datasets as discussed at the end of Sect. 2. For every approach, we compare three different inference schemes for U: \(q_{\phi }(u|x,y,a)\), \(q_{\phi }(u|x,y)\) and \(q_{\phi }(u|x,a)\). As a baseline, we also use a classical Variational Autoencoder inference without counterfactual independence constraint (i.e., Eq. (4) without the last term).

All hyper-parameters for every approach have been tuned by 5-fold cross-validation. For the US Census data set for our approach for instance, the encoder \(q_\phi\) architecture is an MLP of 3 hidden layers with 128, 64 and 32 units respectively, with ReLU activations. On this dataset, the decoder \(p_\theta\) is an MLP of only one hidden layer with 64 units with a ReLu activation function and the output consists in one single output node with linear activation to reconstruct Y and 37 units to reconstruct X (number of features). The adversarial neural network \(p_\psi\) is an MLP of two hidden layers with 32 and 16 units respectively. For the binary datasets, a sigmoid is applied on the outputs of decoders for A and Y. For both MMD constraints we used a Gaussian radial basis function kernel. For all datasets, the prior distribution p(U) considered for training the models is a five-dimensional standard Gaussian.

In order to evaluate the level of dependence between the latent space U and the sensitive variable A, we compare the different approaches by using the neural estimation of the HGR correlation coefficient given in Grari et al. (2019). This coefficient, assesses the level of non-linear dependency between two jointly distributed random variables. The estimator is trained for each dataset and each approach on the train set, comparing observed variables A with the corresponding inferred codes U.

For all data sets, we repeat five experiments by randomly sampling two subsets, \(80\%\) for the training set and \(20\%\) for the test set. Finally, we report the average reconstruction loss for X and Y on the test set, as long as the HGR between inferred test codes and the corresponding sensitive attributes. Results of our experiments can be found in Table 1 for the discrete case and Table 2 for the continuous case. For all of them, we attempted via the different hyperparameters (\(\uplambda _{x}\), \(\uplambda _{y}\), \(\uplambda _{MMD}\), \(\uplambda _{ADV}\)) to obtain the lower dependence measure while keeping the minimum loss as possible to reconstruct X and Y.

Fig. 3
figure 3

Impact of \(\uplambda\) (Crime data set) on a specific instance i. Blue points are counterfactual predictions \(h_{\theta }(x_{i,A\leftarrow {a'}})\) from 1.000 points \(A\leftarrow {a'}\) generated randomly. The red cross represents the prediction \(h_{\theta }(x_{i,A\leftarrow {a}})\) for the real \(A=a\) of instance i (Color figure online)

Table 1 Inference results in the discrete case
Table 2 Inference results in the continuous case

As expected, the baseline without the independence constraint achieves the best X and Y reconstruction loss, but this is also the most biased one with the worst dependence in term of HGR in most datasets. Comparing the different constraints in the discrete case, the adversarial achieves globally the best result with the lower HGR while maintaining a reasonable reconstruction for X and Y. It is unclear which MMD constraint performs better than the other. We observe that the best results in terms of independence are obtained without the sensitive variable given as input of the inference network (inference only with X and Y). Note however that for the MMD constraints, this setting implies to make the wrong assumption of independence of U w.r.t. A given X and Y for the estimation of the constraint (as discussed at the end of Sect. 2). This is not the case for our adversarial approach, which obtains particularly good results on this setting for discrete datasets. On continuous datasets, our approach succeeds in maintaining reasonable reconstruction losses for important gains in term of HGR compared to the classical VAE approach (without constraint). Interestingly, on these datasets, it appears that our approach obtains slightly better results when using the full information (X, Y and A) as input of the inference network. We explain this by the fact that removing the influence of a binary input is harder than the one of a smoother continuous one, while this can reveal as a useful information for generating relevant codes.

4.2 Step 2: Counterfactual predictive model

This section reports experiments involving the training procedure from step 2 as described in Sect. 3. The goal of these experiments is threefold: 1. assess the impact of the adversarial inference on the target task of counterfactual fairness, 2. compare our two proposals for counterfactual bias mitigation (i.e., using a uniform distribution or an adversarial dynamic one for the sampling of counterfactual sensitive values) and 3. assess the impact of the control parameter from Eq. (5).

The predictive model used in our experiments is a MLP with 3 hidden layers. The adversarial network \(P_\phi\) from Eq. (5) is a MLP with 2 hidden layers and RELU activation. For all our experiments, a single counterfactual for each individual is sampled at each iteration during the training of the models. Optimization is performed using ADAM.

Tables 3 and 4 report results for the discrete and the continuous case respectively. The inference column refers to the inference process that was used for sampling counterfactuals for learning the predictive model. For each setting, we use the best configuration from Tables 1 and 2. The mitigation column refers to the type of counterfactual mitigation that is used for the results: No mitigation or \(L_{CF}\) (Eq. 2) for the discrete case; No mitigation, \(L_{CF}\) (Eq. 4) or \(L_{DynCF}\) (Eq. 5) for the continuous setting. Results are reported in terms of accuracy (for the discrete case) or MSE (for the continuous case) and of Counterfactual Fairness (CF). The CF measure is defined, for the \(m_{test}\) individuals from the test set, as:

$$\begin{aligned} CF=\frac{1}{m_{test}} \sum _i^{m_{test}} {\mathbb {E}}_{(x',a') \sim C(i)} [\Delta (h_{\theta }(x_i, a_i),h_{\theta }(x',a'))] \end{aligned}$$
(6)

where C(i) is the set of counterfactual samples for the i-th individual of the test set. This corresponds to counterfactuals sampled with the Adversarial inference process defined at step 1 (with the best configuration reported in Tables 1 and 2). As discussed above, the synthetic datasets allow one to rely on "true" counterfactuals for the computation of counterfactual fairness, rather than relying on an inference process which may include some bias. For these datasets, we thus also report an additional RealCF metric, which is defined as in Eq. (6), but using these counterfactuals sampled from the true codes used to generate the test data. For both CF and RealCF, for every i from the test set, |C(i)| equals 1 for binary settings and |C(i)| equals 1000 for the continuous one. \(\Delta\) is a cost function between two predictions, the logit paring cost for the binary case (more details given in Sect. 2.2 step 2) and a simple squared difference for the continuous setting.

Results from both tables first confirm the good behavior of our inference model from step 1, which allows one to obtain greatly better results than other inference processes for both the discrete and the continuous settings. Our adversarial counterfactual inference framework allows one to get codes that can be easily used to generate relevant counterfactual individuals. For this observation, the most important results are those given for the synthetic scenarios, for which the RealCF metric shows good results for our method, while strongly reliable since relying on counterfactuals sampled from true codes of individuals.

Secondly, results from Table 2 show that, even in the continuous setting where the enumeration of all values from \(\Omega _A\) is not possible, it is possible to define counterfactual mitigation methods such as our approaches \(L_{CF}\) and \(L_{DynCF}\). These two methods, used in conjunction with our Adversarial Inference, give significantly better results than no mitigation on every dataset. Interestingly, we also observe that \(L_{DynCF}\) allows one to improve results over \(L_{CF}\), which shows the relevance of the proposed dynamic sampling process. Furthermore, note that we can reasonably expect even better results compared to \(L_{CF}\) on data with higher-dimensional sensitive attributes.

To illustrate the impact of the hyperparameter \(\uplambda\) on the predictions accuracy (MSE Error) and the counterfactual fairness estimation (CF), we plot results for 10 different values of \(\uplambda\) (5 runs each) on Fig. 4 for the Crime data set. It clearly confirms that higher values of \(\uplambda\) produce fairer predictions, while a value near 0 allows one to only focus on optimizing the predictor loss. This is also observable from Fig. 3 which plots counterfactual predictions for a specific instance i from the test set. Higher values of \(\uplambda\) produce clearly more stable counterfactual predictions.

Table 3 Counterfactual fairness results for the discrete case
Table 4 Counterfactual fairness results for the continuous case
Fig. 4
figure 4

Impact of hyperparameter \(\uplambda\) (Crime data set): Higher values of \(\uplambda\) produce fairer predictions, while \(\uplambda\) near 0 allows to only focus on optimizing the regression loss

In Fig. 5, we consider the distribution of considered counterfactual samples w.r.t. to the sensitive variable A for the uniform sampling strategy from \(P'(A)\) and the dynamic strategy as defined in Eq. (5). This is done on the Motor dataset and for a specific randomly sampled instance i with sensitive attribute \(a_i=75\), at a given point of the optimization, far before convergence (the model is clearly unfair at this point). The blue points are the counterfactual fairness estimation \((h_\theta ({X_{i,A\leftarrow {a}},a)}-h_\theta (X_{i,A\leftarrow {a'}},a'))\) for each counterfactual sampled a’ s (1.000 points) from the uniform distribution \(P'(A)\). The red points are the counterfactual fairness estimations for counterfactuals corresponding to a’ values (30 points) sampled from our dynamic distribution \(P_\phi (a'|u)={\mathcal {N}}(\mu _{\phi }(u),\sigma _{\phi }^2(u)I)\), where \(\phi\) are the parameters of the adversarial network which optimizes the best mean and variance for each latent code u (\(\mu _{\phi }(u)\) and \(\sigma _{\phi }^2(u)\)). Being optimized to maximize the error at each gradient step, the red points are sampled on lower values of A where the error is the most important. More importantly, very few points are sampled in the easy area, near the true sensitive value of i which is 75. This demonstrates the good behavior of our dynamic sampling process.

Fig. 5
figure 5

Dynamic Sampling Visualization for a randomly sampled individual whose age A is 75. Red points are sampled counterfactuals from the dynamic distribution \(P_\phi (a'|u)\) with u the inferred confounding for this individual (Color figure online)

4.3 Total and counterfactual effect

In addition, we propose to compare performances of our approach with works based on fair data generation (Louizos et al., 2017; Kim et al., 2021; Kocaoglu et al., 2017; Xu et al., 2019), such as mentioned at the end of Sect. 2.2, to emphasize the benefits of our two-steps process for learning counter-factually fair prediction models.

Traditionally, these methods are evaluated in terms of total and counterfactual causal effect of the sensitive on the data generated by the models. Total causal effect (TCE) aims at assessing the statistical parity on the outcomes generated from causal intervention. TCE for binary sensitives is defined as:

$$\begin{aligned} TCE= P(Y_{A\leftarrow {a_{1}}})-P(Y_{A\leftarrow {a_{0}}}) \end{aligned}$$
(7)

where \(Y_{A\leftarrow {a}}\) corresponds to generated causal transformation of input Y, resulting from setting a as the sensitive attribute to the corresponding individual, according to the causal graph G (i.e., obtained via distribution \(P(Y_{A\leftarrow {a}}|X,Y,A)\)).

A limit of such a metric is that it only considers fairness in the data given as training set for learning the predictor model. We claim that this is not enough since any residual bias in these data may strongly impact the final prediction (on both training and testing data). To overcome this limitation, and assess total effect of sensitives on predictions rather than on training data only, we introduce the Total Predictions Effect (TPE), which refers to the statistical parity of the output prediction from intervention. The metric is defined in the binary case as:

$$\begin{aligned} TPE= P(h_\theta (X_{A\leftarrow {a_{1}}}))-P(h_\theta (X_{A\leftarrow {a_{0}}})) \end{aligned}$$
(8)

which takes into account the fairness of the predictor from transformed data \(X_{A\leftarrow {a}}\).

Following (Kim et al., 2021; Xu et al., 2019), we also consider counterfactual effects, which depend on the effect of the sensitive on the outcome for specific individuals (or groups of individuals). Similarly as for the total effect, for any observation o, we consider the Counterfactual Causal Effect defined as: \(CCE= P(Y_{A\leftarrow {a_{1}}}|o)-P(Y_{A\leftarrow {a_{0}}}|o)\) and introduce the Counterfactual Prediction Effect as: \(CPE= P(h_\theta (X_{A\leftarrow {a_{1}}})|o)-P(h_\theta (X_{A\leftarrow {a_{0}}})|o)\).

Causal Effect In Table 5, we represent the results from the different generated data observations on the Adult UCI dataset. We consider the condition observations o as the concatenation of the features race and native_country as in Xu et al. (2019); Kim et al. (2021) (\(O=\{race,native\_country\}\)). We report the chi-square distance \(\chi ^2\) that indicates the similarity between the generated and the real dataset. We consider three baselines that are unaware of the fairness constraint: CausalGan (Kocaoglu et al., 2017) that preserves the causal structures, the DCEVAE WR that represents the DCVAE architecture (Kim et al., 2021) without any fairness regulation term (i.e., \(\beta _f=0\) according to notations in (Kim et al., 2021)) and the original data. Our approach that contains no fairness penalty on the generated outcomes (in step 1) is also designed to reflect the causal structure. We also analyze the impact of CFGAN CE (Xu et al., 2019), which aims at decreasing the TCE in the generated data, CFGAN CE (Xu et al., 2019) which in turn aims at decreasing the 4 different (CCE), and finally DCEVAE, which corresponds to the DCVAE model with a fairness penalization set to \(\beta _f=0.3\).

As expected, only the three last methods, which act on the data rather than on the predictor itself, are able to mitigate TCE and CCE. Our method does not seek at mitigating biases in inferred outcomes, but seeks at leveraging inferred variables that allow it to learn a fair predictor. This is thus without any surprise that the reconstructed Y are not unbiased with regards to the sensitive; this is even a good indication of no information loss in the step 1 of our process, despite mitigating correlation between the latent confounder U and the sensitive A. In the following, we compare these observations to results in prediction effects.

Table 5 Total causal effect and counterfactual causal effect on adult UCI

Predictions Effect In this part, we focus on the level of fairness of the final predictor model. A Logistic Regression (LR), a Neural Network (NN) and a classification tree (CART) are considered in the following. These predictors are either trained on the datasets produced from generation-focused models (i.e., CausalGAN, CFGAN TE, CFGAN CE, DCVAE RW and DCEVAE), or trained in the second step as described in Sect. 3.2 for our two-steps model. Please note that our algorithm can only handle derivative gradient during the optimization, therefore we have discarded the tree CART.

We report the results for each prediction model in Table 6, in terms of TPE, CPE (measured on generated data for test samples) and prediction accuracy (measured on the original test dataset). From this table, we observe completely different results than those from previous table, with generation based models such as CFGAN greatly penalized compared to two-steps methods such as ours. This confirms our intuition that, even if produced data have biases well mitigated on the test set (as seen in Table 5), some small residuals of these biases can stay in the data. Then, the learning process is free to assign important emphasis on these problematic features, if this helps to achieve good prediction accuracy. In two-steps approaches such as ours, this is not the case, since biases of the outcomes are mitigated while learning prediction models, which enables more fairness robustness on test data.

Table 6 Total predictions effect and counterfactual predictions effect on adult UCI

5 Conclusion

We developed a new adversarial learning approach for counterfactual fairness. To the best of our knowledge, this is the first such method that can be applied for continuous sensitive attributes. The method proved to be very efficient for different dependence metrics on various artificial and real-world data sets, for both the discrete and the continuous settings. Finally, our proposal is applicable for any causal graph to achieve generic counterfactual fairness. As future works, it might be interesting to consider a generalization of our proposal for Path Specific (Chiappa, 2019) counterfactual fairness in the continuous case.