1 Introduction

In recent times, the use of Artificial Intelligence (AI) has permeated many processes that are used to make important decisions, such as filtering applicants for jobs, deciding if an applicant should receive credit and recognizing people in images [15, 25]. Given this, it is essential to ensure that AI-driven models are not exhibiting behaviour which is morally or legally undesirable. In AI, data is a collection of attributes, which can either be explicit (e.g. labels) or implicit (e.g. information from an image). Some of these attributes are referred to as protected attributes as they should not be used to discriminate (e.g. gender, race or age). However, it has been shown numerous times that AI models which are naïvely trained are biased against one or more of these protected attributes, as they exhibit lower accuracy for some demographics [4, 11, 19]. This behaviour is discriminatory against these demographics and is morally or legally undesirable, or simply unfair. There are two common sources of unfair behaviour that can present itself in AI systems. The first source is biases that are present in the data used for training AI models. Biases in the data with respect to protected attributes can cause an AI model trained upon that data to discriminate against the protected attribute [2]. For example, if a dataset used to train a facial recognition model for unlocking doors, only contains images of men (i.e. bias with respect to gender) then the learned model will not accurately recognise and admit women (i.e. unfair behaviour). The second source of bias is due to some values or demographics of a protected attribute being inherently harder for AI to recognize than others. For example, it has been shown that even when training with a balanced dataset, faces with a darker skin tone are harder to recognize for facial recognition algorithms [28].

Various solutions to the fairness problem have been proposed. We focus on algorithmic in-processing methods for reducing the bias [5, 6, 8,9,10, 14, 29, 33]. In-processing aims to address the bias of a model by applying an extra objective during training which makes the model bias-aware and consequentially learns a fairer model. In-processing has proven to be quite effective at reducing the unfair behaviour of AI. However, in-processing methods often include extra models which can increase training cost and complexity [16]; use adversarial training [6, 8, 9, 33] which has proven to be notoriously unstable [24] or make assumptions about the representation space of the model which may not hold in all cases [10, 21, 32]. Creating fair AI models is particularly difficult in the computer vision domain as any problems with extra computational cost and complexity are exacerbated by the large models utilized. Additionally the high dimensionality of images means they can contain many implicit attributes, which are often highly correlated to each other and to protected attributes. Disentangling the implicit factors is extra challenging in these cases.

In this paper, we introduce Bias Accuracy Standard deviation Estimation or BASE, a novel fairness algorithm, which optimizes a differentiable approximation of the fairness metric standard deviation of accuracy across demographics (\(\sigma _{acc}\)) to learn an AI model which is fair with respect to equalized odds (EO). Models that exhibit a low standard deviation of accuracy across demographics or variance of demographics have the property of equal performance on a target task regardless of the demographic of the protected attribute. For example, a facial recognition model which has low variance of demographics for ethnicity, is equally likely to correctly recognize the identity of a person from an image regardless of their ethnicity. Reducing the variance of demographics of a model makes it fairer w.r.t. EO. However, for an AI model that is trained with gradient based optimization the variance of demographics is difficult to use. This is due to the accuracy of a single sample - an integral part of the variance of demographics (Sect. 2.3) - having an undefined gradient at 0 and being 0 everywhere else, which leads to zero influence on the model parameters. BASE overcomes this difficulty by instead using a sigmoid based approximation of accuracy which we call soft-accuracy inside the variance of demographics metric. This approach has multiple advantages. Firstly computational efficiency, for example, training a classifier on images with BASE incurs only the extra computation of calculating the variance of demographics. Compare this to training a classifier with knowledge distillation [16] or adversarial debiasing [33], where additional models are used which incur extra memory usage for the model parameters and gradients, alongside with extra computation for the forward pass of the additional model. Secondly, BASE makes no assumptions about the representation space. The model will automatically learn the representation space structure required to reduce the variance of demographics. Furthermore, due to its simplicity BASE can be combined with other solutions.

To summarize the main contributions of our work are:

  • Provide a novel method for improving the fairness on AI models trained with gradient based optimization, that increases algorithmic simplicity and does not rely on training additional models (Sect. 3.1).

  • Show that our method is competitive with and in some cases outperforms current state-of-the-art fair image classifiers when using either a biased dataset or an unbiased dataset (Sect. 4.4, Sect. 4.4).

  • Show that our method increasingly outperforms the fairness of a naive classifier when exposed to increasingly biased training sets in which target and protected attributes are strongly correlated. Our method also achieves higher over-all accuracy on heavily biased datasets (Sect. 4.4).

2 Related Work and Preliminaries

Fair AI has received increasing attention in the past few years and a varied range of solutions has been proposed. Algorithmic methods for reducing the bias can be broken down into three main categories based upon when they apply their fairness constraint. Pre-processing methods aim to change the distribution of the data used for training such that a fairer model is produced. These methods include re-sampling, which changes the sampling rate of data during training to ensure each protected class is equally represented [1, 23, 26] and augmentation methods which add synthetic data to the dataset [3, 22, 31, 34] to balance the protected classes. The second class of methods, post-processing methods, aim to adjust the prediction after the fact to compensate for the bias [30]. Pre-processing and post-processing have some major drawbacks. Pre-processing only addresses the bias in the dataset and the inherent difficulties of some demographics can still cause a biased model [28, 29]. On the other hand post-processing methods require that protected attribute labels to be known at inference time or assume that the target and protected attribute are independent [30]. Our method is related to the final category of in-processing, which is discussed further below. In-processing methods typically run under a constrained optimization scheme where a loss penalty or a special construction of the AI model is used to reduce the bias during optimization.

2.1 In-processing for Fair Classification

Like many machine learning tasks, the fairness problem is difficult to optimize directly and adversarial training became a common method to create fair representations and predictors [6, 8, 9, 29, 33]. These methods use an adversarial model, or adversary, whose purpose is to learn the relationship between the predictor and the protected attribute. The output of the adversary is then used to enforce a fairness constraint upon the predictor. This is achieved either by gradient reversal of the adversary or by maximising the entropy of the adversaries predictions. If a strong adversary is unable to determine a relationship between the predictor and the protected attribute then fairness of the predictor can be guaranteed [33]

Other constrained optimization methods have been proposed and their approaches vary greatly. Gong et al. [10] minimize the variance of sample density across different demographics within the representation space. Cho et al. [5] use a kernel density estimate to approximate the conditional distributions used for measuring fairness in a differentiable manner. Hwang et al. [14] reduce the Wasserstein distance between protected groups within the representation space. Finally, in a work most similar to our own Shen et al. [27] use cross-entropy loss as a proxy for probability during training to optimise for fairness. Our method differs in two main aspects; our objective directly considers the two elements of the models output vector responsible for determining accuracy and we evaluate our work in the computer vision domain.

2.2 Problem Definition

The ultimate goal of fair machine learning is to create predictors which contain no bias. There is, however, many different forms of bias that can present themselves and as a consequence there are multiple different definitions of fairness. The three most common definitions are Demographic parity [33], Equalized Odds [12] and Equalized Opportunity [12]. In the following section A, \(\hat{Y}\) and Y are random variables which represent the protected attribute, the output of a predictor and the true value of the target attribute respectively.

Demographic Parity. Demographic parity is the simplest form of fairness since it only considers the output of the predictor and the protected attribute. A predictor satisfies demographic parity when its output is independent of the protected attribute. That is \(\forall a\in \mathcal {A}; \textrm{Pr}(\hat{Y}=\hat{y}|A=a) = \textrm{Pr}(\hat{Y}=\hat{y})\). However, this definition does not always allow for perfect classification [12]. If there is any correlation between the protected attribute and the target task then maintaining independence forces a reduction in performance. For example, if we learned a predictor for university admittance with age as a protected attribute, then achieving demographic parity would require our predictor to admit young children with the same probability as those who had just finished high school, regardless of each individuals suitability.

Equalized Odds. Equalized Odds is another definition of fairness that is more commonly applied for computer vision tasks. A predictor satisfies equalized odds when its output is conditionally independent of the protected attribute for all classes of the target class. That is \(\forall y\in \mathcal {Y}, \forall a,a'\in \mathcal {A}, \textrm{Pr}(\hat{Y}=y|A=a,Y=y) = \textrm{Pr}(\hat{Y}=y|A=a',Y=y)\). This definition allows us to maintain performance as it is satisfied when a predictor achieves the same level of accuracy for each demographic of the protected attribute.

Equalized Opportunity. Equalized Opportunity is a special case of equalized odds for which there is a class of the target task \(y_+\in \mathcal {Y}\) that confers advantage, e.g., to receive a loan or be hired for a job. It is a relaxation of equalized odds that is satisfied when the output of the predictor is conditionally independent of the protected attribute for only the advantageous class. That is \(\forall a,a'\in \mathcal {A}, \textrm{Pr}(\hat{Y}=y_+|A=a,Y=y_+) = \textrm{Pr}(\hat{Y}=y_+|A=a',Y=y_+)\)

Equalized odds and equalized opportunity are more practical definitions of fairness when applied to a computer vision problems because they still allow full predictive capability [12]. Further, since equalized opportunity is a relaxation of equalized odds, if equalized odds is achieved then equalized opportunity is also achieved. Therefore, in this work we aim to create predictors that satisfy equalized odds.

2.3 Distance Measures for Equalized Odds

Though the goal is to achieve true equalized odds, current methods are unable to achieve it [5, 16, 33]. Therefore, we need to use metrics to quantify how far a predictor is from true equalized odds. In this work we use three different metrics to measure the level of fairness of a predictor. The first two metrics use the difference in predictor output between different demographics of a protected attribute. This difference is called the difference of equalized odds (DEO).

$$\begin{aligned} \textrm{DEO}(a,a',y) \triangleq \big |\textrm{Pr}(\hat{Y}=y|A=a,Y=y) - \textrm{Pr}(\hat{Y}=y|A=a',Y=y)\big |\;. \end{aligned}$$
(1)

DEO can be directly used when the protected attribute is binary and can easily be extended for more demographics by aggregating DEO across the different target and protected attribute values. The methods used to aggregate DEO differ between various works in the literature. We use the aggregation methods from Jung et al. [16] who propose two different methods of aggregation, \(\mathrm {DEO_{max}}\) and \(\mathrm {DEO_{avg}}\) which are shown in Eqs.(2) and (3), respectively. \(\mathrm {DEO_{max}}\) can be used to understand the peak bias of an AI model and \(\mathrm {DEO_{avg}}\) can be used to understand the bias of a model in the majority of cases.

$$\begin{aligned} \mathrm {DEO_{max}} \triangleq \max _{y}(\max _{a,a'}(\textrm{DEO}(a,a',y)))\;. \end{aligned}$$
(2)
$$\begin{aligned} \mathrm {DEO_{avg}} \triangleq \frac{1}{|\mathcal {Y}|}\sum _{y}(\max _{a,a'}(\textrm{DEO}(a,a',y)))\;. \end{aligned}$$
(3)

Another fairness metric that is commonly reported, often in the Fair face recognition literature, is the standard deviation of accuracy across the demographics of the protected attribute, denoted by \(\sigma _{\textrm{Acc}}\). This metric is shown in Eq. (5), where \(\mu \) is the average accuracy across all the demographics. Note that \(\textrm{Pr}(\hat{Y}=y|A=a)\) is equivalent to the accuracy of the predictor \(\hat{Y}\) in the domain of demographic a.

$$\begin{aligned} \mu = \frac{1}{|\mathcal {A}|}\sum _{a\in \mathcal {A}}[\textrm{Pr}(\hat{Y}=y|A=a)] \end{aligned}$$
(4)
$$\begin{aligned} \sigma _{\textrm{Acc}} \triangleq \sqrt{\frac{1}{|\mathcal {A}|}\sum _{a}\left[ \textrm{Pr}(\hat{Y}=y|A=a) - \mu \right] ^2} \end{aligned}$$
(5)

All these metrics represent a distance from true equalized odds. In all cases this means that lower values indicate a fairer classifier.

3 Method

3.1 A Differentiable Approximation for Distance from Equalized Odds

The strategy used to train an AI model for classification uses a distance measure between the models output distribution and the true data distribution, referred to as the loss or objective function. Then a gradient optimization method is used to update the parameters of the model to reduce the distance measure. This is a simple but incredibly effective strategy. We aim to use the same strategy to increase the fairness of an AI model. We use \(\sigma _{\mathrm {Acc.}}\) as an objective function to reduce the distance from true EO.

In what follows, we use boldface fonts to denote vectors, e.g., \(\boldsymbol{\mathrm {\hat{y}}} \in \mathcal {\hat{Y}}\) denotes the output vector of the model. We use \(\hat{y}_t\) to show the element corresponding to the ground truth label y in \(\boldsymbol{\mathrm {\hat{y}}}\). Furthermore, \(\hat{y}_m = \textrm{max}(\boldsymbol{\mathrm {\hat{y}}}\setminus \{\hat{y}_{t}\})\) represents the largest non ground truth element of \(\boldsymbol{\mathrm {\hat{y}}}\) and \(\mathcal {\hat{Y}}_a\) represents the domain of demographic a for the protected attribute. Accuracy of a single sample \(\boldsymbol{\mathrm {\hat{y}}}\) is defined in Eq. (6).

$$\begin{aligned} \textrm{Acc}(\hat{y}_t, \hat{y}_m) \triangleq {\left\{ \begin{array}{ll} 1 &{} \hat{y}_t > \hat{y}_m \\ 0 &{} \mathrm {otherwise.} \end{array}\right. } \end{aligned}$$
(6)

In essence, if the element \(\hat{y}_t\) is greater than all other elements, the model has correctly predicted the outcome for this sample and therefore, has an accuracy of one.

Since \(\mathbb {E}_{\boldsymbol{\mathrm {\hat{y}}}\sim \mathcal {\hat{Y}}_a}[\textrm{Acc}(\hat{y}_t, \hat{y}_m)]=\textrm{Pr}(\hat{Y}=y|A=a)\), we substitute the expectation into Eqs. (4) and (5), which gives us Eqs. (7) and (8).

$$\begin{aligned} \mu = \frac{1}{|\mathcal {A}|}\sum _{a\in \mathcal {A}}\mathbb {E}_{\boldsymbol{\mathrm {\hat{y}}}\sim \mathcal {\hat{Y}}_a}[\textrm{Acc}(\hat{y}_t, \hat{y}_m)] \end{aligned}$$
(7)
$$\begin{aligned} \sigma _{\textrm{Acc}} = \sqrt{\frac{1}{|\mathcal {A}|}\sum _{a\in \mathcal {A}}\left[ \mathbb {E}_{\boldsymbol{\mathrm {\hat{y}}}\sim \mathcal {\hat{Y}}_a}[\textrm{Acc}(\hat{y}_t, \hat{y}_m)] - \mu \right] ^2} \end{aligned}$$
(8)

This is the objective we would like to optimize. However to be used for gradient based optimization that AI models are trained with an objective needs to be differentiable, which \(\sigma _{\textrm{Acc}}\) is not due to the undefined gradient at \(\hat{y}_t=\hat{y}_m\) of \(\textrm{Acc}(\hat{y}_t, \hat{y}_m)\). Instead we approximate the accuracy using a sigmoid based soft accuracy function, shown in Eq. (9), which is a differentiable approximation of accuracy. The soft accuracy is characterised by \(\kappa \), which is a hyper-parameter that describes the sharpness of the function. A higher value of \(\kappa \) leads to a closer approximation of accuracy with \(\lim _{\kappa \rightarrow \infty } \mathrm {Acc_{soft}}(\hat{y}_t, \hat{y}_m) = \textrm{Acc}(\hat{y}_t,\hat{y}_m)\), however this is paired with an increased sparsity of the gradient.

$$\begin{aligned} \mathrm {Acc_{soft}}(\hat{y}_t, \hat{y}_m) \triangleq \frac{1}{1+e^{-\kappa (\hat{y}_t-\hat{y}_m)}} \end{aligned}$$
(9)

We then substitute soft accuracy into \(\sigma _{Acc}\) for accuracy. This gives us the objective shown in Eq. (11).

$$\begin{aligned} \mu _{\textrm{soft}} = \frac{1}{|\mathcal {A}|}\sum _{a\in \mathcal {A}}\mathbb {E}_{\boldsymbol{\mathrm {\hat{y}}}\sim \mathcal {\hat{Y}}_a}[\mathrm {Acc_{soft}}(\hat{y}_t, \hat{y}_m)] \end{aligned}$$
(10)
$$\begin{aligned} \sigma _{\mathrm {Acc_{soft}}} \triangleq \sqrt{\frac{1}{|\mathcal {A}|}\sum _{a\in \mathcal {A}}\Big [\mathbb {E}_{\boldsymbol{\mathrm {\hat{y}}}\sim \mathcal {\hat{Y}}_a}\big [ \mathrm {Acc_{soft}}(\hat{y}_t, \hat{y}_m)\big ] - \mu _{\textrm{soft}}\Big ]^2} \end{aligned}$$
(11)

This is the differentiable objective that we can optimize to obtain a fair predictor.

3.2 Training Objective

By itself the soft accuracy fairness objective does not learn to classify. In fact the easiest solution for a model to achieve equalized odds is to randomly classify each sample. Since it is important that the model still achieves high utility we combine the soft accuracy fairness objective with a cross entropy classification objective. This gives us the full objective which is shown in Eq. (12).

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{\textrm{ce}} + \gamma \sigma _{\mathrm {Acc_{soft}}} \end{aligned}$$
(12)

The two losses, \(\mathcal {L}_{ce}\) and \(\sigma _{\mathrm {Acc_{soft}}}\), aim to achieve different objectives, which are classification performance and fairness respectively. Applying too much weight to one objective can harm the other. We use \(\gamma \) as a hyper-parameter to balance the utility of the model with the fairness. A higher value for \(\gamma \) will result in a fairer classifier, however at this can often come at the cost of classification performance. We experimentally determined the optimal value of \(\gamma \) for each dataset by performing a grid search. However, we observe an extensive search is not required and finding the correct order of magnitude results in good performance.

3.3 Balancing the Training Dataset

When calculating \(\sigma _{\mathrm {Acc_{soft}}}\) on a mini-batch the number of samples used to estimate the soft accuracy for each protected demographic is highly important. If the number of samples for a particular demographic is too low then the variance of the soft accuracy estimation will increase. Differences in variance between the different demographics lead to instability of training gradients which has a negative impact on performance. To counter this effect we simply oversample the training dataset set such that each protected, target attribute pair is evenly sampled. This is achieved by randomly duplicating samples from the undersampled pairs until all protected, target attribute pairs contain the same number of samples. There exist more sophisticated methods [22, 31] which could be used to augment the training dataset and their use may lead to gains in performance. However, we leave this investigation to future work.

4 Experiments

In the following section we thoroughly investigate and validate the capability of our soft accuracy fairness objective.

4.1 Baselines

We compare our algorithm with four different baselines. The first is a naïve classifier that is not aware of fairness in any regard. This baseline represents the worst case scenario for fairness. Since one source of bias is an unbalanced dataset we also include a naïve classifier which is trained by oversampling the dataset such that it is balanced. We refer to this baseline as Naïve Balanced. The third baseline is Adversarial Debiasing (AD) [33] which is used as a common benchmark method and the final baseline is the state-of-the-art in-processing method, MFD [16]. The original MFD paper only provided results for the age task with the UTKFace dataset. Additionally, the original MFD paper only implemented a simple data augmentation scheme. We employed further data augmentations which allowed our naïve classifier to achieve a much higher accuracy (\(74.7\%\) vs \(83.1\%\)). In the spirit of fair comparison, we apply their method code with our datasets and augmentation scheme, this allows MFD to achieve comparable accuracy. Where applicable, results from the original paper are reported as MFD\(^\diamond \). Similarly, AD was originally implemented on non computer vision tasks, we re-implement AD for evaluation on CV tasks. For both re-implementations we perform a sweep of the bias loss hyper-parameter, discard hyper-parameter choices that lead to a large reduction in accuracy and report the best results.

4.2 Datasets

We use three datasets for our experiments. UTKFace [17], CelebA [20] and Fairface [18]. UTKFace and CelebA are face image datasets commonly used to benchmark fairness. UTKFACE contains \(~20\textrm{k}\) samples with annotations of age, gender and ethnicity. CelebA contains \(~200\textrm{k}\) images which are labelled with 40 binary attributes. The images from UTKFace and CelebA cover a large variation in position, facial expression, illumination, occlusion and resolution. Buolamwini and Gebru [4] note that collecting a balanced dataset should be the first step in a fairness solution. Therefore it is important that we also evaluate our method under these conditions, for which we use the Fairface dataset. Fairface is also a face image dataset. It contains \(~98\textrm{k}\) images with annotations of age, gender and ethnicity. Fairface was created in an effort to reduce racial bias in existing datasets and had a strong focus on reducing the imbalance of races in the dataset during its creation. As shown in Fig. 1, compared to UTKFace, the race labels in Fairface are much more balanced. Using these UTKFace, CelebA and Fairface we evaluate two scenarios. Where a task is trained with a balanced dataset and where the task is trained with a biased dataset. UTKFace provides the age labels as integers, instead of learning a regression problem we group ages together into classes. To allow comparison we follow the division used by Jung et al. [16] where ages are divided into three classes, less than 20, 20–39 and greater than 40. Fairface provides age labels in classes already, however they are heavily imbalanced with far fewer samples in the extreme young and old classes. To maintain Fairface as a balanced test set we divide the ages in four new classes to balance them. These four classes are 0–19, 20–29, 30–39 and 40+.

Fig. 1.
figure 1

The relative distribution of different races in the UTKFace dataset and Fairface dataset. UTKFace labels both East Asian and South East Asian faces together so these are shown under East Asian.

Skewed Fairface. Since it is imperative to understand how the performance of a fairness algorithm is related to the bias of a dataset, we present a protocol for controlling the bias within a dataset. We apply this protocol to Fairface to create a dataset which we name Fairface Skewed (FairfaceS). FairfaceS is characterised by a skew parameter (s) which can range from 0 to 1. The skew parameter describes the relative distribution of (target, protected) attribute pairs where a higher skew parameter leads to a dataset with a higher correlation between the target attribute and the protected attribute. The relative distribution is calculated by arranging the classes into a 2D array. Two diagonal corners are assigned the value of 1 and the other two diagonal corners are assigned the value of \(1-s\). Bilinear interpolation is then used to calculate the remaining values of the matrix. An example of the relative distribution for different skews is shown in Fig. 2. Fairface is then under-sampled such that the relative distribution of each (target, protected) pair matches that in the matrix. This protocol imposes an order on the class however, in the absence of a rigorous similarity metric between separate demographics and target attribute values we simply order the classes alphabetically.

Fig. 2.
figure 2

The relative distribution of (protected, target) pairs in the FairfaceS dataset for different skew values.

As the skew value increases the mutual information between the protected attribute and the target attribute increases leading to an increase in the bias. Using FairfaceS allows us to evaluate how a fairness algorithm performs under varying degrees of dataset bias. Additionally, because FairfaceS uses genuine attributes that can be linked in complicated manners, rather than creating a bias with respect to augmentations such as grayscaling an image, it allows for greater understanding of how a system may behave in a real-world scenario.

Balanced Test Set and Triplicate Experiments. When evaluating the fairness of a model it is important that the test set have a uniform distribution of (protected, target) pairs. If a particular pair is undersampled or oversampled it will have a disproportionate impact on the results, e.g. if the White, Male pair is more prevalent in the test set then the accuracy on this pair will affect the average accuracy more. To ensure that our results do not include any bias toward a particular label pair we select samples for the test set such that each protected, target label pair is included in equal numbers. We also observed that whilst the target classification performance is stable over different training and test splits, the fairness varies by a large degree. To ensure robust results we perform our experiments on three different train test splits and report the mean and standard deviation. The exception is our experiments with CelebA, for which we use the official train, validation, and test sets as this allows us to compare to previous work. The results for CelebA are reported over three different random initializations.

4.3 Implementation Details

For all experiments we use a Resnet18 [13]. For experiments on UTKFace, Fairface and FairfaceS models are initialised from weights that were pretrained on Imagenet-1k [7]. Models in the CelebA experiments are randomly initialised. More details about the exact training procedure can be found in the supplementary material. MFD and AD are implemented according to their original papers. However, we follow Jung et al. [16] and remove the gradient projection from the original work to increase stability of training.

4.4 Classification Tasks

In this section we investigate the performance of our method on two tasks, age and gender classification.

Unbalanced Data. First we test the scenario in which the training data from the task is not balanced. This is the case for the majority of AI tasks unless special care has been taken during the creation of the dataset. For this experiment, we use the UTKFace and CelebA datasets. For UTKFace we use race as the protected attribute and test both age and gender as target attributes. For CelebA we use the Male attribute as the protected attribute and Attractive as the target attribute. The results are shown in Tables 1, 2 and 3, respectively. In both UTKFace scenarios all fairness methods improve fairness over a naïve classifier. The age classification task in harder than gender and is also much less fair, with the naïve classifier only achieving a \(\sigma _{\textrm{Acc}}\) of 8.5 compared to 3.0 for the gender task. In the highly unfair scenario with age as the target attribute, we observe that BASE achieves the best fairness for \(\sigma _{\textrm{Acc}}\) and \(\mathrm {DEO_{avg}}\), whilst achieving the highest over-all accuracy. It is only outperformed on \(\mathrm {DEO_{max}}\) by MFD\(^\diamond \) which does so with at a significantly lower over-all accuracy. Whilst the data for the gender task is still unbalanced, we observe that the naïve classifier can already achieve a better level of fairness leading us to believe this is a fairer task. For this task, BASE is competitive and achieves the second best accuracy and fairness. MFD achieves the greater fairness, however, this comes at the expense of a lower over-all accuracy. In the CelebA scenario BASE achieves the highest performance in all metrics. Again the fairness of the naï classifier is low for this scenario, showing that the CelebA task is unfair. These experiments show that BASE works best in an environment that is particularly unfair.

Table 1. Comparison of methods on UTKFace dataset with age as the target variable. Best results are bold and second best are underline. Results marked \(\diamond \) are reported directly from [16].
Table 2. Comparison of methods on UTKFace dataset with gender as the target variable. Best results are bold and second best are underline.
Table 3. Comparison of methods on CelebA dataset with attractive as the target variable. Best results are bold and second best are underline. Results marked \(\diamond \) are reported directly from [21].

Balanced Data. Next, we test the scenario in which the training data for the task has been collected with a focus on ensuring that it is balanced with respect to the protected attribute. For this experiment, we use the Fairface dataset and race as the protected attribute. For the classification target attribute we test both age and gender. The results are shown in Tables 4 and 5, respectively.

Table 4. Comparison of methods on Fairface dataset with age as the target variable. Best results are bold and second best are underline.
Table 5. Comparison of methods on Fairface dataset with gender as the target variable. Best results are bold and second best are underline.

In these two scenarios, we observe that the fairness of the naïve classifier is already high due to the balanced nature of the data. For both target attributes, the naïve classifier with balanced sampling achieves the best fairness for two of the three metrics. However, this comes at the cost of accuracy for the age task. For both tasks BASE achieves the second best results for \(\sigma _{\mathrm {Acc.}}\), with equal highest overall accuracy in the age task and the second best overall accuracy for the gender task.

Biased Data. Finally, we investigate how out method performs with an increasingly biased dataset. For this experiment we use the FairfaceS dataset (Sect. 4.2) with gender as the target variable. We evaluate a naïve classifier and BASE over a range of different skew parameters and observe the effect on accuracy and fairness. The results are shown in Fig. 3.

Fig. 3.
figure 3

The accuracy and fairness of a Naïve classifier and BASE over different skew parameters of the FairfaceS dataset. Error bars are the 95% confidence interval over 3-fold cross-validation.

We observe that, as one would expect, as the skew increases and consequentially the bias in the dataset increases both accuracy and fairness decay for both methods. Additionally, at low levels of bias, whilst BASE is able to increase the fairness of the classifier in all metrics, this comes at the cost of overall accuracy compared to the naïve classifier. However, as the skew increases the accuracy of the naïve classifier decays at a greater rate than BASE. At extreme skew levels, BASE is even able to achieve a higher degree of overall accuracy. The same results can be seen with the fairness metrics. With the performance of the naïve classifier decaying at a higher rate than BASE. Even though BASE produces a more fair predictor at low skew levels, the performance gap only increases as the skew increases.

5 Conclusion

In this work, we introduce a new fairness objective based upon optimising the standard deviation of soft accuracy across demographics of a protected attribute. Experimental results on UTKFace, CelebA, Fairface and FairfaceS show that our system is able to produce fairer AI models for computer vision tasks under widely varying conditions whilst being particular effective for more unfair scenarios and can even improve the overall accuracy compared to a naive model in heavily biased data-sets.