Keywords

1 Introduction

Data analytics has grown in popularity due to its ability to automate decision-making through machine learning. However, real-world data can contain biases that produce unfair outcomes, making fairness in data pipelines involving machine learning a pressing concern. Fairness in machine learning typically deals with intervening algorithms providing equitable outcomes regardless of protected characteristics such as gender, race, or age group.

The existing related works can be divided into three categories [5, 8, 20]. The first category of methods are pre-processing methods, which aim to reduce bias in the data. Examples of such methods include data augmentation and data balancing [2]. The second category of methods are in-processing methods, which aim to enforce fairness constraints during the training procedure [15]. Examples of in-processing methods include regularization techniques and constrained optimization [31]. The last category are post-processing methods that allow the improvement of fairness after training by correcting the outputs of the trained model [14].

The goal of this paper is to introduce a pre-processing method that achieves fairness by including generated data points. This is done by utilizing a statistical model that learns the distribution of the dataset, enabling the generation of synthetic samples. Additionally, a discrimination measure is employed to evaluate the fairness when incorporating the generated data points. Our method treats the discrimination measure as a black-box, making it able to optimize any discrimination measure defined by the user. We refer to this property of our algorithm as fairness-agnostic. This makes it suitable for cases where a specific fairness notion is required.

For the experimentation, multiple datasets known to be discriminatory were used. The experiments were performed by firstly loading the datasets and then pre-processing them using different pre-processing techniques. The pre-processed datasets were then fed into several classifiers. The performance of each classifier was then evaluated in terms of performance and fairness to assess the effectiveness of the pre-processing methods. Our experiments have empirically shown that our technique effectively lessens discrimination without sacrificing the classifiers’ prediction qualities. Moreover, it is compatible with any machine learning model. Of the pre-processors tested, none were able to meet all of these conditions. The scope and application of our method is not necessarily limited to tabular data and classification tasks, even though experiments were conducted on them. The method is more broadly suitable for supervised learning tasks where the data, label, and protected attribute are available. Only the appropriate discrimination measures have to be derived for the right task. Generally, our primary contributions are:

  • The introduction of a novel pre-processing technique that can optimize any given fairness metric by pre-selecting generated data points to include into the new fair dataset.

  • We carry out a comprehensive empirical study, comparing our method against three widely recognized pre-processors [9, 13, 31], using multiple datasets commonly found in fairness literature.

  • We present interesting and valuable properties, such as the empirical evidence that our method consistently improved fairness in comparison to the unprocessed data.

2 Related Work

Many pre-processing algorithms in literature alter the dataset to achieve fairness [4, 9, 31]. Because the methods simply return a fair dataset, they can be used with any estimator. However, such approaches cannot be used with ease: They often require a parameter setting that sets how aggressive the change should be. As the approaches differ in their methodology, it is hard to interpret the parameter’s setting and their unexpected effects on the data. Data alteration methods also have a higher risk of producing data that do not resemble the original data distribution in any ways.

Other approaches return a weight for each sample in the dataset that the estimator should account for when fitting the data [1, 13]. While the approaches seem promising [1, 13], they require estimators to be able to handle sample weights. A way to account for this is to replicate samples based on their sample weights. However, this is not computationally scalable for larger datasets or for larger differences between the sample weights.

Another related approach is removing data samples that influence estimators in a discriminatory way [28]. Nevertheless, this approach does not seem feasible for smaller datasets.

Differently from related works, we present an algorithm that does not come with the above mentioned drawbacks. Further, our approach is able to satisfy any fairness notion that is defined for measuring discrimination or bias in the dataset. While the work of Agarwal et al. [1] also features this property, the fairness definitions must be formalizable by linear inequalities on conditional moments. In contrast, our work requires the fairness definitions to quantify discrimination in a numeric scale where lower values indicate less discrimination. This can be as simple as calculating the differences of probabilistic outcomes between groups.

While there exist works that train fair generative models to produce data that is fair towards the protected attribute on images [7, 24, 27] or tabular data [12, 23], our approach can be seen as a framework that employs generative models and can therefore be used for any data where the protected attribute is accessible. Specifically, our research question is not “How can fair generative models be constructed?”, we instead deal with the question “Using any statistical or generative model that learns the distribution of the dataset, how can the samples drawn from the distribution be selected and then included in the dataset such that fairness can be guaranteed?”. Other works that generate data for fairness include generating counterfactuals [26] and generating pseudo-labels for unlabeled data [6].

3 Measuring Discrimination

In this section, we briefly present discrimination measures that assess the fairness of data. For that, we make use of following notation [5, 8, 20]: A data point or sample is represented as a triple (xyz), where \(x \in X\) is the feature, \(y \in Y\) is the ground truth label indicating favorable or unfavorable outcomes, and \(z \in Z\) is the protected attribute, which is used to differentiate between groups. The sets XYZ typically hold numeric values and are defined as \(X = \mathbb {R}^d\), \(Y=\{0, 1\}\), and \(Z = \{1, 2, \ldots , k\}\) with \(k \ge 2\). For simplicity, we consider the case where protected attributes are binary, i.e., \(k=2\). Following the preceding notation, a dataset is defined as the set of data points, i.e., \(\mathcal {D} = \{(x_i, y_i, z_i)\}_{i=1}^n\). Machine learning models \(\phi : X \times Z \rightarrow Y\) are trained using these datasets to predict the target variable \(y \in Y\) based on the input variables \(x \in X\) and \(z \in Z\). We call the output \(\hat{y}:=\phi (x, z)\) prediction.

Based on the work of [32], we derive discrimination measures to the needs of the pre-processing method in this paper. To make our algorithm work, a discrimination measure must satisfy certain properties which we introduce in the following.

Definition 1

A discrimination measure is a function \(\psi : \mathbb {D} \rightarrow \mathbb {R}^+\), where \(\mathbb {D}\) is the set of all datasets, satisfying the following axioms:

  1. 1.

    The discrimination measure \(\psi (\cdot )\) is bounded by [0, 1]. (Normalization)

  2. 2.

    Minimal and maximal discrimination are captured with 0, 1 by \(\psi (\cdot )\), respectively.

The first and second axiom together assure that the minimal or maximal discrimination can be assessed by this measure. Furthermore, through normalization it is possible to evaluate the amount of bias present and its proximity to the optimal solution. As achieving no discrimination is not always possible, i.e., \(\psi (\mathcal {D}) = 0\), we consider lower discrimination as better and define a fairer dataset as the one with the lower discrimination measure among two datasets.

Literature [2, 5, 8, 19, 20, 32] on fairness-aware machine learning have classified fairness notions to either representing group or individual fairness. We subdivide the most relevant fairness notions into two categories which are dataset and prediction notions and derive discrimination measures from it as suggested by [32]. From now on, we denote xyz as random variables describing the events of observing an individual from a dataset \(\mathcal {D}\) taking specific values.

Dataset notions typically demand the independency between two variables. When the protected attribute and the label of a dataset are independent, it is considered fair because it implies that the protected attribute does not influence or determine the label. An example to measure such dependency would be the normalized mutual information (NMI) [29] where independency can be concluded if and only if the score is zero. Because it is normalized as suggested by the name, it is a discrimination measure.

Definition 2

(Normalized mutual information). Let \(H(\cdot )\) be the entropy and I(yz) be the mutual information [25]. The normalized mutual information score is defined in the following [30]:

$$\begin{aligned} \psi _{NMI}(\mathcal {D}) = 2\frac{I(y; z)}{H(y) + H(z)}. \end{aligned}$$

Statistical parity [15, 31] and disparate impact [9] are similar notions that also demand independency, except they are specifically designed for binary variables. Kang et al. [16] proved that zero mutual information is equivalent to statistical parity. To translate statistical parity to a discrimination measurement, we make use of differences similarly to Žliobaitė [32].

Definition 3

(Statistical parity). Demanding that each group has the same probability of receiving the favorable outcome is statistical parity, i.e.,

$$\begin{aligned} p(y=1 \mid z=1) = p(y=1 \mid z=0). \end{aligned}$$

Because we want to minimize discrimination towards any group, we measure the absolute difference between the two groups to assess the extent to which the dataset fulfills statistical parity. This is also known as (absolute) statistical disparity (SDP) [8]. A value of 0 indicates minimal discrimination:

$$\begin{aligned} \psi _{SDP}(\mathcal {D}) = |p(y=1 \mid z=1) - p(y=1 \mid z=0)|. \end{aligned}$$
(1)

Because disparate impact [9] essentially demands the same as statistical parity but contains a fraction, dividing by zero is a potential issue that may arise. Therefore, its use should be disregarded [32]. Note that dataset notions can also be applied to measure the fairness on predictions by exchanging the data label with the prediction label.

Parity-based notions, fulfilling the separation or sufficiency criterion [2], require both prediction and truth labels to evaluate the fairness. Contrary to the category before, measuring solely on datasets is not possible here. Despite this, it is still essential to evaluate on such measures to account for algorithmic bias. Here, the discrimination measure takes an additional argument, which is the prediction label \(\hat{y}\) as a random variable. According fairness notions are, for example, equality of opportunity [10], predictive parity [2], and equalized odds [2].

Definition 4

(Equalized odds). Equalized odds is defined over the satisfaction of both equality of opportunity and predictive parity [10],

$$\begin{aligned} p(\hat{y}=1 \mid y=i, z=1) = p(\hat{y}=1 \mid y=i, z=0) \forall i \in \{0, 1\}, \end{aligned}$$

where equality of opportunity is the case of \(i=1\) and predictive parity is the case of \(i=0\), correspondingly. Making use of the absolute difference, likewise to SDP (1), we denote the measure of equality of opportunity as \(\psi _{EO}(\mathcal {D},\hat{y})\) and predictive parity as \(\psi _{PP}(\mathcal {D},\hat{y})\).

To turn equalized odds into a discrimination measure, we can calculate the average of the absolute differences for both equality of opportunity and predictive parity. This is referred to as average odds error [3]:

$$\begin{aligned} \psi _{ODDS}(\mathcal {D},\hat{y}) = \frac{\psi _{EO}(\mathcal {D},\hat{y}) + \psi _{PP}(\mathcal {D},\hat{y})}{2}. \end{aligned}$$
(2)

4 Problem Formulation

Intuitively, the goal is to add an amount of synthetic datapoints to the original data to yield for minimal discrimination. With the right discrimination measure chosen, it can be ensured that the unprivileged group gets more exposure and representation in receiving the favorable outcome. Still, the synthetic data should resemble the distribution of the original data. The problem can be stated formally in the following: Let \(\mathcal {D}\) be a dataset with cardinality n, let \(\tilde{n}\) be the number of samples to be added to \(\mathcal {D}\). The goal is to find a set of data points \(S = \{d_1, d_2, \ldots , d_{\tilde{n}}\}\) that can be added to the dataset, i.e., \(\mathcal {D} \cup S\) with \(\mathcal {S} \sim P(\mathcal {D})\), that minimizes the discrimination function \(\psi (\mathcal {D} \cup S)\). Hence, we consider the following constrained problem:

$$\begin{aligned} \min&\quad \psi (\mathcal {D} \cup \mathcal {S}) \nonumber \\ \text {subject to}&\quad \mathcal {S} \sim P(\mathcal {D}) \nonumber \\&\quad |\mathcal {S}| = \tilde{n}. \end{aligned}$$
(3)

The objective (3) suggests that the samples \(d_i\) that are added to the dataset \(\mathcal {D}\) are drawn from \(P(\mathcal {D})\). To draw from \(P(\mathcal {D})\), a statistical or generative model \(P_G\) that learns the data distribution can be used. Therefore generating data samples and bias mitigation are treated as sequential tasks where the former can be solved by methods from literature [22]. Because the discrimination measure \(\psi \) can be of any form, the optimization objective is treated as a black-box and is solved heuristically.

5 Methodology

Our algorithm relies on a statistical model, specifically the Gaussian copula [22], to learn the distribution of the given dataset \(P(\mathcal {D})\). Gaussian copula captures the relationship between variables using Gaussian distributions. While assuming a Gaussian relationship, the individual distributions of the variables can be any continuous distribution, providing flexibility in modeling the data.

Still, the type of model for this task can be set by the user as long as it can sample from \(P(\mathcal {D})\). Because discrimination functions are treated as black-boxes, the algorithm does not require the derivatives of \(\psi \) and optimizing for it leads to our desired fairness-agnostic property: It is suitable for any fairness notion that can be expressed as a discrimination function. Our method handles the size constraint in Eq. (3) as an upper bound constraint, where a maximum of \(\tilde{n}\) samples are added to \(\mathcal {D}\).

figure a

Our method, outlined in Algorithm 1, begins by initializing \(\hat{\mathcal {D}}\) with the biased dataset \(\mathcal {D}\). Then \(\hat{n}\) is set as a multiplicative \(r > 1\) of the original dataset’s size. Lastly in the initialization, the distribution of \(P(\mathcal {D})\) is learned by a generative model \(P_G\). The algorithm then draws m samples from the generative model \(P_G\) which are referred to as the set of candidates C. The next step is decisive for the optimization (Line 9): The candidate which minimizes the discrimination most when included in the dataset \(\hat{\mathcal {D}}\) is added to \(\hat{\mathcal {D}}\). The steps of drawing samples and adding the best candidate to the dataset is repeated till \(\hat{\mathcal {D}}\) has a cardinality of \(\hat{n}\) or the discrimination is less than the fairness threshold \(\epsilon \). Because \(\epsilon \) is set to 0 by default, the algorithm can stop earlier before the dataset reaches its requested size if the discrimination cannot be further reduced, i.e., \(\psi (\hat{\mathcal {D}}) = 0\). Because calculating \(\psi (\hat{\mathcal {D}} \cup \{c\})\) (Line 9) does not involve retraining any classifier and solely evaluates the dataset, this step is practically very fast. In our implementation, we generate a set of synthetic data points prior to the for-loop, eliminating the sampling cost during the optimization step. We refer to Appendix A for the proof outlining the polynomial time complexity of the presented method.

6 Evaluation

To evaluate the effectiveness of the presented method against other pre-processors in ensuring fairness in the data used to train machine learning models, we aim to answer following research questions:

  • RQ1 What pre-processing approach can effectively improve fairness while maintaining classification accuracy, and how does it perform across different datasets?

  • RQ2 How stable are the performance and fairness results of classifiers trained on pre-processed datasets?

  • RQ3 How does pursuing for statistical parity, a data-based notion, affect a prediction-based notion such as average odds error?

  • RQ4 Is the presented method fairness-agnostic as stated?

To especially address the first three research questions, which deal with effectiveness and stability, we adopted the following experimental methodology: We examined our approach against three pre-processors on four real-world datasets (see Table 1). The pre-processors we compare against are Reweighing [13], Learning Fair Representation [31] (LFR), and Disparate Impact Remover [9] (DIR). The data were prepared such that categorical features are one-hot encoded and rows containing empty values are removed from the data. We selected sex, age, race, and foreign worker as protected attributes for the respective datasets. Generally, the data preparation was adopted from AIF360 [3].

Table 1. Overview of datasets.

All hyperparameter settings of the pre-processsors were kept as they are, given the implementation provided by AIF360 [3]. For the case of LFR, we empirically had to lower the hyperparameter of optimizing for fairness. It was initially set too high which led to identical predictions for all data points. For our approach, we set \(r=1.25\) which returns a dataset consisting of additional 25% samples of the dataset’s initial size. The discrimination measure chosen was the absolute difference of statistical parity (1), which all other methods also optimize for. Further, we set \(m = 5\) and \(\epsilon =0\) as shown in Algorithm 1.

The experimental methodology for a single dataset is visualized in Fig. 1 as a pipeline. The given dataset is firstly split into a training (80%) and test set (20%). Afterwards, the training set is then passed into the available pre-processors. Then, all debiased data are used to train several classifiers. We employed three different machine learning algorithms—k-nearest neighbors (KNN), logistic regression (LR), and decision tree classifier (DT)—to analyze the pre-processed datasets and the original, unprocessed dataset for comparison. The unprocessed dataset is referred to as the baseline. Finally, the performance and fairness is evaluated on the prediction of the test set. It is noteworthy to mention that the test sets were left untouched to demonstrate that by pre-processing the training data, unbiased results can be achieved in the prediction space even without performing bias mitigation in the test data. Due to stability reasons (and to handle RQ2), we used Monte Carlo cross-validation to shuffle and split the dataset. This was done 10 times for all datasets. The results from it set the performance-fairness baseline. While our optimization focuses on SDP, we address RQ3 by assessing the error of average odds. To answer RQ4, we refer to Sect. 6.2 for the experimentation and discussion.

Fig. 1.
figure 1

Experimental methodology visualized as a pipeline

6.1 Comparing Pre-processors

Table 2 presents the performance-fairness test results of pre-processors on different datasets (RQ1). For the discrimination, the table displays SDP and average odds error of the predictions on the test sets. To assess the classifier’s performances, we used area under the receiver operating characteristic curve (AUC). An estimator that guesses classes randomly would produce an AUC score of 0.5. Here, higher scores imply better prediction performances. Means and standard deviations of the Monte Carlo cross-validation results are also displayed to evaluate the robustness (RQ2). We note that all classifiers except of KNN were able to handle sample weights in training, which are required for Reweighing. Therefore Reweighing was not able to mitigate bias in KNN and performed as well as the baseline in contrast to other approaches including ours.

Because all pre-processors aim to reduce statistical disparity (or the equivalent formulation), we compare the SDP scores between the pre-processors: In most cases, our approach produced Pareto optimal solutions with respect to both SDP and AUC. Generally, only Reweighing and our approach appear to consistently improve fairness without sacrificing notable prediction power. In direct comparison, LFR improved the fairness at most across all experiments but at the same time sacrifices prediction quality of all classifiers to such a great extent that the predictions become essentially useless. In experiments where LFR attained standard deviations of 0 across all scores (Table 2b, 2d), we investigated the pre-processed data and found that LFR had modified almost all labels to a single value. As a result, the estimators were unable to classify the data effectively, as they predicted only one outcome. The results of DIR are very inconsistent. DIR sometimes even worsens the fairness, as seen in the COMPAS and German datasets, where SDP and average odds error are increased in most settings. This situation arises when there is an excessive correction of the available discrimination for the unprivileged group, leading to discrimination against the privileged group. If the discrimination measures are defined such that the privileged or unprivileged groups do not matter (similarly to this paper), reverse discrimination would not mistakenly occur by our approach. This extra property renders our method more suitable for responsible use cases.

When comparing the average odds error rates (RQ3), our approach has successfully reduced algorithmic bias without aiming for it under nearly all experiments. The increase in the average odds error rate (mean), albeit negligible, was observed only when training DT on the Banking data and LR in the German dataset. In all other ten model and dataset configurations, our approach did reduce the error rate without particularly optimizing for it. This can be expected in practice as the independency of the label with the protected attribute (SDP) is a sufficient condition for average odds.

Table 2. The tables displays each classifier’s mean test performance and discrimination when trained on different pre-processed training sets. The best performing statistic for each classifier is marked in bold. Minimal standard deviations are marked bold, too. All values displayed are percentages.

6.2 Investigating the Fairness-Agnostic Property

To demonstrate the fairness-agnostic property of our algorithm (RQ4), we evaluated our method against the baseline dataset on multiple measures and examine whether the objective was improved (see Fig. 2). The COMPAS dataset was used for this experiment. The chosen objectives are: the absolute value of Pearson’s \(\rho \), NMI (2), and the objective of disparate impact (DI) as given by [9]. All other experimental settings remained the same as described prior, except that other pre-processing methods were not used.

Fig. 2.
figure 2

Results of optimizing different discrimination objectives with our method on the COMPAS dataset. Objectives are ordered by columns, classifiers by rows. The y-axis displays AUC.

It can be observed that all discrimination measures were lowered significantly. Generally, our method was able to optimize on any fairness notion, as evidenced here and Sect. 6.1. It was even able to outperform algorithms that were specifically designed for a single metric, demonstrating its adaptability.

7 Conclusion

Machine learning can be utilized for malicious purposes if estimators are trained on data that is biased against certain demographic groups. This can have an incredibly negative impact on the decisions made and the groups that are being discriminated against.

The presented pre-processing method in this work is a sampling-based optimization algorithm that firstly uses a statistical model to learn the distribution of the given dataset, then samples points from this distribution, and determines which one to add to the data to minimize the discrimination. This process continues until the predefined criteria set by the user are satisfied. The method can optimize any discrimination measure as it is treated as a black-box, making it more accessible for wider use cases.

The results of our experiments demonstrate that our technique is reliable and significantly reduces discrimination while not compromising accuracy. Although a few other methods performed similarly in a few experiments, they were not compatible with certain estimators or even added bias to the original data. Because fairness was improved among the experiments and our method adds samples, it indicates that representativeness can be achieved with our method. Our research underscores the importance of addressing bias in data and we hope to contribute such concerns in data analytics and knowledge discovery applications.

8 Discussion and Future Work

The results of our approach demonstrate that it is possible to achieve fairness in machine learning models using generated data points. Despite our approach showing promise, it is important to acknowledge that our results rely heavily on the quality of the statistical model used to generate synthetic data. For tabular data, Gaussian copula [22] seems to be a good choice.

In future work, we aim to explore the potential of our method in making pre-trained models fairer with our method. While retraining large models using debiased datasets may not always be feasible from a cost-effective perspective, our approach allows using generated data to fine-tune the model for fairness, which provides a more efficient alternative.

Additionally, our evaluation deals with datasets where the protected attribute is a binary variable, which leaves some use cases untreated. Neglecting to recognize non-binary groups can lead to overlooking those who are most in need of attention. Similarly, research on dealing with multiple protected attributes at the same time could be done. This is to make sure that no protected group is being disadvantaged. Previous studies have touched on this subject [1, 4, 32], but we hope to reformulate these issues as objectives that work with our approach.