1 Introduction

Automated decision systems, which are fundamental to many of our daily activities, enhance our experiences through personalized recommendations in areas like movies, products, and even potential dating partners. These systems, driven by machine learning (ML) algorithms, are adept at identifying patterns in extensive datasets. Unlike humans, machines do not tire or lose interest, and they can process a significantly larger number of variables [1]. However, similar to human decision-making, these algorithms can exhibit biases, potentially leading to unfair outcomes [2]. Such biases often mirror human-like semantic prejudices, especially when processing data related to human outcomes [3], and can lead to decisions that disproportionately benefit certain groups, thereby raising substantial ethical concerns [4].

Bias is commonly understood as a preference or prejudice for or against a specific thing, person, or group, often in an unfair way [5, 6]. Examples of such biases include gender, race, demographic characteristics, or sexual orientation. The aim of fairness is to detect and mitigate the effects of these biases [7], ensuring that machine learning systems do not reinforce existing human and societal biases or introduce new ones.

Reflect on the pervasive influence of algorithmic biases, which subtly yet significantly shape outcomes in ways that often go unnoticed until scrutinized. Many examples from various sectors highlight this issue. The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system, used in U.S. courts, has demonstrated racial biases in its risk assessments.Footnote 1 A well-known healthcare algorithm also showed significant racial biases in its decision-making [8]. Amazon’s hiring algorithm was found to favor men, indicating a gender bias.Footnote 2 Additionally, Facebook’s targeted housing advertisements were implicated in discriminatory practices based on race and color.Footnote 3 These cases underline how deep-seated biases in algorithms can lead to unfair outcomes across a range of applications.

Natural Language Processing (NLP), as a branch of artificial intelligence, also encounters biases in its applications. These biases in textual data are a widespread and ingrained problem, often originating from cognitive biases that shape our conversations, perspectives, and comprehension of information [9]. This bias can manifest explicitly, as seen in language that discriminates against specific racial or ethnic groups [10], commonly found in social media content. Implicit bias [11], however, operates more subtly, reinforcing prejudices through unintentional language choices, yet it is also detrimental. The need for unbiased and reliable text data has intensified across various fields, including healthcare [12] and social media [13].

Such data is crucial for training NLP models that perform a range of downstream tasks, such as generating news recommendations. These news recommenders frequently inherit biases from their underlying data, which can influence the beliefs and behaviors of news consumers [10, 14]. For instance, research [13] demonstrates that offering unbiased news to users helps to broaden their understanding of societal issues. Exposure to news that incorporates biased language can influence users’ perceptions about specific demographic groups or the stories themselves. Therefore, our project aims to deliver news with reduced bias.

A key contribution of this research is the development of a comprehensive framework for detecting and mitigating bias in text data, particularly in news. The specific contributions of this work are outlined as follows:

  1. 1.

    We introduce FairFrame (Fairness Framework), a framework specifically designed to detect and mitigate bias within textual content, such as news articles.

  2. 2.

    We develop a bias detection module utilizing state-of-the-art transformer models. This module demonstrates superior performance in identifying textual biases compared to existing benchmarks.

  3. 3.

    Our framework integrates an explainable AI component based on LIME, which provides clear and interpretable insights into the decisions made by our bias detection module, thereby enhancing transparency.

  4. 4.

    We pioneer the use of larger language models for bias mitigation through tailored few-shot prompting techniques. To our knowledge, this is the first instance of employing LLMs specifically for the mitigation of bias in text.

  5. 5.

    We conduct comprehensive experiments to evaluate the effectiveness of FairFrame against other leading-edge fairness methodologies. Additionally, we assess the performance of each individual component within FairFrame across various experimental setups to ascertain their efficacy and impact.

The rest of this paper is organized as follows: Sect. 2, “Related Work", provides an overview of previous studies on bias detection and mitigation. Section 3, “FairFrame: A Fairness Framework for Bias Detection and Mitigation in News" outlines our bifurcated approach, introducing the detection and mitigation modules. Section 4, “Experiments", details the experimental design. Section 5, “Results", presents the findings of the experiments. Section 6, “Discussion", delves into the implications of these findings. Finally, Sect. 7, “Conclusion and future works", summarizes the study’s major insights and outlines the future directions for research.

2 Related works

In this section, we aim to gain insights into related works on bias detection and mitigation, initially in AI broadly and then specifically in NLP. Finally, we will introduce few-shot prompting techniques for LLMs, as these form the foundation of our bias mitigation module.

2.1 Fairness algorithms

In the study of fairness within AI and ML [15], algorithms designed to reduce bias are generally classified into three main categories: (1) pre-processing algorithms, (2) in-processing algorithms, and (3) post-processing algorithms.

2.1.1 Pre-processing algorithms

Pre-processing algorithms aim to address biases in datasets related to sensitive attributes such as race, gender, caste, or religion before training begins. These methods strive to preserve the data’s integrity while ensuring fairness.

A key technique is the reweighting algorithm, which adjusts the weights of training samples to balance group representation without changing the actual data features or labels, as highlighted in [16]. The Learning Fair Representations algorithm, detailed in [17], creates new data representations that mask protected attributes to prevent bias in decision-making processes. Another approach, the Disparate Impact Remover, modifies feature values to promote group fairness while maintaining the internal rank order within each group [18]. Lastly, the Optimized Pre-processing algorithm employs a probabilistic transformation of both features and labels to ensure both individual and group fairness, as described in [19].

2.1.2 In-processing algorithms

In-processing algorithms are pivotal in integrating fairness directly during the model training phase. These techniques modify the model’s loss function to embed fairness into its core operations, addressing biases efficiently [20, 21].

A prominent method, the Prejudice Remover, adds a discrimination-aware regularization term to the learning objective, significantly reducing biased predictions based on sensitive attributes [20]. The Adversarial De-biasing algorithm introduces a dual strategy: training a primary classifier for accuracy and an adversarial model to obscure protected attributes, minimizing bias in predictions [22]. Additionally, the Exponentiated Gradient Reduction algorithm treats fair classification as a series of cost-sensitive problems, resulting in a randomized classifier that balances accuracy and fairness constraints [23]. The Meta Fair Classifier provides a tailored approach by optimizing a classifier based on a specified fairness metric, allowing customization of fairness goals to suit specific definitions and needs [21].

2.1.3 Post-processing algorithms

Post-processing algorithms are designed to mitigate biases in model outputs after the training phase, offering the advantage of applicability to existing classifiers without the need for retraining.

A key example is the Reject Option Classification algorithm, which adjusts decisions to benefit historically disadvantaged groups and is particularly useful in contexts such as employment [24]. The Equalized Odds algorithm uses linear programming to modify output labels to achieve fairness across different groups by equalizing true and false positive rates [25]. Another approach, the Calibrated Equalized Odds algorithm, optimizes the model’s score outputs to align with fairness objectives, balancing accuracy and fairness [26]. These methods typically require access to protected attributes to adjust outputs accordingly, ensuring that final model predictions do not perpetuate biases. Post-processing is a practical solution for enhancing fairness in AI systems, especially when retraining is not an option.

In addition to these methods, the software engineering community has developed tools like FairML [27], FairTest [28], Themis-ml [29], and AIF360 [5].

2.2 Detect bias in NLP

Detecting and mitigating bias in NLP is crucial due to its widespread use across various applications [3]. Biases in NLP can manifest as unfair discrimination, often reflecting societal and cultural prejudices encoded in the training datasets [30, 31]. Such biases may not only skew NLP outputs but also reinforce harmful stereotypes [32, 33].

Researchers have developed methods to detect and correct biases in NLP. These include statistical techniques to identify biased patterns in data [34] and innovative approaches using advanced machine learning to explore different aspects of bias, such as gender, race, and disability [35,36,37]. Notably, efforts have been made to debiasing word embeddings and mitigate attribute bias in tasks like natural language inference [38, 39]. Moreover, emerging research has expanded the understanding of bias beyond simple demographic factors, investigating how biases related to race, gender, disability, nationality, and religion are replicated in NLP models [40,41,42]. Tools like Perturbation Analysis and StereoSet have been developed to measure these biases systematically [43, 44]. Identifying and addressing these biases is essential for the development of fairer and more inclusive NLP technologies, as biases can lead to social harm by fostering prejudices and perpetuating stereotypes [32, 45, 46].

2.3 Few-shot prompting

The training of LLMs on massive datasets improves their performance in line with scaling laws [47]. This development has introduced a new method in NLP called prompt engineering, aimed at efficiently using the vast knowledge stored in these models [48]. Various strategies for crafting prompts have been introduced, aiming to steer model utility across differing research domains [49]. The advent of LLMs like GPT-3 and ChatGPT has popularized prompt-based techniques for an array of tasks. Broadly, there are two main approaches:

Zero-shot Prompting: Zero-shot prompting, using well-crafted prompts without example inputs, has proven highly effective, with GPT models excelling in tasks like data extraction, often outperforming traditional models [50]. In healthcare, the DeIDGPT system uses precision-engineered prompts on platforms like ChatGPT for privacy-preserving medical data summarization, achieving superior results [51]. Additionally, ChatAug, a method for augmenting data on ChatGPT, has been shown to surpass other approaches, highlighting the importance of domain expertise and suggesting fine-tuning strategies for further research [52]. Studies on manual prompting have also enhanced translation tasks, demonstrating the significant impact of well-defined prompts [53]. Similarly, HealthPrompt employs various prompt structures to improve zero-shot learning in clinical text classification, emphasizing the potential of prompt design to boost NLP performance [54].

Few-shot Prompting: Zero-shot prompting, despite its efficacy across many tasks, faces challenges related to the limitations of pre-existing models and can sometimes produce inaccurate outputs [55]. To address this, few-shot prompting, which uses a small set of example prompts to guide the model more accurately, has been found effective. This approach provides clear prompts that help achieve the desired results. For instance, few-shot prompting has been used with GPT-4 for evaluating medical multiple-choice questions (MCQs), avoiding more complex methods like chain-of-thought processing [56]. These prompt-based strategies harness the contextual understanding of LLMs, showing impressive results on platforms like ChatGPT/GPT-4 [57]. Furthermore, applications such as text translation, data augmentation, content generation, and summarization have seen performance enhancements with few-shot prompting, leading to better accuracy on public datasets compared to traditional benchmarks [58, 59].

2.4 Comparison with state-of-the-art approaches

While the previous works discussed in this section are valuable and represent incremental progress, they largely overlook the data sources where bias initially originates. As highlighted in the literature [4, 60], it is critical to address biases at the earliest stages of the data process to prevent them from being introduced and subsequently amplified by model predictions. In this study, our objective is to eliminate biases during the data ingestion phase (i.e., the pre-processing phase, see Sect. 2.1.1) through a framework that focuses on bias detection and mitigation. Additionally, our bias detection module surpasses state-of-the-art baselines by demonstrating superior performance. Furthermore, we integrate an explainable AI module post-detection, which enhances transparency and bolsters the perception of fairness. Finally, we uniquely employ LLMs in our bias mitigation module. Although various studies [61, 62] have raised concerns about LLMs, our research highlights a constructive application of this emerging technology. The remarkable efficacy of LLMs across diverse tasks stems primarily from their proficiency in contextual learning, which makes them instrumental in addressing numerous research challenges. Consequently, we utilize LLMs to mitigate bias in text.

3 FairFrame: a fairness framework for bias detection and mitigation in news

In this section, we delve into FairFrame, a framework to address a prevalent issue in the realm of news dissemination: the presence of biases within articles. The core objective of our research is to identify and neutralize such biases.

3.1 Problem statement

Given a dataset of \(N\) articles \(A_n\), our goal is to detect biases \(B_n\) and subsequently debias \(D_n\) the biased articles. More formally, for each given article \(A_n\), we aim to identify biased words, which we denote as \(B_n = \{ b_{n,i} \}_{i \le |B_n|}\). Once biases are detected, the objective is to generate debiased content \(D_n = \{ d_{n,i} \}_{i \le |D_n|}\). This involves replacing the identified biased words \(b_{n,i}\) with neutral alternatives \(d_{n,i}\) that maintain the original meaning of the content but without the biased connotations. The debiasing process aims to ensure that the modified articles exhibit reduced bias, thereby enhancing the perceived objectivity and impartiality of the information presented.

3.2 Overview of fairframe

FairFrame operates through a dual-component system, illustrated in Fig. 1, which consists of a Bias Detector and a Bias Mitigator. The Bias Detector’s role is to examine news articles to determine the presence of bias, thereby categorizing the content as either biased or unbiased. Following detection, the Bias Mitigator intervenes by altering the biased words within the articles. It replaces biased words with neutral expressions, ensuring the output is an unbiased version of the original article.

Fig. 1
figure 1

Overview of FairFrame

3.3 Bias detector

Figure 2 illustrates the pipeline architecture of the Bias Detector component, comprising three distinct phases: Training Phase, Classification Phase, and Explainable AI Phase.

3.3.1 Training phase

The objective of the bias detection module is to ascertain whether a sentence exhibits bias or not. Consequently, the Learning Task is defined as follows:

Given a corpus \(\mathcal {X}\) and a randomly sampled sequence of tokens \(x_i \in \mathcal {X}\) with \(i \in \{1, \ldots , N\}\), the learning task consists of assigning the correct label \(y_i\) to \(x_i\) where \(y_i \in \{0, 1\}\) represents the neutral and biased classes, respectively. The supervised task can be optimized by minimizing the binary cross-entropy loss

$$\begin{aligned} \mathcal {L}:= -\frac{1}{N} \sum _{i=1}^{N} \sum _{k=\{0,1\}} f_k(x_i) \cdot \log (\hat{f}_k(x_i)). \end{aligned}$$
(1)

where \(f_k(\cdot )\) is a binary indicator triggering 0 in the case of neutral labels and 1 in the case of a biased sequence. \(\hat{f}_k(\cdot )\) is a scalar representing the language model score for the given sequence.

The initial phase, deemed the most crucial, begins with an input dataset. This data includes a variety of biased instances identified in news articles, utilized for training our models. This is followed by the preprocessing stage, during which tokenization is employed. Subsequently, we proceed to fine-tune and assess a range of Transformer-based models sourced from HuggingFace’s Transformers library, with a comprehensive account of this process provided in the experiments section.

Fig. 2
figure 2

Bias detector pipeline

Our approach entails fitting the binary indicator function \(f_k(\cdot )\) with an array of advanced language processing models. The foundational element of these models’ architecture is the encoder stack of the Transformer [63], which relies exclusively on the attention mechanism. Our implementation includes the BERT model [64], along with its derivatives such as DistilBERT [65] and RoBERTa [66]. These models are adept at acquiring bidirectional language representations from unlabeled text. DistilBERT is notable for being a more compact iteration of BERT, while RoBERTa differentiates itself by employing a modified loss function and enhanced training dataset. Additionally, we examine models with transformer-based architectures that have unique training objectives. For instance, DistilBERT and RoBERTa apply masked language modeling in their pre-training phase, whereas ELECTRA [67] adopts a discriminative training method to capture language representations. Our analysis also encompasses XLNet [68], which serves as a representative of autoregressive models, to provide a broad perspective in our systematic evaluation.

3.3.2 Classification phase

In the classification phase, the trained transformer model is used to analyze new, unseen articles. The model classifies these articles as either biased or non-biased based on patterns and features it learned during the training phase. The output is a set of biased content \(B_n = \{ b_{n,i} \}_{i \le |B_n|}\) identified in the articles.

3.3.3 Explainable AI phase

The final stage of our pipeline is the XAI phase, designed to deliver transparent explanations, enabling users to gain confidence in the system’s outputs. To achieve this, we integrate LIME (Local Interpretable Model-agnostic Explanations) [69].

LIME functions independently from Fairframe’s main prediction mechanism, acting as an auxiliary tool that provides localized insights into specific predictions. While it does not alter the system’s core operations, it significantly enhances user understanding by offering interpretable insights based on individual cases.

By treating any machine learning model as an independent “black-box," LIME enables model-agnostic explanations that are inherently interpretable through input features. This method allows LIME to offer targeted insights into the bias detector component, revealing which features or words the detector relies on to determine if a text is biased.

3.4 Bias mitigator

LLMs exhibit a capability for in-context learning, enabling them to understand and perform various tasks based solely on task descriptions and examples provided within a prompt, without the need for specialized fine-tuning for each new task [70].

The bias mitigation phase involves several key steps, as illustrated in Fig. 3.

Fig. 3
figure 3

Bias mitigator pipeline

3.4.1 Formulation

After detecting the biased content, the next step is to neutralize these biases and generate debiased content. Let \(A_n = \{a_{n,i}\}_{i \le |A_n|}\) represent the set of biased articles provided by the user. To guide the debiasing process, we define a set of few-shot prompts \(P = \{(b_{p,i}, d_{p,i})\}_{i \le |P|}\), where \(b_{p,i}\) are examples of biased text and \(d_{p,i}\) are their debiased counterparts. These prompts instruct the model on how to transform biased text into neutral text. Additionally, a knowledge base \(K\) provides further context and information, including dictionaries of biased words.

The few-shot prompts \(P\) and relevant information from the knowledge base \(K\) are combined to form a comprehensive prompt \(q\). A LLM \(\mathcal {M}\) processes the prompt \(q\) along with the biased articles \(A_n\) to generate debiased content:

$$\begin{aligned} D_n = \mathcal {M}(q, A_n) \end{aligned}$$

where \(D_n = \{d_{n,i}\}_{i \le |D_n|}\) is the set of debiased articles. The final output \(D_n\) is a debiased version of the input news articles, intended to be more objective.

3.4.2 Prompt design

Crafting effective prompts is key to maximizing the benefits of LLMs. This process entails creating the initial input, or “prompt," to steer the model towards generating the specific output you’re looking for [55]. To enhance the effectiveness of our approach, we advocate for the use of a meticulously structured prompt, illustrated in Fig. 4. This prompt is designed to include five crucial elements that are key to achieving the desired results:

  1. 1.

    Context: Provides a backdrop for the request, establishing the scenario or domain within which the model operates. This ensures alignment with the intended purpose or environment.

  2. 2.

    Knowledge: Encapsulates relevant information, facts, or principles necessary for the task, enabling the model to generate informed and accurate responses.

  3. 3.

    General Request: Specifies the overall objective or the type of output sought from the model, guiding its action or response type.

  4. 4.

    Few Shots Examples: Involves providing a small number of example inputs and their corresponding outputs. These examples serve as a guide for the model, showing the format, style, or approach that is expected in the responses. It’s a way of teaching the model through direct examples without needing extensive training data.

  5. 5.

    Input to Debias: Provides specific inputs aimed at counteracting biases, ensuring fairness and balance in responses.

The experimental values for each element are presented in Table 1.

Fig. 4
figure 4

Illustration of the structured prompt

4 Experimental setup

4.1 Used dataset

In this research, our data source is the MBIC-A Media Bias Annotation Dataset [71]. This dataset encompasses 17,000 annotated sentences from roughly 1,000 news articles sourced from various outlets, including HuffPost, MSNBC, AlterNet, Fox News, Breitbart, USA Today, Reuters, and others. It comprises approximately 10,000 biased and 7000 unbiased annotations. The features of the dataset utilized in this study include:

  • Sentence: A sentence extracted from a news article.

  • News Link: The URL of the source news article.

  • News Outlet: The publishing source of the news (e.g., USA Today, MSNBC).

  • Topic: The subject matter of the news (e.g., gun control, coronavirus, white nationalism).

  • Biased Words: Words identified as biased by experts.

  • Label: Classification of the news as biased or unbiased.

In this study, we utilize protected attributes from the dataset as defined in existing literature [72]: “gender" includes Male and Female; “age" is categorized into Elder, Young, and Adult; “education" is split into College degree and High school; “language" distinguishes between English speaker and Non-English speaker; “race" comprises Black, White, Caucasian, and Asian. Furthermore, we define privileged attributes as follows: Male for gender, College degree for education, English Speaker for language, and White for race. Conversely, the unprivileged attributes are Female for gender, High school for education, Non-English Speaker for language, and both Black and Asian for race. These attributes are grouped into privileged and unprivileged based on the prevalence of biased language associated with each. The selection of these attributes reflects the marginalization observed in various societal domains such as gender, race, ethnicity, religion, disability, and sexual orientation, as discussed in literature [72].

We selected this dataset as our main source of data due to its ability to encompass a wide array of biases. It is particularly valuable because it gauges public perceptions of bias. Furthermore, the dataset includes articles covering a diverse spectrum of topics such as politics, science, and ethnicity, among others. This diversity is crucial to our goal of identifying various forms of textual bias.

4.2 Bias detector implementation

In the bias detector module, we use various transformer models. Therefore, we detail the experimental settings.

Training: Our training protocol adopts the neural models available through the Transformer API by HuggingFace [73]. These models are initialized with their pre-trained parameters, while the parameters for the classification elements are set up and refined consistently. The process begins with fine-tuning and assessing the neural models using the MBIC dataset.

Hyperparameter Tuning: During the model training process, we employ a 5-fold cross-validation strategy to fine-tune the hyperparameters and to ensure that our model is robust and generalizes well to unseen data. The hyperparameters we have selected for the training process are as follows:

  • Buffer Size: Set to 10,000, this variable determines the size of the buffer used in shuffling the dataset, ensuring that our training samples are provided in random order.

  • Batch Size: With a value of 8, the batch size controls the number of training samples to work through before the model’s internal parameters are updated.

  • Learning Rate: The learning rate is set to \(5 \times 10^{-5}\), which dictates the step size at each iteration while moving toward a minimum of the loss function. We use Adam optimization.

  • Early Stopping: A callback is implemented to monitor the validation loss with 1 epoch, aiming to prevent overfitting by halting the training process if no improvement is observed.

All computations were performed on Google Colab Pro+.

4.3 Bias mitigator implementation

Large Language Model: In our experiments, we utilized the GPT-4 model from OpenAI, a large-scale, multimodal model capable of processing both image and text inputs to generate text outputs. Although GPT-4 does not match human capabilities in numerous real-world situations, it achieves human-like performance across a range of professional and academic benchmarks [74].

Prompt: As outlined in the previous section, our structured prompt is composed of several distinct arguments. Table 1 displays the experimental values assigned to each part of the prompt, providing a detailed breakdown of how each argument contributes to the overall structure and functionality of the prompt.

Table 1 Experimental inputs for each structured prompt elements

We implemented a progression from zero-shot to few-shot learning techniques to assess model responsiveness and accuracy. Initially, in the zero-shot scenario, the models were evaluated without any prior examples, relying solely on the prompt. Subsequently, we introduced few-shot learning, specifically with two-shot and four-shot scenarios, to observe how the incremental introduction of examples influences performance.

Table 2 showcases examples of biased text alongside their debiased versions, which serve as input for the models in our 2-shot and 4-shot experiments. This methodical incorporation of examples enables us to scrutinize the model’s adaptability and learning efficacy as it progresses from a zero-shot to a few-shot learning context.

Table 2 Examples of biased and unbiased text used in few-shot learning scenarios

4.4 Baselines

We were unable to identify any state of the art models capable of simultaneously performing both tasks: (1) bias detection and (2) bias mitigation. Therefore, we have employed alternative baseline methods to assess the effectiveness of the individual components of FairFrame.

Bias Detector: We are assessing the performance of the bias detection module within FairFrame. This involves evaluating a variety of classification models alongside our fine-tuned transformers to determine which combination yields the most accurate results. For fine-tuning the bias detector module, we experiment with different models and embeddings, aiming to identify the optimal setup for the classification task. The models employed in this experiment include traditional machine learning methods, deep neural networks, and advanced Transformer-based methods featuring self-attention:

  • Logistic Regression with TFIDF Vectorization (LG-TFIDF): We employ Logistic Regression (LG) combined with TfidfVectorizer for word embedding. This setup, known for its effectiveness in various classification tasks like hate speech detection and text classification, serves as a solid baseline.

  • Random Forest with TFIDF Vectorization (RF-TFIDF): The Random Forest (RF) classifier is paired with TfidfVectorizer for word embedding. This combination is commonly used in text classification, sentiment analysis, and similar tasks.

  • Gradient Boosting Machine with TFIDF Vectorization (GBM-TFIDF): We utilize the Gradient Boosting Machine (GBM) with TfidfVectorizer for word embedding.

  • Logistic Regression with ELMO (LG-ELMO): Logistic Regression is used in conjunction with ELMO embeddings, a contextual word embedding technique based on bi-directional LSTM networks.

  • We also employ the multilayer perceptron (MLP), a feedforward artificial neural network, with ELMO embeddings, noted for its strong performance in classification tasks.

Bias Mitigator: We adopt the evaluation strategy outlined in the related work [7], which classifies fairness methods into three categories: (1) fairness pre-processing, (2) fairness in-processing, and (3) fairness post-processing methods:

  • Disparate impact remover (DIR) [18] is a pre-processing technique designed to enhance fairness between groups, specifically between privileged and unprivileged groups. It modifies feature values-such as those indicating privilege or lack thereof-to create unbiased data while retaining essential information. Following the application of this algorithm, any machine learning or deep learning model can be developed with the adjusted data. The efficacy of this process is assessed using the Disparate Impact metric, which verifies whether the model operates within an acceptable bias threshold. In our baseline approach, we employ several methods via AutoML, and report on the outcomes from the most effective model. Among the models tested, Logistic Regression yielded the best results.

  • Adversarial De-biasing (ADB) [22] utilizes the framework of generative adversarial networks (GANs). This in-processing method involves training a model to de-bias word and general feature embeddings. It focuses on internalizing definitions of fairness, including demographic parity, equality of odds, and equality of opportunity. In this setup, a discriminator-part of the GAN-is tasked with predicting the protected attribute reflected in the bias of the original feature vector. Concurrently, a generator-also part of the GAN-strives to produce more de-biased embeddings to effectively challenge the discriminator.

  • Calibrated Equalized Odds (CEO) [26] post-processing is a technique that adjusts calibrated classifier score outputs. It optimizes these scores to determine the probabilities for modifying output labels to meet an equalized odds objective. This method falls under the category of post-processing techniques.

We also compare our framework with Dbias [76], designed to ensure fairness in news articles. It can analyze any text to determine if it exhibits bias. Dbias identifies biased words within the text, masks them, and then suggests alternative sentences using new words that are bias-free or significantly less biased.

4.5 Evaluation metrics

4.5.1 Detection phase

In this phase, we assess the performance of our proposed model through several key metrics commonly employed in machine learning detection systems to provide a comprehensive understanding of its effectiveness. We use the following metrics: accuracy (ACC), precision (PRE), recall (Rec), and F1-score (F1).

4.5.2 Mitigation phase

Disparate Impact (DI) [28] is an evaluation metric to evaluate fairness. It compares the proportion of individuals that receive a positive output for two groups: an unprivileged group and a privileged group. The industry standard for DI is a four-fifths rule [77], which means if the unprivileged group receives a positive outcome less than 80% of their proportion of the privileged group, this is a disparate impact violation. An acceptable threshold should be between 0.8 and 1.25, with 0.8 favoring the privileged group, and 1.25 favoring the unprivileged group [77]. Mathematically, it can be defined as:

$$\begin{aligned} & DI = \frac{\frac{\text {num}\_\text {positives}(\text {privileged} = \text {False})}{\text {num}\_\text {instances}(\text {privileged} = \text {False})}}{\frac{\text {num}\_\text {positives}(\text {privileged} = \text {True})}{\text {num}\_\text {instances}(\text {privileged} = \text {True})}} \end{aligned}$$
(2)

where num_positives is the number of individuals in the group: either privileged = False (unprivileged), or privileged = True (privileged), who received a positive outcome. The num_instances are the total number of individuals in the group.

Although DI is not specifically designed for analyzing text-based biases, taking inspiration from related works [78], we measure the biases on three specific subsets (number of positives, number of negatives, and total number of instances) in the test set that mentions the identities (gender, education, spoken language) of specific groups using biased or unbiased words.

5 Results

In this section, we provide an interpretation of the results as well as a comparison with state-of-the-art methods.

5.1 Effectiveness of the bias detection module

Table 3 presents the results of various bias detection models, evaluated using Precision (Pre), Recall (Rec), and F1-score (F1) metrics.

The performance of various models in detecting bias varies significantly. The LG-TFIDF model shows balanced but moderate performance with Recall, and F1-score all at 0.61. Both RF-TFIDF and GBM-TFIDF offer slight improvements, with F1-scores of 0.64 and 0.65, respectively. The LG-ELMO model achieves a higher F1-score of 0.67, demonstrating the advantage of ELMo embeddings in capturing contextual information. The MLP-ELMO model has a very high precision of 0.96 but a lower recall of 0.67, resulting in an F1-score of 0.78, indicating it is conservative in its predictions. DBias, with a balanced F1-score of 0.75, stands out among the models.

Table 3 Bias detector results

In this research, we employ transformer models for bias detection, achieving high effectiveness. BERT and DistilBERT lead with F1-scores of 0.85 and 0.84, with BERT showing superior recall at 0.88. DistilBERT proves that even streamlined models can perform excellently in detecting bias. RoBERTa, with the highest recall at 0.90, tends to generate more false positives, reflected in a lower precision of 0.72 and an F1-score of 0.79. ELECTRA and XLNet also perform well, scoring F1-scores of 0.81 and 0.80, respectively, with ELECTRA showing balanced precision and recall and XLNet demonstrating high recall and reasonable precision.

DistilBERT performance closely approaches the performance of the BERT. Despite the slight difference in precision and recall, DistilBERT offers a significant advantage in terms of faster inference speeds and reduced computational load. This model is a distilled version of BERT-smaller, faster, and requiring less computational power-making it an optimal choice for environments where quick model responsiveness is crucial.

5.2 Assessing the effectiveness of LIME

Following the detection phase, we enter the Explainable AI phase, where we employ the LIME method. As depicted in Table 4, this method enables a side-by-side comparison of biases identified by experts (specifically, the “Biased Words" column in the dataset described in Sect. 4.1) with those detected by our model using LIME.

The application of LIME to emphasize the specific words flagged by experts as biased provides strong validation of our model’s capability to effectively recognize and interpret nuanced biases. The analysis presented in Table 4 shows that LIME does not only capture broad themes of bias but also matches closely with expert evaluations at the word level, showcasing a high degree of accuracy in identifying biases.

LIME focuses on identifying and highlighting words that the model deems crucial for detecting bias. These highlighted words are significant as they encompass the primary features that the model uses to determine whether a text exhibits bias. These words include, but are not limited to, the terms identified by experts.

Table 4 Comparison of expert-identified bias words and those highlighted by LIME

For example, in the first row of the table, the bias in the discussion is “Belated, Birtherism." Both the expert-identified biased words and the model-identified biased words via LIME are closely aligned. The experts have labeled the phenomenon as “Belated, Birtherism," encapsulating the entire phrase as indicative of bias. LIME, in its analysis, separately identifies the words “Belated," “Birtherism," and “conspiracy" which are core components of the expert’s terminology. This alignment underscores the efficacy of our proposed model in detecting bias, as it successfully identifies mostly the same keywords as the experts. By doing so, LIME confirms that the model’s decision-making process aligns with expert human judgment, highlighting the precise terms contributing to perceived bias. In fact, it highlights words beyond those identified as biased by human experts, revealing the features the model relies on for classifying text as biased. This approach helps validate and refine the model’s understanding of textual biases, offering deeper insight into its detection logic.

5.3 Effectiveness of the bias mitigation module

We compare the proposed approach’s performance against baseline methods. The results for fairness metrics and accuracy metrics relevant to the classification for all methods, including the baseline and FairFrame, are detailed in Table 5. The experiments are structured in two phases: (1) pre-debiasing evaluation and (2) post-debiasing evaluation, following previous research [7, 72]. Initially, the pre-debiasing evaluation involves using protected variable values to compute the Disparity Impact (DI) and identify pre-existing biases in the dataset. Subsequently, the post-debiasing phase involves applying various bias mitigation baselines to the original data.

Table 5 Comparison of FairFrame with the baseline methods

In the “Pre-debiasing" evaluation phase, the DI ratio for all models remains constant because it is calculated using the original dataset before any techniques are applied. The DI score in the “Pre-debiasing" evaluation is 0.7, indicating that unprivileged groups receive positive outcomes less than 80% of the time compared to privileged groups, which constitutes a disparate impact violation.

In the “Post-debiasing" evaluation, we see a notable enhancement in the DI ratio with our method. DIR model shows a trade-off with improved fairness but reduced performance, while ADB achieves a balance, slightly losing accuracy but gaining significantly in fairness. CEO maintains consistent performance with minor gains in fairness. Notably, Dbias reaches 1.01, indicating a notable enhancement in fairness across the models. An ideal DI value falls between 0.8 and 1.25, ensuring equitable treatment across different groups [72]. Our model achieves a DI ratio of 1.18, demonstrating an effective reduction of disparities. While the baseline methods exhibit various strengths and weaknesses before debiasing, post-debiasing improvements in disparity impact are most notable for Dbias and ADB. However, FairFrame consistently outperforms the baseline methods in most metrics, both pre- and post-debiasing, highlighting its effectiveness in achieving high performance and enhanced fairness.

Tradeoff between accuracy and fairness: The results suggest a trade-off between increased fairness and decreased overall performance These findings confirm earlier research [5, 76], which indicates that detecting bias becomes markedly harder following debiasing efforts. During the “post-debiasing" phase, biases must be identified in sentences where originally biased words have been altered. As a result, the effectiveness of bias detection is likely to decrease since these sentences do not overtly appear biased anymore. This is in line with both theoretical expectations and previous empirical studies in the field.

5.3.1 Ablation study

We tested different settings using GPT-4 model across various configurations: zero-shot, two-shot, and four-shot prompting in both Prompting Learning (PL) and Knowledge-based Prompting Learning (KPL).

Table 6 provides examplesFootnote 4 of input (original biased text) and output (debiased text) across different settings. This showcases how the debiasing process varies with different prompt configurations. For example, the 0-Shots PL setting effectively substitutes the term “birtherism" with “conspiracy theories", thus preserving the original context of the text while removing its biased connotation. In contrast, the 2-Shots KPL approach goes further by clarifying YouTube’s stance, promoting a more balanced narrative. A comparative analysis shows distinct patterns in how each setting mitigates bias. Notably, the PL settings, especially the 4-Shots PL, consistently achieve a high level of neutrality in the texts produced. This suggests that using multiple examples (shots) during the debiasing process enhances the model’s ability to accurately understand and eliminate bias. Meanwhile, the KPL settings offer a more nuanced approach that carefully balances maintaining the integrity of the original text with the need to expunge biased language.

Table 6 Examples of debiased texts across different settings, with expert-identified biased words bolded in the original text

Additionally, we assess the configurations by comparing DI scores, which are detailed in Fig. 5. The findings revealed that for PL, the DI scores progressively increased with the number of shots: starting at 0.92 for zero-shot, rising marginally to 1.02 for two-shots, and further to 1.10 for four-shots, indicating incremental improvements with additional example prompts. Conversely, KPL demonstrated superior initial performance with a DI score of 1.09 in the zero-shot setup, which suggests that the integration of domain-specific knowledge enhances the model’s baseline effectiveness. Further improvements were noted in two-shot KPL, achieving a DI score of 1.18. However, extending to four-shots did not further enhance performance, maintaining the DI score at 1.18. This plateau suggests a potential saturation point or diminishing returns with additional prompts in KPL. These findings underscore the significant impact of knowledge integration in prompting strategies and highlight the efficiency of KPL over PL, particularly in scenarios where prompt optimization is crucial for balancing performance with computational efficiency.

Fig. 5
figure 5

Disparate impact scores for GPT-4 under various prompting configurations

6 Discussion

This issue is far from resolved, having only been partially addressed. Through our framework, we strive to offer news that is either unbiased or less biased. In this work, we concentrate on mitigating biases in textual data, which differs from detecting and correcting biases in numeric data [79, 80]. Moreover, while other researchers employ either XAI methods for bias detection [81] or binary classification [76], our method combines both to enhance performance and interoperability. Previous research often involves multiple components to address bias-bias detection, bias recognition, bias masking, and fairness infilling [76]. This structure can be complex and time-consuming for debiasing text. Our method, however, consolidates the process into two main components. This streamlined approach reduces complexity and accelerates the debiasing process.

6.1 Transformers in bias detection

Our findings show that transformer-based models consistently outperform baseline models across all metrics, illustrating the advantage of advanced deep learning architectures in capturing nuanced patterns indicative of bias. However, these models can also embed systemic biases from their training data, potentially perpetuating and amplifying these biases in predictive tasks [82]. In this study, we acknowledge the potential risk of introducing new biases via transfer learning. However, our findings support that carefully fine-tuning the models proves advantageous. This fine-tuning entails specifically adjusting the model parameters to mitigate bias amplification by prioritizing fairness and equitable representation during training. To further safeguard against these issues, we employ Explainable AI with LIME to gain insights into the model’s decision-making process.

6.2 Interpreting AI decisions

To directly address the critical issue of bias amplification mentioned in Section 6.1, we have integrated the use of Local Interpretable Model-agnostic Explanations (LIME) into our methodology. LIME enhances the transparency of our transformer-based models by providing interpretable explanations for individual predictions. This interpretability is crucial for uncovering and understanding the model’s decision-making process on a granular level. By analyzing how specific features, particularly words flagged by experts as potentially biased, influence predictions, LIME allows us to dissect and address these biases effectively. Table 4 demonstrates how LIME identifies features that are most impactful in the model’s decisions, including those contributing to bias, thereby significantly enhancing our confidence in the model’s outputs. This approach not only illuminates the ’why’ and ’how’ behind the model’s conclusions but also serves as a critical tool in our efforts to minimize bias amplification by making the model’s reasoning processes transparent and adjustable.

6.3 Debiasing text with large language models

Our method for mitigating bias in text utilizes LLMs by prompting them to replace biased words, capitalizing on their advanced linguistic abilities. Recent studies [48] have shown that language models can self-diagnose and self-debias when given correctly formulated prompts. Despite these promising capabilities, the question remains: Can LLMs inherently embed biases from their training data? The answer is complex. While LLMs can adjust their outputs based on debiased instructions, they are fundamentally shaped by the vast datasets on which they are trained, which often contain biases reflective of historical and cultural prejudices. Therefore, even as LLMs exhibit the ability to self-correct, the embedded biases from their training phase can still influence their behavior subtly and persistently. To leverage the self-diagnosis and debiasing capabilities effectively, our methodology included precise and contextually aware prompting. This involved not just instructing the LLMs to replace overtly biased terms but also guiding them to recognize patterns in the data where biases manifest.

6.4 Concepts of bias and fairness

In this research, bias refers to the phenomenon where computer systems “systematically and unfairly discriminate against certain individuals or groups of individuals in favor of others" [83]. This can occur due to biased training data, differential use of information, or inherent biases in the algorithms themselves.

Currently, there is no universally accepted definition of bias and fairness [76]. Different types of biases require different approaches since the characteristics of gender bias, for example, do not apply to biases related to ethnicity or social status. To develop more standardized definitions in the future, it is essential first to examine a diverse array of biases in various contexts. This exploration will help accurately determine the fairness of data and algorithms.

While our approach employs technical definitions of bias and fairness, it is crucial to recognize that algorithmic bias is not merely a technical issue but also a complex sociopolitical one. The impact of algorithmic bias goes beyond technology, as it mirrors and perpetuates existing sociopolitical inequalities. For instance, biased algorithms can result in discrimination based on race, gender, or socioeconomic status, thereby affecting fundamental rights and freedoms [84].

6.5 Limitations

In our research study, we acknowledge several limitations that indicate substantial work remains. Primarily, we have applied only the DI fairness metrics, recognizing the need to explore additional metrics and assess their impact on performance. A significant challenge in fairness research is data collection. For this study, we utilized a manually annotated news dataset to identify bias-bearing words. Moreover, we are aware that crowdsourced datasets often embody significant social biases. To address this, one future direction is to evaluate the biases of crowd workers using counterfactual fairness metrics [85]. Additionally, we recommend that dataset providers enhance transparency in their annotation processes to better support fairness studies.

7 Conclusion and future works

In this paper, we introduce FairFrame, a framework designed to facilitate the dissemination of news that is less influenced by societal and other biases. FairFrame comprises two primary components: a bias detection module and a bias mitigation module. We employ a Transformer-based model to identify biased news using labeled news datasets. Additionally, we leverage the capabilities of Large Language models to debias text, substituting biased terms with neutral alternatives. We evaluate FairFrame’s performance against leading fairness methodologies in the field. This study provides a platform for scholars focused on text debiasing. Despite progress, considerable efforts are still needed to advance fairness in machine learning. Consequently, a potential future direction is to expand the toolkit’s usage to additional datasets, including those containing fake news.