Keywords

1 Introduction

Recent advancements in machine learning and natural language processing research have paved the way for the development of sophisticated LLMs. The widespread availability and the ease with which they can generate coherent content are contributing to the production of massive volumes of automatically generated online content. LLMs have demonstrated remarkable performance in producing human-like language, showcasing their potential use across a wide range of applications, such as domain specific tasks in legal [20] and financial services [23]. Foundation models such as OpenAI’s GPT-3 [1] and Big Science’s Bloom [19] are publicly available, and can generate highly sophisticated content with basic text prompts. This often presents a challenge to discern between human and LLM-generated text.

While LLMs demonstrate the ability to understand the context and generate coherent human-like responses, they do not have a true understanding of what they are producing [12]. This could potentially lead to adverse consequences when used in downstream applications. Generating plausible but false content (hallucination [10]), may inadvertently help propagate misinformation, fake news, and spam [9].

There is a considerable body of research available on detecting text generated by artificial intelligence (AI) systems [9, 21]. However, the identification of a specific LLM responsible for generating such text is a relatively new area of research. We argue that attributing the generated text to a specific LLM is a vital research area, as the knowledge of the source LLM would enable one to be vigilant regarding potential known biases and limitations associated with that model and use the content appropriately in downstream applications with suitable oversight [21].

In this study, we focus on identifying the source of the AI-generated text (referred to as model attribution hereafter) in two different languages, English and Spanish. More specifically, given a piece of text, the goal is to determine which specific LLM generated the text. To address this problem statement, we propose an ensemble classifier, where the probabilities generated from various state-of-the-art LLMs are used as input feature vectors to traditional machine learning classification models to produce the final predictions. Our experiments show multiple instances of the proposed framework outperform several baselines using well-established evaluation metrics.

2 Related Work

The majority of research in this area is focused on differentiating between text authored by humans and text generated by AI [3, 17].

The use of neural networks leveraging complex linguistic features and their derivatives is most prevalent in detecting AI-generated text. DetectGPT [15] generates minor perturbations of a passage using a generic pre-trained Text-to-Text Transfer Transformer (T5) model, and then compares the log probability of the original sample with each perturbed sample to determine if it is AI-generated. Deng et al. [4] build upon the DetectGTP model by incorporating a Bayesian surrogate model to select text samples more efficiently, which achieves similar performance as DetectGTP using half the number of samples. Mitrovic et al. [16] developed a fine-tuned Transformer-based approach to distinguish between human and ChatGTP generated text, with the addition of SHapely Additive exPlanations (SHAP) values for model explainability. This approach provides insight into the reasoning behind the model’s predictions. Statistical methods have also been applied for detection of AI-generated text, such as the Giant Language model Test Room (GLTR) approach [6].

The increasing sophistication of generative AI models coupled with adversarial attacks make detection of AI-generated text especially challenging. Two forms of attacks that create additional complications are paraphrasing attacks and adversarial human spoofing [17]. Automatically generated text may also show factual, grammatical, or coherence artifacts [14] along with statistical abnormalities that impact the distributions of automatic and human texts [8]. The importance of detecting AI-generated text and the corresponding challenges will foster further research on this topic.

In addition to distinguishing between human and AI-generated text, identifying a specific LLM that generates the artificial text is becoming increasingly important. Uchendu et al. [21] explored the Robustly optimized BERT approach (RoBERTa) model to classify AI-generated text into eight different classes. Li et al. [11] developed a model for AI-generated multi-class text classification on Russian language using Decoding-enhanced BERT with disentangled attention (DeBERTa) as a pre-trained language model for category classification. These prior works focused on model attribution for only a single language, such as English or Russian. In contrast to the aforementioned research, and to the extent of our knowledge, our approach to model attribution is the first one to be applied across multiple languages, demonstrating the robustness of our approach across attributable LLMs, languages, and domains.

3 AuTexTification Dataset

The dataset used in the study comes from the Iberian Languages Evaluation Forum (IberCLEF)-AuTexTification shared task [18]. The data consists of texts from five domains, where three domains (legal, wiki, and tweets) are used for training, and two different domains are used for testing (reviews and news). It contains machine generated text from six text generation models, labeled as bloom-1b7 (A), bloom-3b (B), bloom-7b1 (C), babbage (D), curie (E), and text-davinci-003 (F) for two different languages, English and Spanish. The LLMs used to generate the text are of increasing number of neural parameters, ranging from 2B to 175B. The motivation here is to emulate realistic AI text detection approaches that should be versatile enough to detect a diverse set of text generation models and writing styles. The number of samples in each class for both languages is shown in Table 1. To showcase the complexity of the problem, we also present samples for each category from both the English and Spanish datasets in Tables 2 and 3.

Table 1. Label distribution across the languages for model attribution task. Train and test splits for each language are also shown.
Table 2. Samples of English AI-generated text, with corresponding source models (labeled A-F).
Table 3. Samples of Spanish AI-generated text, with corresponding source models (labeled A-F).
Table 4. Models explored for English and Spanish datasets

4 Proposed Ensemble Approach

In this Section, we detail our approach for conducting the generative language model attribution. We first provide a description of the LLMs and machine learning models that we explored for model attribution. Next, we discuss the proposed ensemble neural architecture, where we fine-tuned the LLMs and then passed their predictions to various traditional machine learning models to perform the ensemble operation.

4.1 Models

LLMs: We explored various state-of-the-art LLMs [22], such as Bidirectional Encoder Representations from Transformers (BERT), DeBERTa, RoBERTa, and cross-lingual language model RoBERTa (XLM-RoBERTa) along with their variants. Since the datasets are different for each language, and the same set of models will not fit across them, we fine-tuned different models for different languages. We investigated more than 15 distinct models for each language and selected the ones presented in this paper based on their performance on the validation data. This selection was made to ensure model diversity, which aids in generalisation and improved comprehension of context and semantics. Table 4 lists the different models that we selected for the two languages under consideration. We briefly describe each of the LLMs below.

  • microsoft/deberta-base [7] is a transformer model which improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder.

  • xlm-roberta-large-finetuned-conll03-english is XLM-RoBERTa based model [2] which is a large multi-lingual language model trained on 2.5TB of filtered Common Crawl data. The conll03-english model is fine-tuned on the XLM-RoBERTa model with conll2003 dataset in English.

  • roberta-large, PlanTL-GOB-ES/roberta-large-bne are RoBERTa based models [13] which are pre-trained on a large corpus of English data in a self-supervised fashion using a Masked Language Modeling (MLM) objective. The roberta-large-bne model has been pre-trained using the largest Spanish corpus with a total of 570GB of text compiled from the web crawlings.

  • dbmdz/bert-base-multilingual-cased-finetuned-conll03-spanish, hiiamsid/sentence_similarity_spanish_es, allenai/scibert_scivocab_cased, bert-large-uncased-whole-word-masking-finetuned-squad, and allenai/longformer-base-4096 are BERT-based models [5]. The bert-base-multilingual model is pre-trained on 104 languages with the largest Wikipedia data using a MLM objective and further pre-trained on the CoNLL-2002 dataset in Spanish. The sentence similarity Spanish model is a sentence-transformer model where the base model is BETO which is trained on a large Spanish corpus. The scibert model is trained on papers taken from Semantic Scholars. The BERT-large SQuAD model is slightly different from other BERT models since it is trained with a whole word masking technique and further fine-tuned on the Stanford Question Answering Dataset (SQuAD). The Long-Document Transformer (Longformer) model is a BERT-like model stemmed from the RoBERTa checkpoint and pre-trained for MLM on long documents which supports sequences of lengths up to 4,096.

Machine Learning (ML) Models: We explored various traditional machine learning and ensembling models such as Bagging , Voting, OneVsRest, Error-Correcting Output Codes (ECOC), and LinearSVC [24].

4.2 Proposed Ensemble Neural Architecture

As shown in Fig. 1, an input text is passed through variants of the pre-trained LLMs such as, DeBERTa (D), XLM-RoBERTa (X), RoBERTa (R), and BERT (B). During the model training phase, these models are fine-tuned on the training data. For inference and testing, each of these models independently generate classification probabilities (P), namely \(P^{D}\), \(P^{X}\), \(P^{R}\), \(P^{B}\), etc. In order to maximize the contribution of each model, each of these probabilities are concatenated (\(P^{C}\)) or averaged (\(P^{A}\)), and this output is passed as a feature vector to train various traditional ML models to produce final predictions.

Fig. 1.
figure 1

Proposed ensemble neural architecture

5 Experiments

In this section, we discuss the evaluation of the proposed methods. We report model performance using well-established metrics such as accuracy (Acc), macro F1 score (\(F_{macro}\)), precision (Prec) and recall (Rec).

5.1 Baselines

We establish Linear Support Vector Classification (SVC), Logistic Regression (LR), and Random Forests (RF) as baselines, where each baseline model takes two distinct feature sets – word n-grams and character n-grams. We also explored other baselines like the Symanto Brain Few-shot and Zero-shot without label verbalization approachesFootnote 1, but due to their relatively low performance compared to the approaches presented in Table 5, we do not report those results.

5.2 Implementation Details

During model training we set aside 20% from the training data for validation. However, for the held-out testing phase, the validation set is merged with the training set. The following hyper-parameters are used for model fine-tuning: batch size - 128, learning rate - \(3e^-5\), max sequence length - 128, and number of epochs is set to 20. We also used a sliding window to prevent the truncation of longer sequences, allowing the model to handle longer sentences.

Table 5. Baseline results of model attribution for both English and Spanish.
Table 6. Results of model attribution on the English dataset
Table 7. Results of model attribution on the Spanish dataset

5.3 Results

Table 5 shows results produced using three traditional ML methods (Linear SVC, LR, and RF) across two different feature sets (word n-grams and character n-grams) for both languages. LR with character n-grams outperforms other approaches on the macro F1 performance metric for both languages.

Tables 6 and 7 provide results on English and Spanish datasets respectively, with different variants of the proposed architecture. The first block in the table shows the results for individual LLMs. The second and third blocks show the ensemble results with \(P^{C}\) and \(P^{A}\) respectively, as input feature vector to several machine learning models.

Fig. 2.
figure 2

Class-wise F-scores for the outperformed baseline (LR with character n-grams) and proposed ensemble method (Linear SVC) on English dataset

The results on the English test data are shown in Table 6. Out of all the combinations, Linear SVC with concatenated feature vector (\(P^{C}\)) as an input, outperforms other approaches for a majority of the evaluation metrics with an \(F_{macro}\) score of 0.63. Table 7 shows the results on the Spanish test dataset where the concatenated feature vector (\(P^{C}\)) is passed as an input to the Linear SVC classifier outperforms the other approaches with an \(F_{macro}\) score of 0.656.

Overall, we observed that the ensemble models performed well when compared to individual LLMs. Ensembling the models provides additional cues from each individual model, which helps enhance the performance. Furthermore, several variants of the proposed framework outperforms each of the baselines across the evaluated metrics.

Table 8. Samples form the English test dataset where the prediction from the ensemble model (Linear SVC) is accurate, that from the individual LLM is not.

Figure 2 shows the class-wise performance comparison of our best ensemble method (Linear SVC) with that of the best baseline (LR with character-n-grams) on English and Spanish datasets. For all the classes in both datasets, the macro F1 score of the proposed method outperforms the baseline macro F1 scores. Even though the number of parameters for LLMs that we explore are not huge, our proposed ensemble approach performed very well on text generated using the large model with 175B parameters (text-davinci-003).

Tables 8 and 9 show a few samples from the test data for English and Spanish, respectively. In these samples, we demonstrate that while no individual LLM predict the ground truth label correctly, the ensemble Linear SVC classifier predicts the correct label. We also show the ground truth label associated with each sample.

Table 9. Samples form the Spanish test dataset where the prediction from the ensemble model (Linear SVC) is accurate, that from the individual LLM is not.

6 Conclusion

In this paper, we explored generative language model attribution for English and Spanish languages. We proposed an ensemble neural architecture where the probabilities of individual LLMs are concatenated and passed as input to machine learning models. Each of the variants of the proposed ensemble approach outperformed several traditional machine learning baselines and the individual LLMs for both languages. Our model results in macro \(F_{macro}\) scores of 63% and 65.6% on English and Spanish data, respectively, outperforming other baseline approaches. Our analysis showed that our proposed approach is also effective at classifying the samples that are generated using LLMs with large number of parameters. Our approach also performs well for out-of-domain themes since themes in the test dataset were different from the training dataset.Directions for future work include developing a multi-task approach for generative language model attribution as well as exploring other multilingual datasets.