Keywords

1 Introduction

In the scope of this paper, we study the verification of authorship in the field of Vietnamese literature. Authorship Attribution (AA) is the task of allocating text to the correct author among a closed group of potential authors. Then, the attribution of authorship is considered as the task of identifying the author of a text.

The most significant difference between AA and other text classification tasks is the need to capture the specific style of each author. This area has attracted attention because of its relevance to many applications, including identifying the author of anonymous documents or phishing emails (Chask 2005; Lambers and Veenman 2009; Iqbal et al. 2010), and plagiarism detection (Kimler 2003; Gollub et al. 2013).

Existing AA methods focus on capturing each author’s writing style through models such as SVM (Schwartz et al. 2013), CNN (Shrestha et al. 2017) and RNN (Bagnall 2015). The stylistic features recorded through n-grams of words and characters and syntactic and semantic information (Ding et al., 2016) were used in this task. In addition to learning features directly from the original text (such as n-gram words and characters), specific models use stylometric features, such as text length and output frequency-side of the first characters (Stamatatos, 2013). Huang et al. 2020 introduce latent-style features such as each user’s emotional orientation of tweets and gets good results.

In machine learning approaches, author attribution can be thought of as a form of text classification. Let D = {d1, d2, …, dn} be a set of documents and A = {a1, a2, …, an} is a fixed set of authors. The main task of copyright allocation is assigning an author A for each document in D. Most deep learning models have given significant results in various text classification tasks. Pre-trained language models can be fine-tuned for a target task with a few labeled training samples. (Wang et al., 2021) integrates RoBERTa fine-tuning and user writing styles for AA tweets, achieving the most modern results to date.

In this research, we proposed an approach based on the transfer learning technique through fine-tuning a pre-trained language model for Vietnamese named PhoBERT. We add a dense layer and soft-max activation to classify the author; the model is trained over several epochs. Our testing was performed with the VN-Literature dataset we collected through a crawler tool developed to collect and extract data from the Internet. Collected data are Vietnamese literary works with specifically identified authors and made public on the Internet. Unlike most deep learning methods for AA, V4A requires neither preprocessing nor feature engineering. Our method provides a state-of-the-art (SOTA) approach through transfer learning, with significantly improved relative accuracy. In addition, we also conduct an evaluation and comparison with other methods to show the architecture’s strengths and weaknesses. We also build an overall architecture, combining style with hybrid features, to improve the macro averaged F1-score. Our evaluation was based on the published standard English corpus: Blog Attribution corpus (introduced by Schler et al., 2004) and Enron Email (by the Federal Energy Regulatory Commission). We performed the test by using pre-trained multilingual models (BERT, XLM-R and DitilBERT) with transfer learning techniques.

The following section presents the research related to the problem of author classification. Section 3 presents the self-built corpus and an overview of the compared corpus. Section 4 details our existing and proposed architectures. Section 5 describes the results obtained. Section 6 discusses our method, as well as future works. Finally, Sect. 7 describes our conclusions.

2 Related Work

Traditionally, previous studies have been based on feature extraction such as Fig. 1 from the content (such as the Bag-of-Words (BoW), Term-Frequency Inverse-Document-Frequency (TF-IDF) method) word or N-gram level, etc.) or writing style (such as the use of punctuation, capital letters, numbers, the author’s POS tag (in the study of E. Stamatatos 2009). However, some proposed approaches using Convolutional Neural Network (CNN) for AA tasks are more widely available, especially transfer learning methods. In addition, the authors use ensemble methods as described in Fig. 2 to propose their models. The following section presents recent studies of these methods.

Fig. 1.
figure 1

Authorship Attribution classic pipeline

Fig. 2.
figure 2

Authorship Attribution Ensemble model

Usually, the set of features analyzed in a text is divided into five categories (according to E. Stamatatos 2009): vocabulary, structure, specific content, syntax, and features. Lexical features are defined as a set of characters and words of an individual. Such features include the distribution of uppercase letters, special characters, the average length of words used per sentence, and other characteristics. This set of features describes an author’s lexical richness. Structural features tell us about how writers organize elements in a text, such as the number of paragraphs and sentences or their average length. Iyer et al. (2019) dealt with identifying the author of a manuscript of any literary work based on a pre-trained model with 50 authors. The model completes the task as a text classification problem with multi-class labels and proposes a supervised machine learning model with stylometric feature extraction. The accuracy has increased significantly after being cross-validated, with the optimization reaching nearly 93%.

The issue of authorship of short texts collected from Twitter was examined in the scientific work of Huang et al. (2022). The authors proposed a method of learning text representation using the joint development declaration of n-gram words and characters as input to NN. In addition, authors used an additional set of features with 10 elements: text length, number of usernames, subject, emoji, URL, numeric expression, time expression, date expression, month, degree of polarization and degree of subjectivity. Models provided for verification are CNN and LSTM. The method has an accuracy of 83.6% on a corpus containing 50 authors.

An approach based on a common implementation of words, n-grams, and latent Dirichlet allocation (LDA) has been proposed by Anwar et al. (2018). The LDA-based approach allows for processing sparse data and volumetric texts, providing a more accurate representation. The described approach is an unsupervised computational method that can account for the dataset’s heterogeneity, multiple text styles, and the Urdu language’s specificity. It was tested on 6000 texts written by 15 authors in Urdu. Use the improved sqrt-cosine similarity as the classifier. As a result, the accuracy achieved is 92.89%.

Dmitrin et al. (2018) present the analysis and application of different NN architectures (RNN, LSTM, CNN, bidirectional LSTM). The study was conducted based on three datasets in Russian (Habrahabr blog - 30 authors, average text length 2000 words; vk.com - 50 and 100 authors, average text length 100 words; Echo.msk.ru - 50 and 100 authors, average text length 2000 words). CNN achieved the best results (87% for the Habrahabr blog, 59% and 53% for 50 and 100 authors with vk.com, respectively). In addition, the author found that character tri-grams are not very good for short texts from social networks. In contrast, for longer texts, tri-grams and tetra-grams achieve almost the same accuracy (84% for social media) tri-grams, 87% for tetra-grams).

Convolutional Neural Networks (CNN) can extract formations from raw signals during the user’s speech or vision processing. Ruder et al. 2016 explored word- and character-level CNNs for AA and found that character-level CNNs tend to perform better than other simple approaches based on SVM. (Rhodes) An n-gram model with 3, 4 and 5 g as input to a multi-layer CNN was applied max-over-time pooling.

Adaku Uchendu et al. (2020) used human-written and machine-written texts (CTRL, GPT, GPT2, GROVER, XLM, XL-NET, PPLM, FAIR) to perform authorship verification between texts written by writers and text generators. Most machine-written texts are significantly different from human texts, which makes it easier to identify the author. However, the generated documents with GPT2, GROVER and FAIR models give better quality than other methods used, which leads to confusion in the classification process. For this study, the author used a convolutional neural network (CNN) because the CNN architecture is suitable for representing each author’s characteristics. In addition, the author has improved the CNN implementation by using n-gram word and part-of-speech (PoS) tags. The classification result of “human-machine” ranges from 81–97%, depending on the generation methods.

3 The Corpora

This paper presents a self-collected corpus through the tool we developed. Collected data are Vietnamese prose works based on eight authors with 839 different works. Raw data is collected and stored as files and not preprocessed. The data stored in the structured form is a CSV format document, including the data fields: work, title, content, and author (Table 2). The work is represented by an author that does not include co-authors, which does the work of identifying authorship based on style clearer, optimizing classification for better results. Specific content is literary works; each work will be a collection of many sentences and will have different lengths. In our testing, we looked at the length profile of each homologous text to reduce the standard deviation between sentences. The following Table 1 presents the statistics of the data we collected.

Fig. 3.
figure 3

Overview of our data collection and processing tools

Next, we evaluate the model through two publicly available datasets from the Internet, Blog Authorship Corpus and Enron Email. Many different authors have studied these two datasets in recent years to perform the task of identifying the author of a text in a list of N potential authors on each dataset. Details of the datasets are presented below.

The Authorship Corpus blog (publicly available on KaggleFootnote 1) consists of 19,320 bloggers who collected posts from blogger.com in August 2004. It was introduced by Schler et al. as part of a study on the effects of age and gender on blogging. The archive combines 681,288 posts and over 140 million words - or about 35 posts and 7250 words per person. Each blog is stored in a separate file, the name of which indicates the blogger’s id# and the gender, age, industry and astrological sign provided by the blogger herself.

Table 1. Statistics of collected datasets
Table 2. Collected data storage details

Enron Email is a document repository containing more than 0.5 million emails. This data was initially made public and posted to the web by the Federal Energy Regulatory Commission during the investigation. It includes the data of 150 users, most of whom are senior Enron managers. This dataset was used in the study of Klimt and Yang 2004 on the email classification problem. The data is publicly available hereFootnote 2. The Enron Email Archive has been researched for several tasks, including authorship analysis in Halvani et al., 2020.

For our tests, we consider the eight authors with the most significant number of documents in each dataset to provide the most objective assessment of the proposed model and the VN-Literature dataset. Table 3 presents summary statistics on the length and number of documents per author for each dataset tested. Thereby showing the similarity of the Enron Email and VN-Literature datasets, the similarity in text size here is the length of each text and the number of documents of each author. Finally, we selected the top 8 authors with the most text for the Blog dataset.

Table 3. Summary statistics on each author’s average length and the number of documents on the dataset

4 The Methodologies

4.1 A Brief Introduction to Authorship Attribution Task

Authorship Attribution (AA) is the process of assigning an author to an anonymous document based on the characteristics of the script. Several attribution methods have been developed for natural languages, such as English, Russian, Chinese, and Dutch. However, the number of works related to Vietnamese is still limited. Many machine learning models have been tested, including communication machine learning and deep learning models. However, it is not often mentioned that combined stylometric and deep learning models or the transfer learning techniques used can significantly impact classifier performance. Therefore, we propose to study their use in this regard by building a model of possible author identification for a particular Vietnamese text rather than in prose works through the combination of writing style and PhoBERT model (a pre-trained language model for Vietnamese). We evaluated these models on a large dataset in Vietnamese collected through self-developed tools by eight different authors. We also compared them with other existing methods. The test results show that our model provides the best results and can attribute the text’s author with an accuracy of 84.7%. Furthermore, compared with related methods, the results indicate that our proposed method is suitable for allocating copyright.

Fig. 4.
figure 4

Overview of our proposed method model (V4A) combining writing style, feature synthesis, and the BERT language model

First, we introduce a baseline model. Accordingly, the problem AA can be reduced to the text classification problem, in which this problem is defined according to Aggarwal and Zhai (2012) when given a set of training documents as training data \(D = \left\{ {X_1 ,X_2 , \ldots ,X_i } \right\}\), where each \(X_i \in D\) has a label in the set of labels \(\left\{ {1..k} \right\}\). First, the training data is used to build the classification model. Then, for the incoming unlabeled data, the classification model predicts the label for it. We will discuss each approach in the next section and present our proposed method.

4.2 Traditional Method

Traditionally, AA has relied on the extraction of content-related features (e.g., Bag-Of-Words) (BoW), Term frequency-inverse document frequency (TF-IDF) at the word or n-gram level, etc.) or author’s stylistic features, according to research by Stamatatos (2009) (e.g., the use of punctuation, capital letters, numbers, POS tag) of an author. Then construct a classifier that trains on these features, such as the popular Logistic Regression (LR) used by Madigan et al. (2005), Anwar et al. (2018). In addition, we also use other machine learning models such as Multinomial Naive Bayes, Decision Tree, Random Forest.

  1. 1.

    Multinomial Naive Bayes: This algorithm is to predict and classify databases based on observational data and statistics, based on the Bayes theorem of probability. Multinomial Naive Bayes is a popular supervised learning algorithm in machine learning because it is relatively easy to train and achieves high performance (I. Rish et al., 2001).

  2. 2.

    Logistic Regression: A binary classification algorithm, a simple, well-known and most important method in machine learning. By analyzing the relationship between all available independent variables, the Logistic regression model predicts a dependent data variable (Genkin et al., 2007; Hosmer Jr et al., 2013). However, in natural language processing, this method requires manual features extracted from the data to classify text in the test we used for TF-IDF, Stylometric and Char n-gram.

  3. 3.

    Decision Tree: The decision tree is the most powerful and popular method for classification (Pranckeviciu, 2017). The decision tree algorithm is also considered as a structure tree, where each node represents a test attribute, each branch is a test result, and each leaf node is a class label target variable.

  4. 4.

    Random Forest: Random forest is a supervised machine learning method to solve classification and regression tasks (Davidson, 2017). The Random Forest model is very effective for classification problems because it mobilizes hundreds of smaller internal models with different rules simultaneously to make the final decision.

4.3 Deep Learning

Convolutional Neural Network (Text-CNN): Convolutional Neural Network (CNN) is a multi-layer neural network architecture developed for classification (Kim et al., 2014). By using convolutional layers, it is possible to detect the combined features. In our test, we used together a pre-trained word embedding of 157 different languages (fastTextFootnote 3) suggested by Grave et al., 2018. This embedding turns a word into a 300-dimensional vector. Finally, we use a soft-max function that uses the results to predict labels for the text.

Bidirectional Long Short-Term Memory (Bi-LSTM): Bi-LSTM (Schuster et al., 1997) is a well-known variant of Recurrent Neural Networks (RNN) (Medsker et al., 1999). Bi-LSTM can be trained using all available input information in the past and continuously over a selected time frame. This method is considered very powerful in classification problems, and most of its classification results achieve high performance. Therefore, we intend to use it to compare with other classification models in this task.

4.4 Transformers Model

The transfer learning model has recently attracted the attention of researchers in the field of NLP because of its outstanding effectiveness. Above all, transformers models are an advanced architecture that relies on attention mechanisms and deep neural networks, replacing recurrent layers inside the auto encoder-decoder with particular layers called multi-head self-attention (Yang et al. 2019). The most prominent is the language model BERT stands for Bidirectional Encoder representations from transformers, proposed by Devlin et al., 2019. BERT and its variants, such as DistilBERT (Sanh et al., 2020), XLM-R (Conneau et al. 2020), and especially PhoBERT (Nguyen et al., 2020), have affirmed its strength in natural language processing tasks in recent years.

4.5 Proposed Model

While researching and examining previous studies on the issue of authorship attribution, we found that there is no proposed model for combining pre-trained language models with stylometric features. That led us to introduce a new model, ViBert4Author (V4A), a simple fine-tuning of PhoBERT with a dense layer and softmax enabled, trained in several epochs for authorship. The output size of the dense layer corresponds to the number of authors in the corpus.

For Vietnamese, the state-of-the-art (SOTA) method was first developed and called PhoBERT (by Nguyen et al., 2020) to solve Vietnam’s NLP problems. PhoBERT is a pre-trained model that shares the same idea as RoBERTa, a BERT replication study proposed by Liu et al., 2019 and has been modified for Vietnamese. BERT, widely used in research in recent years, is a contextual word representation model built using bidirectional transformers and based on a masked language model. As described by Sun et al. (2020), to use BERT as a classifier, a simple dense layer with a softmax activation function is combined with the final hidden state h header of the first token [CLS] through a weight matrix W and prediction probability of label c in the following way:

$$ p(c|h) = softmax\left( {Wh} \right) $$

Then, all weights, including the weight of BERT and the weight of W, are adjusted to maximize the log probability of correct labels. Finally, the training is done using the Cross-Entropy loss function. In this study, we implement PhoBERT, a pre-trained language mode from the Transformer library (Wolf et al., 2020), trained on a large Vietnamese corpus. Fine-tune BERT for AA task was done on Google Colab Tesla T4-PCIE-16GB.

Inspired by previous studies, we have incorporated writing style features into V4A. Through 2 models called V4A + Style and V4A + Style + Hybrid Feature combined with the Logistic Regression model. For hybrid features, we extract based on n-grams with character-level bi-grams and tri-grams through an LR classifier. Finally, we collect the BERT model’s output probabilities by stylometric and hybrid features, which are reassembled and classified using an additional LR classifier. Such a model will examine the content, writing style and synthesis of features. The architecture of V4A + Style + Hybrid Feature is shown in Fig. 4.

5 Results

The parameters we choose for the tested architectures in the study are shown in Table 4. We ran tests on eight authors for all three datasets presented above. Our model was trained on 5 epochs for each test. The results are shown in Tables 5 and 6. We retained 20% of the data for the test set using stratified approach sampling, i.e., stratified approach sampling, the ratio of the proportions of each class is kept equal in the train and test sets. We compare the proposed approach with traditional and deep learning models using the feature extraction methods presented earlier. In addition, we compare our approach with the word-level TF-IDF - LR model with root word removal and word breaks. We also add a benchmark for the performance of an LR trained only on stylistic features and an additional LR trained on Char N-gram hybrid features.

Table 4. Parameters of the experiments

In this study, we focus on evaluating based on the VN-Literature dataset we built ourselves. Thereby, it is evident that the proposed model V4A (fine-tuning from pre-trained language model PhoBERT) works well with the author identification problem in Vietnamese literature. For our dataset, V4A outperforms traditional machine learning models using TF-IDF. We see the highest Random Forest model in conventional machine learning models with a measured accuracy of 70.8%. The most superior deep neural network model is BiLSTM combined with pre-trained word embedding fastText with improved accuracy of 71%. However, with the proposed model, the results are more impressive, with outstanding accuracy reaching over 80% with multilingual models such as BERT, XLM-RoBERTa and DistilBERT, especially with the PhoBERT language model due to its advantages. Moreover, it is built in Vietnamese, giving better accuracy, up to 84.6%; combining writing style and hybrid features helps increase accuracy.

Two other datasets also included in our experiment gave awe-inspiring results. Data preparation will take a long time with the Enron Email dataset containing much unimportant information. We decided to delete short emails or emails containing special characters and signatures. Experimental results show improvement of State-of-the-art models (ignoring experience with monolingual PhoBERT) with traditional models based on TF-IDF and deep learning models with word embedding, average accuracy increase of more than 13.3%.

The more significant challenge is with the Blog Authorship dataset, which is a large dataset, so we only selected the top 8 authors in our test, providing an average word count of about 1700 for each author, and providing enough data for model training and evaluation. Our testing shows that traditional machine learning and deep learning models still do not perform as well as the proposed model combining pre-trained multilingual models and feature techniques. Nevertheless, the results show a significant improvement in accuracy at 9.8%.

Table 5. A detailed description of the results of our tests based on two English datasets
Table 6. A detailed description of the results of our tests based on the Vietnamese dataset (VN-Literature)

6 Discussion

Our research combined a Vietnamese language model with the writing style and feature extraction methods to achieve outstanding scores for task AA. We can improve the results by using dense layers, and the activation function used is softmax based on the pre-trained language model PhoBERT. The approach of BERT and PhoBERT focuses mainly on the inherent element of word representation, syntax and context without requiring any preprocessing. Previous studies (Sari et al. 2018) have shown that using type and matching features improves the accuracy of AA tasks. We also offer that adding such features and pre-trained language models can improve the macro-averaged F1-score. Our proposed model V4A fine-tuning from the pre-trained language model PhoBERT shows similar effects. That indicates that our model will work well when each author has enough training data. However, having the data available to all authors is not easy in practice. In addition, the representation of short paragraphs by the authors also causes some asymmetry in the data.

There are many possible methods to extend this study. Our experiments on three datasets: VN-Literature, Enron Email, and Blog Authorship, have successfully utilized transformers-based language and representation models. In addition, we have done it with other pre-trained multilingual models that are variations of BERT and have had a positive effect. In further works, we will focus on building a more extensive experimental dataset with a more significant number of authors; Next, we will attempt to extract additional typographic, association, profiling, or content-related features.

7 Conclusion

The development of science and technology led to the birth of more modern advanced research. Deep learning and transformers techniques are gradually becoming more popular in natural language processing; Feature engineering and text preprocessing are becoming less and less necessary. In this work, we presented V4A, a fine-tuning-based approach based on pre-trained PhoBERT for author classification. In addition, we also introduce a self-developed dataset for the task of author classification in Vietnamese literature. Our model works best when the author has enough training data, no significant label imbalance, and the text is too short. In addition, in this task, we also applied more style features (like previous authors) and hybrid features to V4A, which improved, significantly the increase in F1-Score compared to the model overall figure. Future work will explore more features for V4A, build more datasets with a larger number of authors, and review and suggest ways to deal with large data imbalances. Finally, explore other pre-trained language models and expand our approach to Authorship Attribution.