Vietnamese Text’s Writing Styles Based Authorship Identification Model

Dong, Khoa Dang; Nguyen, Dang Tuan

doi:10.1007/978-981-19-8069-5_23

Khoa Dang Dong⁸ &
Dang Tuan Nguyen⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1688))

Included in the following conference series:

International Conference on Future Data and Security Engineering

1890 Accesses

Abstract

Identification of authorship is a research topic in natural language processing that has been interesting in recent years. Previously, texts were studied through a large variety of feature extraction methods to identify the author of the content. Advanced approaches based on deep learning have recently been applied to authorship attribution. This paper introduces a new model called ViBert4Author (V4A), a fine-tuning version of the pre-trained PhoBERT language model with the addition of dense layer and soft-max through combining the same algorithms. The feature extraction method is used for author classification in Vietnamese literature. In addition, our article also introduces a dataset that has been collected based on self-developed tools, the dataset on building over 800 works from 8 authors named VN-Literature. We also performed many tests on English datasets to evaluate the model: blogs, emails published on Kaggle, and pre-trained multi-languages for testing. We give a comprehensive analysis of the advantages and disadvantages of the proposed method. In addition, we evaluate the extraction of additional features (stylometric and hybrid features) in our assessment of approaches using the F1-score measure. The results show that our proposed model has improved performance over previous methods, in which the model that combines stylistic features and modern methods achieves outstanding performance.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Automatic Authorship Investigation

Integrating RoBERTa Fine-Tuning and User Writing Styles for Authorship Attribution of Short Texts

Deep Stylometry and Lexical & Syntactic Features Based Author Attribution on PLoS Digital Repository

Keywords

1 Introduction

In the scope of this paper, we study the verification of authorship in the field of Vietnamese literature. Authorship Attribution (AA) is the task of allocating text to the correct author among a closed group of potential authors. Then, the attribution of authorship is considered as the task of identifying the author of a text.

The most significant difference between AA and other text classification tasks is the need to capture the specific style of each author. This area has attracted attention because of its relevance to many applications, including identifying the author of anonymous documents or phishing emails (Chask 2005; Lambers and Veenman 2009; Iqbal et al. 2010), and plagiarism detection (Kimler 2003; Gollub et al. 2013).

Existing AA methods focus on capturing each author’s writing style through models such as SVM (Schwartz et al. 2013), CNN (Shrestha et al. 2017) and RNN (Bagnall 2015). The stylistic features recorded through n-grams of words and characters and syntactic and semantic information (Ding et al., 2016) were used in this task. In addition to learning features directly from the original text (such as n-gram words and characters), specific models use stylometric features, such as text length and output frequency-side of the first characters (Stamatatos, 2013). Huang et al. 2020 introduce latent-style features such as each user’s emotional orientation of tweets and gets good results.

In machine learning approaches, author attribution can be thought of as a form of text classification. Let D = {d₁, d₂, …, d_n} be a set of documents and A = {a₁, a₂, …, a_n} is a fixed set of authors. The main task of copyright allocation is assigning an author A for each document in D. Most deep learning models have given significant results in various text classification tasks. Pre-trained language models can be fine-tuned for a target task with a few labeled training samples. (Wang et al., 2021) integrates RoBERTa fine-tuning and user writing styles for AA tweets, achieving the most modern results to date.

In this research, we proposed an approach based on the transfer learning technique through fine-tuning a pre-trained language model for Vietnamese named PhoBERT. We add a dense layer and soft-max activation to classify the author; the model is trained over several epochs. Our testing was performed with the VN-Literature dataset we collected through a crawler tool developed to collect and extract data from the Internet. Collected data are Vietnamese literary works with specifically identified authors and made public on the Internet. Unlike most deep learning methods for AA, V4A requires neither preprocessing nor feature engineering. Our method provides a state-of-the-art (SOTA) approach through transfer learning, with significantly improved relative accuracy. In addition, we also conduct an evaluation and comparison with other methods to show the architecture’s strengths and weaknesses. We also build an overall architecture, combining style with hybrid features, to improve the macro averaged F1-score. Our evaluation was based on the published standard English corpus: Blog Attribution corpus (introduced by Schler et al., 2004) and Enron Email (by the Federal Energy Regulatory Commission). We performed the test by using pre-trained multilingual models (BERT, XLM-R and DitilBERT) with transfer learning techniques.

The following section presents the research related to the problem of author classification. Section 3 presents the self-built corpus and an overview of the compared corpus. Section 4 details our existing and proposed architectures. Section 5 describes the results obtained. Section 6 discusses our method, as well as future works. Finally, Sect. 7 describes our conclusions.

2 Related Work

Traditionally, previous studies have been based on feature extraction such as Fig. 1 from the content (such as the Bag-of-Words (BoW), Term-Frequency Inverse-Document-Frequency (TF-IDF) method) word or N-gram level, etc.) or writing style (such as the use of punctuation, capital letters, numbers, the author’s POS tag (in the study of E. Stamatatos 2009). However, some proposed approaches using Convolutional Neural Network (CNN) for AA tasks are more widely available, especially transfer learning methods. In addition, the authors use ensemble methods as described in Fig. 2 to propose their models. The following section presents recent studies of these methods.

Usually, the set of features analyzed in a text is divided into five categories (according to E. Stamatatos 2009): vocabulary, structure, specific content, syntax, and features. Lexical features are defined as a set of characters and words of an individual. Such features include the distribution of uppercase letters, special characters, the average length of words used per sentence, and other characteristics. This set of features describes an author’s lexical richness. Structural features tell us about how writers organize elements in a text, such as the number of paragraphs and sentences or their average length. Iyer et al. (2019) dealt with identifying the author of a manuscript of any literary work based on a pre-trained model with 50 authors. The model completes the task as a text classification problem with multi-class labels and proposes a supervised machine learning model with stylometric feature extraction. The accuracy has increased significantly after being cross-validated, with the optimization reaching nearly 93%.

The issue of authorship of short texts collected from Twitter was examined in the scientific work of Huang et al. (2022). The authors proposed a method of learning text representation using the joint development declaration of n-gram words and characters as input to NN. In addition, authors used an additional set of features with 10 elements: text length, number of usernames, subject, emoji, URL, numeric expression, time expression, date expression, month, degree of polarization and degree of subjectivity. Models provided for verification are CNN and LSTM. The method has an accuracy of 83.6% on a corpus containing 50 authors.

An approach based on a common implementation of words, n-grams, and latent Dirichlet allocation (LDA) has been proposed by Anwar et al. (2018). The LDA-based approach allows for processing sparse data and volumetric texts, providing a more accurate representation. The described approach is an unsupervised computational method that can account for the dataset’s heterogeneity, multiple text styles, and the Urdu language’s specificity. It was tested on 6000 texts written by 15 authors in Urdu. Use the improved sqrt-cosine similarity as the classifier. As a result, the accuracy achieved is 92.89%.

Dmitrin et al. (2018) present the analysis and application of different NN architectures (RNN, LSTM, CNN, bidirectional LSTM). The study was conducted based on three datasets in Russian (Habrahabr blog - 30 authors, average text length 2000 words; vk.com - 50 and 100 authors, average text length 100 words; Echo.msk.ru - 50 and 100 authors, average text length 2000 words). CNN achieved the best results (87% for the Habrahabr blog, 59% and 53% for 50 and 100 authors with vk.com, respectively). In addition, the author found that character tri-grams are not very good for short texts from social networks. In contrast, for longer texts, tri-grams and tetra-grams achieve almost the same accuracy (84% for social media) tri-grams, 87% for tetra-grams).

Convolutional Neural Networks (CNN) can extract formations from raw signals during the user’s speech or vision processing. Ruder et al. 2016 explored word- and character-level CNNs for AA and found that character-level CNNs tend to perform better than other simple approaches based on SVM. (Rhodes) An n-gram model with 3, 4 and 5 g as input to a multi-layer CNN was applied max-over-time pooling.

Adaku Uchendu et al. (2020) used human-written and machine-written texts (CTRL, GPT, GPT2, GROVER, XLM, XL-NET, PPLM, FAIR) to perform authorship verification between texts written by writers and text generators. Most machine-written texts are significantly different from human texts, which makes it easier to identify the author. However, the generated documents with GPT2, GROVER and FAIR models give better quality than other methods used, which leads to confusion in the classification process. For this study, the author used a convolutional neural network (CNN) because the CNN architecture is suitable for representing each author’s characteristics. In addition, the author has improved the CNN implementation by using n-gram word and part-of-speech (PoS) tags. The classification result of “human-machine” ranges from 81–97%, depending on the generation methods.

3 The Corpora

This paper presents a self-collected corpus through the tool we developed. Collected data are Vietnamese prose works based on eight authors with 839 different works. Raw data is collected and stored as files and not preprocessed. The data stored in the structured form is a CSV format document, including the data fields: work, title, content, and author (Table 2). The work is represented by an author that does not include co-authors, which does the work of identifying authorship based on style clearer, optimizing classification for better results. Specific content is literary works; each work will be a collection of many sentences and will have different lengths. In our testing, we looked at the length profile of each homologous text to reduce the standard deviation between sentences. The following Table 1 presents the statistics of the data we collected.

Next, we evaluate the model through two publicly available datasets from the Internet, Blog Authorship Corpus and Enron Email. Many different authors have studied these two datasets in recent years to perform the task of identifying the author of a text in a list of N potential authors on each dataset. Details of the datasets are presented below.

The Authorship Corpus blog (publicly available on Kaggle^{Footnote 1}) consists of 19,320 bloggers who collected posts from blogger.com in August 2004. It was introduced by Schler et al. as part of a study on the effects of age and gender on blogging. The archive combines 681,288 posts and over 140 million words - or about 35 posts and 7250 words per person. Each blog is stored in a separate file, the name of which indicates the blogger’s id# and the gender, age, industry and astrological sign provided by the blogger herself.

Table 1. Statistics of collected datasets

Full size table

Table 2. Collected data storage details

Full size table

Enron Email is a document repository containing more than 0.5 million emails. This data was initially made public and posted to the web by the Federal Energy Regulatory Commission during the investigation. It includes the data of 150 users, most of whom are senior Enron managers. This dataset was used in the study of Klimt and Yang 2004 on the email classification problem. The data is publicly available here^{Footnote 2}. The Enron Email Archive has been researched for several tasks, including authorship analysis in Halvani et al., 2020.

For our tests, we consider the eight authors with the most significant number of documents in each dataset to provide the most objective assessment of the proposed model and the VN-Literature dataset. Table 3 presents summary statistics on the length and number of documents per author for each dataset tested. Thereby showing the similarity of the Enron Email and VN-Literature datasets, the similarity in text size here is the length of each text and the number of documents of each author. Finally, we selected the top 8 authors with the most text for the Blog dataset.

Table 3. Summary statistics on each author’s average length and the number of documents on the dataset

Full size table

4 The Methodologies

4.1 A Brief Introduction to Authorship Attribution Task

Authorship Attribution (AA) is the process of assigning an author to an anonymous document based on the characteristics of the script. Several attribution methods have been developed for natural languages, such as English, Russian, Chinese, and Dutch. However, the number of works related to Vietnamese is still limited. Many machine learning models have been tested, including communication machine learning and deep learning models. However, it is not often mentioned that combined stylometric and deep learning models or the transfer learning techniques used can significantly impact classifier performance. Therefore, we propose to study their use in this regard by building a model of possible author identification for a particular Vietnamese text rather than in prose works through the combination of writing style and PhoBERT model (a pre-trained language model for Vietnamese). We evaluated these models on a large dataset in Vietnamese collected through self-developed tools by eight different authors. We also compared them with other existing methods. The test results show that our model provides the best results and can attribute the text’s author with an accuracy of 84.7%. Furthermore, compared with related methods, the results indicate that our proposed method is suitable for allocating copyright.

First, we introduce a baseline model. Accordingly, the problem AA can be reduced to the text classification problem, in which this problem is defined according to Aggarwal and Zhai (2012) when given a set of training documents as training data $D = \left\{ {X_1 ,X_2 , \ldots ,X_i } \right\}$, where each $X_i \in D$ has a label in the set of labels $\left\{ {1..k} \right\}$. First, the training data is used to build the classification model. Then, for the incoming unlabeled data, the classification model predicts the label for it. We will discuss each approach in the next section and present our proposed method.

4.2 Traditional Method

Traditionally, AA has relied on the extraction of content-related features (e.g., Bag-Of-Words) (BoW), Term frequency-inverse document frequency (TF-IDF) at the word or n-gram level, etc.) or author’s stylistic features, according to research by Stamatatos (2009) (e.g., the use of punctuation, capital letters, numbers, POS tag) of an author. Then construct a classifier that trains on these features, such as the popular Logistic Regression (LR) used by Madigan et al. (2005), Anwar et al. (2018). In addition, we also use other machine learning models such as Multinomial Naive Bayes, Decision Tree, Random Forest.

1.
Multinomial Naive Bayes: This algorithm is to predict and classify databases based on observational data and statistics, based on the Bayes theorem of probability. Multinomial Naive Bayes is a popular supervised learning algorithm in machine learning because it is relatively easy to train and achieves high performance (I. Rish et al., 2001).
2.
Logistic Regression: A binary classification algorithm, a simple, well-known and most important method in machine learning. By analyzing the relationship between all available independent variables, the Logistic regression model predicts a dependent data variable (Genkin et al., 2007; Hosmer Jr et al., 2013). However, in natural language processing, this method requires manual features extracted from the data to classify text in the test we used for TF-IDF, Stylometric and Char n-gram.
3.
Decision Tree: The decision tree is the most powerful and popular method for classification (Pranckeviciu, 2017). The decision tree algorithm is also considered as a structure tree, where each node represents a test attribute, each branch is a test result, and each leaf node is a class label target variable.
4.
Random Forest: Random forest is a supervised machine learning method to solve classification and regression tasks (Davidson, 2017). The Random Forest model is very effective for classification problems because it mobilizes hundreds of smaller internal models with different rules simultaneously to make the final decision.

4.3 Deep Learning

Convolutional Neural Network (Text-CNN): Convolutional Neural Network (CNN) is a multi-layer neural network architecture developed for classification (Kim et al., 2014). By using convolutional layers, it is possible to detect the combined features. In our test, we used together a pre-trained word embedding of 157 different languages (fastText^{Footnote 3}) suggested by Grave et al., 2018. This embedding turns a word into a 300-dimensional vector. Finally, we use a soft-max function that uses the results to predict labels for the text.

Bidirectional Long Short-Term Memory (Bi-LSTM): Bi-LSTM (Schuster et al., 1997) is a well-known variant of Recurrent Neural Networks (RNN) (Medsker et al., 1999). Bi-LSTM can be trained using all available input information in the past and continuously over a selected time frame. This method is considered very powerful in classification problems, and most of its classification results achieve high performance. Therefore, we intend to use it to compare with other classification models in this task.

4.4 Transformers Model

The transfer learning model has recently attracted the attention of researchers in the field of NLP because of its outstanding effectiveness. Above all, transformers models are an advanced architecture that relies on attention mechanisms and deep neural networks, replacing recurrent layers inside the auto encoder-decoder with particular layers called multi-head self-attention (Yang et al. 2019). The most prominent is the language model BERT stands for Bidirectional Encoder representations from transformers, proposed by Devlin et al., 2019. BERT and its variants, such as DistilBERT (Sanh et al., 2020), XLM-R (Conneau et al. 2020), and especially PhoBERT (Nguyen et al., 2020), have affirmed its strength in natural language processing tasks in recent years.

4.5 Proposed Model

While researching and examining previous studies on the issue of authorship attribution, we found that there is no proposed model for combining pre-trained language models with stylometric features. That led us to introduce a new model, ViBert4Author (V4A), a simple fine-tuning of PhoBERT with a dense layer and softmax enabled, trained in several epochs for authorship. The output size of the dense layer corresponds to the number of authors in the corpus.

For Vietnamese, the state-of-the-art (SOTA) method was first developed and called PhoBERT (by Nguyen et al., 2020) to solve Vietnam’s NLP problems. PhoBERT is a pre-trained model that shares the same idea as RoBERTa, a BERT replication study proposed by Liu et al., 2019 and has been modified for Vietnamese. BERT, widely used in research in recent years, is a contextual word representation model built using bidirectional transformers and based on a masked language model. As described by Sun et al. (2020), to use BERT as a classifier, a simple dense layer with a softmax activation function is combined with the final hidden state h header of the first token [CLS] through a weight matrix W and prediction probability of label c in the following way:

$$ p(c|h) = softmax\left( {Wh} \right) $$

Then, all weights, including the weight of BERT and the weight of W, are adjusted to maximize the log probability of correct labels. Finally, the training is done using the Cross-Entropy loss function. In this study, we implement PhoBERT, a pre-trained language mode from the Transformer library (Wolf et al., 2020), trained on a large Vietnamese corpus. Fine-tune BERT for AA task was done on Google Colab Tesla T4-PCIE-16GB.

Inspired by previous studies, we have incorporated writing style features into V4A. Through 2 models called V4A + Style and V4A + Style + Hybrid Feature combined with the Logistic Regression model. For hybrid features, we extract based on n-grams with character-level bi-grams and tri-grams through an LR classifier. Finally, we collect the BERT model’s output probabilities by stylometric and hybrid features, which are reassembled and classified using an additional LR classifier. Such a model will examine the content, writing style and synthesis of features. The architecture of V4A + Style + Hybrid Feature is shown in Fig. 4.

5 Results

The parameters we choose for the tested architectures in the study are shown in Table 4. We ran tests on eight authors for all three datasets presented above. Our model was trained on 5 epochs for each test. The results are shown in Tables 5 and 6. We retained 20% of the data for the test set using stratified approach sampling, i.e., stratified approach sampling, the ratio of the proportions of each class is kept equal in the train and test sets. We compare the proposed approach with traditional and deep learning models using the feature extraction methods presented earlier. In addition, we compare our approach with the word-level TF-IDF - LR model with root word removal and word breaks. We also add a benchmark for the performance of an LR trained only on stylistic features and an additional LR trained on Char N-gram hybrid features.

Table 4. Parameters of the experiments

Full size table

In this study, we focus on evaluating based on the VN-Literature dataset we built ourselves. Thereby, it is evident that the proposed model V4A (fine-tuning from pre-trained language model PhoBERT) works well with the author identification problem in Vietnamese literature. For our dataset, V4A outperforms traditional machine learning models using TF-IDF. We see the highest Random Forest model in conventional machine learning models with a measured accuracy of 70.8%. The most superior deep neural network model is BiLSTM combined with pre-trained word embedding fastText with improved accuracy of 71%. However, with the proposed model, the results are more impressive, with outstanding accuracy reaching over 80% with multilingual models such as BERT, XLM-RoBERTa and DistilBERT, especially with the PhoBERT language model due to its advantages. Moreover, it is built in Vietnamese, giving better accuracy, up to 84.6%; combining writing style and hybrid features helps increase accuracy.

Two other datasets also included in our experiment gave awe-inspiring results. Data preparation will take a long time with the Enron Email dataset containing much unimportant information. We decided to delete short emails or emails containing special characters and signatures. Experimental results show improvement of State-of-the-art models (ignoring experience with monolingual PhoBERT) with traditional models based on TF-IDF and deep learning models with word embedding, average accuracy increase of more than 13.3%.

The more significant challenge is with the Blog Authorship dataset, which is a large dataset, so we only selected the top 8 authors in our test, providing an average word count of about 1700 for each author, and providing enough data for model training and evaluation. Our testing shows that traditional machine learning and deep learning models still do not perform as well as the proposed model combining pre-trained multilingual models and feature techniques. Nevertheless, the results show a significant improvement in accuracy at 9.8%.

Table 5. A detailed description of the results of our tests based on two English datasets

Full size table

Table 6. A detailed description of the results of our tests based on the Vietnamese dataset (VN-Literature)

Full size table

6 Discussion

Our research combined a Vietnamese language model with the writing style and feature extraction methods to achieve outstanding scores for task AA. We can improve the results by using dense layers, and the activation function used is softmax based on the pre-trained language model PhoBERT. The approach of BERT and PhoBERT focuses mainly on the inherent element of word representation, syntax and context without requiring any preprocessing. Previous studies (Sari et al. 2018) have shown that using type and matching features improves the accuracy of AA tasks. We also offer that adding such features and pre-trained language models can improve the macro-averaged F1-score. Our proposed model V4A fine-tuning from the pre-trained language model PhoBERT shows similar effects. That indicates that our model will work well when each author has enough training data. However, having the data available to all authors is not easy in practice. In addition, the representation of short paragraphs by the authors also causes some asymmetry in the data.

There are many possible methods to extend this study. Our experiments on three datasets: VN-Literature, Enron Email, and Blog Authorship, have successfully utilized transformers-based language and representation models. In addition, we have done it with other pre-trained multilingual models that are variations of BERT and have had a positive effect. In further works, we will focus on building a more extensive experimental dataset with a more significant number of authors; Next, we will attempt to extract additional typographic, association, profiling, or content-related features.

7 Conclusion

The development of science and technology led to the birth of more modern advanced research. Deep learning and transformers techniques are gradually becoming more popular in natural language processing; Feature engineering and text preprocessing are becoming less and less necessary. In this work, we presented V4A, a fine-tuning-based approach based on pre-trained PhoBERT for author classification. In addition, we also introduce a self-developed dataset for the task of author classification in Vietnamese literature. Our model works best when the author has enough training data, no significant label imbalance, and the text is too short. In addition, in this task, we also applied more style features (like previous authors) and hybrid features to V4A, which improved, significantly the increase in F1-Score compared to the model overall figure. Future work will explore more features for V4A, build more datasets with a larger number of authors, and review and suggest ways to deal with large data imbalances. Finally, explore other pre-trained language models and expand our approach to Authorship Attribution.

Notes

References

Chaski, C.E.: Who’s at the keyboard? authorship attribution in digital evidence investigations. International Journal of Digital Evidence 4, 1–13 (2005)
Google Scholar
Lambers, M., Veenman, C.J.: Forensic authorship attribution using compression distances to prototypes. In: Proceeding of the 3rd International Workshop on Computational Forensics, 13–24 (2009)
Google Scholar
Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining write prints from anonymous e-mails for forensic investigation. Digit. Investig. 7(1–2), 56–64 (2010)
Article Google Scholar
Kimler: Using style markers for detecting plagiarism in natural language (2003)
Google Scholar
Huang and Mizuho IWAIHARA: Authorship Attribution Based on Pre-Trained Language Model and Capsule Network (2022)
Google Scholar
Gollub, T., et al.: Recent trends in digital text forensics and its evaluation – plagiarism detection, author identification, and author profiling (2013)
Google Scholar
Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro messages. In: Conf. Empirical Methods in Natural Language Processing, pp. 1880–1891 (2013)
Google Scholar
Shrestha, P., et al.: Convolutional neural networks for authorship attribution of short texts (2017)
Google Scholar
Bagnall, D.: Author identification using multi–headed recurrent neural network. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, vol. 1391 (2015)
Google Scholar
Ding, S.H.H., Fung, B.C.M., Iqbal, F., Cheung, W.K.: Learning stylometric representations for authorship analysis. In: IEEE Transactions on Cybernetics, pp. 107–121 (2016)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Huang, W., Su, R., Iwaihara, M.: Contribution of improved character embedding and latent posting styles to authorship attribution of short texts, pp. 261–269. Springer (2020)
Google Scholar
Wang, X., Iwaihara, M.: Integrating RoBERTa fine-tuning and user writing styles for authorship attribution of short texts (2021)
Google Scholar
Iyer, R.R., Rose, C.P.: A machine learning framework for authorship identification from texts. arXiv Prepr. arXiv: 1912.10204 (2019)
Google Scholar
Anwar, W., Bajwa, I.S., Choudhary, M.A., Ramzan, S.: An empirical study on forensic analysis of urdu text using LDA-based authorship attribution (2018)
Google Scholar
Dmitrin, Y.V, Botov, D.S, Klenin, J.D, Nikolaev, I.E.: Comparison of deep neural network architectures for authorship attribution of Russian social media texts (2018)
Google Scholar
Uchendu, A., Le, T., Shu, K., Lee, D.: Authorship attribution for neural text generation (2020)
Google Scholar
Ruder, S., Ghaffari, P., Breslin, J.G.: Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv Prepr. arXiv:1609.06686 (2016)
Klimt, B., Yang, Y.: A new dataset for email classification research. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (ed.), Machine Learning: ECML (2004)
Google Scholar
Aggarwal, C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, US (2012)
Chapter Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on EMNLP, pp. 1746–1751 (2014)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Medsker, L., Jain, L.C.: Recurrent Neural Networks: Design and Applications. CRC Press (1999)
Google Scholar
Yang, X., Yang, L., Bi, R., Lin, H.: A comprehensive verification of transformer in text classification. In: China National Conference on Chinese Computational Linguistics, pp. 207–218. Springer (2019)
Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale (2020)
Google Scholar
Nguyen, D.Q., Nguyen, T.A.: PhoBERT: Pre-trained language modelsfor vietnamese. In: Findings of the Association for Computational Lin-guistics: EMNLP 2020, pp. 1037–1042. Association for ComputationalLinguistics, Online (2020)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018)
Google Scholar
Sari, Y., Stevenson, M., Vlachos, A.: Topic or style? exploring the most useful features for authorship attribution. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 343–353. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
Khoa Dang Dong
Saigon University, Ho Chi Minh City, Vietnam
Dang Tuan Nguyen

Authors

Khoa Dang Dong
View author publications
You can also search for this author in PubMed Google Scholar
Dang Tuan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dang Tuan Nguyen .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Food Industry, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
Sungkyunkwan University, Seoul, Korea (Republic of)
Tai M. Chung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dong, K.D., Nguyen, D.T. (2022). Vietnamese Text’s Writing Styles Based Authorship Identification Model. In: Dang, T.K., Küng, J., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2022. Communications in Computer and Information Science, vol 1688. Springer, Singapore. https://doi.org/10.1007/978-981-19-8069-5_23

Download citation

DOI: https://doi.org/10.1007/978-981-19-8069-5_23
Published: 20 November 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8068-8
Online ISBN: 978-981-19-8069-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Vietnamese Text’s Writing Styles Based Authorship Identification Model

Abstract