Keywords

1 Introduction

Vietnam has witnessed strong growth in e-commerce in recent years. Many Vietnamese online trading platforms are constructed and attract consumers. Online shopping is now popular in people’s daily routines because of its convenience and flexibility. However, besides the advantages of online shopping, the rise of fake products and fraud qualifications in online trading concerns customers and shop owners.

Customer reviews play an essential part in the behaviors of consumers in online shopping. Customer reviews express their opinions, emotions, and attitudes about products, and these opinions affect other customers in deciding whether to buy a product or not. If a customer wants to buy a product in an online shop, they tend to refer to reviews of previous customers about that product. Catching up with this behavior, some people create spam reviews, which are illegitimate means and untruth facts about the actual quality of products to confuse the consumers to boost the financial business or fame of an individual or organization [9]. Therefore, detecting these spam reviews will protect both sellers and customers from the risk of low-quality products and preserve the reputation of the sellers.

Our purpose in this paper is to propose a method to detect spam reviews about products on online shopping platforms. First, we constructed a corpus for spam detection from users’ reviews by texts. Second, we use machine learning approaches to build classification models for detecting spam comments and evaluate classification models’ performances on the constructed dataset.

The paper is structured as follows. Section 1 introduces our works. Section 2 takes a survey about relevant research on the problem of online spam detection. Section 3 describes the data creation process and gives an overview of our dataset. Section 4 introduces our approaches for the online spam reviews detection problem by machine learning and deep learning. Section 5 displays our empirical results and analysis of the performances of classification models. Finally, Sect. 6 concludes our research and proposes future works.

2 Related Works

Preliminary research about spam reviews showed how to construct a classification model for detecting whether a user’s reviews are spam or not and the difficulty in detecting spam reviews from both the content of the review and the reviewers [9]. The challenge of identifying spam reviews comes from reviewers’ behavior when they try to create spam content just like other innocent reviewers. Therefore, [10] proposes three aspects of spam reviews to claim the problem of spam reviews.

Many approaches are applied for the task of spam reviews detection, including patterns and rules [13, 14, 30], machine learning and deep learning approaches [24, 26], linguistic features [15, 22]. Overall, the dataset is the key point for training and evaluating the classification models applied in the opinion spam detection task. Available datasets for detecting spam reviews are introduced by [25].

In Vietnamese, there are several datasets about user reviews on e-commerce platforms, such as the dataset about phone and restaurant reviews [18, 29], the smartphone feedback datasets [16, 19], and the complaining detection on e-commerce websites dataset [20]. However, there is no particular dataset for spam review detection on Vietnamese e-commerce websites yet. Hence, our motivation is to construct a dataset for detecting spam reviews on Vietnamese E-commerce platforms.

3 The Dataset

3.1 Dataset Creation Process

We collected data from leading online shopping platforms in Vietnam. Then, we select some of the most recent selling products for each product category and collect up to 15 reviews per product. After collecting, we get a dataset of 19,868 product reviews which contains the number of star reviews, comments about the product, and the link to that product. Subsequently, we construct the annotate guideline and annotate for the corpus.

Our data annotating process is impressed from the MATTER framework [23]. The annotation process consists of two phases. The first is the training phase for the annotators. The annotators read a guideline describing the meaning of labels, a sample review, and some examples of specific cases. The annotators will read the guideline and annotate 300 random samples in the dataset. Then, we calculate and evaluate the average inter-annotator agreement. If the inter-annotator agreement is satisfied, we move to the second phase, which is the annotation phase. In contrast, we re-train the annotators and update the annotation guidelines. According to [4], the inter-annotator agreement calculated by Cohen’s Kappa index [2] is acceptable when higher than 0.5.

In the second phase, the annotators will be provided with a complete dataset and annotate on this dataset. There are all three annotators during the annotation phases, and the final label of the dataset will be decided by voting for the most assigned label. To ensure objectivity when annotating the dataset, we keep annotators annotating independently.

3.2 Annotation Guidelines

Our dataset comprises two tasks. The first task determines whether the reviews are spam or not spam (Task 1), and the second task indicates the types of spam reviews (Task 2). The dataset contains two labels: SPAM and NO-SPAM. For each spam review, we label one of three types of spam labels [10]. The label of a review is described as follows:

NO-SPAM: Reviews labeled with this label are regular reviews, true to the product’s reality. Reviews like these provide helpful information for buyers to get an overview of the product before deciding whether to buy it or not.

SPAM: Reviews labeled with this label are reviews that are entirely or partially untrue about products sold on e-commerce sites. Reviews like these often make it easier to sell products or hurt the sales and reputation of stores and provide inaccurate or unhelpful information. According to [10], we divided the labels for the reviews as spam into three labels:

  • SPAM-1 (fake review): These reviews mislead customers by giving negative review comments to the product to damage the reputation of the store selling the product or giving an excellent review to the product in order to attract customers for the product and the shop even though the product is not relevant.

  • SPAM-2 (review on brand only): These reviews do not comment specifically on the product but only on the brand, manufacturer, or seller of the product. Although these reviews can be informative for product buyers, they are often negative and considered spam.

  • SPAM-3 (non-review): These are reviews whose commentary is not about the product or anything related to the product. These reviews tend to promote another product, get commissions from an e-commerce site, or have no purpose.

Table 1. Several example reviews and instruction for annotating labels

In addition, Table 1 describes several examples of reviews from users and the explanation for choosing the label for each review. For each review, annotators choose one suitable label.

3.3 Inter-annotators Agreement and Discussion

We have three different annotators to annotate the dataset. We let those three annotators work independently on the sample to measure the inter-annotator agreement in the training phase. Then, we calculate the inter-annotator agreement in pairs of annotators by the Cohen’s Kappa index [2].

Table 2. Inter-annotator agreement of three annotators A1, A2 and A3 on the two tasks. Annotators are working independently
Table 3. Inter-annotator agreement of three annotators A1, A2 and A3 on the two tasks after re-training with the updated annotation guidelines

According to Table 2, the average inter-annotator agreements of two tasks are lower than 0.5, which does not satisfy the minimum agreement level, according to [4]. Therefore, we update the current annotation guidelines with more explanations and examples and re-train the annotators to boost the quality of the annotation guidelines. As illustrated in Table 3, the inter-annotator agreement is improved on both tasks. The final inter-annotator agreement of our dataset is 0.72 and 0.68, which is the substantial level according to [12].

Table 4. Confusion matrix between three pairs of annotators when annotating the samples. We calculate the values by taking the average of three pairs

In addition, Table 4 describes the number of annotated comments by three different annotators. It can be seen that the number of disagreement data fell into the case of determining whether a comment belongs to a specific spam type and comment is not a spam review. Therefore, we attach the original links of products to the reviews for annotators to reference. This way, annotators can identify the reviews’ context, then give the accrue labels. However, as shown in Table 5, two annotators are disparity when annotating Reviews #1, because this review contains the user’s opinion about not only the product but also the brands of the providers, which is categorized as SPAM-2 according to the annotation guidelines. Besides, Comments #2 and Comments #3 give the opinion about the product’s quality and design but do not mention the product carefully, which confuses the annotators for deciding these comments as non-review (SPAM-3).

In general, the challenge of annotating for this task is identifying whether or not the reviews are spam. Despite detailed annotation guidelines and the information about the products relevant to the reviews, annotators are still misunderstood because consumers’ opinions are diverse, and the stylistics of users’ reviews are unclear. Therefore, to guarantee the objectives, we let three annotators annotate the entire dataset, then take the final label by major voting.

Table 5. Several sample reviews that contain disagreement labels between annotators

3.4 Dataset Overview

Fig. 1.
figure 1

The distributions of three labels on the train, development, and test sets

After annotating the dataset, we have nearly 19,000 reviews from users, in which each review is categorized as spam or not spam. If the reviews are spam, they consist of the types of spam. Then, we divided the dataset into train, development, and test sets with proportions 7-1-2. The overall information about the dataset is illustrated in Table 6.

In addition, Fig. 1 shows the distribution of reviews by each label on the train, development, and test sets. The reviews which are annotated as not spam account for the highest proportion. For the spam types, the SPAM-3 reviews are more than two remaining types. The data distribution on the training, development, and test set is similar.

Table 6. Overview about the dataset. The vocabulary size is computed on syllable level
Table 7. The distribution of spam and non-spam reviews based on the rating stars of users

Besides, according to Table 7, most spam reviews have 5 stars rated by users. For no spam reviews, although the distribution is more uniform from 1 to 4, most reviews are rated as 5 stars. Hence, the rating star for the product is not reliable information for expressing the opinions of users about the quality of the product.

4 Methodologies

4.1 Task Definition

The problem of spam review detection is categorized as the text classification task. This problem comprises two tasks: Task 1 is the binary classification task for classifying whether a review is spam or not spam, and Task 2 is the multi-class classification task for identifying the type of spam, which are one of three types as mentioned in Sect. 3.

4.2 Word Embedding

Word embedding is a vector space used to represent text data that can describe the relationship, semantic similarity, the context of data. In natural language processing, word representation plays a vital role in many downstream tasks, such as classification tasks. On the task of text classification, the empirical results from [8] showed that the fastText pre-trained embedding provided by [5] obtained robust results when integrating with various deep neural networks on social media texts. Therefore, we choose the fastText word embeddingFootnote 1 for our empirical results.

4.3 Deep Neural Network Models

Text-CNN [11]: Convolutional neural network (CNN) is a model combined with many different layers. CNN is often applied in computer vision to extract features of images for image classification and has achieved high performance than traditional approaches. In addition, CNN is also applied in natural language processing problems, typically text classification with the Text-CNN model. This model is based on convolutional architecture to extract valuable features from natural texts.

LSTM [6]: Long Short Term Memory (LSTM) is an improved model from Recurrent Neural Network (RNN). LSTM helps the model remember the previous information for a long time, which is a restriction faced by the RNN model. LSTM comprises three gates: input gate, output gate, and forget gate. The input gate selects information to add to the context, the output gate decides whether the input is necessary for the present, and the forget gate is used to remove information from the context when it is no longer needed. This model helps the model classify the text better because it can capture contextual information in the entire text.

GRU [1]: Gated Recurrent Unit (GRU) is a variant of the LSTM model. This model has lower complexity than the LSTM. While the LSTM has three gates, the GRU has only two gates: the update and reset gates. The update gate determines if any past information is retained and used in the future, and the reset gate decides that past information should be kept and any information forgotten. The advantage of GRU is using fewer parameters during training and therefore uses less memory, and training time is faster than LSTM.

Transformers is an architecture that has been proposed in recent years and is currently in widespread use. The appearance of BERT [3] helps many downstream tasks in NLP attain high-performance results while training on a small dataset. BERT and its variances become the baseline approaches in many NLP tasks, which is called BERTology [27].

In the Vietnamese language, there are two kinds of BERTology approaches: multilingual and monolingual models [7, 28]. As a result, the monolingual obtained better results than the multilingual models for the text classification task [28], and sequence-to-sequence task [7]. Therefore, we applied two monolingual BERT models, including PhoBERT [17] and BERT4News [21] for our problem of detecting spam reviews.

5 Empirical Results

5.1 Baseline Results

We implement two tasks for the spam detection problem. First, we implement a binary classifier to classify reviews as spam or no spam. Second, we construct a classification model to determine spam types for the review. We adapt the Text CNN, LSTM, GRU models, and transformers with PhoBERT and BERT4News for our tasks. Finally, we will use the Accuracy and macro-averaged F1-score metrics to evaluate the performance of baseline models.

Table 8. The empirical results of classification models on the dataset

On classifying reviews as spam or not, the Text CNN model results on the test set with Accuracy and macro-averaged F1-score of 84.18% and 77.89%, better than the two models LSTM and GRU. Also, on this task, the PhoBERT achieved better results than BERT4News with Accuracy and macro-averaged F1-scores of 90.01% and 86.89%, respectively. For the spam type detection task, the PhoBERT model obtains 88.93% for Accuracy and 72.17% for macro-averaged F1-score, higher than BERT4News. The results of the evaluation between the models are described in Table 8.

Based on the results from the training and evaluation of the models, we can see that the classification of the evaluations using the Transformer models gives better performance than the deep neural network models. The PhoBERT model obtained the best performance on the two tasks. As for detecting spam types, the results of Accuracy and F1-score are more different than classifying reviews are spam or not because of the imbalance between the data labels. Besides, classifying the reviews as spam or not is less complicated than detecting the types of spam, so the results of this task are higher than the task of spam types detection.

5.2 Error Analysis

According to Fig. 2, the error prediction of SPAM and NO SPAM is not too much. In contrast, the error predictions are significantly different on the second task. Most of the SPAM-2 reviews are predicted as NO SPAM, and the number of wrong prediction (predicted as NO SPAM is 177) are higher than the accurate prediction (predicted as SPAM-2 is 135). The proportion of the wrong prediction of SPAM-1 reviews is also very high, in which most SPAM-1 reviews are predicted as no spam. However, this error prediction is not too much in the whole test set. The reviews with type SPAM-3 are the same as type SPAM-1. In general, most wrong predictions are caused by the doubt between NO SPAM label and other labels. Thus, the challenge of classification models on our dataset for this task is to determine whether the reviews are spam or not and to identify the type of spam reviews.

Fig. 2.
figure 2

Confusion matrices of PhoBERT model on two tasks. The confusion matrices are created by using the sklearn library

Table 9. Several wrong predictions of reviews with type SPAM-2 to NO SPAM label

To study the wrong prediction in SPAM-2 labels, we take several random examples with predicted labels by the highest classification model and compare them with the real label. Those examples are described in Table 9. According to Table 9, there are two main reasons for the wrong predictions. The first reason is the identification of the praise for the brand of the product and the brand of the retailer or provider, and the reviews about the quality of products. For example, reviews No #1, No #3, and No #4 mention the opinion about Tefal, Ariel (the two famous consumer goods brands in Vietnam), and brand Tiki (the online shopping service provider and retailer). However, those reviews do not focus on product quality, only express the thank to the retailer and the brands. The classification model cannot discriminate between the praise of the brands and the opinion of customers about the quality of products. The second reason is the short reviews of users, which are not giving any information about the products or services provider, such as reviews No #2 and No #3.

In general, the cons of current classification models on the dataset are the perplexity of the spam and no-spam reviews, in which the reviews about products must directly focus on the characteristic of the product, the quality of the products, and their services. To overcome this problem, the model should integrate extra information about the product, such as the information page about the product and the previous reviews of users, to enhance the classification ability.

6 Conclusion

This paper provided the ViSpamReviews - a dataset for spam reviews detection on Vietnamese online shopping websites with more than 19,000 reviews annotated by humans. The dataset follows the strict annotation process, and annotators are provided with detailed annotation guidelines for labeling the dataset. The final inter-annotator agreements are \(\kappa =0.72\) for the task of determining whether a review is spam or not and \(\kappa =0.68\) for the task of detecting the types of spam reviews. Besides, we also applied robust classification models to the dataset, and the PhoBERT model obtained the highest result with 86.89% by F1-score for the spam classification task and 72.17% by F1-score for the spam types detection task. From the error analysis, we found that it is necessary to integrate extra metadata about the product as well as the previous reviews to boost up the classification models.

Our next study is to extend the dataset for detecting spans of spam in the reviews and identify the opinion of users on the specific characteristic of products and their relevant services. Finally, based on the current results, the dataset can be used for developing an application to help shop owners filter spam reviews from users.