Keywords

1 Introduction

Product reviews affect users’ shopping behavior. According to the survey, 64% of users will read the reviews before buying goods, 87% of users choose to buy after reading good reviews, and 80% of users give up after reading bad reviews [1]. Fake reviews refer to reviews that are written to intentionally confuse the consumer [2]. However, research by the Washington Post [3] found that more than 60% of reviews of electronic products on Amazon.com were fake. Therefore, it is very important to automatically identify the authenticity of network platform information and provide users with more authentic information. Review information is text information, so the identification of spam reviews can be regarded as a text classification problem. Li et al. [4] propose a neural network composed of two convolutional layers combined with sentence importance weights for deceptive review detection. Liu et al. [5] based on the combination of bidirectional long short-term memory network and features, carry out fake review detection, which can well learn the long-distance correlation in the sequence.

However, the existing models are at a deadlock in the recognition effect, one of the reasons is that the embedding layer usually provides context-independent word-level features by Word2Vec or Glove models. Moreover, the spam review dataset is too small to implement task-based architecture. Therefore, it has a good potential to further improve the performance to generate context aware word vectors with the help of pre-trained language models on large-scale datasets.

In addition, the experimental results show that emotional features have a good effect in the recognition of fake reviews [6]. After conducting data mining on the Yelp fake review datasets, we find that reviewers usually describe many aspects of the product to express their opinions and convince others. Different users will produce a variety of fine-grained evaluations when evaluating the same product, which reflects the quality of the product in an all-round way. Because the spammers are not personal experience, the non-real information they posted may be different from the public evaluation. For example, for a restaurant, the real users have a negative evaluation on the dishes and a positive evaluation on the drinks. Spammers also have positive comments on drinks, but they are full of praise on dishes. J. Surowiecki points out in the book The Intelligence of Crowds: “In the right environment, a group has extraordinary intelligence, and this intelligence often beats the smartest person in the group” [7]. The group’s evaluation of a product aspect can represent the real level of the aspect.

Therefore, if we can mine the potential group intelligence in product reviews, and combine user’s emotional attitude towards product aspects, we can verify whether the user’s emotional attitude is true, and use the public intelligence to detect spam reviews more effectively.

In this paper, we fuse group intelligence with users’ personalized sentiment information and context semantic information to generate multidimensional representations for the identification of spam reviews, and propose a new model Triple BERT (T-Bert) based on the structure of Triple Network [8] and BERT component, which provides a new solution strategy for the task of spam review detection.

2 Related Work

2.1 Spam Review Detection

The task of spam review detection began in 2007 [9]. Spam review detection is a specific application of the general problem of deception detection, mainly using text and behavioral features. Behavior features include the number of good/bad reviews [10], the frequency of comments [12], etc.; text features include the length of the comment text [10, 11], various vocabulary and syntactic features [13], etc. In addition, some works combine text features with behavior features. Wang et al. [14] combined the two as sentence representation based on CNN model, which solved the cold start problem in spam review detection. Wang et al. [15] used MLP to obtain user behavior features and CNN to obtain text language features, and combined them based on attention neural network to identify fake reviews. Yuan et al. [16] used hierarchical fusion attention mechanism to generate fusion text representation from the perspective of user and product, and based on TransH algorithm to model the relationship among user, product and review text, to generate more reliable review representation.

Previous work mostly based on Word2Vec or GloVe for word vector representation, but it is not enough to capture the complex semantic relevance in sentences. Recently, pre-trained language models such as ELMo and BERT have been shown to be effective in generating context-aware word vectors with the potential to further improve performance, and have been shown to be effective in a number of natural language processing applications, so far, however, no work has been done to apply BERT to the spam review detection task. In this paper, we use BERT as the basic model to construct the word vector and verify the performance of BERT model in this task.

2.2 Fine-Grained Sentiment Analysis

Fine-grained sentiment analysis is a challenging and significant subtask in sentiment analysis. Fine-grained sentiment analysis [17] aims to identify the sentiment polarity of specific aspects. This task enables users to evaluate the comprehensive sentiments of all aspects of a given product or service and have a more comprehensive understanding of its quality [18]. Fine-grained sentiment analysis can be subdivided into three categories: the first one is to detect the polarity of sentiment corresponding to a given aspect in a sentence [19, 20], but it is difficult to be applied because fine-grained aspects need to be labeled in advance; the second is Aspect-oriented Opinion Words Extraction (AOWE) [21, 22], which aims to extract the opinion words corresponding to a given aspect from the sentence; the last is End-to-End Aspect-based Sentiment Anslysis (E2E-ABSA), whose goal is to jointly detect aspect terms/categories and corresponding aspect sentiment.

On the one hand, existing studies [6, 25] show that using sentiment features can effectively identify fake reviews; on the other hand, as mentioned in the Sect. 1, in order to integrate group intelligence into the model and further improve the reliability of spam review detection, we conduct E2E-ABSA on spam review data, so as to obtain the group sentiment corresponding to all aspects of the product. Meanwhile this measures the degree of deviation of user’s sentiment from the public’s sentiment. These two are used as the auxiliary information of spam review detection, which provides a new solution for this task.

3 Methodology

The structure of T-Bert is shown in Fig. 1. We regard spam review detection task as a binary task. Firstly, for each user’s (or product’s) reviews, we conduct fine-grained sentiment analysis on the sentences, and cluster the fine-grained aspects to get the user’s (or product’s) sentiment tendency in each fine-grained aspect, that is, the group sentiment tendency \(G_{i}=\left\{ a_{i1},a_{i2},a_{i3},a_{i4},a_{i5},a_{i6}\right\} \) of product \(P_{i}\), and the personal sentiment tendency \(S_{i}=\left\{ b_{i1},b_{i2},b_{i3},b_{i4},b_{i5},b_{i6}\right\} \) of user \(U_{j}\) to product \(P_{i}\). Secondly, given an input sentence \(X_{i}=\left\{ x_{i1},x_{i2},\ldots ,x_{iT}\right\} \) of length T, we encode it with the BERT component of the L Transformer layers to get a contextualized sentence representation \(E^{L}=\left\{ e_{1}^L,e_{2}^L,\ldots ,e_{T}^L\right\} \in \mathbb {R}^{T\times D}\), where D represents the dimension of the vector. Finally, we combine \(G_{i}\),\(S_{i}\) and \(E^{L}\) to identify spam reviews. Our goal is to determine whether \(X_{i}\) is a spam review.

Fig. 1.
figure 1

Overall structure of T-Bert model. \(S_{i}\) stands for the user’s personalized sentiment, \(G_{i}\) stands for group intelligence.

3.1 Group Intelligence and User Personalized Sentiment

We extract the fine-grained aspects that users are concerned about from the reviews. The fine-grained aspect refers to the product attributes that contained in the user’s review. Since the fine-grained aspect is not marked in the spam review dataset, the amount of data that is too large to be manually marked, and it is difficult to define the marking standard, we use the transfer learning method to mark the fine-grained information. This research is based on the Yelp dataset, which includes restaurant reviews and a small number of hotel reviews. Therefore, we use the method in work [23] to train the fine-grained sentiment analysis model based on the data of SemEval 2016 [26], and use the model to label the Yelp dataset. Each review in the dataset is annotated to get a triple information \((A_{i},W_{i},POS/NEG)\), that is, the fine-grained aspect \(A_{i}=\left\{ A_{i1},A_{i2},\ldots ,A_{in}\right\} \) referred to in sentence \(X_{i}\), and the corresponding sentiment word \(W_{i}\) for each fine-grained aspect \(A_{ix}\), as well as this group of fine-grained sentiment tendency POS/NEG, POS represents positive sentiment, NEG represents negative sentiment. In order to obtain group intelligence and user personalized sentiment, we further analyze the annotated fine-grained sentiment information.

We use the labeling standards in the SemEval dataset to divide the fine-grained aspects into 6 categories: restaurant, food, drink, service, ambience, and location. First, de-duplicate and merge the fine-grained aspect words contained in all review sentences to obtain the fine-grained aspect word set ASP. Perform word frequency statistics on ASP, and select 10 seed words in each category to form a seed word set \(\bar{A}\) in order from highest to bottom. Second, use the Word2Vec model to train the Yelp review dataset to obtain the word vector model. Finally, based on the word vector model, the similarity between each seed word in each category in \(\bar{A}\) and \(ASP_{i}\) is calculated. If the average similarity is greater than the threshold \(\alpha \), then \(ASP_{i}\) belongs to this category. As shown in Table 1, the fine-grained aspect word set \(\tilde{A}\) divided into 6 categories is generated according to the above steps.

Table 1. Part of the fine-grained aspect word set \(\tilde{A}\)

We determine the sentiment polarity of each category in the review sentence based on simple rules. For example, in the category of food, if the number of positive fine-grained words is greater than the number of negative fine-grained words, the sentiment of food is positive, and vice versa. From the product dimension, perform fine-grained sentiment analysis and clustering on all review information of the product \(P_{i}\) to obtain its group sentiment feature \(G_{i}=\left\{ a_{i1},a_{i2},a_{i3},a_{i4},a_{i5},a_{i6}\right\} \), Where \(a_{i1}\) represents a certain category of group sentiment polarity. From the user dimension, the fine-grained sentiment analysis results of \(U_{j}\)’s evaluation of \(P_{i}\) are clustered according to \(\tilde{A}\), which is regarded as user’s personalized sentiment feature \(S_{i}=\left\{ b_{i1},b_{i2},b_{i3},b_{i4},b_{i5},b_{i6}\right\} \), where \(b_{ix}\) represents a certain category of user sentiment polarity.

3.2 Triple Bert

The BERT model is a new language model that uses bidirectional Transformers for pre-training on a large number of corpora, and performs amazingly in many tasks in the NLP field. We built a spam review detection model T-Bert based on the Triple Network framework and BERT.

Embedding Layer. We use the BERT component as the embedding layer of the T-Bert model. For each token \(X_{it}\) in sentence \(X_{i}\), We add token embedding, segment embedding and position embedding to \(e_{t}\), \(t\in [1,T]\) to form the input feature \(E^{0}=\left\{ e_{1},e_{2},\ldots ,e_{T}\right\} \) of the first branch of the embedding layer. Then L transformer layers are introduced to refine the token-level features layer by layer. Finally, the output \(E^{L}\) obtained by splicing the last four layers is the representation of the review sentence \(X_{i}\).

$$\begin{aligned} \begin{array}{c} E^L = 0.25\times E^{L-1}+0.25\times E^{L-2}+0.25\times E^{L-3}+0.25\times E^{L-4} \end{array} \end{aligned}$$
(1)

In order to combine group intelligence, user personalized sentiment and text information for spam review detection, we use BERT component to transform the two dimensions of sentiment information constructed in Sect. 3.1. First, the two features \(G_{i}\) and \(S_{i}\) are Onehot mapped and normalized. Then, we pack each feature value in the \(S_{i}\) of the \(P_{i}\) as \(E^{s0}=\left\{ e_{s1},e_{s2},\ldots ,e_{s12}\right\} \), where \(e_{st},t\in [1,12]\) is the combination of the token embedding, segment embedding, and position embedding corresponding to the input feature token. This is the second branch of the embedding layer. The input feature \(E^{g0}=\left\{ e_{g1},e_{g2},\ldots ,e_{g12}\right\} \) of the third branch of embedding layer is generated in the same way. Note that the BERT components of the first branch, the second branch and the third branch share weights. The calculation process is as shown below, where \(E^{gl}\in \mathbb {R}^{12\times D},E^{sl}\in \mathbb {R}^{12\times D}\) are the representation of group intelligence feature \(G_{i}\) and user sentiment feature \(S_{i}\) respectively.

$$\begin{aligned} \begin{array}{c} E^{gl} = Transformer_{l}(E^{gl-1}),\\ E^{sl} = Transformer_{l}(E^{sl-1}). \end{array} \end{aligned}$$
(2)

Spam Review Detection Layer. In order to identify spam reviews, we build four different spam review detection layers on the embedding layer to classify the feature representations obtained before. We concatenate \(E^{L}\), \(E^{gl}\) and \(E^{sl}\) to form the input \(E^{F}\in \mathbb {R}^{(T+24)\times D}\) of spam review detection layer. Linear The obtained \(E^{F}\) is input into a max pooling layer. The most distinctive features in each sentence can be selected to form a sentence representation \(h_{L}\in \mathbb {R}^{D}\), and then input into the linear classification layer. Finally, softmax function is used to calculate the probability of classification category as follow:

$$\begin{aligned} \begin{array}{c} h_{L} = \max \limits _{dim = 1}(E^{F}),\\ P = softmax(h_{L}W_{L}), \end{array} \end{aligned}$$
(3)

where \(W_{L}\in \mathbb {R}^{D\times C}\), C is the number of categories.

Bidirectional Long Short-Term Memory (BiLSTM). BiLSTM is a combination of forward LSTM and backward LSTM, which can better capture bidirectional semantic dependencies. Input the obtained \(E^{F}\) into BiLSTM to obtain the task-specific hidden representation \(h\in \mathbb {R}^{2H}\), where H is the hidden layer size in BiLSTM, and then obtain the predicted value P through the softmax function:

$$\begin{aligned} \begin{array}{c} h=BiLSTM(E^{F})=[\overrightarrow{h},\overleftarrow{h}],\\ P=softmax(hW_{2}). \end{array} \end{aligned}$$
(4)

Attention Network. The attention mechanism in seq2seq breaks the limitation that the encoder can only use the final single vector result, so that the model focuses on the input information that is more important for the output information. We use the attention mechanism to calculate \(E^{F}\), extract the implicit features in sentences, focus on the words that are important for classification, and generate a specific representation \(h_{A}\in \mathbb {R}^{D}\) of this task.

$$\begin{aligned} \begin{array}{c} h_{A}=\beta E^{F},\\ \beta =\frac{exp({E_{i}}^{'} )}{ {\textstyle \sum _{n=1}^{T+24}{E_{n}}^{'}} },\\ E^{'}=tanh(E^{F}W_a), \end{array} \end{aligned}$$
(5)

where \(\beta \) is the score function that determines the importance of the words in the whole sentence, \(W_{a}\in \mathbb {R}^{D\times D}\) is the transformation matrix. Similarly, a linear layer with softmax activation as before is stacked on the designed attention layer to output the prediction.

Convolutional Neural Network (CNN). The CNN model proved to be effective for NLP and achieved excellent results in semantic analysis [27]. In this paper, we use the convolution kernel of the CNN layer to perform a convolution operation on the review sentence representation \(E^{L}\) to obtain the hidden features \(O_{i}\in \mathbb {R}^{f\times (T-k+1)}\) in the text.

$$\begin{aligned} \begin{array}{c} O_{i}=W\cdot E_{i:i+k-1}^{L},\\ V_{c}=\max \limits _{0\le i\le T-k}(O_{i}). \end{array} \end{aligned}$$
(6)

where f is the channel for the convolution and k is the width of the convolution kernel.\(\cdot \) represents the dot product operation of the matrix, \(i=0,1,2,\ldots ,T-k\) and \(W\in \mathbb {R}^{k\times D}\). The convolution core is repeatedly applied for the convolution operation and fed into the max pooling layer for filtering features.

Above is a process of feature extraction by a filter. In this paper, m filters of different sizes are used to extract as many features as possible, and then these features are spliced to get the review representation \(h_{c1}\in \mathbb {R}^{m\times f}\). Then we combine the filtered token level text features with the sentence level output of BERT model and \(E^{tl}\), \(E^{sl}\) to get the final sentence representation \(h_{c}\in \mathbb {R}^{m\times (f+3D)}\). Finally, \(h_{c}\) is input into the linear layer with softmax activation function to get the classification result.

4 Experiment

4.1 Datasets and the Evaluation Metrics

In order to verify the effectiveness of the model, we conducted experiments on three public datasets: YelpChi [10], YelpNYC and YelpZIP [11]. The data are real business reviews of restaurants and hotels from different areas of the Yelp website. It can be found that the average sentence length of real review sentence is longer than that of spam review sentence because it involves fine-grained aspects description. There is no significant difference between spam reviews and real reviews when observed from sentence-level sentiment analysis.

We used precision, recall and F1 scores to evaluate the effectiveness of the model. The precision reflects the correctness of the model in predicting spam reviews, and the recall reflects the proportion of correctly predicted spam reviews by the model in all spam reviews. F1 score is the harmonic mean of precision and recall.

4.2 Baselines and Implementation Detail

In the comparison experiment, we compare the BERT-based model with several advanced methods in existence. ABNN [15] is a neural network based on attention mechanism, which uses MLP to obtain user behavior features and CNN to obtain text language features, and combines the two based on attention to identify spam reviews. HFAN [8] is a hierarchical fusion of attention among users, reviews and products to get a comment representation that integrates the three to classify comments. DFFNN [14] is a deep feedforward neural network, which combines bag-of-word/n-gram feature, word embedding and multiple emotion indicators of the review sentence as representation. In addition, we also compare the modeling effect of several spam review detection layers with different network structures and the influence of different sentiment features on the detection ability.

In the embedding layer, we use the pre-trained “BERT-base-uncasd” model, where the number of transformer layers \(L=12\), the hidden size \(D=768\), that is, the sentence representation dimension is 768 and the sentence length \(T=200\). In the spam review detection layer, the learning rate is set to \(2e-5\), the dropout rate is set to 0.5 and the training batch size is 128. The hidden layer dimension of BiLSTM is set to 300. In convolution neural network, the size of convolution kernel channel is \(f = 50\), and the width of convolution kernel increases from 1 to 11. A total of 11 filters with different sizes are used.

4.3 Results and Analysis

The Embedding Effect of the BERT Model: The experimental results are shown in Table 2 below. Compared with other methods without BERT model, BERT + Linear is not as good as the best model when using only text information as detection feature, however, the recall rate and F1 value are slightly different from other models that use a variety of information, which validates the performance of BERT model in the task of detecting spam reviews. It shows that the BERT model encoded by the association between any two tokens can generate a review representation with rich contextual information for the spam review detection layer.

Table 2. Experimental results of single BERT using only text information

Performance of Different Spam Review Detection Layers: The experimental results are shown in Table 2. When only text information is used as the clue of spam review detection, the precision, recall and F1 values of BERT + ATT, BERT + LSTM and BERT + CNN are higher than those of BERT + Linear. Therefore, the use of more powerful network structure can bring better effect for the spam review detection task than only using the linear layer. This result shows that merging context information is helpful to sequence modeling and can provide more effective sentence representation for text classification tasks.

Performance of Different Sentiment Information: The results are shown in Fig. 2(a), Fig. 2(b) and Fig. 2(c). S-BERT refers to Siamese BERT, which takes text information as the input of the first branch of the embedding layer and user personalized sentiment feature as the input of the second branch. The rest of the model structure is the same as that of T-Bert. As Fig. 2(a) shown, when different features are used as the potential thread for spam review detection, the precision of using sentiment features is not greatly improved compared with using only text information, but the recall rate and F1 value are greatly improved, which indicates that when fine-grained sentiment information is fused, the ability of the model to identify spam reviews is improved. From Fig. 2(b) and Fig. 2(c), we can find that the detection ability is further improved by combining the product and user dimensions, that is, combining the group intelligence with the user’s personalized sentiment, which verifies our previous hypothesis that the effective use of group intelligence can better detect spam reviews.

Fig. 2.
figure 2

Performance of different models on YelpChi dataset using different sentiment features.

Comparing with Table 2, it can be seen that the recall rate and F1 value of T-Bert have been greatly improved. Compared with the existing technology, the average recall rate and F1 value of the three data sets have been improved by 4.6% and 2.4% respectively. The experimental results verify the effectiveness and feasibility of the proposed strategy. But there is still room for improvement, the improvement of model’s precision is not so good. The reason is that: in order to obtain fine-grained aspect information annotation, transfer learning method is used. However, the accuracy of annotation can not reach 100%. The result of annotation further affects the accuracy of subsequent spam review detection. How to further improve the effect of the model is our next research plan.

5 Conclusion

In this paper, we propose a new research strategy for spam review detection task, and verify the effectiveness of BERT component in this task. Specifically, we propose a strategy to improve the effectiveness of spam review detection by using group intelligence and user personalized sentiment information. In order to effectively use the intelligence of the group, we combine the group intelligence and the user personalized sentiment information with the text information to generate multidimensional representation, and propose a new model Triple BERT (T-Bert) based on the structure of Triple Network and BERT component. We explore the use of the BERT model as the embedding layer to generate review representations with rich contextual information, and to couple the BERT component with multiple neural models, a large number of experiments are carried out on three benchmark datasets to verify the effectiveness of the strategy proposed in this paper. The results show that BERT performs well in the task of spam review detection and improves the effectiveness of the T-Bert model.