Keywords

1 Introduction

The internet contains a wealth of reviews and opinions on any topics. User-generated contents come in various forms and sizes, objective opinions and subjective opinions. Postings in internet forums and user comments in websites are the important sources of information. The decision-making process of people is affected by the opinions of others in the information age [4]. When a person wants to change a job, he or she will start by searching for reviews and opinions written by the employees and former employees regarding the companies in his or her wish list. However, the number of reviews is often very large, which causes lots of reviews and opinions to be unnoticed, even though some of them are very helpful. As a result, predicting the helpfulness of a review is very important.

Many websites rank reviews by their published time, product rating, user voting, etc. Compared to sort by published time and product rating, the user voting method seems to be better and more helpful, since its results are cumulative from lots of visitors. For example, in Amazon.com, they employ a voting system to collect the feedback by asking “was this review helpful to you? Yes/no”. It would be useful to rank reviews based on the quality as soon as these reviews are shown. This would save lots of time on surfing the web-pages and finding helpful reviews. However, user voting mechanisms are controversial, including the imbalance vote bias, the winner cycle bias and the early bird bias [14]. These kinds of bias show that voting system is not the best choice for ranking user-generated contents.

Previous works approximate the ground truth of helpfulness from users’ voting results. If there are X of Y users who consider a review to be helpful, then the helpfulness score is X/Y. However, it is hard to collect the right value of Y. For example, when a user opens a product details page with many reviews, he just read the basic information about the product and leaves. It’s hard to decide whether we should add 1 to all the reviews in this page. In addition, the review voting itself can be influenced by many factors, such as page structure adjustment, review recommendation, etc.

In this paper, we model the problem of predicting review helpfulness score as a regression problem and analysis performance of different features used in previous researches. Many researches [3, 5, 10, 18, 25, 26] focus on exploring new features to model review sentences, then gain better results on the task. However, novel features are limited by the data resources, language of sentence, third-party tools, etc. In order to overcome these limitations, word embedding features are introduced to model sentences. Experimental results show that word embedding features outperform other features used in previous research. From the point view of dimensionality reduction, we also compared the Unigram features with Latent Semantic Analysis (LSA) to other features. Result showed that LSA technology with unigram features gain better performance.

The following section discusses related works about review helpfulness prediction. The definition of helpfulness prediction and the format of data used in our experiments are given in Sect. 3. Details about features used in our approach are introduced in Sect. 4. Experiments and evaluation metrics are described in Sect. 5. In Sect. 6 we discuss and analysis the results. We conclude and present directions for future research in the last section.

2 Related Works

Presenting the helpful content to visitors is an important component for any content-centric websites. Engineers of such kind of website have been committed to improve the click rate of reviews, either using normal ranking mechanism or carefully improved mechanism. Consequently, there has been plenty of researches on various aspects of ratings and the quality of review contents.

Some of them focus on finding the most helpful features for predicting the quality of review content [9, 10, 14, 20, 25]. Meanwhile, there are also some researches focus on exploring new algorithms [5, 13, 22, 26, 27].

In the research of Kim et al. [9], lexical, structural, syntactic, semantic and meta-data related features were used for automatic helpfulness prediction. Text surface features and unigrams are proved to be the most helpful features and widely used in later researches.

Zhang and Varadarajan [27] built a regression model by incorporating a diverse set of features, and achieved highly competitive performance of utility scoring on three real-world data sets. Their experiments also proved that the shallow syntactic features turned out to be the most influential predictors.

Liu [14] worked on how to detect low quality reviews. They introduced features to model the informativeness, subjectiveness and readability of a review and classified them into high or low qualities.

Yang et al. [25] hypothesized that helpfulness is an internal property of text and introduced LIWC and INQUIRER semantic features to model the review text. Their experiments showed that two semantic features could accurately predict helpfulness scores and greadly improve the performance compared with features previously used.

RevRank is an unsupervised algorithm to ranking helpfulness of online book reviews [22]. They first constructed a lexicon of dominant terms across reviews, then a virtual core review based on this lexicon was created. They used the distance between the virtual review and each real review to determine overall helpfulness ranking.

Hong et al. [5] developed a binary helpfulness classification system. The system used a set of novel features based on needs fulfillment, information reliability and sentiment divergence measure. Their system outperformed some earlier researches with the same dataset.

Lee and Choeh [13] proposed a helpfulness prediction neural network model and made use of products, review characteristics, and reviewer information as features. This is the first study to predict helpfulness using neural networks. The authors proved that their model outperform the conventional linear regression model analysis in predicting helpfulness.

Rong Zhang et al. [26] proposed a comment-based collaborative filtering approach which captures correlations between hidden aspects in review comments and numeric ratings. They also estimated the aspects of comments based on profiles of users and items, the model outperformed baseline system in Chinese review dataset.

Srikumar [10] proposed a predictive model extracts novel linguistic category features by analysing the textual content of review. He made use of review metadata, subjectivity and readability related features for helpfulness prediction. He proved that the proposed linguistic category features were better predictors of review helpfulness for experience goods.

Table 1. An example of reviews in Amazon dataset

3 Task Definition

In this section, we defined the task of review helpfulness prediction (RHP), and we introduce the data format of Amazon.com reviews. This data have been successfully used in related review helpfulness prediction tests. All the data analysis, illustrations and experiments are based on the dataset.

3.1 Task of RHP

The task of RHP aims to automatically predict the helpfulness score of a specific product’s reviews. In this task, in order to eliminate the interference of external information, only text information is considered rather than any other human interaction information, such as user background, user level etc. The RHP should assign a high score to a review which gains a high manual voting score and assigns a low value to a review which gains a low manual voting score.

Therefore, given a set of reviews, the RHP should output a score list of each review’s helpfulness score. We treat this as a regression task of reviews regarding their helpfulness.

3.2 Amazon.com Data Format

We use the Amazon review data which was prepared for Opinion Spam Detection [6]. This dataset provides 5.8 million reviews about products sold in Amazon. Each review contains product number, date, number of helpful feedback, number of feedbacks, rating, title and body. An example of reviews in this dataset was shown in Table 1.

In this paper, we only consider the body part of each review as the available local resources for RHP. The ‘body’ part gives the content of a review. Other items, such as ‘title’, ‘ratings’, are not totally available in this dataset for each product. In order to avoid dealing with missing information, we do not use the title and other fields which are optional in this experiment. The length of ‘body’ part of reviews in this dataset is various. Word count distribution about review sample in the corpus is given in Fig. 1.

Fig. 1.
figure 1

Word count distribution in the corpus

4 Features

To make the experiment reproducible, only text-based features are used and discussed in this work. Text surface features [9, 15, 17, 24], Unigram features [1, 9, 24], Part-of-speech (POS) features [9, 10, 15] are widely used in previous research work, then we considered them as baselines.

4.1 Surface Features

Following previous researches [24, 25], text surface features used are shown in Table 2. These features have proven effectiveness and are easy to implement for a new corpus.

Table 2. The description of surface features

4.2 Unigram Features

It is proved that the unigram feature is a reliable feature for review helpfulness prediction in previous work [25]. After removing all the stop words and word frequency lower than 10, we build a word dict. Each review is represented as a word vector, in which the value is TF-IDF weight.

In addition, for getting the semantic features and saving the training time, we also employ the LSA [11] technology to perform dimensionality reduction of vector space. Each review represented with unigram features is re-represented in a lower dimension vector space.

4.3 POS Features

The efficiency of part of speech (POS) features has been proved in previous research and there is not much difference among ways of implementing of POS features, which made it to be a reasonable feature in RHP. We use the following POS features: number of Noun words, number of Adjective words, number of Verb words, and number of Adverb words.

4.4 Word Embedding Features

We use the Genism toolFootnote 1 to learn the word embeddings from the provided 5.8 M Amazon product reviews, with the following settings:

  1. 1.

    we removed non-english reviews, which reduces the corpus to 5.5 M reviews.

  2. 2.

    we used the skip-gram model with window size 5 and filtered words with a frequency less than 10.

We use word embeddings of size 100, which means the dimension of output vector is 100. This setting is same with default settings of other tools, such as word2vec. The details of computing word embedding features are introduced in previous researches [16, 19].

5 Experiments and Results

We empirically evaluate our approach, described in Sect. 4, by comparing the performance of different features combination. Below, we describe our experimental setup, choose evaluation metric, present our results and analyze different features’ performance.

5.1 Evaluation Setup and Evaluation Metrics

In order to predict the helpfulness score of reviews, we focus on reviews with helpful feedback voting in the Amazon dataset. For removing duplicate reviews in the dataset, we use Hong’s [5] deduplication method to filter the redundant reviews. There are too many reviews without voting information or feedback information. For this, we filter out the reviews with feedbacks count lower than 100.

Fig. 2.
figure 2

Distribution of review ratings

The final dataset involves 19,030 reviews on 9805 products. The distribution of review ratings is shown in Fig. 2. To obtain the helpfulness voting score, we follow the annotation of review quality defined by Liu [14]. On the basis, we tested each group of feature combination on the whole dataset.

In the training process, we use three regression methods including Linear Regression (LR), Linear Support Vector Regression (LSVR) and Support Vector Regression (SVR)[21]. In the evaluation process, we run 10-fold cross validation. The original Amazon ratings are not used as ground truth, because the ratings are stared by their author for the product not for the review text.

In our experiment, we use the Root Mean Square Error (RMSE) metric to evaluate the performance.

5.2 Results

In this experiment, we test the performance with single feature groups described in Sect. 4 and results are shown in Table 3. Different combinations of features are also tested and results are shown in Table 4.

Table 3. RMSE of single feature

Feature Performance. The first group of results is baselines of this experiment. As described in previous researches, SF features focus on statistics information and they are used as the baseline.

The second group of results is about Unigram features with LSA technology and word embedding features. From the results, LUF gains better performance than word embedding features. However, the difference between them is not large. Compared to Unigram features without LSA, the LUF improves the performance a lot. The word embedding features also perform better than Unigram features.

Table 4. RMSE of feature combinations

The first group in Table 4 is about combinations of UF, SF and PF, we use them as feature combination baselines.

The second group shows the performance about LUF with other features. Compared to combinations about Unigram features, this group makes notable improvements (about 13 %).

The third group shows the performance about word embedding features with other features. From the results, combinations with EF show better performance than UF combinations with UF and LUF. This can verify the efficiency of EF features.

The last group in Table 4 shows combinations with all the features. From the results, combinations with LUF, SF, PF, and EF perform better than UF, SF, PF and EF. The results show than LUF can improve the performance again. In addition, this combination shows the best performance among all the combinations.

Table 5. Model performance

Regression Model Performance. Furthermore, we try to find the relationship between features and the underlying model of helpfulness prediction. For the result of each feature in Table 3 and each feature combination in Table 4, we count the best performance of three regression models. The statistical results are shown in Table 5. Linear regression gets the best in both single feature and feature combination results. Linear SVR also performs better than SVR. It shows that linear relation exists between these features and helpfulness of reviews.

6 Conclusions and Future Work

Until now, the helpfulness of reviews has been well studied with kinds of features, including Unigram features, text structural features, part-of-speech features, semantic features etc. However, features used in previous research so far produce results that are too unreliable to become a basis of a discourse-level prediction. We assert that the helpfulness of an online review should be predicted with its hidden structural information and lexical information. In this paper, we first give the definition of review helpfulness prediction, and then introduce word embedding features to predict the helpfulness score. Our experiments show that the word embedding features can lead to a substantial improvement over previous features. In addition, we test the LSA technology on Unigram features and the results show that LSA can lead to a substantial improvement over Unigram features. As a result of different features combinations, we try to analyze the hidden relationship between features and helpfulness of a review.

In the future, we will test the prediction performance on different corpus and try to do prediction with deep learning [12]. Convolutional neural network (CNN) has been proved to be efficient in modeling sentences [8], text categorization [7, 23] and machine reasoning [2]. Further, we will investigate how to bring CNN into this research and predict the helpfulness of reviews.