Keywords

1 Introduction

With the rapid development of web 2.0, the role of Internet users has changed greatly from previous information recipients to information producers. Specifically, the development of online shopping platforms has attracted more and more consumers to share their shopping experience and product feedback on the Internet. This information is presented in forms of semi-structured or unstructured data, such as text, numbers, images, etc., providing more objective references for other potential consumers. A survey found that 63% of online shoppers consistently search and read reviews before making a purchasing decision. Among them, 64% spend 10 min reading reviews, whereas 33% spend 30 min or more. Although many scholars have adopted content-based recommendation, collaborative filtering and knowledge-based recommendation methods for user recommendation research, there are relatively few user recommendations based on online review information [1, 2]. At the very beginning, scholars used numerical scoring information from online reviews to sort products. Mary et al. selected 8 million product reviews and 1.5 million merchant reviews to evaluate quality and rank different products. She found that the average value has a better performance [3]. However, due to fuzziness of the numerical scores, it can’t well reflect users’ sentiment expression of products’ unique characteristics. That is, products with same scores can’t be distinguished from each other [4]. As valuable user-generated contents, review texts can better reflect users’ emotional expression. Khan et al. proposed a new text sentiment analysis method (eSAP) for decision support systems and carried out experiments with data sets from seven different fields to verify the effectiveness of the method [5]. In order to solve the problem of unbalanced sentiment classification in Chinese product review, Tian et al. proposed a new method based on topic sentence to improve the accuracy of classification [6].

However, text reviews can’t cover comprehensive information, most of which are written only based on a certain aspect or some aspects of products. Recently, models which integrate numerical scores and review texts have attracted a lot of attention [7, 8]. From the aspect of product evaluation method, the fusion of different data is pretty simple and the weight is mostly determined by subjective decision or weighted average method. Yang et al. proposed a method based on multi-source heterogeneous information. However, only online reviews are taken into consideration and subjective opinions of consumers are neglected while calculating weights. Zhang et al. proposed a method to calculate weights based on the validity of comments and importance of time, of which the former one is decided by the votes from other consumers and the latter one is decided by the time this review was published. However, while calculating the overall score of the product, the same weights are given to different aspects which cannot reflect their different importance for consumers [10]. Some other methods, such as intuitionistic fuzzy set theory, TOPSIS are also used to do products ranking [11, 12].

There are different types of online reviews, such as numeric ratings and text descriptions. Such heterogeneous information brings customers more complexity to make purchase decisions. Zhang et al. used the directed graph model to evaluate relative advantages of products, but there was no analysis of node weights in the network [13]. The node represents comprehensive sentiment value of the product. The direction of the edge represents relative advantage between two products. The weight of the side represents the value of relative advantage.

In this paper, a directed graph model is proposed to integrate such rich and heterogeneous information. The modified PageRank method is used to calculate the value of each node and to rank different products. The rest of this paper is organized as follows. In Sect. 2, a method to mine and integrate text and numerical information is proposed. A directed graph model is constructed and the importance of each node is calculated by improved PageRank algorithm. In Sect. 3, an experimental study based on automobile online reviews is taken to illustrate the feasibility of proposed method. Finally, in Sect. 4, contribution of this paper and future work is summarized.

2 Research Methods

In order to make full use of user-generated contents for more effective ranking of different products, this paper selects aspect-based online text reviews and data information to do analysis. The directed graph model is used for information fusion, and improved PageRank method is deployed to find the node value, which refers to the final scores of different products. The research framework is shown in Fig. 1.

Fig. 1.
figure 1

Research framework

2.1 Text Information Processing Method

In this section, we first formulate the problem of aspect-based products ranking. Then we describe the method of sentiment intensity calculation and aspect-based weight determination.

2.1.1 Sentiment Intensity Calculation Method

A large number of online aspect-based reviews are crawled from the third-party website. The set \( P = \left\{ {P_{1} , P_{2} , \ldots ,P_{i} , \ldots ,P_{I} } \right\} \) denotes the set of I alternative products, where \( P_{i} \) is the ith alternative product. The original corpus is denoted as \( C_{0} = \left\{ {r_{i1} ,r_{i2} , \ldots ,r_{ij} , \ldots ,r_{IN} } \right\} \), where \( {\text{r}}_{ij} \) is the jth online review of alternative product \( P_{i} \). Firstly, the original corpus needs to be preprocessed, including word segmentation and stop words removing. The result is a word set denoted as \( {\text{W}}_{ij} = \left\{ {w_{ij1} ,w_{ij2} , \ldots ,w_{ijm} , \ldots ,w_{IJM} } \right\} \), where \( w_{\text{ijm}} \) means the mth word segmentation result in the jth online review of alternative product \( P_{i} \).

In this paper, the sentiment word lexicon established by the information retrieval team of Dalian University of Technology is used as the basis of text analysis [14]. The sentiment word lexicon is denoted by \( S{\text{W}}_{0} = \left\{ {sw_{1} ,{\text{sw}}_{2} , \ldots ,sw_{q} , \ldots ,sw_{Q} } \right\} \), where \( {\text{sw}}_{q} \) denotes the qth sentiment word. The sentiment intensity of sentiment word is denoted by \( S{\text{V}}_{0} = \left\{ {sv_{1} ,{\text{sv}}_{2} , \ldots ,sv_{q} , \ldots ,sv_{Q} } \right\} \), where \( {\text{sv}}_{\text{q}} \) denotes the sentiment intensity of the qth sentiment word in the lexicon. Let \( W_{ij}^{k} = \{ W_{ij}^{1} ,W_{ij}^{2} , \ldots ,W_{ij}^{k} , \ldots ,W_{ij}^{n} \} \), where \( W_{ij}^{k} \) denotes the word segmentation result of the jth online review concerning the kth aspect of product \( P_{i} \). Let \( S_{ij}^{k} = \{ S_{ij}^{1} ,S_{ij}^{2} , \ldots ,S_{ij}^{k} , \ldots ,S_{ij}^{n} \} \), where \( S_{ij}^{k} \) denotes the sentiment intensity of the kth aspect in the jth online review of alternative product \( P_{i} \). The sentiment intensity calculation process is given below.

  • Step 1: If the ith word of \( W_{ij}^{k} \) appears in \( S{\text{W}}_{0} \) and its corresponding segment word is \( sw_{i} \), then the sentiment intensity and sentiment polarity are denoted as \( s{\text{v}}_{i} \) and \( sp_{i} \) respectively.

  • Step 2: Take the location of \( W_{ij}^{kn} \) as center and two characters as window size to determine whether there is any negative words within this range. If there is not, the modified sentiment intensity is calculated by \( s{\text{v}}_{i}^{{\prime }} - sv_{i} = 0 \); otherwise, it is calculated by \( s{\text{v}}_{i}^{{\prime }} { + }sv_{i} = 0 \).

  • Step 3: Determine the sentiment polarity \( sp_{i} \). If \( sp_{i} = 1 \), which means the review is positive, then \( S_{ij}^{k} = S_{ij}^{k} + sv_{i}^{{\prime }} \); otherwise, \( S_{ij}^{k} = S_{ij}^{k} - sv_{i}^{{\prime }} \).

In this paper, we assume that weights of reviews from different customers for the kth aspect of the ith product is equal, that is, \( S_{ij}^{k} \) and \( S_{ij + 1}^{k} \) are of no individual differences. Therefore, the sentiment intensity of kth aspect about alternative product \( P_{i} \) can be calculated as:

$$ S_{i}^{k} = \frac{1}{\text{n}}\sum\limits_{{{\text{j = }}1}}^{\text{n}} {S_{ij}^{k} } $$
(1)

2.1.2 Aspect-Based Weight Determination Method

In order to obtain the overall sentiment intensity of product \( P_{i} \), we need to calculate the weight of each aspect. LDA (Latent Dirichlet Allocation) is a typical topic model, and it is a generative directed graph model. It is mainly used to deal with discrete data, and has a wide range of applications in information retrieval, natural language processing and other fields.

The relationship of LDA variables is described in Fig. 2, where hollow circles represent hidden variables and solid circles represent observable variables. Only word frequency \( w_{t,n} \) of word n in document t is observable, it depends on topic of word n in document t \( Z_{t,n} \) and word frequency corresponding to topic k \( \beta_{k} \). Meanwhile, \( Z_{t,n} \) depends on the topic distribution \( \theta_{t} \) and \( \theta_{t} \) depends on the parameter \( \alpha \) of Dirichlet Allocation, while \( \beta_{k} \) depends on parameter \( \eta \). Accordingly, the probability distribution of LDA is given as follows:

$$ p(W,z,\beta ,\theta |\alpha ,\eta ) = \prod\limits_{t = 1}^{T} {p(\theta_{t} |\alpha )} \prod\limits_{i = 1}^{K} {p(\beta_{k} |\eta )} (\prod\limits_{n = 1}^{N} {P(w_{t,n} |z_{t,n} ,\beta_{k} )} P(z_{t,n} |\theta_{t} )) $$
(2)

where \( p(\theta_{t} |\alpha ) = \frac{{\varGamma (\sum\nolimits_{k} {\alpha_{k} } )}}{{\prod\nolimits_{k} {\varGamma (\alpha_{k} )} }}\prod\limits_{k} {\theta_{t,k}^{{\alpha_{k} - 1}} } \) and \( p(\beta_{k} |\eta ) \) usually obey K-dimensional and N-dimensional Dirichlet Allocation with parameters \( \alpha \) and \( \eta \). The aspect-based weight calculation process is given below.

Fig. 2.
figure 2

LDA model

  • Step 1: With prior knowledge, parameters \( K \), number of topics, can be determined. The probability matrix of subject and word can be obtained from \( \beta_{k} \), based on which the word cloud for each subject can be generated. Using text data visualization, we can display each subject more intuitively and summarize realistic meaning of each subject with consideration of realistic background;

  • Step 2: From \( \theta_{t} \), we can know the probability matrix of document and topic, and then determine the proportion \( \alpha \theta_{t} \) of each topic in a document. Let \( \alpha_{k} \) be the weight of each topic:

    $$ \alpha_{k} = \frac{1}{T}\sum\limits_{t = 1}^{T} {\alpha \theta_{t} } $$
    (3)
  • Step 3: Compute comprehensive sentiment value of a product \( TS_{i} \). It is consisted of two parts: one is the objective sentiment value based on consumer’s product reviews, i.e.,

    $$ S_{i} = \sum\limits_{k = 1}^{k} {\alpha_{k} S_{i}^{k} } $$
    (4)

The other one is subjective sentiment value based on the intention of consumers who have the intension to purchase, i.e.,

$$ S_{i}^{{\prime }} = \sum\limits_{k = 1}^{k} {\beta_{k} S_{i}^{k} } $$
(5)

Where \( \beta_{k} \) represents personal preference for all aspects of the product. Therefore, the comprehensive sentiment value of a product \( P_{i} \) can be computed as:

$$ TS_{i} = \delta S_{i} + ( 1- \delta )S_{i}^{{\prime }} $$
(6)

2.2 Numerical Information Processing Method

Numerical information is adopted by many scholars because of its easy access and intuitive comment form. Numerical value represents the satisfaction degree of the consumer to the product. This paper will analyze it from two aspects: numerical rating based on itself and the numerical rating based on the comparison.

  1. a.

    The numerical rating based on itself

    In this paper, we assume the weight from different customers for the kth aspect of the product is equal. \( R_{ij}^{k} \) denotes the numerical rating concerning jth aspect in the kth online review of alternative product \( P_{i} \). So, the comprehensive numerical rating of product \( P_{\text{i}} \) concerning the jth aspect can be computed as:

    $$ R_{ij} = \frac{1}{k}\sum\limits_{k = 1}^{K} {R_{ij}^{k} } $$
    (7)

    In summary, the comprehensive numerical rating of product \( P_{i} \) can be computed based on the topic weight \( \alpha_{k} \).

    $$ TR_{i} = \sum\limits_{k = 1}^{K} {\alpha_{k} R_{ij} } $$
    (8)
  2. b.

    The numerical rating based on the comparison

    Nowadays, many third-party platforms have launched comparison plates for homogeneous products. Let \( CR_{ij} \) be comparative rating concerning the jth aspect of alternative product \( P_{i} \). Let \( PR_{i} \) be the superiority of product \( P_{i} \) over other same level products. If \( P_{i} \) is superior to other products in the jth aspect, then \( PR_{ij} > 0 \). If it is inferior to other products, \( PR_{ij} < 0 \). If they are same in the jth aspect, \( PR_{ij} = 0 \). So, the relative comparative superiority of product \( P_{i} \) over \( P_{m} \) can be computed as:

    $$ Q_{im} = \frac{1}{J}\frac{{\sum\limits_{j = 1}^{J} {CR_{ij} } }}{{\sum\limits_{j = 1}^{J} {CR_{mj} } }} $$
    (9)

2.3 Information Fusion Network Construction Method

Construct a directed graph model \( G(V,E,Q^{v} ,Q^{E} ) \). Node \( V \) denotes the product. Edge \( E \) denotes the directed connection between products, and \( {\text{Q}}^{v} \) denotes weight of node V, which refers to the comprehensive sentiment value \( TS_{i} \). \( {\text{Q}}^{E} \) denotes weight of edge E, that is, the relative comparative superiority \( q_{im} \),which can be computed as:

$$ q_{im} = \frac{{\left| {{\text{Q}}^{i} - {\text{Q}}^{m} } \right|}}{{\hbox{min} ({\text{Q}}^{i} , {\text{Q}}^{m} )}} $$
(10)

If \( q_{im} > 1 \), the edge is directed from \( P_{m} \) to \( P_{i} \). If \( q_{im} < 1 \), the direction is reversed. If \( q_{im} = 1 \), there is no connection between these two nodes.

There are many ways to calculate the importance of network nodes. The classical search engine page ranking algorithm PageRank is based on the idea that pages from high-quality web pages must be quality web pages [15]. Similarly, one product with comparative advantage from another high evaluation product will surely be of high evaluation. We use improved PageRank algorithm to calculate importance of each node. Assume that there are n nodes in a directed graph, n = 1, 2, 3, …, N, the number of output edges of node n is denoted as \( L(n) \). Therefore, the importance of node A can be computed as:

$$ {\text{Q}}^{A} = {\text{Q}}^{A} + \frac{{{\text{Q}}^{i} q_{{^{iA} }} }}{L(i)} $$
(11)

Where the node \( i \) means all nodes that have a directed chain to node A. Figure 3 shows the importance of the nodes. Therefore, the importance of the node 1 can be computed as:

$$ {\text{Q}}^{ 1} = {\text{Q}}^{1} + \frac{{{\text{Q}}^{2} q_{{^{ 2 1} }} }}{2} + \frac{{{\text{Q}}^{3} q_{{^{ 3 1} }} }}{2} $$
(12)
Fig. 3.
figure 3

Directed graph example

3 Experimental Study

With the rapid development of China’s economy, people’s living standards have been improved significantly. Automobile has become a necessity in people’s life. It is shown that the third-party consumer reviews platforms have a better reputation and recognition [16].

3.1 Data Collection and Preprocessing

In order to verify the practicality and effectiveness of proposed method, two third-party consumer review platforms, autohome (http://www.autohome.com.cn/) and Pcauto Network (http://www.pcauto.com.cn/) are selected to do online reviews analysis. According to products’ level and price, we choose compact SUV automobiles with prices between 100,000 and 300,000 yuan. We choose four brands automobiles, Mazda, Honda, Trumpchi and Roewe, and then use web crawler to obtain the four kinds of products’ data. The detailed data is shown in Table 1.

Table 1. Data set

To preprocess the text data, we apply Stanford Parser to segment Chinese text. An external dictionary is added to the word segmentation to avoid ambiguities because of separation of field words. The external dictionary includes 2401 words selected from automobile field words in Sogou cell lexicon.

3.2 Data Experiment

The comprehensive reviews are decomposed into aspect-based review corpus, based on which sentiment analysis is carried out. The sentiment intensity values of product \( P_{i} \) based on different aspects \( S_{ij}^{k} \) are obtained. And then the sentiment value of product \( P_{i} \) towards the kth aspect \( S_{i}^{k} \) can be calculated by \( S_{i}^{k} = \frac{1}{n}\sum\limits_{j = 1}^{n} {S_{ij}^{k} } \). The sentiment value based on aspect of the four kinds of products are calculated and the result is shown in Table 2.

Table 2. Aspect-based sentiment values of different products

According to the method proposed in this paper, the preprocessed text review data is used to determine the weight based on LDA topic model. Take Mazda review data as an example. Given the K = 8, then the document and topic probability matrix \( \theta_{t} \) can be obtained. Meanwhile, with LDA topic model, the distribution of keywords under each topic can be determined and then value of word frequency f can be set. Let f = 40, 163 keywords are kept, from which a 8 × 163 topic-keywords probability matrixes \( \beta_{k} \) can be generated.

$$ \begin{aligned} \begin{array}{*{20}c} {\quad \quad \quad \quad \;\;Word1} & {\;Word2} & \ldots & {Word163} \\ \end{array} \hfill \\ \beta_{k} = \begin{array}{*{20}c} {Topic1} \\ {Topic2} \\ . \\ . \\ . \\ {Topic8} \\ \end{array} \left[ {\begin{array}{*{20}c} {0.0034} & {0.0146} & \ldots & {0.0023} \\ {0.0043} & {0.0092} & \ldots & {0.0022} \\ . & . & {} & . \\ . & . & {} & . \\ . & . & {} & . \\ {0.0047} & {0.0104} & \ldots & {0.0026} \\ \end{array} } \right] \hfill \\ \end{aligned} $$

With \( \beta_{k} \), we can make word cloud to visualize display the subject words, from which we can get the realistic meaning of LDA topic in automobile reviews. By Eq. (4), the objective sentiment values of different products based on online reviews can be obtained. The results are shown in Table 3.

Table 3. Topic-product correspondence and sentiment value analysis

Based on analysis in Sect. 2.3, we can construct the directed graph model after calculating the weights of each network node, the direction and weight of each edge. The result is shown in Fig. 4.

Fig. 4.
figure 4

Directed graph model of four products

Based on the improved PageRank algorithm, the final scores for different products can be obtained. By Eq. (10), the final scores of the four products are 3.3713, 4.1464, 5.2842 and 3.8805. So, the final ranking result is \( P_{3} { > }P_{2} { > }P_{4} { > }P_{1} \).

Competition car ranking information provided by Pacific automotive network shows that the actual ranking result is \( P_{3} { > }P_{1} { > }P_{2} { > }P_{4} \, \). It means the result calculated by proposed method in this paper is relatively consistent with the performance of each product in the competition ranking. However, the proposed method is not totally consistent with the actual result. It is because the text review information is taken into consideration in the method proposed in this paper, which also verifies the effectiveness of the proposed method.

4 Conclusion

In this paper, a new method of mining aspect-based online reviews is proposed. Firstly, different from other studies with single data, this paper uses both numerical data and review text data. Secondly, LDA topic model in text mining field is used to solve probability distribution of document-topic and topic-keyword, based on which the word cloud about the topic is obtained. Through the visualization of text data, each topic’s content is shown. The virtual topic of LDA topic model is endowed with practical significance and the weights of aspects are calculated. With full use of massive online review knowledge, the customers’ objective sentiment values are finally digged out.

At the same time, considering that different consumers have different preferences towards different aspects of products, their subjective sentiment orientation weights are taken into account. The probability of text and of personal preference are integrated to calculate the comprehensive sentiment value of the product, which makes the result more persuasive and reliable. The proposed algorithm of this paper provides a new theoretical method for aspect-based online text review mining. What’s more, developing decision support systems for personalized recommendation is also of great practical significance.