Keywords

1 Introduction

Various smart appliances have play more and more important roles in our daily life, such as Apple watch, smart car etc. In particular, customers can much easier express their opinions through their smart appliances in different ways. As a result, it is not surprising that nowadays there are tons of reviews available out there. Sentiment classification for such reviews has attracted more and more attention from the natural language processing (NLP) community.

Sentiment analysis refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis aims to determine the attitude of a speaker with respect to some topic or the overall contextual polarity of a document. such as ‘positive’ or ‘negative’, ’thumbs up’ or ‘thumbs down’ [1].

Methods for document sentiment classification are generally based on lexicon and corpus [7, 8]. Lexicon-based approaches can derive a sentiment measure for text based on sentiment lexicons. Corpus-based approaches involve a statistical classification method. The latter usually outperforms the former and has been used in unsupervised learning [9], supervised learning and semi-supervised learning.

Early research within this field include Pang et al. [2] and Turney [3], who applied supervised learning and unsupervised learning for classifying sentiment of movie reviews, respectively. In particular, Pang et al. applied different methods based on n-gram grammar and POS (including Naïve Bayes, Maximum Entropy and SVM) to classify a review as either positive or negative. Unsupervised learning for sentiment classification works without any labeled reviews [3]. Although these methods turn out to have good performance, they all rely on labeled data which is normally difficult to obtain. Unsupervised learning of sentiment is difficult, because of the prevalence of sentimentally ambiguous reviews. SO-PMI [4] is a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. Further more, some scholars apply machine learning approaches to derive a classifier through supervised learning [5, 6].

Several semi-supervised learning approaches have been proposed, which use a large amount of unlabeled data together with labeled data for learning [10,11,12]. Sindhwani and Melville [10] propose a semi-supervised sentiment classification algorithm, which utilizes lexical prior knowledge in conjunction with unlabeled data. Recently, deep belief networks [11] is an effective model in semi-supervised learning for sentiment classification. DBN(deep belief networks) performs well in semi-supervised learning, and can be used for sentiment classification. To embedding prior knowledge in the network structure, Zhou et al. [13] propose a semi-supervised learning algorithm called fuzzy deep belief networks for sentiment classification, which is based on deep learning algorithm DBN and fuzzy sets [14].

However, there are several defects in existing semi-supervised learning methods. On one hand, they cannot deal with the data near the separating hyper-plane among classes reasonably. On the other hand, it is difficult to determine the correlation between the fuzzy sets and neurons. In this paper, we propose to enhance DBN with fuzzy mathematics and genetic algorithm in order to deal with the aforementioned challenges. In particular, our proposal maps input data to output using fuzzy rules according to their memberships of the fuzzy sets. Specifically, we design a new fuzzy set with special fuzzy rules for the data near the separating hyper-plane among the classes. And we introduced genetic algorithm to determine the correlation between the fuzzy sets and neurons. Our algorithm refers to evolutionary fuzzy deep belief networks with incremental rules (EFDBNI). We conclude that our proposal is able to tackle the aforementioned challenges and brings out better performance over existing approaches.

The remainder of this paper is organized as follows. Section 2 introduces the related work of sentiment classification. Section 3 presents our semi-supervised learning method EFDBNI in details. Section 4 presents the experimental results. We conclude the paper in Sect. 5.

2 Related Work

Many works have been proposed for sentiment classification. According to the dependence on labeled data, methods of sentiment classification fall into three categories: supervised learning, unsupervised learning and semi-supervised learning.

The study supervised learning methods for sentiment classification has begun with the work in [2]. These methods are widely used in analyzing the sentiments of various topics, such as movie reviews [15], product reviews [16, 17], microblogs [18, 19] and so on. The idea is training a domain-specific sentiment classifier for each target domain using the labeled data in that domain. Although these methods turn out to have good performance, they all rely on labeled data as training set which is normally difficult to obtain even though several works use the domain adaptation approach [20,21,22,23,24] as the challenge of domain-specific and annotating a large scale corpus for each domain is very expensive.

Unsupervised learning for sentiment classification is to maximize likelihood of observed data without any labeled reviews [25]. In [3], the classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs. In addition, a phrase has a positive semantic orientation when it has good associations (e.g., “subtle nuances”) and a negative semantic orientation when it has bad associations (e.g., “very cavalier”). In [26], Read and Carroll investigate the effectiveness of word similarity techniques when performing weakly-supervised sentiment classification. Because labeled data are not used by unsupervised learning approaches, they are expected to be less accurate than those based on supervised learning [27].

Semi-supervised learning addresses this problem by using large amount of unlabeled data, together with the labeled data, to build better classifiers [28]. There are many semi-supervised learning methods of sentiment classification presented. To address the sentiment analysis task of rating inference, Goldberg and Zhu [29] present a graph-based semi-supervised learning algorithm which infers numerical ratings based on the perceived sentiment. In [30], a novel semi-supervised sentiment prediction algorithm utilizes lexical prior knowledge in conjunction with unlabeled examples. This method is based on joint sentiment analysis of documents and words based on a bipartite graph representation of the data. Recently, deep belief networks [11] is an effective model in semi-supervised learning for sentiment classification. DBN (deep belief networks) performs well in semi-supervised learning, and can be used for sentiment classification. To embedding prior knowledge in the network structure, Zhou et al. [13] propose a semi-supervised learning algorithm called fuzzy deep belief networks for sentiment classification. In [31], a novel sentiment analysis model is proposed based on recurrent neural network, which takes the partial document as input and then the next parts to predict the sentiment label distribution rather than the next word. In this paper, we focus on document level sentiment classification.

3 Method

We describe the procedure for training an Evolutionary Fuzzy Deep Belief Network with Incremental rules (EFDBNI). Suppose that we construct a EFDBNI with one input layer, one output layer and N − 1 hidden layers. Firstly, we preprocess the sentiment classification data set. Secondly, we train normal deep belief networks with one input layer, one output layer and N − 1 hidden layers. Thirdly, we define fuzzy sets associated with fuzzy rules. In addition, we construct special hidden layers corresponding to these fuzzy sets based on the deep belief networks mentioned above. Then we will get a new fuzzy deep believe networks by applying membership functions to the deep structure. The whole procedure is shown in Fig. 1.

Fig. 1.
figure 1

The whole procedure for establishing an EFDBNI.

3.1 Preprocess

As the sentiment classification data set is normally composed of many review documents to a bag of words, we need to preprocess them in advance in the same way as that of [32].

3.2 Normal Deep Belief Networks

Preprocessed as mentioned above, each review is represented as a vector of binary weight xi. If the jth word of the vocabulary is in the ith review, then we will set \( x_{j}^{i} = 1 \); otherwise, \( x_{j}^{i} = 0 \). Then the data set is denoted by

$$ X = [x^{1} ,x^{2} , \ldots ,x^{R + T} ] $$
(1)

where

$$ x^{i} = [x_{1}^{i} ,x_{2}^{i} , \ldots ,x_{D}^{i} ],i \in 1,2, \ldots ,R + T $$
(2)

where R is the amount of training reviews, T is the amount of test reviews, D is the amount of feature words in the data set.

The L training reviews to be labeled manually is denoted by XL. These reviews are chosen randomly. The labels corresponding to L labeled training reviews are aggregated into a set of labels Y. And it is denoted as

$$ Y = [y^{1} ,y^{2} , \ldots ,y^{L} ] $$
(3)

where

$$ y^{i} = [y_{1}^{i} ,y_{2}^{i} , \ldots ,y_{c}^{i} ]^{{\prime }} $$
(4)
$$ y_{j}^{i} = \left\{ {\begin{array}{*{20}c} {1,x \in jth\;class} \\ {0,x \notin jth\;class} \\ \end{array} } \right. $$
(5)

c is the number of classes. In sentiment classification, if a review xi is positive, yi = [1, 0]′; otherwise yi = [0, 1]′.

We construct DBN with one input layer, one output layer and N − 1 hidden layers, while the desired EFDBNI also has one input layer, one output layer and N − 1 hidden layers. In both of the DBN as semi-manufacture and the EFDBNI as the made-up article, the input layer h0 has D units and the output layer has C units. The output layer has a linear activation function. And every hidden layer uses a sigmoid function as its activation function.

We train the DBN using all reviews as inputs. Firstly, we build the DBN layer by layer using RBM through the traditional algorithm [33]. Each RBM is consisted of a binary input layer and a binary output layer [34]. Secondly, we refine the parameter space W using L labeled reviews by back-propagation. In this task, we define the optimization problem as

$$ \mathop {\arg \hbox{min} }\limits_{w} f(h^{N} (X^{L} ,Y^{L} )) $$
(6)
$$ f(h^{N} (X^{L} ),Y^{L} ) = \frac{1}{2}\sum\limits_{i = 1}^{L} {\sum\limits_{j = 1}^{C} {(h_{j}^{N} (x^{i} ) - y_{j}^{i} )^{2} } } $$
(7)

where C is the number of classes and it is equal to 2 in the case of sentiment classification.

The layer hN is obtained as follows:

$$ h_{t}^{N} (x) = c_{t}^{N} + \sum\limits_{s = 1}^{{D_{N - 1} }} {w_{st}^{N} h_{s}^{N - 1} (x),t = 1,2, \ldots ,D_{N} } $$
(8)
$$ h_{t}^{k} (x) = sigm(c_{t}^{k} + \sum\limits_{s = 1}^{{D_{k - 1} }} {w_{st}^{k} h_{s}^{k - 1} (x)),t = 1,2, \ldots ,D;k = 1,2, \ldots ,N - 1} $$
(9)

where \( w_{st}^{k} \) is the symmetric interaction term between unit s in the layer hk−1 and unit t in the layer hk, k = 1, 2, … N − 1, while \( c_{t}^{k} \) is the tth bias of layer hk. And \( w_{st}^{N} \) is the symmetric interaction term between units in the layer hN−1 and unit t in the layer hN. c is the tth bias of layer hN.

3.3 Membership Function of Fuzzy Rule

Training reviews are divided into three fuzzy sets A, B and C, each of which is inferred using the fuzzy rules.

Definition 1.

For a data set X, positive fuzzy set A in X is characterized by a membership function μA(x) which associates each review x with a real number in the interval [0, 1], representing the grade of x as positive review in A:

$$ A = \{ (x,\mu_{A} (x));x \in X\} $$
(10)

where

$$ \mu_{A} (x):X \to [0,1] $$
(11)

Definition 2.

For a data set X, negative fuzzy set B in X is characterized by a membership function μB (x) which associates each review x with a real number in the interval [0, 1], representing the grade of x as negative review in B:

$$ B = \{ (x,\mu_{B} (x));x \in X\} $$
(12)

where

$$ \mu_{B} (x):X \to [0,1] $$
(13)

The membership functions are based on the value of hN(x) from the deep belief networks trained in Sect. 2.1. In the case of sentimental classification, the dimension of hN(x) is 2(corresponding to positive or negative class). Thus, the class separation line is

$$ h_{1}^{N} = h_{2}^{N} $$
(14)

The distance between a point hN(xi) and a separation line is

$$ d(x^{i} ) = (h_{1}^{N} (x^{i} ) - h_{2}^{N} (x^{i} ))/\sqrt 2 $$
(15)

If d(xi) > 0, xi is positive; otherwise, xi is negative.

The two membership functions μA(x) and μB(x) are expressed as

$$ \mu_{A} (x;\beta ,\gamma ) = \left\{ {\begin{array}{*{20}l} {S(d(x);\gamma - \beta ,\gamma - \beta /2,\gamma ),} \hfill & {d(x) \le \gamma } \hfill \\ {1,} \hfill & {d(x) > \gamma } \hfill \\ \end{array} } \right. $$
(16)
$$ \mu_{B} (x;\beta , - \gamma ) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {d(x) < - \gamma } \hfill \\ {1 - S(d(x); - \gamma , - \gamma + \beta /2, - \gamma + \beta ),} \hfill & {d(x) \ge - \gamma } \hfill \\ \end{array} } \right. $$
(17)
$$ S(d;\alpha ,\beta ,\gamma ) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {d \le \alpha } \hfill \\ {2(\frac{d - \alpha }{\gamma - \alpha })^{2} ,} \hfill & {\alpha \le d \le \beta } \hfill \\ {1 - 2(\frac{d - \gamma }{\gamma - \alpha })^{2} ,} \hfill & {\beta \le d < \gamma } \hfill \\ {1,} \hfill & {d \ge \gamma } \hfill \\ \end{array} } \right. $$
(18)

If μA(x) > μB(x) as d(x) > 0, the grade of membership in A is bigger than in B. If μA(x) < μB(x) as d(x) < 0, the grade of membership in A is smaller than in B.

To estimate two parameters β and γ, we have

$$ \gamma = \hbox{max} (d(x^{i} )) $$
(19)
$$ \beta = \xi \times \gamma ,\xi \ge 2 $$
(20)

where ξ is a constant which indicates the degree of separation for the two classes.

However, it is not proper to partition the data set in this way. These fuzzy sets only model the two sentimental polarities and their fuzzy rules. Then, the input data is mapped to the output using the fuzzy rules according to their memberships of the fuzzy sets. The higher degree the input data belongs to a fuzzy set, the more proper to use its corresponding rules. In contrast, although the fuzzy sets theory allows an input data locate at the overlapping among several fuzzy sets, there are no reasonable rules can be applied to the input if it belongs to every set in a low degree. In another word, it leads to inexact results to use the existing fuzzy rules for inference with the data near the separating hyper-plane. Hence, we design a new fuzzy set with special fuzzy rules for the data near the separating hyper-plane in this paper. This new rule could compensate for the previous rules.

Definition 3.

For a data set X, neutral fuzzy set C in X is characterized by a membership function μ C (x) which associates each review x with a real number in the interval [0, 1], representing the grade of x as neutral review in C:

$$ C = \{ (x,\mu_{C} (x));x \in X\} $$
(21)

According to set theory, we reformulate the definition of fuzzy set C as follows:

$$ C = A^{c} \cap B^{c} $$
(22)

where Xc represents the complementary set of a set X.

On the base of fuzzy mathematics, we have

$$ \mu_{C} (x) = \hbox{min} (1 - \mu_{A} (x),1 - \mu_{B} (x)) $$
(23)

To be specific, the membership function μC(x) is formalized as

$$ \mu_{C} (x) = \left\{ {\begin{array}{*{20}c} {1,d \le \gamma } \\ {1 - 2[\frac{d - (\gamma - \beta )}{\beta }]^{2} ,\gamma < d \le \gamma - \frac{\beta }{2}} \\ {2(\frac{d - \gamma }{\beta })^{2} ,\gamma - \frac{\beta }{2} < d \le - \gamma + \frac{\beta }{2}} \\ {1 - 2[\frac{d + (\gamma - \beta )}{\beta }]^{2} , - \gamma + \frac{\beta }{2} < d \le \gamma } \\ {0,\gamma < d} \\ \end{array} } \right. $$
(24)

The parameters β and γ in (24) are as the same as the two parameters in (16) and (17). Thus, the parameters β and γ in (24) can also be estimated by (19) and (20).

3.4 Evolutionary Fuzzy Deep Belief Networks Algorithm with Incremental Rules

After extracting fuzzy parameters, deep architecture has been constructed. The architecture is shown in Fig. 2. The top hidden layer hN−1 is divided into three parts corresponding to each of the fuzzy rules respectively.

Fig. 2.
figure 2

Architecture of fuzzy deep belief networks.

The label of new data is denoted by \( {\hat{j}} \). It is determined by

$$ \hat{j} = \mathop {\arg \hbox{max} h^{N} (x)}\limits_{j} $$
(25)

The procedure of getting final outputs of the networks using μA(x), μB(x), μC(x) and outputs from hN−1 is formulated as

$$ h_{t}^{N} (x) = c_{t}^{N} + \mu_{A} (x)P + \mu_{B} (x)Q + \mu_{C} (x)R,t = 1,2, \ldots ,D_{N} $$
(26)

where

$$ P = \sum\limits_{s = 1}^{{D_{N - 1} /3}} {w_{st}^{N} h_{s}^{N - 1} (x)} $$
(27)
$$ Q = \sum\limits_{{s = D_{N - 1} /3 + 1}}^{{2D_{N - 1} /3}} {w_{st}^{N} h_{s}^{N - 1} (x)} $$
(28)
$$ R = \sum\limits_{{s = 2D_{N - 1} /3 + 1}}^{{D_{N - 1} }} {w_{st}^{N} h_{s}^{N - 1} (x)} $$
(29)

where the description of hN−1 can be seen in (9). However, we use genetic algorithm to determine which of units in layer hN−1 are used to represent each rule.

That is we determine the indexes of the hidden units in layer involve in the (27), (28) and (29). In each of the equations, the index is denoted by s. We divide the units in layer into three groups. Each of the groups is associated with one of the fuzzy rules. According to the conclusion in [35], it is helpful to determine the units associated with each fuzzy rule. As such, we need some principles to determine the features represented by the units in layer hN−1 associated with each fuzzy rule. Therefore, we implement this process by applying genetic algorithm (GA) for grouping problem. The fuzzy rules are groups. The features are objects to be grouped.

In our work, we only need to consider the equal group-size problem. Different from previous genetic algorithm for maximally diverse grouping problems [36, 37], our purpose is to assign the right fuzzy rules to the features represented by the units in layer hN−1. Thus, we design a novel genetic algorithm which tackles this particular problem in our paper. The details are described as follows:

We divide the units in layer hN−1 into three manually disjoint groups. These groups are numbered with one, two and three.

As the previous genetic algorithm for grouping problems, we encode the solution as chromosome. In our work, it is suitable to use any encoding scheme which can represent the space of solutions. Thus, we adopt the most straightforward encoding scheme, namely one gene per object. For example, the chromosome 132312 would encode the solution where the first object is in group 1, the second in group 3, third in 2, fourth in 3, fifth in 1, sixth in 2.

Although there are various compositions of the groups, we can simplify all kinds of diversities among the compositions into two types. We denote three objects in three groups respectively by A, B and C. Group 1, Group 2 and Group 3 include object A, object B and object C respectively. These relationships are described in Table 1.

Table 1. Units for magnetic properties.

Another solution to the grouping problem is to assign these three objects to different groups. We list all the cases in Table 2.

Table 2. The composition of groups.

If we ignore the differences caused by capital letter (for example, the pair of ABC and BAC is equivalent to the pair of CBA and BCA), we can further simplify the cases into three. This is described in Table 3.

Table 3. The composition of groups.

The differences between these solutions are induced to the assignment (grouping) of each object. So the differences can be classified into two types described in Table 3. For example, a solution is represented by chromosome 123123, while another solution is 213312. On the one hand, the diversity between the first three genes 123 of the first chromosome and the first three genes 213 of the second chromosome is the case described in the second row (the row includes the phase ‘group number’ is the first row) and third row in Table 3. On the other hand, the diversity between the last three genes 123 of the first chromosome and the last three genes 312 of the second chromosome is the case described in the second row and fourth row in Table 3. So we can break up a pair of solution according to their consistent part and inconsistent part. Based on this fact, we design specific GA operators for our problem. The details are described as follows.

  • Step 1. We denote crossover rate as φ. The parameter ranges from 0.5 to 1. And it is determined manually.

  • Step 2. We classify the diversities between the parents into two types in Table 3. And these diversities are integrated into a set. We assign a random number each terms of the set. The number ranges from 0 to 1. If a term’s number is bigger than φ, it is added to the list for crossover.

  • Step 3. We make duplication of parents.

  • Step 4. We take the top term out of the list. And we exchange the objects in the duplication of the first parent. We only need to regroup the objects involved in the diversity term. Then the assignment of the objects are as the same as that in the second parent. At the same time, we make the assignment of the three objects in the duplication of the second parent to be as the same as the first parent. Then we repeat the process described above until the list is empty.

It should be note that the crossover rate φ is larger than 0.5. Because if it is too small, the process may create springs which are the same as parents. Since small crossover rate may lead to all of the diversities are added into the list. Then the duplication of a parent change into the chromosome which is the same as the other parent.

We redefine the mutation operator as follows:

  • Step 1. We denote mutation rate as ϕ. The parameter ranges from 0 to 0.5. And it is determined manually.

  • Step 2. Generate a random number with uniform distribution. If the number is bigger than ϕ, we will go to the next step; otherwise, quite the process.

  • Step 3. Pick up two groups for mutation randomly. Here, the three groups have the same probabilities to be chosen. Then mutation occurs in each of the group at a random position. The probability of a mutation of a bit in a group is equal to 3/D N1 , where D N1 is the number of units in layer h N1 . After we determine the positions, we exchange the group number of the two objects.

It should be note that in step 3 we keep the size of each group constantly. And we ensure that an object is only included in one group.

To evaluate the solution, we need a fitness function. Different from previous genetic algorithm for maximally diverse grouping problems, our purpose is to assign the right fuzzy rules to the features represented by the units in layer hN−1. So we design a new fitness function for our problem. Further, we use the error rate of the networks that have the structure described by a solution as the criterion for fitness. To obtain the initial population, we select a number of solutions from whole possible solutions. To obtain the initial population, we select a number of solutions from whole possible solutions. It should be noted that according to the work in [35] the characteristics of a fuzzy model depend heavily on the structures rather than on the parameters of the membership functions. Further, they confirm that it is practical to select the structures before the identification of the parameters of the membership functions. In our proposed GA, we only need to determine the connections between layer hN−1 and hN. In the neural networks, the structure from layer h0 to hN−1 represents the membership functions. And the parameters of this structure have been well adjusted for the learning in Sect. 2.1. Thus, for one generation, we just use gradient-descent to adjust the parameters for symmetric interaction term between layer hN−1 and layer hN as well as the bias of layer hN, i.e., wN and bN. The process of GA is shown in Algorithm 1. The number of a generation is variable gen. The index of population is i.

figure a

After we determine the structure of the whole networks using GA, we will identify all parameters of the whole fuzzy deep belief networks. This is supervised learning. The EFDBNI algorithm also uses gradient-descent to retrain the parameters through the whole deep architecture. The optimization problem is the same as the one described in (6) and (7). Hence, the whole fuzzy deep belief networks algorithm is shown in Algorithm 2.

figure b

4 Results and Discussion

4.1 Experimental Setup

We use three sentiment classification data sets. The data sets includes electronics (ELE), restaurants (RES) and movies (MOV). Each of them contains 1000 positive and 1000 negative reviews.

We divide the 2000 reviews into two parts. Half of the reviews are randomly selected as training data and the remaining reviews are used for testing. Only 10% of the training reviews are labeled.

We set all of the neural networks consist of one input layer, one output layer and three hidden layers. And there are 100, 100, and 200 hidden units in the three hidden layers respectively. However, the structures of the neural networks are different among the three data sets. Because the number of units in the input layer is the same as the dimensions of each data set. The max number of iterations of unsupervised-learning is set to 1000, and the supervised-learning is repeat for 10 times for each labeled data.

The parameter ξ is set by experience for different data sets. When ξ = 3, EFDBNI can get relatively better results on all data sets. Thus, we set ξ as 3. Here, we set the max generation of GA as 100.

4.2 Performance Comparison

We compare the classification performance of EFDBNI with two representative semi-supervised learning classifiers, i.e., transductive SVM (TSVM) [38] and Fuzzy deep belief networks (FDBN) [13].

The test accuracy on three data sets with three rules can be seen in Fig. 3, we can see that the performance of the proposed method is better than the others on all three data sets.

Fig. 3.
figure 3

Test accuracy with 100 labeled reviews on three data sets for TSVM, FDBN and EFDBNI.

5 Conclusion

This paper proposes a novel semi-supervised learning algorithm EFDBNI to address the sentiment classification problem with a small number of labeled reviews. Our proposal inherits the advantages of previous works about deep learning for sentimental classification, and has significantly improved the performance of existing deep learning architecture.