1 Introduction

With the development of Web technology, more and more people tend to express their opinions and emotions on the Web, as well as absorb others’ feeling online. People prefer to share their sentiments about a product or a service via electronic word-of-mouth reviews [5, 37]. The solution of automatically predicting the sentiment orientation of reviews has been widely applied to many fields. For instance, it could help potential customers to go through reviews about the concerned product to obtain overall orientations [30]. Also, the orientations of reviewers have important implications for managers to be aware of their brand building, product development, etc. [7]. In addition, it can help governments to analyze and monitor public opinions.

Sentiment prediction, or sentiment classification, as a special case of text classification for subjective texts, is becoming a hotspot for handling promptly increasing reviews. A scheme for (consumers’) review sentiment classification aims at automatically judging what sentiment orientation, positive, negative, or neutral, a review is. There are mainly two strategies for prediction: machine learning techniques and semantic orientation aggregation [28]. The former follows traditional text classification techniques, such as Naïve Bayes (NB) [18] and support vector machine (SVM) [13]. The second strategy is to classify features into two classes, positive and negative, and then aggregate the overall orientation of a review. There are only a few studies focused on sentiment classification for Chinese reviews. Ye et al. [35] first worked on this area at 2005. Tan and Zhang [28] presented an experimental study with four feature selection methods and five machine learning methods. Wang et al. [30] presented a novel feature selection method based on utilizing mean and variance to improve Fisher’s discriminant ratio. Zhang et al. [37] studied sentiment classification of restaurant reviews written in Cantonese.

Generally, there are some debates on feature selection and classification techniques in review sentiment classification. Pang et al. [13] pointed out that binary-based features have shown better performance that frequency-based features, while the opposite conclusion could be found in [30]. SVM performs the best in [13] on unigram features, whereas NB performs better than SVM in [6] using unigram features as well. Moreover, sentiment classification is severely dependent on domains or topics [28]. Xia et al. [31] argued that this is mainly because different types of features have distinct distributions and would probably vary in performance among different machine learning algorithms. Thus, they first combined distinct feature sets and separate machine learning classifiers to improve accuracy of sentiment classification. Multiple classifiers system (MCS) that fuses a pool of component classifiers is a pop issue which has more robustness, accuracy and generality than single classifier [1].

Besides above-mentioned debates, however, there are some disadvantages among existing techniques. First, due to uncertainties of outputs of component classifiers, it is reasonable to consider fuzziness of outputs, while most of the classification schemes harden the outputs of a component classifier by a maximum operator. Second, although Xia et al. [31] employed MCS to improve accuracies, only the degree of support of a test pattern belongs to one class with respect to a component classifier is considered, while the degree of a test pattern does not belong to the class is not taken into accont. If we want to measure both aspects and their fuzziness, the theory of intuitionistic fuzzy sets (IFSs) [2] is the very selection. In addition, we investigate weighting strategy in MCS in Chinese review sentiment prediction by proper operators of IFSs.

Huang et al. [9, 10, 11] presented a fast learning algorithm referred to as extreme learning machine (ELM) for single-hidden layer feedforward networks (SLFNs). Some developments of ELM are presented in [8, 17, 21, 23, 24, 39]. The hidden layer of ELM needs not to be iteratively tuned and therefore result in extremely fast generalization. Therefore, we employ ELM to study Chinese review sentiment prediction. Meanwhile, it is hard to prepare a high-quality corpus with enough reviews suit for training classifiers. Thus, a rational compromise can be done by organizing a small amount of reviews to train original classifiers, when new labeled reviews arrive, new and more accurate classifiers could be retrained by online sequential learning techniques. In addition, since word-of-mouth expressions may change in a long time sequence, it is necessary to add new reviews to capture the changes. Thus, a further development of ELM, online sequential extreme learning machine (OS-ELM) [19], is used as the component classifier.

In this paper, we aim at making an intensive study of the effectiveness of ensemble learning schemes, that is, ensemble of OS-ELMs (EOS-ELMs), for predicting Chinese review sentiments based on intuitionistic fuzzy aggregation. A novel multi classifiers fusion algorithm is proposed in which outputs of component classifiers are equivalently represented by a set of intuitionistic fuzzy values at first and the fusion process is interpreted as aggregation of IFSs. Further, we set up experiments to seek empirical answers to the following questions:

  1. (1)

    Are ELM and OS-ELM effective to Chinese review sentiment classification?

  2. (2)

    Can the performances of Chinese review sentiment classification benefit from the proposed ensemble learning techniques?

  3. (3)

    Among all the versions of proposed fusion algorithm and other comparable algorithms, which generally performs the best?

  4. (4)

    When weighting component classifiers, which is the better way in area of Chinese review sentiment classification?

The remainder of this paper is organized as follows. Section 2 reviews and presents some preliminaries on ELM, OS-ELM, IFS and induced operators. In Sect. 3, we present the Parallel EOS-ELM algorithm under intuitionistic fuzzy framework and the specific process of intuitionistic fuzzy fusion method. Section 4 evaluates the performance of the proposed scheme with experiments, to show the excellent results. Some further in-depth discussions on both the theoretical and application issues are given in Sect. 5. The conclusions are given in Sect. 6.

2 Preliminaries

2.1 ELM and OS-ELM

ELM [10], as a learning algorithm for SLFNs, randomly selects weights and biases for hidden nodes, and analytically determines the output weights by finding least square solution. OS-ELM [19, 25] proposed by Liang et al. was developed for SLFNs with additive and radial basis function hidden nodes on the basis of ELM. Consider N arbitrary distinct samples \(({\bf x}_i ,{\bf t}_i) \in {\bf R}^n\times {\bf R}^m. \) If a SLFN with L hidden nodes can approximate these N samples with zero error, it then implies that there exist β i , a i , and b i such that

$$ f_L \left( {\bf x}_j \right) = \sum\limits_{i = 1}^L \!{\beta _i G\left( {{\bf a}_i ,b_i ,{\bf x}_j } \right)} = {\bf t}_j ,\quad j = 1,2, \ldots, N $$
(1)

where a i and b i are the learning parameters of the hidden nodes, β i is the output weight, and G(a i b i x j ) denotes the output of the ith hidden node with respect to the input x j .

Assume the data \(\aleph \!=\! \{ ( {\bf x}_i ,{\bf t}_i )| {\bf x}_i \in {\bf R}^n,{\bf t}_i \in {\bf R}^m \}_{i = 1}^N \) presents to the network (L hidden nodes) sequentially (one-by-one or chunk-by-chunk with fixed or varying chunk size). There are two phases in the OS-ELM algorithm, initialization phase and sequential phase.

Initialization phase: A small chunk of training data is used to initialize the learning, \(\aleph _0 = \{ ( {\bf x}_i ,{\bf t}_i )\}_{i = 1}^{N_0 } \) from the given training set ℵ, and N 0 ≥ L where L = rank (H 0).

  1. (a)

    Randomly assign the input parameters.

  2. (b)

    Calculating the initial hidden layer output matrix H 0:

    $$ {\bf H}_0 = \left[ {{\begin{array}{ccc} G\left( {\bf a}_1 ,b_1 ,{\bf x}_1 \right) & \ldots & G\left( {{\bf a}_L ,b_L ,{\bf x}_1 } \right) \\ \vdots & \ddots & \vdots \\ {G\left( {{\bf a}_1 ,b_1 ,{\bf x}_{N_0 } } \right)} & \ldots & {G\left( {{\bf a}_L ,b_L ,{\bf x}_{N_0 } } \right)} \\ \end{array} }} \right]_{N_0 \times L} $$
    (2)
  3. (c)

    Estimating the initial out put weight β(0)(0) = P 0 H T0 T 0, where P 0   = (H T0 H 0)−1, and \( {\bf T}_{0} \!=\! [ {{\bf t}_{1} ,\! \ldots ,\!{\bf t}_{N_0} }]^T. \)

  4. (d)

    Set k = 0. (k: a parameter indicates the number of chunks of data that is prepared to the network.)

    Sequential learning phase: Present the (k + 1)th chunk of new samples, \(\aleph _{k + 1} = \{ {( {\bf x}_i ,{\bf t}_i )} \}_{i = \sum\nolimits_{j = 0}^k {N_j } + 1}^{\sum\nolimits_{j = 0}^{k + 1} {N_j } }, \) and N k+1 denotes the number of samples in the (k + 1)th chunk.

  5. (e)

    Compute the partial hidden layer output matrix H k+1

    $$ {\bf H}_{k + 1} \!\!=\!\!\! \left[\!\! {{\begin{array}{ccc} {G\!\!\left(\! {{\bf a}_1 \!,b_1 \!,{\bf x}_{\sum\nolimits_{j = 0}^k {N_j } + 1} }\!\! \right)} &\ldots\! & {G\!\!\left(\! {{\bf a}_L \!,b_L \!,{\bf x}_{\sum\nolimits_{j = 0}^k {N_j } + 1} } \!\!\right)} \\ \vdots & \ddots & \!\vdots\! \\ {G\!\!\left(\! {{\bf a}_1 \!,b_1 \!,{\bf x}_{\sum\nolimits_{j = 0}^{k + 1} {N_j } } } \!\!\right)} & \!\ldots\! & {G\!\!\left(\! {{\bf a}_L \!,b_L \!,{\bf x}_{\sum\nolimits_{j = 0}^{k + 1} {N_j } } } \!\!\right)} \\ \end{array} }} \!\!\!\right] $$
    (3)
  6. (f)

    Calculate the output weight β(k+1):

    $$ {\bf T}_{k + 1} = \left[ {{\bf t}_{\sum\nolimits_{j = 0}^k {N_j } + 1} , \ldots ,{\bf t}_{\sum\nolimits_{j = 0}^{k + 1} {N_j } } } \right]^T $$
    (4)
    $$ {\bf P}_{k + 1} \!=\! {\bf P}_k - {\bf P}_k {\bf H}_{k + 1}^T \!\left( {{\bf I} + {\bf H}_{k + 1} {\bf P}_k {\bf H}_{k + 1}^T } \right)\!^{ - 1}{\bf H}_{k + 1} {\bf P}_k $$
    (5)
    $$ \beta _{\left( {k + 1} \right)} = \beta _{\left( k \right)} + {\bf P}_{k + 1} {\bf H}_{k + 1}^T \left( {{\bf T}_{k + 1} - {\bf H}_{k + 1} \beta _{\left( k \right)} } \right) $$
    (6)
  7. (g)

    Set k = k + 1. Go to (e).

Furthermore, there are some studies on the ensemble schemes of OS-ELM [8, 17, 21]. Corresponding fusion strategies will be specified and implemented by a unique form hereinafter.

2.2 IFS and induced operators

Since the introduction of fuzzy sets, several extensions of this concept have been defined. The most accepted one might be Atanassov’s intuitionistic fuzzy sets (IFSs) [2]. The concept of IFS is as follows.

Definition 1 [2]

Let X be an ordinary finite nonempty set. An IFS in X is an expression A given by

$$ A = \left\{ {\left\langle {x,\mu _A \left( x \right),v_A \left( x \right)} \right\rangle \vert x \in X} \right\} $$
(7)

where μ A :X → [0, 1], v A :X → [0, 1] with the condition: 0 ≤ μ A (x) + v A (x) ≤ 1, for all x in X.

The functions μ A (x), v A (x), and π A (x) = 1 − μ A (x) − v A (x) denote, respectively, the degree of the membership, non-membership, and hesitation of the element x in the set A. Let \(\Upomega \) be the set of all IFVs. The ordered pair α(x) = (μα (x), v α(x)) is called an intuitionistic fuzzy value (IFV) [32], where \(\mu _\alpha ( x ),v_\alpha ( x ) \in [ {0,1} ]\) and μα(x) + v α(x) ≤ 1. In this paper, IFV is abbreviated as α = (μ, v). Two operations on IFVs are as follows:

Definition 2 [32]

Let α = (μαv α) and β = (μβv β) be two IFVs, λ > 0 then

  1. (1)

    \(\alpha \oplus \beta = ( {\mu _\alpha + \mu _\beta - \mu _\alpha \mu _\beta ,v_\alpha v_\beta } )\)

  2. (2)

    λα = (1− (1 − μα)λ, v λα ).

Xu [32] proposed the following process to compare two IFVs:

Definition 3

Let α = (μαv α), β = (μβv β) be two IFVs, S(α) = μα − v α and S(β) = μβ − v β be the scores of α and β, respectively, and H(α) = μα + v α and H(β) = μβ + v β be the accuracy degrees of α and β, respectively, then

  1. (1)

    If S(α)   >   S(β), then α   >   β;

  2. (2)

    If S(α)   = S(β) and H(α)   >   H(β), then α   > β;

  3. (3)

    If S(α)   = S(β) and H(α)   = H(β), then α   = β .

To aggregate IFVs, several aggregation operators [32] were proposed, such as intuitionistic fuzzy arithmetic averaging (IFAA) operator, intuitionistic fuzzy weighted averaging (IFWA) operator, and intuitionistic fuzzy ordered weighted averaging (IFOWA) operator. Operators for classifiers fusion are cited here.

Definition 4

Let \(\alpha _j = ( {\mu _j ,v_j } )(j = 1,2, \ldots ,n)\) be a collection of IFVs with the weighting vector \(w = ( {w_1 ,w_2 , \ldots ,w_n } )^T\) such that \(w_j \in[0,1]\) and \(\sum\nolimits_{j = 1}^n {w_j } = 1. \) An IFWA operator of dimension n is a mapping IFWA: \(\Upomega ^n \to \Upomega , \) and

$$ {\rm IFWA}_w \left( {\alpha _1 ,\alpha _2 , \ldots ,\alpha _n } \right) = \mathop \oplus \limits_{j = 1}^n \left( {w_j \alpha _j } \right) $$
(8)

Especially, if \(w = (1/n,\,1/n, \ldots ,1 /n)^T, \) then the IFWA operator is reduced to the IFAA operator:

$${\rm IFAA}\left( {\alpha _1 ,\alpha _2 , \ldots ,\alpha _n } \right) = \frac{1}{n}\mathop \oplus \limits_{j = 1}^n \left( {\alpha _j } \right) $$
(9)

The induced ordered weighted averaging operator proposed by Xu and Da [33] reorders the elements by an order inducing vector, and the associated weighting vector weights the element in the reordered position. We define an induced intuitionistic fuzzy ordered weighted averaging (I-IFOWA) operator for classifier fusion.

Definition 5

Let \(\alpha _j = ({\mu _j ,v_j })(j = 1,2, \ldots ,n)\) be a collection of IFVs. An I-IFOWA operator of dimension n is a mapping I-IFOWA: \(\Upomega ^n \to \Upomega , \) furthermore,

$$ \begin{aligned} {\rm I-IFOWA}_w &\left( {\left\langle {u_1 ,\alpha _1 } \right\rangle ,\left\langle {u_2 ,\alpha _2 } \right\rangle , \ldots ,\left\langle {u_n ,\alpha _n } \right\rangle } \right) \quad \\ &\quad= \mathop \oplus \limits_{j = 1}^n \!\! \left( {w_j \alpha _{\sigma \left( j \right)} } \right) = \left( {\!\! 1\!\! - \!\! \prod\limits_{j = 1}^n \!\!{\left( {1\!- \mu _{\sigma \left( j \right)} } \right)^{w_j }\!\!,} \prod\limits_{j = 1}^n \!\!{\left( {v_{\sigma \left( j \right)} } \right)^{w_j }} }\!\! \right) \end{aligned} $$
(10)

where \(w = (w_1 ,w_2 , \ldots ,w_n)^T\) is the weighting vector, \(w_j \in [0,1]\) and \(\sum\nolimits_{j = 1}^n {w_j } = 1, \alpha _{\sigma (j)} \) is the α j value of the pair \(\langle {u_j ,\alpha _j } \rangle \) having the jth largest \(u_j ( {u_j \in [ {0,1} ]}), u_j \) is referred to as the order inducing variable.

We can obtain the IFWA, IFAA operators by choosing different manifestations of the weighting vector and the order inducing vector \(u=(u_1, u_2, \ldots, u_n). \)

3 Integrating the I-IFOWA operator and EOS-ELM for sentiment prediction

3.1 Problem description

The problem of predicting consumer sentiments can be formally defined as an online sequential sentiment prediction problem similar to [3].

The labeled Chinese review corpus consists of N samples \(\aleph = \{ ( {\bf x}_i ,{\bf t}_i )| {\bf x}_i \in {\bf R}^n,{\bf t}_i \in {\bf R}^m \}_{i = 1}^N , \) where \({\bf x}_i = [ {x_{i1} , x_{i2} , \ldots ,x_{in} } ]^T\) is the random vector corresponding to the features \(\{ X_1 ,X_2 , \ldots ,X_n \}. \) In binary-based feature representation, x i is derived by encoding the presence or absence of features in the ith review, that is, \(x_{ij} \in \{ {0,1} \}\), \(j = 1,2, \ldots ,n. \) While in frequency-based feature representation, x i is derived by encoding the frequencies of features in the ith review, that is, \(x_{ij} \in \{ 0 \}\bigcup {Z^ + } \), \(j = 1,2, \ldots ,n\), Z +  is the set of positive integer. Similar to [23], we represent the overall sentiment by the class label t i coded by m bits: \({\bf t}_i = [ {t_{i1} ,t_{i2} , \ldots ,t_{im} } ]^T, \) where m = 2 if “neutral” is exclusive, or m = 3 if “neutral” is inclusive. For a pattern of class k, only t ik is “1” and the rest is “−1,” \(k = 1,2, \ldots ,m. \) The online sequential sentiment prediction problem can be defined as follows.

Online Sequential Sentiment Prediction: A corpus of N reviews arrives sequentially. In initial phase, N 0 reviews \(\aleph _0 = \{ ({\bf x}_i ,t_i)\}_{i = 1}^{N_0 } \) are given. Others would arrive one-by-one or chunk-by-chunk (with fixed or varying chunk size). If N 0 = N, sequential prediction reduces to batch prediction. We want to predict the overall sentiment of a new review by a learned mapping f:R nR.

3.2 Parallel EOS-ELM

Parallel EOS-ELM means OS-ELMs in the ensemble are parallel, thus can be implemented by concurrent technique, each thread of which generates one OS-ELM. A typical parallel EOS-ELM can be found in [17]. In the paper, as we try to use the induced operator to fuse the outputs of OS-ELMs, the order inducing variables needs to be evaluated. Similar to existing fusion methods with weighting procedure, we argue a intuitive hypothesis that “better” classifier should be assigned bigger weight. The concept “better” is measured by some quantitative criteria such as norm of output weights \(\|\beta \|\) (just for ELM-correlative method), accuracy, time, etc. It is agreed commonly that smaller \(\| \beta\|\) may lead to better generalization performance [21, 39, 10]. Therefore, the smaller \(\| \beta \|\) is, the bigger weight should be assigned. Accuracy should be another rational measurement. In some special area, time is a crucial criterion, then we can use time as the order inducing variable. Note that if norm of weights or time is used, then the weighting vector is ascending, whereas if accuracy is used, the weighting vector is descending. Parallel EOS-ELM consists of Q OS-ELM networks with the same number of hidden nodes and the same activation function for each node. The proposed Parallel EOS-ELM is illustrated in Table 1. In order to increase the diversity of the ensemble, some other techniques can be used. For example, each OS-ELM can be trained with distinct samples.

Table 1 Algorithm of parallel EOS-ELM

3.3 Intuitionistic fuzzy fusion

When an ensemble is prepared, we could use it to predict class label of a test pattern, that is, an unlabeled Chinese review. The components of outputs of a classifier can be viewed as degree of support, belief, certainty, possibility, etc., not necessarily coming from a statistical classifier [16]. In this paper, we use the I-IFOWA operator for two aspects of purpose. First, we present a generalized form of weighting fusion schemes which include all existing weighting fusion function as special cases of it. Thus, we can compare among different weighting strategies by an uniform formula. Additional, we use IFVs to represent outputs of component classifiers because we try to consider both the degree of a test pattern belongs to a class and the degree of the test pattern does not belong to the class. We develop the following fusion schemes just based on the hypothesis that “better” classifier should be assigned bigger weight.

Let us begin with analyzing the outputs of OS-ELMs. Assume that {OS-ELM q } Q q=1 is the set of Q OS-ELMs deriving by Parallel EOS-ELM. We simply denote the outputs of those Q OS-ELMs by, given test sample \({\bf x} \in {\bf R}^n\)

$$ {\rm DP}\left( {\bf x} \right) = \left[ {{\begin{array}{ccc} {y_{1,1} \left( {\bf x} \right)} &\ldots & {y_{1,m} \left( {\bf x} \right)} \\ \vdots & \ddots & \vdots \\ {y_{Q,1} \left( {\bf x} \right)} & \ldots & {y_{Q,m} \left( {\bf x} \right)} \\ \end{array} }} \right]_{Q\times m} $$
(11)

where \({\rm DP}_q ( {\bf x} ) = \left[ {{\begin{array}{ccc} {y_{q,1} ( {\bf x} )} \ldots {y_{q,m} ( {\bf x} )} \\ \end{array} }} \right]\) is the output of OS-ELM q  , such that \(y_{q,j} ( {\bf x} ) \in [ { -1,1} ]\) and m is the number of pattern classes. In practice, this interval may not always hold. If \(y_{q,j} ( {\bf x} ) \notin [ { -1,1} ], \) we normalize DP q (x) by \({y_{q,j} ( {\bf x} ) = y_{q,j} ( {\bf x} )} / {\mathop {\max}\nolimits_j } \{ {| {y_{q,j} ( {\bf x} )} |} \}. \) Motivated by [15], we call the matrix decision profile (DP). As described in Sect. 3.1, the ideal DP of an EOS-ELM for a test pattern x of class j should be:

$$ {\rm DP}\left( {\bf x} \right) = \left[ {{\begin{array}{ccccc} {-1}& \ldots & 1 & \ldots & {-1} \\ \vdots & & \vdots & & \vdots \\ {-1} & \ldots & 1 & \ldots & {-1} \\ \end{array} }} \right]_{Q\times m} $$
(12)

However, the ideal case in (12) would not usually happen because of the fuzzinesses and uncertainties of both training data and test data. Thus, in the decision making phase, our task is to develop a process to handle them. The maximum operator and weighted vote function are rational but elementary resolutions. We utilize theories of IFSs to represent the fuzzinesses and uncertainties by the degree of membership and degree of non-membership, simultaneously.

As we use “1” to represent the correct label and “−1” to represent the incorrect label, the output of OS-ELM q includes rich information. Let X be the set of samples of class j. The degree of y q,j (x) approximates to “1” figures the degree of x belongs to class j, denoted by

$$ \mu _{q,j} \!\left( {\bf x} \right) = P\left( {\bf x}\! \in \!X \right) = \frac{y_{q,j}\! \left( {\bf x} \right) \!-\! \left( { - 1} \right)}{1 - \left( { - 1} \right)} = \frac{y_{q,j} \left( {\bf x} \right) + 1}{2}; $$
(13)

while the degree of y q,j (x) approximates to “−1” implies the degree of x does not belong to class j, denoted by

$$ v_{q,j} \left( {\bf x} \right) \!=\! P\left( {\bf x}\! \notin\! X \right) = \frac{1 - y_{q,j} \left( {\bf x} \right)}{1 - \left( { - 1} \right)} = \frac{1 - y_{q,j} \left( {\bf x} \right)}{2}. $$
(14)

Actually, we calculate the possibilities of both x belongs to class j and x does not belong to class j. Because they both indicate some evidences of whether x belongs to class j, motivated by processes of forming an IFV suggested by [14, 20], we transform y q,j (x) into an IFV α q,j (x) = (μ q,j (x), v q,j (x)) based on the degree of approximation, where \(q = 1,2, \ldots ,Q,\, j = 1,2, \ldots ,m. \) The corresponding IFS is

$$ A_{q,j} = \left\{ \left\langle {\bf x},\mu _{q,j} \left( {\bf x} \right),v_{q,j} \left( {\bf x} \right) \right\rangle \vert {\bf x} \in X \right\} $$
(15)

where μ q,j (x) represents the degree of \({\bf x} \in X\), v q,j (x) represents the degree of \({\bf x} \notin X. \) Note that the transformation enables us to use intuitionistic fuzzy theories to fuse the EOS-ELM. We clarify the equivalence of the transformation in the following theorem.

Theorem 1

Let y q,j be an entry of outputs of an OS-ELM. IFV α q,j (x) = (μ q,j (x), v q,j (x)) is the transformation result using (13, 14, 15). Then, y q,j is equal to the expected value of the pair q,j (x), v q,j (x)). If a piece linear utility function u is assumed, then the utility of y q,j is equal to the expected utility of α q,j (x).

Proof

According to the procedure of transformation, the expected value of α q,j (x):

$$ E\!\left( \left( {\mu _{q,j} \left( {\bf x} \right)\!,v_{q,j} \left( {\bf x} \right)} \right) \right)\! =\! -1 \times \frac{1 - y_{q,j} }{2} + 1\times \frac{ y_{q,j} + 1 }{2} = y_{q,j} $$

Additional, if a piecewise linear utility function u is assumed for the class of a test pattern. If a test pattern belongs to class j, the utility is denoted by u(1); if a test patten does not belong to class j, the utility is denoted by u(−1). Then, the expected utility of (μ q,j (x), v q,j (x)) is:

$$ u\left( \left( {\mu _{q,j} \left( {\bf x} \right),v_{q,j} \left( {\bf x} \right)} \right) \right) = \mu _{q,j} \left( {\bf x} \right) u\left( 1 \right) + v_{q,j} \left( {\bf x} \right) u\left( -1 \right) $$

the utility of y q,j can be calculated by [34]:

$$ \begin{aligned} u\left( y_{q,j} \right) &= u\left( 1 \right) - \frac{1 - y_{q,j}}{1 - \left( -1 \right)} \left( u\left( 1 \right) - u\left( -1 \right) \right) \\ & = \frac{y_{q,j} + 1}{2} u\left( 1 \right) + \frac{1 - y_{q,j}}{2} u\left( -1 \right) \\ & = \mu _{q,j} \left( {\bf x} \right) u\left( 1 \right) + v_{q,j} \left( {\bf x} \right) u\left( -1 \right) \\ \end{aligned} $$

which complete the proof.□

It is commonly known that any piecewise continuous function can be arbitrarily approximated by piecewise linear functions. Thus, our transformation results are equivalent to the original data if any piecewise continuous utility function is used.

Therefore, the DP of EOS-ELM can be equivalently transformed into the following form

$$ {\rm IFDP}\left( {\bf x} \right) = \left[ {{\begin{array}{ccc} {\alpha _{1,1} \left( {\bf x} \right)} & \ldots & {\alpha _{1,m} \left( {\bf x} \right)} \\ \vdots & \ddots & \vdots \\ {\alpha _{Q,1} \left( {\bf x} \right)} & \ldots & {\alpha _{Q,m} \left( {\bf x} \right)} \\ \end{array} }} \right]_{Q\times m} $$
(16)

and the qth row, \(\alpha _q ( {\bf x} ) = [ {\alpha _{q,1} ( {\bf x} ), \ldots ,\alpha _{q,m} ( {\bf x} )} ], \) represents the output of OS-ELM q  , the jth column represents the degree of x belongs to the jth class and does not belong to the jth class.

The task of classifier combination is to derive the fused output of Q classifiers for final prediction, and the fusion process can be denoted by

$$ D\left( {\bf x} \right) = \Uptheta \left( {IFDP\left( {\bf x} \right)} \right) = \Uptheta \left( {\alpha _1 \left( {\bf x} \right), \ldots ,\alpha _Q \left( {\bf x} \right)} \right) $$
(17)

where \(\Uptheta \) is regarded as aggregation operator. Recur to the operators mentioned in Sect. 2.2, the order inducing variables \(u_q ({q = 1,2, \ldots ,Q})\) derived in Sect. 3.2 and the weighting vector w of the ensemble, spontaneously, we present the classifier fusion algorithm in Table 2.

Table 2 Algorithm of the I-IFOWA operator based fusion

There are some differences between the proposed fusion scheme and other fusion schemes. We take into account the degree of support and nonsupport at the same time, whereas existing fusion schemes consider only the degree of support. As seen in Definition 5, we use nonlinear combination rather than linear combination (e.g., sum).

3.4 Decision making using the I-IFOWA operator

All the existing weighting strategies can be seen as special cases of the proposed fusion scheme if \(\Uptheta \) in 17 is the I-IFOWA operator implemented by different manifestations of order-inducing variables or weighting vector.

  1. (1)

    If \(w_q = 1/Q,\,q = 1,2, \ldots ,Q, \) the I-IFOWA operator reduces to the IFAA operator. Thus, component classifiers are indistinguishable, as seen in [17, 27].

  2. (2)

    If \(w_q = {u_q } /{\sum\nolimits_{q = 1}^Q {u_q } },\,q = 1,2,\ldots ,Q, \) obviously, \(\sum\nolimits_{q = 1}^Q {w_q } = 1\) and the order in {w q } Q q=1 is equal to the order in {u q } Q q=1 , thus, the I-IFOWA operator reduces to the IFWA operator, which implement the weighted average strategy as seen in [8, 26].

  3. (3)

    If w q  = 2/Q, \(q = 1, 2, \ldots ,Q /2\) and \(w_q = 0,\,q = Q/2 + 1,Q/2 + 2, \ldots ,Q, \) we actually select the first half “better” classifiers from the ensemble for decision making, which is the weighting strategy used in [21].

    We can generate some more versions of the I-IFOWA operator based on other fusion strategies.

  4. (4)

    Ng and Abramson [22] suggested to use the best individual classifier for the final decision. This can be implemented by the maximum operator. The intuitionistic fuzzy maximum operator can be obtained by setting w 1 = 1 and \(w_q = 0,\,q = 2, \ldots ,Q. \)

  5. (5)

    Recently, another hot research issue is the selective fusion scheme [38, 12], which could perform better than using all component classifiers. In the intuitionistic fuzzy fusion framework, the I-IFOWA operator may help to select the best M from Q component classifiers by proper weighting vectors. If the M classifiers are treated identically, then, we set the weighting vector as: w j  = 1/M, \(j = 1,2, \ldots ,M\) and \(w_j = 0,\,j = M + 1, \ldots ,Q; \) If we want to weight the M classifiers, then weighting strategies can be used.

Other versions of the I-IFOWA operator can also be derived by similar discussion, and other fusion algorithms can be constructed.

3.5 The proposed procedure

Based on preparation above, we present the paradigm of the proposed procedure step by step as a brief illustration. A online predicting system can be implemented for product supplier or consumers based on the paradigm.

Step 1 Initialization. This step includes preparing training corpus and test reviews, selecting classification features, forming input vector for each sample and normalization if needed. The training reviews and test reviews should be relevant to the same topic written in Chinese. For example, if we want to predict reviews sentiments about a hotel, then the training reviews should also be reviews about hotels, even hotels in the same city or exactly the same hotel.

Step 2 Training. Train EOS-ELM by Parallel EOS-ELM using training corpus. the outputs of training phase are an ensemble of ELM network models with distinct input and output parameters and order inducing variables.

Step 3 Prediction. Given a test pattern x, utilize the trained EOS-ELM to generate Q outputs and construct the IFDP(x) in (16). Then, the final decision could be made using the I-IFOWA operator based fusion schemes.

Step 4 Conclusion. After all reviews are labeled, the overall consumer sentiments, as the most important index for product supplier and consumers, can be calculated by evaluating the ratio of positive reviews. Then, an alarm mechanism can be set up for product manager, while the threshold of positive ratio should be confirmed according to management science.

4 Performance evaluation

4.1 Experimental setup

Experiments are conducted on three open datasets in distinct domains. Datasets [29] used in this section are summarized in Table 3. Each dataset has 2,000 positive samples and 2,000 negative samples. For each simulation, we randomly select 60% samples of each class for training, and the rest are used for testing.

Table 3 Summary of three datasets

As feature selection and reduction are not what we concerned in this study, we simply use algorithms proposed in literatures to obtain the input vector of a sample. We combine sets of n adjacent Chinese characters into n-gram features in a review as in [36, 35]. As bigrams usually outperform unigrams and trigrams [37], we use only bigrams in this study. Then, each Chinese review is represented by a vector with each value of which is fixed by binary-based weight, due to the better results reported in [37, 31]. Obviously, there are a big amount of features need to be reduced so that the complexity of computation can be alleviated and the accuracy may be improved as well. Therefore, we adopt a feature selection method based on improved fisher’s discriminant ratio [30] for finally selecting a certain size of features. This method takes use of the conditional mean and conditional variances in statistics to improve the famous fisher linear discriminant. We construct feature set for each dataset, independently, and select 1,000, 1,500 and 1,000 features for HOTEL, BOOK, and NBC, respectively.

In the following experiments, the additive hidden node is used, the activation function of ELM is the sigmoidal function: y = 1 (1 + e x). The input weights and biases are randomly generated from the range [−1, 1]. For initialization phase, the number of training data N 0 is L + 100, where L is the network size.

Some criteria, such as Training times, Testing times and Standard Deviation (SD), are used to evaluate performances of the proposed methodology and compare with other existing method. Each unit of experiments runs 50 trials for the mean and SD are derived.

4.2 Experiments

We conduct all experiments in a Matlab 7.7.0 environment running in a desktop with CPU 2.93 GHz, and 3GB RAM. The code of ELM [40] is downloaded from Internet, we modified it to output soft predicted labels.

We first focus on parameters of ELM and OS-ELM. There is only one parameter for ELM, that is,number of hidden nodes, need to be optimized. We conduct 50 trials of simulations to get the mean accuracy for each dataset with a series of numbers to find out the optimal choice. Table 4 shows the simulations. The first row of Table 4 means numbers of hidden nodes. The optimal number of hidden nodes are selected as 200 in all three datasets. We did not select bigger number as it may lead to computational complexity though it presents more accurate results. The number of hidden nodes in each OS-ELM in EOS-ELM is set to be equal to that in corresponding dataset as well.

Table 4 Accuracies with distinct numbers of hidden nodes using ELM

SVM and NB are the most frequently used machine learning methods in recent researches of sentiment classification. SVM, as a family of supervized learning algorithms which select models that maximize the error margin of a training set, have been successively used in many pattern classification problems. We impose the Matlab code for the SVM classifier from [4]. NB assumes that features are mutually independent. Then, conditional probabilities can be simply calculated by \(P( {\bf x}| {{\bf y}_j } ) = P( [ {\bf x}_1 ,{\bf x}_2 , \ldots ,{\bf x}_n ]| {\bf y}_j ) \approx \prod\nolimits_{k = 1}^m P( {\bf x}_k | {{\bf y}_j } ). \) Then, final decision of NB can be obtained by the maximum index of \(\prod\nolimits_{k = 1}^m P( {\bf x}_k | {{\bf y}_j } ) P( {{\bf y}_j } ), \) where \(j = 1,2, \ldots , m. \) Laplace smoothing is used to prevent infrequently occurring words form being zero probabilities.

We conducted experiments with ELM, OS-ELM, SVM, and NB in Chinese review sentiments prediction in three datasets. Table 5 shows numerical results of several criteria.

Table 5 Performances on three datasets of ELM, OS-ELM, SVM and NB

In Table 5, it is clear that ELM and OS-ELM are effective algorithms for sentiment prediction of Chinese reviews. In order to improve performances of ELM and OS-ELM, we exam the proposed fusion scheme, comparing with some existing fusion schemes. As we have on idea about that which of norm of output weight and accuracy is better to measure component classifier, we try both measurements in Parallel EOS-ELM. Specific fusion methods are categorized into two groups.

Group 1 includes some existing methods which use straightway outputs of component classifiers:

  1. (1)

    Arithmetic average (AA) As can be seen in [17], the final prediction can be made by \(\arg \max \frac{1}{Q}\!\sum\nolimits_{q = 1}^Q \!\!{f^{( q )}\!( {\bf x} )} , \) where f (q)(x) is the output of OS-ELM q with the input pattern x.

  2. (2)

    Weighted average (WA) Different from AA, weights of component classifiers are taken into account. Frequently, weights are represented by accuracies of component classifiers.

  3. (3)

    First half arithmetic average (FHAA) Liu and Wang [21] used the first half “better” component classifiers for final decision. The first half is induced by norm of output weights. But the first half is not distinguished with each other.

Group 2 is formed by proposed intuitionistic fuzzy fusion scheme with some strategies of weighting and two strtegies of order inducing. These kinds of methods are implemented by the I-IFOWA operator. Outputs of component classifiers are transformed into IFVs at first. Corresponding to Group 1, it includes:

  1. (1)

    Intuitionistic fuzzy arithmetic average (IFAA) The ensemble is formed by Parallel EOS-ELM, and outputs of OS-ELMs are transformed into IFPD before fusion. Similarly, we have

  2. (2)

    NIFWA represents the intuitionistic fuzzy weigh-ted average (IFWA) method of Parallel EOS-ELM which order inducing variables and weighting vector are obtained by norms of output weighting, actually, we weight the ensemble of OS-ELMs by norms of output weighting; if order inducing variables and weighting vector are obtained by accuracies of OS-ELMs, IFWA is denoted by AIFWA, that is, we weight the ensemble by accuracies in this case.

  3. (3)

    NIFFHAA represents the method of using I-IFOWA operator to fuse the first half “better” component classifiers when order inducing variables are obtained by norms of output weighting; AIFFHAA represents the method of using I-IFOWA operator to fuse the first half “better” component classifiers if order inducing variables are obtained by accuracies of OS-ELMs.

Note that order inducing variables is useful in (3) of Group 2 only. How to use the I-IFOWA operator to implement Group 2 has been mentioned in Sect. 3.4.

We test these 8 fusion methods on each dataset, the performances are shown in Tables 6, 7, 8, respectively. In these tables, accuracies of best individual classifier (BIC) and Oracle are also presented. The “Oracle” work as follows: assign the correct class label to x if and only if at least one individual classifier produces the correct class label of x when its decision is hardened. Numbers in the first row of Tables 6, 7, 8 mean sizes of ensemble.

Table 6 Accuracies of fusion methods with respect to size of ensemble on HOTEL (%)
Table 7 Accuracies of fusion methods with respect to size of ensemble on BOOK (%)
Table 8 Accuracies of fusion methods with respect to size of ensemble on NBC (%)

5 Discussions

Based on numerical results, we make in-depth discussion about the questions raised in Sect. 1. Accuracies obtained in Sect. 4 are higher than using the same dataset (HOTEL) in [41], which proofs that the feature selection process is effective.

5.1 Effectiveness of single classifiers

As can be seen in Table 5, ELM produces the highest training accuracies on HOTEL and NBC, and OS-ELM wins the highest testing accuracy on NBC. SVM seems to be the best classifier on BOOK as both training accuracies and testing accuracies are much higher than others. While the SDs of ELM, OS-ELM and NB are much lower than that obtained by SVM, which means ELM, OS-ELM, and NB are more stable than SVM. Moreover, training time and testing time are significant criteria in practical applications, especially in ensemble learning. Table 5 shows that ELM and OS-ELM are at least 10 times faster than SVM, 30 times faster than NB with respect to training time. Moreover, the advantage of testing time of ELM and OS-ELM is more prominent. Regrading SDs of training time and testing time, ELM and OS-ELM are more outstanding. Generally, we can argue that ELM and OS-ELM are effective in the application of Chinese review sentiment prediction. Similar to classification result of [19], OS-ELM cost slightly more training time than ELM. The testing time of them are almost equivalent. In most case, accuracies of OS-ELM outperforms that of ELM. This is another reason why we choose OS-ELM.

5.2 Effectiveness of ensemble learning

Comparing values of BIC in Tables 6, 7, 8, we can see that each of the fusion methods improve performances on the basis of best individual classifiers. The highest improvements of three datasets are 3.09%, 6.25%, and 2.55% respectively, all of which are obtained by fusion methods proposed in this paper. While the lowest improvements of three datasets are 2.08%, 2.40%, and 1.60% respectively. Further, the highest improvements occur when sizes of ensemble are 40, 30, and 20, respectively. The gaps between BIC and Oracle mentioned the potential improvements of fusion. However,, gaps between BIC and Oracle are much bigger than gaps between BIC and the fusion methods, which means fusion methods used in experiments did not improve so much. This may caused by many reasons such as dependence of component classifiers, noises of data, etc.

5.3 Regarding fusion methods

Fusion methods with the same weighting strategies are further compared. With regard to AA and IFAA, it is clear that accuracies obtained by IFAA are always higher than that obtained by AA when size of ensemble is bigger than 20. Thus, AA can only suit small size of ensemble, while IFAA performs much better if the size is bigger. That is, if the size of ensemble increases, IFAA suffers from the followed noise and conflict much milder than AA. AIFWA outperforms WA in most of cases in HOTEL and BOOK whereas NIFWA fails to outperform WA in many cases. NIFFHAA performs better than FHAA 20 times and worse than FHAA once. However, AIFFHAA performs better than FHAA 10 times and worse than FHAA 11 times. Thus, we conclude that the intuitionistic fuzzy fusion methods proposed in this paper is generally better than three existing fusion methods.

5.4 Regarding weighting and order inducing strategies

We now focus on how weighting strategies or order inducing strategies influence the performance of fusion. First, the existing weighting methods (weighting by accuracy and weighting by norms of output weight) are invalid in Chinese reviews sentiment prediction. In other words, arithmetic average is the best weighting strategy. Thus, highest accuracies occur 3 times when AA is used, 10 times when IFAA is used, 5 times when NIFFHAA is used, only 3 times when methods with weighting strategies are used. AA performs nearly equivalent to WA because differences of accuracies of component classifiers are very low, thus the accuracy benefit nearly nothing from the weighting process. Second, accuracy performs better if it is used as weight of component classifier (approximate the performance of AA), performs worse if it is used as order inducing variable. This is concluded by the fact that NIFAA performs even worse when the size of ensemble is bigger in all three datasets. Moreover, norm of output weight performs better if it is used as order inducing variable (inducing the first half could obtain considerable performances), performs worse if it is used as weight of component classifier. Finally, the selective methods usually outperforms methods fusing the whole ensemble though it actually use partial (a half) component classifiers for fusion. FHAA outperforms AA for 4 times and NIFFHAA outperforms IFAA for 5 times.

6 Conclusions

This paper has combined OS-ELM and intuitionistic fuzzy theories to develop a novel ensemble learning methodology to conduct this problem. Under the framework of intuitionistic fuzzy fusion, an ensemble of OS-ELMs as well as their order inducing variables are parallelly trained at first. Then, outputs of OS-ELMs are equivalently transformed into IFPD of the test pattern. Associated with proper order inducing variables and weighting vector, the I-IFOWA operator is used to fusion degree of membership and non-membership simultaneously. It has been shown that the existing fusion strategies are just the special cases of the generalized framework of ensemble learning. The proposed scheme has been successfully used in Chinese review sentiment prediction with the excellent performance.