1 Introduction

Sentiment classification plays a significant role in everyday life, in political activities, in activities relating to commodity production, and commercial activities. Finding a solution for the accurate and timely classification of emotion is a challenging task.

Data clustering is the process of putting objects into classes where the objects are similar. A cluster is a set of data objects which are similar in scope, but are not similar to objects in other clusters. Number of data clusters are clustered which can be identified firstly following experience or can be automatically identified in the clustering method.

This technique clusters a set of n data object vectors X = {x1,x2, …,xn}\(\subset \) Rs into c fuzzy clusters based on calculating the minimum objective function to measure the quality of clustering and find the cluster centers in each cluster to minimize the cost measurement function.A fuzzy set is a set in which each x basic member is assigned a μ(ξ) real value in [0, 1] to display the dependency measure of this member in the set. When the dependency measure equals 0, the basic member does not belong to the set; but if the dependency measure equals 1, the basic member belongs to the set completely. Therefore, a fuzzy set is a set of (x, μ(x)) pairs.

Fuzzy C-Means (FCM) is also a method of clustering which allows one element to belong to two or more clusters. It is often used to cluster the data but seldom used for data classification.

We suggest many basic principles of our model to classify the opinions (positive, negative, neutral) expressed in the English documents in the English testing data set, based on the large number of English sentences in the English training data set. The principle underpinning our proposed model, which uses clustering techniques to classify the semantics of the English documents are as follows: assuming that an English document contains n English sentences, the English document has a positive polarity if the number of English sentences clustered into the 30,000 positive English sentences of the training data set is greater than the number of English sentences clustered into the 30,000 negative English sentences of the training data set. Conversely, the English document has a negative polarity if the number of English sentences clustered into the 30,000 positive English sentences of the training data set is less than the number of English sentences clustered into the 30,000 negative English sentences of the training data set. Finally, the English document has a neutral polarity if the number of English sentences clustered into the 30000 positive English sentences of the training data set is equal to the number of English sentences clustered into the 30,000 negative English sentences of the training data set.

Based on these principles, we implement our proposed model in the Cloudera parallel network environment. Our model uses FCM combined with Hadoop Map (M)/Reduce (R) to classify the sentiments (positive, negative, neutral) of one English document in the English testing data set into either positive polarity, negative polarity or neutral polarity in the Cloudera parallel network environment.

To implement this study, we use the basis Fuzzy C-Means algorithm (the core basis of the Fuzzy C-Means algorithm) presented in [824]. There are also many studies which use the FCM in semantic classification (opinion mining, sentiment analysis) but there is not much work which uses FCM for sentiment analysis with the aforementioned principles of our proposed model.

FCM is a clustering technique in the data mining field and it has been applied in the natural language processing field where we have had many difficulties and it has taken a long time to implement this research. There are many advantages of FCM, such as: it is unsupervised, it always converges; it provides membership values which are useful for interpretation; it is flexible with respect to the distance used; and if some of the membership values are known, this can be incorporated into the numerical optimization. There are several disadvantages of FCM as follows: long computational time; sensitivity to the initial guess (speed, local minima); and sensitivity to noise - one expects low (or even no) membership degree for outliers (noisy points).

In addition, based on the work related to FCM and the sentiment analysis of big data in [824], there are not studies which use FCM for big data in sentiment classification. We use FCM in our model opinion mining in big data, although our English data set in this work is a small English testing data set with 25,000 English document in each testing data set.

In addition, based on many works related to FCM in the parallel system (or FCM in the distributed system) in [2527], many studies relate to parallel systems or distributed systems, in [2842], FCM used research for sentiment classification in [4350], and many studies in the world, there is not any study related to FCM for sentiment classification in parallel system but our model uses FMC for semantic analysis in the distributed system.

Many studies, such as [256], use Hadoop Map (M)/Reduce (R), and Cloudera; Vector Space Models (VSM); FCM; FCM in parallel systems (distributed systems)sentiment classification and big data. However, to the best of our knowledge, no studies use all of them. Our proposed model uses all of these.

Finally, we build many FCM-related algorithms in our new model based on the basic FCM in the Cloudera distributed system with Hadoop Map (M) /Reduce (R) and these algorithms have not been used in any other study.

This study comprises six sections: Section 1 is the introduction; Section 2 discusses the related work on Fuzzy C-Means (FCM), Hadoop, Cloudera, etc.; Section 3 discusses the English data set; Section 4 overviews the methodology of our proposed model; Section 5 describes the experiment and Section 6 provides the conclusion.

2 Related work

In this section, we overview several studies related to Fuzzy C-Means (FCM), the Vector Space Model, Hadoop, Cloudera, etc.

There are many studies which are related to the Vector Space Model [24]. First of all, the authors of [2] transfer all English sentences into many factors which are used in VSM algorithm. In this research, the authors examine the Vector Space Model, an information retrieval technique and its variations. The rapid growth of the World Wide Web and the abundance of documents and different forms of information available on it, has resulted in the need for better information retrieval techniques. The Vector Space Model is an algebraic model used for information retrieval. It represents a natural language document in a formal manner by the use of vectors in a multi-dimensional space, and allows decisions to be made as to which documents are similar to each other and to the queries fired. This work also explains the existing variations of the VSM and proposes a new variation that should be considered [3]. In the text classification task, one of the main problems is to choose which features give the best results. Various features can be used such as words, n-grams, syntactic n-grams of various types (POS tags, dependency relations, mixed, etc.), or a combination of these features. Also, algorithms to reduce the dimensionality of these sets of features can be applied, such as Latent Dirichlet Allocation (LDA). In this research, the authors consider the multi-label text classification task and apply various feature sets. The authors consider a subset of multi-labeled files of the Reuters-21578 corpus. The authors use traditional TF-IDF values of the features and tried both considering and ignoring the stop words. The authors also tried several combinations of features, like bi-grams and uni-grams. The authors also experimented by adding LDA results into Vector Space Models as new features. These latter experiments obtained the best results [4]. KNN and SVM are two machine learning approaches to text categorization (TC) based on the Vector Space Model. In this model, borrowed from information retrieval, documents are represented as a vector where each component is associated with a particular word from the vocabulary. Traditionally, each component value is assigned using the information retrieval TFIDF measure. While this weighting method seems very appropriate for IR, it is not clear that it is the best choice for TC problems. Actually, this weighting method does not leverage the information implicitly contained in the categorization task to represent documents. In this research, the authors introduce a new weighting method based on the statistical estimation of the importance of a word for a specific categorization problem. This method also has the benefit of making feature selection implicit, since useless features of the categorization problem considered are assigned a very small weight. Extensive experiments reported in the research show that this new weighting method significantly improves classification accuracy as measured on many categorization tasks.

Many studies such as [57] are related to the implementation of algorithms and applications in the parallel network environment. Hadoop is an Apache-based framework which is used to handle large data sets on clusters consisting of multiple computers, using the Map and Reduce programming model. The two main projects of Hadoop are the Hadoop Distributed File System (HDFS) and Hadoop M/R (Hadoop Map/Reduce). Hadoop M/R allows engineers to program for writing applications for the parallel processing of large datasets on clusters consisting of multiple computers. An M/R task has two main components: (1) Map and (2) Reduce. This framework splits the input data into chunks which multiple Map tasks can handle as a separate data partition in parallel. The outputs of the map tasks are gathered and processed by the Reduce task which is ordered. The inputs and outputs of each M/R are stored in HDFS because the Map tasks and the Reduce tasks are performed on the pair (key, value), and the formatted input and output formats will be the pair (key, value) [7]. Cloudera, the global provider of the fastest, easiest, and most secure data management and analytics platform built on Apache TM Hadoop\({{\circledR }}\) and the latest open source technologies, announced in November 2015 that it will submit proposals for Impala and Kudu to join the Apache Software Foundation (ASF). By donating its leading analytic database and columnar storage projects to the ASF, Cloudera aims to accelerate the growth and diversity of their respective developer communities. Cloudera delivers the modern data management and analytics platform built on Apache Hadoop and the latest open source technologies. The world’s leading organizations trust Cloudera to help solve their most challenging business problems with Cloudera Enterprise, the fastest, easiest and most secure data platform available currently. Cloudera’s customers are able to efficiently capture, store, process and analyze vast amounts of data, empowering them to use advanced analytics to drive business decisions quickly, flexibly and at a lower cost than has been possible before. To ensure Cloudera’s customers are successful, it offers comprehensive support, training and professional services.

There are many studies, such as [824] which are related to the FCM algorithm.

Many studies are related to FCM in parallel systems (or FCM in distributed systems) such as the work in [2527].

Many studies, such as [2842] are related to parallel systems or distributed systems.

Research using FCM for sentiment classification can be found in [4350].

The latest research on sentiment classification can be found in [5154, 56].

3 Data set

The English training data set includes 60,000 English sentences in the movie field, of which 30,000 are positive English sentences and 30,000 are negative English sentences.

All English sentences in our English training data set have been automatically extracted from Facebook and websites in social networks, after which we labeled them as either positive or negative. Figure 1 is the English training data set of this model.

Fig. 1
figure 1

Our english training data set

We used a publicly available large data set of movie reviews from the Internet Movie Database [1]. This English data set comprises a testing data setwhich we refer to as the first testing data set and also a training data setwhich we refer to as the second testing data set. Both our first testing data set and our second testing data set contain 25,000 English documents, each with 12,500 positive English movie reviews and 12,500 negative English movie reviews. Figure 2 is the English testing data set of this model.

Fig. 2
figure 2

Our english testing data set

4 Methodology

The methodology section comprises two parts: the semantic classification of the 25,000 English documents in the testing t1 and the 25,000 English documents of the testing t2 on the sequential environment is presented in the first part and the sentiment classification of the 25,000 English reviews of the testing t1 and the 25,000 English reviews of the testing t2 in the parallel network environment is presented in the second part.

In the English training data set, there are two clusters: the first, called the positive cluster, contains 30,000 positive English sentences and the second, called the negative cluster, contains 30,000 negative English sentences. All English sentences in both the first cluster and the second cluster have undergone word segmentation and stop-word removal after which they are transferred into vectors (vector representation). The 30,000 positive English sentences in the positive cluster are transferred into the 30,000 positive vectors, called the positive vector group (or the positive vector cluster). The 30,000 negative English sentences in the negative cluster are transferred into 30,000 negative vectors, called the negative vector group (or the negative vector cluster). Therefore, the English training data set includes the positive vector group (or the positive vector cluster) and the negative vector group (or the negative vector cluster) [24]. The VSM is an algebraic model used for information retrieval. It represents a natural language document in a formal manner by the use of vectors in a multidimensional space. The VSM is a way of representing documents through the words that they contain. Vector space modeling places terms, documents, and queries in a term-document space so it is possible to compute the similarities between queries and the terms or documents, and allow the results of the computation to be ranked according to the similarity measure between them. The VSM allows decisions to be made about which documents are similar to each other and to queries.

We transferred all the English sentences in the training data set into vectors similar to VSM [24].

4.1 Fuzzy C-means algorithm in the sequential environment

Figure 3 illustrates how sentiment classification is undertaken in the sequential environment.

Fig. 3
figure 3

Fuzzy c-means algorithm in the sequential environment

With each English document in the English testing data set, we assume that each English document has n English sentences and we transfer the n English sentences into n vectors similar to VSM [24]. Thus, the document has n vectors. For each vector of the n vectors, we use FCM to cluster the vector into the positive vector group or the negative vector group in the sequential environment. According to [817], we implement the FCM algorithm which is enhanced to be able to classify the sentiment of the English sentences.

The total all the fuzzy partitions which have c clusters of N objects in D is calculated as follows:

$$\begin{array}{@{}rcl@{}} E_{fc} &=&\left\{U\in R_{cN} \vert {\underset{1\le i\le c\wedge 1\le k\le N}{\forall}} u_{ik} \in [0,1],\sum\limits_{i=1}^{c} u_{ik} \right.\\ &&\left.=1,0<\sum\limits_{k=1}^{N} {u_{ik} <N}\right\} \end{array} $$

Minimize the objective function:

$$J_{m} (U,V)=\sum\limits_{i=1}^{c} \sum\limits_{k=1}^{N} (u_{ik} )^{m} d_{ik}^{2} $$
$$d_{ik}^{2} =\vert x_{k} - \textit{v}_{i} \vert_{A} $$

V = [v1, v2, ..., vc] is a matrix which represents the center object values of the cluster. A matrix is a positive finite. m is the exponent weight in [1, \(\infty \)).

The objective function reaches a minimum value if and only if:

$$\begin{array}{@{}rcl@{}} {\underset{{1\le k\le N}}{\forall}} I_{k} =\{i\vert 1\le \mathrm{i}\le \ \text{c;}{~}_{d_{\text{ik}}} =0\} \end{array} $$
$$ {\underset{1\le i\le c\wedge 1\le k\le N}{\forall}} \,\,\,u_{ik} =\left\{\begin{array}{l} {(d_{ik} )}^{\frac{2}{1-m}} \left[\sum\limits_{j=1}^{c} {{(d_{ik})}^{\frac{2}{1-m}}}\right]^{-1} \\ \left\{\begin{array}{l} 0,i\notin\\\\ \sum\limits_{i\in I_{k}} {u_{ik} =1,i\in I_{k}} \end{array}\right. \end{array}\right. $$
(1)
$$ {\underset{{1\le i\le c}}{\forall}} v_{i} =\frac{\sum\limits_{k=1}^{N} {{(u_{ik} )}^{m} x_{k}} } {\sum\limits_{k=1}^{N} {{(u_{ik} )}^{m}} } $$
(2)

The FCM algorithm comprises the following steps:with \({\vert \vert U\vert \vert } {_{F}^{2}} =\sum \nolimits _{i} {\sum \nolimits _{k} U_{ik}^{2}}\)

figure b

With the clustering results of the n vectors of the documents in the testing data set, the document has a positive sentiment if the number of vectors in the n vectors is greater than the number of vectors in the n vectors. The document has a negative sentiment if the number of vectors in the n vectors is less than the number of vectors in the n vectors. The document has a neutral sentiment if the number of vectors in the n vectors is equal to the number of vectors in the n vectors.

4.2 Fuzzy C-means (FCM) in the parallel network environment

Figure 4 illustrates how semantic classification is undertaken in a parallel network environment.

Fig. 4
figure 4

Fuzzy c-means algorithm in the parallel network environment

We transfer the 60,000 English sentences in the training data set into the 60,000 vectors using Hadoop Map (M)/Reduce (R) in the Cloudera parallel network environment to shorten the execution time of this task. Figure 5 overviews the process of transferring each English sentence into one vector in the Cloudera networkenvironment.

Fig. 5
figure 5

Overview of the process of transforming each english sentence into one vector in Cloudera

Transferring each English sentence into one vector in the Cloudera network environment involves two phases: Map (M) phases and Reduce (R) phases. The input of the Map phase is one English sentence and the output of the Map phase are the many components of a vector which correspond to the sentence. In the Map phase of Cloudera, we transfer the sentence into one vector similar to VSM [24]. The input of the Reduce phase is the output of the Map phase, which is many components of a vector. The output of the Reduce phase is a vector which corresponds to the sentence. In the Reduce phase of Cloudera, these components of the vector are built into one vector.

Each English document in the testing data set contains n English sentences. We transfer each English sentence in the n English sentences into one vector similar to the process shown in Fig. 5. Hence, the document also has n vectors.

FCM in the Cloudera parallel network environment comprises two phases: the first phase is the Hadoop Map (M) phase in Cloudera and the second phase is the Hadoop Reduce (R) phase in Cloudera. In the Map phase, the input is the n vectors of one English document (which have been classified) into either the positive vector group or the negative vector group; and the output is the clustering results of the n vectors of the document into either the positive vector group or the negative vector group. In the Reduce phase, the input is the output of the Map phase and this input is the clustering results of the n vectors of the document into either the positive vector group or the negative vector group; and the output is the sentiment classification result of the document as either having positive polarity, negative polarity, or neutral polarity. In the Reduce phase, the English document is classified as having a positive sentiment if the number of vectors of the n vectors in the positive vector group is greater than the number of vectors of the n vectors in the negative vector group; the English document is classified as having a negative sentiment if the number of vectors of the n vectors in the positive vector group is less than the number of vectors of the n vectors in the negative vector group; and the English document is classified as having a neutral sentiment if the number of vectors of the n vectors in the positive vector group is equal to the number of vectors of the n vectors in the negative vector group.

4.2.1 Hadoop Map (M)

Figure 6 illustrates the Hadoop Map phase.

Fig. 6
figure 6

Overview of fuzzy c-means in Hadoop map (M) in Cloudera

Similar to [717], we propose FCM as follows:

figure c

4.2.2 Hadoop Reduce (R)

Figure 7 illustrates the Hadoop Reduce phase.

Fig. 7
figure 7

Overview of Hadoop reduce (R) in Cloudera

5 Experiment

We used measures such as accuracy (A) to calculate the accuracy of the results of sentiment classification.

The Java programming language was used to save the data sets in order to implement our proposed model to classify the 25,000 English documents in testing data set t1 and the 25,000 English documents of testing data set t2.

To implement the proposed model, we used the Java programming language to save the English training data set, the English testing data set and the results of the sentiment classification.

The sequential environment in this research comprises one node (one server). The Java language is used to program FCM. The configuration of the server in the sequential environment is Intel\({\circledR }\) Server Board S1200V3RPS, Intel\({\circledR }\) Pentium\({\circledR }\) Processor G3220 (3M Cache, 3.00 GHz), 2GB PC3-10600 ECC 1333 MHz LP Unbuffered DIMMs. The operating system of the server is Cloudera.

We implement FCM in the Cloudera parallel network environment - this Cloudera system comprises four nodes (four servers). The Java language is used to program the application of the FCM in Cloudera. The configuration of each server in the Cloudera system is Intel\({\circledR }\) Server Board S1200V3RPS, Intel\({\circledR }\) Pentium\({\circledR }\) Processor G3220 (3M Cache, 3.00 GHz), 2GB PC3-10600 ECC 1333 MHz LP Unbuffered DIMMs. The operating system of each of the four nodes is Cloudera. All nodes have the same configuration.

The results of the sentiment classification of the 25,000 English documents in testing data set t1 are presented in Table 1.

Table 1 The results of the 25,000 english documents in testing data set t1

The results of the sentiment classification of the 25,000 English documents in testing data set t2 are presented in Table 2.

Table 2 The results of the 25,000 english documents in testing data set t2

The accuracy of the sentiment classification of the 25,000 English documents in testing dataset t1 is shown in Table 3.

Table 3 The accuracy of our proposed model for the sentiment classification of the 25,000 english documents in testing data set t1

The accuracy of the sentiment classification of the 25,000 English documents in testing dataset t2 is shown in Table 4.

Table 4 The accuracy of our proposed model for the sentiment classification of the 25,000 english documents in testing data set t2

6 Conclusion

Although our proposed model was tested on an English data set, it can also be applied to many other languages. In this paper, our model was tested on the 25,000 English documents in the testing data set t1 and the 25,000 English documents in the testing data set t2 which are small data sets. However, our model can be applied to a big data set containing millions of English documents in a very short time.

In this work, we proposed a new model to classify the sentiments of English documents using the Fuzzy C-Means Algorithm (FCM) with Hadoop Map (M) /Reduce (R) in the Cloudera parallel network environment. The experiment results show that our proposed model achieves 60.2 % and 59.8 % accuracy of the English documents. Currently, there is a paucity of research which shows that clustering methods can be used to classify data. Our research shows that clustering methods are able to classify data and in particular, they are useful for sentiment classification for text.

As shown in Table 3, the average time taken for the sentiment classification of the 25,000 English documents in testing data set t1 using the FCM algorithm in the sequential environment is 150,590 seconds, which is greater than the average time taken for the sentiment classification of the 25,000 English documents using FCM in the Cloudera parallel network environment, which is 37,659 seconds.

As shown in Table 4, the average time taken for the sentiment classification of the 25,000 English documents in testing data set t2 using the FCM algorithm in the sequential environment is 151590 seconds, which is greater than the average time taken for the sentiment classification of the 25,000 English documents in testing data set t2 using FCM in the Cloudera parallel network environment, which is 37875 seconds.

The execution time of the FCM in Cloudera is dependent on the performance of the Cloudera parallel system and is also dependent on the performance of each server on the Cloudera system.

The principles underpinning our proposed model for classifying the sentiment (positive, negative, neutral) of the English documents in the English testing data set in the sequential environment, based on the numerous English sentences in the English training data set are similar to the principles underpinning our proposed model for classifying the sentiment (positive, negative, neutral) of the English documents in English testing data set in the distributed environment, based on the numerous English sentences in English training data set.

The FCM of our proposed model in the sequential environment is different from the FCM of our proposed model in the parallel environment. We built many algorithms related to the FCM to implement our model in the distributed system.

The execution time of our model in the parallel environment is less than the execution time of our model in the sequential environment. The execution of our model in the distributed system is shorter if the performance in the distributed system is longer.

In addition, the execution time of any model is also dependent on the algorithms. For example, using the same algorithms, different systems perform differently and have different execution times. Using the same system with the same performance, different algorithms may have different execution times.

Our survey has many advantages and disadvantages. The advantages are: it processes big data involving millions of English documents; the execution time of our model to conduct sentiment on big data is short, etc. However, the disadvantages are: it takes a long time to implement and it is costly to build the algorithms of the model in the distributed system.

To understand the scientific value of this research, we compare our model’s results with the results of models used in other studies.

Table 5 compares our model’s results with the studies in [24] as follows:

  • cluster technique: CT.

  • sentiment classification: SC (opinion mining, or semantic classification, or emotion classification).

  • parallel network system: PNS (distributed system).

  • special domain: SD.

  • dependence on the training data set: DT.

  • language: L

  • Vector Space Model: VSM

  • no mention: NM

  • English language: EL.

  • Fuzzy C-Means: FCM.

Table 5 Comparison of our model’s results with the work in [24]

Table 6 Compares our model’s results with the work related to the Fuzzy C-Means (FCM) algorithm in [824].

Table 6 Comparison of our model’s results with the work related to the Fuzzy C-Means (FCM) algorithm in [824]

Table 7 compares our model’s results with studies related to Fuzzy C-Means in the parallel system (or FCM in the distributed system) in [2527].

Table 7 Comparison of our model’s results with studies related to Fuzzy C-Means in the parallel system (or FCM in the distributed system) in [2527]

Table 8 compares our model’s results with studies related to FCM for sentiment classification in [4350].

Table 8 Comparison of our model’s results with the FCM used for sentiment classification in [4350]

Table 9 compares our model’s results with the latest research on sentiment classification (or sentiment analysis or opinion mining) in [5156].

Table 9 Comparison of the proposed model with the latest sentiment classification models (or the latest sentiment classification methods) in [5156]