Fuzzy C-means for english sentiment classification in a distributed system

Phu, Vo Ngoc; Dat, Nguyen Duy; Ngoc Tran, Vo Thi; Ngoc Chau, Vo Thi; Nguyen, Tuan A.

doi:10.1007/s10489-016-0858-z

Fuzzy C-means for english sentiment classification in a distributed system

Published: 05 November 2016

Volume 46, pages 717–738, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Fuzzy C-means for english sentiment classification in a distributed system

Download PDF

Vo Ngoc Phu ORCID: orcid.org/0000-0001-6047-9066^1,2,
Nguyen Duy Dat³,
Vo Thi Ngoc Tran⁴,
Vo Thi Ngoc Chau⁵ &
…
Tuan A. Nguyen⁶

1056 Accesses
51 Citations
Explore all metrics

Abstract

Sentiment classification plays a significant role in everyday life, in political activities, in activities relating to commodity production, and commercial activities. Finding a solution for the accurate and timely classification of emotions is a challenging task. In this research, we propose a new model for big data sentiment classification in the parallel network environment. Our proposed model uses the Fuzzy C-Means (FCM) method for English sentiment classification with Hadoop MAP (M) /REDUCE (R) in Cloudera. Cloudera is a parallel network environment. Our proposed model can classify the sentiments of millions of English documents in the parallel network environment. We tested our model using the testing data set (which comprised 25,000 English reviews, 12,500 being positive and 12,500 negative) and achieved 60.2 % accuracy. Our English training data set has 60,000 English sentences, comprising 30,000 positive English sentences and 30,000 negative English sentences.

Sentiment analysis using semantic similarity and Hadoop MapReduce

Article 18 May 2018

Trends on Sentiment Analysis over Social Networks: Pre-processing Ramifications, Stand-Alone Classifiers and Ensemble Averaging

Study on sentiment classification strategies based on the fuzzy logic with crow search algorithm

Article 11 July 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Sentiment classification plays a significant role in everyday life, in political activities, in activities relating to commodity production, and commercial activities. Finding a solution for the accurate and timely classification of emotion is a challenging task.

Data clustering is the process of putting objects into classes where the objects are similar. A cluster is a set of data objects which are similar in scope, but are not similar to objects in other clusters. Number of data clusters are clustered which can be identified firstly following experience or can be automatically identified in the clustering method.

This technique clusters a set of n data object vectors X = {x1,x2, …,xn}$\subset $ Rs into c fuzzy clusters based on calculating the minimum objective function to measure the quality of clustering and find the cluster centers in each cluster to minimize the cost measurement function.A fuzzy set is a set in which each x basic member is assigned a μ(ξ) real value in [0, 1] to display the dependency measure of this member in the set. When the dependency measure equals 0, the basic member does not belong to the set; but if the dependency measure equals 1, the basic member belongs to the set completely. Therefore, a fuzzy set is a set of (x, μ(x)) pairs.

Fuzzy C-Means (FCM) is also a method of clustering which allows one element to belong to two or more clusters. It is often used to cluster the data but seldom used for data classification.

We suggest many basic principles of our model to classify the opinions (positive, negative, neutral) expressed in the English documents in the English testing data set, based on the large number of English sentences in the English training data set. The principle underpinning our proposed model, which uses clustering techniques to classify the semantics of the English documents are as follows: assuming that an English document contains n English sentences, the English document has a positive polarity if the number of English sentences clustered into the 30,000 positive English sentences of the training data set is greater than the number of English sentences clustered into the 30,000 negative English sentences of the training data set. Conversely, the English document has a negative polarity if the number of English sentences clustered into the 30,000 positive English sentences of the training data set is less than the number of English sentences clustered into the 30,000 negative English sentences of the training data set. Finally, the English document has a neutral polarity if the number of English sentences clustered into the 30000 positive English sentences of the training data set is equal to the number of English sentences clustered into the 30,000 negative English sentences of the training data set.

Based on these principles, we implement our proposed model in the Cloudera parallel network environment. Our model uses FCM combined with Hadoop Map (M)/Reduce (R) to classify the sentiments (positive, negative, neutral) of one English document in the English testing data set into either positive polarity, negative polarity or neutral polarity in the Cloudera parallel network environment.

To implement this study, we use the basis Fuzzy C-Means algorithm (the core basis of the Fuzzy C-Means algorithm) presented in [8–24]. There are also many studies which use the FCM in semantic classification (opinion mining, sentiment analysis) but there is not much work which uses FCM for sentiment analysis with the aforementioned principles of our proposed model.

FCM is a clustering technique in the data mining field and it has been applied in the natural language processing field where we have had many difficulties and it has taken a long time to implement this research. There are many advantages of FCM, such as: it is unsupervised, it always converges; it provides membership values which are useful for interpretation; it is flexible with respect to the distance used; and if some of the membership values are known, this can be incorporated into the numerical optimization. There are several disadvantages of FCM as follows: long computational time; sensitivity to the initial guess (speed, local minima); and sensitivity to noise - one expects low (or even no) membership degree for outliers (noisy points).

In addition, based on the work related to FCM and the sentiment analysis of big data in [8–24], there are not studies which use FCM for big data in sentiment classification. We use FCM in our model opinion mining in big data, although our English data set in this work is a small English testing data set with 25,000 English document in each testing data set.

In addition, based on many works related to FCM in the parallel system (or FCM in the distributed system) in [25–27], many studies relate to parallel systems or distributed systems, in [28–42], FCM used research for sentiment classification in [43–50], and many studies in the world, there is not any study related to FCM for sentiment classification in parallel system but our model uses FMC for semantic analysis in the distributed system.

Many studies, such as [2–56], use Hadoop Map (M)/Reduce (R), and Cloudera; Vector Space Models (VSM); FCM; FCM in parallel systems (distributed systems)sentiment classification and big data. However, to the best of our knowledge, no studies use all of them. Our proposed model uses all of these.

Finally, we build many FCM-related algorithms in our new model based on the basic FCM in the Cloudera distributed system with Hadoop Map (M) /Reduce (R) and these algorithms have not been used in any other study.

This study comprises six sections: Section 1 is the introduction; Section 2 discusses the related work on Fuzzy C-Means (FCM), Hadoop, Cloudera, etc.; Section 3 discusses the English data set; Section 4 overviews the methodology of our proposed model; Section 5 describes the experiment and Section 6 provides the conclusion.

2 Related work

In this section, we overview several studies related to Fuzzy C-Means (FCM), the Vector Space Model, Hadoop, Cloudera, etc.

There are many studies which are related to the Vector Space Model [2–4]. First of all, the authors of [2] transfer all English sentences into many factors which are used in VSM algorithm. In this research, the authors examine the Vector Space Model, an information retrieval technique and its variations. The rapid growth of the World Wide Web and the abundance of documents and different forms of information available on it, has resulted in the need for better information retrieval techniques. The Vector Space Model is an algebraic model used for information retrieval. It represents a natural language document in a formal manner by the use of vectors in a multi-dimensional space, and allows decisions to be made as to which documents are similar to each other and to the queries fired. This work also explains the existing variations of the VSM and proposes a new variation that should be considered [3]. In the text classification task, one of the main problems is to choose which features give the best results. Various features can be used such as words, n-grams, syntactic n-grams of various types (POS tags, dependency relations, mixed, etc.), or a combination of these features. Also, algorithms to reduce the dimensionality of these sets of features can be applied, such as Latent Dirichlet Allocation (LDA). In this research, the authors consider the multi-label text classification task and apply various feature sets. The authors consider a subset of multi-labeled files of the Reuters-21578 corpus. The authors use traditional TF-IDF values of the features and tried both considering and ignoring the stop words. The authors also tried several combinations of features, like bi-grams and uni-grams. The authors also experimented by adding LDA results into Vector Space Models as new features. These latter experiments obtained the best results [4]. KNN and SVM are two machine learning approaches to text categorization (TC) based on the Vector Space Model. In this model, borrowed from information retrieval, documents are represented as a vector where each component is associated with a particular word from the vocabulary. Traditionally, each component value is assigned using the information retrieval TFIDF measure. While this weighting method seems very appropriate for IR, it is not clear that it is the best choice for TC problems. Actually, this weighting method does not leverage the information implicitly contained in the categorization task to represent documents. In this research, the authors introduce a new weighting method based on the statistical estimation of the importance of a word for a specific categorization problem. This method also has the benefit of making feature selection implicit, since useless features of the categorization problem considered are assigned a very small weight. Extensive experiments reported in the research show that this new weighting method significantly improves classification accuracy as measured on many categorization tasks.

Many studies such as [5–7] are related to the implementation of algorithms and applications in the parallel network environment. Hadoop is an Apache-based framework which is used to handle large data sets on clusters consisting of multiple computers, using the Map and Reduce programming model. The two main projects of Hadoop are the Hadoop Distributed File System (HDFS) and Hadoop M/R (Hadoop Map/Reduce). Hadoop M/R allows engineers to program for writing applications for the parallel processing of large datasets on clusters consisting of multiple computers. An M/R task has two main components: (1) Map and (2) Reduce. This framework splits the input data into chunks which multiple Map tasks can handle as a separate data partition in parallel. The outputs of the map tasks are gathered and processed by the Reduce task which is ordered. The inputs and outputs of each M/R are stored in HDFS because the Map tasks and the Reduce tasks are performed on the pair (key, value), and the formatted input and output formats will be the pair (key, value) [7]. Cloudera, the global provider of the fastest, easiest, and most secure data management and analytics platform built on Apache ^TM Hadoop${{\circledR }}$ and the latest open source technologies, announced in November 2015 that it will submit proposals for Impala and Kudu to join the Apache Software Foundation (ASF). By donating its leading analytic database and columnar storage projects to the ASF, Cloudera aims to accelerate the growth and diversity of their respective developer communities. Cloudera delivers the modern data management and analytics platform built on Apache Hadoop and the latest open source technologies. The world’s leading organizations trust Cloudera to help solve their most challenging business problems with Cloudera Enterprise, the fastest, easiest and most secure data platform available currently. Cloudera’s customers are able to efficiently capture, store, process and analyze vast amounts of data, empowering them to use advanced analytics to drive business decisions quickly, flexibly and at a lower cost than has been possible before. To ensure Cloudera’s customers are successful, it offers comprehensive support, training and professional services.

There are many studies, such as [8–24] which are related to the FCM algorithm.

Many studies are related to FCM in parallel systems (or FCM in distributed systems) such as the work in [25–27].

Many studies, such as [28–42] are related to parallel systems or distributed systems.

Research using FCM for sentiment classification can be found in [43–50].

The latest research on sentiment classification can be found in [51–54, 56].

3 Data set

The English training data set includes 60,000 English sentences in the movie field, of which 30,000 are positive English sentences and 30,000 are negative English sentences.

All English sentences in our English training data set have been automatically extracted from Facebook and websites in social networks, after which we labeled them as either positive or negative. Figure 1 is the English training data set of this model.

We used a publicly available large data set of movie reviews from the Internet Movie Database [1]. This English data set comprises a testing data setwhich we refer to as the first testing data set and also a training data setwhich we refer to as the second testing data set. Both our first testing data set and our second testing data set contain 25,000 English documents, each with 12,500 positive English movie reviews and 12,500 negative English movie reviews. Figure 2 is the English testing data set of this model.

4 Methodology

The methodology section comprises two parts: the semantic classification of the 25,000 English documents in the testing t1 and the 25,000 English documents of the testing t2 on the sequential environment is presented in the first part and the sentiment classification of the 25,000 English reviews of the testing t1 and the 25,000 English reviews of the testing t2 in the parallel network environment is presented in the second part.

In the English training data set, there are two clusters: the first, called the positive cluster, contains 30,000 positive English sentences and the second, called the negative cluster, contains 30,000 negative English sentences. All English sentences in both the first cluster and the second cluster have undergone word segmentation and stop-word removal after which they are transferred into vectors (vector representation). The 30,000 positive English sentences in the positive cluster are transferred into the 30,000 positive vectors, called the positive vector group (or the positive vector cluster). The 30,000 negative English sentences in the negative cluster are transferred into 30,000 negative vectors, called the negative vector group (or the negative vector cluster). Therefore, the English training data set includes the positive vector group (or the positive vector cluster) and the negative vector group (or the negative vector cluster) [2–4]. The VSM is an algebraic model used for information retrieval. It represents a natural language document in a formal manner by the use of vectors in a multidimensional space. The VSM is a way of representing documents through the words that they contain. Vector space modeling places terms, documents, and queries in a term-document space so it is possible to compute the similarities between queries and the terms or documents, and allow the results of the computation to be ranked according to the similarity measure between them. The VSM allows decisions to be made about which documents are similar to each other and to queries.

We transferred all the English sentences in the training data set into vectors similar to VSM [2–4].

4.1 Fuzzy C-means algorithm in the sequential environment

Figure 3 illustrates how sentiment classification is undertaken in the sequential environment.

With each English document in the English testing data set, we assume that each English document has n English sentences and we transfer the n English sentences into n vectors similar to VSM [2–4]. Thus, the document has n vectors. For each vector of the n vectors, we use FCM to cluster the vector into the positive vector group or the negative vector group in the sequential environment. According to [8–17], we implement the FCM algorithm which is enhanced to be able to classify the sentiment of the English sentences.

The total all the fuzzy partitions which have c clusters of N objects in D is calculated as follows:

$$\begin{array}{@{}rcl@{}} E_{fc} &=&\left\{U\in R_{cN} \vert {\underset{1\le i\le c\wedge 1\le k\le N}{\forall}} u_{ik} \in [0,1],\sum\limits_{i=1}^{c} u_{ik} \right.\\ &&\left.=1,0<\sum\limits_{k=1}^{N} {u_{ik} <N}\right\} \end{array} $$

Minimize the objective function:

$$J_{m} (U,V)=\sum\limits_{i=1}^{c} \sum\limits_{k=1}^{N} (u_{ik} )^{m} d_{ik}^{2} $$

$$d_{ik}^{2} =\vert x_{k} - \textit{v}_{i} \vert_{A} $$

V = [v1, v2, ..., vc] is a matrix which represents the center object values of the cluster. A matrix is a positive finite. m is the exponent weight in [1, $\infty $).

The objective function reaches a minimum value if and only if:

$$\begin{array}{@{}rcl@{}} {\underset{{1\le k\le N}}{\forall}} I_{k} =\{i\vert 1\le \mathrm{i}\le \ \text{c;}{~}_{d_{\text{ik}}} =0\} \end{array} $$

$$ {\underset{1\le i\le c\wedge 1\le k\le N}{\forall}} \,\,\,u_{ik} =\left\{\begin{array}{l} {(d_{ik} )}^{\frac{2}{1-m}} \left[\sum\limits_{j=1}^{c} {{(d_{ik})}^{\frac{2}{1-m}}}\right]^{-1} \\ \left\{\begin{array}{l} 0,i\notin\\\\ \sum\limits_{i\in I_{k}} {u_{ik} =1,i\in I_{k}} \end{array}\right. \end{array}\right. $$

(1)

$$ {\underset{{1\le i\le c}}{\forall}} v_{i} =\frac{\sum\limits_{k=1}^{N} {{(u_{ik} )}^{m} x_{k}} } {\sum\limits_{k=1}^{N} {{(u_{ik} )}^{m}} } $$

(2)

The FCM algorithm comprises the following steps:with ${\vert \vert U\vert \vert } {_{F}^{2}} =\sum \nolimits _{i} {\sum \nolimits _{k} U_{ik}^{2}}$

With the clustering results of the n vectors of the documents in the testing data set, the document has a positive sentiment if the number of vectors in the n vectors is greater than the number of vectors in the n vectors. The document has a negative sentiment if the number of vectors in the n vectors is less than the number of vectors in the n vectors. The document has a neutral sentiment if the number of vectors in the n vectors is equal to the number of vectors in the n vectors.

4.2 Fuzzy C-means (FCM) in the parallel network environment

Figure 4 illustrates how semantic classification is undertaken in a parallel network environment.

We transfer the 60,000 English sentences in the training data set into the 60,000 vectors using Hadoop Map (M)/Reduce (R) in the Cloudera parallel network environment to shorten the execution time of this task. Figure 5 overviews the process of transferring each English sentence into one vector in the Cloudera networkenvironment.

Transferring each English sentence into one vector in the Cloudera network environment involves two phases: Map (M) phases and Reduce (R) phases. The input of the Map phase is one English sentence and the output of the Map phase are the many components of a vector which correspond to the sentence. In the Map phase of Cloudera, we transfer the sentence into one vector similar to VSM [2–4]. The input of the Reduce phase is the output of the Map phase, which is many components of a vector. The output of the Reduce phase is a vector which corresponds to the sentence. In the Reduce phase of Cloudera, these components of the vector are built into one vector.

Each English document in the testing data set contains n English sentences. We transfer each English sentence in the n English sentences into one vector similar to the process shown in Fig. 5. Hence, the document also has n vectors.

FCM in the Cloudera parallel network environment comprises two phases: the first phase is the Hadoop Map (M) phase in Cloudera and the second phase is the Hadoop Reduce (R) phase in Cloudera. In the Map phase, the input is the n vectors of one English document (which have been classified) into either the positive vector group or the negative vector group; and the output is the clustering results of the n vectors of the document into either the positive vector group or the negative vector group. In the Reduce phase, the input is the output of the Map phase and this input is the clustering results of the n vectors of the document into either the positive vector group or the negative vector group; and the output is the sentiment classification result of the document as either having positive polarity, negative polarity, or neutral polarity. In the Reduce phase, the English document is classified as having a positive sentiment if the number of vectors of the n vectors in the positive vector group is greater than the number of vectors of the n vectors in the negative vector group; the English document is classified as having a negative sentiment if the number of vectors of the n vectors in the positive vector group is less than the number of vectors of the n vectors in the negative vector group; and the English document is classified as having a neutral sentiment if the number of vectors of the n vectors in the positive vector group is equal to the number of vectors of the n vectors in the negative vector group.

4.2.1 Hadoop Map (M)

Figure 6 illustrates the Hadoop Map phase.

Similar to [7–17], we propose FCM as follows:

4.2.2 Hadoop Reduce (R)

Figure 7 illustrates the Hadoop Reduce phase.

5 Experiment

We used measures such as accuracy (A) to calculate the accuracy of the results of sentiment classification.

The Java programming language was used to save the data sets in order to implement our proposed model to classify the 25,000 English documents in testing data set t1 and the 25,000 English documents of testing data set t2.

To implement the proposed model, we used the Java programming language to save the English training data set, the English testing data set and the results of the sentiment classification.

The sequential environment in this research comprises one node (one server). The Java language is used to program FCM. The configuration of the server in the sequential environment is Intel${\circledR }$ Server Board S1200V3RPS, Intel${\circledR }$ Pentium${\circledR }$ Processor G3220 (3M Cache, 3.00 GHz), 2GB PC3-10600 ECC 1333 MHz LP Unbuffered DIMMs. The operating system of the server is Cloudera.

We implement FCM in the Cloudera parallel network environment - this Cloudera system comprises four nodes (four servers). The Java language is used to program the application of the FCM in Cloudera. The configuration of each server in the Cloudera system is Intel${\circledR }$ Server Board S1200V3RPS, Intel${\circledR }$ Pentium${\circledR }$ Processor G3220 (3M Cache, 3.00 GHz), 2GB PC3-10600 ECC 1333 MHz LP Unbuffered DIMMs. The operating system of each of the four nodes is Cloudera. All nodes have the same configuration.

The results of the sentiment classification of the 25,000 English documents in testing data set t1 are presented in Table 1.

Table 1 The results of the 25,000 english documents in testing data set t1

Full size table

The results of the sentiment classification of the 25,000 English documents in testing data set t2 are presented in Table 2.

Table 2 The results of the 25,000 english documents in testing data set t2

Full size table

The accuracy of the sentiment classification of the 25,000 English documents in testing dataset t1 is shown in Table 3.

Table 3 The accuracy of our proposed model for the sentiment classification of the 25,000 english documents in testing data set t1

Full size table

The accuracy of the sentiment classification of the 25,000 English documents in testing dataset t2 is shown in Table 4.

Table 4 The accuracy of our proposed model for the sentiment classification of the 25,000 english documents in testing data set t2

Full size table

6 Conclusion

Although our proposed model was tested on an English data set, it can also be applied to many other languages. In this paper, our model was tested on the 25,000 English documents in the testing data set t1 and the 25,000 English documents in the testing data set t2 which are small data sets. However, our model can be applied to a big data set containing millions of English documents in a very short time.

In this work, we proposed a new model to classify the sentiments of English documents using the Fuzzy C-Means Algorithm (FCM) with Hadoop Map (M) /Reduce (R) in the Cloudera parallel network environment. The experiment results show that our proposed model achieves 60.2 % and 59.8 % accuracy of the English documents. Currently, there is a paucity of research which shows that clustering methods can be used to classify data. Our research shows that clustering methods are able to classify data and in particular, they are useful for sentiment classification for text.

As shown in Table 3, the average time taken for the sentiment classification of the 25,000 English documents in testing data set t1 using the FCM algorithm in the sequential environment is 150,590 seconds, which is greater than the average time taken for the sentiment classification of the 25,000 English documents using FCM in the Cloudera parallel network environment, which is 37,659 seconds.

As shown in Table 4, the average time taken for the sentiment classification of the 25,000 English documents in testing data set t2 using the FCM algorithm in the sequential environment is 151590 seconds, which is greater than the average time taken for the sentiment classification of the 25,000 English documents in testing data set t2 using FCM in the Cloudera parallel network environment, which is 37875 seconds.

The execution time of the FCM in Cloudera is dependent on the performance of the Cloudera parallel system and is also dependent on the performance of each server on the Cloudera system.

The principles underpinning our proposed model for classifying the sentiment (positive, negative, neutral) of the English documents in the English testing data set in the sequential environment, based on the numerous English sentences in the English training data set are similar to the principles underpinning our proposed model for classifying the sentiment (positive, negative, neutral) of the English documents in English testing data set in the distributed environment, based on the numerous English sentences in English training data set.

The FCM of our proposed model in the sequential environment is different from the FCM of our proposed model in the parallel environment. We built many algorithms related to the FCM to implement our model in the distributed system.

The execution time of our model in the parallel environment is less than the execution time of our model in the sequential environment. The execution of our model in the distributed system is shorter if the performance in the distributed system is longer.

In addition, the execution time of any model is also dependent on the algorithms. For example, using the same algorithms, different systems perform differently and have different execution times. Using the same system with the same performance, different algorithms may have different execution times.

Our survey has many advantages and disadvantages. The advantages are: it processes big data involving millions of English documents; the execution time of our model to conduct sentiment on big data is short, etc. However, the disadvantages are: it takes a long time to implement and it is costly to build the algorithms of the model in the distributed system.

To understand the scientific value of this research, we compare our model’s results with the results of models used in other studies.

Table 5 compares our model’s results with the studies in [2–4] as follows:

cluster technique: CT.
sentiment classification: SC (opinion mining, or semantic classification, or emotion classification).
parallel network system: PNS (distributed system).
special domain: SD.
dependence on the training data set: DT.
language: L
Vector Space Model: VSM
no mention: NM
English language: EL.
Fuzzy C-Means: FCM.

Table 5 Comparison of our model’s results with the work in [2–4]

Full size table

Table 6 Compares our model’s results with the work related to the Fuzzy C-Means (FCM) algorithm in [8–24].

Table 6 Comparison of our model’s results with the work related to the Fuzzy C-Means (FCM) algorithm in [8–24]

Full size table

Table 7 compares our model’s results with studies related to Fuzzy C-Means in the parallel system (or FCM in the distributed system) in [25–27].

Table 7 Comparison of our model’s results with studies related to Fuzzy C-Means in the parallel system (or FCM in the distributed system) in [25–27]

Full size table

Table 8 compares our model’s results with studies related to FCM for sentiment classification in [43–50].

Table 8 Comparison of our model’s results with the FCM used for sentiment classification in [43–50]

Full size table

Table 9 compares our model’s results with the latest research on sentiment classification (or sentiment analysis or opinion mining) in [51–56].

Table 9 Comparison of the proposed model with the latest sentiment classification models (or the latest sentiment classification methods) in [51–56]

Full size table

References

Large movie review dataset (2016) http://ai.stanford.edu/~amaas/data/sentiment/
Singh V K, Singh V K (2015) Vector space model: an information retrieval system. International Journal of Advanced Engineering Research and Studies
Carrera-Trejo V, Sidorov G, Miranda-Jiménez S, Moreno Ibarra M, Cadena Martínez R (2015) Latent Dirichlet allocation complement in the vector space model for multi-label text classification. International Journal of Combinatorial Optimization Problems and Informatics 6(1):7–19
Google Scholar
Soucy P, Mineau G W (2005) Beyond TFIDF weighting for text categorization in the vector space model. In: Proceedings of the 19th international joint conference on Artificial intelligence, USA, pp 1130–1135
Hadoop (2016). http://hadoop.apache.org
Apache (2016). http://apache.org
Cloudera (2016). http://www.cloudera.com
Ghaffari M, Ghadiri N (2016) Ambiguity-driven fuzzy C-means clustering: how to detect uncertain clustered records. Applied Intelligence (APIN):1–12
RJ Hathaway J C, Bezdek Y H u (2000) Generalized fuzzy c-means clustering strategies using L/sub p/ norm distances. IEEE Trans Fuzzy Syst 8(5):576–582
Article Google Scholar
Tsao E C -K, Bezdek J C, Pal N R (1994) Fuzzy Kohonen clustering networks. Pattern Recogn 27 (5):757–764
Article Google Scholar
Hathaway R J, Bezdek J C (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern B (Cybern) 31(5):735–744
Article Google Scholar
Lim Y W, Lee S U (1990) On the color image segmentation algorithm based on the thresholding and the fuzzy c-means techniques. Pattern Recogn 23(9):935–952
Article Google Scholar
Bezdek J C, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203
Article Google Scholar
Pal N R, Bezdek J C (2002) On cluster validity for the fuzzy c-means model. IEEE Trans Fuzzy Syst 3 (3):370–379
Article Google Scholar
Pal N R, Pal K, Keller J M, Bezdek J C (2005) A possibilistic fuzzy c-means clustering algorithm. IEEE Trans Fuzzy Syst 13(4):517–530
Article Google Scholar
Ahmed M N, Yamany S M, Mohamed N, Farag A A (2002) A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data. IEEE Trans Med Imaging 21(3):193–199
Article Google Scholar
Cannon R L, Dave J V, Bezdek J C (2009) Efficient implementation of the fuzzy c-means clustering algorithms. IEEE Trans Pattern Anal Mach Intell 8(2):248–255
MATH Google Scholar
Bezdek J C, Hathaway R J, Sabin M J, Tucker W T (1987) Convergence theory for fuzzy c-means: Counterexamples and repairs. IEEE Trans Syst Man Cybern 17(5):873–877
Article MATH Google Scholar
Hathaway R J, Bezdek J C (1994) Nerf c-means: non-euclidean relational fuzzy clustering. Pattern Recogn 27(3):429–437
Article Google Scholar
D-Q Zhang S -C, Chen A (2004) Novel kernelized fuzzy C-means algorithm with application in medical image segmentation. Artif Intell Med 32(1):37–50
Article Google Scholar
Hathaway R J, Davenport J W, Bezdek J C (1989) Relational duals of the c-means clustering algorithms. Pattern Recogn 22(2):205–212
Article MathSciNet MATH Google Scholar
Chuang K-S, Tzeng H -L, Chena S, Wu J, Chen T -J (2006) Fuzzy c-means clustering with spatial information for image segmentation. Comput Med Imaging Graph 30(1):9–15
Article Google Scholar
Bahrampour S, Moshiri B, Salahshoor K (2011) Weighted and constrained possibilistic C-means clustering for online fault detection and isolation. Appl Intell (APIN) 35(2):269–284
Article Google Scholar
Zhang D-Q, Chen S -C (2003) Clustering incomplete data using kernel-based fuzzy c-means algorithm. Neural Process Lett 18(3):155–162
Article MathSciNet Google Scholar
Hall L O, Bensaid A M, Clarke L P, Velthuizen R P (2002) A comparison of neural network and fuzzy clustering techniques in segmenting magnetic resonance images of the brain. IEEE Trans Neural Netw 3(5):672–682
Article Google Scholar
Kuo R J, Ho L M, Hu C M (2002) Integration of self-organizing feature map and K-means algorithm for market segmentation. Comput Oper Res 29(11):1475–1493
Article MATH Google Scholar
Kwok T, Smith K, Lozano S, Taniar D (2002) Parallel Fuzzy c-Means Clustering for Large Data Sets, Euro-Par 2002 Parallel Processing, Volume 2400 of the series Lecture Notes in Computer Science, pp 365–374
Xylogiannopoulos K F, Karampelas P, Alhajj R (2016) Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays. Appl Intell (APIN):1–31
Carns P H, Ligon III W B, Ross R B, Thakur R (2000) PVFS: A parallel file system for linux clusters. In: Proceedings of the extreme linux track: 4th annual linux showcase and conference
Moyer S A, Sunderam V S (1994) PIOUS: a scalable parallel I/o system for distributed computing environments. In: Proceedings of the scalable high-performance computing conference
Shirazi B A, Kavi K M, Hurson A R (1995) Scheduling and load balancing in parallel and distributed systems, scheduling and load balancing in parallel and distributed systems, USA
Andrews G R (1999) Foundations of parallel and distributed programming. In: Foundations of parallel and distributed programming 1st, USA
Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput 22(6):789–828
Article MATH Google Scholar
Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson Ú, Gunda P K, Currey J (2008) dryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language symposium on operating system design and implementation (OSDI)
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system
Guerrero J M, Matas J, Garcia de Vicuna L, Castilla M, Miret J (2007) Decentralized control for parallel operation of distributed generation inverters using resistive output impedance. IEEE Trans Ind Electron 54:2
Article Google Scholar
van Steen M, Homburg P, Tanenbaum A S (1999) Globe: a wide-area distributed system. IEEE Concurr 7(1):70–78
Article Google Scholar
Shende S S, Malony A D (2006) The tau parallel performance system. Int J High Perform Comput Appl 20(2):287–311
Article Google Scholar
Bagrodia R, Meyer R, Takai M, Chen Y -A, Zeng X, Martin J, Song H Y (1998) Parsec: a parallel simulation environment for complex systems. Computer 31(10):77–85
Article Google Scholar
RumelHart D E, Hinton G E, McClelland J L (1986) A general framework for parallel distributed processing. In: Parallel distributed processing: explorations in the microstructure of cognition, USA, vol 1, pp 45–76
Ikudome K, Fox G C, Kolawa A, Flower J W (1990) An automatic and symbolic parallelization system for distributed memory parallel computers. In: Proceedings of the fifth distributed memory computing conference
Wang H O, Tanaka K, Griffin M (1995) Parallel distributed compensation of nonlinear systems by Takagi-Sugeno fuzzy model
Poria S, Gelbukh A, Cambria E, Hussain A, Huang G -B (2014) EmoSenticSpace: a novel framework for affective common-sense reasoning. Knowl-Based Syst 69:108–123
Article Google Scholar
Poria S, Gelbukh A, Das D, Bandyopadhyay S (2013) Fuzzy clustering for semi-supervised learning – case study: construction of an emotion lexicon. In: Advances in artificial intelligence, volume 7629 of the series lecture notes in computer science, pp 73–86
Vinchurkar S V, Nirkhi S M (2012) feature extraction of product from customer feedback through blog. International Journal of Emerging Technology and Advanced Engineering 2(1):2250–2459
Google Scholar
IndiraPriya P, Ghosh D K (2013) A Survey on Different Clustering Algorithms in Data Mining Technique. International Journal of Modern Engineering Research (IJMER) 3(1):267–274
Google Scholar
Ghasemi J, Ghaderi R, Karami Mollaei M R, Hojjatoleslami S A (2013) A novel fuzzy Dempster–Shafer inference system for brain MRI segmentation. Inf Sci 223:205–220
Article Google Scholar
Sheeba J I, Vivekanandan K (2014) A fuzzy logic based on sentiment classification. International Journal of Data Mining & Knowledge Management Process (IJDKP) 4(4)
Liu C-L, Chang T -H, Li H -H (2013) Clustering documents with labeled and unlabeled documents using fuzzy semi-Kmeans. Fuzzy Sets Syst 221:48–64
Article MathSciNet MATH Google Scholar
Manek A S, Deepa Shenoy P, Chandra Mohan M, Venugopal K R (2016) Aspect term extraction for sentiment analysis in large movie reviews using gini index feature selection method and SVM classifier. World wide web, 1–20. doi:10.1007/s11280-015-0381-x. Print ISSN1386-145x, US
Agarwal B, Mittal N (2016) Machine learning approach for sentiment analysis. Prominent feature extraction for sentiment analysis, 21–45. doi:10.1007/978-3-319-25343-5_3. Print ISBN 978-3-319-25341-1
Agarwal B, Mittal N (2016) Semantic orientation-based approach for sentiment analysis. Prominent feature extraction for sentiment analysis, 77–88. doi:10.1007/978-3-319-25343-5_6. Print ISBN 978-3-319-25341-1
Canuto S, André M, Gonçalves F B (2016) Exploiting new sentiment-based meta-level features for effective sentiment analysis. In: Proceedings of the ninth ACM international conference on web search and data mining (WSDM ’16), New York, USA, pp 53–62
Ahmed S, Danti A (2016) Effective sentimental analysis and opinion mining of web reviews using rule based classifiers. Computational Intelligence in Data Mining 1:171–179. doi:10.1007/978-81-322-2734-2_18. Print ISBN 978-81-322-2732-8, India
Article Google Scholar
Phu V N, Tuoi P T (2014) Sentiment classification using enhanced contextual valence shifters. In: International Conference on Asian Language Processing (IALP), pp 224–229
Tran V T N, Phu V N, Tuoi P T (2014) Learning more chi square feature selection to improve the fastest and most accurate sentiment classification. In: The third asian conference on information systems (ACIS 2014)

Download references

Author information

Authors and Affiliations

Division of Computational Mathematics and Engineering, Institute for Computational Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Vo Ngoc Phu
Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam
Vo Ngoc Phu
Faculty of Information Technology, Ly Tu Trong Technical College, Ho Chi Minh City, Vietnam
Nguyen Duy Dat
School of Industrial Management (SIM), Ho Chi Minh City University of Technology, Vietnam National University, Ho Chi Minh City, Vietnam
Vo Thi Ngoc Tran
Computer Science & Engineering (CSE), Ho Chi Minh City University of Technology, Vietnam National University, Ho Chi Minh City, Vietnam
Vo Thi Ngoc Chau
Faculty of Computer Networks and Communications, University of Information Technology, Vietnam National University of Hochiminh City, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam
Tuan A. Nguyen

Authors

Vo Ngoc Phu
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen Duy Dat
View author publications
You can also search for this author in PubMed Google Scholar
Vo Thi Ngoc Tran
View author publications
You can also search for this author in PubMed Google Scholar
Vo Thi Ngoc Chau
View author publications
You can also search for this author in PubMed Google Scholar
Tuan A. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vo Ngoc Phu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Phu, V.N., Dat, N.D., Ngoc Tran, V.T. et al. Fuzzy C-means for english sentiment classification in a distributed system. Appl Intell 46, 717–738 (2017). https://doi.org/10.1007/s10489-016-0858-z

Download citation

Published: 05 November 2016
Issue Date: April 2017
DOI: https://doi.org/10.1007/s10489-016-0858-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Fuzzy C-means for english sentiment classification in a distributed system

Abstract

Similar content being viewed by others

Sentiment analysis using semantic similarity and Hadoop MapReduce

Trends on Sentiment Analysis over Social Networks: Pre-processing Ramifications, Stand-Alone Classifiers and Ensemble Averaging

Study on sentiment classification strategies based on the fuzzy logic with crow search algorithm

1 Introduction

2 Related work

3 Data set

4 Methodology

4.1 Fuzzy C-means algorithm in the sequential environment