Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

An opinion is an unproven judgement or view about something from someone. It cannot be necessarily based on facts or knowledge but can be useful in different situations. Referring to [1] an opinion is a quintuple, (ei, aij, sijkl, hk, tl), where ei is the name of an entity, aij is an aspect of ei, sijkl is the sentiment on aspect aij of entity ei, hk is the opinion holder, and tl is the time when the opinion is expressed by hk. The sentiment level will vary from the application and it will have different level as positive or negative; as positive, negative or neutral; as points for example 1–10 used in hotels, films rating, etc.

Nowadays there is a huge amount of people’s review, opinions in online and social media and analyzing it will be important to make decision in the future. So, recently business and academia are focused in finding the best way to analyze this huge amount of online opinions using machine learning techniques.

Opinion mining is an ongoing field of text mining and natural language processing that aims to identity and extract subjectivity information in people’s opinion. This is known also as sentiment analysis and involve the development of a system to collect, analyze, summarize or categorize the opinions based on different criteria. Referring to the opinion definition in [1] the aim of opinion mining is to evaluate the quintuple of the opinion. Opinion mining can be performed in document level, in sentence level and in aspect-based level.

The aim of this paper is to evaluate through experiments the classification algorithms used to perform opinion mining in document level. For this we created a text corpus with opinions in Albanian language collected from well-known Albanian online newspaper. The corpus contains opinions categorized in five different subjects, and for each subject the opinions are categorized as positive and as negative opinions. First, we have cleaned the dataset by preprocessing the text data with a stop-word removal and a stemmer. Then the clean dataset is used to train and test the classification algorithms. We use Weka software to evaluate the performance of 53 classification algorithms.

The structure of the paper is as follows: Sect. 2 presents background and related work; Sect. 3 presents methodology for opinion categorization; Sect. 4 gives a short description of the classifiers; Sect. 5 present experiment results and the Sect. 6 conclude our work and give future work ideas.

2 Background and Related Work

In [1] the author gives an in-depth comprehensive study of all research fields of opinion mining. Since the last decade opinion mining is one of the fastest growing research areas in natural language processing. Companies and researchers are increasingly focused in opinion mining and to use the result from it in their daily work, as marked predicting, the feedback form a student for a given lesson, etc.

As opinion mining is a classification technique, the main machine learning techniques for text classification, supervised and unsupervised can be used on it. In supervised learning is used a leveled dataset to build the classification model, but in unsupervised learning the dataset is unleveled that make it not too appropriate to be used in opinion mining.

The work in [2] consist in performance evaluation of three supervised machine learning techniques, Maximum Entropy, Naïve Bayes and SVM for opinion mining in English Twitter dataset combined with the Semantic Orientation based WordNet, for extracting the synonyms of the content. From the results of the performed experiments we can conclude that Naïve Bayes algorithm used with unigram technique outperform the two other algorithms and the use of WordNet improve the accuracy.

To address the problem of unleveled dataset and use the unsupervised methods in opinion mining the work in [3] proposed to use a clustering algorithm. In [4] the author used an unsupervised machine learning technique as spectral clustering to cluster the English Twitter dataset as positive and negative. The proposed system has four main steps as following: cleaning and normalization of the data set; applying a test dataset to cluster the dataset using unsupervised machine learning algorithm; applying a k-means to normalize the generated matrix and finally applying a hierarchical clustering algorithm to merge the clusters in one. The experimental results indicate that unsupervised machine learning technique outperform the supervised machine learning.

Opinion mining would be performed in document level, in sentence level and in aspect-based level. At document level opinion, the whole document is indicated to be a positive or negative opinion. The documents can be from one domain or cross-domain; from one or more languages. At sentence level opinion, a sentence is indicated to be a positive or negative opinion [5]. The main task in aspect-based level opinion mining is the identification of the aspect, of the word related to this aspect and the orientation of this word [6]. The focus of our work is to evaluate the performance of classification algorithms in document level classified as positive or negative opinion.

One of the first study in opinion mining classification has been realized by [7] that performed an experimental evaluation of three machine learning algorithm, Naïve Bayes, SVM and Maximum Entropy in a movie review corpus classified as positive and negative opinions using unigrams and bigrams. They concluded that the algorithms do not performs as well as in topic-based categorizing and there are too many challenges to address.

To improve the performance of opinion classification to many researchers have performed experimental evaluation of machine learning algorithms including new features and testing different features combinations.

An interesting study was conducted in [8] to evaluate the performance of four machine learning algorithm, Naive Bayes, Stochastic Gradient Descent, SVM and Maximum Entropy, in reviews of IMDb dataset categorized as positive and negative opinions, using TF-IDT and Count Vectorizer technique. They tested all the algorithms for different n-gram models. At the end the results are compared with results of 20 previews works by different authors. The value on n in n-gram is in disproportional portion with the classification accuracy, so the result taken in unigram and bigram are better than in trigram, four-gram and five-gram. The used technique for converting the text document into matrix improve the accuracy.

In [9] the authors proposed a method that combine machine learning and semantic orientation based on the integration of knowledge resource to perform a polarity opinion classification based in majority voting system. They used three corpuses one with opinions in Arabic, the second the corresponding in English and the third corpus bilingual. They used SVM combined with different feature as using or no preprocessing data, and using different n-gram models. The three corpuses obtain the best score in terms of F1 with different combination of SVM with features, but the proposed technique has a better performance in bilingual corpus.

An impressive research work has been conducted in opinion mining in different languages as: English, German, Spanish, Turkish, Arabian, etc. [10,11,12,13]. The paper [12] is a review paper on opinion mining research conducted for categorizing opinions in Spanish language for the period 2012 to 2015. Comparing with research in English there is a sporadic research conducted in Spanish, but the results from this work is promising and encouraged to develop more research in this language. Another research work [13] present a multiple classifier system used for opinion mining in Turkish language. This system is based in a vote algorithm combined with three machine learning algorithms, Naive Bayes, SVM, and Bagging. The proposed solution showed better performance than the three classifiers used individually.

Insignificant research work has been done in Albanian language for opinion classification. The first paper in this fill is [14] that evaluate through experiments the application of machine learning algorithms in opinion classification.

All the research work discussed above and reviewed during the preparation of this paper, use and recommend a preprocessing phase to the data set before applying the machine learning algorithms for opining mining. There are few attempts to develop preprocessing tools for Albanian language. The authors in [15] provided a naïve-single-step stemming algorithm for Albanian language and a stop-words list. They have evaluated the algorithm through human experts. The experts indicate that there are more rules that are not included in the algorithm that could improve its capabilities. A rule-based stemmer for Albanian language implemented in java programming language is presented in [16]. This stemmer is an improved version of the compositor implemented in [17]. We used the stemmer introduced in [16] to preprocess our data set.

3 Methodology for Opinion Classification

The aim of this work is to evaluate through experiments the performance of 53 classification algorithms for opinion classification on document level in a corpus in Albanian language. We use Weka software to test and train the classification algorithms on our corpus. To perform this work, we followed the following steps.

3.1 Data Collection

We created a text corpus of 500 Albanian written opinions collected from different well-known Albanian newspapers. The corpus has five subjects related to nowadays discussion in Albania as follow: Tourism in Albania, Politics, the including of VAT tax in Small Business, waste import in Albania, the law of Higher Education. For each of the subjects we have collected fifty text document categorized as positive opinions and fifty text documents categorized as negative opinions.

3.2 Preprocessing

Referring to all the research work reviewed, the corpus is preprocessed before it is used to training and test the classification algorithms. Only in [9], one of the corpus is used without preprocessing the dataset. We used the rule-based stemmer implemented in [16] to perform the preprocessing step. This stemmer contains 134 rules based on the Albanian language morphology but to find the stem of the word it is not taken in consideration the linguistic meaning of the word. The stemmer is experimentally tested by the authors to classify text documents corpuses and the results demonstrate its effectiveness.

The preprocessing process include the following steps:

  1. 1.

    The first step is applying a stop-word that include:

    • All the data are set to lower case;

    • Remove the stop-word like: “dhe”, “ne” “sepse”, “kur”, “edhe” “ndonëse”, “mbase” etc. These are words that in Albanian language do not have any meaning. The stop-word can be removed because they do not have any positive or negative connotation and do not affect the emotion of the opinion;

    • Special characters (including the punctuations) and numbers are removed.

  2. 2.

    The second step is applying the stemmer that finds the stems of all the words in the text document.

At the end of this process the text file is a file containing words that do not have any language structure.

3.3 Experiment Settings

As mentioned above, we create five corpuses contain opinion in Albanian language as. Each corpus contains fifty text documents of positive opinions and fifty text documents as negative opinions. The corpus firstly is preprocessed. Then the output text file from this phase is used as input for the algorithms. To evaluate the performance of the classification algorithm for opinion mining, we used WEKA software [18]. Weka software is implemented in Java programing language and contains a collection of machine learning algorithms and tools to perform data mining tasks.

To perform the experiments in Weka we created for each corpus a ARFF file. We load all text documents of one corpus in Weka using the textDirectoryLoader class. This class loads the text documents in a folder using the names of the sub-folders as class labels and store the content of it in a string attribute. Then we apply the StringToWordVector filter that convert all the string attributes into a word vector, that represent the occurrence of the word from the text contained in the strings.

We chose 53 classification algorithms implemented in Weka to evaluate the performance in each corpus. We use 10-folds Cross-validation for training a model and testing all the chosen algorithms.

4 The Classifiers

The classification of text is a widely study in data mining, text mining and natural language processing, that appoint categorical values to the text. Referring to [19] the problem of classification is defined as having a set of training records D = {X1 X2, X3,…, Xn}, such that each record is labeled with a value from a set of k different discrete values indexed by {1…k}. The trained data are used to build a classification model, that links the features of the record data to one of the labels. So, for each unknown test instance, the training model predict a label for it.

The classification methods are categorized in five categories: Probabilistic and Naïve Bayes Classifiers, Rule-based Classifiers, Proximity-based Classifiers, Linear Classifiers and Decision Tree Classifiers.

4.1 Probabilistic and Naïve Bayes Classifiers

Probabilistic classifiers use a mixture model and can predict for a given input a probability distribution over a set of classes, and not only the most likely class that the input should belong to.

Naïve Bayes Classifier is a straightforward and powerful algorithm, that uses two groups of models. These two models calculate the posterior probability of a class based on the distribution of the words in the text document. The models differ by each other in using or not the frequencies of the word and from the action taken to model the probability space.

The probabilistic and Naïve Bayes classifiers taken in consideration during this paper in Weka are: Bayesian Logistic Regression, Bayes Net, Complement Naïve Bayes, Naïve Bayes, Naïve Bayes Multinomial, Naïve Bayes Multinomial Text, Naïve Bayes Multinomial Updateable and Naïve Bayes Updateable.

4.2 Rule-Based Classifiers

Rule-based classifiers uses a set of rules to classify data sets. They use separate-and-conquer method, that is a repetitive process of generating a rule to cover a subset of the training data and then removing all the data covered by the rule form the training set. This process is repeated until there are no data left to be covered. These techniques are very used because it is easy to interpret and maintain them. One of the most used rule-based classifier is RIPPER implemented in Weka as JRip.

The rule-based classifiers taken in consideration during this paper in Weka are: Conjunctive Rule, Decision Table, JRip, NNge, OneR, PART, Ridor and ZeroR.

4.3 Proximity-Based Classifiers

To realize the classification, the proximity-based classifiers use distance-based measures. The documents are assigned into a class based on similarity of their measures such as the dot product or the cosine metric. A test data is classified by computing the similarity between this data and the training data, and assign this data to the class that contain the greater number of similar data.

The proximity classifiers taken in consideration during this paper in Weka are: IB1, IBK, Kstar and LWL.

4.4 Linear Classifiers

Linear classifiers identify the class belonging the data by making a classification decision based on the value of linear combination of its characteristics. The data characteristics are presented to the machine in a vector called feature vector.

The Support Vector Machines classifier, Regression-Based Classifiers and Neural Network Classifiers are included in this category.

The linear classifiers taken in consideration during this paper in Weka are: Logistic, RBF Classifier, RBF Network, SGD, SGD Text, Simple Logistic, SMO and Voted Perceptron.

4.5 Decision Tree Classifiers

A Decision Tree Classifiers use a hierarchical division of the main training dataset using a condition or a predicate on the attribute value. This division is designed to create partition of the dataset that are more skewed in terms of their dataset distribution. In text data these predicates are conditions on the absence or presence of one or more words in the document.

The decision tree classifiers taken in consideration during this paper in Weka are: Decision Stump, HoeffdingTree, J48, LMT, Random Forest, Random Tree, REPTree and SimpleCart.

We have taken in consideration and algorithms implemented in two Weka’s class: Meta class and Misc class. The classifiers in Meta class are a combination of different classifiers: AdaBoost M1, Attribute Selected Classifier, Bagging, Classification Via Regression, Filtered Classifier, Interative Classifier Optimizer, Logit Boster, Multi Class Classifier, Multi Class Classifier Updateable, Random Committee, Randomizable Filtered Classifier, Random Sub Space, Real Ada Boost and Vote. The Misc Class have FLR, Hyper Pipes and Input Mapped Classifier classifiers.

5 Experiment Results

We performed all the experiments based on the specification in the Sect. 3. We evaluated the performance of classification algorithms on each corpus in term of percentage of correctly classified instances.

In Table 1 are shown the results of the experiments, highlighting in red the best percent of correctly classified instances and in blue the second-best percent of correctly classified instances for each corpus.

Table 1. Experimental results in terms of percent of correctly classified instances for each algorithm.

By analyzing the experimental results, we note that for different corpus we have different algorithms which have the best performance. So, the results are as follow:

  1. 1.

    For C1, there are two algorithms, Logistic and Multi Class Classifier, that have the best percent of correctly classified instances by 94%;

  2. 2.

    For C2 the best performing algorithm is Hyper Pipes with 92% correctly classified instances;

  3. 3.

    For C3 the best performing algorithm is RBF Classifier with 79% correctly classified instances;

  4. 4.

    For the two last corpuses RBF Network is the best performing algorithm but with different percent of correctly classified instances, 86% for C4 and 89 for C5.

There are five best performing algorithms with percent of correctly classified instances varying from 79% to 94%. An interesting fact is that exist a significant gap between the best and the worst percent of correctly classified instances for each algorithm. The gap for each algorithm is: Logistic -> 28%; Multi Class Classifier -> 28%; Hyper Pipes -> 25; RBF Classifier -> 12%; RBF Network -> 40%. Logistic and Multi Class Classifier have the same performance. The RBF Classifier has his best percent of correctly classified instances under the C1 corpus and not under C3 where it is the best performing for this corpus.

To rank these five algorithms from the best performance to the lowest performance, for each corpus we rated each algorithm with a score of 5 to 1 based on the percent of correctly instances classified. We use this score to calculate the weighted average for each algorithm in terms of percent of correctly classified instances.

For example, for corpus C1 we order the algorithms based on their performance in term of percent of correctly classified instances and then each of them was rated with points. So, the best performing algorithms are Logistic and Multi Class Classifier with 94% of correctly classified instances, rated with 5 points. The next performing algorithm is Hyper Pipes with 92% of correctly classified instances, rated with 4 points. Then comes RBF Classifier with 85% of correctly classified instances, rated with 3 points. And the least performing algorithm is RBF Network with 49% of correctly classified instances, rated with 3 points. We applied the same marking scheme to each corpus. The results of this marking scheme and the calculation of the weighted average of percent of correctly classified instances for each algorithm are shown in Table 2.

Table 2. The rank of best performing algorithms in terms of weighted average of percent of correctly classified instances.

The algorithms are ranked from the algorithm that have the highest weighted average of percent of correctly classified instances to the lower one. Hyper Pipes is the best performing algorithm with 83.62% weighted average of percent of correctly classified instances, followed by Logistic and Multi Class Classifier with a difference of 1.09%. Then comes RBF Network with 80.16% weighted average of percent of correctly classified instances. And the least performing algorithm is RBF Classifier.

We performed another experimental evaluation to identify any statistical difference between the best performant algorithms using as comparison field percent_correct. Based on the result of Table 2, we choose Hyper Pipes as a base algorithm, and compare it with the other four. The cross-validation experiment was performed for each corpus using the Weka Experimenter tool with the five best performant algorithms, 10 cross-validation and 10 repetitions. In Table 3 are shown the results of this experiment.

Table 3. Cross-validation experiment results in term of percent_correct for each algorithm.

For corpus C1 and C2, the experimental results indicate that there are no important differences in performance between the algorithms. The result of RBF Classifier for corpus C3 has a “v”. This means that RBF Classifier has performed statistically significantly better than Hyper Pipes. The same sign is and in the result of RBF Network when running on corpus C5. Also, this means that RBF Network has performed statistically significantly better than Hyper Pipes. But the result of RBF Classifier for corpus C4 has a “*” meaning that it has performed statistically significantly worse than Hyper Pipes.

To conclude even with this experiment, we cannot define which algorithm perform statistically better.

For the five best performing algorithms we have performed different experiments changing the value of their parameters. The changes do not increase the performance, but sometimes the performance is even decreased.

We have performing experiments using unigram, bigram and trigram tokenizer instead of WordToTokenizer in StringToWordVector filter. When we used unigram the performance in terms of percent correctly classified instances is equal to the performance when WordToTokenizer is used. For some algorithm the value of percent correctly classified instances is insignificantly increased when we used bigram. And when is used trigram the value of percent correctly classified instances for all the algorithm is significantly decreased.

6 Conclusions and Future Work

In this paper we perform a thorough experimental evaluation of algorithms for opinion mining in Albanian language. At the beginning of our work we reviewed the state of art and the recent work done in this field.

We created a text corpus of 500 Albanian written opinions collected from different well-known Albanian newspapers. The corpus has five subject related to nowadays discussion in Albania. Each subject is used as a corpus to evaluate the performance of the algorithms. Each corpus has fifty text document categorized as positive opinions and fifty text documents categorized as negative opinions.

Firstly, we cleaned the dataset passing it in a preprocessing phase composed by a stop-word removal and a stemmer. Then these unstructured data are used as an input to train and test 53 classification algorithm in Weka.

The experimental results show that there are five different best performing algorithms in terms of percent of correctly classified instances. The best performing algorithm are: Logistic and Multi Class Classifier for corpus C1; Hyper Pipes for corpus C2, RBF Classifier for corpus C3 and RBF Network for corpus C4 and C5.

We evaluated the performance using WordToTokenizer, unigram, bigram and trigram tokenizer. All the results showed in this paper are using WordToTokenizer. The performance when is used unigram is equal to the performance when WordToTokenizer is used. Some of the algorithms perform insignificantly better when is used bigram. And when is used trigram the performance is significantly decreased.

In the future, it would be of a great interest to evaluate the performance of the classification algorithms for Opinion Mining in a bigger corpus and in a cross-domain corpus in Albanian language. Also, it would be interesting to test the algorithms when the opinions are classified with more than two level, for example: as positive, neutral and negative; with points 1 to 10; etc.