Keywords

1 Introduction

Behaviors of online users have transformed as the growing rate of data on the internet. According to the statistics [11], 70% of online users who like to purchase electronics, stated that they have read online reviews before purchasing the product. This situation accelerates the researches about sentiment analysis and opinion mining for online resources such as Twitter and Facebook on the internet. There are several studies [4, 14, 18] which analyze microblogging platforms such as Twitter and Facebook to devise people’s opinions and classify them according to the sentiment. Various kinds of information can be extracted from these resources. For example, manufacturer companies can collect information about their products, and political parties and social organizations may determine their events according to the results of this researches. With the recent advances in machine learning techniques, this issue has been addressed and come a long way in English. However, in agglutinative languages such as Turkish, it has been still one of the hot topics [13, 16, 25]. While there are some advantages to use microblogging especially Twitter such as a variety of data and the velocity of data [18], some of the tweets can have sarcasm in it [5] and it can be problematic to label such a huge data by just looking at some features in it [7]. These can lead to inefficient results and to degrade the performance of machine learning models. To solve these issues and to make models that predict more accurately we come up with a different approach. In the literature, there have been studies which they used movie reviews [8,9,10, 19, 24]. The approach that we have followed is similar to these studies. We use reviews on Google Play to get rid of the ambiguity of texts. Because in such statements, people directly mention their opinions and rate them before submitting them. As a result of rating before submitting, the aforementioned problems can be elucidated easily.

Although, our study has similarities with the current state of literature, our main difference over current state of studies is that we build a deep neural network to do sentiment analysis and opinion mining for the Turkish text corpus. We also contribute the literature by sharing our corpus that we have prepared for this study [1]. In our study, we apply different machine learning algorithms and build a deep neural network to analyze how the reviews on Google Play in Turkey can be utilized for sentiment analysis and opinion mining purposes. We select the reviews for the following reasons:

  • Application reviews section is directly used to express the opinions of users.

  • Application reviews contain enormous texts and up to date.

  • There are a lot of different people using an enormous number of different applications which means different contexts.

We collected a corpus of 11000 reviews from Google Play Turkey by adjusting their ratings evenly between two sets of reviews:

  1. 1.

    reviews belonging four and five stars as positive emotions

  2. 2.

    reviews belonging one and two stars as negative emotions (Table 1).

Table 1. Examples of reviews on Google Play

The rest of the paper will be analyzed as follows: In Sect. 2 we will mention the relevant work. In the third section, we will clarify how we collect the data for our study. The fourth will give an analysis of the corpus that we use the training models. Section 5, we present model building and the results of our study. Section 6 will conclude the paper.

2 Related Works and Background

2.1 Related Works

There have been studies that focus on sentiment analysis and opinion mining in the literature. The main motivation behind these studies comes from the track-ability of consumer behavior and the wide-range of data.

In 2012, Kaya et al. [13] integrate sentiment classification techniques into the domain of political news for Turkish news sites. They compare supervised machine learning algorithms which are Naïve Bayes, Maximum Entropy, SVM and the character based N-Gram Language Model for sentiment analysis of Turkish political news. Kucuk et al. [16] work on named entity recognition (NER) and report experiments about NER on Turkish tweets. In 2015, Yildirim et al. [25] reports the effects of preprocessing layers on the sentiment classification of Turkish social media texts. While there are some benefits to utilize microblogging especially Twitter and Facebook such as a variety of data and the velocity of data [18], some of the microblogging texts can have sarcasm in it [5] and it can create ambiguity to work with such a huge data by just looking at some features in it [7]. These kind of features of the data may cause inefficient results. We use reviews on Google Play to get rid of the ambiguity of the texts. Because in such statements, people directly mention their opinions and rate them before submitting them. As a result of this, the aforementioned problems can be elucidated easily. In the literature, there have been other studies which they used movie reviews [8,9,10, 19, 24]. The approach that we have followed is similar to these studies.

The background of our study consists of three main parts which are feature engineering, machine learning classifiers and sequential deep learning model. In the feature engineering part, we are going to explain why we need different representations of the text data to extract information from them. In the second part, we are going to mention the machine learning classifiers that we have used in this study and finally, we going to give details about deep neural network that we have used in our research.

2.2 Feature Engineering for NLP

In the natural language process (NLP), we cannot directly use the text to extract information and to build machine learning models. In order to utilize the text information, we need to convert it into numerical values. A simple and adequate model to make us able to use text documents in machine learning is called the Bag-of-Words Model, or BoW [12]. This simple model actually focuses on the occurrence of each word in a text. By using this method, we can easily encode every text as an encoded fixed-length vector with the length of the vocabulary which we have known. In the scope of this study, we will explain the vectorization for text-based features and analyze two of the three different ways to use this model in the scikit-learn library [20] which are CountVectorizer, TfidfVectorizer, and HashVectorizer. Since we will use CountVectorizer and TfidfVectorizer, the following subsections will explain the details about them.

CountVectorizer. The CountVectorizer provides and implements both occurrence counting and tokenization. It builds a vocabulary of known words. It uses this vocabulary to encode new text data. Although it is a good solution, there are some drawbacks to use CountVectorizer such as irrelevant words that occur so many times.

TfidfVectorizer. There is an alternative method to calculate word frequencies. It is called Term Frequency - Inverse Document frequency (TFIDF) which is the part of the scores belonged to each word.

  • Term Frequency: How often a given word appears with in a text

  • Inverse Document Frequency: This actually adjusts words which are appear so many times in texts.

2.3 Machine Learning Classifiers

In this subsection, we are going to give some details about the machine learning classifiers that we have used in our study.

Multinomial Naive Bayes. Multinomial Naive Bayes classifier is based on Naive Bayes theorem [17]. It is a simple and baseline approach to build classifiers since it is fast and easy to implement. In [21], they discuss that while it has really good efficiency, it affects the quality of results because of its assumptions. In order to eliminate such drawbacks, they introduce the multinominal Naive Bayes (MNB) method. MNB models the distribution of words in a corpus as a multinomial. They pretend the text as a sequence of words and assume that the location of words is produced independently of each other.

K-Nearest Neighbors. KNN (K-Nearest Neighbors) is one of the simplest and widely used classification algorithms. KNN is a non-parametric and lazy learning algorithm and it is used for classification and regression [3]. The main idea behind the nearest neighbor method is to find a label for the new point according to the closest distance metric to it and predict the label from these. The number of neighbors can be defined by the user.

Decision Tree Learning. Decision tree learning is one of the commonly used methods in predictive analysis [22]. The purpose is to build a model that predicts the right label of target variables from the input variables. A tree is created by splitting the input variables. Classification features have an important role while splitting the tree [23]. There are some advantages to use decision trees. It is simple to interpret decision trees and it uses white-box model.

In our study, we pick the above classifiers and analyze the effects of different vectorizers on different models. We have also built a sequential deep learning model to predict the right target value. The deep learning network that we have used in our study can be seen in Fig. 1. We are going to explain it in detail in the Model Building and Results section.

2.4 Sequential Deep Learning Model

Deep learning is a variety of a family of machine learning methods. It is based on artificial neural networks. The use of multiple layers makes the learning process deep and it is where the adjective “deep" comes from. The sequential model is one of the simplest ways to build a model in Keras which is a deep learning framework [6]. It is built on top of TensorFlow 2.0 [2]. It provides us to build the model layer by layer.

In our study, we have also built a sequential deep learning model to predict the target value according to textual information.

Fig. 1.
figure 1

Deep neural network with 2-hidden layers

3 Corpus Collection

There are several data collection methods such as APIs, We have used web scraping methods to extract the data from Google Play Turkey. We select the most popular 112 applications and we also extract the most useful 100 reviews from each application. By doing that, we achieve to collect the most relevant reviews. We simply collected the text part of the reviews and their ratings.

We eliminate the ratings that take 3 stars. We use the below procedure to label reviews according to their ratings:

  • Positive label for the reviews that take 4 stars or 5 stars

  • Negative label for the reviews that take 1 star or 2 stars

These two types of labeled data are used to train a classifier to predict whether the given review has a positive or negative sentiment. In our research, we specifically use the Turkish language. Categories of 112 applications that we have selected to extract reviews can be seen in the below Table 2.

Table 2. Categories of applications in Google Play

4 Corpus Analysis

In order to build a machine learning model, we examine our data and explore new features such as length of the review, capital letter percentage in the review that we can use for classification. Before modeling, we check the correlation between extra features and the target value. The below Fig. 2 depicts the correlation between the percentage of digits and the percentage of capital letters in the reviews.

Fig. 2.
figure 2

Distribution between the percentage of digits and the percentage of capital letters in the reviews

Other features that we extract from our corpus are digit percentage, exclamation mark usage, length of the review, and capital letter percentage. We see that none of these features are related to the target value. Correlation values such as in Fig. 2 show us there is no relation among these features and the target value which is used for sentiment analysis. Our overall design can be seen in below Fig. 3. As we have mentioned above, we have collected over 10.000 reviews from the Google Play Turkey.

Fig. 3.
figure 3

Overall design

After we have collected the data, we have explored the data and we have shared some of our findings like in the Fig. 3. After these steps, we have started to clean our data by applying all procedures as follows:

  • Replacing similar emoticons with a determined keyword

  • Removing punctuations

  • Replacing some Turkish letters with their corresponding English letters

  • Lowercasing the text

  • Removing digits

  • Removing extra white spaces and punctuation

After these procedures, there are also special requirements to extract features from the text data because of its structure. The text should be parsed to remove tokens and after that, the words must be encoded as numerical values such as integer or float to be used in machine learning models. In order to that, we use the scikit-learn library which is a free software machine learning library [20]. By using scikit-learn we both perform tokenization and feature extraction of our corpus.

We have used two types of feature extraction methods which are CountVectorizer and TfidfVectorizer and compared their results in terms of effects to the prediction accuracy.

In our study, we both used machine learning classifiers and deep learning models for algorithm selection and model training. We are going to give the details about algorithm selection and model training parts in the following sections.

5 Model Building and Results

We build a classifier using multinomial Naive Bayes which is based on Bayes’ theorem [15]. We also build models with decision trees and K-Nearest neighbors. As we have mentioned earlier, we both use CountVectorizer and TfidfVectorizer for feature engineering and analyze the results.

5.1 Results for Machine Learning Classifiers

The bar chart in Fig. 4 shows the prediction accuracy of the machine learning classifiers which we have used in our study. As can be seen from the figure, there is no significant difference in prediction accuracy when we use CountVectorizer or TfidfVectorizer.

Fig. 4.
figure 4

Prediction accuracy of the classifiers

The prediction accuracy results can be seen in Table 3. As can be seen, while there is a difference when we use a multinomial Naive Bayes classifier, there is no difference when we use different vectorizer for decision tree and KNN classifier.

Table 3. Prediction accuracy table

5.2 Results for Deep Learning Model

As we have mentioned, we have also built a sequential deep learning model. In our model, we have 2 hidden layers and we apply dropout for the 20% of the nodes in order to avoid overfitting in our model. Figure 1 depicts the deep learning network that we have used in our study.

Table 4. Prediction accuracy results for the deep learning model
Fig. 5.
figure 5

Training and validation accuracy for TfidfVectorizer

As can be seen in Table 4, our results indicate that we have 95.87% prediction accuracy on test data for TfidfVectorizer and 95.71% prediction accuracy on test data for CountVectorizer. However, as can be seen in Fig. 5 and Fig. 6, even though we exactly use the same model for both, there may be overfitting when we use CountVectorizer. Also, when we look at the results in Table 4, it can be seen that there is overfitting in the model. While the prediction accuracy is 99.73% for training data, the test data accuracy is less than training data almost 5%. This may give us a clue that there can be overfitting when we use the same model with the CountVectorizer. In order to get rid of this overfitting, we may need to optimize the parameters of the model.

Fig. 6.
figure 6

Training and validation accuracy for CountVectorizer

6 Conclusions

It is conspicuous that sentiment analysis and opinion mining have an important role to trace consumer behavior. With the recent advances in machine learning techniques, this issue has been addressed and come a long way in English. However, in agglutinative languages such as Turkish, it has been still one of the hot topics. In this study, we compare several classification methods and deep learning methods to make a sentiment analysis for Turkish reviews that we have collected from Google Play. We have explained the machine learning classifiers that we have used and we have depicted our overall design. We have significant results both from the machine learning classifiers and the deep learning model. Our prediction accuracy results are satisfactory. We had 87.30% prediction accuracy for multinomial Naive Bayes and 95.87% prediction accuracy for deep learning model. Our experiments showed that we could have overfitting when we use different vectorizer techniques. While there is no difference between machine learning classifiers when we use different vectorizers, there is a difference when we build a deep learning model to predict target value. Although we build our model from the reviews that we have extracted from Google Play, this model can also be applied to any data from Twitter, Facebook, or any other microblogging.