Keywords

1 Introduction

In the twenty-first century, social media platforms offer powerful tools that people can easily share their feelings and opinions about various topics with large crowds. Therefore, the social media usage has become an important part of daily routine in our lives (Gaál et al. 2015; Anstead and O’Loughlin 2015; Chen et al. 2011). Sentiment analysis is known as opinion mining or emotion artificial intelligence in the literature (Yang and Lin 2018; Appel et al. 2018; Öztürk and Ayvaz 2018; Zheng et al. 2018; Houlihan and Creamer 2017; El-Masri et al. 2017; Geetha et al. 2017; Ma et al. 2017). It is based on the usage of natural language processing, text analysis, computational linguistics and biometrics to systematically identify, extract, quantify and study affective states and subjective information (Antonio et al. 2018; Hameed et al. 2018; Ruan et al. 2018).

Information technology (IT)-based social media data analysis has affected a company’s ability to discover their social media intelligence (Lee 2018). As an example of this type of studies, sentiment analysis is performed to customers’ online or written reviews and survey responses. It has a wide range area for different disciplines, especially for marketing (Abdi et al. 2018; Li 2018; Ruan et al. 2018). Sentiment analysis is classifying the polarity of a given text at the document, sentence or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative or neutral (Daniel et al. 2017; Oscar et al. 2017; Nguyen and Jung 2017).

Almost 70% of adult Internet users use social media, and this percentage is increasing (Pew-Research 2018). Twitter is one of the most popular applications for sharing feelings and opinions (Mondal et al. 2017; Chappel et al. 2017; LaPoe et al. 2017; Sul et al. 2016). According to the ‘Digital in 2017 Global Overview,’ 48 million people use social media actively in Turkey (Digital in 2017 Global Overview 2018). Therefore, social media is an important source of data to analyze people’s feelings about the events that create the country’s agenda. The main reason for this is the intensive use of social media, and there is a great deal of power to spread the news about the agenda directly from social media. Figure 1 shows the social media statistics of Turkey from January 2016 to April 2018. According to the statistics, Twitter is the second social media platform used in Turkey with 17.67% ratio.

Fig. 1
figure 1

Social media statistics of Turkey from January 2016 to April 2018 (StatCounter—GlobalStats 2018)

With the increasing popularity of the Internet, tourism facilities have become more digital with increased interconnections between customers, suppliers and firms (Zhou et al. 2014; Philips et al. 2016; Podnar and Javernik 2012; Filieri and McLeay 2013). Therefore, in order to understand why customers choose a product or a service, social media data analysis plays an important role in competitive advantage (Brooks et al. 2014). At this point, Booking.com is an example of IT-based tourism facilities. Booking.com firm was established in 1996 in Amsterdam and has grown from a small Dutch start-up to one of the largest travel e-commerce companies in the world. Now, the firm employs more than 15,000 employees in 198 offices in 70 countries worldwide. The Booking.com Web site and mobile apps are available in over 40 languages, offer 1,742,015 properties and cover 130,452 destinations in 227 countries and territories worldwide. Each day, more than 1,550,000 room nights are reserved on its platform (Booking 2018). Moreover, the firm has long-term relationship with more than 560,000 hotels worldwide, 40,000,000+ guest reviews, 750,000 rooms booked per day, # 1 most visited travel site by traffic, 100+ million visits a month and access to over 180 countries (DPO 2018).

In 2017, an Istanbul court ordered the suspension of the activities of the Web site (www.booking.com) in Turkey on March 29, citing accusations of unfair competition, following a lawsuit filed by the Association of Turkish Travel Agencies (TURSAB). At the end of the lawsuit, it has been concluded to limit www.booking.com’s services for the hotel search and booking in Turkey since 2017. The Web site, which had around 13,000 hotel members from Turkey, halted selling rooms in Turkey to Turkish users on March 30, one day after the court decided to block the Web site in the country. The Web site can still be used from foreign countries to make reservations for Turkish hotels and from Turkey to make reservations abroad (Figs. 2 and 3). According to a sector player, Turkey’s city hotels take around 35% of their reservations via Web sites, with Booking.com taking a large share of this total (Independent 2018; Hotel Management 2018; Hurriyet 2018).

Fig. 2
figure 2

Booking.com Web site homepage in Turkey

Fig. 3
figure 3

Booking.com Web site search example for Warsaw

In this study, sentiment analysis is tested on Turkish tweets about www.booking.com in Turkey after the court decided to stop the activities of Booking.com. Moreover, after the date that Booking.com stops its services, traffic data of other major Web sites serving in this sector has been obtained and how they are influenced by this activity is also interpreted. The data is obtained on Twitter from starting the date that Booking.com closures in Turkey. The Twitter messages in Turkish were manually obtained from the Internet because of being expensive of old tweet data. The data has been passed through the preprocessing, feature selection and classification stages. At the end of these processes, the data is analyzed using various text mining algorithms so the success rates achieved are compared and interpreted.

2 Web Traffic of Online Reservation in Turkey After Booking.com Limitation

There are two main aims in this study. The first one is to determine the emotional analysis of the customers after Booking.com in Turkey; the other one is to analyze how other companies in the same sector are affected by this activity. According to Turkish media reports, up to 30% increase in sales of other companies has been observed after stopping the activities of Booking.com (Posta 2018). In order to be able to determine how other companies in the sector are affected, traffic data of the Web pages of the other companies with a high market share in Turkey was obtained from www.semrush.com Web site.

The numbers of ‘unique visitors’ of the pages are taken into account and interpreted. The number of unique visitors is the number of entries on the same rope over a given period of time, counted as a single entry. The data obtained before and after stopping the activities of Booking.com was analyzed. The company has ceased its activities in Turkey on March 29, 2017. Table 1 shows the traffic analytics of unique visitors between April 2016 and October 2017 (k).

Table 1 Traffic analytics: unique visitor table—April 2016–October 2017 (k)

Traffic data shows (Fig. 4) that the density of the Web pages of other companies has started to increase visibly from April 2017. Although this is thought to be related to the opening of the summer season and many other factors, it is observed that there is an increase in the number of unique visitors compared to the data in the summer of 2017, which appears in the graphs. Therefore, stopping the activities of Booking.com has led users to intensively refer to the sites of other holiday agencies, which has a favorable effect on other agencies. Figure 4, shown, represents the other firms’ Web traffic starting from April 2016 to October 2017. According to the statistics, the firms called ‘ETS Tur, Jolly Tur, Neredekal.com, Tatilbudur.com, tatilsepeti.com, Tatil.com, trivago and Anı Tur’ have increased their unique visitor numbers.

Fig. 4
figure 4

Traffic analytics: unique visitor chart—April 2016–October 2017 (k)

3 Sentiment Analysis of Turkish Tweets via Text Mining

In this study, the data set has been generated from Twitter messages about Booking.com after stopping its activities in Turkey. The data was obtained manually, filtering from Twitter in Turkish. The reason for the manual acquisition of the data is the need for tweet in the past and in large quantities, which leads to huge costs. In the tweets obtained, the parts that are not used in the study such as user name and liking are separated and only a data set consisting of messages is created. Messages in the generated data set are grouped into three categories: positive, negative and neutral. After text preprocessing—cleaning step, the data set of the study is composed of 2000 tweets. The results show that 382 of 2000 tweets are positive, 1274 of them are negative and 344 of them are neutral opinions.

Firstly, the typing mistakes in the data set and the correction of the marking mistakes have been corrected. In addition to this step, the abbreviation is made when the tweet is thrown, and the letter repetition made to emphasize the written word is also corrected. RapidMiner software was used for further preprocessing, machine learning, classification and analysis phases. With the RapidMiner, all letters in the data set are converted to lowercase, clearing of unnecessary characters such as @ and #, clearing of punctuation marks, clearing of words with more or less than a certain number of characters, clearing of stall words that do not make sense in working according to the generated stall word dictionary, identification of roots, disintegration of data. The Snowball library, which was developed for the Turkish language, was used to determine the roots of the words. N-gram model was used for feature selection process. Then, the term frequency weighting method was applied to determine how many times a term has passed in the data set. The supervised learning technique of the machine learning method is used in the study.

After all these steps have been completed, experimental results have been obtained by Naive Bayes (kernel), gradient boosted tree, Naive Bayes, k-nearest neighbors (KNNs), sequential minimal optimization (SMO), random forest, decision tree (J48) algorithms. For each algorithm, training data set was selected as shuffled sampling at 0.7, 0.5 and 0.3 ratios and analyzed separately. The success rates obtained as a result of the analyses made are compared. Figure 5 shows the accuracy rate.

Fig. 5
figure 5

Accuracy rate

4 Conclusion

Turkish tweet sentiment problem is a challenging problem with the fact that the Turkish expressions are short and contain different interpretations in terms of semantics. At the classification stage, the attributes were tested in three different ways for the N-gram model: 2, 3 and 4 grams. The best result is observed from 3-gram model. The success rates of the study were affected from the out of balance of the data set. The existing negative sentences in the data set cause increase in the success rates predicted by negative cues. In this study, machine learning-based approaches were used for sentiment analysis. The highest success rate with 79.29% was obtained with sequential minimal optimization (SMO) algorithm. It was also seen that the highest success rate was always obtained when 0.7 training sets were selected. It is obviously determining that the success rates of the Naive Bayes (kernel) algorithm are almost the same as the SMO algorithm. Among the algorithms used in the study, the KNN algorithm is the most unsuccessful algorithm because of giving unsuccessful results.

In addition to the accuracy rate, the kappa statistical results obtained are also compared. The kappa statistic ranges from +1 to −1, but gives the relationship between the observed compliance and the chance-based compliance among the classes. When kappa statistic is equal to individual, full harmony is mentioned, while if it is greater than zero, harmony observed is greater or equal to harmony depending on chance. If the kappa statistic is less than zero, it is understood that the classification is not reliable (Aha and Kibler 1991; Nizam and Akın 2014). Table 2 shows the kappa statistical results of the algorithms used in the study. The kappa statistic of sequential minimal optimization, which is the most successful algorithm, was measured as 0.583. This value is an indicator that the classification is reliable.

Table 2 Kappa statistical values

It is known that the training data used in the classification run and the attributes extracted from the data set are direct effects. For this reason, in the future studies, it is considered to construct a data set consisting of samples with much better discriminability. It is also contemplated to apply additional semantic and mathematical methods in order to increase the size of the data by increasing the number of messages, thereby increasing the ability of the classifiers to generalize, and extracting attributes with better distinguishing characteristics at the feature extraction stage.