Introduction

Social media offer a plethora of textual data posted by online users about the most disparate topics, e.g., liking or criticisms on politicians and celebrities, feedback on public events, comparisons among product brands, and suggestions for vacations. In this work, we consider online hotel reviews and we focus on those review systems that allow a reviewer to post both a textual description of the hotel and a numeric score, which quite directly summarizes the reviewer’s overall satisfaction towards the facility. Actually, many e-commerce and e-advise websites, among which major players, like Amazon, Walmart, and Yelp, enable such dual functionality.

The presence of both review text and score conveys to consumers a significant amount of information, which could be exploited in different ways. On the one hand, the score conveniently acts as a direct indicator, guiding the consumer to a faster choice, without getting lost into the details naturally contained in the review text. On the other hand, the richness and variety of the information included in the text are supposed to improve the consumer awareness, supporting her cognitive process and, ultimately, leading to a satisfying purchase decision. Relying either on the text, the score, or both of them, the implicit belief is that a correspondence exists between the polarity expressed by the textual data and the numerical value associated to such data. Instead, in this paper, we consider the presence of a possible misalignment between the review text and the review score. As noticed in [1], such a misalignment could lead to the increase of the consumer cognitive processing costs, to sub-optimal purchase decisions, and, ultimately, to neutralize the utility of the review site. Here, we approach the issue under a different perspective and with the aim of positively exploiting hidden information that may exist, for those reviews featuring the disagreement between the text and the score.

Our intuition is that misalignment can naturally occur, since users’ opinions are greatly subjective and it can be difficult and reductive to summarize a whole experience with a single value. For instance, less demanding people will probably turn a blind eye on the furniture of a hotel room, leading to a higher numerical score than that given by a hard-toplease client, but it is possible that both the reviews feature the same textual description about such furniture. The same holds even in the evaluation of an objective characteristic of a service (e.g., the number of flights’ delays of an airline): different users, such as businessmen and young travelers, may have a different perspective.

Following this intuition, we evaluate the disagreement between the text of a review and the associated score. For our investigations, we consider a large dataset consisting of around 160k hotel reviews collected from TripAdvisor. To evaluate if a mismatch exists between the text and the review score in the TripAdvisor dataset, we carry out a polarity detection task, where texts are classified as positive or negative [2,3,4,5]. The polarity mismatch attribute (i.e., the information about the correspondence—or not—between the review text polarity and the review score) is computed by constructing a reliable classification model that leverages on state-of-the-art techniques for sentiment analysis [6, 7] and exploits a labeled dataset from the Booking website. Thus, we leverage techniques inherited from the field of cognitive computing area (such as sentiment analysis techniques), with the specific goals of identifying and analyzing mismatches between text and score in online review platforms. To the best of our knowledge, it is the first time that such techniques have been considered for that specific task.

Main Findings

Findings are as follows. For the dataset under investigation,

  • At around 12% of reviews with an actual score of 1 and 2 have been classified as positive by the classifier.

  • At around 5% of reviews with an actual score of 4 and 5 have been classified as negative by the classifier .

  • Among the mismatched reviews (i.e., reviews for which the detected polarity of the review text is opposed to the review score), the majority of the mismatches happens for reviews with an associated score of 2 and 4, rather than for reviews with the highest and lowest score; in addition, by analyzing in detail the reviews with a mismatch, we find out that their texts present a mixture of positive and negative content.

  • Reviews for which a mismatch is not detected contain only negative and positive aspects, respectively.

The proposed approach allows to slim down the set of reviews to take into account, when searching significant aspects of the products being reviewed. Indeed, the mismatch classification provides a selection of reviews in which positive and negative aspects of a product are mixed. Such a base represents a meaningful and compact piece of information, useful to both providers and consumers. By only relying on that focused reviews subset, the former will benefit by adjusting, e.g., their product lines and advertisement campaigns. The latter may concentrate only on such subset for addressing their needs and matching their expectations. We think that our novel approach can be applied in other scenarios as well, where a text is associated to a value from a fixed scale, like surveys, peer-reviews of academic papers, and student grade evaluations. This could lead to the design and development of a cognitive computing platform that can help and guide to the identification and, possibly, mitigation of any mismatch that could arise when users generate their contents. The platform could also be used by service administrators, as a pre-filter to highlight the most ambiguous or unsettled contents, to be considered for further analysis or alternative evaluations. Furthermore, our experimental results can also provide an additional approach in the context of human-computer-interaction field, to shed light on how humans interact on and perceive online review platforms. The ultimate goal is the identification and understanding of the reasons behind such kind of physiological anomalies—the mismatches—that characterize user-generated contents.

The remainder of this paper is as follows. “Related Work” discusses related work in the area of online reviews platforms, approached with cognitive computation and, specifically, polarity detection techniques. In “Datasets,” we detail the datasets used in our study. The construction of the polarity classification model and the evaluation of its performance is described in “Polarity Classification Model Construction.” In “Application of the Classification Model and Discussion,” the classification model previously learned is applied to the TripAdvisor dataset, in order to evaluate the polarity mismatch of each review. Then, we quantify the detected mismatches over the whole dataset, and we focus on specific kinds of mismatches, by showing and discussing real examples from the dataset. We also give further hints for some useful applications of the mismatch detection process. Finally, “Conclusions” concludes the paper.

Related Work

E-advice technology offers a form of “electronic word-of-mouth,” with new potential for gathering valid suggestions that guides the consumer’s choice. Since some years, extensive and nationally representative surveys have been carried out, “to evaluate the specific aspects of ratings information that affect people attitudes towards e-commerce.” It is the case, e.g., of work in [8], which highlights how people, while taking into accounts the average of ratings for a product, still do not take care of the number of reviews leading to that average. The high impact of reviews on consumers is also testified by the fact that a positive (or negative) review about a product can be as effective as a recommendation by a friend.Footnote 1 Further, positive comments convey a series of strong benefits, like, e.g., an improvement in search engines’ ranking, a stronger perception of trust, and increased sales [9,10,11].

In this work, we explore online reviews to understand if the text reflects the associated score, i.e., if there exists a polarity mismatch between text and score. A polarity mismatch can be detected by first applying polarity detection techniques to the text, whose outcome is the evaluation of the text content as expressing a positive (or negative) sentiment, and, then, to compare such positivity (or negativity) with the score associated to that text.

Polarity detection techniques fall under the wide umbrella of sentiment analysis [2, 12]. Several approaches have been proposed in the literature for polarity detection. A significant branch rely on lexicon-based features, due to the availability of lexical resources for sentiment analysis, such as, e.g., the lexicons SenticNet and SentiWordNet and a Twitter opinion lexicon, proposed in [13,14,15,16], respectively. Usually, lexicon-based approaches involve the extraction of term polarities from the sentiment lexicons and the aggregation of the single polarities to predict the overall sentiment of a piece of text.

Concerning subjectivity in texts, i.e., those expressions representing opinions and speculations, work in [3] is one of the first studies to perform subjectivity analysis, to identify subjective sentences and their features. In the specific field of polarity detection applied to product reviews, work in [17] assigns a numerical score to a textual description exploiting the SentiWordNet lexicon: the task is especially useful when a review platform only allows to leave a text as a review, without an associated numerical score. A more recent work in [18] considers analogous topics. Work in [19] proposes an unsupervised approach that involves the extraction of terms and slangs polarities from three sentiment lexicons and the aggregation of such scores to predict the overall sentiment of a tweet. In [4, 5, 20], the authors consider the contextual polarity of a word, i.e., the polarity acquired by the word contextually to the sentence in which it appears. For a survey of sentiment analysis algorithms and applications, the interested reader can refer to [21]. For the specific scenario of polarity evaluation and sentiment analysis in specific social networks, the interested reader can refer to the series of work in [22, 23], inherent to Twitter.

Still regarding opinion mining in reviews, efforts have been spent to investigate aspect extraction, i.e., the association between the expressed opinion and the opinion target [24], the analysis of scarce-resource languages, like the Singaporean English [25], and future emotional behaviors of interactive reviewers [26]. Work in [27] evaluates the differences in preferences between American and Chinese users. Work in [28] combines information extraction with sentiment analysis to identify a topic (e.g., “wifi”) from a review segment, to recognize the dimension through which the topic is evaluated in the review (e.g., “fast,” “free,” “poor,”) and to evaluate how that topic got rated within that review segment. A similar approach is proposed in [29], where the authors present an unsupervised system to infer salient aspects in online reviews, together with their sentiment polarity. A more recent method to help with the word polarity disambiguation has been proposed in [30]. In this work, the authors define the problem with a probabilistic model and adopt Bayesian models to calculate the polarity probability of a given word within a given context. Specific applications of polarity detection can be found in [31] and [32]. The first contribution describes an unsupervised method for polarity detection in Turkish movie reviews, while the latter aims at detecting polarity of a Spanish corpus of movie reviews by combining a supervised and unsupervised learning in order to develop a polarity classification system. Similarly, in [33], the authors rely on labeled reviews corpora to test a novel approach for sentiment analysis, based on semantic relationships between words in natural language texts.

This brief overview of the literature shows heterogeneous techniques and applications for polarity detection, both supervised and unsupervised, in different contexts and for different goals. In this work, thanks to the availability of a labeled dataset, we exploit a supervised approach, which automatically learns a model from the annotated data. In order to choose the most effective algorithm, we test different supervised algorithms and we finally select a linear support vector machine (SVM) [34], due to its efficiency in dealing with the task at hand.

Fig. 1
figure 1

Distribution of scores in the TripAdvisor dataset

Datasets

In this section, we present the datasets used for our analysis. We consider two datasets, both composed of hotel reviews, downloaded from two popular e-advice sites, namely BookingFootnote 2 and TripAdvisor.Footnote 3 The first, labeled, dataset is used to train a text classifier, to learn the polarity of the reviews constituting it. The second one is not annotated and it is the input of the learned model. Both datasets were collected by developing ad hoc software, which crawls the web pages of the hotels and extracts the review data.

Booking-Labeled Dataset

In order to train a text classifier, we rely on a specific dataset, i.e., the Booking-labeled dataset, which is focused on the hotel review domain. To this end, we downloaded 726,327 reviews and the associated metadata from the Booking website, by considering all the hotel reviews available for the city of London until July 2016. Then, we filtered out all the reviews shorter than 20 words and written in a language different from English, by exploiting the language detection Python library.Footnote 4 We finally obtained 467,863 reviews. To tag each review with its strong positive (or negative) polarity, we applied the following procedure:

  • For each review, we considered the text content and its score. Since the Booking scoring system ranges over {0, …, 10}, we discarded those reviews with a “close-to-neutral” score, namely between 4 and 8.

  • The remaining reviews were tagged with a positive polarity if their score is above 8 and with a negative polarity if the score is below 4.

  • We then manually inspect each review, to assess if the text content is in line with the polarity assigned in the previous step.

  • We finally keep 2,000 reviews for each polarity, to speed up the learning process of the classification model.

Fig. 2
figure 2

Polarity classification model construction

Thus, the Booking-labeled dataset includes 4,000 reviews, half tagged with a strong positive polarity and the remaining half tagged with a strong negative polarity.

TripAdvisor Dataset

The TripAdvisor dataset is composed of reviews taken from the TripAdvisor website. This dataset contains all the reviews that could be accessed on the website between the 26th of June 2013 and the 25th of June 2014—date of the newest extracted review—for hotels in New York, Rome, Paris, Rio de Janeiro, and Tokyo. With a straightforward approach, we were able to collect the following information, for each review:

  • The review date, text, and numeric score

  • The reviewer username, location, and TripType, being one among the following five categories: family, friends, couple, solo traveler, and businessman

  • The ID of the hotel which the review refers to

We focus on the text and the score associated to each review. The reviews accessible from TripAdvisor in the year under investigation are 353,167. Nevertheless, we apply a filtering process to discard reviews whose textual part is not in English, since the approach presented in “Polarity Classification Model Construction” is specific for the English language. Reviews in English were selected by following the language identification and analysis approach presented in [35].

Figure 1 shows the distribution of the reviews, per score value. Such distribution is highly unbalanced, with the highest score being the most frequent in the dataset (reflecting the distribution usually featured by review platformsFootnote 5). Since in “Polarity Classification Model Construction” we will focus on strong disagreements, we further discarded from the TripAdvisor dataset those reviews having a score equal to 3. Thus, after applying the filtering process—by removing non English reviews—and discarding the reviews with score equal to 3, the final dataset resulted made up of 164,300 reviews, in English, provided by 142,583 TripAdvisor’s registered users that reviewed 4,019 hotels.

Polarity Classification Model Construction

A polarity mismatch (PM) occurs when there is a disagreement between the text polarity of a review and the score assigned to it. In particular, here, we focus on strong disagreements: on a scale of five stars, if a review text is evaluated as strongly negative, we expect the associated score to be 1 or 2 stars. Instead, if the text features a strongly positive polarity, we expect the score to be 4 or 5 stars.

Given a set of reviews, our aim is to compute the PM for each of them, by performing a polarity recognition on the reviews’ text. To this end, upon testing the performances of different classification algorithms, we adopt a linear SVM [34] and we train the classifier on the Booking-labeled dataset described in “Booking-Labeled Dataset”. We use such dataset to learn a polarity classification model to automatically detect the polarity expressed by hotel reviews.

The remainder of the section presents the steps performed to learn the polarity classification model, how it has been tested and validated. These steps are summarized in Fig. 2.

Text Filtering

Before building a classification model, the text needs to be pre-processed through natural language processing (NLP) techniques and the most relevant features need to be selected. To this end, we exploit the string to word vector (STWV) and the attribute selection (AS) filters provided by Weka.Footnote 6 This step is represented in Fig. 2 by the building block STWV+AS Filters, which returns a pre-processed review, given its text.

The STWV filter supports all the common steps required in nearly every text classification task, like breaking text utterances into indexing terms (word stems, collocations) and assigning them a weight in term vectors. The STWV is an unsupervised filter that converts a text into a set of attributes representing the occurrences of the words in the text.

The set of words (attributes) is determined by the first batch filtered (typically training data). This filter has a significant number of parameters that can be set. In Table 1, we report the parameters we have considered, with the chosen values, and their descriptions.

Table 1 Parameter used for the STWV Weka filter

Keeping a large number of tokens as attributes, the STWV filter generates a huge number of attributes. Therefore, we perform a dimensionality reduction, to transform the list of attributes into a more compact one and, also, to decrease the computational time required by the classification algorithms to build the models. To this end, we apply the AS filter of Weka. This is a supervised filter, which selects a subset of the original representation attributes, according to some information theory quality metric. In Table 2, we report the parameters used for the AS filter.

Table 2 Parameter used for the AS Weka filter

Classification Model Construction and Evaluation

The Booking-labeled dataset is used to train several classification models, by exploiting different machine learning algorithms. In particular, we applied the algorithm implementation of the SVMs described in [37], the C4.5 decision tree algorithm [38], the PART algorithm [39], and the Naive Bayes (NB) classifiers based on probabilistic classification algorithms [40].

The experiments were performed with the classifiers’ parameters set to their default values in Weka and a n-fold cross-validation methodology was applied, with n = 5. In order to establish which classifier performs better, we evaluate and compare them by relying on standard metrics generally adopted for classification problems: accuracy, precision, recall and F-score, whose computation is based on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values. Accuracy is the overall effectiveness of the classifier, i.e., the number of correctly classified patterns over the total number of patterns. Precision is the number of correctly classified patterns of a class over the number of patterns classified as belonging to that class. Recall is the number of correctly classified patterns of a class over the number of samples of that class. The F-score is the weighted harmonic mean of precision and recall and it is used to compare different classifiers.

The average classification results are summarized in Table 3. Specifically, for each classifier, the table reports the accuracy and the per-class values of precision, recall and F-score. All the values are averaged over the 5 values obtained by applying the fivefold cross-validation. The best results have been obtained by SVM, with an average accuracy of 97.0%.

Table 3 Comparison of classification results of different learning algorithms—Booking dataset

Since the results obtained by the SVM classifier clearly outperform those obtained by the other classifiers, we choose this algorithm (building block Learning Algorithm (SVM) in Fig. 2) to construct the polarity classification model that will be used to detect the polarity of reviews belonging to the TripAdvisor dataset described in “Datasets.” This process is detailed in the following section.

Application of the Classification Model and Discussion

The polarity classification model learned by using the Booking-labeled dataset is here exploited to compute the PM for each review belonging to the TripAdvisor dataset. The followed approach is summarized in Fig. 3.

Fig. 3
figure 3

Mismatch detection approach

The texts of the reviews are pre-processed, as already done for the Booking dataset, and then they are evaluated by the polarity classification model, which outputs the predicted polarity for each review. The predicted polarity is finally compared with the actual polarity associated to the review text, which is extracted from the actual score assigned by the reviewer.

Thus, if the detected polarity is positive and the review score is 4 or 5, then PM is set to 0 (no mismatch detected). Similarly, if the detected polarity is negative and the review score is 1 or 2, still PM is set to 0. Quite intuitively, if the detected polarity is positive, but the score is 1 or 2, then PM is set to 1. The same happens if the detected polarity is negative and the score is 4 or 5.

For the TripAdvisor dataset, the polarity classification model assigns PM equal to 0 to 94.07% of patterns, i.e., in the 94.07% of the cases, the predicted polarity matches the score associated to the review. Thus, it detects a PM for 9744 patterns, sum of false positives (1806) and false negatives (7938). Table 4 shows the confusion matrix for the SVM classifier.

Table 4 Confusion matrix

Table 5 expands the matrix by grouping reviews according to their original score. The actual positive polarity spreads on 4 and 5 and the actual negative polarity spreads on 1 and 2.

Table 5 Confusion matrix considering the actual score

By looking at Table 5, we can draw some considerations on the obtained results. Focusing on false positives, the percentage of mismatches with score equal to 1 is 16% and is considerably lower than the percentage of mismatches with score equal to 2. Similarly, if we consider the false negatives, the percentage of mismatches with score equal to 5 is 31%, which is lower than the percentage of mismatches with score equal to 4. Thus, the reviews with an intermediate score tend to be classified as a mismatch more than the reviews with extreme scores.

Then, we compute the percentage of mismatched reviews with respect to each score and we report the results in Table 6.

Table 6 Mismatched reviews over the total number of reviews, per score

Table 6 highlights that the majority of mismatches happens when the score associated to the review is 2 and 4. Nevertheless, for the 2-score group, the percentage of mismatches is more than doubled, compared with the 4-score group.

Considering the good performance of the SVM classifier on the annotated Booking dataset, reported in Table 3, we can reasonably assert that a relevant part of the mismatched reviews in the TripAdvisor dataset presents, indeed, a PM between the text and the assigned score. To prove the outcome, we report some of the mismatched reviews. Table 7 shows examples of false positive reviews, scored with 1 and 2, but classified as positive by the classifier. For such an excerpt, we do not select ad hoc reviews, rather we randomly select some examples from the available set. It can be noticed that all the reviews include some positive words to describe positive aspects of the hotel (mainly the location, in the samples), so that the classifier is mistaken.

Table 7 False positive reviews (scores 1 and 2)

In Table 8, we report some examples of true negative reviews, i.e., reviews scored with 1 and 2 and correctly detected as negative by the classifier. The excerpt highlights that these reviews essentially describe only negative aspects. Also, considering the numbers in Table 5, we can argue that, among the reviews scored with 1 or 2, there exists a small subset featuring a PM. Such reviews mainly contain a mixture of positive and negative opinions, rather than only negative (or positive) ones.

Table 8 True negative reviews (scores 1 and 2)

We investigate also the dual situation, by considering positive reviews, i.e., reviews marked with 4 or 5. To this end, Table 9 reports an extract of false negative reviews, with a score equal to 4 or 5, but detected as negative by the classifier. From the table, we notice that these reviews globally express a positive sentiment, but customers highlight some issues within.

Table 9 False negative reviews (scores 4 and 5)

Finally, we report some examples of true positive reviews in Table 10: such reviews are scored with 4 or 5 and they have been correctly classified as positives by the classifier. Overall, these reviews express a positive sentiment about the hotel they refer to, by describing only positive aspects.

Table 10 True positive reviews (scores 4 and 5)

Concluding, the classifier is able to detect reviews with a PM. The mismatch should not be intended as an inconsistency between the text written in the review and the score assigned to it; instead, it indicates that the considered review is a mixture of positive and negative opinions, to a greater extent if compared with reviews belonging to the same class of score. Thus, this approach features its benefits if exploited to perform an initial selection on reviews, by tagging with a mismatch those reviews which are worth to be further investigated in details.

Open Problems and Further Investigations

The approach presented so far let a specific subset of reviews emerge from a wider set. The characteristic of this subset is that the texts of the reviews contain a mixture of positive and negative aspects, leading the classification algorithm to label such reviews with a polarity in contrast to the associated score.

This paves the way for further investigation. Indeed, even if in this work we only considered the relationships between the review text and the score associated to it, a further possibility could be to explore possible connections between the polarity mismatch and some characteristics of the reviewer (e.g., the gender) and the reviewed product (e.g., specific attractions in the hotel neighborhood). As an example, the work in [41] studies how review scores may be affected by external, environmental factors, such as the weather conditions and the daylight length. Also, a series of recent studies correlates the huge amount of textual data available online with the demographic and psychological characteristics of the users who author them. This is the case, e.g., of the work proposed in [42], where the authors consider millions of Facebook messages, from where they extract hundreds of millions of words, topics, and sentences and automatically correlate them with gender, age, and personality of the users that posted them. Still referring to users’ demographic characteristics, work in [43] investigates textual online reviews, to test how the words—and their use—in a review are linked to the reviewer gender, country, and age. Work in [44] focuses on review manipulation: comparing hotel reviews and related features across different review sites, the work outperforms the detection of suspicious hotel reviews when checking the reviews on sites in isolation.

Therefore, we envisage the possibility to extract additional features from the TripAdvisor dataset and to apply adequate techniques to discover frequent patterns, correlations and causal structures among them. To this aim, we may think to follow different approaches. One is represented by the well-known and widely applied methodology of association rule mining [45], which allows the induction of rules predicting the occurrence of one feature (or more), given the occurrence of other features in the same set. This may lead to find correlations among the review features, such as the score, the occurrence of a possible polarity mismatch, and additional reviewers and reviews metadata. Furthermore, one could apply statistical techniques of preferences measurement [46], largely applied in market analysis. Among them, we remind the conjoint analysis [47, 48], which aims at determining the combination of features that is mostly influential to choose a product. The goal, in our scenario, would be to recognize the most significant features leading to, e.g., a polarity mismatch.

Conclusions

In this work, we moved from the intuition that a misalignment can exist between a review text and the score associated to it. To prove this hypothesis, we first constructed a reliable classifier by using an annotated dataset of hotel reviews taken from Booking. We then used this classifier to classify a large dataset of around 160K hotel reviews taken from TripAdvisor, according to the positive or negative polarity expressed by their textual content.

As the main result of our approach, we found that reviews tagged with a polarity mismatch present a mixture of positive and negative aspects of the product under examination. Thus, the mismatch classification is able to reduce the set of reviews which users may focus on, when searching significant aspects of the products being reviewed.

We argue that, only focusing on those texts associated with a mismatch, instead of manually investigating all the review texts in a dataset, consumers could achieve a better awareness on what has been liked—or not—about a product. Also, providers could understand how to improve their services. The proposed technique is applicable to a wide range of services: accommodation, car rental, and food services, just to cite a few.

As future work, we first aim at running a semantic analysis on those texts of reviews marked as mismatches, focusing in particular on aspect extraction, to link the opinionated text with the target of the opinion. Secondly, we will apply association rule mining techniques to features associated to the mismatched reviews, to possibly find correlations among review features, scores, and mismatches.