1 Introduction

In the era of ubiquitous and pervasive computing (Jameson and Krger 2005), the increasing amount of personal data collected by digital devices (e.g., smart phones, Google glass) can be used to construct accurate and flexible user models for personalized recommender systems. Most existing systems use contextual data (Adomavicius and Tuzhilin 2011; Yu et al. 2006), which generally refer to any information that characterizes the situation of an entity (e.g., a person, a place, or an object) (Abowd et al. 1999). For example, in a typical context-aware recommendation approach named contextual pre-filtering (Adomavicius et al. 2005), when the recommender is estimating the rating of an item for the target user, it considers data from other users that were acquired in the same context, because these data are more relevant for predicting the target user’s contextual preference. Empirical studies indicate that context-aware approaches can produce more accurate recommendations than non-context-aware approaches (Karatzoglou et al. 2010; Adomavicius et al. 2005; Zheng et al. 2013; Hariri et al. 2012; Park et al. 2006). However, most existing context-aware recommendation methods are limited in that the users’ preferences are modeled purely at the item level (i.e., the contextual preferences are related to the overall evaluations of items); they do not consider that the preferences can be modeled at the more fine-grained aspect level. Aspects are general features that are used to describe the item. For example, a hotel may have aspects such as “location”, “food quality”, and “service” (Liu et al. 2011; Jannach et al. 2012; Ganu et al. 2013). Indeed, users’ aspect-level preferences can be likely to be influenced by contextual factors, especially for service items (i.e., items consisting of certain business services in return for money, such as hotels, restaurants, movies, etc.) (Fuchs and Zanker 2012). Consider a hotel review from TripAdvisor Footnote 1 (see Fig. 1) as an example. We can clearly see that this reviewer, in the context of a business trip, places more emphasis on the aspect “location”, but if he was taking a family trip, the aspect “room” would become more important. Therefore, understanding users’ contextual preferences as they relate to aspects should be meaningful.

Fig. 1
figure 1

A hotel review example from TripAdvisor. The user’s opinions about the item’s aspects are highlighted with full lines, and the contexts are highlighted with dash lines

As the goal of this study is to develop more effective service recommender systems, we propose a method for deriving users’ aspect-level contextual preferences. Given the increasing number of users who share their experiences (i.e., opinions) with products and services in online reviews (Moghaddam and Ester 2012), we exploit the value of this type of textual information to accomplish our goal. Specifically, we contribute to the development of context-aware recommender systems in the following ways: (1) we develop an automatic technique for extracting aspect-level contextual opinions from user-generated reviews; (2) we use contextual weighting strategies to derive users’ aspect-level contextual preferences; and (3) we implement a stochastic gradient descent learning method to automatically integrate users’ contextual preferences into the recommendation process. In our technique, we discriminate between two types of user preferences: context-dependent and context-independent. The context-dependent preferences are the aspect-level contextual preferences that are common to users in the same context, whereas context-independent preferences reflect users’ stable requirements for an item’s aspects over time and are, as a result, less sensitive to contextual change.

An intuitive method to determine the context-dependent preferences is to count an aspect’s occurrence frequency (i.e., the occurrence frequency of any term related to the aspectFootnote 2) in reviews written in a specific context. In other words, the more frequently an aspect is mentioned, the more important it is to users in that context (i.e., the higher its weight) (Levi et al. 2012). However, this method cannot distinguish between aspects that appear the same number of times. We argue that the relative importance of each aspect-related term should also be considered when determining the aspect’s weight. To this end, we borrow knowledge from text categorization (Yang and Pedersen 1997) and propose three alternative contextual weighting methods for determining a term’s weight. Each variant is based on a different text feature selection strategy: mutual information (MI), information gain (IG), and chi-square statistic (CHI). On the other hand, context-independent preferences can also be extracted from reviews, but to do this accurately it is necessary to consider different properties between new users and repeated users. For new users (i.e., those with few history records in system Jamali and Ester 2009; Massa and Avesani 2007), we apply the probabilistic regression model (PRM) that can detect the preferences of new users by treating the detection as a Bayesian learning process. For repeated users (i.e., those with abundant history data), we compare the effectiveness of two models, i.e., PRM and the linear regression model (LRM), as the latter can be used to detect users’ preferences in a rich data condition.

Finally, to automatically combine the two types of user preferences into the recommendation process, we propose a linear-regression-based algorithm that uses a stochastic gradient descent learning procedure. We demonstrate the superior accuracy of our approach in comparison with the related methods on two real-life datasets (hotel reviews from TripAdvisor, and restaurant reviews from Yelp Footnote 3).

The rest of this article is organized as follows. We first summarize the related works, and classify them into two categories: context-aware and review-based recommender systems (Sect. 2). After discussing their respective strengths and limitations, we state our research problem and sketch the flow of our proposed system (Sect. 3). In Sect. 4, we describe the methodology we have developed. We compare the variations of our method with the related approaches in Sect. 5, and summarize the experiment results and discuss our work’s practical implications and limitations in Sect. 6. Finally, we conclude with our main findings and describe directions for future research (Sect. 7).

2 Related work

Our work is closely related to two types of recommender system: context-aware recommender systems and review-based recommender systems. In this section, we describe the state-of-the-art on these two subjects.

2.1 Context-aware recommender systems

One of broadly accepted definitions of context is given in (Abowd et al. 1999): “Context is any information that can be used to characterize the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and application themselves.” Adomavicius and Tuzhilin (2011) classified existing context-aware recommendation techniques into three categories according to the phase of the process in which the contextual information is applied: (1) contextual pre-filtering uses context to filter out irrelevant rating data before running a classical recommendation approach such as collaborative filtering (Adomavicius et al. 2005; Panniello et al. 2009); (2) contextual post-filtering uses context to distill the recommendation results after the classical approach has been applied (Panniello et al. 2009); and (3) contextual modeling directly incorporates context into the recommendation model (Zheng et al. 2013; Karatzoglou et al. 2010). Although contextual pre/post-filtering based approaches have been successful in some applications, researchers have pointed out that they are highly dependent on the selection of the recommendation algorithm, and a simple filtering strategy can cause the loss of valuable contextual information and hence damage the system’s prediction accuracy (Adomavicius et al. 2005; Panniello et al. 2009; Karatzoglou et al. 2010). In comparison, contextual modeling based approaches provide a more natural way to capture the interaction between user behavior and related context, so they have received more attention in recent years. For example, Karatzoglou et al. (2010) modeled the user-item-context relationship as a multiple dimensional tensor, which is an extension of the traditional two-dimensional (i.e., user-item) matrix factorization model. The tensor model is approximated by applying the stochastic gradient descent method. However, these approaches mostly use contextual information as hard constraints, and this cannot work well when there is severe data sparsity phenomenon.

A new trend in context-aware recommender systems is to measure the similarity between ratings acquired in different contexts. In this way, it can be decided whether a rating given in a specific context can be used to calculate recommendations in another context. For instance, Zheng et al. (2013) stated that the user-based k-Nearest Neighbor (k-NN) algorithm’s different steps (like searching similar users and calculating a user’s average rating) can be performed with different data selection strategies, among which the data are weighted by their context similarities, calculated by applying the particle swarm optimization algorithm. It is proven that this algorithm can improve prediction accuracy while maintaining prediction coverage (i.e., predicting as many unknown ratings as possible). Codina et al. (2013) proposed a singular value decomposition based analysis method to measure the semantic similarity between contexts, which in turn indicates the similary between ratings attained in different contexts. Then, when computing recommendations requested in a certain context, if there is lack of the target user’s history ratings pertinent to that context, the ratings acquired in other contexts are taken into account by weighting them with context similarity. These works assume that context is explicitly specified by users. However, datasets that contain both ratings and user-specified contexts are rare in practice (Li et al. 2010).

Our study can be regarded as an extension of the contextual pre-filtering based approach, as it also first filters out ratings according to the target user’s contexts and then generates recommendations; but the innovation is that our pre-filtering is conducted at the aspect level instead of at the item level. Our approach is superior to the above-mentioned approaches for the following reasons: (1) it capitalizes on textual reviews to acquire users’ contextual information; and (2) it refines users’ preferences by establishing the relationship between aspect-level opinions and contextual factors, and then incorporates the fine-grained contextual preferences into the recommendation process.

2.2 Review-based recommender systems

The common rationale behind review-based recommenders is that advanced opinion mining techniques can transform user-generated textual reviews into opinion ratings. For instance, some studies inferred the so-called virtual ratings from reviews (Zhang et al. 2013). The inferred ratings have been found to be comparable to users’ real ratings for the purposes of performing collaborative filtering techniques (Zhang et al. 2013; Leung et al. 2006; Poirier et al. 2010). In addition, researchers have attempted to combine users’ real ratings and review texts when they are both available. For example, Pero and Horvth (2013) incorporated both users’ real ratings and ratings inferred from reviews into the Matrix Factorization model in which either (1) real ratings are adjusted by inferred ratings before being input to the model; (2) inferred ratings and real ratings serve as separate inputs into the model and the resulting predictions are combined for recommendation; or (3) both inferred ratings and real ratings are used in the training phase when constructing the model.

In another sub-branch of research, aspect-level ratings are derived from reviews and used to represent the reviewer’s perception of an item from multiple dimensions. For instance, Ganu et al. (2013) developed a multi-label text classifier based on the support vector machine to classify review sentences into different aspect categories (e.g., food, service) and sentiment categories (i.e., positive and negative). Sentences classified into a specific \(\langle aspect, sentiment\rangle \) pair are used to calculate the opinion rating of the corresponding aspect. All of the aspects’ opinion ratings are then used to produce recommendations through regression-based and clustering-based algorithms. In a different approach, Wang and Chen (2012) and Chen and Wang (2013) used the latent class regression model to leverage reviews so that they can detect reviewers’ cluster-level weight preferences placed on features, and then use these preferences to compute user-user similarity during the recommendation process. Dong et al. (2013a, b) harnessed the extracted product features (i.e., aspect-related terms) and the accompanying opinions to build product profiles. These profiles are used to prioritize retrieved products that are similar to a user’s query product and have also been positively reviewed by users. In contrast to these heuristic-based algorithms, some researchers developed model-based recommendation approaches for capitalizing on aspect-level ratings as derived from reviews. Jakob et al. (2009) used the multi-relational matrix factorization to model interactions between users, items, and users’ opinions about aspects. The predicted rating of an item for the target user is calculated by multiplying the latent factors of the involved entities (i.e., user and item). Similarly, Wang et al. (2012) implemented a three-dimensional tensor model to accommodate the latent relationship between users, items, and aspect-level opinions. The tensor model is concretely approximated by a decomposition-based method named CP-WOPT (Acar et al. 2011), and the learnt latent factors of user, item, and overall rating are used to predict the rating. This sub-branch of work based on aspect-level ratings is essentially similar to multi-criteria recommender systems (Adomavicius et al. 2011), as the latter type of system also uses users’ evaluations of multiple aspects of an item to enhance recommendation (Liu et al. 2011; Adomavicius and Kwon 2007; Jannach et al. 2012; Zhang et al. 2009). However, unlike the ratings that users assigned to a fixed set of aspects that the system predefined in multi-criteria recommenders, reviews can contain aspects that users freely mentioned in text. Moreover, the words in a review text may more precisely indicate the reviewer’s personal opinions about aspects, which may help recommenders to more accurately model her/his preferences.

Several studies have additionally used the contextual information extracted from reviews to improve recommendation accuracy. Hariri et al. (2011) applied the labeled latent drichlet allocation (LDA) to extract contexts from hotel reviews and compute recommendations by taking into account both context-based and rating-based similarities when predicting an item’s utility for the target user. In (Li et al. 2010), two methods, string matching based method and text classifier, were adopted to extract four types of context from restaurant reviews: time, occasion, location, and companion. This work postulates that a user’s interest in an item is influenced by (1) the user’s long-term preference, which can be learnt from the user’s history ratings, and (2) the current context. Our approach differs from this study in the following ways: (1) we propose to detect users’ contextual preferences at the aspect level, instead of at the item level, and (2) we explicitly model users’ aspect-level contextual preferences through review analysis, rather than only using the available information (i.e., users’ history ratings and the extracted contexts) as input for training a probabilistic latent relational model.

All of the aforementioned studies provide insights into how to use free-text review information to improve recommender systems. However, their main limitation is that they do not explore the underlying relationship between aspect-level opinions and contexts. To the best of our knowledge, few studies have attempted to fill this gap. One work is (Carter et al. 2011). After extracting users’ opinions about camera features from reviews, the authors manually correlated the opinions with the product usage information (also expressed in reviews) so as to construct aspect-context relations. This study has three limitations: (1) it requires manual effort to identify aspect-context relations; (2) the contextual influences on users’ aspect-level preferences are not explicitly modeled; and (3) it lacks an experimental evaluation of their proposed recommendation algorithm. Another study (Levi et al. 2012) suggested that the aspect-level opinions expressed in users’ hotel reviews can be correlated with their self-specified contexts (such as trip intent and nationality) to capture underlying aspect-context relations. The derived relations are then used to calculate users’ relative weights for aspects in different contexts. This approach is still limited, as the researchers did not extract from reviews different opinions about the same aspect in different contexts (e.g., a user’s different opinions about the aspect “room” in contexts business trip and family trip as expressed in a review; see Fig. 1). To overcome these limitations, we previously proposed determining users’ contextual opinions through review analysis, and deriving users’ aspect-level contextual preferences through using feature selection metrics (Chen and Chen 2014). However, this approach is only applicable to repeated users. Another limitation is that users’ contextual preferences are fused into the recommendation process via a fixed parameter that cannot adapt to changes in users’ preferences between different contexts.

In comparison with the above-described methods, the innovations of our current approach are as follows: (1) we identify the effect of contextual factors on users’ aspect-level preferences in a more precise way by discriminating the aspect-related term’s relative importance to the context; (2) we propose a recommendation algorithm that is applicable to both new users and repeated users; and (3) we integrate users’ context-dependent preferences into the recommendation process using stochastic gradient descent learning.

3 Research problems and our system’s workflow

We believe that the widely available user-generated reviews on the web can be used to more accurately model users’ preferences, especially preferences influenced by contextual factors; this approach considers the users’ opinions about aspects of an item, instead of solely relying on their overall ratings. We particularly aim to augment service recommender systems by detecting users’ aspect-level contextual preferences and combining them with context-independent preferences. This study solves two research problems: (1) how to discover the relation between aspect-level opinions and contextual information in reviews, and to use this information to derive users’ context-dependent preferences, and (2) how to leverage both context-dependent and context-independent preferences from reviews and use them together to generate recommendations.

As discussed in Sect. 1, using an aspect’s occurrence frequency in reviews as the only feature pertinent to a specific context might not truly reflect its importance to users in that context. Therefore, a more sophisticated contextual weighting method should be investigated. In addition, we also need to detect users’ context-independent preferences, as they reflect users’ relatively stable requirements for item aspects over time. A recommendation method should combine both types of preferences in a precise way. Our system’s workflow can be seen in Fig. 2.

Fig. 2
figure 2

The workflow of our developed recommender system that is based on contextual opinions extracted from user reviews

  1. (1)

    Contextual opinion extraction We first implement an automatic method to conduct contextual review analysis that will extract users’ aspect-related contextual opinions from their reviews. Specifically, users’ contextual opinions are their evaluations of an item’s aspects (e.g., a hotel’s “location”, “food quality”, and “service”) contingent upon a certain context. We formally denote the contextual opinion as a tuple consisting of four elements: \(\langle i, {rev}_{u,i}, {a}_{k}, {Con}_{u,i,k} \rangle \) that represents user \(u\)’s opinion \({a}_{k}\) of aspect k of item i in contexts \({Con}_{u,i,k}\), as expressed in review \({rev}_{u,i}\) (where \(1 \le k \le K\), \(K\) denotes the number of aspects, and \({Con}_{u,i,k}\) is a boolean vector whose element value is equal to 1 when the associated context occurs, and 0 otherwise). For example, suppose \({Con}_{u,i,k}\) is five-dimensional in the restaurant domain containing five context values family, friends, colleague, couple, and solo. If a contextual opinion is tagged with the context values friends and couple, then \({Con}_{u,i,k}\) is represented as \(\langle 0,1,0,1,0 \rangle \). In this step, the main question is how to extract both aspect-level opinions and contexts from reviews and reveal their relationship. We give our solution in Sect. 4.1.

  2. (2)

    Context-independent preference inference As mentioned before, context-independent preferences reflect users’ relatively stable requirements for item aspects over time. Therefore, we believe that a user’s history data can be used to determine such preference. Normally, there are two types of users in a dataset: new users, who have few history records; and repeated users, who possess abundant history data. For a new user, it is almost impossible to build a preference inference model with the limited amount of history records s/he has provided. We hence test the performance of the probabilistic regression model (PRM) to derive new users’ preferences by treating the problem as a Bayesian learning process. For deriving context-independent preferences of repeated users, in addition to PRM, we investigate the linear regression model’s (LRM) suitability. In our previous work (Chen and Chen 2014), we only used LRM. It is hence meaningful to experimentally compare it with PRM in the current work. The details of these two models are given in Sect. 4.2, and the results of comparing them are in Sect. 5.

  3. (3)

    Context-dependent preference inference In contrast to context-independent preferences, context-dependent preferences reflect users’ desire for certain item aspects in a specific context. Some recent studies have pointed out that people in the same context tend to have similar preferences for item aspects (Fuchs and Zanker 2012; Levi et al. 2012); this finding motivates us to consider all of the reviews written within a context when determining users’ context-dependent preferences. In other words, the inference of context-dependent preferences is not user-specific, and consequently there is no need to discriminate between new users and repeated users in this process. We concretely derive users’ context-dependent preferences based on two observations: (a) a more important aspect usually has a higher occurrence frequency in reviews; and (b) aspect-related terms may be of varying importance to users in different contexts. For example, the term “Wifi” that is related to aspect “facility” may be more important to users in the context of staying in a hotel for business than in the context of being with friends, as business travelers often expect hotels to have Wifi. Hence, we first implement a frequency-based approach to assign weights to aspects, and then refine the weights using knowledge from text categorization (Yang and Pedersen 1997) that assesses the aspect-related term’s relative importance. Specifically, we propose three alternative contextual weighting methods (Chen and Chen 2014) for capturing users’ aspect-level contextual preferences. The three methods are respectively based on mutual information (MI), information gain (IG), and chi-square statistic (CHI), as these feature selection metrics can all be used to measure the dependency between two random variables (i.e., an aspect-related term and a context in our case); this enables us to discriminate the relative importance of an aspect-related term in different contexts. The differences between the three methods are detailed in Sect. 4.3, and their performance is tested in the experiment (see Sect.  5.4).

  4. (4)

    Ranking and recommendation The above three steps result in two types of preferences: context-independent preferences (including variations of LRM-based and PRM-based preferences) and context-dependent preferences (including variations of MI-based, IG-based and CHI-based preferences). To automatically incorporate both types of preferences into the recommendation process, we propose a linear-regression-based method that uses stochastic gradient descent learning (see Sect. 4.4). Moreover, as these two types of preferences can be combined in different ways, i.e., at the holistic level or the aspect level, we conduct an in-depth investigation of different combination strategies through experimental comparison (see Sect. 5.4). The recommendation algorithm returns the top-N items, and our evaluation task is to identify whether the user’s target choice appears in the recommendation list.

In the following, we describe each step in detail.

4 Our methodology

4.1 Extracting contextual opinion tuples from reviews

As mentioned, the first task is to transform user-generated reviews into structured contextual opinion tuples, which can be formally denoted as \(\{\langle i, {rev}_{u,i}, {a}_{k}, {Con}_{u,i,k} \rangle \mid 1\le k \le K \}\). Following related works on aspect-level opinion mining (Jakob et al. 2009; Chen and Wang 2013; Wang et al. 2010) and contextual information extraction (Li et al. 2010), we propose an automatic method to perform contextual review analysis. The proposed method has four main sub-steps.

  1. (1)

    Aspect identification Notice that each aspect of an item is concretely represented as a set of related terms in the reviews (e.g., the aspect “service” corresponds to the terms such as “service”, “waiter”, “waitress”, “attitude”, etc.). Therefore, we need to first identify aspect-related terms. The approaches used to complete this task in previous studies can be classified into two categories: heuristic-based and model-based. The heuristic-based approaches usually initialize each aspect with a few of predefined keywords, and then search for the other related terms by applying clustering method (Wang and Chen 2012), relying on certain syntactic relations (Wang et al. 2012), or measuring the dependency between terms (Wang et al. 2010; Jakob et al. 2009). In the model-based approaches, the latent dirichlet allocation (LDA) model has been popularly applied (Blei et al. 2003); for this model, we only need to define the number of aspects and then the aspect-related terms will be automatically retrieved. In our study, because there is prior knowledge that can be used to describe the recommended service (i.e., the aspects defined to describe the service), we prefer the heuristic-based method. We concretely apply the bootstrapping method introduced in (Wang et al. 2010), as it has been proven effective for processing service reviews. For example, for hotels, we define eight aspects: “value”, “location”, “service”, “room”, “facility”, “sleep quality”, “food quality”, and “cleanliness”; for restaurants, we define five aspects: “value”, “food quality”, “atmosphere”, “service”, and “location”. Then, each aspect is equipped with a few terms as seed words (see examples in Tables 1 and 2), and the other terms are searched out by measuring the dependency between the seed words and the candidate term based on the chi-square statistic (Yang and Pedersen 1997). Notice that only frequently occurring nouns and noun phrases, which are extracted by using a part-of-speech (POS) tagger Footnote 4, are considered to be prospective terms.

Table 1 Hotel aspects and aspect-related terms
Table 2 Restaurant aspects and aspect-related terms
  1. (2)

    Opinion orientation Adjectives and adverbs in reviews can be regarded as users’ opinion carriers. We first use POS tagger to extract these words from reviews, and then determine their orientations as numeric scores (+1 for positive and \(-\)1 for negative (Ding et al. 2008; Hu and Liu 2004)) with the aid of an opinion lexicon constructed in (Wilson et al. 2005). We then consider two strategies to reveal the connection between the aspect-related terms and opinions: syntactic-based (Wang et al. 2012; Jakob et al. 2009) and distance-based (Levi et al. 2012; Ding et al. 2008). The syntactic-based approach relies on certain syntactic patterns such as adjectival modifiers (e.g. “the comfortable bed”, in which the adjective “comfortable” modifies the term “bed”) and nominal subjects (e.g., “The location of the hotel is perfect”, in which the term “location” is the subject of “perfect”). The distance-based approach applies a flexible strategy. That is, if the aspect-related term and the opinion co-occur in the same sentence, they are correlated. In our collected reviews, we notice that some users tend to write reviews in a rather free and unrestricted way. In other words, some sentences in reviews do not strictly follow standard syntactic patterns (such as the sentence “Everybody is there to make you happy, from the owner to the chef in the kitchen.”) If we purely rely on standard patterns, opinions buried in these sentences cannot be captured. Therefore, we aggregate all of the opinions expressed in a single sentence for each aspect-related term using the distance-based technique: \(score(s, f)={\sum }_{{op \in s }} {sent}_{op} \big / {d( op, f )}\), where f denotes an aspect-related term that appears in sentence s, op denotes an opinion word in sentence s, \({sent}_{op}\) denotes its sentiment score, and \(d\left( op, f \right) \) gives the distance from op to f (e.g., in “such a wonderful place”, the distance from “wonderful” to “place” is 1). In addition, when performing this sub-step, we adopt two opinion rules (Ding et al. 2008): Negation rule (i.e., the opinion’s sentiment score will be reversed if there exists a negation word such as “no”, “not”, “never”, etc.) and But rule (i.e., the opinions expressed before and after the word “but” are opposite to each other).

  2. (3)

    Context extraction To reveal the underlying relation between aspects and contexts, not only the aspect-level opinions but also the contexts should be extracted from reviews. Following (Abowd et al. 1999), we regard context as any information that can be used to characterize the situation of an entity. For example, the contextual variables Companion (whether a user is accompanied by others) and Occasion (e.g., anniversary, birthday, etc.) have often been considered as important factors when the user is choosing a hotel to stay or a restaurant to take meals (Fuchs and Zanker 2012). In addition, for restaurant service, Time (i.e., time of taking the meal) has been regarded as an important contextual factor in influencing users’ choice (Li et al. 2010). Each contextual variable can be concretely assigned a value that we call “context value”. For example, the optional context values of Companion are family, friends, colleague, couple, solo, etc. Moreover, each context value can be defined by a set of keywords. For instance, the keywords related to the context value colleague are “colleague”, “business”, “coworker”, “boss”, and so on. Thus, if any of the keywords appear in a review sentence, that sentence will be labeled with the corresponding context value. Table 3 lists the contextual variables, context values, and value-related keywords for hotel and restaurant services.

  3. (4)

    Aspect-context relation identification From the above three sub-steps, we can obtain both aspect-level opinions and contextual information from reviews. The question then becomes how to determine their relations. We have observed two common patterns in user-generated reviews. (a) Users usually specify their context in the first sentence of the review, e.g., “We went to this restaurant for dinner,” “I chose this hotel for enjoying the holiday with my wife.” This observation is supported by a statistical analysis of our experiment datasets, i.e., 72.3 % of hotel reviews and 64.9 % of restaurant reviews contain this pattern. (b) In addition to stating context at the beginning, users may evaluate item aspects in another imagined context later in the text, such as “However, this hotel might not suit those enjoying a family trip due to its limited room space” (see Fig. 1). Statistically, 23.1 % of hotel reviews and 20.7 % of restaurant reviews in our datasets possess such a writing pattern. Based on these two observations, we propose the following rules for automatically identifying aspect-context relations: (a) if both an aspect-level opinion and a context occur in the same sentence, they will be related; (b) if a sentence only contains an aspect-level opinion without mentioning a context, the opinion will be related to the context that appears in the previous, nearest sentence. Then, for a certain context, we sum all of the opinions pertinent to an aspect that is related to this context. That is, aspect \(k\)’s opinion \({a}_{k}\) as contained in the tuple \(\langle i, {rev}_{u,i}, {a}_{k}, {Con}_{u,i,k} \rangle \) is the result of aggregating all of the opinion scores of the aspect-related terms that are associated with the context \({Con}_{u,i,k}\). An aspect may be assigned to different opinion tuples in different contexts. For instance, the aspect “room” in the review presented in Figure 1 is contained in two tuples \(\langle i, {rev}_{u,i}, {a}_\mathrm{room}=1, {Con}_{u,i,\mathrm{room}}=\text {``business''} \rangle \) and \(\langle i, {rev}_{u,i}, {a}_{\mathrm{room}} = -1, {Con}_{u,i,\text {room}}=\text {``family''} \rangle \) Footnote 5, which have opposite opinions about this aspect in two different contexts.

Table 3 Contextual variables, context values, and value-related keywords for hotel and restaurant services

4.2 Inferring context-independent preferences

As context-independent preferences reflect a user’s relatively stable requirements for item aspects, the user’s history data can be used to infer these preferences. To accomplish this task, we consider two alternative inference models: the linear regression model (LRM) and the probabilistic regression model (PRM).

4.2.1 Linear regression model based inference

This approach assumes that a user’s overall evaluation of an item (like the overall rating) is the sum of her/his opinions about different aspects of the item, so it can be generated by aggregating the aspect-level opinions. For our purpose, the coefficient assigned to each aspect variable in the aggregation function can be interpreted as the weight that the user gives to that aspect; it essentially defines the relative contribution of the aspect to the overall rating.

More specifically, we apply the linear least-square regression function (Franklin 2005) to define this aggregation relationship. By performing contextual review analysis described in Sect. 4.1, each review written by a user can be represented as a rating vector \(\mathbf {{A}_{u,{rev}_{u,i}}} = \langle {a}_{1},\ldots , {a}_{K}\rangle \), in which \({a}_{k}\) (\(1 \le k \le K\)) represents the user’s opinion rating for aspect \(k\). All of the rating vectors (that correspond to the set of reviews written by the user) can then be used to construct the linear least-square regression function, which is formally denoted as:

$$\begin{aligned} {R}_{{rev}_{u,i}}={{\mathbf {W}_{u}}}^{T}{\mathbf {A}_{u,{rev}_{u,i}}} + \varepsilon \end{aligned}$$
(1)

where \({R}_{{rev}_{u,i}}\) denotes the overall rating that accompanies the review, \(\varepsilon \) denotes the error term, and \(\mathbf {{W}_{u}}=\langle {w}_{u,1}, \ldots , {w}_{u,K} \rangle \) denotes the weight vector that the user gives to different aspects.

As the obtained weights might not all be statistically significant, i.e., \({a}_{k}\) has little influence on \({R}_{{rev}_{u,i}}\) and thus there is no significant linear relationship between \({a}_{k}\) and \({R}_{{rev}_{u,i}}\), we apply the statistical t test to select weights that pass the significance level (\(0.1\)) and regard these weights as the user’s context-independent preferences. To be specific, the Null hypothesis of the t test is that there is not a significant linear relationship between \({a}_{k}\) and \({R}_{{rev}_{u,i}}\) and thus \({w}_{u,k}\) is equal to 0. Then, we calculate the test for each weight \({w}_{u,k}\) via \(\frac{{w}_{u,k}}{\sqrt{\frac{1}{K} \sum _{k=1}^{K}{\left( {w}_{u,k} - \varsigma \right) }^{2}}}\), where \(\varsigma = \frac{1}{K} \cdot \sum _{k=1}^{K}{w}_{u,k}\) denotes the mean of all of the acquired weights. If the corresponding p value is lower than the significance level that we set, we can reject the Null hypothesis, and conclude that the aspect \(k\) is important to the user and its associated weight reflects the user’s preference for it.

4.2.2 Probabilistic regression model based inference

Like the linear regression model, the probabilistic regression model (PRM) postulates that the relation between the overall rating and all aspects’ opinions is essentially a regression problem. However, the difference is that PRM models the underlying relation via Bayesian treatment so that prior knowledge can be incorporated into the model. Specifically, this approach considers the noise term \(\varepsilon \) in Eq. 1 is drawn from a Gaussian distribution, with a mean of 0 and a variance of \({\sigma }^{2}\): \(\varepsilon \sim \mathcal {N} \left( 0, {\sigma }^{2} \right) \). Inspired by (Yu et al. 2011; Chen and Wang 2013), we treat the overall rating \({R}_{{rev}_{u,i}}\) as a sample drawn from a Gaussian distribution with a mean of \({{\mathbf {W}_{u}}}^{T}{\mathbf {A}_{u,{rev}_{u,i}}}\) and a variance of \({\sigma }^{2}\). In other words, the conditional probability that a user \(u\) gives the overall rating \({R}_{{rev}_{u,i}}\) to an item \(i\) can be defined as follows:

$$\begin{aligned} \begin{aligned} p\left( {R}_{{rev}_{u,i}} \mid {{\mathbf {W}_{u}}},{\mathbf {A}_{u,{rev}_{u,i}}} \right)&= \mathcal {N}\left( {R}_{{rev}_{u,i}} \mid {{\mathbf {W}_{u}}}^{T}{\mathbf {A}_{u,{rev}_{u,i}}}, {\sigma }^{2} \right) \\&= \frac{1}{\sqrt{2 \pi } \sigma }exp\left( - \frac{ {\left( {R}_{{rev}_{u,i}}- {{\mathbf {W}_{u}}}^{T}{\mathbf {A}_{u,{rev}_{u,i}}} \right) }^{2}}{2{\sigma }^{2}} \right) \end{aligned} \end{aligned}$$
(2)

According to the Bayes theory (Franklin 2005), the posterior probability of \({\mathbf {W}_{u}}\) can be defined as the product of Eq. 2 and the incorporated prior probability:

$$\begin{aligned} p\left( \mathbf {W}_{u}\mid \mathcal {S} \right) \propto \prod _{\langle u,i\rangle \in \mathcal {S}} p\left( {R}_{{rev}_{u,i}} \mid {{\mathbf {W}_{u}}},{\mathbf {A}_{u,{rev}_{u,i}}} \right) \times p\left( {\mathbf {W}_{u}} \mid \mu ,{\varSigma }\right) \times p\left( \mu , {\varSigma }\right) \qquad \end{aligned}$$
(3)

where \(\mathcal S\) denotes the set of user-item pairs, in which \(\langle u, i\rangle \in \mathcal S\) indicating that user \(u\) posted a review to item \(i\), and \( p\left( {\mathbf {W}_{u}} \mid \mu ,{\varSigma }\right) \) is the prior probability of \({\mathbf {W}_{u}}\), which can be drawn from a multivariate Gaussian distribution with \(\mu \) as the mean and \({\varSigma }\) as the covariance matrix:

$$\begin{aligned} p\left( {\mathbf {W}_{u}} \mid \mu ,{\varSigma }\right) \sim \mathcal {N}\left( \mu , {\varSigma }\right) \end{aligned}$$
(4)

Given that important aspects are usually commented on more frequently by users, we concretely incorporate an aspect’s occurrence frequency as the prior knowledge (denoted as \({\mu }_{0}\)) into \({\mathcal {N}}\left( \mu , {\varSigma }\right) \). The prior probability of the distribution \(p\left( \mu , {\varSigma }\right) \) is hence defined as:

$$\begin{aligned} p\left( \mu , {\varSigma }\right) = exp\left( - \psi \cdot KL \left( \mathcal {N}\left( \mu , {\varSigma }\right) \mid \mathcal {N}\left( {\mu }_{0}, \mathbf {I} \right) \right) \right) \end{aligned}$$
(5)

where \(KL\left( \cdot \mid \cdot \right) \) is the KL-divergence for computing the difference between distributions \(\mathcal {N}\left( \mu , {\varSigma }\right) \) and \(\mathcal {N}\left( {\mu }_{0}, \mathbf {I} \right) \), \(\psi \) is the trade-off parameter, and \(\mathbf {I}\) is an identity covariance matrix.

The parameters in the constructed model include \({\varPsi }= \{ \mathbf {{W}_{1}},...,\mathbf {{W}_{\left| U \right| }} , \mu , {\varSigma }, {\sigma }^{2} \}\), in which \(U\) denotes the set of users. To estimate these parameters, we optimize the following function, which searches for an optimal \(\hat{{\varPsi }}\) to maximize the following probability given the review corpus:

$$\begin{aligned} \hat{{\varPsi }} = {arg max}_{{\varPsi }}\left( {\varPsi }\mid \mathcal {S} \right) = \sum _{{\langle u,i\rangle \in \mathcal {S}}}log\left( p\left( \mathbf {W}_{u}\mid \mathcal {S} \right) \right) \end{aligned}$$
(6)

This can be solved by applying the Expectation–Maximization (EM) algorithm (Dempster et al. 1977). Thus, with PRM, we can also obtain the weights \(\mathbf {W}_{u}\) that a user holds for different aspects as the user’s context-independent preferences.

4.2.3 Discussion

In our previous study (Chen and Chen 2014), we used only the linear regression model (LRM) to derive users’ context-independent preferences, as the goal of that project was to improve recommendations for repeated users who are with abundant history data. In such cases, the number of data samples used as input for training the regression model can be larger than or equal to the number of independent variables. In the current study, we consider new users who have few ratings and reviews, where the amount of user data does not satisfy the LRM’s requirement. We hence believe that the probabilistic regression model (PRM) might address this limitation by treating the preference inference problem as a Bayesian learning process and considering the prior knowledge (i.e., the aspect’s occurrence frequency). Therefore, in this study, we use PRM for new users, but we consider both LRM and PRM for repeated users, and compare their effectiveness in an experiment (see Sect. 5).

4.3 Inferring context-dependent preferences

Unlike context-independent preferences, context-dependent preferences indicate the aspect-level contextual needs that are common to users in the same context. To capture such preferences, we propose three variations of contextual weighting methods.

Intuitively, if an aspect appears more frequently than others in reviews pertaining to a certain context, this aspect should be more valued by users in this context and thus receive a higher weight. Therefore, our basic approach is to assign weights to aspects by analyzing the relation between the aspect’s occurrence frequency and the context. We first develop the following formula to calculate the occurrence frequency of aspect k in context value c:

$$\begin{aligned} {freq}_{k,c}=\frac{\sum _{rev \in R}\sum _{s \in rev} {{\varDelta }}_{s,c} \cdot \left( \sum _{f \in s} {{\varTheta }}_{f,k} \right) }{\sum _{rev \in R}\sum _{s \in rev} {{\varDelta }}_{s,c} \cdot \left( \sum _{f \in s} 1 \right) } \end{aligned}$$
(7)

where f, s, and rev, respectively, represent an aspect-related term, a sentence, and a review; R denotes the set of all reviews; \({{\varDelta }}_{s,c}\) is an indicator function, whose value is equal to 1 if the sentence s is related to context value c, and 0 otherwise; and \({{\varTheta }}_{f,k}\) is another indicator function, whose value is equal to 1 if the term f is related to aspect k, and 0 otherwise. In fact, Eq. 7 calculates the aspect’s occurrence frequency based on its related terms’ occurrences in sentences labeled with context value c. Once the frequencies of the aspect in different context values are obtained, we compute the aspect’s average frequency as \({avg}_{k}= {\sum _{c \in \mathcal C} {{freq}_{k,c}}}/{ \left| \mathcal C \right| }\), the standard deviation as \({stdv}_{k}=\sqrt{\sum _{c \in \mathcal C}{{\left( {freq}_{k,c} - {avg}_{k} \right) }^{2}}/{\left| \mathcal C \right| }}\) (where \(\mathcal C\) denotes the set of context values), and \({dev}_{k,c} = {freq}_{k,c} - {avg}_{k}\) (Levi et al. 2012). Next, we adopt the strategy proposed in (Levi et al. 2012) for computing the weight of aspect k regarding context value c:

$$\begin{aligned} {w}_{k,c} = {\left\{ \begin{array}{ll} 1, &{}\text { if } \left| {dev}_{k,c} \right| < {stdv}_{k} \\ Max\left( 0.1, 1/{\left| \frac{{dev}_{k,c}}{{stdv}_{k}} \right| } \right) , &{}\text { if } \frac{{dev}_{k,c}}{{stdv}_{k}} < -1\\ Min\left( 3,\frac{{dev}_{k,c}}{{stdv}_{k}} \right) , &{}\text { else } \end{array}\right. } \end{aligned}$$
(8)

However, Eq. 7 does not distinguish the relative importance of the aspect-related term in different contexts. In our view, the same term might be valued differently by users in different contexts, as explained by the example we gave in Sect. 3. To account for this, we extend this method (Eq. 8) by weighting the term using knowledge from text categorization (Yang and Pedersen 1997). We concretely build on the text categorization methods for selecting representative features (i.e., terms) when categorizing documents, and develop three contextual weighting methods; then we compare their effectiveness in an experiment.

4.3.1 Mutual information (MI)

Originally, mutual information was used to measure the mutual dependence between two random variables in information theory (Yang and Pedersen 1997). For our task, the two random variables can be aspect-related term and context. Given a term f and a context value c, the mutual information between them is defined as:

$$\begin{aligned} MI\left( f,c \right) = log\frac{p\left( f \wedge c \right) }{p\left( f \right) \cdot p\left( c \right) } \end{aligned}$$
(9)

where \(p\left( f \right) \) denotes the probability of f appearing in sentences, \(p\left( c \right) \) denotes the probability of sentences that are associated with context value c, and \(p\left( f \wedge c \right) \) denotes the probability that f appears in sentences that are related to context value c.

4.3.2 Information gain (IG)

In the area of text categorization, information gain is used to measure the number of bits of information for categorizing documents by knowing the presence or absence of a word in a document (Yang and Pedersen 1997). Hence, we can use this metric to measure the importance of an aspect-related term within a specific context. To suit our need, we implement this metric as a binary classification model, in which each sentence is classified into two categories, related to context value c or not: \(\mathcal {O} = \{{c}_{presence},{c}_{absence}\}\). The information gain is then calculated as follows:

$$\begin{aligned} IG\left( f,c \right)= & {} -\sum _{c \in \mathcal {O}}p\left( c\right) \cdot log \, p\left( c\right) + p\left( f\right) \sum _{c \in \mathcal {O}}p\left( c\mid f \right) log \, p\left( c\mid f \right) \nonumber \\&+ p\left( \bar{f}\right) \sum _{c \in \mathcal {O}}p\left( c\mid \bar{f} \right) log \, p\left( c\mid \bar{f} \right) \end{aligned}$$
(10)

where \(\bar{f}\) denotes the absence of f in a sentence, and \(p\left( c\mid f \right) \) denotes the probability that sentences containing f are related to context value c.

4.3.3 Chi-square statistic (CHI)

Generally speaking, we can measure the lack of independence between two random variables by computing the variance between the sample distribution and Chi-square distribution (Yang and Pedersen 1997). For our purpose, the lack of independence is computed between an aspect-related term f and a context value c, and regarded as \(f\)’s weight for \(c\), which is formally defined as follows:

$$\begin{aligned} CHI\left( f,c \right) = \frac{D \times {\left( D_1 D_4 - D_2 D_3\right) }^{2}}{\left( D_1 + D_3 \right) \times \left( D_2 + D_4\right) \times \left( D_1 + D_2 \right) \times \left( D_3 + D_4 \right) } \end{aligned}$$
(11)

where \(D_1\) is the number of times that f occurs in sentences related to c, \(D_2\) is the number of times that f occurs in sentences not related to c, \(D_3\) is the number of sentences in c that do not contain f, \(D_4\) is the number of sentences that are neither related to c nor containing f, and D is the number of times that all of the terms occur in sentences related to c.

There are several inherent differences between the above three methods: (1) MI treats an aspect-related term and a context value as independent of each other and computes the dependency based on the probability of them co-occurring in a sentence, which is rather straightforward; (2) IG regards the reviews written in a context as a corpus, and computes the dependency as the amount of information (measured by applying the entropy theory) obtained by observing the aspect-related term in the corpus; and (3) like MI, CHI also assumes that the two random variables are independent of each other, but computes the dependency as the variance between the sample distribution and Chi-square distribution. The common property between IG and CHI is that they both consider a term as having presence and absence statuses in relation to a context value.

Through any of these three methods, we can obtain the weights of the aspect-related terms with respect to different context values; this information can then be used for computing an aspect’s frequency by modifying Eq. 7 as follows:

$$\begin{aligned} {freq}_{k,c}=\frac{\sum _{rev \in R}\sum _{s \in rev} {{\varDelta }}_{s,c} \cdot \left( \sum _{f \in s} {{\varTheta }}_{f,k} \cdot MI\left( f,c \right) \right) }{\sum _{rev \in R}\sum _{s \in rev} {{\varDelta }}_{s,c} \cdot \left( \sum _{f \in s} MI\left( f,c \right) \right) } \end{aligned}$$
(12)

where \(MI\left( f,c \right) \) is calculated via Eq. 9, which can be replaced with \(IG\left( f,c \right) \) (Eq. 10) or \(CHI\left( f,c \right) \) (Eq. 11). The results can then be applied to Eq. 8 to determine the aspect’s weight in a certain context.

4.4 Generating recommendation

Considering that users’ behavior can be influenced by both context-independent and context-dependent preferences, we implement a linear-regression-based method to combine both types of preferences when computing a score for review \({rev}_{v,i}\) (i.e., a review written by reviewer v for item i) for the target user u (suppose item i is unknown to u):

$$\begin{aligned} score\left( u,{rev}_{v,i}, T \right)= & {} \sum _{\langle i, {rev}_{v,i}, {a}_{k}, {Con}_{v,i,k} \rangle \in S\left( {rev}_{v,i} \right) } \left( \prod _{c \in T} \left( 1 + {\alpha }_{k,c} \cdot {w}_{k,c} \right) \right) \nonumber \\&\cdot {w}_{u,k} \cdot {a}_{k} \cdot g\left( {Con}_{u},{Con}_{v,i,k} \right) \end{aligned}$$
(13)

In Eq. 13, \({w}_{k,c}\) is the user u’s context-dependent preference for aspect k in context value c (derived via one of the three contextual weighting methods as proposed in Sect. 4.3), \({w}_{u,k}\) is the user’s context-independent preference for aspect k (see Sect. 4.2), \({a}_{k}\) is the aspect k’s opinion score contained in the contextual opinion tuple \(\langle i, {rev}_{v,i}, {a}_{k}, {Con}_{v,i,k} \rangle \), \(S\left( {rev}_{v,i} \right) \) is the set of contextual opinion tuples derived from \({rev}_{v,i}\), \(T\) contains the target user’s current contexts, and \({Con}_{u}\) denotes its vector form. The indicator function \(g\left( {Con}_{u},{Con}_{v,i,k} \right) \) is defined as follows:

$$\begin{aligned} g\left( {Con}_{u},{Con}_{v,i,k} \right) = {\left\{ \begin{array}{ll} 1,&{} \text { if } {Con}_{u} \cdot {Con}_{v,i,k}\ne 0 \\ 0,&{} \text { else } \end{array}\right. } \end{aligned}$$
(14)

Equation 14 is used to ensure that only the aspect-level opinions pertinent to the target user’s current contexts are taken into account.

The score of item i for user u is then calculated by averaging the scores of all of its reviews using the following formula:

$$\begin{aligned} score\left( u,i \right) = {avg}_{{rev}_{v,i} \in R\left( i \right) }\left[ score\left( u,{rev}_{v,i}, T \right) \right] \end{aligned}$$
(15)

where \(R\left( i \right) \) denotes the set of reviews for item i. As Eq. 13 considers the target user’s context-dependent and context-independent preferences, the predicted score of a review reflects its relevance to the target user. The higher the predicted score, the more interested the target user in the aspects mentioned in the review. So, if the average score across all of the reviews of an item is high, this item could be recommended to the target user. The top-N items with the highest scores are retrieved in our system. In the experiment, we set \(N\) as 5, 10, and 15.

It is worth noting that in Eq. 13, \({\alpha }_{k,c}\) is a combination parameter used to control the relative contributions of a user’s context-independent and context-dependent preferences for aspect \(k\) in context value \(c\), when computing a review’s score. To automatically learn the parameter for each \(\langle aspect, context \rangle \) pair, we propose a stochastic gradient descent learning method. As our task is to perform the top-N recommendation, we may use an objective function that measures the ranking error (i.e., items enjoyed by the target user are ranked below those not enjoyed by her/him) (Weston et al. 2011):

$$\begin{aligned} \sum _{u \in U}\sum _{i \in {I}^{+}}\sum _{\bar{i} \in {I}^{-}} L\left( F\left( score(u,\bar{i}) \ge score(u,i) \right) \right) \end{aligned}$$
(16)

Here, \(U\) denotes the set of users, \({I}^{+}\) denotes the set of items enjoyed by user \(u\) (i.e., positive items Footnote 6), and \({I}^{-}\) denotes the set of items not enjoyed by \(u\) (i.e., negative items). The computation of \(score(u,i)\) is via Eq. 15, which involves the parameter \({\alpha }_{k,c}\) that we aim to learn. The indicator function \(F(\tau )\) is equal to 1 if \(\tau \) is true, and 0 otherwise. The function \(L(\varrho )\) is used to convert the ranking error \(\varrho \) (i.e., \(F(\tau )\)) into a weight. There are two choices for \(L(\varrho )\): 1) \(L(\varrho ) = \mathcal {H}\cdot \varrho \), in which \(\mathcal {H}\) denotes a constant; and 2) \(L(\varrho ) = \sum _{x=1}^{\varrho }\frac{1}{x}\). It has been demonstrated that the first choice optimizes the mean rank of the recommendation list, whereas the second optimizes the top of the ranked list (Jason et al. 2012). For instance, given two items, if their true ranking positions are 1 and 100, respectively, the first choice tends to favor functions that rank them both at 50, whereas the second prefers functions that rank them at their true positions, which matches our aim of optimizing the top-N items’ ranking in the recommendation list. We thus adopt the second choice for defining \(L(\varrho )\).

However, Eq. 16 is not continuous and thus not arbitrarily differentiable, which prevents us from applying the stochastic gradient descent based method to solve it. Inspired by (Weston et al. 2011), we add a margin to Eq. 16 and approximate it as follows:

$$\begin{aligned}&\sum _{u \in U}\sum _{i \in {I}^{+}}\sum _{\bar{i} \in {I}^{-}} L\left( F\left( 1 + score(u,\bar{i}) \ge score(u,i) \right) \right) \nonumber \\&\quad \cdot \big |1+ score(u,\bar{i}) - score(u,i) \big | \end{aligned}$$
(17)
figure a

By doing this, we make the stochastic gradient descent learning method feasible for minimizing the objective function so as to learn the optimal combination parameter \({\alpha }_{k,c}\) for each \(\langle aspect, context \rangle \) pair. Algorithm 1 sketches the general scheme of our developed method. Specifically, it works as follows. Before the learning process starts, the combination parameters are randomly initialized (line \(1\)). In each iteration, for each positive item \(i\) enjoyed by user \(u\), we calculate the corresponding ranking error, i.e., \(F\left( 1 + score(u,\bar{i}) \ge score(u,i) \right) \) (lines \(5 \sim 9\)). Particularly, due to the large number of items in real-life datasets, the computation of the ranking error would be costly. Therefore, we adopt the following sampling approximation: for a positive item \(i\), sample \(N\) items until a violation is found, i.e., \(score(u,\bar{i}) + 1 > score(u,i)\), and then approximate the ranking error with \(\left| {I}^{+} \bigcup {I}^{-} \right| /N\). Then, the ranking error is converted into a weight \(L\left( \left| {I}^{+} \bigcup {I}^{-} \right| /N \right) \) and used to adjust the value of \({\alpha }_{k,c}\) (that is used to compute \(score(u,i)\) and \(score(u,\bar{i})\) via Eq. 15) in the direction in which we expect an improvement (lines \(10 \sim 12\)):

$$\begin{aligned} {\alpha }_{k,c} \leftarrow {\alpha }_{k,c} + {\uplambda }L\left( \left| {I}^{+} \bigcup {I}^{-} \right| /N \right) \end{aligned}$$
(18)

where \(k\in \left[ 1,K \right] \), \(c\in \left[ 1,C \right] \), and \({\uplambda }\) is used to control the learning rate and set as 0.02 in our experiment. To avoid over-fitting problem, in each step we need to ensure that the learned parameter vectors (i.e., \({\varvec{\alpha }}_{c} = \langle {\alpha }_{1,c},...,{\alpha }_{K,c} \rangle \)) are constrained as follows: \(||{\varvec{{\alpha }}}_{c} ||\le \mathcal {H}\), where \(\mathcal {H}\) denotes a constant value and is set as 8 through experimental trials. If \(||{\varvec{{\alpha }}}_{c} ||> \mathcal {H}\), we carry out the following strategy as regularization (lines \(13 \sim 15\)):

$$\begin{aligned} {\varvec{\alpha }}_{c} \leftarrow \mathcal {H}{\varvec{\alpha }}_{c} / ||{\varvec{{\alpha }}}_{c} ||\end{aligned}$$
(19)

The algorithm stops when the difference between the objective function in two successive iterations is smaller than a pre-defined threshold (line \(16\)). Then, the learned parameter \({\alpha }_{k,c}\) is incorporated into Eq. 13 for calculating the score of a review of the candidate item for the target user.

5 Experiment and results

5.1 Datasets and experiment setup

We use two real-life datasets to test our approach: the first is a dataset of hotel service crawled from TripAdvisor, and the second is a dataset of restaurant service from Yelp as published by the RecSys’13 challengeFootnote 7. In both datasets, each textual review is accompanied by an overall rating ranging from 1 to 5 stars as assigned by the reviewer. To ensure that each review contains sufficient evaluation information and that each item receives sufficient reviews to be analyzed, we first perform the following cleaning procedure: (1) remove reviews that contain less than three sentences; (2) remove users that have posted only one review; and (3) remove items that have received less than 15 reviews. The descriptions of the two datasets are given in Table 4. Note that the data sparsity is defined as \(1-\frac{\# \text { of reviews}}{\# \text { of users } \times \text { } \# \text { of items}}\). The whole sets of retrieved aspect-related terms for the two datasets are shown in Tables 5 and 6.

Table 4 Descriptions of the two datasets
Table 5 The whole set of retrieved aspect-related terms for hotel dataset
Table 6 The whole set of retrieved aspect-related terms for restaurant dataset

For the evaluation procedure, we adopt a widely used per-user evaluation scheme (Shani and Gunawardana 2011; Codina et al. 2013). That is, for each user, we randomly select a certain number of ratings that are above four stars (i.e., enjoyed items) as testing data, while the remaining ratings serve as training data. In the hotel dataset, the average number of ratings (and reviews) given by new users (i.e., users who have less than five history records Jamali and Ester 2009) is 2.37; it is 14.08 for repeated users. In the restaurant dataset, the average number of ratings (and reviews) given by new users is 2.80; it is 15.99 for repeated users. Therefore, in the experiment, for each new user we randomly select one rating as the testing data, but for each repeated user we randomly select three ratings. As for the target user’s current contexts, in the hotel dataset, such context information is attained from the tested item’s associated context (as provided by the user); in the restaurant dataset, because such information is not available, it is simulated by performing a contextual analysis of the user’s review for the tested item. All of the reported results are the averages of per-user evaluations; the Student’s t Test (Smucker et al. 2007) is applied to compute the statistical significance of the differences between the compared methods.

The experiment is designed to answer the following questions: (1) how can we accurately infer the context-independent and context-dependent preferences of both new and repeated users? and (2) when the two types of preferences are combined to generate recommendations, is the stochastic gradient descent based method capable of learning the combination parameters? Accordingly, the experiment is divided into three parts: (1) apply the method to a sample group of new users to identify the ideal strategy for inferring their preferences; (2) apply the method to a sample group of repeated users to identify the ideal strategy for inferring their preferences; and (3) apply the method to the whole dataset to test the effectiveness of the proposed stochastic gradient descent learning method. The results are given in Sect. 5.4.

5.2 Compared methods

The variations of our recommendation algorithm are denoted as LRM/PRM + MI/IG/CHI connecter, which include different combinations of users’ context-independent preferences (inferred by either the linear regression model (LRM) or probabilistic regression model (PRM); see Sect. 4.2) and context-dependent preferences (inferred by one of the three contextual weighting methods respectively based on mutual information (MI), information gain (IG), and chi-square statistics (CHI); see Sect. 4.3).

We compared our algorithms with three related methods: the first does not consider contextual information when generating recommendations (i.e., context freer); the second only uses contextual information to filter data before a traditional recommendation algorithm is applied (i.e., context pre-filter); and the third uses reviews to infer users’ contextual preferences, but does not take into account aspect-related terms’ relative importance (i.e., simple connecter).

  • Context freer this method adopts the regression-based method proposed in (Adomavicius and Kwon 2007); it uses the aspect-level opinions, i.e., \(\{{a}_{k} \mid 1 \le k \le K\}\) (where \(K\) denotes the number of aspects), to calculate the score of a review \({rev}_{v,i}\) for the target user \(u\). In fact, this method uses a simplified version of Eq. 13, which does not consider the user’s context-dependent preferences:

    $$\begin{aligned} score \left( u,{rev}_{v,i} \right) = \sum _{k=1}^{K}{a}_{k} \cdot {w}_{u,k} \end{aligned}$$
    (20)

    Here, \({w}_{u,k}\) represents the user \(u\)’s context-independent preference for aspect \(k\). Then, Eq. 15 is applied to compute the score of item \(i\) for user \(u\). We denote this method as Freer.

  • Context Pre-filter following (Adomavicius et al. 2005), the contextual information is used at the item level in this method. That is, we first pre-filter the ratings according to the user-specified contexts and then apply the recommendation algorithm Freer. This results in a modified version of Eq. 13 as follows:

    $$\begin{aligned} score \left( u,{rev}_{v,i} \right) = g\left( {Con}_{u},{Con}_{{rev}_{v,i}} \right) \cdot \sum _{k=1}^{K}{a}_{k} \cdot {w}_{u,k} \end{aligned}$$
    (21)

    Here, \({Con}_{{rev}_{v,i}}\) denotes the contexts extracted from review \({rev}_{v,i}\), \({Con}_{u}\) denotes the contexts specified by the target user, and \(g\left( {Con}_{u},{Con}_{{rev}_{v,i}} \right) \) is an indicator function used to ensure that only the opinions pertinent to the target user’s current contexts are considered (as defined in Eq. 14). In fact, this method only considers reviews written in the target user’s contexts when calculating the item’s score via Eq. 15. We denote it as Pre-filter.

  • Simple connecter this method originates from (Levi et al. 2012). It uses the results of the contextual review analysis that we described in Sect. 4.1 to assign context-dependent weights to aspects by Eq. 8. Compared to our approaches, this method does not consider the relative weights of aspect-related terms in different contexts. We denote it as Simpler.

5.3 Evaluation metrics

For the top-N recommendations, researchers have commonly stressed two points (Deshpande and Karypis 2004; Gunawardana and Shani 2009): whether a user’s target choice appears in the recommendation list and how highly the target choice is ranked in the list. Therefore, we apply two metrics to measure the recommendation accuracy.

  • Hit Ratio @ top-N recommendations shortened to H@N, measures whether a user’s target choice appears in the top-N recommendations list (Chen and Wang 2013). It is computed as the percentage of hits among all users:

    $$\begin{aligned} H@N = \sum _{z=1}^{Z}{\delta }_{{rank}_{z}\le N}/{Z} \end{aligned}$$
    (22)

    where \(Z\) is the number of testings, \({rank}_{z}\) is the ranking position of the user’s target choice in the z-th testing, and \({\delta }_{{rank}_{z}\le N}\) is an indicator function that is equal to 1 if \({rank}_{z}\le N\) (i.e., the recommendation list contains the choice), or 0 otherwise.

  • Mean reciprocal rank shortened to MRR, evaluates the ranking position of a user’s target choice in the recommendation list (Chen and Wang 2013), and is formally defined as:

    $$\begin{aligned} MRR = {\sum _{z=1}^{Z} \frac{{\delta }_{{rank}_{z}\le N}}{{rank}_{z}}}/{Z} \end{aligned}$$
    (23)

5.4 Results analysis

As mentioned, we divide our experiment into three parts: experiments respectively on new users, repeated users, and the whole dataset. Notice that for each sample, we can determine the combination parameter \(\alpha \) for generating recommendations (via Eqs. 1315) by applying either of the two methods: (1) manual selection, which manually selects the best parameter value based on experimental trials (Chen and Chen 2014); or (2) automatic selection, which automatically decides the parameter value for each \(\langle aspect, context \rangle \) pair through applying the stochastic gradient descent learning algorithm we proposed in Sect. 4.4. In the first two experiment parts, because our main goal is to investigate the best strategies for inferring context-independent and context-dependent preferences for new users and repeated users respectively, we simply use the first strategy. In the third part, we focus on the second selection strategy to investigate whether our recommendation algorithm can be further improved.

5.4.1 Evaluation of preference inference for new users

As new users’ context-independent preferences can only be estimated by applying the probabilistic regression model (PRM) (see the discussion in Sect. 4.2), we primarily compare the three different contextual weighting methods (i.e., MI-based, IG-based, and CHI-based), which differ in terms of how they detect context-dependent preferences. That is, there are three variations of the method for preference inference for new users: \(\textit{PRM} + \textit{MI}\), \(\textit{PRM} + \textit{IG}\), and \(\textit{PRM} + \textit{CHI}\). The experiment results are shown in Table 7.

Table 7 Experiment results of preference inference for new users. Results marked with \(*\) are significantly better than the method being compared (\(p < 0.001\) by Student’s t Test)

First, we observe that both Pre-filter and Simpler significantly defeat Freer with respect to the two evaluation metrics. For instance, the H@5 achieved by Simpler is 0.0093 and the one achieved by Pre-filter is 0.0060 in the hotel dataset; these values are, respectively, 200 and 94 % higher than that achieved by Freer that does not consider contextual information. Similar improvements are observed for the other evaluation metrics. This proves that contextual information as extracted from reviews can enhance recommendation. Moreover, the comparison between Pre-filter and Simpler shows that, in most cases, Simpler is better than Pre-filter, indicating that contextual opinions can be used to build users’ aspect-level context-dependent preferences.

The results also show that PRM + MI/IG/CHI is significantly superior to Simpler with respect to all of the evaluation metrics in most conditions. For example, the improvements brought by MI, IG, and CHI over Simpler in terms of H@5 in the hotel dataset are, respectively, 109, 116 and 251 %; for MRR@5, they are 120, 126 and 227 %. This demonstrates that the accuracy of users’ aspect-level context-dependent preferences can be increased by considering the relative importance of aspect-related terms in different contexts. Among the three contextual weighting methods, CHI achieves the best performance, followed by IG and then MI. This can be explained by the way in which these methods compute the relevance of an aspect-related term to a specific context. MI (i.e., Eq. 9) tends to favor low-frequency terms, which might result in biases in the calculation of a term’s relevance. In comparison, both CHI (i.e., Eq. 11) and IG (i.e., Eq. 10) compute a term’s weight by considering all of the possible combinations of “presence” and “absence” statuses of an aspect-related term in relation to a specific context. The better performance obtained by CHI relative to IG is likely because CHI computes the dependency between an aspect-related term and a context value by directly measuring their co-occurrence frequency; this depicts the term’s relative importance more precisely and thus results in better inference accuracy.

5.4.2 Evaluation of preference inference for repeated users

The context-independent preferences of repeated users can be learned by applying either the linear regression model (LRM) or the probabilistic regression model (PRM), and their context-dependent preferences can be learned by applying one of the MI-, IG- and CHI-based contextual weighting methods. Therefore, there are six different combinations to be tested: \(\textit{LRM} + \textit{MI}\), \(\textit{LRM} + \textit{IG}\), \(\textit{LRM} + \textit{CHI}\), \(\textit{PRM} + \textit{MI}\), \(\textit{PRM} + \textit{IG}\), and \(\textit{PRM} + \textit{CHI}\). The experiment results are reported in Tables 8 (on the hotel dataset) and 9 (on the restaurant dataset).

Table 8 Experiment results of preference Inference for repeated users on the hotel dataset
Table 9 Experiment results of preference Inference for repeated users on the restaurant dataset

The results are similar to those presented in Table 7. That is, of the three methods, Freer, Pre-filter, and Simpler, Simpler still achieves the best performance, followed by Pre-filter and then Freer. This supports our postulation that reviews are valuable resources, containing contextual opinions that can be used to more accurately depict users’ aspect-level contextual preferences. In addition, all of our proposed context-dependent preference inference methods defeat Simpler, and CHI still performs the best. Comparing these results with those reported in Table 7, we find that the improvement brought by discriminating aspect-related terms (e.g., CHI over Simpler) is more obvious for new users than for repeated users in terms of the metric H@N (\(N=5, 10, 15\)). Specifically, in the hotel dataset, the average improvement is up to 236 % for new users, but only 105 % for repeated users; in the restaurant dataset, the improvement is 295 % for new users vs. 75 % for repeated users.

For the context-independent preference inference for repeated users, PRM based Freer, which only considers users’ context-independent preferences as inferred by PRM, significantly outperforms LRM based Freer, in terms of all of the evaluation metrics in both datasets. For example, in the hotel dataset, the value of PRM based Freer is 35 % higher than that achieved by LRM based Freer w.r.t. H@15; it is 37 % higher w.r.t. MRR@15. Furthermore, when users’ context-dependent preferences are integrated, the PRM-based variations (i.e., PRM based Pre-filter, PRM based Simpler, PRM + MI/IG/CHI) are significantly superior to those based on LRM in terms of most evaluation metrics. All of these results demonstrate that PRM is better at deriving repeated users’ context-independent preferences, which may be because it incorporates prior knowledge into the model.

5.4.3 Evaluation of combination parameter

The results from the above two parts lead to the following conclusions: for both types of users, i.e., new users and repeated users, context-independent preferences are better estimated by applying the probabilistic regression model (PRM), and context-dependent preferences are better obtained through the contextual weighting method based on Chi-square statistic. In this part, we focus on investigating the effectiveness of our proposed stochastic gradient descent learning algorithm, which is aimed at automatically determining a set of combination parameters when generating recommendations (see Sect. 4.4). Specifically, we seek to learn the parameter for each \(\langle aspect, context \rangle \) pair when combining the context-independent and context-dependent preferences via Equation 13. Therefore, there are \(K \times C\) parameters to be learned, i.e., \(\{{\alpha }_{k,c}\mid 1 \le k \le K, 1\le c \le C \}\). In addition, we implement some variations of the learning method that combine the two types of preferences at different levels (i.e., holistic-level and aspect-level):

  • Holistic learning searches for a holistic-level parameter \(\alpha \) manually through experimental trials, as described in previous study (Coy et al. 2001). In other words, the combination parameter is neither aspect-specific nor context-specific, thus it cannot be adaptive to a user’s needs for different aspects of an item in different contexts. \({\alpha }_{k,c}\) in Eq. 13 is replaced with a fixed parameter \(\alpha \) in this method. We denote it as Holistic.

  • Aspect-level learning involves \(K\) parameters, i.e., \(\varvec{\alpha } = \langle {\alpha }_{1},...,{\alpha }_{K}\rangle \), in which \({\alpha }_{k}\) \((1\le k \le K)\) represents the combination parameter for aspect \(k\) in all of the contexts. For this learning, Eq. 18 is modified as \({\alpha }_{k} \leftarrow {\alpha }_{k} + {\uplambda }L\left( \left| {I}^{+} \bigcup {I}^{-} \right| /N \right) \) and Eq. 19 is modified as \({\varvec{\alpha }}_{k} \leftarrow \mathcal {H}{\varvec{\alpha }}_{k} / ||{\varvec{\alpha }} ||\). In Eq. 13, \({\alpha }_{k}\) replaces \({\alpha }_{k,c}\) for computing a review’s score. This method does not consider that users’ context-independent and context-dependent preferences for the same aspect can be combined in different ways in different contexts. We denote it as Aspect.

  • Aspect-context-level learning learns a parameter \({\alpha }_{k,c}\) for each \(\langle aspect, context \rangle \) pair, as described in Sect. 4.4. We denote it as Aspect-Context.

The experiment results are shown in Table 10. There are two important findings: (1) Aspect is significantly superior to Holistic in most conditions, which demonstrates the effectiveness of the proposed learning algorithm in terms of combining two types of user preferences via learning the combination parameter at a more fine-grained level, i.e., learning a parameter for each aspect. For instance, the H@10 achieved by Aspect is 0.0860 in the hotel dataset, which is 24 % higher than that achieved by Holistic. As for the restaurant dataset, the performance of Aspect is 27 % higher than that achieved by Holistic with respect to H@10 (i.e., 0.1123 vs. 0.0883); and (2) Aspect-Context further defeats Aspect. The average improvement brought by Aspect-Context is up to 28 % (relative to Aspect) and it is up to 61 % (relative to Holistic) in the hotel dataset in terms of metrics H@N (\(N=5, 10, 15\)), and the improvements are respectively 24 and 55 % in the restaurant dataset. This proves our hypothesis that users’ aspect-level preferences can be influenced by contexts, and that our proposed learning algorithm is capable of capturing such influences.

Table 10 Experiment results of the combination parameter identification

6 Discussion

6.1 Summary of experiment results

All of the above results lead to three main conclusions. (1) The probabilistic regression model (PRM) is suitable for deriving not only new users’ but also repeated users’ context-independent preferences; this can be mainly attributed to its Bayesian learning process and the prior knowledge that it can draw on when deriving such preferences. (2) For detecting users’ context-dependent preferences, the contextual weighting method based on CHI defeats not only the baselines, but also the other two weighting methods respectively based on MI and IG under most circumstances. Its advantage can be attributed to two main properties of CHI: (a) CHI considers all possible combinations of the statuses (i.e., “presence” and “absence”) of an aspect-related term in relation to a context value; and (b) CHI calculates the dependency between an aspect-related term and a context value by directly measuring their co-occurrence frequency. (3) The stochastic gradient descent learning method can automatically learn the combination parameters for fusing the two types of user preferences, and hence enhance the recommendation accuracy.

6.2 Practical implications to recommendation in ubiquitous computing

In our view, this research brings several practical implications to recommendation in ubiquitous computing. (1) With the aid of advanced mobile devices (e.g., smart phones, Google glass), users’ current contexts (such as location, motion, time of day) can be automatically sensed (Carmichael et al. 2005; Zimmermann et al. 2005; Cheverst et al. 2005; Hammer et al. 2015); and more importantly, such contexts can be matched to their contextual preferences that are inferred from their item reviews for system to provide accurate recommendations in real time. (2) In particular, we have found that reviews can be used to model users’ preferences at fine-grained aspect level, which are then linked to contexts for capturing their multi-faceted nature of contextual needs for items. (3) Moreover, through experiment on hotel and restaurant datasets, we have demonstrated the actual merit of our recommendation algorithm in mobile tourism, which is a typical application scenario of ubiquitous computing (Hatala and Wakkary 2005; Petrelli and Not 2005).

6.3 Limitations of our current work

Our current work still has several limitations. (1) The experiment was conducted on only two datasets, which limits the generalizability and applicability of our findings to broader product domains. Moreover, the practical usefulness of our method in real life is not tested, as the experiment was designed as an offline simulation and the approach has not been validated as effective for online users. (2) In the experiment, we excluded short reviews and items with few reviews to ensure that each review possesses sufficient opinions and that each item has received sufficient reviews. However, since they may also contain some valuable information, our method should be improved to accommodate their special characteristics. (3) In our collected reviews, we observed that sentences like “This place is a wonderful choice for family or friends to gather, but not for a couple” offer both positive and negative opinions about an aspect in different contexts. Using our current aspect-context relation identification method (see Sect. 4.1—step four), which correlates an aspect-level opinion with all of the contexts expressed in a sentence, we cannot identify the negative opinion about “place” in the context couple in the above example. In addition, considering that adverbs can be used as intensifiers to strengthen or soften opinions, they could be treated in a different way from adjectives in the process of determining opinion orientation. (4) Our current recommendation algorithm simply averages all of an item’s reviews’ scores to predict the item’s interest score for the target user, which is irrespective of the number of reviews being aggregated.

7 Conclusions and future work

In this paper, we seek to enhance service recommender systems by leveraging users’ aspect-level contextual preferences (i.e., context-dependent preference). For this purpose, we have investigated three variations of contextual weighting methods which are based on different text feature selection strategies: MI, IG and CHI. All of the three strategies aim to analyze the relation between an aspect’s frequency (based on aspect-related terms’ relative importance) and a context value. We further derive users’ context-independent preferences from reviews. Particularly, to support both new users and repeated users, we have investigated two regression models for deriving context-independent preferences: the linear regression model (LRM) and the probabilistic regression model (PRM). Then, we proposed a linear-regression-based algorithm that uses a stochastic gradient descent learning procedure to automatically fuse the two types of preferences into the process of generating recommendations.

We tested the proposed method on two real-life service datasets and demonstrated that our method outperforms related techniques in terms of recommendation accuracy. In summary, we have found that (1) it is helpful to correlate users’ opinions with contextual factors by performing contextual review analysis; (2) the accuracy of a user’s profile can be increased by combining both context-dependent and context-independent preferences; and (3) aspect-related terms are important for discriminating users’ aspect-level preferences in different contexts. Thus, our work highlights the merit of deriving users’ aspect-level contextual preferences from reviews, and the effect of our proposed linear-regression-based algorithm on improving the recommendation accuracy. As mentioned above (Sect. 6.2), we believe that our algorithm can be beneficial to recommender systems in ubiquitous computing. In this scenario, the system can automatically sense a user’s current contexts through her/his mobile devices and then match the contexts to her/his aspect-level contextual preferences (as derived from her/his reviews to items such as hotels, restaurants) for the production of accurate recommendations in real time.

In the future, we will continue to improve our approach as follows. (1) We will conduct user evaluations to empirically validate the practical benefits of our recommendation algorithm to online users. (2) We will address the limitations of our method (as discussed in Section 6.3). For instance, we may combine a matrix factorization model with LDA (McAuley and Leskovec 2013) for processing items with few reviews. We may improve the accuracy of aspect-context relation identification through adopting some specific linguistic rules (Ding et al. 2008). We will also take into account the number of reviews when aggregating them to compute an item’s prediction score. (3) It will be interesting to investigate reviewers’ aspect-level comparative opinions, such as “The bed was comfortable but not as good as that in the Four Seasons Hotel” (Zhang et al. 2010). Intuitively, comparative opinions can reveal users’ preferences for one item over others with regard to some aspects; this motivates us to combine them with contextual opinions for further improving our recommendation algorithm.