Synonyms

Incremental learning; Online machine learning on streams; Stream learning

Definitions

  • Data stream algorithms process a continuous stream of data with only a limited possibility to store past records.

  • Online machine learning covers methods that update their models after observing a new event and can immediately serve predictions based on the updated model.

Overview

In this chapter, we investigate online learning based recommender algorithms that can efficiently handle nonstationary datasets. We show that online learning for recommendation is rather usual than the exceptional task: For example, if no user history is available, we have to build a user model on the fly, based on the interactions in the live user session.

To the best of our knowledge, this is the first survey with a comprehensive overview of the ideas for recommendation over streaming data and their implementation in various distributed data stream processing systems.

The chapter is based on the notions of online learning, as introduced in the chapter “Overview of Online Machine Learning in Big Data Streams” of this Encyclopedia.

Introduction

Recommender systems (Ricci et al. 2011) serve to predict user preferences regarding items such as music tracks (Spotify), movies (Netflix), products, books (Amazon), blogs, or microblogs (Twitter), as well as content on friends’ and personal news feeds (Facebook).

Recommenders give a clear, industry-relevant example of the requirements for online machine learning introduced in the Chapter “Overview of Online Machine Learning in Big Data Streams” of this Handbook. In a typical implementation, users interact with the system by requesting recommendations and then providing feedback by clicking on some of the displayed items. In this way, the users produce a continuous stream of events that can be used for model update and immediate evaluation, for example, by click-through rate. Note that in Žliobaite et al. (2012), it is observed that adaptive learning models are still rarely deployed in industry. Recommender systems are the main exception: A special class, the session-based recommendation task appears frequently in practice when no past user history is available and user models are built on the fly, using recent interactions in the session.

Recommender systems can be categorized by the type of information they infer about users and items. Collaborative filtering (Linden et al. 2003; Sarwar et al. 2001) builds models of past user-item interactions such as clicks, views, purchases, or ratings, while content-based filtering Lops et al. (2011) recommends items that are similar in content, for example, share phrases in their text description. Context-aware recommenders (Adomavicius and Tuzhilin 2011) use additional information on the user and the interaction, for example, user location and weather conditions. Recent events in a user session (Koenigstein and Koren 2013) serve as a special context.

A milestone in the research of recommendation algorithms, the Netflix Prize competition (Bennett and Lanning 2007), had high impact on research directions. The target of the contest was based on the one- to five-star ratings given by users, with one part of the data used for model training and the other for evaluation. As an impact of the competition, tasks now termed batch rating prediction were dominating research results.

Recommendation models rely on the feedback provided by the user, which can be explicit, such as one- to five-star movie ratings on Netflix (Adhikari et al. 2012). However, most recommendation tasks are implicit, as the user provides no like or dislike information. Implicit feedback can be available in the form of time elapsed viewing an item or listening to a song, or in many cases, solely as a click or some other form of user interaction. In Pilászy et al. (2015), the authors claim that 99% of recommendation industry tasks are implicit.

As a main difference between recommendation and classification, classifiers usually work independently of the event whose outcome they predict. Recommender systems, on the other hand, may directly influence observations: They present a ranked top list of items (Deshpande and Karypis 2004), and the user can only provide feedback for the items on the list. Moreover, real systems process data streams where users request one or a few items at a time and get exposed to new information that may change their needs and taste when they return to the service next time. Furthermore, an online trained model may change and return completely different lists for the same user even for interactions very close in time.

By the above considerations, real recommender applications fall in the category of top item recommendation by online learning for implicit user feedback, a task that has received less attention in research so far. In this section, we show the main differences in evaluating such systems compared to both classifiers and batch systems, as well as describe the main data stream recommender algorithms.

Online recommenders seem more restricted than those that can iterate over the data set several times, and one could expect inferior quality from the online methods. By contrast, in Pálovics et al. (2014) and Frigó et al. (2017), surprisingly strong performance of online methods is measured.

As an early time-aware recommender system example, the item-based nearest neighbor (Sarwar et al. 2001) can be extended with time-decay (Ding and Li 2005). Most of the early models, however, are time-consuming to compute and difficult to update from a data stream and hence need periodical batch training. Probably the first result in this area, the idea of processing transactions in chronological order to incrementally train a recommendation model first appeared in Takács et al. (2009, Section 3.5). Streaming gradient descent matrix factorization methods were also proposed in Isaacman et al. (2011) and Ali and Johnson (2011), who use Netflix and MovieLens data and evaluate by root mean square error (RMSE).

The difficulty of evaluating streaming recommenders was first mentioned in Lathia et al. (2009), although the authors evaluated models by offline training and testing split. Ideas for online evaluation metrics appeared first in Pálovics and Benczúr (2013), Vinagre et al. (2014), and Pálovics et al. (2014). In Vinagre et al. (2014), incremental algorithms are evaluated using recall. In Pálovics et al. (2014), recall is shown to have undesirable properties, and other metrics for evaluating online learning recommenders are proposed.

Finally, we note that batch distributed recommender systems were surveyed in Karydi and Margaritis (2016).

Prequential (Online) Evaluation for Recommenders

To train and evaluate a time-sensitive or online learning recommender, we can use the prequential or online evaluation framework that is described in detail for classifier evaluation in the chapter “Online Machine Learning Algorithms over Data Streams” of this Handbook. As seen in Fig. 1, online evaluation for a recommender system includes the following steps:

  1. 1.

    We query the recommender for a top-k recommendation for the active user.

  2. 2.

    We evaluate the list in question against the single relevant item that the user interacted with.

  3. 3.

    We allow the recommender to train on the revealed user-item interaction.

Recommender Systems Over Data Streams, Fig. 1
figure 189figure 189

Prequential evaluation of the online ranking prediction problem

Since we can potentially retrain the model after every new event, the recommendation for the same user may be very different even at close points in time, as seen in Fig. 1. The standard recommender evaluation settings used in research cannot be applied, since there is always only a single relevant item in the ground truth.

In one of the possible recommender evaluation settings, the rating prediction problem, which is popular in research, we consider a user u and an item i. The actual user preference in connection with the item is expressed as a value rui, for which the system returns a prediction \(\hat {r}_{ui}\). This explicit rating can be a scale such as one to five stars for a Netflix movie, while implicit rating can be the duration of viewing a Web page in seconds. Implicit rating is binary when the only information is whether the user interacted with the item (clicked, viewed, purchased) or not. Depending on whether rui is binary or scale, the same prequential metrics, such as error rate or mean squared error (MSE), can be applied as for classification or regression. For example, in the Netflix Prize competition, the target was the square root of MSE between the predicted and actual ratings.

Another possible way to evaluate recommenders is ranking prediction, where performance metrics depend on the list of displayed items. We note that given rating prediction values \(\hat {r}_{ui}\) for all i, in theory, ranking prediction can be solved by sorting the relevance score of all items. For certain models, heuristics to speed up the selection of the highest values of \(\hat {r}_{ui}\) by candidate preselection exist (Teflioudi et al. 2015).

To evaluate ranking prediction, we have to take into consideration two issues that do not exist for classifier evaluation. In the case of prequential evaluation, as shown in Fig. 1, the list for user u may change potentially after every interaction with u. As soon as u provides feedback for certain item i, we can change model parameters and the set of displayed items may change completely. Most of the batch ranking quality measures focus on the set of items consumed by the same user, under the assumption that the user is exposed to the same list of items throughout the evaluation. As this assumption does not hold, we need measures for individual user-item interactions.

Another issue regarding ranking prediction evaluation lies in a potential user-system interaction that affects quality scores. Typically, the set of items is very large, and users are only exposed to a relatively small subset, which is usually provided by the system. The form of user feedback is usually a click on one or more of these items, which can be evaluated by computing the click-through rate. Since users cannot give feedback on items outside the list, the fair comparison of two algorithms that present different sets for the user can only be possible by relying on live user interaction. This fact is known by practitioners, who use A/B testing to compare the performance of different systems. In A/B testing, the live set of users is divided into groups that are exposed to the results of the different systems.

Most traditional ranking prediction metrics, to a certain level, rely on the assumption that the same user is exposed to the same list of items, and hence the interactions of the same user can be considered to be the unit for evaluation. For online evaluation, as noted in Pálovics et al. (2014), the unit of evaluation will be a single interaction, which usually contains a single relevant item. Based on this modification, most batch metrics apply in online learning evaluation as well. Note that the metrics below apply not just in A/B testing but also in experiments with frozen data, where user feedback is not necessarily available for the items returned by a given algorithm. For example, if the item consumed by the user in the frozen data is not returned by the algorithm, the observed relevance will be 0, which may not be the case if the same algorithm is applied in an A/B test. Note that attempts to evaluate research results by A/B testing have been made in the information retrieval community (Balog et al. 2014); however, designing and implementing such experiments is cumbersome.

Next, we list several metrics for the quality of the ordered top-K list of items L = {i1, i2, …, iK} against the items E consumed by the user. We will also explain how online evaluation metrics differ from their batch counterparts. For the discussion, we mostly follow Pálovics et al. (2014).

Click-through rate is commonly used in the practice of recommender evaluation. It is defined as the ratio of clicks received for L:

$$\displaystyle \begin{aligned} \mbox{Clickthrough@K} = \begin{cases} 1 & \mbox{if } E \cap L \neq \emptyset; \\ 0 & \mbox{otherwise}. \end{cases} \end{aligned} $$
(1)

For precision and recall, similar to click-through, the actual order within L is unimportant:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mbox{Precision@K} &\displaystyle = \frac{|E \cap L|}{K},\\ \mbox{Recall@K} &\displaystyle = \frac{|E \cap L|}{|E|}. \end{array} \end{aligned} $$
(2)

For batch evaluation, E is the entire set of items with positive feedback from a given user who is exposed to the same L for each interaction. The overall batch system performance can be evaluated by averaging precision and recall over the set of users. For online evaluation, typically |E| = 1, where Precision@K is 0 or 1∕K and Recall@K is 0 or 1 depending on whether the actual item in E is listed in L or not. Precision and recall are hence identical to click-through, up to a constant. As a consequence, the properties of online precision and recall are very different from their batch counterparts. The main reason for the difference lies in the averaging procedure of prequential evaluation: We cannot merge the events of the same user; instead, we average over the set of individual interactions.

Measures that consider the position of the relevant item i in L can give more refined performance indication. The first example is reciprocal rank:

$$\displaystyle \begin{aligned} RR@K = \begin{cases} 0 & \mbox{if rank}{}(i) > K; \\ \frac{\displaystyle 1}{\displaystyle \mbox{rank}(i)} & \mbox{otherwise}. \end{cases} \end{aligned} $$
(3)

Discounted cumulative gain (DCG) is defined similarly, as

$$\displaystyle \begin{aligned} \mbox{DCG@K} = \displaystyle\sum_{k=1}^K \frac{\mbox{rel}(i_k)}{\log_2(1 + k)} {} \end{aligned} $$
(4)

where rel(ik) indicates the relevance of the i-th item in the list. For the implicit task, relevance is 1 if the user interacted with the item in the evaluation set, 0 otherwise. For batch evaluation, we can consider all interactions of the same user as one unit. If we define iDCG@K, the ideal maximum possible value of DCG@K for the given user, we can obtain nDCG@K, the normalized version of DCG@K, as

$$\displaystyle \begin{aligned} \mbox{nDCG@K} = \frac{\mbox{DCG@K}}{\mbox{iDCG@K}}. {} \end{aligned} $$
(5)

Note that for online learning, there is only one relevant item, hence iDCG= 1. For emphasis, we usually use the name nDCG for batch and DCG for online evaluation.

Session-Based Recommendation

Previous items in user sessions constitute a very important context (Hidasi and Tikk 2016). In e-commerce, the same user may return next time with a completely different intent and may want to see a product category completely different from the previous session. Algorithms that rely on recent interactions of the same user are called session-based item-to-item recommenders. The user session is special context, and it is the only information available for an item-to-item recommender. In fact, several practitioners (Koenigstein and Koren 2013; Pilászy et al. 2015) argue that most of the recommendation tasks they face are without sufficient past user history. For example, users are often reluctant to create logins and prefer to browse anonymously. Moreover, they purchase certain types of goods (e.g., expensive electronics) so rarely that their previous purchases will be insufficient to create a meaningful user profile. Whenever a long history of previous activities or purchases by the user is not available, recommenders may propose items that are similar to the most recent ones viewed in the actual user session.

Session-based recommendation can be served by very simple algorithms, most of which are inherently online. A comparison of the most important such online algorithms in terms of performance is available in Frigó et al. (2017). Data stream processing algorithms can retain items from the most recently started sessions as long as they fit in their memory. Recommendation is based on the recent items viewed by the user in the actual shopping session. For example, we can record how often users visited item i after visiting another item j. Since fast update to transition frequencies is usually possible, the method is online.

In an even simpler algorithm that is not strictly session-based, we recommend the most popular recent items. This method can be considered batch or online depending on the granularity of the item frequency measurement update. Both algorithms can be personalized if we consider the frequency of past events involving the user. If items are arranged hierarchically (e.g., music tracks by artist and genre), personal popularity and personal session data can involve the frequency of the artists or genres for recommending tracks. More session-based algorithms are described in Koenigstein and Koren (2013).

Online Matrix Factorization

Most nontrivial online recommender algorithms are based on matrix factorization (Koren et al. 2009), a popular class of collaborative filtering methods. Given the user-item utility matrix R = [rui] shown in Fig. 2, we model R by decomposing it into the two dense matrices P and Q. For a given user u, the corresponding row in P is user vector pu. Similarly, for item i, the corresponding column of Q is item vector qi. The predicted relevance of item i for user u is then

$$\displaystyle \begin{aligned} \hat{r}_{ui} = p_u q_i^T. {} \end{aligned} $$
(6)

Note that we can extend the above model by scalar terms that describe the biased behavior of the users and the items (Koren et al. 2009).

Recommender Systems Over Data Streams, Fig. 2
figure 1810figure 1810

Utility matrix R and the matrix factorization model built from matrices P and Q

One possibility to train model parameter matrices P and Q is by gradient descent (Koren et al. 2009; Funk 2006), which can be applied to online learning as well (Pálovics et al. 2014). For a set of interactions E, we optimize Eq. (6) for MSE as target function:

$$\displaystyle \begin{aligned} MSE = \frac{1}{N} \displaystyle\sum_{i=1}^N ( {\hat{r}_{u i}} - r_{u i} ) ^ 2, {} \end{aligned} $$
(7)

where rui is the actual and \({\hat {r}_{u i}}\) is the predicted rating for user u and item i and N is the current size of the data stream.

In one step of gradient descent, we fit P and Q in Eq. (6) to one of the ratings in E. Unlike in batch training, where we can use the ratings several times in any order, in online learning, we have the most recent single item in E. In other words, in online gradient descent, we fit the model to the events one by one as they arrive in the data stream.

For a given (explicit or implicit) rating rui, the steps of gradient descent are as follows. First, we compute the gradient of objective function F with respect to the model parameters:

$$\displaystyle \begin{aligned} \frac{\partial F}{\partial p_u} = - 2 (r_{ui} - \hat{r}_{ui}) q_i, \quad \frac{\partial F}{\partial q_i} = - 2 (r_{ui} - \hat{r}_{ui}) p_u. \end{aligned} $$
(8)

Next, we update the model parameters in opposite direction of the gradient, proportionally to learning rate η, as

$$\displaystyle \begin{aligned} \begin{array}{rcl} p_u &\displaystyle \leftarrow&\displaystyle \eta (r_{ui} - \hat{r}_{ui}) q_i, \\ q_i &\displaystyle \leftarrow&\displaystyle \eta (r_{ui} - \hat{r}_{ui}) p_u. \end{array} \end{aligned} $$

Overfitting is usually avoided by adding a regularization term in the objective function (Koren et al. 2009).

In the case of implicit feedback, the known part of the utility matrix only contains elements with positive feedback. To fit a model, one requires negative feedback for training as well. Usually, such elements are selected by sampling from those that the user has not interacted with before Rendle and Freudenthaler (2014). We can also introduce confidence values for ratings and consider lower confidence for the artificial negative events (Hu et al. 2008).

Gradient descent can also be used in a mix of batch and online learning, for example, training batch models from scratch periodically and continuing the training with online learning. We can also treat users and items differently, for example, updating user vectors more dynamically than item vectors, as first suggested by Takács et al. (2009).

Another use of online gradient descent is to combine different recommendation models (Pálovics et al. 2014). We can express the final prediction as the linear combination of the models in the ensemble whose parameters are the linear coefficients and the individual model parameters. In Pálovics et al. (2014), two online gradient descent methods are described with regard to whether the derivative of the individual models is available, where all parameters can be trained through the derivative of the final model or otherwise by learning the coefficients and the individual models separately.

Variants of Matrix Factorization

Several variants of matrix factorization that can be trained by gradient descent both for batch and online learning tasks have been proposed. Bayesian Personalized Ranking (Rendle et al. 2009) has top list quality as target instead of MSE. In asymmetric matrix factorization (Paterek 2007), we model the user by the sum of the item vectors the user rated in the past.

Recently, various factorization models have been developed that incorporate context information (Hidasi and Tikk 2016). Context data can be modeled by introducing data tensor D instead of the rating matrix R. In a simplest case, the data includes a single piece of additional context information (Rendle and Schmidt-Thieme 2010): for example, music tracks can have artist as context.

Alternating least squares (Koren et al. 2009; Pilászy et al. 2010) (ALS) is another optimization method for matrix factorization models, in which for a fixed Q, we compute the optimal P, then for a fixed P, the optimal Q, repeatedly until certain stopping criteria are met. Hidasi et al. Hidasi and Tikk (2012); Hidasi (2014); Hidasi and Tikk (2016) introduced several variants of ALS-based optimization schemes to incorporate context information. By incremental updating, ALS can also be used for online learning (He et al. 2016).

Conclusions

Recommendation differs from classification in that in recommendation, there are two types of objects, users, and items, and a prediction has to be made for their interaction. A practical recommender system displays a ranked list of a few items for which the user can give feedback. In an online learning system, the list shown to the same user at different times may change completely for two reasons. First, as in the prequential classifier training and evaluation setting described in detail for classifier evaluation in the chapter “Online Machine Learning Algorithms over Data Streams” of this Handbook, the list of recommendations may change because the model changes. Second, the user feedback we use for evaluation depends on the actual state of the model, since the user may have no means to express interest in an item not displayed. Hence for online learning evaluation, metrics that involve the notion of a volatile list have to be used.

Online learning, as introduced in the chapter “Overview of Online Machine Learning in Big Data Streams” of this Handbook, is very powerful for recommender systems due to their advantage of having much more emphasis on recent events. For example, if we update models immediately for newly emerged users and items, trends are immediately detected. The power of online learning for recommendation may also be the result of updating user models with emphasis on recent events, which may be part of the current user session. User session is a highly relevant context for recommendation, and most session-based methods are inherently online.

Cross-References