Keywords

1 Introduction

In many real world recommender systems, user feedback is continuously generated at unpredictable rates and order, and is potentially unbounded. In large scale systems, the rate at which user feedback is generated can be very fast. Building predictive models from these continuous flows of data is a problem actively studied in the field of data stream mining. Ideally, algorithms that learn from data streams should be able to process data at least as fast as it arrives, in a single pass, while maintaining an always-available model [3]. Most incremental algorithms naturally have these properties, and are thus a viable solution.

Incremental algorithms for recommendation also treat user feedback data as a data stream, immediately incorporating new data in the recommendation model. In many – if not most – recommendation applications this is a desirable feature, since it gives the model the ability to evolve over time. This is important, because the task of a recommender system is to find the most relevant items to each user, individually. Naturally, users are human beings, whose preferences change over time. Moreover, in large scale systems, new users and items are permanently entering the system. A model that is immediately updated with fresh data has the capability of adjusting faster to such changes.

1.1 Related Work

Ensemble methods in machine learning are convenient techniques to improve the accuracy of algorithms. Typically, this is achieved by combining results from a number of weaker sub-models. Bagging [1], Boosting [4] and Stacking [13] are three well-known ensemble methods used with recommendation algorithms. Boosting is experimented in [2, 7, 9, 10], bagging is studied also in [7, 10], and stacking in [11]. In all of these contributions, ensemble methods work with batch learning algorithms only.

In this paper we propose online bagging for incremental recommendation algorithms designed to deal with streams of positive user feedback. To our best knowledge this is the first ensemble method proposed for incremental recommender systems in the literature.

This paper is organized as follows. After this introductory section, we describe online bagging for common data mining tasks in Sect. 2. Section 3 describes our online bagging approach in recommendation problems. In Sect. 4 we present our experiments and results, along with a short discussion. Finally, we conclude in Sect. 5.

2 Online Bagging

Bagging [1] is an ensemble technique that takes a number of bootstrap samples of a dataset and trains a model on each one of the samples. Predictions from the various sub-models are then aggregated in a final prediction. This is known to improve the performance of algorithms by reducing variance, which is especially useful with unstable algorithms that are very sensitive to small changes in the data. The diversity offered by training several models with slightly different bootstrap samples of the data helps in giving more importance to the main concepts being learned – since they must be present in most bootstrap samples of the data, and less importance to noise or irrelevant phenomena that may mislead the learning algorithm.

To obtain a bootstrap sample of a dataset with size N, we perform N trials, sampling a random example with replacement from the dataset. Each example has probability of 1/N to be sampled at each trial. The resulting dataset will have the same size as the original dataset, however some examples will not be present whereas some others will occur multiple times. To obtain M samples, we simply repeat the process M times.

In its original proposal [1], bagging is a batch procedure requiring \(N \times M\) passes through the dataset. However, it has been shown in [8] that this can be done incrementally in a single pass, if the number of examples is very large – a natural assumption when learning from data streams. Looking at the batch method above, we observe that each bootstrap sample contains K occurrences of each example, with \(K \in \{0,1,2,\ldots \}\), and:

$$\begin{aligned} P(K = k) = {N \atopwithdelims ()k} \left( \frac{1}{N}\right) ^k \left( 1 - \frac{1}{N}\right) ^{N-k} \end{aligned}$$
(1)

In an incremental setting, one could just initialize M sub-models – or bootstrap nodes – and then use (1) to train new examples K times, redrawing K for each node. The problem is that this would still require knowing N beforehand. However, if we assume that \(N \rightarrow \infty \), then the distribution of K tends to a Poisson(1) distribution, and therefore

$$\begin{aligned} P(K = k) = \frac{e^{-1}}{k!} \end{aligned}$$
(2)

eliminating the need of any prior knowledge about the data, allowing the usage of bagging in a single pass over data.

3 Online Recommendation with Bagging

To assess the potential of online bagging, we use ISGD [12], a simple online matrix factorization algorithm for positive-only data. ISGD (Algorithm 1) uses Stochastic Gradient Descent in one pass through the data, which is convenient for data stream processing. It is designed for positive-only streams of user-item pairs (ui) that indicate a positive interaction between user u and item i. Examples of positive interactions are users buying items in an online store, streaming music tracks from an online music streaming service, or simply visiting web pages. This is a much more widely available form of user feedback, than for example, ratings data, which is only available from systems with user rating features.

figure a

ISGD continuously updates factor matrices A – the user factors matrix – and B – the item factors matrix –, correcting the model to adapt to the incoming user-item pairs. If (ui) occurs in the stream, then the model prediction \(\hat{R}_{ui} = A_u \cdot B_i\) should be close to 1. Top-N recommendations to any user u is obtained by a ranking function \(f = |1 - \hat{R}_{ui}|\) for all items i in ascending order, and taking the top N items.

The online bagging approach described in Sect. 2, can be easily applied to ISGD, resulting in Algorithm 2 – BaggedISGD.

figure b

BaggedISGD learns M models on M bootstrap nodes, each of them based on the online bootstrap sampling method described in Sect. 2. Similarly to ISGD, to perform the actual list of recommendations for a user u, items i are sorted by a function \(f = |1 - \hat{R}_{ui}|\). However, the scores \(\hat{R}_{ui}\) are actually the average score of all nodes:

$$\begin{aligned} \hat{R}_{ui} = \frac{\sum _{m = 1}^MA_u^m \cdot B_i^m}{M} \end{aligned}$$
(3)

At training time, this algorithm requires at least M times the computational resources needed for ISGD, with M bootstrap nodes. Recommendation also has the overhead of aggregating M predictions from the submodels. In our experiments, we also measure update and recommendation times, for several values of M.

4 Evaluation

To simulate a streaming environment we need datasets that maintain the natural order of the data points, as they were generated. Additionally, we need positive-only data, since the tested algorithm is not designed to deal with ratings. We use 4 datasets that conciliate these two requirements – positive-only and naturally ordered, described in Table 1. ML1M is based on the Movielens-1M movie rating datasetFootnote 1. To obtain the YHM-6KU, we sample 6000 users randomly from the Yahoo! Music datasetFootnote 2. LFM-50U is a subset consisting of a random sample of 50 users taken from the Last.fmFootnote 3 datasetFootnote 4. PLC-STRFootnote 5 consists of the music streaming history taken from Palco PrincipalFootnote 6, a Portuguese social network for non-mainstream artists and fans.

All of the 4 datasets consist of a chronologically ordered sequence of positive user-item interactions. However, ML1M and YHM-50U are obtained from ratings datasets. To use them as positive-only data, we retain the user-item pairs for which the rating is in the top 20% of the rating scale. This means retaining only the rating 5 in ML1M and rating of 80 or more in the YHM-6KU dataset. Naturally, only single occurrences of user-item pairs are available in these datasets, since users do not rate the same item more than once. PLC-STR and LFM-50 have multiple occurrences of the same user-item pairs.

Table 1. Dataset description

We run a set of experiments using the prequential approach [5] as described in [12]. Each observation in the dataset consists of a simple user-item pair (ui) that indicates a positive interaction between user u and item i. The following steps are performed in the prequential evaluation process:

  1. 1.

    If u is a known user, use the current model to recommend a list of items to u, otherwise go to step 3;

  2. 2.

    Score the recommended list given the observed item i;

  3. 3.

    Update the model with (ui) (optionally);

  4. 4.

    Proceed to – or wait for – the next observation

This process is entirely applicable to algorithms that learn either incrementally or in batch mode. This is the reason why step 3 is annotated as optional. For example, instead of performing this step, the system can store the data to perform batch retraining periodically.

Table 2. Average performance of ISGD with and without bagging. M is the number of bootstrap nodes. The last two columns contain the average update times and the average recommendation times.

To kickstart the evaluation process we use 10% the available data to train a base model in batch, and use the remaining 90% to perform incremental training and evaluation. We do this initial batch training to avoid cold-start problems, which are not the subject of our research.

In our setting, the items that users have already co-occurred with – i.e. items that users know – are not recommended. This has one important implication in the prequential evaluation process, specifically on datasets that have multiple occurrences of the same user-item pair. Evaluation at these points is necessarily penalized, since the observed item will not be within the recommendations. In such cases, we bypass the scoring step, but still use the observation to update the model.

We measure two dimensions on the evaluation process: accuracy and time. In the prequential process described above, we need to make a prediction and evaluate it at every new user-item pair (ui) that arrives in the data stream. To do this, we use the current model to recommend a list of items to user u. We then score this recommendation list, by matching it to the actually observed item i. We use a recommendation list with at most 20 items, and then score this list as 1 if i is within the recommended items, and 0 otherwise, using Recall@C with cutoffs \(C \in \{1,5,10,20\}\). Using these cutoffs, we only consider the top C items in the list. For example, Recall@1 only checks whether the first item in the list matches the actual observed item i. Regardless of C, we only have one item to test against the list, which means that Recall@C can only take the values \(\{0,1\}\). We can calculate the overall Recall@C by averaging the scores at every step, which in practice gives us the hit ratio. Additionally, we can also depict it using a moving average. We also measure the update time, in milliseconds, at every step which can depicted using a moving average as well.

All experiments were run in Intel Haswell 4-core machines, with CentOS Linux 7 64 bit. The algorithms and prequential evaluation code is implemented on top of MyMediaLite [6]. The recommendation step is implemented with multi-core code – predictions from nodes are computed in parallel.

4.1 Results

To evaluate bagging, we experiment with four levels of bootstrapping \(M \in \{8,16,32,64\}\). Table 2 summarizes the results of our experiments. Values in Table 2 are obtained by averaging Recall and time obtained at all prequential evaluation steps. With all datasets except YHM-6KU, bagging improves the Recall, especially with \(M \ge 32\). One interesting observation is that bagging has a bigger influence on higher Recall cutoffs, which suggests that improvements of the predictive ability are typically not obtained in the top 5 recommended items.

The model update times increase approximately in proportion to the number of bootstrap nodes M, which is not surprising, since the algorithm performs the update operations one time (in average) in each one of the M bootstrap nodes. However, since the baseline update time is very small, this overhead is also small. The last column of Table 2 contains the recommendation time, specifically the average time required to produce a recommendation list. The bagging algorithm needs to aggregate predictions coming from all M nodes, which is an important overhead. Results show that both the update times and recommendation times increase proportionally to M. However, the recommendation step is a far more expensive operation, even when computed in parallel. For example, using \(M = 64\) with LFM-50U and YHM-6KU, recommendations are computed in nearly two seconds in average, in 4-core machines, which can reasonably be considered too much in many applications.

A useful feature of prequential evaluation is that it allows us also to depict the evolution of the outcome of Recall. In Figs. 1, 2, 3 and 4, we depict the evolution of Recall@C with \(C \in \{1,5,10,20\}\). This visualization reveals how the predictive ability of the algorithm performs over time, as the incremental learning process occurs.

Fig. 1.
figure 1

Prequential evaluation of Recall@1 with ISGD with and without bagging. Lines are drawn using a moving average of Recall@1 with \(n = 10 000\). The first 10 000 points are drawn using the accumulated average.

Fig. 2.
figure 2

Prequential evaluation of Recall@5 with ISGD with and without bagging. Lines are drawn using a moving average of Recall@5 with \(n = 10 000\). The first 10 000 points are drawn using the accumulated average.

Fig. 3.
figure 3

Prequential evaluation of Recall@10 with ISGD with and without bagging. Lines are drawn using a moving average of Recall@10 with \(n = 10 000\). The first 10 000 points are drawn using the accumulated average.

Fig. 4.
figure 4

Prequential evaluation of Recall@20 with ISGD with and without bagging. Lines are drawn using a moving average of Recall@20 with \(n = 10 000\). The first 10 000 points are drawn using the accumulated average.

4.2 Discussion

Results in Table 2 and Figs. 1, 2, 3 and 4 show that bagging is able to improve the accuracy of ISGD, with improvements of 35% over the baseline (see Table 2 LFM-50U and ML1M). This improvement is mainly observable with cutoffs \(C \ge 5\) of Recall. Given that bagging reduces variance [1], this suggests that ISGD is more stable in the top few recommendations. Another observation is that improvements are not consistent with all datasets. With LFM-50U, for example, bagging only slightly outperforms the baseline ISGD – and only with \(M \ge 32\), while with PLC-STR, the improvement is much higher in proportion, even with lower M.

One other observation that is particularly visible in the plotted lines in Figs. 1, 2, 3 and 4 is that as we increase the number of nodes M, the improvement potential becomes lower. In almost all experiments, regardless of the Recall cutoff and the dataset, the improvement achieved when doubling M from 16 to 32 is higher than the improvement we get when doubling M from 32 to 64, although the computational overhead in the latter case is twice as high. The optimal number of nodes is dependent on the desired trade-off between accuracy improvement and computational cost. Note that with some datasets – e.g. YHM-6KU, improvements may only be obtained with a relatively large M.

It is also clear that the time overheads grow linearly with the number of bootstrap models. However, the overhead in model update times is not very relevant in practice, given that the baseline update times are very low in ISGD – with \(M = 64\) the highest update time falls below 400 ms. The overhead at recommendation time is more evident, when aggregating results from the M bootstrap nodes. Fortunately, as with most ensemble techniques, parallel processing can be trivially used to alleviate this overhead. Additionally, there may be room for code optimization or approximate methods that require less and/or more efficient computations.

5 Conclusions

Bagging is an ensemble technique successfully used with many machine learning algorithms, however it has not been thoroughly studied in recommendation problems, and particularly with incremental algorithms. In this paper, we experiment online bagging with an incremental matrix factorization algorithm that learns from unbounded streams of positive-only data. Our results suggest that with manageable overheads, accuracy clearly improves – more than 35% in some cases, especially as the number of recommended items increases. In the near future, we intend to experiment this and other online ensemble methods in a larger number of stream-based recommendation algorithms.