1 Introduction

Recommender systems are information retrieval applications that aim to mitigate the problem of “information overload”, by matching users to certain items (Borchers et al. 1998). They have become ubiquitous on the world wide web and have found applications in many different areas where these items can represent anything from news articles and musical artists to retail products and social media accounts. Most modern approaches to recommendation are based on some form of collaborative filtering (Ekstrand et al. 2011), a family of methods that aim to model user preferences and learn them from a dataset of user behaviour. These methods have known widespread success over the years, and are the cornerstone of modern recommender systems research. As a consequence, the quest for more effective collaborative filtering algorithms is a very active research area, where significant strides forward are being made every year. Many novel methods are based on deep and nonlinear neural networks, and the expressiveness of this model class has made them ubiquitous in the field (Liang et al. 2018; Elahi et al. 2019; Shenbin et al. 2020). Recent work casts doubt on the reproducibility of evaluation strategies that are often adopted to empirically validate research findings (Dacrema et al. 2019; Rendle 2019; Rendle et al. 2020), making it harder to conclude whether these complex model classes are what the field needs moving forward.

In a parallel line of research, the effectiveness of simpler linear models for the collaborative filtering task has been shown time and again (Ning and Karypis 2011; Levy and Jack 2013; Sedhain et al. 2016; Steck 2019b, c; Steck et al. 2020). Most notably and recently, Embarrassingly Shallow Auto-Encoders (reversed: \(\textsc {ease}^{\textsc {r}}\)) have been shown to yield highly competitive results with the state of the art, whilst often being much easier to implement, and much more efficient to compute (Steck 2019a). The closed-form solution that is available for ridge regression models is at the heart of these major advantages, as \(\textsc {ease}^{\textsc {r}}\) effectively optimises a regularised least-squares problem. Recently, \(\textsc {ease}^{\textsc {r}}\) has been extended to incorporate item metadata into two variants: \(\textsc {cease}^{\textsc {r}}\) and \(\textsc {add-ease}^{\!\textsc {r}}\) (Jeunen et al. 2020). These extensions improve the capabilities of closed-form linear models to deal with issues such as the “long tail” (very few items account for the large majority of interactions) and “cold start” (new items do not have any interactions) (Schein et al. 2002; Park and Tuzhilin 2008; Shi et al. 2014).

The main benefit of \(\textsc {ease}^{\textsc {r}}\) and its variants over competing approaches, is their computational efficiency. As the core algorithm consists of a single inversion of the Gramian item-item matrix, it is often many times more efficient to compute than models relying on iterative optimisation techniques. As reported in the original paper, the algorithm can be implemented in just a few lines of Python and is typically computed in the order of minutes on various often used publicly available benchmark datasets (Steck 2019a). Nevertheless, matrix inversion is known to scale poorly for large matrices, and \(\textsc {ease}^{\textsc {r}}\)’s reliance on it does inhibit its adoption in use-cases with large item catalogues. In such cases, methods that rely on gradient-based optimisation techniques are still preferable.

To add insult to injury, real-world systems rarely rely on a single model that is computed once and then deployed. To make this concrete: suppose we operate a hypothetical retail website, and we wish to send out an e-mail with a top-N list of personalised recommendations to our subscribed users every few days. Naturally, the model that generates these recommendation lists should evolve over time, preferably incorporating new user-item interactions that have occurred over the past days. The importance of having such a dynamic model is threefold: (1) It will generate more novel and diverse recommendations than its static counterpart (Castells et al. 2015), (2) it will be able to combat concept drift in the data (due to shifting item popularity or seasonality trends in preferences) (Gama et al. 2014), and (3) it will have the means to handle cold-start problems when either with new items or news users appear (Schein et al. 2002).

Many modern digital systems generate new data at increasingly fast rates, and this is no different for our hypothetical retail website. This is important to take into account when choosing a recommendation algorithm. Models that are already inefficient to compute initially, will only see these problems exacerbated when the predominant approach every few days is to recompute them iteratively on more and more data. This puts a theoretical limit on how often we can update the model, and incurs a computational cost that we would like to reduce. Instead, it would be preferable to have models that can be updated with new information when it arrives, but do not require a full retraining of untouched parameters for every new batch of data that comes in. This is not an easy feat, and the field of “online recommender systems” that are able to handle model updates more elegantly has seen much interest in recent years (Vinagre et al. 2020). More generally, the problem of “lifelong” or “continual” learning in the machine learning field deals with similar issues (Chen 2018).

In this work, we present a novel algorithm to incrementally update the state-of-the-art item-based linear model \(\textsc {ease}^{\textsc {r}}\), which is naturally extended to include recent variants that exploit side-information: \(\textsc {cease}^{\textsc {r}}\) and \(\textsc {add-ease}^{\!\textsc {r}}\). \(\textsc {ease}^{\textsc {r}}\) consists of two major computation steps: (1) the generation of the Gramian item-item matrix, and (2) the inversion of this matrix that yields the solution to the regression problem.

We propose Dynamic \(\textsc {ease}^{\textsc {r}}\) (\(\textsc {dyn-ease}^{\textsc {r}}\)), consisting of incremental update rules for these two steps that leverage the recently proposed Dynamic Index algorithm (Jeunen et al. 2019) and the well-known Woodbury matrix identity (Hager 1989), respectively. As such, \(\textsc {dyn-ease}^{\textsc {r}}\) provides a way to efficiently update an existing \(\textsc {ease}^{\textsc {r}}\)-like model without the need of recomputing the entire regression model from scratch with every data update.

A theoretical analysis of the proposed algorithm shows that the highest efficiency gains can be expected when the rank of the update to the Gramian is low, an assumption that has been widely adopted in the recommender systems literature before (Koren et al. 2009). We show how this quantity can be bounded using simple summary statistics from the new batch of data, and support our findings with empirical results. Further experiments confirm that \(\textsc {dyn-ease}^{\textsc {r}}\) is able to significantly cut down on computation time compared to iteratively retrained \(\textsc {ease}^{\textsc {r}}\), in a variety of recommendation domains. Finally, we show how we can update the model with low-rank approximations when the new batch of data itself is not low-rank; providing a tuneable trade-off between the exactness of the solution and the efficiency with which it can be kept up-to-date. Empirical observations show how this approximate variant of \(\textsc {dyn-ease}^{\textsc {r}}\) still yields highly competitive recommendation performance, with greatly improved update speed, and how the low-rank assumption can even improve on recommendation accuracy. As a result, our work broadens the space of recommendation problems to which the state-of-the-art linear model \(\textsc {ease}^{\textsc {r}}\) can efficiently be applied. To foster the reproducibility of our work, all source code for the experiments in Sect. 4 is publicly available at github.com/olivierjeunen/dynamic-easer.

The rest of this manuscript is structured as follows: Sect. 2 introduces our use-case, with mathematical notation and relevant related work; Sect. 3 introduces \(\textsc {dyn-ease}^{\textsc {r}}\) and presents a theoretical analysis of its inner workings, motivating an approximate variant; Sect. 4 presents empirical observations from a wide range of experiments and shows where \(\textsc {dyn-ease}^{\textsc {r}}\) can provide meaningful improvements, findings that are in line with what the theory suggests. Section 5 concludes our work, additionally presenting a scope for future research.

2 Background and related work

We first formalise our use-case and present relevant mathematical notation used throughout the rest of this work. We are interested in the “binary, positive-only” implicit feedback setting (Verstrepen et al. 2017), where we have access to a dataset consisting of preference indications from users in \({\mathcal {U}}\) over items in \({\mathcal {I}}\) at time \(t \in {\mathbb {N}}\), assumed from a set of interaction data \({\mathcal {P}} \subseteq {\mathcal {U}} \times {\mathcal {I}} \times {\mathbb {N}}\). Ignoring temporal information, these preferences can be represented in a binary user-item matrix \({\varvec{X}} \in \{0,1\}^{|{\mathcal {U}}|\times |{\mathcal {I}}|}\), where \({\varvec{X}}_{u,i} = 1\) if we have a click, view, purchase,...for user u and item i in \({\mathcal {P}}\), and \({\varvec{X}}_{u,i} = 0\) otherwise. With \({\mathcal {P}}_{t}\), we denote the set of all interactions up to time t: \(\{(u,i,t^{\prime }) \in {\mathcal {P}} | t^{\prime } < t \}\). Consequently, \({\varvec{X}}_t\) is the user-item matrix constructed from the set of interactions \({\mathcal {P}}_{t}\). We will refer to the set of all items seen by user u as \({\mathcal {I}}_u \subseteq {\mathcal {I}}\), and vice versa \({\mathcal {U}}_i \subseteq {\mathcal {U}}\) for an item i. The Gramian of the user-item matrix is defined as \({\varvec{G}} :={\varvec{X}}^{\intercal } {\varvec{X}}\); it is an item-item matrix that holds the co-occurrence count for items i and j at index \({\varvec{G}}_{i,j}\). The goal at hand for a recommendation algorithm is to predict which zeroes in the user-item matrix \({\varvec{X}}\) actually shouldn’t be zeroes, and thus imply that the item would in some way “fit” the user’s tastes and consequently make for a good item to be shown as a recommendation.

In some cases, additional information about items can be available. Such “side-information” or “metadata” often comes in the form of discrete tags, which can, for example, be a release year, genre or director for a movie, an artist or genre for a song, a writer for a book, or many more. Incorporating item metadata in the modelling process can help mitigate cold-start and long-tail issues, where the preference information for a given item is limited (Schein et al. 2002; Park and Tuzhilin 2008). We will refer to the set of all such tags as the vocabulary \({\mathcal {V}}\). In a similar fashion to the user-item matrix \({\varvec{X}}\), a tag-item matrix \({\varvec{T}} \in {\mathbb {R}}^{|{\mathcal {V}}|\times |{\mathcal {I}}|}\) is constructed. Note that this matrix is real-valued, as it will often contain pre-computed values such as tf-idf weights instead of binary indicators.

In what follows, we present a brief introduction to item-based recommendation models, most notably item-knn (Sarwar et al. 2001), slim (Ning and Karypis 2011) and \(\textsc {ease}^{\textsc {r}}\) (Steck 2019a). We then additionally introduce \(\textsc {cease}^{\textsc {r}}\) and \(\textsc {add-ease}^{\!\textsc {r}}\) as extensions of \(\textsc {ease}^{\textsc {r}}\) that incorporate item side-information whilst retaining a closed-form solution (Jeunen et al. 2020), as these are most relevant to the dynamic \(\textsc {ease}^{\textsc {r}}\) algorithm we will present in Sect. 3. This section is concluded with an overview of related work in the field of incremental collaborative filtering approaches.

2.1 Item-based models, slim and \(\textsc {ease}^{\textsc {r}}\)

Item-based collaborative filtering models tackle the recommendation task by defining a conceptual similarity matrix \({\varvec{S}} \in {\mathbb {R}}^{|{\mathcal {I}}|\times |{\mathcal {I}}|}\). The score given to a potential recommendation is then computed as the sum of similarities between items in the user’s history and the item at hand:

$$\begin{aligned} \text {score}(u,i) = \sum _{j \in {\mathcal {I}}_u} {\varvec{S}}_{j,i} = ({\varvec{X}}_{u,\cdot } {\varvec{S}})_i \end{aligned}$$
(1)

Here, \({\varvec{X_{u,\cdot }}}\) denotes the uth row of \({\varvec{X}}\). Note that computing recommendation scores for all training users and all items simply consists of computing the matrix multiplication \({\varvec{X}}{\varvec{S}}\), an operation that is made more efficient when the matrix \({\varvec{S}}\) is restricted to be sparse. Scores for items i already present in the user history \({\mathcal {I}}_u\) are often ignored, and the remaining items are ranked and presented in a top-N recommendation list or slate to the user. Early seminal works would define the similarity matrix \({\varvec{S}}\) as all pairwise cosine similarities among items in the high-dimensional but sparse user-item matrix \({\varvec{X}}\) (Sarwar et al. 2001). This has then been extended to include slightly more advanced notions of similarity such as Pearson’s correlation or conditional probabilities (Deshpande and Karypis 2004). Recent work has introduced the “Dynamic Index” algorithm to incrementally compute the Gramian of \({\varvec{X}}\), additionally showing that several conventional similarity metrics such as cosine similarity or Jaccard index can be readily computed from \({\varvec{G}}\) when it is up-to-date (Jeunen et al. 2019).

Methods for actually learning an optimal item-item similarity matrix have been proposed for the task of rating prediction (Koren 2008), as well as for pairwise learning from implicit feedback (Rendle et al. 2009). Ning and Karypis were the first to propose to learn a sparse weight matrix \({\varvec{S}}\) through a pointwise optimisation procedure, aptly dubbing their approach the Sparse LInear Method (slim) (Ning and Karypis 2011). slim optimises a least-squares regression model with elastic net regularisation, constrained to positive weights:

$$\begin{aligned} {\varvec{S}}^{*}&= {{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}} \left\Vert {\varvec{X}} - {\varvec{X}}{\varvec{S}}\right\Vert ^{2}_{F} + \lambda _{1} \left\Vert {\varvec{S}}\right\Vert ^{2}_{1} + \lambda _{2} \left\Vert {\varvec{S}}\right\Vert ^{2}_{F},\nonumber \\&\text {subject to } \text {diag}({\varvec{S}}) = 0 \text { and } {{\varvec{S}} \ge 0}. \end{aligned}$$
(2)

The restriction of the diagonal to zero avoids the trivial solution where \({\varvec{S}} = {\varvec{I}}\). Many extensions of slim have been proposed in recent years, and it has become a widely used method for the collaborative filtering task (Ning and Karypis 2012; Levy and Jack 2013; Christakopoulou and Karypis 2014; Sedhain et al. 2016; Christakopoulou and Karypis 2016; Steck 2019a, c; Steck et al. 2020; Chen et al. 2020). In practice, the slim optimisation problem is often decomposed into \(|{\mathcal {I}}|\) independent problems (one per target item). Although these can then be solved in an embarrassingly parallel fashion, this renders the approach intractable for very large item catalogues. Indeed, as they aim to solve \(|{\mathcal {I}}|\) regression problems, their computational complexity is in the order of \({\mathcal {O}}(|{\mathcal {I}}|(|{\mathcal {I}}|-1)^{2.373})\), assuming they exploit the recent advances in efficient matrix multiplication and inversion (Le Gall 2014; Alman and Vassilevska W. 2021). The computational cost of the original slim approach is a known impediment for its adoption in certain use-cases; related work has reported that hyper-parameter tuning took several weeks on even medium-sized datasets (Liang et al. 2018).Footnote 1

Steck studied whether the restrictions of slim to only allow positive item-item weights and their \(l_{1}\)-regularisation-induced sparsity were necessary for the resulting model to remain competitive, and concluded that this was not always the case (Steck 2019a; Steck et al. 2020). The resulting Tikhonov-regularised least-squares problem can then be formalised as:

$$\begin{aligned} {\varvec{S}}^{*} = {{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}} \left\Vert {\varvec{X}} - {\varvec{X}}{\varvec{S}}\right\Vert ^{2}_{F} + \lambda \left\Vert {\varvec{S}}\right\Vert ^{2}_{F} \text {, subject to } \text {diag}({\varvec{S}}) = 0. \end{aligned}$$
(3)

The main advantage of simplifying the optimisation problem at hand is that the well-known closed-form solutions for ordinary least squares (OLS) and ridge regression can now be adopted. Including the zero-diagonal constraint via Lagrange multipliers yields the Embarrassingly Shallow Auto-Encoder (\(\textsc {ease}^{\textsc {r}}\)) model:

$$\begin{aligned} \hat{{\varvec{S}}} = {\varvec{I}} - \hat{{\varvec{P}}} \cdot \text {diagMat}(\mathbf {1} \oslash \text {diag}(\hat{{\varvec{P}}})) \text {, where } \hat{{\varvec{P}}} :=({\varvec{X}}^\intercal {\varvec{X}} + \lambda {\varvec{I}})^{-1}. \end{aligned}$$
(4)

As this model consists of a single regression problem to be solved and thus a single matrix inversion to be computed, its complexity is orders of magnitude smaller than that of the original slim variants: \({\mathcal {O}}(|{\mathcal {I}}|^{2.373})\). \(\textsc {ease}^{\textsc {r}}\) no longer yields a sparse matrix, possibly making Equation 1 much less efficient to compute. Nevertheless, the author reported that there was only a marginal performance impact when simply sparsifying the learnt matrix by zeroing out weights based on their absolute values up until the desired sparsity level. As an additional advantage, \(\textsc {ease}^{\textsc {r}}\) has only a single regularisation strength hyper-parameter to tune compared to the two needed for slim’s elastic net regularisation. We refer the interested reader to Steck (2019a, 2019b) for a full derivation of the model and additional information.

Another recent extension of the slim paradigm proposes to use Block-Diagonal-Regularisation (BDR) to obtain a block-aware item similarity model (Chen et al. 2020). The block-diagonal structure in the learnt matrix inherently represents clusters among items. As inter-block similarities are penalised, BDR has a sparsity-inducing effect that positively impacts the efficiency of the recommendation-generating process. Because the block-aware model presented by Chen et al. (2020) no longer has an analytically computable solution readily available, further comparison with their method is out of scope for the purposes of this work. The item-based paradigm and its closed-form instantiations have also recently been adapted for bandit-based recommendation use-cases (Jeunen and Goethals 2021).

2.2 Item-based models with side-information

The \(\textsc {ease}^{\textsc {r}}\) definition can be further extended to incorporate side-information in either a “collective” (\(\textsc {cease}^{\textsc {r}}\)) or “additive” (\(\textsc {add-ease}^{\textsc {r}}\)) manner (Jeunen et al. 2020). The first method, inspired by collective slim (Ning and Karypis 2012), intuitively treats discrete tags equivalent to how users are treated, and re-weights their contribution to the solution of the regression problem by the diagonal weight-matrix \({\varvec{W}} \in {\mathbb {R}}^{(|{\mathcal {U}}|+|{\mathcal {V}}|)\times (|{\mathcal {U}}|+|{\mathcal {V}}|)}\):

$$\begin{aligned} {\varvec{S}}^{*}&= {{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}} \left\Vert \sqrt{{\varvec{W}}} ({\varvec{X}}' - {\varvec{X}}'{\varvec{S}})\right\Vert ^{2}_{F} + \lambda \left\Vert {\varvec{S}}\right\Vert ^{2}_{F},\nonumber \\&\text {subject to } \text {diag}({\varvec{S}}) = 0 \text {, where } {\varvec{X}}' = \begin{bmatrix} {\varvec{X}} \\ {\varvec{T}} \end{bmatrix}. \end{aligned}$$
(5)

The closed-form solution is then given by Equation 6, where \(\oslash \) denotes element-wise division, \(\text {diag}(\cdot )\) extracts the diagonal from a matrix, \(\text {diagMat}(\cdot )\) generates a square diagonal matrix from a vector, and \(\mathbf {1}\) is a vector of ones.

$$\begin{aligned} \hat{{\varvec{S}}} = {\varvec{I}} - \hat{{\varvec{P}}} \cdot \text {diagMat}(\mathbf {1} \oslash \text {diag}(\hat{{\varvec{P}}})) \text {, where } \hat{{\varvec{P}}} :=({\varvec{X}}'^\intercal {\varvec{W}}{\varvec{X}}' + \lambda {\varvec{I}})^{-1} \end{aligned}$$
(6)

The second method, \(\textsc {add-ease}^{\textsc {r}}\), treats the regression problem on the user-item matrix \({\varvec{X}}\) and the one on the tag-item matrix \({\varvec{T}}\) as two fully independent problems to solve in parallel; combining the two resulting item-item weight matrices \({\varvec{S}}_{{\varvec{X}}}\) and \({\varvec{S}}_{{\varvec{T}}}\) in an additive fashion later down the line.

$$\begin{aligned} \begin{aligned} {\varvec{S}}^{*} = \alpha&{{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}_{{\varvec{X}}}}\left( \left\Vert \sqrt{{\varvec{W}}_{{\varvec{X}}}} ({\varvec{X}} - {\varvec{X}}{\varvec{S}}_{{\varvec{X}}})\right\Vert ^{2}_{F} + \lambda _{{\varvec{X}}} \left\Vert {\varvec{S}}_{{\varvec{X}}}\right\Vert ^{2}_{F}\right) \\ + (1-\alpha )&{{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}_{{\varvec{T}}}}\left( \left\Vert \sqrt{{\varvec{W}}_{{\varvec{T}}}} ({\varvec{T}} - {\varvec{T}}{\varvec{S}}_{{\varvec{T}}})\right\Vert ^{2}_{F} + \lambda _{{\varvec{T}}} \left\Vert {\varvec{S}}_{{\varvec{T}}}\right\Vert ^{2}_{F}\right) ,\\&\text {subject to } \text {diag}({\varvec{S}}_{{\varvec{X}}}) = \text {diag}({\varvec{S}}_{{\varvec{T}}}) = 0. \end{aligned} \end{aligned}$$
(7)

\(\textsc {add-ease}^{\textsc {r}}\) doubles the amount of parameters used by \(\textsc {ease}^{\textsc {r}}\) and \(\textsc {cease}^{\textsc {r}}\), increasing its degrees of freedom at learning time at the cost of having to solve two regression problems instead of one. Note, however, that these are fully independent and can be computed in parallel. Equation 8 shows the analytical formulas to obtain the two independent models, and combine them with a blending parameter \(0 \le \alpha \le 1\).

$$\begin{aligned}&\hat{{\varvec{S}}}_{{\varvec{X}}} = {\varvec{I}} - \hat{{\varvec{P}}}_{{\varvec{X}}} \cdot \text {diagMat}(\mathbf {1} \oslash \text {diag}(\hat{{\varvec{P}}}_{{\varvec{X}}})) \text {, where } \hat{{\varvec{P}}}_{{\varvec{X}}} :=({\varvec{X}}^\intercal {\varvec{W}}_{{\varvec{X}}}{\varvec{X}} + \lambda _{{\varvec{X}}}{\varvec{I}})^{-1} \nonumber \\&\quad \hat{{\varvec{S}}}_{{\varvec{T}}} = {\varvec{I}} - \hat{{\varvec{P}}}_{{\varvec{T}}} \cdot \text {diagMat}(\mathbf {1} \oslash \text {diag}(\hat{{\varvec{P}}}_{{\varvec{T}}})) \text {, where } \hat{{\varvec{P}}}_{{\varvec{T}}} :=({\varvec{T}}^\intercal {\varvec{W}}_{{\varvec{T}}}{\varvec{T}} + \lambda _{{\varvec{T}}}{\varvec{I}})^{-1}\nonumber \\&\quad \hat{{\varvec{S}}} = \alpha \hat{{\varvec{S}}}_{{\varvec{X}}} + (1 - \alpha )\hat{{\varvec{S}}}_{{\varvec{T}}} \end{aligned}$$
(8)

The computational complexity of \(\textsc {cease}^{\textsc {r}}\) and \(\textsc {add-ease}^{\textsc {r}}\) remains in the order of \({\mathcal {O}}(|{\mathcal {I}}|^{2.373})\), which is equivalent to the original \(\textsc {ease}^{\textsc {r}}\) approach. As such, these methods allow item side-information to be included into the model without a significant added cost in terms of computational complexity. The main reason for this, is that we adapt the entries in the Gramian \({\varvec{G}}\), but do not alter its dimensions.

2.3 Incremental collaborative filtering

Collaborative filtering techniques that can be incrementally updated when new data arrives are a lively research area in itself. Vinagre et al. (2014) propose incremental Stochastic Gradient Descent (SGD) as a way to dynamically update matrix factorisation models based on positive-only implicit feedback. Their methodology has first been extended to include negative feedback (Vinagre et al. 2015), and then to a co-factorisation model that is more complex than traditional matrix factorisation, but also leads to superior recommendation accuracy (Anyosa et al. 2018). He et al. (2016) propose an incremental optimisation procedure based on Alternating Least Squares (ALS), and also show how it can be applied to efficiently and effectively update matrix factorisation models. More recently, Ferreira et al. propose a method that personalises learning rates on a user-basis, reporting further improvements. In contrast, our work focuses on item-based similarity models that come with closed-form solutions, as these have been shown to be highly competitive with the state of the art in many collaborative filtering use-cases.

Instead of just incorporating new data into the model, Matuszyk et al. (2018) propose to forget older data that has become obsolete, reporting significantly improved performance for collaborative filtering approaches. The dynamic \(\textsc {ease}^{\textsc {r}}\) method we propose in Sect. 3 fits perfectly into this paradigm, as it can incorporate new data just as easily as it can forget irrelevant information in a targeted manner. This type of decremental learning has the additional advantage of being able to avoid complete retraining in privacy-sensitive application areas, where specific user histories need to be removed from the model upon request.

2.4 Neural auto-encoders

The Auto-Encoder paradigm of which \(\textsc {ease}^{\textsc {r}}\) is a specific instantiation, has gained much popularity in recent years. The Mult-VAE method proposed by Liang et al. (2018) consists of a variational auto-encoder with a multinomial likelihood, and has been a strong baseline for several years (Dacrema et al. 2019). Khawar et al. (2020) propose an architecture that first learns a grouping of items and leverages this structure when learning the auto-encoder, reporting significant gains over the original Mult-VAE method. As these methods rely on gradient-based optimisation of often highly non-convex objective functions, they rely on software packages with automatic differentiation capabilities, and typically require significant computational resources, in the form of several hours of training on machines equipped with GPUs. The methods we consider in this work are computed in the order of minutes on CPUs, and we do not include neural approaches in our comparison for this reason. Furthermore, among others, the work of Steck (2019a) and Dacrema et al. (2019) have repeatedly shown that linear item-based models can attain highly competitive recommendation accuracy compared to neural alternatives.

3 Methodology and contributions

We have given a brief history of item-based collaborative filtering models, and have discussed why \(\textsc {ease}^{\textsc {r}}\) and its variants are computationally often more efficient than their counterparts based on slim. For very large item catalogues, however, its more than quadratic computational complexity in the number of items still becomes a very tangible issue. Because of this, the demand for an algorithm that can efficiently update \(\textsc {ease}^{\textsc {r}}\)-like models when new data arrives, is still very real, and a necessity for these methods to obtain widespread adoption in practice. Recent work proposes the “Dynamic Index” algorithm as a way to incrementally update item similarities in neighbourhood-based models that adopt cosine similarity (Jeunen et al. 2019). A crucial building block of this metric and the algorithm is the efficient and incremental computation of the Gramian matrix \({\varvec{G}} = {\varvec{X}}^\intercal {\varvec{X}}\). By storing \({\varvec{G}}\) in low-overhead sparse data-structures such as inverted indices, they minimise memory overhead whilst still allowing for an amortised constant lookup time when querying \({\mathcal {I}}_u\), which is a requirement for incremental updates. From Eqs. 46 and 8 , it is clear to see that \(\textsc {ease}^{\textsc {r}}\) and its variants are dependent on this Gramian matrix as well. In fact, it is the only building block needed to be able to compute the resulting item-item weight matrix \(\hat{{\varvec{S}}}\). As such, we adopt parts of the Dynamic Index algorithm proposed by Jeunen et al. to first efficiently compute and then incrementally update the Gramian matrix \({\varvec{G}}\). Once we have an up-to-date matrix \({\varvec{G}}\), we need to compute its inverse to obtain \(\hat{{\varvec{P}}}\) and the eventual model \(\hat{{\varvec{S}}}\) from that. The matrix inversion to go from \({\varvec{G}}\) to \(\hat{{\varvec{P}}}\) is the workhorse behind \(\textsc {ease}^{\textsc {r}}\) that takes up the large majority of the computation time, as this step corresponds to solving the least-squares problem formulated in Eq. 3. Iterative re-computation of this matrix inverse every time we wish to incorporate new data into the model, is thus to be avoided if it can be.

3.1 Low-rank model updates with the Woodbury matrix identity

Equation 9 shows the Woodbury matrix identity, which posits that the inverse of a rank-k correction to some \(n\times n\) matrix \({\varvec{A}}\) can be computed by performing a rank-k correction on the inverse of the original matrix (Hager 1989).

$$\begin{aligned} ({\varvec{A}} + {\varvec{U}}{\varvec{C}}{\varvec{V}})^{-1} = {\varvec{A}}^{-1} - {\varvec{A}}^{-1}{\varvec{U}}({\varvec{C}}^{-1}+{\varvec{V}}{\varvec{A}}^{-1}{\varvec{U}})^{-1}{\varvec{V}}{\varvec{A}}^{-1} \end{aligned}$$
(9)

So, given \({\varvec{A}}^{-1}\), \({\varvec{U}}\), \({\varvec{C}}\) and \({\varvec{V}}\), there is no need to re-compute the inversion on the update of \({\varvec{A}}\), but it is sufficient to multiply a few matrices and compute the inverse of \(({\varvec{C}}^{-1}+{\varvec{V}}{\varvec{A}}^{-1}{\varvec{U}})\in {\mathbb {R}}^{k\times k}\). Naturally, for large n and \(k \ll n\), the efficiency gains coming from this reformulation will be most significant. Although this is no requirement for Eq. 9 to hold, we assume \({\varvec{C}} \in {\mathbb {R}}^{k\times k}\) to be a diagonal matrix. As a result, the inversion of \({\varvec{C}}\) becomes trivial and consists of just k operations.

In our setting, suppose we have an up-to-date model at a certain time t with \({\varvec{X}}_t\), \({\varvec{G}}_{t}\), \(\hat{{\varvec{P}}}_{t}\) and \(\hat{{\varvec{S}}}_{t}\). At a given time \(t+1\), suppose we have an updated user-item matrix \({\varvec{X}}_{t+1}\), but we wish to compute \({\varvec{G}}_{t+1}\), \(\hat{{\varvec{P}}}_{t+1}\) and the resulting \(\hat{{\varvec{S}}}_{t+1}\) as efficiently as possible. As we mentioned before, computing \({\varvec{G}}_{t+1}\) incrementally can be achieved easily and efficiently by adopting parts of the Dynamic Index algorithm. In fact, because of the incremental nature of the algorithm, we can easily just store the difference in the Gramian matrix instead of its entirety: \({\varvec{G}}_{{\varvec{\varDelta }}} = {\varvec{G}}_{t+1}-{\varvec{G}}_{t} = {\varvec{X}}_{t+1}^{\intercal }{\varvec{X}}_{t+1} - {\varvec{X}}_{t}^{\intercal }{\varvec{X}}_{t}\). Given a set of user-item interactions \({\mathcal {P}}_{{\varvec{\varDelta }}} \subset {\mathcal {U}}\times {\mathcal {I}}\) to include into the model and an inverted index \({\mathcal {L}}_t\) mapping users u to their histories \({\mathcal {I}}_u\), Algorithm 1 shows how to achieve this. Note that the indices holding \({\mathcal {I}}_u\) are just a sparse representation of the user-item matrix \({\varvec{X}}\) and don’t require any additional memory consumption. Furthermore, Algorithm 1 is easily parallellisable through the same MapReduce-like paradigm adopted by Jeunen et al. (2019). Naturally, an efficient implementation will exploit the symmetry of the Gramian \({\varvec{G}}_{{\varvec{\varDelta }}}\) to decrease memory consumption as well as the number of increments needed at every update.

figure a

Now, having computed \({\varvec{G}}_{{\varvec{\varDelta }}}\), we can rewrite what we need as follows:

$$\begin{aligned} \hat{{\varvec{P}}}_{t+1} = ({\varvec{G}}_{t+1}+\lambda {\varvec{I}})^{-1} = ({\varvec{G}}_{t} + \lambda {\varvec{I}} + {\varvec{G}}_{{\varvec{\varDelta }}})^{-1}. \end{aligned}$$
(10)

The form on the right-hand side already begins to resemble Woodbury’s formula in Eq. 9. All that’s left is to decompose \({\varvec{G}}_{{\varvec{\varDelta }}} \in {\mathbb {R}}^{n\times n}\) into matrices \({\varvec{U}} \in {\mathbb {R}}^{n\times k}\), \({\varvec{C}} \in {\mathbb {R}}^{k\times k}\) and \({\varvec{V}} \in {\mathbb {R}}^{k\times n}\). As \({\varvec{G}}_{{\varvec{\varDelta }}}\) is the difference of two real symmetric matrices \({\varvec{G}}_{t+1}\) and \({\varvec{G}}_{t}\), it will always be a real symmetric matrix as well. This means that the eigenvectors of \({\varvec{G}}_{{\varvec{\varDelta }}}\) can be chosen to be orthogonal to each other: \({\varvec{Q}}^{\intercal } \equiv {\varvec{Q}}^{-1}\). Consequently, an eigendecomposition always exists, where k is the rank of \({\varvec{G}}_{{\varvec{\varDelta }}}\):

$$\begin{aligned} \begin{aligned} {\varvec{G}}_{{\varvec{\varDelta }}}&= {\varvec{Q}}{\varvec{\varLambda }}{\varvec{Q}}^{-1}\\&= {\varvec{Q}}{\varvec{\varLambda }}{\varvec{Q}}^{\intercal }\\&= \sum _{i = 1}^{k} {\varvec{\varLambda }}_{ii}{\varvec{Q}}_{\cdot ,i}{\varvec{Q}}_{\cdot ,i}^{\intercal }. \end{aligned} \end{aligned}$$
(11)

As such, we can plug Eq. 11 containing the eigendecomposition of \({\varvec{G}}_{{\varvec{\varDelta }}}\) into Eqs. 9 and 10 to obtain our final update rule in Eq. 12:

$$\begin{aligned} \begin{aligned} \hat{{\varvec{P}}}_{t+1}&= ({\varvec{G}}_{t} + \lambda {\varvec{I}} + {\varvec{G}}_{{\varvec{\varDelta }}})^{-1}\\&= ({\varvec{G}}_{t} + \lambda {\varvec{I}} + {\varvec{Q}}{\varvec{\varLambda }} {\varvec{Q}}^{\intercal })^{-1}\\&= \hat{{\varvec{P}}}_{t} - \hat{{\varvec{P}}}_{t}{\varvec{Q}}({\varvec{\varLambda }}^{-1} + {\varvec{Q}}^\intercal \hat{{\varvec{P}}}_{t}{\varvec{Q}})^{-1}{\varvec{Q}}^\intercal \hat{{\varvec{P}}}_{t}. \end{aligned} \end{aligned}$$
(12)

The full \(\textsc {dyn-ease}^{\textsc {r}}\) procedure is presented in Algorithm 2. If the updates to the Gramian matrix are low-rank, this procedure will be much more computationally efficient than re-computing the inverse of the entire Gramian matrix from scratch, as we will show in the following subsection. The assumption that the data-generating process behind user-item interactions is generally low-rank, has been exploited far and wide in the recommender systems literature (Koren et al. 2009).

It is interesting to note that \(\textsc {ease}^{\textsc {r}}\) does not follow the low-rank assumption that motivates the popular family of latent factor models for collaborative filtering. Indeed, \(\textsc {ease}^{\textsc {r}}\) is a full-rank model, combatting overfitting with Gaussian priors on its parameters rather than reducing the dimensionality of the problem. The low-rank assumption we adopt here is on the update to the Gramian \({\varvec{G}}_{{\varvec{\varDelta }}}\), instead of the full Gramian \({\varvec{G}}\). As we will show further on, both theoretically and empirically, this assumption holds in a variety of settings.

The fact that \({\varvec{G}}_{{\varvec{\varDelta }}}\) is symmetric and will often be very sparse in nature can be exploited when computing the eigendecomposition on line 3 of Algorithm 2, as we will show in the following section. Many modern software packages for scientific computing implement very efficient procedures specifically for such cases (e.g. SciPy (Virtanen et al. 2020)). Note that alternative algorithms to factorise \({\varvec{G}}_{{\varvec{\varDelta }}}\) into lower-dimensional matrices exist, often relying on randomised sampling procedures (Martinsson et al. 2011; Halko et al. 2011). These algorithms are reportedly more efficient to compute than the traditional eigendecomposition, but often not geared specifically toward the high-dimensional yet sparse use-case we tackle in this work, or not equipped to exploit the symmetric structure that is typical for the Gramian. As they compute two dense matrices of \({\varvec{Q}}\)’s dimensions—their improvement in computation time comes with the cost of increased memory consumption. Furthermore, these methods are often focused on approximate matrix reconstructions whereas we are interested in an exact decomposition of the update to the Gramian. As the eigendecomposition fulfils our needs, the study of alternative factorisation methods falls out of the scope of the present work.

Throughout this section, we have focused on \(\textsc {dyn-ease}^{\textsc {r}}\) as a general extension of \(\textsc {ease}^{\textsc {r}}\). Naturally, our approach is trivially extended to include \(\textsc {cease}^{\textsc {r}}\), \(\textsc {add-ease}^{\textsc {r}}\) or a weight matrix \({\varvec{W}}\) different from the identity matrix \({\varvec{I}}\) as well, as these variants only change the input to Algorithms 1 and 2, but bear no impact on the procedures themselves.

figure b

3.2 Computational complexity analysis of eigendecomposition

The computational complexity of \(\textsc {ease}^{\textsc {r}}\) is determined by the inversion of the Gramian, whereas the complexity of \(\textsc {dyn-ease}^{\textsc {r}}\) is dictated by that of the eigendecomposition of the update to the Gramian. The computational complexity of matrix inversion, as well as that of solving the eigen-problem of a matrix, can be reduced to that of matrix multiplication (Pan and Chen 1999; Le Gall 2014). Given a square matrix of size \(n \times n\), this is generally thought of as an \({\mathcal {O}}(n^{3})\) problem. Nevertheless, specialised methods that provide improved bounds on the exponent exist, the most recent one being \({\mathcal {O}}(n^{2.37286})\) by Alman and Vassilevska W. (2021).

In practice, it is easily seen that more efficient algorithms can be applied to specific cases instead of the general approach. Indeed, the inversion of a diagonal matrix consists of just n operations, and algorithms to multiply sparse matrices are often much more efficient than their dense counterparts. In what follows, we provide a brief theoretical analysis of the complexity of \(\textsc {dyn-ease}^{\textsc {r}}\), giving rise to an improved estimate for its computational complexity in practical settings. This bound explains the efficiency improvements of \(\textsc {dyn-ease}^{\textsc {r}}\) over \(\textsc {ease}^{\textsc {r}}\), and recovers the equivalence of eigendecomposition to matrix inversion in the general case.

A first important thing to note is that the Gramian is symmetric, and so is \({\varvec{G}}_{{\varvec{\varDelta }}}\). This allows us to use the iterative method proposed by Lanczos (1950) to compute its eigen-vectors and -values.Footnote 2 The core algorithm proposed by Lanczos consists of k steps—one per nonzero eigenpair—which in turn consist of several vector and matrix manipulations. We refer the interested reader to an excellent analysis of the Lanczos algorithm provided by Paige (1980), showing how it works and why it converges. The computational complexity of every step in the method is determined by that of a matrix-vector product between the input \({\varvec{G}}_{{\varvec{\varDelta }}}\) and an \(|{\mathcal {I}}|\)-dimensional vector. In the general case, such an operation is \({\mathcal {O}}(|{\mathcal {I}}|^{2})\). In our specific case, however, \({\varvec{G}}_{{\varvec{\varDelta }}}\) is often of an extremely sparse nature. This allows us to describe the complexity of the product as \({\mathcal {O}}(m\cdot |{\mathcal {I}}|)\), where m is the average number of nonzero values in every column of \({\varvec{G}}_{{\varvec{\varDelta }}}\). Repeating these steps for every nonzero eigen-value-vector pair yields a final computational complexity of \({\mathcal {O}}(k\cdot m\cdot |{\mathcal {I}}|)\). When we wish to do a full-rank update on a dense matrix (i.e. \(k = m = |{\mathcal {I}}|\)), this recovers the computational complexity of general matrix inversion: \({\mathcal {O}}(|{\mathcal {I}}|^{3})\). In the cases where either the rank of the update is low (\(k \ll |{\mathcal {I}}|\)) or the update to the Gramian is highly sparse (\(m \ll |{\mathcal {I}}|\)), the eigendecomposition will be most efficient and as a consequence, the performance benefits of \(\textsc {dyn-ease}^{\textsc {r}}\) over \(\textsc {ease}^{\textsc {r}}\) will be most apparent too. Note that although low-rankness and sparsity will often come in pairs in the practical settings we deal with, this does not have to be the case in general. As a counterexample: the identity matrix \({\varvec{I}}\) is highly sparse yet full-rank.

3.3 Efficient estimation and upper bounding of \({\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}})\)

In order to compute the eigendecomposition on line 3 of Algorithm 2, the numerical rank of \({\varvec{G}}_{{\varvec{\varDelta }}}\) would need to be known a priori. Furthermore, as we have shown, the efficiency of the update procedure is highly dependent on the assumption that this rank is much smaller than the dimensionality of the Gram-matrix itself: \(k \ll |{\mathcal {I}}|\). It is known that matrix ranks can be estimated efficiently through the use of randomised methods (Liberty et al. 2007; Ubaru and Saad 2016); when dealing with sparse and symmetric matrices, these methods tend to attain extremely efficient performance.Footnote 3 Being able to estimate \({\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}})\) of course does not guarantee that this quantity will be low. In practice, however, we notice that it is often the case. We can see that the rank of the update \({\varvec{G}}_{{\varvec{\varDelta }}}\) depends on (1) the number of unique users in the update \({\mathcal {P}}_{{\varvec{\varDelta }}}\), denoted by \(|{\mathcal {U}}_{{\varvec{\varDelta }}}|\) , and (2) the average number of items in the entire history of these users: \(\overline{|{\mathcal {I}}_{{\mathcal {U}}_{{\varvec{\varDelta }}}}|}\).

This can be intuitively seen from the fact that an index ij in the Gramian matrix represents the number of co-occurrences between the items i and j in the dataset. As such, a new user-item interaction \((u,i) \in {\mathcal {P}}_{{\varvec{\varDelta }}}\) affects \({\varvec{G}}_{i,j}, \forall j \in {\mathcal {I}}_u\).

Now, let \({\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ]}\) be the user-item matrix containing all (including historical) user-item interactions from only the users that appear in the update. This means we can rewrite the updated Gramian matrix as follows:

$$\begin{aligned} \begin{aligned} {\varvec{G}}_{t+1}&= {\varvec{G}}_{t} - {\varvec{X}}_{t}^{\intercal }{\varvec{X}}_{t} + {\varvec{X}}_{t+1}^{\intercal }{\varvec{X}}_{t+1}\\&= {\varvec{G}}_{t} - {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t}^{\intercal } {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t} + {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t+1}^{\intercal } {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t+1}. \end{aligned} \end{aligned}$$

The update then becomes: \( {\varvec{G}}_{{\varvec{\varDelta }}} = {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t+1}^{\intercal } {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t+1} - {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t}^{\intercal } {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t}.\)

Lemma 1

Given a \(|{\mathcal {U}}|\times |{\mathcal {I}}|\) user-item matrix \({\varvec{X}}\), its Gramian matrix \({\varvec{G}}\), and updates to \({\varvec{X}}\); the rank of the update of the Gramian matrix \({\varvec{G}}_{{\varvec{\varDelta }}}\) can be upper bounded by two times the number of unique, nonzero rows in \({\varvec{X}}_{{\varvec{\varDelta }}}\): \({\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}}) \le 2|{\mathcal {U}}_{{\varvec{\varDelta }}}|\).

Proof

As the rank of a matrix is defined as its number of linearly independent row or column vectors, a (possibly loose) upper bound for \({\text {rank}}({\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ]})\) is given by its number of nonzero rows \(|{\mathcal {U}}_{{\varvec{\varDelta }}}|\). Consequently, the rank of the Gramian matrix has the same bound: \({\text {rank}}({\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}}, \cdot ]}^{\intercal }{\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ]}) \le |{\mathcal {U}}_{{\varvec{\varDelta }}}|\). It is well known that the rank of the sum of two matrices is less than or equal to the sum of the ranks of the individual matrices. Bringing those together, we have that \({\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}}) \le 2|{\mathcal {U}}_{{\varvec{\varDelta }}}|\). \(\square \)

This upper bound on \({\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}})\) holds for any update to \({\varvec{X}}\). When users in the update are disjoint of those in \({\varvec{X}}_{t}\), the bound can be tightened to \(|{\mathcal {U}}_{{\varvec{\varDelta }}}|\). For general-purpose use cases, it is not be feasible to ensure that users in the update do not appear with partial histories in previous iterations of the model. For specific applications such as session-based recommendation, however, it is common practice to train models on the session-item matrix, which satisfies this assumption by definition (Ludewig and Jannach 2018).

Lemma 2

Given a \(|{\mathcal {U}}|\times |{\mathcal {I}}|\) user-item matrix \({\varvec{X}}\), its Gramian matrix \({\varvec{G}}\), and updates to \({\varvec{X}}\) that only consist of adding new rows or altering previously zero-rows; the rank of the update of the Gramian matrix \({\varvec{G}}_{{\varvec{\varDelta }}}\) can be upper bounded by the number of rows being added or altered: \({\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}}) \le |{\mathcal {U}}_{{\varvec{\varDelta }}}|\).

Proof

When the update only pertains to new users, this means that \({\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t} = 0\), which ensures that \(\text {rank}({\varvec{G}}_{{\varvec{\varDelta }}}) = \text {rank}({\varvec{X}}_{{\varvec{\varDelta }}})\). Because \({\text {rank}}({\varvec{X}}_{{\varvec{\varDelta }}})\) is bounded by \(|{\mathcal {U}}_{{\varvec{\varDelta }}}|\) per definition, so is \({\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}})\): \({\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}}) \le |{\mathcal {U}}_{{\varvec{\varDelta }}}|\). \(\square \)

We have provided bounds for \(\text {rank}({\varvec{G}}_{{\varvec{\varDelta }}})\) by focusing on the number of users that have contributed interactions in the new batch of data that we wish to include into the model. Analogously, in some settings, it might be easier to bound the number of unique items that are being interacted with. In a news recommendation setting, for example, a new batch of data might consist of only a very limited number of items (in the order of hundreds) being read by a much higher number of users (hundreds of thousands). In this case, we can straightforwardly extend Lemmas 1 and 2 to bound the rank by the number of independent columns in \({\varvec{X}}\) as opposed to its rows. The further reasoning and results follow trivially, bounding \(\text {rank}({\varvec{G}}_{{\varvec{\varDelta }}})\) by \(2|{\mathcal {I}}_{{\varvec{\varDelta }}}|\) and \(|{\mathcal {I}}_{{\varvec{\varDelta }}}|\), respectively. Whereas the original \(\textsc {ease}^{\textsc {r}}\) approach and the need to iteratively retrain would make it a poor choice for applications with possibly vast item catalogues but smaller active item catalogues, such as catalogues of news articles, the presented upper bounds theoretically show why \(\textsc {dyn-ease}^{\textsc {r}}\) can provide an efficient updating mechanism.

figure c

3.4 Approximate \(\textsc {dyn-ease}^{\textsc {r}}\) updates via truncated eigendecomposition

Naturally, the rank of the update will not always be low in general recommendation use-cases. The easiest counter-example to think of is the case where we wish to include k user-item interactions that pertain to k new and unique users as well as k unique items. This will lead to a diagonal-like structure of \({\varvec{X}}_{{\varvec{\varDelta }}}\) and \(\text {rank}({\varvec{X}}_{{\varvec{\varDelta }}}) = k\), which is problematic for large values of k. However, it is also not hard to see that incorporating such a batch of data into our model will not affect any of our personalisation capabilities. Indeed, as \(\textsc {ease}^{\textsc {r}}\) exploits signal from item co-occurrences, data where no item co-occurrences are present is practically useless, even though it is full-rank. Although this is a contrived example, it serves to illustrate that the rank of the update is not necessarily synonymous with its informational value.

In these cases, we can still resort to updating our model \(\hat{{\varvec{P}}}\) with a low-rank approximation of \({\varvec{G}}_{{\varvec{\varDelta }}}\) without hurting the performance of the updated model. Instead of computing the rank and a full eigendecomposition of the Gramian as shown in Algorithm 2, we can choose the rank k at which we wish to truncate, and update \(\hat{{\varvec{P}}}\) with a low-rank approximation \(\tilde{{\varvec{G}}}_{{\varvec{\varDelta }}}\) instead of the real thing. The resulting algorithm is shown in Algorithm 3, and it provides a tuneable trade-off between the exactness of the acquired solution and the efficiency of incremental updates.

Interestingly, this type of approximate update is closely related to yet another extension of the slim paradigm: Factored Item Similarity Models (fism) (Kabbur et al. 2013). In fism, the similarity matrix \({\varvec{S}}\) is modelled as the product of two lower-dimensional latent factor matrices. The resulting low-rank model is shown to be increasingly effective as the sparsity in the user-item interactions it learns from increases, highlighting that this type of approximation does not necessarily imply a decrease in recommendation accuracy. In approximate \(\textsc {dyn-ease}^{\textsc {r}}\), we do not directly model the similarity matrix \({\varvec{S}}\) as factorised, but we update \({\varvec{S}}\) with a factorised version of the update to the Gramian \({\varvec{G}}_{{\varvec{\varDelta }}}\). Factorised models such as fism or approximate \(\textsc {dyn-ease}^{\textsc {r}}\) also bear resemblance to models that are often used in natural language processing applications. Indeed, the well-known word2vec algorithm to learn word embeddings for natural language processing applications implicitly learns to factorise a matrix holding the (shifted positive) pointwise mutual information between word-context pairs (Mikolov et al. 2013; Levy and Goldberg 2014).

Although our motivations for approximate \(\textsc {dyn-ease}^{\textsc {r}}\) are rooted in improving the computational cost of exact \(\textsc {dyn-ease}^{\textsc {r}}\), the advantages of transitivity that come from adopting low-rank representations can significantly impact recommendation performance as well. Imagine items \(a,b,c \in {\mathcal {I}}\) where (ab) and (bc) co-occur in the training data of user histories, but (ac) does not. Full-rank \(\textsc {ease}^{\textsc {r}}\) cannot infer a correlation between a and c in such a setting, whereas low-rank models can learn a latent factor that unifies a, b and c. This explains the advantage that low-rank models have in sparse data environments. For further insights on the advantages, differences and analogies between full-rank and low-rank models, we refer the interested reader to the work of Van Balen and Goethals (2021).

As we are factorising \({\varvec{G}}_{{\varvec{\varDelta }}}\) by its truncated eigendecomposition, we are guaranteed to end up with the optimal rank-k approximation with respect to the mean squared error between \(\tilde{{\varvec{G}}}_{{\varvec{\varDelta }}}\) and \({\varvec{G}}_{{\varvec{\varDelta }}}\). Naturally, with the highly sparse nature of \({\varvec{G}}_{{\varvec{\varDelta }}}\), this optimal approximation will focus on reconstructing entries with large values, and rows or columns with many nonzero values. This corresponds to focusing on the items that occur most often in the new incoming batch of user-item interactions \({\mathcal {P}}_{\varDelta }\). Because of this, we can expect approximate \(\textsc {dyn-ease}^{\textsc {r}}\) to favour recently popular items, which can give an additional performance boost in the right application areas. Nevertheless, an in-depth discussion or validation of the efficacy of factorised \(\textsc {ease}^{\textsc {r}}\)-like models falls outside the scope of this work, as we focus on the efficiency with which the model can be updated. If the cut-off rank k is lower than the true rank of the update, approximate \(\textsc {dyn-ease}^{\textsc {r}}\) guarantees an improvement in terms of the computational complexity of the update procedure.

4 Experimental results and discussion

The goal of this section is to validate that the methods we proposed in earlier sections of this manuscript work as expected, and to investigate whether expectations grounded in theory can be substantiated with empirical observations. Concretely, the research questions we wish to answer are the following:

RQ1:

Can exact \(\textsc {dyn-ease}^{\textsc {r}}\) provide more efficient model updates in comparison with iteratively retrained \(\textsc {ease}^{\textsc {r}}\)?

RQ2:

Can our theoretical analysis on the correlation between rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) and the runtime of \(\textsc {dyn-ease}^{\textsc {r}}\) set realistic expectations in practice?

RQ3:

Do the phenomena we describe for bounding rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) occur in real-world session-based or news recommendation datasets?

RQ4:

Can approximate \(\textsc {dyn-ease}^{\textsc {r}}\) provide a sensible trade-off between recommendation efficiency and effectiveness?

Table 1 Datasets we adopt throughout the experiments presented in this work, along with their source and summary statistics that describe the user-item interactions and their sparsity

Table 1 shows the publicly available datasets we use throughout our experiments in an attempt to provide empirical answers to the above-mentioned research questions. The well-known MovieLens dataset (Harper and Konstan 2015) consists of explicit ratings (on a 1–5 scale) that users have given to movies, along with the time of rating. We drop ratings lower than 3.5 and treat the remainder as binary preference expressions. Additionally, we only keep users and items that appear at least 3 times throughout the dataset. This type of pre-processing is common, and ensures we are left with positive preference expressions that carry enough signal for effective personalisation (Liang et al. 2018; Beel and Brunel 2019). We take the newest and largest variant of the dataset as our starting point: MovieLens-25M. Many recommender systems applications are based on shorter browsing sessions rather than full user histories that might span years (Ludewig and Jannach 2018). As laid out in Sect. 3.3, these set-ups can be especially amenable to our approach, as the adoption of these shorter sessions instead of longer user histories naturally decreases the rank of the update to the Gramian. We adopt two well-known datasets for session-based recommender systems: the YooChoose dataset, released in the context of the 2015 ACM RecSys Challenge (Ben-Shimon et al. 2015); and the RetailRocket dataset (Kaggle 2016). These datasets consist of implicit feedback (clicks) from users on retail products, and we compute the 3-core for users and items in the same manner we did for MovieLens-25M, after removing repeated user-item interactions. To validate our intuitions regarding \(\textsc {dyn-ease}^{\textsc {r}}\) and the rank of the Gramian in news recommendation setups, we use the Adressa and Microsoft News datasets (MIND) (Gulla et al. 2017; Wu et al. 2020). These datasets contain implicit feedback inferred from browsing behaviour on news websites; we pre-process them analogously to the other datasets.

Some datasets have prohibitively large item catalogues for \(\textsc {ease}^{\textsc {r}}\) to compute the inverse Gramian at once. However, the large majority of items are often at the extreme end of the so-called “long tail”, only being interacted with once or twice. We prune these items to keep the \(\textsc {ease}^{\textsc {r}}\) computation feasible but still highlight the advantages of \(\textsc {dyn-ease}^{\textsc {r}}\).

Note that these pruning operations on rare items significantly cut down computation time for \(\textsc {ease}^{\textsc {r}}\) (directly dependent on \(|{\mathcal {I}}|\)), but do not pose an unfair advantage for \(\textsc {dyn-ease}^{\textsc {r}}\). Items that appear just once in the dataset blow up the size of the Gramian, but do not significantly impact the rank of the Gramian updates. Indeed, in these situations we get that \(k \ll |{\mathcal {I}}|\), and the computational advantages of \(\textsc {dyn-ease}^{\textsc {r}}\) over \(\textsc {ease}^{\textsc {r}}\) become even more pronounced. We adopt such pruning as it is common practice and keeps the computational needs for reproducing our experiments reasonable. The reason we do not further explore other commonly known datasets such as the Million Song Dataset (MSD) (Bertin-Mahieux et al. 2011), is that these do not include logged timestamps that indicate when the user-item interactions occurred. Because of this, they are unsuited for evaluating a realistic scenario where models are incrementally retrained over time.

The final dataset we adopt is the SuperMarket Dataset with Implicit feedback (SMDI) introduced by Viniski et al. Because this dataset has a comparatively small item catalogue, the computation time for all \(\textsc {ease}^{\textsc {r}}\) variants is in the order of seconds and largely dominated by variance and system overhead. We adopt the SMDI dataset to study the recommendation performance of approximate \(\textsc {dyn-ease}^{\textsc {r}}\), as it exhibits a distribution shift that is largely absent in the other datasets we consider.

To foster the reproducibility of our work, all source code for the experiments we have conducted is publicly available under an open-source license at github.com/olivierjeunen/dynamic-easer/. All code is written in Python 3.7 using SciPy (Virtanen et al. 2020). Reported runtimes are wall-time as measured using an Intel Xeon processor with 14 cores. The rest of this section is structured to follow the research questions laid out above.

4.1 Efficiency of exact \(\textsc {dyn-ease}^{\textsc {r}}\) (RQ1)

To verify the gains in runtime from exact \(\textsc {dyn-ease}^{\textsc {r}}\) over iteratively retrained \(\textsc {ease}^{\textsc {r}}\), we chronologically split the user-item interactions based on a fixed timestamp t, yielding all user-item interactions up to t, and all those after t. The Microsoft News Dataset comes with user’s reading histories and clicks on shown recommendations, but the former type of interaction does not include timestamps. Because of this, we treat these historical interactions as “early data” included in the original \(\textsc {ease}^{\textsc {r}}\) computation, and incorporate the timed clicks chronologically into \(\textsc {dyn-ease}^{\textsc {r}}\) in the procedure described below.

We train an \(\textsc {ease}^{\textsc {r}}\) model on the early batch, and log the runtime in seconds needed for this computation. This operation is repeated over 5 runs, and we report a 95% Gaussian confidence interval. As new incoming user-item interactions do not affect the dimension of the Gramian matrix that needs to be inverted, the runtime needed to compute \(\textsc {ease}^{\textsc {r}}\) remains fairly constant when adding new user-interactions.

Over the newer batch of data, we employ a non-overlapping sliding window technique that chronologically generates batches of data to be included in the existing model via our proposed exact \(\textsc {dyn-ease}^{\textsc {r}}\) procedure. The size of this window \(\delta \) is varied to study the effects on the runtime of \(\textsc {dyn-ease}^{\textsc {r}}\). Larger values of \(\delta \) imply larger update batch sizes, which will often lead to an increase in rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)). Naturally, when \(\delta \) becomes too large, a point is reached where the overhead induced by our incremental updating method becomes prohibitively large, and it becomes favourable to fully retrain the \(\textsc {ease}^{\textsc {r}}\) model. Sensible values of \(\delta \) come with a restriction: when the runtime of the model update is larger than \(\delta \), this would indicate that the procedure cannot keep up with incoming data in real-time. We do not encounter this issue for any of the values of \(\delta \) explored in our experiments - suggesting that \(\textsc {dyn-ease}^{\textsc {r}}\) can be a good fit for various configurations.

Figure 1 visualises the resulting runtimes from the procedure laid out above, on all five considered datasets. The time for the sliding window increases over the x-axis, and runtime for varying values of \(\delta \) is shown on the y-axis. The explored values of \(\delta \) differ based on the dataset and use-case: for the 25-year spanning MovieLens dataset, daily updates might be sufficient; for the 3-month spanning news recommendation dataset Adressa, more frequent 5-minute updates might be more appropriate, to keep up with the highly dynamic nature of the environment.

Fig. 1
figure 1

Runtime results for \(\textsc {dyn-ease}^{\textsc {r}}\) updated with different intervals (sliding window width \(\delta \)), as compared to iteratively retrained \(\textsc {ease}^{\textsc {r}}\) over the final N days of the datasets. We observe that for a wide range of interval widths \(\delta \), \(\textsc {dyn-ease}^{\textsc {r}}\) can provide significant efficiency gains. When \(\delta \) becomes too large, the overhead that comes with incremental updates becomes too high and a full retrain with vanilla \(\textsc {ease}^{\textsc {r}}\) is favourable

We included values of \(\delta \) that push the runtime for \(\textsc {dyn-ease}^{\textsc {r}}\) up to that of \(\textsc {ease}^{\textsc {r}}\) to highlight the limitations of our approach. Provided that the computing power and infrastructure is available, however, \(\delta \) can be decreased to bring \(\textsc {dyn-ease}^{\textsc {r}}\)’s runtime into the desirable range. Note that this limitation on \(\delta \) is general for online learning approaches from user-item interactions, and not specific to the methods we propose in this work.

From the runtime results, we can observe that our proposed method entails significant performance improvements compared to iterative model retraining, for a wide range of settings. Over all datasets, we observe a clear trend toward lower runtimes for shorter sliding windows and more frequent updates, as is expected from our theoretical results.

As the MovieLens-25M dataset spans several decades, the amount of new user-item interactions to be incorporated on a daily basis remains modest. Exploring lower values of \(\delta \) would not provide any additional insights into the performance of \(\textsc {dyn-ease}^{\textsc {r}}\) because of this. As a consequence, we obtain a clean separation between the runtime for \(\textsc {dyn-ease}^{\textsc {r}}\) on batches of different length.

The remaining four datasets represent session- and news-based recommendation environments, which are known to be much more fast-paced and dynamic. Because we focus on smaller sliding window lengths \(\delta \) here, we clearly see daily seasonal patterns emerging. Indeed, \(\textsc {dyn-ease}^{\textsc {r}}\) runtime peaks coincide with peaks in website traffic. As the rank of the update is typically correlated with the number of users or items in the update, this phenomenon is to be expected. It highlights that \(\textsc {dyn-ease}^{\textsc {r}}\) is able to effectively target those model parameters that need updating, and does not spend unnecessary computing cycles on unchanged parts of the model. Note that \(\delta \) does not need to be a fixed constant in real-world applications. An effective use of computing power might decrease and increase \(\delta \) during traffic peaks and valleys, respectively.

4.2 Correlating rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) and runtime of exact \(\textsc {dyn-ease}^{\textsc {r}}\) (RQ2)

The runtime of the incremental updates shown in Fig. 1 is visualised against the rank of the updates in Fig. 2. We clearly observe a strong correlation between the rank of the update to the Gramian and the runtime of \(\textsc {dyn-ease}^{\textsc {r}}\), with a trend that is consistent over varying values of \(\delta \).

We fit a polynomial of the form \(f(x) = a\cdot x^{b} + c\) on a randomly sampled subset of 90% of measurements, and assess its performance in predicting the runtime for \(\textsc {dyn-ease}^{\textsc {r}}\) based on rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) on the remaining 10% of the measurements. Table 2 shows the optimal parameters, the number of samples (runtime measurements) and the root mean squared error (RMSE) on the test sample for every dataset. Figure 2 qualitatively shows that we are able to predict the runtime for \(\textsc {dyn-ease}^{\textsc {r}}\) updates with reasonable accuracy when we know the rank of the update. Combined with the bounds on this quantity laid out in Sect. 3.3, we can use this to set an expected upper bound for the computation time of our incremental updates through \(\textsc {dyn-ease}^{\textsc {r}}\). Table 2 quantitatively shows the magnitude of the errors, reassuring our qualitatively obtained insights. Note that whereas the absolute RMSE increases with the datasets with larger item catalogues, the relative error of the model remains fairly constant. Indeed, a mean error of 5 seconds on a prediction of 10 seconds is not equivalent to being 5 seconds off when the order of magnitude is 1000 seconds. These empirical observations together with the theoretical analysis presented in Sect. 3 highlight the efficiency and favourable scalability of the proposed \(\textsc {dyn-ease}^{\textsc {r}}\) procedure.

Fig. 2
figure 2

Runtime for incremental updates from Fig. 1 plotted against the rank of the update to the Gramian matrix \({\varvec{G}}_{{\varvec{\varDelta }}}\). We observe a strong correlation between higher values of rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) and runtime, as well as a correlation between \(\delta \) and higher rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)). This result highlights that rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) is the main driving factor between \(\textsc {dyn-ease}^{\textsc {r}}\)’s computational complexity, and that bounding it can give realistic expectations for \(\textsc {dyn-ease}^{\textsc {r}}\) efficiency in practice

Table 2 Resulting polynomial model to predict runtime from rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)), along with the root mean squared error (RMSE) it attains and the number of observations N it was fitted on

4.3 Analysing bounds for rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) (RQ3)

Figure 3 shows the rank of the incremental updates from Fig. 1 compared to summary statistics for the batches of user-item interactions. This visualisation shows the effectiveness of the upper bounds laid out in Sect. 3.3 in order to assess their utility and provide a better understanding of the underlying dynamics for every dataset.

We observe that both for general purpose MovieLens-25M and the session-based datasets, the user-focused bound performs reasonably well in approximating the rank of the update to the Gramian. This is in line with our theoretical expectations, and confirms that the number of unique users in any given batch of user-item interactions are the main driving factor for rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)). We further see that the upper bound becomes looser as the number of unique users grows. This as well is expected behaviour, as it becomes less likely for new users’ behaviour to be linearly independent of other users in the batch as the batch size grows. As mentioned in 3.3, the upper bound of \(2|{\mathcal {U}}|\) could be tightened to \(|{\mathcal {U}}|\) if we did not perform a hard split on time but rather divided user sessions into a “finished” and “ongoing” set. This phenomenon occurs naturally for the YooChoose dataset, where we clearly see that the \(2|{\mathcal {U}}|\) bound is much looser. Note that the tight bounds for MovieLens might change if this dataset would include timestamps for item consumption rather than rating, as the majority of users might watch a smaller set of current series or movies. Such a recency bias would decrease the active item catalogue, favouring the item-based bounds.

In contrast with the user-focused datasets, the bound on the number of unique items is much tighter for the news datasets, providing an almost perfect approximation in many cases. This confirms our intuition that the rank of the update in these settings is fully determined by the number active items in the catalogue and virtually independent of the number of users or interactions in a given batch. This in turn makes these environments especially amenable to our \(\textsc {dyn-ease}^{\textsc {r}}\) approach.

Fig. 3
figure 3

Upper bounds for the batches of incremental updates from Fig. 1 plotted against the rank of the update to the Gramian matrix \({\varvec{G}}_{{\varvec{\varDelta }}}\). We observe that different applications that imply different data characteristics bound the rank of the update in different ways. These results confirm our expectations that where the user-focused upper bound is a good approximator for the MovieLens and RetailRocket datasets, the item-focused bound is tighter in news recommendation settings. Moreover, we observe favourable behaviour for the rank of the update in function of the number of interactions, further highlighting \(\textsc {dyn-ease}^{\textsc {r}}\)’s scalability

The number of unique users or items in a batch can give rise to reasonably tight upper bounds on the rank of the update in realistic scenarios, using real-world datasets. The absolute number of user-item interactions \(|{\mathcal {P}}|\) provides another (impractical) bound on the rank of the update; indeed, in a worst-case scenario, every user-item interaction would pertain to a unique user and a unique item. We include the visualisation of the relation between rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) and \(|{\mathcal {P}}|\) to intuitively show that our proposed approach scales favourably with respect to the size of the data, a property that is most appreciated in highly dynamic environments with ever-growing dataset sizes.

4.4 Efficiency and effectiveness of approximate \(\textsc {dyn-ease}^{\textsc {r}}\) (RQ4)

Finally, we wish to validate the efficiency and efficacy of approximate updates to \(\textsc {ease}^{\textsc {r}}\)-like models. Specifically, we wish to understand the trade-off between runtime and recall for models that are iteratively updated as new data comes in. We report experimental results for runtime and recommendation accuracy for both the Adressa and SMDI datasets. This experiment is not repeated on the other datasets, as they do not favour this type of experimental evaluation procedure. MovieLens-25M spans a too long time period, and we observe insignificant effects of model retraining on recommendation accuracy. YooChoose and RetailRocket focus on shorter user sessions, which are also unfavourable for SW-EVAL to reach statistically significant conclusions. Lastly, the Microsoft News Dataset contains bandit feedback, which is different to the organic user-item interactions we tackle in this work. This was not an issue when evaluating models’ computational cost, but is prohibitive to properly evaluate recommendation accuracy in a common manner.

To illustrate the advantages of approximate \(\textsc {dyn-ease}^{\textsc {r}}\), we make use of the Sliding Window Evaluation (SW-EVAL) technique (Jeunen et al. 2018; Jeunen 2019). We train a model on all user-item interactions that have occurred up to time t. For a fixed sliding window width \(\delta _{\text {update}}\), we periodically update the model with new incoming data, both for the exact and approximate \(\textsc {dyn-ease}^{\textsc {r}}\) variants. A concurrent sliding window with width \(\delta _{\text {eval}}\) dictates the evaluation period where every competing model is evaluated on its ability to predict with which items users interacted with next. This experimental procedure is formalised in Algorithm 4. We set \(\delta _{\text {update}} = 60\)min and \(\delta _{\text {eval}}=120\)min for the Adressa dataset and evaluate over the final 24 hours, and \(\delta _{\text {update}} = 6\)h and \(\delta _{\text {eval}}=3\)d for the final 120 days of the SMDI dataset. This difference in order of magnitude is to keep the overall runtime of the experiments reasonable, and to ensure statistically significant results from sufficiently large evaluation sample sizes.

4.4.1 Computation time for approximate \(\textsc {dyn-ease}^{\textsc {r}}\)

Figure 4 shows computation time for exact \(\textsc {dyn-ease}^{\textsc {r}}\), as well as several approximate model variants with varying cut-off ranks k. In terms of runtime improvements, we observe very favourable results for approximate updates. As is expected, the computational cost of \(\textsc {dyn-ease}^{\textsc {r}}\)’s updates can largely be attributed to the computation of all eigen-pairs, and limiting the rank has a significant impact on the efficiency of said updates. At cut-off rank \(k = 250\), the computational cost for the updates is decreased by a factor 3 or 65%.

As we have mentioned above, the computation time for all \(\textsc {ease}^{\textsc {r}}\) variants on the SMDI dataset is in the order of seconds and largely dominated by variance and system overhead. As a result, runtime results on this dataset do not provide significant insights, and we do not report them.

figure d
Fig. 4
figure 4

Runtime results for exact and approximate \(\textsc {dyn-ease}^{\textsc {r}}\) variants, with varying cut-off ranks k. We observe a quick and steady decline in computation time needed for lower values of k, which can be attributed to less computation time spent finding eigen-pairs

Fig. 5
figure 5

Recommendation accuracy results for exact and approximate \(\textsc {dyn-ease}^{\textsc {r}}\) variants, with varying cut-off ranks k. We additionally report results for a stale \(\textsc {ease}^{\textsc {r}}\) model that does not incorporate new user-item data over time. We observe that exact \(\textsc {(dyn-)ease}^{\textsc {r}}\)’s recommendation accuracy can largely be retained by approximate variants , as the recommendation accuracy measurements remain within statistical noise of one another for a large range of cut-off ranks k. For \(k = 5\), we observe a statistically significant improvement as time goes on. For \(k=1\), we observe a decrease in recommendation accuracy with respect to the exact model

Table 3 Runtime and recommendation accuracy measurements on the Adressa and SMDI datasets, for the experimental procedure laid out in Algorithm 4

4.4.2 Recommendation accuracy for approximate \(\textsc {dyn-ease}^{\textsc {r}}\)

Figure 5 visualises Recall@K for \(K \in \{1, 5, 10, 20\}\) over time on the Adressa and SMDI datasets. The SMDI dataset has large variance on the number of active users over time, which heavily influences the statistical significance of some of the evaluation results, as they are based on insufficient samples. We do not include evaluation results where the evaluation set consisted of less than 100 users. The denominator for our Recall measure is \(\min (K,|{\mathcal {I}}_{u}^{\text {test}}|)\) instead of the number of held-out items \(|{\mathcal {I}}_{u}^{\text {test}}|\), to ensure that a perfect value of 1 can be attained at all cut-offs k.

Additional to several approximate model variants, we include recommendation accuracy results for a model that is not updated over time, yielding a lower bound on the accuracy we can expect from approximate updates. On the Adressa dataset, we observe that such a model quickly deteriorates over time. This is to be expected, as the Adressa dataset and news recommendation application are heavily biased toward recent items and interactions. This bias is less clear on the SMDI dataset at lower cut-off ranks K, but clearly manifests itself for Recall@20 near the end of the evaluation period.

For both datasets, we observe that the accuracy of approximate \(\textsc {dyn-ease}^{\textsc {r}}\) variants for high values of k, is statistically insignificantly different from exact \(\textsc {dyn-ease}^{\textsc {r}}\). This highlights that the N-fold improvement in terms of computational cost can come with a negligible impact on recommendation accuracy, showing the advantages of approximate computations. For Adressa, we observe a statistically significant improvement over exact \(\textsc {dyn-ease}^{\textsc {r}}\) for low values of k. This can be attributed to the reasons laid out in Sect. 3, as the low-rank approximation handles sparsity, transitivity, and favours recently popular items. These model characteristics are highly favourable in news recommendation settings—but might have smaller influence on supermarket data. Nevertheless, the results highlight that many efficient low-rank updates can yield highly competitive models compared to more costly full-rank updates.

All runtime and recommendation accuracy measurements are aggregated in Table 3, providing further insights on the trade-off between runtime and recommendation accuracy for approximate \(\textsc {dyn-ease}^{\textsc {r}}\). We denote the Recall@K measure as R@K for improved spacing. On the SMDI dataset, the differences in recommendation accuracy among exactly or approximately update model variants were not found to be statistically significant at the \(p=0.05\) level. The differences between the stale \(\textsc {ease}^{\textsc {r}}\) and \(\textsc {dyn-ease}^{\textsc {r}}\) models are significant.

The size of the Adressa dataset yields more statistical power, and both the differences between stale \(\textsc {ease}^{\textsc {r}}\) and \(\textsc {dyn-ease}^{\textsc {r}}\) and those between exact \(\textsc {dyn-ease}^{\textsc {r}}\) and approximate \(\textsc {dyn-ease}^{\textsc {r}}\) with \(k \in \{1,5\}\) were found to be statistically significant.

5 Conclusion

Linear item-based models are an attractive choice for many collaborative filtering tasks due to their conceptual simplicity, interpretability, and recommendation accuracy. Recent work has shown that the analytical solution that is available for ridge regression can significantly improve the scalability of such methods, with a state-of-the-art algorithm called \(\textsc {ease}^{\textsc {r}}\) (Steck 2019a). \(\textsc {ease}^{\textsc {r}}\) consists of a single matrix inversion of the Gramian. As its computational complexity does not rely on the number of users or even the number of user-item interactions in the training set, it is particularly well suited to use-cases with many users or interactions, with the sole constraint that the size of the item catalogue is limited.

When deployed in real-world applications, models often need to be periodically recomputed to incorporate new data and account for newly available items and shifting user preferences, as well as general concept drift. Iteratively retraining an \(\textsc {ease}^{\textsc {r}}\)-like model from scratch puts additional strain on such real-world applications, putting a hard upper limit on the frequency of model updates that can be attained, and possibly driving up computational costs. This especially limits the application of \(\textsc {ease}^{\textsc {r}}\) in domains where item recency is an important factor deciding on item relevance—such as in retail or news recommendation.

In this work, we propose a novel and exact updating algorithm for embarrassingly shallow auto-encoders that combines parts of the Dynamic Index algorithm (Jeunen et al. 2019) and the Woodbury matrix identity (Hager 1989): Dynamic \(\textsc {ease}^{\textsc {r}}\) (\(\textsc {dyn-ease}^{\textsc {r}}\)). We have provided a thorough theoretical analysis of our proposed approach, highlighting in which cases it can provide a considerable advantage over iteratively retrained \(\textsc {ease}^{\textsc {r}}\), and in which cases it does not. These theoretical insights are corroborated by empirical insights from extensive experiments, showing that \(\textsc {dyn-ease}^{\textsc {r}}\) is well suited for efficient and effective online collaborative filtering in various real-world applications.

\(\textsc {dyn-ease}^{\textsc {r}}\) exploits the sparse and symmetric structure of the Gramian to efficiently compute the eigendecomposition of the Gramian update. When the rank of the update is large, however, this operation can still become prohibitively expensive. To mitigate this problem, we have additionally proposed an approximate \(\textsc {dyn-ease}^{\textsc {r}}\) variant that uses a low-rank approximation of the Gramian update as opposed to its exact decomposition. Empirical results highlight further efficiency improvements at a small cost for recommendation accuracy. Our work broadens the scope of problems for which item-based models based on ridge regression are an appropriate choice in practice. To foster the reproducibility of our work, the source code for all our experiments is publicly available under an open-source license at github.com/olivierjeunen/dynamic-easer.

A promising area for future work is to further improve \(\textsc {dyn-ease}^{\textsc {r}}\)’s computational efficiency by looking at alternative (approximate) matrix decompositions that exploit efficient random sampling (Halko et al. 2011; Martinsson et al. 2011), as the bottleneck of our current approach lies in the computation of the exact eigendecomposition of the update to the Gramian. Furthermore, we would like to explore applications of our efficient incremental updating scheme to more general multi-label regression tasks beyond the collaborative filtering use-case we tackle in this work.