Embarrassingly shallow auto-encoders for dynamic collaborative filtering

Jeunen, Olivier; Van Balen, Jan; Goethals, Bart

doi:10.1007/s11257-021-09314-7

Embarrassingly shallow auto-encoders for dynamic collaborative filtering

Published: 13 January 2022

Volume 32, pages 509–541, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

Embarrassingly shallow auto-encoders for dynamic collaborative filtering

Download PDF

522 Accesses
4 Citations
2 Altmetric
Explore all metrics

Abstract

Recent work has shown that despite their simplicity, item-based models optimised through ridge regression can attain highly competitive results on collaborative filtering tasks. As these models are analytically computable and thus forgo the need for often expensive iterative optimisation procedures, they have become an attractive choice for practitioners. Computing the closed-form ridge regression solution consists of inverting the Gramian item-item matrix, which is known to be a costly operation that scales poorly with the size of the item catalogue. Because of this bottleneck, the adoption of these methods is restricted to a specific set of problems where the number of items is modest. This can become especially problematic in real-world dynamical environments, where the model needs to keep up with incoming data to combat issues of cold start and concept drift. In this work, we propose Dynamic $\textsc {ease}^{\textsc {r}}$: an algorithm based on the Woodbury matrix identity that incrementally updates an existing regression model when new data arrives, either approximately or exact. By exploiting a widely accepted low-rank assumption for the user-item interaction data, this allows us to target those parts of the resulting model that need updating, and avoid a costly inversion of the entire item-item matrix with every update. We theoretically and empirically show that our newly proposed methods can entail significant efficiency gains in the right settings, broadening the scope of problems for which closed-form models are an appropriate choice.

Simple is Beautiful: An Online Collaborative Filtering Recommendation Solution with Higher Accuracy

Advances in Collaborative Filtering

Time-Sensitive Collaborative Filtering through Adaptive Matrix Completion

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recommender systems are information retrieval applications that aim to mitigate the problem of “information overload”, by matching users to certain items (Borchers et al. 1998). They have become ubiquitous on the world wide web and have found applications in many different areas where these items can represent anything from news articles and musical artists to retail products and social media accounts. Most modern approaches to recommendation are based on some form of collaborative filtering (Ekstrand et al. 2011), a family of methods that aim to model user preferences and learn them from a dataset of user behaviour. These methods have known widespread success over the years, and are the cornerstone of modern recommender systems research. As a consequence, the quest for more effective collaborative filtering algorithms is a very active research area, where significant strides forward are being made every year. Many novel methods are based on deep and nonlinear neural networks, and the expressiveness of this model class has made them ubiquitous in the field (Liang et al. 2018; Elahi et al. 2019; Shenbin et al. 2020). Recent work casts doubt on the reproducibility of evaluation strategies that are often adopted to empirically validate research findings (Dacrema et al. 2019; Rendle 2019; Rendle et al. 2020), making it harder to conclude whether these complex model classes are what the field needs moving forward.

In a parallel line of research, the effectiveness of simpler linear models for the collaborative filtering task has been shown time and again (Ning and Karypis 2011; Levy and Jack 2013; Sedhain et al. 2016; Steck 2019b, c; Steck et al. 2020). Most notably and recently, Embarrassingly Shallow Auto-Encoders (reversed: $\textsc {ease}^{\textsc {r}}$) have been shown to yield highly competitive results with the state of the art, whilst often being much easier to implement, and much more efficient to compute (Steck 2019a). The closed-form solution that is available for ridge regression models is at the heart of these major advantages, as $\textsc {ease}^{\textsc {r}}$ effectively optimises a regularised least-squares problem. Recently, $\textsc {ease}^{\textsc {r}}$ has been extended to incorporate item metadata into two variants: $\textsc {cease}^{\textsc {r}}$ and $\textsc {add-ease}^{\!\textsc {r}}$ (Jeunen et al. 2020). These extensions improve the capabilities of closed-form linear models to deal with issues such as the “long tail” (very few items account for the large majority of interactions) and “cold start” (new items do not have any interactions) (Schein et al. 2002; Park and Tuzhilin 2008; Shi et al. 2014).

The main benefit of $\textsc {ease}^{\textsc {r}}$ and its variants over competing approaches, is their computational efficiency. As the core algorithm consists of a single inversion of the Gramian item-item matrix, it is often many times more efficient to compute than models relying on iterative optimisation techniques. As reported in the original paper, the algorithm can be implemented in just a few lines of Python and is typically computed in the order of minutes on various often used publicly available benchmark datasets (Steck 2019a). Nevertheless, matrix inversion is known to scale poorly for large matrices, and $\textsc {ease}^{\textsc {r}}$’s reliance on it does inhibit its adoption in use-cases with large item catalogues. In such cases, methods that rely on gradient-based optimisation techniques are still preferable.

To add insult to injury, real-world systems rarely rely on a single model that is computed once and then deployed. To make this concrete: suppose we operate a hypothetical retail website, and we wish to send out an e-mail with a top-N list of personalised recommendations to our subscribed users every few days. Naturally, the model that generates these recommendation lists should evolve over time, preferably incorporating new user-item interactions that have occurred over the past days. The importance of having such a dynamic model is threefold: (1) It will generate more novel and diverse recommendations than its static counterpart (Castells et al. 2015), (2) it will be able to combat concept drift in the data (due to shifting item popularity or seasonality trends in preferences) (Gama et al. 2014), and (3) it will have the means to handle cold-start problems when either with new items or news users appear (Schein et al. 2002).

Many modern digital systems generate new data at increasingly fast rates, and this is no different for our hypothetical retail website. This is important to take into account when choosing a recommendation algorithm. Models that are already inefficient to compute initially, will only see these problems exacerbated when the predominant approach every few days is to recompute them iteratively on more and more data. This puts a theoretical limit on how often we can update the model, and incurs a computational cost that we would like to reduce. Instead, it would be preferable to have models that can be updated with new information when it arrives, but do not require a full retraining of untouched parameters for every new batch of data that comes in. This is not an easy feat, and the field of “online recommender systems” that are able to handle model updates more elegantly has seen much interest in recent years (Vinagre et al. 2020). More generally, the problem of “lifelong” or “continual” learning in the machine learning field deals with similar issues (Chen 2018).

In this work, we present a novel algorithm to incrementally update the state-of-the-art item-based linear model $\textsc {ease}^{\textsc {r}}$, which is naturally extended to include recent variants that exploit side-information: $\textsc {cease}^{\textsc {r}}$ and $\textsc {add-ease}^{\!\textsc {r}}$. $\textsc {ease}^{\textsc {r}}$ consists of two major computation steps: (1) the generation of the Gramian item-item matrix, and (2) the inversion of this matrix that yields the solution to the regression problem.

We propose Dynamic $\textsc {ease}^{\textsc {r}}$ ($\textsc {dyn-ease}^{\textsc {r}}$), consisting of incremental update rules for these two steps that leverage the recently proposed Dynamic Index algorithm (Jeunen et al. 2019) and the well-known Woodbury matrix identity (Hager 1989), respectively. As such, $\textsc {dyn-ease}^{\textsc {r}}$ provides a way to efficiently update an existing $\textsc {ease}^{\textsc {r}}$-like model without the need of recomputing the entire regression model from scratch with every data update.

A theoretical analysis of the proposed algorithm shows that the highest efficiency gains can be expected when the rank of the update to the Gramian is low, an assumption that has been widely adopted in the recommender systems literature before (Koren et al. 2009). We show how this quantity can be bounded using simple summary statistics from the new batch of data, and support our findings with empirical results. Further experiments confirm that $\textsc {dyn-ease}^{\textsc {r}}$ is able to significantly cut down on computation time compared to iteratively retrained $\textsc {ease}^{\textsc {r}}$, in a variety of recommendation domains. Finally, we show how we can update the model with low-rank approximations when the new batch of data itself is not low-rank; providing a tuneable trade-off between the exactness of the solution and the efficiency with which it can be kept up-to-date. Empirical observations show how this approximate variant of $\textsc {dyn-ease}^{\textsc {r}}$ still yields highly competitive recommendation performance, with greatly improved update speed, and how the low-rank assumption can even improve on recommendation accuracy. As a result, our work broadens the space of recommendation problems to which the state-of-the-art linear model $\textsc {ease}^{\textsc {r}}$ can efficiently be applied. To foster the reproducibility of our work, all source code for the experiments in Sect. 4 is publicly available at github.com/olivierjeunen/dynamic-easer.

The rest of this manuscript is structured as follows: Sect. 2 introduces our use-case, with mathematical notation and relevant related work; Sect. 3 introduces $\textsc {dyn-ease}^{\textsc {r}}$ and presents a theoretical analysis of its inner workings, motivating an approximate variant; Sect. 4 presents empirical observations from a wide range of experiments and shows where $\textsc {dyn-ease}^{\textsc {r}}$ can provide meaningful improvements, findings that are in line with what the theory suggests. Section 5 concludes our work, additionally presenting a scope for future research.

2 Background and related work

We first formalise our use-case and present relevant mathematical notation used throughout the rest of this work. We are interested in the “binary, positive-only” implicit feedback setting (Verstrepen et al. 2017), where we have access to a dataset consisting of preference indications from users in ${\mathcal {U}}$ over items in ${\mathcal {I}}$ at time $t \in {\mathbb {N}}$, assumed from a set of interaction data ${\mathcal {P}} \subseteq {\mathcal {U}} \times {\mathcal {I}} \times {\mathbb {N}}$. Ignoring temporal information, these preferences can be represented in a binary user-item matrix ${\varvec{X}} \in \{0,1\}^{|{\mathcal {U}}|\times |{\mathcal {I}}|}$, where ${\varvec{X}}_{u,i} = 1$ if we have a click, view, purchase,...for user u and item i in ${\mathcal {P}}$, and ${\varvec{X}}_{u,i} = 0$ otherwise. With ${\mathcal {P}}_{t}$, we denote the set of all interactions up to time t: $\{(u,i,t^{\prime }) \in {\mathcal {P}} | t^{\prime } < t \}$. Consequently, ${\varvec{X}}_t$ is the user-item matrix constructed from the set of interactions ${\mathcal {P}}_{t}$. We will refer to the set of all items seen by user u as ${\mathcal {I}}_u \subseteq {\mathcal {I}}$, and vice versa ${\mathcal {U}}_i \subseteq {\mathcal {U}}$ for an item i. The Gramian of the user-item matrix is defined as ${\varvec{G}} :={\varvec{X}}^{\intercal } {\varvec{X}}$; it is an item-item matrix that holds the co-occurrence count for items i and j at index ${\varvec{G}}_{i,j}$. The goal at hand for a recommendation algorithm is to predict which zeroes in the user-item matrix ${\varvec{X}}$ actually shouldn’t be zeroes, and thus imply that the item would in some way “fit” the user’s tastes and consequently make for a good item to be shown as a recommendation.

In some cases, additional information about items can be available. Such “side-information” or “metadata” often comes in the form of discrete tags, which can, for example, be a release year, genre or director for a movie, an artist or genre for a song, a writer for a book, or many more. Incorporating item metadata in the modelling process can help mitigate cold-start and long-tail issues, where the preference information for a given item is limited (Schein et al. 2002; Park and Tuzhilin 2008). We will refer to the set of all such tags as the vocabulary ${\mathcal {V}}$. In a similar fashion to the user-item matrix ${\varvec{X}}$, a tag-item matrix ${\varvec{T}} \in {\mathbb {R}}^{|{\mathcal {V}}|\times |{\mathcal {I}}|}$ is constructed. Note that this matrix is real-valued, as it will often contain pre-computed values such as tf-idf weights instead of binary indicators.

In what follows, we present a brief introduction to item-based recommendation models, most notably item-knn (Sarwar et al. 2001), slim (Ning and Karypis 2011) and $\textsc {ease}^{\textsc {r}}$ (Steck 2019a). We then additionally introduce $\textsc {cease}^{\textsc {r}}$ and $\textsc {add-ease}^{\!\textsc {r}}$ as extensions of $\textsc {ease}^{\textsc {r}}$ that incorporate item side-information whilst retaining a closed-form solution (Jeunen et al. 2020), as these are most relevant to the dynamic $\textsc {ease}^{\textsc {r}}$ algorithm we will present in Sect. 3. This section is concluded with an overview of related work in the field of incremental collaborative filtering approaches.

2.1 Item-based models, slim and $\textsc {ease}^{\textsc {r}}$

Item-based collaborative filtering models tackle the recommendation task by defining a conceptual similarity matrix ${\varvec{S}} \in {\mathbb {R}}^{|{\mathcal {I}}|\times |{\mathcal {I}}|}$. The score given to a potential recommendation is then computed as the sum of similarities between items in the user’s history and the item at hand:

$$\begin{aligned} \text {score}(u,i) = \sum _{j \in {\mathcal {I}}_u} {\varvec{S}}_{j,i} = ({\varvec{X}}_{u,\cdot } {\varvec{S}})_i \end{aligned}$$

(1)

Here, ${\varvec{X_{u,\cdot }}}$ denotes the u^th row of ${\varvec{X}}$. Note that computing recommendation scores for all training users and all items simply consists of computing the matrix multiplication ${\varvec{X}}{\varvec{S}}$, an operation that is made more efficient when the matrix ${\varvec{S}}$ is restricted to be sparse. Scores for items i already present in the user history ${\mathcal {I}}_u$ are often ignored, and the remaining items are ranked and presented in a top-N recommendation list or slate to the user. Early seminal works would define the similarity matrix ${\varvec{S}}$ as all pairwise cosine similarities among items in the high-dimensional but sparse user-item matrix ${\varvec{X}}$ (Sarwar et al. 2001). This has then been extended to include slightly more advanced notions of similarity such as Pearson’s correlation or conditional probabilities (Deshpande and Karypis 2004). Recent work has introduced the “Dynamic Index” algorithm to incrementally compute the Gramian of ${\varvec{X}}$, additionally showing that several conventional similarity metrics such as cosine similarity or Jaccard index can be readily computed from ${\varvec{G}}$ when it is up-to-date (Jeunen et al. 2019).

Methods for actually learning an optimal item-item similarity matrix have been proposed for the task of rating prediction (Koren 2008), as well as for pairwise learning from implicit feedback (Rendle et al. 2009). Ning and Karypis were the first to propose to learn a sparse weight matrix ${\varvec{S}}$ through a pointwise optimisation procedure, aptly dubbing their approach the Sparse LInear Method (slim) (Ning and Karypis 2011). slim optimises a least-squares regression model with elastic net regularisation, constrained to positive weights:

$$\begin{aligned} {\varvec{S}}^{*}&= {{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}} \left\Vert {\varvec{X}} - {\varvec{X}}{\varvec{S}}\right\Vert ^{2}_{F} + \lambda _{1} \left\Vert {\varvec{S}}\right\Vert ^{2}_{1} + \lambda _{2} \left\Vert {\varvec{S}}\right\Vert ^{2}_{F},\nonumber \\&\text {subject to } \text {diag}({\varvec{S}}) = 0 \text { and } {{\varvec{S}} \ge 0}. \end{aligned}$$

(2)

The restriction of the diagonal to zero avoids the trivial solution where ${\varvec{S}} = {\varvec{I}}$. Many extensions of slim have been proposed in recent years, and it has become a widely used method for the collaborative filtering task (Ning and Karypis 2012; Levy and Jack 2013; Christakopoulou and Karypis 2014; Sedhain et al. 2016; Christakopoulou and Karypis 2016; Steck 2019a, c; Steck et al. 2020; Chen et al. 2020). In practice, the slim optimisation problem is often decomposed into $|{\mathcal {I}}|$ independent problems (one per target item). Although these can then be solved in an embarrassingly parallel fashion, this renders the approach intractable for very large item catalogues. Indeed, as they aim to solve $|{\mathcal {I}}|$ regression problems, their computational complexity is in the order of ${\mathcal {O}}(|{\mathcal {I}}|(|{\mathcal {I}}|-1)^{2.373})$, assuming they exploit the recent advances in efficient matrix multiplication and inversion (Le Gall 2014; Alman and Vassilevska W. 2021). The computational cost of the original slim approach is a known impediment for its adoption in certain use-cases; related work has reported that hyper-parameter tuning took several weeks on even medium-sized datasets (Liang et al. 2018).^{Footnote 1}

Steck studied whether the restrictions of slim to only allow positive item-item weights and their $l_{1}$-regularisation-induced sparsity were necessary for the resulting model to remain competitive, and concluded that this was not always the case (Steck 2019a; Steck et al. 2020). The resulting Tikhonov-regularised least-squares problem can then be formalised as:

$$\begin{aligned} {\varvec{S}}^{*} = {{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}} \left\Vert {\varvec{X}} - {\varvec{X}}{\varvec{S}}\right\Vert ^{2}_{F} + \lambda \left\Vert {\varvec{S}}\right\Vert ^{2}_{F} \text {, subject to } \text {diag}({\varvec{S}}) = 0. \end{aligned}$$

(3)

The main advantage of simplifying the optimisation problem at hand is that the well-known closed-form solutions for ordinary least squares (OLS) and ridge regression can now be adopted. Including the zero-diagonal constraint via Lagrange multipliers yields the Embarrassingly Shallow Auto-Encoder ($\textsc {ease}^{\textsc {r}}$) model:

$$\begin{aligned} \hat{{\varvec{S}}} = {\varvec{I}} - \hat{{\varvec{P}}} \cdot \text {diagMat}(\mathbf {1} \oslash \text {diag}(\hat{{\varvec{P}}})) \text {, where } \hat{{\varvec{P}}} :=({\varvec{X}}^\intercal {\varvec{X}} + \lambda {\varvec{I}})^{-1}. \end{aligned}$$

(4)

As this model consists of a single regression problem to be solved and thus a single matrix inversion to be computed, its complexity is orders of magnitude smaller than that of the original slim variants: ${\mathcal {O}}(|{\mathcal {I}}|^{2.373})$. $\textsc {ease}^{\textsc {r}}$ no longer yields a sparse matrix, possibly making Equation 1 much less efficient to compute. Nevertheless, the author reported that there was only a marginal performance impact when simply sparsifying the learnt matrix by zeroing out weights based on their absolute values up until the desired sparsity level. As an additional advantage, $\textsc {ease}^{\textsc {r}}$ has only a single regularisation strength hyper-parameter to tune compared to the two needed for slim’s elastic net regularisation. We refer the interested reader to Steck (2019a, 2019b) for a full derivation of the model and additional information.

Another recent extension of the slim paradigm proposes to use Block-Diagonal-Regularisation (BDR) to obtain a block-aware item similarity model (Chen et al. 2020). The block-diagonal structure in the learnt matrix inherently represents clusters among items. As inter-block similarities are penalised, BDR has a sparsity-inducing effect that positively impacts the efficiency of the recommendation-generating process. Because the block-aware model presented by Chen et al. (2020) no longer has an analytically computable solution readily available, further comparison with their method is out of scope for the purposes of this work. The item-based paradigm and its closed-form instantiations have also recently been adapted for bandit-based recommendation use-cases (Jeunen and Goethals 2021).

2.2 Item-based models with side-information

The $\textsc {ease}^{\textsc {r}}$ definition can be further extended to incorporate side-information in either a “collective” ($\textsc {cease}^{\textsc {r}}$) or “additive” ($\textsc {add-ease}^{\textsc {r}}$) manner (Jeunen et al. 2020). The first method, inspired by collective slim (Ning and Karypis 2012), intuitively treats discrete tags equivalent to how users are treated, and re-weights their contribution to the solution of the regression problem by the diagonal weight-matrix ${\varvec{W}} \in {\mathbb {R}}^{(|{\mathcal {U}}|+|{\mathcal {V}}|)\times (|{\mathcal {U}}|+|{\mathcal {V}}|)}$:

$$\begin{aligned} {\varvec{S}}^{*}&= {{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}} \left\Vert \sqrt{{\varvec{W}}} ({\varvec{X}}' - {\varvec{X}}'{\varvec{S}})\right\Vert ^{2}_{F} + \lambda \left\Vert {\varvec{S}}\right\Vert ^{2}_{F},\nonumber \\&\text {subject to } \text {diag}({\varvec{S}}) = 0 \text {, where } {\varvec{X}}' = \begin{bmatrix} {\varvec{X}} \\ {\varvec{T}} \end{bmatrix}. \end{aligned}$$

(5)

The closed-form solution is then given by Equation 6, where $\oslash $ denotes element-wise division, $\text {diag}(\cdot )$ extracts the diagonal from a matrix, $\text {diagMat}(\cdot )$ generates a square diagonal matrix from a vector, and $\mathbf {1}$ is a vector of ones.

$$\begin{aligned} \hat{{\varvec{S}}} = {\varvec{I}} - \hat{{\varvec{P}}} \cdot \text {diagMat}(\mathbf {1} \oslash \text {diag}(\hat{{\varvec{P}}})) \text {, where } \hat{{\varvec{P}}} :=({\varvec{X}}'^\intercal {\varvec{W}}{\varvec{X}}' + \lambda {\varvec{I}})^{-1} \end{aligned}$$

(6)

The second method, $\textsc {add-ease}^{\textsc {r}}$, treats the regression problem on the user-item matrix ${\varvec{X}}$ and the one on the tag-item matrix ${\varvec{T}}$ as two fully independent problems to solve in parallel; combining the two resulting item-item weight matrices ${\varvec{S}}_{{\varvec{X}}}$ and ${\varvec{S}}_{{\varvec{T}}}$ in an additive fashion later down the line.

$$\begin{aligned} \begin{aligned} {\varvec{S}}^{*} = \alpha&{{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}_{{\varvec{X}}}}\left( \left\Vert \sqrt{{\varvec{W}}_{{\varvec{X}}}} ({\varvec{X}} - {\varvec{X}}{\varvec{S}}_{{\varvec{X}}})\right\Vert ^{2}_{F} + \lambda _{{\varvec{X}}} \left\Vert {\varvec{S}}_{{\varvec{X}}}\right\Vert ^{2}_{F}\right) \\ + (1-\alpha )&{{\,\mathrm{arg\,min}\,}}_{{\varvec{S}}_{{\varvec{T}}}}\left( \left\Vert \sqrt{{\varvec{W}}_{{\varvec{T}}}} ({\varvec{T}} - {\varvec{T}}{\varvec{S}}_{{\varvec{T}}})\right\Vert ^{2}_{F} + \lambda _{{\varvec{T}}} \left\Vert {\varvec{S}}_{{\varvec{T}}}\right\Vert ^{2}_{F}\right) ,\\&\text {subject to } \text {diag}({\varvec{S}}_{{\varvec{X}}}) = \text {diag}({\varvec{S}}_{{\varvec{T}}}) = 0. \end{aligned} \end{aligned}$$

(7)

$\textsc {add-ease}^{\textsc {r}}$ doubles the amount of parameters used by $\textsc {ease}^{\textsc {r}}$ and $\textsc {cease}^{\textsc {r}}$, increasing its degrees of freedom at learning time at the cost of having to solve two regression problems instead of one. Note, however, that these are fully independent and can be computed in parallel. Equation 8 shows the analytical formulas to obtain the two independent models, and combine them with a blending parameter $0 \le \alpha \le 1$.

$$\begin{aligned}&\hat{{\varvec{S}}}_{{\varvec{X}}} = {\varvec{I}} - \hat{{\varvec{P}}}_{{\varvec{X}}} \cdot \text {diagMat}(\mathbf {1} \oslash \text {diag}(\hat{{\varvec{P}}}_{{\varvec{X}}})) \text {, where } \hat{{\varvec{P}}}_{{\varvec{X}}} :=({\varvec{X}}^\intercal {\varvec{W}}_{{\varvec{X}}}{\varvec{X}} + \lambda _{{\varvec{X}}}{\varvec{I}})^{-1} \nonumber \\&\quad \hat{{\varvec{S}}}_{{\varvec{T}}} = {\varvec{I}} - \hat{{\varvec{P}}}_{{\varvec{T}}} \cdot \text {diagMat}(\mathbf {1} \oslash \text {diag}(\hat{{\varvec{P}}}_{{\varvec{T}}})) \text {, where } \hat{{\varvec{P}}}_{{\varvec{T}}} :=({\varvec{T}}^\intercal {\varvec{W}}_{{\varvec{T}}}{\varvec{T}} + \lambda _{{\varvec{T}}}{\varvec{I}})^{-1}\nonumber \\&\quad \hat{{\varvec{S}}} = \alpha \hat{{\varvec{S}}}_{{\varvec{X}}} + (1 - \alpha )\hat{{\varvec{S}}}_{{\varvec{T}}} \end{aligned}$$

(8)

The computational complexity of $\textsc {cease}^{\textsc {r}}$ and $\textsc {add-ease}^{\textsc {r}}$ remains in the order of ${\mathcal {O}}(|{\mathcal {I}}|^{2.373})$, which is equivalent to the original $\textsc {ease}^{\textsc {r}}$ approach. As such, these methods allow item side-information to be included into the model without a significant added cost in terms of computational complexity. The main reason for this, is that we adapt the entries in the Gramian ${\varvec{G}}$, but do not alter its dimensions.

2.3 Incremental collaborative filtering

Collaborative filtering techniques that can be incrementally updated when new data arrives are a lively research area in itself. Vinagre et al. (2014) propose incremental Stochastic Gradient Descent (SGD) as a way to dynamically update matrix factorisation models based on positive-only implicit feedback. Their methodology has first been extended to include negative feedback (Vinagre et al. 2015), and then to a co-factorisation model that is more complex than traditional matrix factorisation, but also leads to superior recommendation accuracy (Anyosa et al. 2018). He et al. (2016) propose an incremental optimisation procedure based on Alternating Least Squares (ALS), and also show how it can be applied to efficiently and effectively update matrix factorisation models. More recently, Ferreira et al. propose a method that personalises learning rates on a user-basis, reporting further improvements. In contrast, our work focuses on item-based similarity models that come with closed-form solutions, as these have been shown to be highly competitive with the state of the art in many collaborative filtering use-cases.

Instead of just incorporating new data into the model, Matuszyk et al. (2018) propose to forget older data that has become obsolete, reporting significantly improved performance for collaborative filtering approaches. The dynamic $\textsc {ease}^{\textsc {r}}$ method we propose in Sect. 3 fits perfectly into this paradigm, as it can incorporate new data just as easily as it can forget irrelevant information in a targeted manner. This type of decremental learning has the additional advantage of being able to avoid complete retraining in privacy-sensitive application areas, where specific user histories need to be removed from the model upon request.

2.4 Neural auto-encoders

The Auto-Encoder paradigm of which $\textsc {ease}^{\textsc {r}}$ is a specific instantiation, has gained much popularity in recent years. The Mult-VAE method proposed by Liang et al. (2018) consists of a variational auto-encoder with a multinomial likelihood, and has been a strong baseline for several years (Dacrema et al. 2019). Khawar et al. (2020) propose an architecture that first learns a grouping of items and leverages this structure when learning the auto-encoder, reporting significant gains over the original Mult-VAE method. As these methods rely on gradient-based optimisation of often highly non-convex objective functions, they rely on software packages with automatic differentiation capabilities, and typically require significant computational resources, in the form of several hours of training on machines equipped with GPUs. The methods we consider in this work are computed in the order of minutes on CPUs, and we do not include neural approaches in our comparison for this reason. Furthermore, among others, the work of Steck (2019a) and Dacrema et al. (2019) have repeatedly shown that linear item-based models can attain highly competitive recommendation accuracy compared to neural alternatives.

3 Methodology and contributions

We have given a brief history of item-based collaborative filtering models, and have discussed why $\textsc {ease}^{\textsc {r}}$ and its variants are computationally often more efficient than their counterparts based on slim. For very large item catalogues, however, its more than quadratic computational complexity in the number of items still becomes a very tangible issue. Because of this, the demand for an algorithm that can efficiently update $\textsc {ease}^{\textsc {r}}$-like models when new data arrives, is still very real, and a necessity for these methods to obtain widespread adoption in practice. Recent work proposes the “Dynamic Index” algorithm as a way to incrementally update item similarities in neighbourhood-based models that adopt cosine similarity (Jeunen et al. 2019). A crucial building block of this metric and the algorithm is the efficient and incremental computation of the Gramian matrix ${\varvec{G}} = {\varvec{X}}^\intercal {\varvec{X}}$. By storing ${\varvec{G}}$ in low-overhead sparse data-structures such as inverted indices, they minimise memory overhead whilst still allowing for an amortised constant lookup time when querying ${\mathcal {I}}_u$, which is a requirement for incremental updates. From Eqs. 4, 6 and 8 , it is clear to see that $\textsc {ease}^{\textsc {r}}$ and its variants are dependent on this Gramian matrix as well. In fact, it is the only building block needed to be able to compute the resulting item-item weight matrix $\hat{{\varvec{S}}}$. As such, we adopt parts of the Dynamic Index algorithm proposed by Jeunen et al. to first efficiently compute and then incrementally update the Gramian matrix ${\varvec{G}}$. Once we have an up-to-date matrix ${\varvec{G}}$, we need to compute its inverse to obtain $\hat{{\varvec{P}}}$ and the eventual model $\hat{{\varvec{S}}}$ from that. The matrix inversion to go from ${\varvec{G}}$ to $\hat{{\varvec{P}}}$ is the workhorse behind $\textsc {ease}^{\textsc {r}}$ that takes up the large majority of the computation time, as this step corresponds to solving the least-squares problem formulated in Eq. 3. Iterative re-computation of this matrix inverse every time we wish to incorporate new data into the model, is thus to be avoided if it can be.

3.1 Low-rank model updates with the Woodbury matrix identity

Equation 9 shows the Woodbury matrix identity, which posits that the inverse of a rank-k correction to some $n\times n$ matrix ${\varvec{A}}$ can be computed by performing a rank-k correction on the inverse of the original matrix (Hager 1989).

$$\begin{aligned} ({\varvec{A}} + {\varvec{U}}{\varvec{C}}{\varvec{V}})^{-1} = {\varvec{A}}^{-1} - {\varvec{A}}^{-1}{\varvec{U}}({\varvec{C}}^{-1}+{\varvec{V}}{\varvec{A}}^{-1}{\varvec{U}})^{-1}{\varvec{V}}{\varvec{A}}^{-1} \end{aligned}$$

(9)

So, given ${\varvec{A}}^{-1}$, ${\varvec{U}}$, ${\varvec{C}}$ and ${\varvec{V}}$, there is no need to re-compute the inversion on the update of ${\varvec{A}}$, but it is sufficient to multiply a few matrices and compute the inverse of $({\varvec{C}}^{-1}+{\varvec{V}}{\varvec{A}}^{-1}{\varvec{U}})\in {\mathbb {R}}^{k\times k}$. Naturally, for large n and $k \ll n$, the efficiency gains coming from this reformulation will be most significant. Although this is no requirement for Eq. 9 to hold, we assume ${\varvec{C}} \in {\mathbb {R}}^{k\times k}$ to be a diagonal matrix. As a result, the inversion of ${\varvec{C}}$ becomes trivial and consists of just k operations.

In our setting, suppose we have an up-to-date model at a certain time t with ${\varvec{X}}_t$, ${\varvec{G}}_{t}$, $\hat{{\varvec{P}}}_{t}$ and $\hat{{\varvec{S}}}_{t}$. At a given time $t+1$, suppose we have an updated user-item matrix ${\varvec{X}}_{t+1}$, but we wish to compute ${\varvec{G}}_{t+1}$, $\hat{{\varvec{P}}}_{t+1}$ and the resulting $\hat{{\varvec{S}}}_{t+1}$ as efficiently as possible. As we mentioned before, computing ${\varvec{G}}_{t+1}$ incrementally can be achieved easily and efficiently by adopting parts of the Dynamic Index algorithm. In fact, because of the incremental nature of the algorithm, we can easily just store the difference in the Gramian matrix instead of its entirety: ${\varvec{G}}_{{\varvec{\varDelta }}} = {\varvec{G}}_{t+1}-{\varvec{G}}_{t} = {\varvec{X}}_{t+1}^{\intercal }{\varvec{X}}_{t+1} - {\varvec{X}}_{t}^{\intercal }{\varvec{X}}_{t}$. Given a set of user-item interactions ${\mathcal {P}}_{{\varvec{\varDelta }}} \subset {\mathcal {U}}\times {\mathcal {I}}$ to include into the model and an inverted index ${\mathcal {L}}_t$ mapping users u to their histories ${\mathcal {I}}_u$, Algorithm 1 shows how to achieve this. Note that the indices holding ${\mathcal {I}}_u$ are just a sparse representation of the user-item matrix ${\varvec{X}}$ and don’t require any additional memory consumption. Furthermore, Algorithm 1 is easily parallellisable through the same MapReduce-like paradigm adopted by Jeunen et al. (2019). Naturally, an efficient implementation will exploit the symmetry of the Gramian ${\varvec{G}}_{{\varvec{\varDelta }}}$ to decrease memory consumption as well as the number of increments needed at every update.

Now, having computed ${\varvec{G}}_{{\varvec{\varDelta }}}$, we can rewrite what we need as follows:

$$\begin{aligned} \hat{{\varvec{P}}}_{t+1} = ({\varvec{G}}_{t+1}+\lambda {\varvec{I}})^{-1} = ({\varvec{G}}_{t} + \lambda {\varvec{I}} + {\varvec{G}}_{{\varvec{\varDelta }}})^{-1}. \end{aligned}$$

(10)

The form on the right-hand side already begins to resemble Woodbury’s formula in Eq. 9. All that’s left is to decompose ${\varvec{G}}_{{\varvec{\varDelta }}} \in {\mathbb {R}}^{n\times n}$ into matrices ${\varvec{U}} \in {\mathbb {R}}^{n\times k}$, ${\varvec{C}} \in {\mathbb {R}}^{k\times k}$ and ${\varvec{V}} \in {\mathbb {R}}^{k\times n}$. As ${\varvec{G}}_{{\varvec{\varDelta }}}$ is the difference of two real symmetric matrices ${\varvec{G}}_{t+1}$ and ${\varvec{G}}_{t}$, it will always be a real symmetric matrix as well. This means that the eigenvectors of ${\varvec{G}}_{{\varvec{\varDelta }}}$ can be chosen to be orthogonal to each other: ${\varvec{Q}}^{\intercal } \equiv {\varvec{Q}}^{-1}$. Consequently, an eigendecomposition always exists, where k is the rank of ${\varvec{G}}_{{\varvec{\varDelta }}}$:

$$\begin{aligned} \begin{aligned} {\varvec{G}}_{{\varvec{\varDelta }}}&= {\varvec{Q}}{\varvec{\varLambda }}{\varvec{Q}}^{-1}\\&= {\varvec{Q}}{\varvec{\varLambda }}{\varvec{Q}}^{\intercal }\\&= \sum _{i = 1}^{k} {\varvec{\varLambda }}_{ii}{\varvec{Q}}_{\cdot ,i}{\varvec{Q}}_{\cdot ,i}^{\intercal }. \end{aligned} \end{aligned}$$

(11)

As such, we can plug Eq. 11 containing the eigendecomposition of ${\varvec{G}}_{{\varvec{\varDelta }}}$ into Eqs. 9 and 10 to obtain our final update rule in Eq. 12:

$$\begin{aligned} \begin{aligned} \hat{{\varvec{P}}}_{t+1}&= ({\varvec{G}}_{t} + \lambda {\varvec{I}} + {\varvec{G}}_{{\varvec{\varDelta }}})^{-1}\\&= ({\varvec{G}}_{t} + \lambda {\varvec{I}} + {\varvec{Q}}{\varvec{\varLambda }} {\varvec{Q}}^{\intercal })^{-1}\\&= \hat{{\varvec{P}}}_{t} - \hat{{\varvec{P}}}_{t}{\varvec{Q}}({\varvec{\varLambda }}^{-1} + {\varvec{Q}}^\intercal \hat{{\varvec{P}}}_{t}{\varvec{Q}})^{-1}{\varvec{Q}}^\intercal \hat{{\varvec{P}}}_{t}. \end{aligned} \end{aligned}$$

(12)

The full $\textsc {dyn-ease}^{\textsc {r}}$ procedure is presented in Algorithm 2. If the updates to the Gramian matrix are low-rank, this procedure will be much more computationally efficient than re-computing the inverse of the entire Gramian matrix from scratch, as we will show in the following subsection. The assumption that the data-generating process behind user-item interactions is generally low-rank, has been exploited far and wide in the recommender systems literature (Koren et al. 2009).

It is interesting to note that $\textsc {ease}^{\textsc {r}}$ does not follow the low-rank assumption that motivates the popular family of latent factor models for collaborative filtering. Indeed, $\textsc {ease}^{\textsc {r}}$ is a full-rank model, combatting overfitting with Gaussian priors on its parameters rather than reducing the dimensionality of the problem. The low-rank assumption we adopt here is on the update to the Gramian ${\varvec{G}}_{{\varvec{\varDelta }}}$, instead of the full Gramian ${\varvec{G}}$. As we will show further on, both theoretically and empirically, this assumption holds in a variety of settings.

The fact that ${\varvec{G}}_{{\varvec{\varDelta }}}$ is symmetric and will often be very sparse in nature can be exploited when computing the eigendecomposition on line 3 of Algorithm 2, as we will show in the following section. Many modern software packages for scientific computing implement very efficient procedures specifically for such cases (e.g. SciPy (Virtanen et al. 2020)). Note that alternative algorithms to factorise ${\varvec{G}}_{{\varvec{\varDelta }}}$ into lower-dimensional matrices exist, often relying on randomised sampling procedures (Martinsson et al. 2011; Halko et al. 2011). These algorithms are reportedly more efficient to compute than the traditional eigendecomposition, but often not geared specifically toward the high-dimensional yet sparse use-case we tackle in this work, or not equipped to exploit the symmetric structure that is typical for the Gramian. As they compute two dense matrices of ${\varvec{Q}}$’s dimensions—their improvement in computation time comes with the cost of increased memory consumption. Furthermore, these methods are often focused on approximate matrix reconstructions whereas we are interested in an exact decomposition of the update to the Gramian. As the eigendecomposition fulfils our needs, the study of alternative factorisation methods falls out of the scope of the present work.

Throughout this section, we have focused on $\textsc {dyn-ease}^{\textsc {r}}$ as a general extension of $\textsc {ease}^{\textsc {r}}$. Naturally, our approach is trivially extended to include $\textsc {cease}^{\textsc {r}}$, $\textsc {add-ease}^{\textsc {r}}$ or a weight matrix ${\varvec{W}}$ different from the identity matrix ${\varvec{I}}$ as well, as these variants only change the input to Algorithms 1 and 2, but bear no impact on the procedures themselves.

3.2 Computational complexity analysis of eigendecomposition

The computational complexity of $\textsc {ease}^{\textsc {r}}$ is determined by the inversion of the Gramian, whereas the complexity of $\textsc {dyn-ease}^{\textsc {r}}$ is dictated by that of the eigendecomposition of the update to the Gramian. The computational complexity of matrix inversion, as well as that of solving the eigen-problem of a matrix, can be reduced to that of matrix multiplication (Pan and Chen 1999; Le Gall 2014). Given a square matrix of size $n \times n$, this is generally thought of as an ${\mathcal {O}}(n^{3})$ problem. Nevertheless, specialised methods that provide improved bounds on the exponent exist, the most recent one being ${\mathcal {O}}(n^{2.37286})$ by Alman and Vassilevska W. (2021).

In practice, it is easily seen that more efficient algorithms can be applied to specific cases instead of the general approach. Indeed, the inversion of a diagonal matrix consists of just n operations, and algorithms to multiply sparse matrices are often much more efficient than their dense counterparts. In what follows, we provide a brief theoretical analysis of the complexity of $\textsc {dyn-ease}^{\textsc {r}}$, giving rise to an improved estimate for its computational complexity in practical settings. This bound explains the efficiency improvements of $\textsc {dyn-ease}^{\textsc {r}}$ over $\textsc {ease}^{\textsc {r}}$, and recovers the equivalence of eigendecomposition to matrix inversion in the general case.

A first important thing to note is that the Gramian is symmetric, and so is ${\varvec{G}}_{{\varvec{\varDelta }}}$. This allows us to use the iterative method proposed by Lanczos (1950) to compute its eigen-vectors and -values.^{Footnote 2} The core algorithm proposed by Lanczos consists of k steps—one per nonzero eigenpair—which in turn consist of several vector and matrix manipulations. We refer the interested reader to an excellent analysis of the Lanczos algorithm provided by Paige (1980), showing how it works and why it converges. The computational complexity of every step in the method is determined by that of a matrix-vector product between the input ${\varvec{G}}_{{\varvec{\varDelta }}}$ and an $|{\mathcal {I}}|$-dimensional vector. In the general case, such an operation is ${\mathcal {O}}(|{\mathcal {I}}|^{2})$. In our specific case, however, ${\varvec{G}}_{{\varvec{\varDelta }}}$ is often of an extremely sparse nature. This allows us to describe the complexity of the product as ${\mathcal {O}}(m\cdot |{\mathcal {I}}|)$, where m is the average number of nonzero values in every column of ${\varvec{G}}_{{\varvec{\varDelta }}}$. Repeating these steps for every nonzero eigen-value-vector pair yields a final computational complexity of ${\mathcal {O}}(k\cdot m\cdot |{\mathcal {I}}|)$. When we wish to do a full-rank update on a dense matrix (i.e. $k = m = |{\mathcal {I}}|$), this recovers the computational complexity of general matrix inversion: ${\mathcal {O}}(|{\mathcal {I}}|^{3})$. In the cases where either the rank of the update is low ($k \ll |{\mathcal {I}}|$) or the update to the Gramian is highly sparse ($m \ll |{\mathcal {I}}|$), the eigendecomposition will be most efficient and as a consequence, the performance benefits of $\textsc {dyn-ease}^{\textsc {r}}$ over $\textsc {ease}^{\textsc {r}}$ will be most apparent too. Note that although low-rankness and sparsity will often come in pairs in the practical settings we deal with, this does not have to be the case in general. As a counterexample: the identity matrix ${\varvec{I}}$ is highly sparse yet full-rank.

3.3 Efficient estimation and upper bounding of ${\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}})$

In order to compute the eigendecomposition on line 3 of Algorithm 2, the numerical rank of ${\varvec{G}}_{{\varvec{\varDelta }}}$ would need to be known a priori. Furthermore, as we have shown, the efficiency of the update procedure is highly dependent on the assumption that this rank is much smaller than the dimensionality of the Gram-matrix itself: $k \ll |{\mathcal {I}}|$. It is known that matrix ranks can be estimated efficiently through the use of randomised methods (Liberty et al. 2007; Ubaru and Saad 2016); when dealing with sparse and symmetric matrices, these methods tend to attain extremely efficient performance.^{Footnote 3} Being able to estimate ${\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}})$ of course does not guarantee that this quantity will be low. In practice, however, we notice that it is often the case. We can see that the rank of the update ${\varvec{G}}_{{\varvec{\varDelta }}}$ depends on (1) the number of unique users in the update ${\mathcal {P}}_{{\varvec{\varDelta }}}$, denoted by $|{\mathcal {U}}_{{\varvec{\varDelta }}}|$ , and (2) the average number of items in the entire history of these users: $\overline{|{\mathcal {I}}_{{\mathcal {U}}_{{\varvec{\varDelta }}}}|}$.

This can be intuitively seen from the fact that an index i, j in the Gramian matrix represents the number of co-occurrences between the items i and j in the dataset. As such, a new user-item interaction $(u,i) \in {\mathcal {P}}_{{\varvec{\varDelta }}}$ affects ${\varvec{G}}_{i,j}, \forall j \in {\mathcal {I}}_u$.

Now, let ${\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ]}$ be the user-item matrix containing all (including historical) user-item interactions from only the users that appear in the update. This means we can rewrite the updated Gramian matrix as follows:

$$\begin{aligned} \begin{aligned} {\varvec{G}}_{t+1}&= {\varvec{G}}_{t} - {\varvec{X}}_{t}^{\intercal }{\varvec{X}}_{t} + {\varvec{X}}_{t+1}^{\intercal }{\varvec{X}}_{t+1}\\&= {\varvec{G}}_{t} - {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t}^{\intercal } {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t} + {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t+1}^{\intercal } {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t+1}. \end{aligned} \end{aligned}$$

The update then becomes: $ {\varvec{G}}_{{\varvec{\varDelta }}} = {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t+1}^{\intercal } {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t+1} - {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t}^{\intercal } {\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t}.$

Lemma 1

Given a $|{\mathcal {U}}|\times |{\mathcal {I}}|$ user-item matrix ${\varvec{X}}$, its Gramian matrix ${\varvec{G}}$, and updates to ${\varvec{X}}$; the rank of the update of the Gramian matrix ${\varvec{G}}_{{\varvec{\varDelta }}}$ can be upper bounded by two times the number of unique, nonzero rows in ${\varvec{X}}_{{\varvec{\varDelta }}}$: ${\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}}) \le 2|{\mathcal {U}}_{{\varvec{\varDelta }}}|$.

Proof

As the rank of a matrix is defined as its number of linearly independent row or column vectors, a (possibly loose) upper bound for ${\text {rank}}({\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ]})$ is given by its number of nonzero rows $|{\mathcal {U}}_{{\varvec{\varDelta }}}|$. Consequently, the rank of the Gramian matrix has the same bound: ${\text {rank}}({\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}}, \cdot ]}^{\intercal }{\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ]}) \le |{\mathcal {U}}_{{\varvec{\varDelta }}}|$. It is well known that the rank of the sum of two matrices is less than or equal to the sum of the ranks of the individual matrices. Bringing those together, we have that ${\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}}) \le 2|{\mathcal {U}}_{{\varvec{\varDelta }}}|$. $\square $

This upper bound on ${\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}})$ holds for any update to ${\varvec{X}}$. When users in the update are disjoint of those in ${\varvec{X}}_{t}$, the bound can be tightened to $|{\mathcal {U}}_{{\varvec{\varDelta }}}|$. For general-purpose use cases, it is not be feasible to ensure that users in the update do not appear with partial histories in previous iterations of the model. For specific applications such as session-based recommendation, however, it is common practice to train models on the session-item matrix, which satisfies this assumption by definition (Ludewig and Jannach 2018).

Lemma 2

Given a $|{\mathcal {U}}|\times |{\mathcal {I}}|$ user-item matrix ${\varvec{X}}$, its Gramian matrix ${\varvec{G}}$, and updates to ${\varvec{X}}$ that only consist of adding new rows or altering previously zero-rows; the rank of the update of the Gramian matrix ${\varvec{G}}_{{\varvec{\varDelta }}}$ can be upper bounded by the number of rows being added or altered: ${\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}}) \le |{\mathcal {U}}_{{\varvec{\varDelta }}}|$.

Proof

When the update only pertains to new users, this means that ${\varvec{X}}_{[{\mathcal {U}}_{{\varvec{\varDelta }}},\cdot ],t} = 0$, which ensures that $\text {rank}({\varvec{G}}_{{\varvec{\varDelta }}}) = \text {rank}({\varvec{X}}_{{\varvec{\varDelta }}})$. Because ${\text {rank}}({\varvec{X}}_{{\varvec{\varDelta }}})$ is bounded by $|{\mathcal {U}}_{{\varvec{\varDelta }}}|$ per definition, so is ${\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}})$: ${\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}}) \le |{\mathcal {U}}_{{\varvec{\varDelta }}}|$. $\square $

We have provided bounds for $\text {rank}({\varvec{G}}_{{\varvec{\varDelta }}})$ by focusing on the number of users that have contributed interactions in the new batch of data that we wish to include into the model. Analogously, in some settings, it might be easier to bound the number of unique items that are being interacted with. In a news recommendation setting, for example, a new batch of data might consist of only a very limited number of items (in the order of hundreds) being read by a much higher number of users (hundreds of thousands). In this case, we can straightforwardly extend Lemmas 1 and 2 to bound the rank by the number of independent columns in ${\varvec{X}}$ as opposed to its rows. The further reasoning and results follow trivially, bounding $\text {rank}({\varvec{G}}_{{\varvec{\varDelta }}})$ by $2|{\mathcal {I}}_{{\varvec{\varDelta }}}|$ and $|{\mathcal {I}}_{{\varvec{\varDelta }}}|$, respectively. Whereas the original $\textsc {ease}^{\textsc {r}}$ approach and the need to iteratively retrain would make it a poor choice for applications with possibly vast item catalogues but smaller active item catalogues, such as catalogues of news articles, the presented upper bounds theoretically show why $\textsc {dyn-ease}^{\textsc {r}}$ can provide an efficient updating mechanism.

3.4 Approximate $\textsc {dyn-ease}^{\textsc {r}}$ updates via truncated eigendecomposition

Naturally, the rank of the update will not always be low in general recommendation use-cases. The easiest counter-example to think of is the case where we wish to include k user-item interactions that pertain to k new and unique users as well as k unique items. This will lead to a diagonal-like structure of ${\varvec{X}}_{{\varvec{\varDelta }}}$ and $\text {rank}({\varvec{X}}_{{\varvec{\varDelta }}}) = k$, which is problematic for large values of k. However, it is also not hard to see that incorporating such a batch of data into our model will not affect any of our personalisation capabilities. Indeed, as $\textsc {ease}^{\textsc {r}}$ exploits signal from item co-occurrences, data where no item co-occurrences are present is practically useless, even though it is full-rank. Although this is a contrived example, it serves to illustrate that the rank of the update is not necessarily synonymous with its informational value.

In these cases, we can still resort to updating our model $\hat{{\varvec{P}}}$ with a low-rank approximation of ${\varvec{G}}_{{\varvec{\varDelta }}}$ without hurting the performance of the updated model. Instead of computing the rank and a full eigendecomposition of the Gramian as shown in Algorithm 2, we can choose the rank k at which we wish to truncate, and update $\hat{{\varvec{P}}}$ with a low-rank approximation $\tilde{{\varvec{G}}}_{{\varvec{\varDelta }}}$ instead of the real thing. The resulting algorithm is shown in Algorithm 3, and it provides a tuneable trade-off between the exactness of the acquired solution and the efficiency of incremental updates.

Interestingly, this type of approximate update is closely related to yet another extension of the slim paradigm: Factored Item Similarity Models (fism) (Kabbur et al. 2013). In fism, the similarity matrix ${\varvec{S}}$ is modelled as the product of two lower-dimensional latent factor matrices. The resulting low-rank model is shown to be increasingly effective as the sparsity in the user-item interactions it learns from increases, highlighting that this type of approximation does not necessarily imply a decrease in recommendation accuracy. In approximate $\textsc {dyn-ease}^{\textsc {r}}$, we do not directly model the similarity matrix ${\varvec{S}}$ as factorised, but we update ${\varvec{S}}$ with a factorised version of the update to the Gramian ${\varvec{G}}_{{\varvec{\varDelta }}}$. Factorised models such as fism or approximate $\textsc {dyn-ease}^{\textsc {r}}$ also bear resemblance to models that are often used in natural language processing applications. Indeed, the well-known word2vec algorithm to learn word embeddings for natural language processing applications implicitly learns to factorise a matrix holding the (shifted positive) pointwise mutual information between word-context pairs (Mikolov et al. 2013; Levy and Goldberg 2014).

Although our motivations for approximate $\textsc {dyn-ease}^{\textsc {r}}$ are rooted in improving the computational cost of exact $\textsc {dyn-ease}^{\textsc {r}}$, the advantages of transitivity that come from adopting low-rank representations can significantly impact recommendation performance as well. Imagine items $a,b,c \in {\mathcal {I}}$ where (a, b) and (b, c) co-occur in the training data of user histories, but (a, c) does not. Full-rank $\textsc {ease}^{\textsc {r}}$ cannot infer a correlation between a and c in such a setting, whereas low-rank models can learn a latent factor that unifies a, b and c. This explains the advantage that low-rank models have in sparse data environments. For further insights on the advantages, differences and analogies between full-rank and low-rank models, we refer the interested reader to the work of Van Balen and Goethals (2021).

As we are factorising ${\varvec{G}}_{{\varvec{\varDelta }}}$ by its truncated eigendecomposition, we are guaranteed to end up with the optimal rank-k approximation with respect to the mean squared error between $\tilde{{\varvec{G}}}_{{\varvec{\varDelta }}}$ and ${\varvec{G}}_{{\varvec{\varDelta }}}$. Naturally, with the highly sparse nature of ${\varvec{G}}_{{\varvec{\varDelta }}}$, this optimal approximation will focus on reconstructing entries with large values, and rows or columns with many nonzero values. This corresponds to focusing on the items that occur most often in the new incoming batch of user-item interactions ${\mathcal {P}}_{\varDelta }$. Because of this, we can expect approximate $\textsc {dyn-ease}^{\textsc {r}}$ to favour recently popular items, which can give an additional performance boost in the right application areas. Nevertheless, an in-depth discussion or validation of the efficacy of factorised $\textsc {ease}^{\textsc {r}}$-like models falls outside the scope of this work, as we focus on the efficiency with which the model can be updated. If the cut-off rank k is lower than the true rank of the update, approximate $\textsc {dyn-ease}^{\textsc {r}}$ guarantees an improvement in terms of the computational complexity of the update procedure.

4 Experimental results and discussion

The goal of this section is to validate that the methods we proposed in earlier sections of this manuscript work as expected, and to investigate whether expectations grounded in theory can be substantiated with empirical observations. Concretely, the research questions we wish to answer are the following:

RQ1:: Can exact $\textsc {dyn-ease}^{\textsc {r}}$ provide more efficient model updates in comparison with iteratively retrained $\textsc {ease}^{\textsc {r}}$?
RQ2:: Can our theoretical analysis on the correlation between rank(${\varvec{G}}_{{\varvec{\varDelta }}}$) and the runtime of $\textsc {dyn-ease}^{\textsc {r}}$ set realistic expectations in practice?
RQ3:: Do the phenomena we describe for bounding rank(${\varvec{G}}_{{\varvec{\varDelta }}}$) occur in real-world session-based or news recommendation datasets?
RQ4:: Can approximate $\textsc {dyn-ease}^{\textsc {r}}$ provide a sensible trade-off between recommendation efficiency and effectiveness?

Table 1 Datasets we adopt throughout the experiments presented in this work, along with their source and summary statistics that describe the user-item interactions and their sparsity

Full size table

Table 1 shows the publicly available datasets we use throughout our experiments in an attempt to provide empirical answers to the above-mentioned research questions. The well-known MovieLens dataset (Harper and Konstan 2015) consists of explicit ratings (on a 1–5 scale) that users have given to movies, along with the time of rating. We drop ratings lower than 3.5 and treat the remainder as binary preference expressions. Additionally, we only keep users and items that appear at least 3 times throughout the dataset. This type of pre-processing is common, and ensures we are left with positive preference expressions that carry enough signal for effective personalisation (Liang et al. 2018; Beel and Brunel 2019). We take the newest and largest variant of the dataset as our starting point: MovieLens-25M. Many recommender systems applications are based on shorter browsing sessions rather than full user histories that might span years (Ludewig and Jannach 2018). As laid out in Sect. 3.3, these set-ups can be especially amenable to our approach, as the adoption of these shorter sessions instead of longer user histories naturally decreases the rank of the update to the Gramian. We adopt two well-known datasets for session-based recommender systems: the YooChoose dataset, released in the context of the 2015 ACM RecSys Challenge (Ben-Shimon et al. 2015); and the RetailRocket dataset (Kaggle 2016). These datasets consist of implicit feedback (clicks) from users on retail products, and we compute the 3-core for users and items in the same manner we did for MovieLens-25M, after removing repeated user-item interactions. To validate our intuitions regarding $\textsc {dyn-ease}^{\textsc {r}}$ and the rank of the Gramian in news recommendation setups, we use the Adressa and Microsoft News datasets (MIND) (Gulla et al. 2017; Wu et al. 2020). These datasets contain implicit feedback inferred from browsing behaviour on news websites; we pre-process them analogously to the other datasets.

Some datasets have prohibitively large item catalogues for $\textsc {ease}^{\textsc {r}}$ to compute the inverse Gramian at once. However, the large majority of items are often at the extreme end of the so-called “long tail”, only being interacted with once or twice. We prune these items to keep the $\textsc {ease}^{\textsc {r}}$ computation feasible but still highlight the advantages of $\textsc {dyn-ease}^{\textsc {r}}$.

Note that these pruning operations on rare items significantly cut down computation time for $\textsc {ease}^{\textsc {r}}$ (directly dependent on $|{\mathcal {I}}|$), but do not pose an unfair advantage for $\textsc {dyn-ease}^{\textsc {r}}$. Items that appear just once in the dataset blow up the size of the Gramian, but do not significantly impact the rank of the Gramian updates. Indeed, in these situations we get that $k \ll |{\mathcal {I}}|$, and the computational advantages of $\textsc {dyn-ease}^{\textsc {r}}$ over $\textsc {ease}^{\textsc {r}}$ become even more pronounced. We adopt such pruning as it is common practice and keeps the computational needs for reproducing our experiments reasonable. The reason we do not further explore other commonly known datasets such as the Million Song Dataset (MSD) (Bertin-Mahieux et al. 2011), is that these do not include logged timestamps that indicate when the user-item interactions occurred. Because of this, they are unsuited for evaluating a realistic scenario where models are incrementally retrained over time.

The final dataset we adopt is the SuperMarket Dataset with Implicit feedback (SMDI) introduced by Viniski et al. Because this dataset has a comparatively small item catalogue, the computation time for all $\textsc {ease}^{\textsc {r}}$ variants is in the order of seconds and largely dominated by variance and system overhead. We adopt the SMDI dataset to study the recommendation performance of approximate $\textsc {dyn-ease}^{\textsc {r}}$, as it exhibits a distribution shift that is largely absent in the other datasets we consider.

To foster the reproducibility of our work, all source code for the experiments we have conducted is publicly available under an open-source license at github.com/olivierjeunen/dynamic-easer/. All code is written in Python 3.7 using SciPy (Virtanen et al. 2020). Reported runtimes are wall-time as measured using an Intel Xeon processor with 14 cores. The rest of this section is structured to follow the research questions laid out above.

4.1 Efficiency of exact $\textsc {dyn-ease}^{\textsc {r}}$ (RQ1)

To verify the gains in runtime from exact $\textsc {dyn-ease}^{\textsc {r}}$ over iteratively retrained $\textsc {ease}^{\textsc {r}}$, we chronologically split the user-item interactions based on a fixed timestamp t, yielding all user-item interactions up to t, and all those after t. The Microsoft News Dataset comes with user’s reading histories and clicks on shown recommendations, but the former type of interaction does not include timestamps. Because of this, we treat these historical interactions as “early data” included in the original $\textsc {ease}^{\textsc {r}}$ computation, and incorporate the timed clicks chronologically into $\textsc {dyn-ease}^{\textsc {r}}$ in the procedure described below.

We train an $\textsc {ease}^{\textsc {r}}$ model on the early batch, and log the runtime in seconds needed for this computation. This operation is repeated over 5 runs, and we report a 95% Gaussian confidence interval. As new incoming user-item interactions do not affect the dimension of the Gramian matrix that needs to be inverted, the runtime needed to compute $\textsc {ease}^{\textsc {r}}$ remains fairly constant when adding new user-interactions.

Over the newer batch of data, we employ a non-overlapping sliding window technique that chronologically generates batches of data to be included in the existing model via our proposed exact $\textsc {dyn-ease}^{\textsc {r}}$ procedure. The size of this window $\delta $ is varied to study the effects on the runtime of $\textsc {dyn-ease}^{\textsc {r}}$. Larger values of $\delta $ imply larger update batch sizes, which will often lead to an increase in rank(${\varvec{G}}_{{\varvec{\varDelta }}}$). Naturally, when $\delta $ becomes too large, a point is reached where the overhead induced by our incremental updating method becomes prohibitively large, and it becomes favourable to fully retrain the $\textsc {ease}^{\textsc {r}}$ model. Sensible values of $\delta $ come with a restriction: when the runtime of the model update is larger than $\delta $, this would indicate that the procedure cannot keep up with incoming data in real-time. We do not encounter this issue for any of the values of $\delta $ explored in our experiments - suggesting that $\textsc {dyn-ease}^{\textsc {r}}$ can be a good fit for various configurations.

Figure 1 visualises the resulting runtimes from the procedure laid out above, on all five considered datasets. The time for the sliding window increases over the x-axis, and runtime for varying values of $\delta $ is shown on the y-axis. The explored values of $\delta $ differ based on the dataset and use-case: for the 25-year spanning MovieLens dataset, daily updates might be sufficient; for the 3-month spanning news recommendation dataset Adressa, more frequent 5-minute updates might be more appropriate, to keep up with the highly dynamic nature of the environment.

We included values of $\delta $ that push the runtime for $\textsc {dyn-ease}^{\textsc {r}}$ up to that of $\textsc {ease}^{\textsc {r}}$ to highlight the limitations of our approach. Provided that the computing power and infrastructure is available, however, $\delta $ can be decreased to bring $\textsc {dyn-ease}^{\textsc {r}}$’s runtime into the desirable range. Note that this limitation on $\delta $ is general for online learning approaches from user-item interactions, and not specific to the methods we propose in this work.

From the runtime results, we can observe that our proposed method entails significant performance improvements compared to iterative model retraining, for a wide range of settings. Over all datasets, we observe a clear trend toward lower runtimes for shorter sliding windows and more frequent updates, as is expected from our theoretical results.

As the MovieLens-25M dataset spans several decades, the amount of new user-item interactions to be incorporated on a daily basis remains modest. Exploring lower values of $\delta $ would not provide any additional insights into the performance of $\textsc {dyn-ease}^{\textsc {r}}$ because of this. As a consequence, we obtain a clean separation between the runtime for $\textsc {dyn-ease}^{\textsc {r}}$ on batches of different length.

The remaining four datasets represent session- and news-based recommendation environments, which are known to be much more fast-paced and dynamic. Because we focus on smaller sliding window lengths $\delta $ here, we clearly see daily seasonal patterns emerging. Indeed, $\textsc {dyn-ease}^{\textsc {r}}$ runtime peaks coincide with peaks in website traffic. As the rank of the update is typically correlated with the number of users or items in the update, this phenomenon is to be expected. It highlights that $\textsc {dyn-ease}^{\textsc {r}}$ is able to effectively target those model parameters that need updating, and does not spend unnecessary computing cycles on unchanged parts of the model. Note that $\delta $ does not need to be a fixed constant in real-world applications. An effective use of computing power might decrease and increase $\delta $ during traffic peaks and valleys, respectively.

4.2 Correlating rank(${\varvec{G}}_{{\varvec{\varDelta }}}$) and runtime of exact $\textsc {dyn-ease}^{\textsc {r}}$ (RQ2)

The runtime of the incremental updates shown in Fig. 1 is visualised against the rank of the updates in Fig. 2. We clearly observe a strong correlation between the rank of the update to the Gramian and the runtime of $\textsc {dyn-ease}^{\textsc {r}}$, with a trend that is consistent over varying values of $\delta $.

We fit a polynomial of the form $f(x) = a\cdot x^{b} + c$ on a randomly sampled subset of 90% of measurements, and assess its performance in predicting the runtime for $\textsc {dyn-ease}^{\textsc {r}}$ based on rank(${\varvec{G}}_{{\varvec{\varDelta }}}$) on the remaining 10% of the measurements. Table 2 shows the optimal parameters, the number of samples (runtime measurements) and the root mean squared error (RMSE) on the test sample for every dataset. Figure 2 qualitatively shows that we are able to predict the runtime for $\textsc {dyn-ease}^{\textsc {r}}$ updates with reasonable accuracy when we know the rank of the update. Combined with the bounds on this quantity laid out in Sect. 3.3, we can use this to set an expected upper bound for the computation time of our incremental updates through $\textsc {dyn-ease}^{\textsc {r}}$. Table 2 quantitatively shows the magnitude of the errors, reassuring our qualitatively obtained insights. Note that whereas the absolute RMSE increases with the datasets with larger item catalogues, the relative error of the model remains fairly constant. Indeed, a mean error of 5 seconds on a prediction of 10 seconds is not equivalent to being 5 seconds off when the order of magnitude is 1000 seconds. These empirical observations together with the theoretical analysis presented in Sect. 3 highlight the efficiency and favourable scalability of the proposed $\textsc {dyn-ease}^{\textsc {r}}$ procedure.

Table 2 Resulting polynomial model to predict runtime from rank(${\varvec{G}}_{{\varvec{\varDelta }}}$), along with the root mean squared error (RMSE) it attains and the number of observations N it was fitted on

Full size table

4.3 Analysing bounds for rank(${\varvec{G}}_{{\varvec{\varDelta }}}$) (RQ3)

Figure 3 shows the rank of the incremental updates from Fig. 1 compared to summary statistics for the batches of user-item interactions. This visualisation shows the effectiveness of the upper bounds laid out in Sect. 3.3 in order to assess their utility and provide a better understanding of the underlying dynamics for every dataset.

We observe that both for general purpose MovieLens-25M and the session-based datasets, the user-focused bound performs reasonably well in approximating the rank of the update to the Gramian. This is in line with our theoretical expectations, and confirms that the number of unique users in any given batch of user-item interactions are the main driving factor for rank(${\varvec{G}}_{{\varvec{\varDelta }}}$). We further see that the upper bound becomes looser as the number of unique users grows. This as well is expected behaviour, as it becomes less likely for new users’ behaviour to be linearly independent of other users in the batch as the batch size grows. As mentioned in 3.3, the upper bound of $2|{\mathcal {U}}|$ could be tightened to $|{\mathcal {U}}|$ if we did not perform a hard split on time but rather divided user sessions into a “finished” and “ongoing” set. This phenomenon occurs naturally for the YooChoose dataset, where we clearly see that the $2|{\mathcal {U}}|$ bound is much looser. Note that the tight bounds for MovieLens might change if this dataset would include timestamps for item consumption rather than rating, as the majority of users might watch a smaller set of current series or movies. Such a recency bias would decrease the active item catalogue, favouring the item-based bounds.

In contrast with the user-focused datasets, the bound on the number of unique items is much tighter for the news datasets, providing an almost perfect approximation in many cases. This confirms our intuition that the rank of the update in these settings is fully determined by the number active items in the catalogue and virtually independent of the number of users or interactions in a given batch. This in turn makes these environments especially amenable to our $\textsc {dyn-ease}^{\textsc {r}}$ approach.

The number of unique users or items in a batch can give rise to reasonably tight upper bounds on the rank of the update in realistic scenarios, using real-world datasets. The absolute number of user-item interactions $|{\mathcal {P}}|$ provides another (impractical) bound on the rank of the update; indeed, in a worst-case scenario, every user-item interaction would pertain to a unique user and a unique item. We include the visualisation of the relation between rank(${\varvec{G}}_{{\varvec{\varDelta }}}$) and $|{\mathcal {P}}|$ to intuitively show that our proposed approach scales favourably with respect to the size of the data, a property that is most appreciated in highly dynamic environments with ever-growing dataset sizes.

4.4 Efficiency and effectiveness of approximate $\textsc {dyn-ease}^{\textsc {r}}$ (RQ4)

Finally, we wish to validate the efficiency and efficacy of approximate updates to $\textsc {ease}^{\textsc {r}}$-like models. Specifically, we wish to understand the trade-off between runtime and recall for models that are iteratively updated as new data comes in. We report experimental results for runtime and recommendation accuracy for both the Adressa and SMDI datasets. This experiment is not repeated on the other datasets, as they do not favour this type of experimental evaluation procedure. MovieLens-25M spans a too long time period, and we observe insignificant effects of model retraining on recommendation accuracy. YooChoose and RetailRocket focus on shorter user sessions, which are also unfavourable for SW-EVAL to reach statistically significant conclusions. Lastly, the Microsoft News Dataset contains bandit feedback, which is different to the organic user-item interactions we tackle in this work. This was not an issue when evaluating models’ computational cost, but is prohibitive to properly evaluate recommendation accuracy in a common manner.

To illustrate the advantages of approximate $\textsc {dyn-ease}^{\textsc {r}}$, we make use of the Sliding Window Evaluation (SW-EVAL) technique (Jeunen et al. 2018; Jeunen 2019). We train a model on all user-item interactions that have occurred up to time t. For a fixed sliding window width $\delta _{\text {update}}$, we periodically update the model with new incoming data, both for the exact and approximate $\textsc {dyn-ease}^{\textsc {r}}$ variants. A concurrent sliding window with width $\delta _{\text {eval}}$ dictates the evaluation period where every competing model is evaluated on its ability to predict with which items users interacted with next. This experimental procedure is formalised in Algorithm 4. We set $\delta _{\text {update}} = 60$min and $\delta _{\text {eval}}=120$min for the Adressa dataset and evaluate over the final 24 hours, and $\delta _{\text {update}} = 6$h and $\delta _{\text {eval}}=3$d for the final 120 days of the SMDI dataset. This difference in order of magnitude is to keep the overall runtime of the experiments reasonable, and to ensure statistically significant results from sufficiently large evaluation sample sizes.

4.4.1 Computation time for approximate $\textsc {dyn-ease}^{\textsc {r}}$

Figure 4 shows computation time for exact $\textsc {dyn-ease}^{\textsc {r}}$, as well as several approximate model variants with varying cut-off ranks k. In terms of runtime improvements, we observe very favourable results for approximate updates. As is expected, the computational cost of $\textsc {dyn-ease}^{\textsc {r}}$’s updates can largely be attributed to the computation of all eigen-pairs, and limiting the rank has a significant impact on the efficiency of said updates. At cut-off rank $k = 250$, the computational cost for the updates is decreased by a factor 3 or 65%.

As we have mentioned above, the computation time for all $\textsc {ease}^{\textsc {r}}$ variants on the SMDI dataset is in the order of seconds and largely dominated by variance and system overhead. As a result, runtime results on this dataset do not provide significant insights, and we do not report them.

Table 3 Runtime and recommendation accuracy measurements on the Adressa and SMDI datasets, for the experimental procedure laid out in Algorithm 4

Full size table

4.4.2 Recommendation accuracy for approximate $\textsc {dyn-ease}^{\textsc {r}}$

Figure 5 visualises Recall@K for $K \in \{1, 5, 10, 20\}$ over time on the Adressa and SMDI datasets. The SMDI dataset has large variance on the number of active users over time, which heavily influences the statistical significance of some of the evaluation results, as they are based on insufficient samples. We do not include evaluation results where the evaluation set consisted of less than 100 users. The denominator for our Recall measure is $\min (K,|{\mathcal {I}}_{u}^{\text {test}}|)$ instead of the number of held-out items $|{\mathcal {I}}_{u}^{\text {test}}|$, to ensure that a perfect value of 1 can be attained at all cut-offs k.

Additional to several approximate model variants, we include recommendation accuracy results for a model that is not updated over time, yielding a lower bound on the accuracy we can expect from approximate updates. On the Adressa dataset, we observe that such a model quickly deteriorates over time. This is to be expected, as the Adressa dataset and news recommendation application are heavily biased toward recent items and interactions. This bias is less clear on the SMDI dataset at lower cut-off ranks K, but clearly manifests itself for Recall@20 near the end of the evaluation period.

For both datasets, we observe that the accuracy of approximate $\textsc {dyn-ease}^{\textsc {r}}$ variants for high values of k, is statistically insignificantly different from exact $\textsc {dyn-ease}^{\textsc {r}}$. This highlights that the N-fold improvement in terms of computational cost can come with a negligible impact on recommendation accuracy, showing the advantages of approximate computations. For Adressa, we observe a statistically significant improvement over exact $\textsc {dyn-ease}^{\textsc {r}}$ for low values of k. This can be attributed to the reasons laid out in Sect. 3, as the low-rank approximation handles sparsity, transitivity, and favours recently popular items. These model characteristics are highly favourable in news recommendation settings—but might have smaller influence on supermarket data. Nevertheless, the results highlight that many efficient low-rank updates can yield highly competitive models compared to more costly full-rank updates.

All runtime and recommendation accuracy measurements are aggregated in Table 3, providing further insights on the trade-off between runtime and recommendation accuracy for approximate $\textsc {dyn-ease}^{\textsc {r}}$. We denote the Recall@K measure as R@K for improved spacing. On the SMDI dataset, the differences in recommendation accuracy among exactly or approximately update model variants were not found to be statistically significant at the $p=0.05$ level. The differences between the stale $\textsc {ease}^{\textsc {r}}$ and $\textsc {dyn-ease}^{\textsc {r}}$ models are significant.

The size of the Adressa dataset yields more statistical power, and both the differences between stale $\textsc {ease}^{\textsc {r}}$ and $\textsc {dyn-ease}^{\textsc {r}}$ and those between exact $\textsc {dyn-ease}^{\textsc {r}}$ and approximate $\textsc {dyn-ease}^{\textsc {r}}$ with $k \in \{1,5\}$ were found to be statistically significant.

5 Conclusion

Linear item-based models are an attractive choice for many collaborative filtering tasks due to their conceptual simplicity, interpretability, and recommendation accuracy. Recent work has shown that the analytical solution that is available for ridge regression can significantly improve the scalability of such methods, with a state-of-the-art algorithm called $\textsc {ease}^{\textsc {r}}$ (Steck 2019a). $\textsc {ease}^{\textsc {r}}$ consists of a single matrix inversion of the Gramian. As its computational complexity does not rely on the number of users or even the number of user-item interactions in the training set, it is particularly well suited to use-cases with many users or interactions, with the sole constraint that the size of the item catalogue is limited.

When deployed in real-world applications, models often need to be periodically recomputed to incorporate new data and account for newly available items and shifting user preferences, as well as general concept drift. Iteratively retraining an $\textsc {ease}^{\textsc {r}}$-like model from scratch puts additional strain on such real-world applications, putting a hard upper limit on the frequency of model updates that can be attained, and possibly driving up computational costs. This especially limits the application of $\textsc {ease}^{\textsc {r}}$ in domains where item recency is an important factor deciding on item relevance—such as in retail or news recommendation.

In this work, we propose a novel and exact updating algorithm for embarrassingly shallow auto-encoders that combines parts of the Dynamic Index algorithm (Jeunen et al. 2019) and the Woodbury matrix identity (Hager 1989): Dynamic $\textsc {ease}^{\textsc {r}}$ ($\textsc {dyn-ease}^{\textsc {r}}$). We have provided a thorough theoretical analysis of our proposed approach, highlighting in which cases it can provide a considerable advantage over iteratively retrained $\textsc {ease}^{\textsc {r}}$, and in which cases it does not. These theoretical insights are corroborated by empirical insights from extensive experiments, showing that $\textsc {dyn-ease}^{\textsc {r}}$ is well suited for efficient and effective online collaborative filtering in various real-world applications.

$\textsc {dyn-ease}^{\textsc {r}}$ exploits the sparse and symmetric structure of the Gramian to efficiently compute the eigendecomposition of the Gramian update. When the rank of the update is large, however, this operation can still become prohibitively expensive. To mitigate this problem, we have additionally proposed an approximate $\textsc {dyn-ease}^{\textsc {r}}$ variant that uses a low-rank approximation of the Gramian update as opposed to its exact decomposition. Empirical results highlight further efficiency improvements at a small cost for recommendation accuracy. Our work broadens the scope of problems for which item-based models based on ridge regression are an appropriate choice in practice. To foster the reproducibility of our work, the source code for all our experiments is publicly available under an open-source license at github.com/olivierjeunen/dynamic-easer.

A promising area for future work is to further improve $\textsc {dyn-ease}^{\textsc {r}}$’s computational efficiency by looking at alternative (approximate) matrix decompositions that exploit efficient random sampling (Halko et al. 2011; Martinsson et al. 2011), as the bottleneck of our current approach lies in the computation of the exact eigendecomposition of the update to the Gramian. Furthermore, we would like to explore applications of our efficient incremental updating scheme to more general multi-label regression tasks beyond the collaborative filtering use-case we tackle in this work.

Notes

It should be noted that the authors have since released a more performant coordinate-descent-based implementation of their method (Ning et al. 2019).
In our experiments, we use an efficient SciPy implementation of a variant called the Implicitly Restarted Lanczos Method (Lehoucq et al. 1998; Virtanen et al. 2020); the analysis is equivalent.
In the SciPy package for Python, an implementation of the randomised method presented by Liberty et al. can be found under scipy.linalg.interpolative.estimate_rank (Liberty et al. 2007; Virtanen et al. 2020).

References

Alman, J., Vassilevska, V. W.: A refined laser method and faster matrix multiplication. In Proc. of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’21. Society for Industrial and Applied Mathematics, (2021)
Anyosa, S.C., Vinagre, J., Jorge, A.M.: Incremental matrix co-factorization for recommender systems with implicit feedback. In Companion Proceedings of the The Web Conference 2018, WWW ’18, page 1413–1418. International World Wide Web Conferences Steering Committee, (2018). ISBN 9781450356404
Beel, J., Brunel, V.: Data pruning in recommender systems research: Best practice or malpractice? In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19 (2019)
Ben-Shimon, D., Tsikinovsky, A., Friedmann, M., Shapira, B., Rokach, L., Hoerle, J.: Recsys challenge 2015 and the yoochoose dataset. In: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, pp. 357–358. ACM (2015)
Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR ’11 (2011)
Borchers, A., Herlocker, J., Konstan, J., Riedl, J.: Ganging up on information overload. Computer 31(4), 106–108 (1998). ISSN 0018-9162
Castells, P., Hurley, N.J., Vargas, S.: Novelty and Diversity in Recommender Systems, pp. 881–918. Springer, US (2015)
Chen, B., Liu, Z.: Lifelong machine learning. Synth. Lect. Artif. Intel. Mach. Learn. 12(3), 1–207 (2018)
Google Scholar
Chen, Y., Wang, Y., Zhao, X., Zou, J., de Rijke, M.: Block-aware item similarity models for top-n recommendation. ACM Trans. Inf. Syst. 38(4), 1–26 (2020)
Google Scholar
Christakopoulou, E., Karypis, G.: Hoslim: higher-order sparse linear method for top-n recommender systems. In: Advances in Knowledge Discovery and Data Mining, pp. 38–49. Springer, New York (2014)
Christakopoulou, E., Karypis, G.: Local item-item models for top-n recommendation. In: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 67–74. ACM (2016). ISBN 978-1-4503-4035-9
Dacrema, M.F., Cremonesi, P., Jannach, D.: Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In: Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, pp. 101–109. ACM, (2019). ISBN 978-1-4503-6243-6
Deshpande, M., Karypis, G.: Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst. 22(1), 143–177 (2004)
Article Google Scholar
Ekstrand, M.D., Riedl, J.T., Konstan, J.A.: Collaborative filtering recommender systems. Found. Trends Hum. Comput. Interact. 4(2), 81–173 (2011). ISSN 1551-3955
Elahi, E., Wang, W., Ray, D., Fenton, A., Jebara, T.: Variational low rank multinomials for collaborative filtering with side-information. In: Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, pp. 340–347. ACM (2019). ISBN 978-1-4503-6243-6
Ferreira, E.J., Enembreck, F., Barddal, J.P.: Adadrift: An adaptive learning technique for long-history stream-based recommender systems. In: Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 2593–2600 (2020)
Gama, I., Žliobaitė, J., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), (2014). ISSN 0360-0300
Gulla, J.A., Zhang, L., Liu, P., Özgöbek, Ö., Su, X.: The adressa dataset for news recommendation. In Proceedings of the International Conference on Web Intelligence, WI ’17, pp. 1042–1048. Association for Computing Machinery, (2017). ISBN 9781450349512
Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)
Article MathSciNet Google Scholar
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011). https://doi.org/10.1137/090771806
Article MathSciNet MATH Google Scholar
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intel. Syst. 5(4):19:1–19:19, 19:1-19:19 (2015)
Google Scholar
He, X., Zhang, H., Kan, M., Chua, T.: Fast matrix factorization for online recommendation with implicit feedback. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, pp. 549–558. ACM (2016)
Jeunen, O.: Revisiting offline evaluation for implicit-feedback recommender systems. In: Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, pp. 596–600. ACM, (2019). ISBN 978-1-4503-6243-6
Jeunen, O., Goethals, B.: Pessimistic reward models for off-policy learning in recommendation. In: Proceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21 (2021)
Jeunen, O., Verstrepen, K., Goethals, B.: Fair offline evaluation methodologies for implicit-feedback recommender systems with mnar data. In: Proceedings of the REVEAL 18 Workshop on Offline Evaluation for Recommender Systems (RecSys ’18), October (2018)
Jeunen, O., Verstrepen, K., Goethals, B.: Efficient similarity computation for collaborative filtering in dynamic environments. In: Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, pp. D251–259. ACM, (2019). ISBN 978-1-4503-6243-6
Jeunen, O., Van Balen, J., Goethals, B.: Closed-form models for collaborative filtering with side-information. In: Fourteenth ACM Conference on Recommender Systems, RecSys ’20, pp. 651–656, (2020). ISBN 9781450375832
Kabbur, S., Ning, X., Karypis, G.: Fism: Factored item similarity models for top-n recommender systems. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pp. 659–667, (2013). ISBN 9781450321747
Kaggle. RetailRocket Recommender System Dataset, 2016. URL https://www.kaggle.com/retailrocket/ecommerce-dataset
Khawar, F., Poon, L., Zhang, N.L.: Learning the structure of auto-encoding recommenders. In: Proceedigns of The Web Conference 2020, WWW ’20, pp. 519–529. ACM (2020)
Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, pp. 426–434. Association for Computing Machinery, (2008). ISBN 9781605581934
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, (2009). ISSN 0018-9162
Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Natl. Bur. Stand. 45, 255–282 (1950)
Article MathSciNet Google Scholar
Le Gall, F.: Powers of tensors and fast matrix multiplication. In: Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, ISSAC ’14, pp. 296–303. Association for Computing Machinery (2014)
Lehoucq, R.B., Sorensen, D. C., Yang, C.: ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM (1998)
Levy, M., Jack, K.: Efficient top-n recommendation by linear regression. In: RecSys Large Scale Recommender Systems Workshop (2013)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. Adv. Neural Inform. Process. Syst. 27, 2177–2185 (2014)
Google Scholar
Liang, D., Krishnan, R.G., Hoffman, M.D., Jebara, T.: Variational autoencoders for collaborative filtering. In: Proceedings of the 2018 World Wide Web Conference, WWW ’18, pp. 689–698. International World Wide Web Conferences Steering Committee, ACM (2018)
Liberty, E., Woolfe, F., Martinsson, P., Rokhlin, V., Tygert, M.: Randomized algorithms for the low-rank approximation of matrices. Proc. Natl. Acad. Sci. 104(51), 20167–20172 (2007)
Article MathSciNet Google Scholar
Ludewig, M., Jannach, D.: Evaluation of session-based recommendation algorithms. User Model. User-Adap. Inter. 28(4), 331–390 (2018)
Article Google Scholar
Martinsson, P.G., Rokhlin, V., Tygert, M.: A randomized algorithm for the decomposition of matrices. Appl. Comput. Harmon. Anal. 30(1), 47–68 (2011). ISSN 1063-5203
Matuszyk, P., Vinagre, J., Spiliopoulou, M., Jorge, A.M., Gama, J.: Forgetting techniques for stream-based matrix factorization in recommender systems. Knowl. Inf. Syst. 55(2), 275–304 (2018). https://doi.org/10.1007/s10115-017-1091-8
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inform. Process. Syst. 26, 3111–3119 (2013)
Google Scholar
Ning, X., Karypis, G.: Slim: Sparse linear methods for top-n recommender systems. In: Proceedings of the 2011 IEEE 11th International Conference on Data Mining, ICDM ’11, pp. 497–506. IEEE Computer Society (2011). ISBN 978-0-7695-4408-3
Ning, X., Karypis, G.: Sparse linear methods with side information for top-n recommendations. In: Proceedings of the 6th ACM Conference on Recommender Systems, RecSys ’12, pp. 155–162, (2012). ISBN 9781450312707
Ning, X., Nikolakopoulos, A.N., Shui, Z., Sharma,M., Karypis, G.: SLIM Library for Recommender Systems, (2019). URL https://github.com/KarypisLab/SLIM
Paige, C.C.: Accuracy and effectiveness of the lanczos algorithm for the symmetric eigenproblem. Linear Algebra Appl. 34, 235–258 (1980)
Article MathSciNet Google Scholar
Pan, V.Y., Chen, Z.Q.: The complexity of the matrix eigenproblem. In: Proceedings of the 31st Annual ACM Symposium on Theory of Computing, STOC ’99, pp. 507–516. Association for Computing Machinery, (1999)
Park, Y., Tuzhilin, A.: The long tail of recommender systems and how to leverage it. In: Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys ’08, pp. 11–18, (2008). URL https://doi.org/10.1145/1454008.1454012
Rendle, S.: Evaluation metrics for item recommendation under sampling (2019)
Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian personalized ranking from implicit feedback. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, UAI ’09, pp. 452–461. AUAI Press (2009)
Rendle, S., Krichene, W., Zhang, L., Anderson, J.: Neural collaborative filtering vs. matrix factorization revisited. In: Fourteenth ACM Conference on Recommender Systems, RecSys ’20, pp. 240–248, (2020). ISBN 9781450375832
Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pp. 285–295. ACM (2001)
Schein, A.I., Popescul, A., Ungar, L.H., Pennock, D.M.: Methods and metrics for cold-start recommendations. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’02, pp. 253–260. ACM (2002)
Sedhain, S., Menon, A.K., Sanner, S., Braziunas, D.: On the effectiveness of linear models for one-class collaborative filtering. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI’16, pp. 229–235 (2016)
Shenbin, I., Alekseev, A., Tutubalina, E., Malykh, V., Nikolenko, S.I.: Recvae: A new variational autoencoder for top-n recommendations with implicit feedback. In: Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20, pp. 528–536, (2020)
Shi, Y., Larson, M., Hanjalic, A.: Collaborative filtering beyond the user-item matrix: a survey of the state of the art and future challenges. ACM Comput. Surv. 47(1) (2014)
Steck, H.: Embarrassingly shallow autoencoders for sparse data. In: The World Wide Web Conference, WWW ’19, pp. 3251–3257 (2019)
Steck, H.: Collaborative filtering via high-dimensional regression. CoRR, abs/1904.13033 (2019)
Steck, H.: Markov random fields for collaborative filtering. Adv. Neural Inform. Process. Syst. 32, 5473–5484 (2019)
Google Scholar
Steck, H., Dimakopoulou, M., Riabov, N., Jebara, T.: Admm slim: Sparse recommendations for many users. In: Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20, pp. 555–563 (2020)
Ubaru, S., Saad, Y.: Fast methods for estimating the numerical rank of large matrices. In: Proceedings of the 33rd International Conference on Machine Learning, Vol. 48, ICML’16, pp. 468–477. JMLR.org (2016)
Van Balen, J., Goethals, B.: High-dimensional sparse embeddings for collaborative filtering. In: Proceedings of the Web Conference 2021, WWW ’21, pp. 575–581. ACM (2021)
Verstrepen, K., Bhaduriy, K., Cule, B., Goethals, B.: Collaborative filtering for binary, positiveonly data. SIGKDD Explor. Newsl., 19(1):1–21, (2017). ISSN 1931-0145
Vinagre, J., Jorge, A.M., Gama, J.: Fast incremental matrix factorization for recommendation with positive-only feedback. In: User Modeling, Adaptation, and Personalization, pp. 459–470. Springer, Cham (2014)
Vinagre, J., Jorge, A.M., Gama, J.: Collaborative filtering with recency-based negative feedback. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, SAC ’15, pp. 963–965, (2015). ISBN 9781450331968
Vinagre, J., Jorge, A.M., Al-Ghossein, M., Bifet, A.: ORSUM - Workshop on Online Recommender Systems and User Modeling, pp. 619–620. ACM, (2020). ISBN 9781450375832
Viniski, A.D., Barddal, J.P., de Souza Britto Jr., A., Enembreck,F., de Campos, H.V.A.: A case study of batch and incremental recommender systems in supermarket data under concept drifts and cold start. Expert Systems with Applications, 176:114890, 2021. ISSN 0957-4174
Virtanen, P., Gommers, R., Oliphant, T.E., et al.: Scipy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods 17(3), 261–272 (2020)
Article Google Scholar
Wu, F., Qiao, Y., Chen, J., Wu, C., Qi, T., Lian, J., Liu, D., Xie, X., Gao, J., Wu, W., Zhou, M.: MIND: A large-scale dataset for news recommendation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL’20, pp. 3597–3606. Association for Computational Linguistics, (2020)

Download references

Acknowledgements

This work received funding from the Flemish Government (AI Research Programme).

Author information

Authors and Affiliations

Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
Olivier Jeunen & Bart Goethals
Spotify, Brussels, Belgium
Jan Van Balen
Faculty of Information Technology, Monash University, Clayton, VIC, Australia
Bart Goethals

Authors

Olivier Jeunen
View author publications
You can also search for this author in PubMed Google Scholar
Jan Van Balen
View author publications
You can also search for this author in PubMed Google Scholar
Bart Goethals
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jan Van Balen: work done while the author was at Adrem Data Lab.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jeunen, O., Van Balen, J. & Goethals, B. Embarrassingly shallow auto-encoders for dynamic collaborative filtering. User Model User-Adap Inter 32, 509–541 (2022). https://doi.org/10.1007/s11257-021-09314-7

Download citation

Received: 04 March 2021
Accepted: 28 November 2021
Published: 13 January 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11257-021-09314-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Embarrassingly shallow auto-encoders for dynamic collaborative filtering

Abstract

Similar content being viewed by others

Simple is Beautiful: An Online Collaborative Filtering Recommendation Solution with Higher Accuracy

Advances in Collaborative Filtering

Time-Sensitive Collaborative Filtering through Adaptive Matrix Completion

Explore related subjects

1 Introduction

2 Background and related work

2.1 Item-based models, slim and \(\textsc {ease}^{\textsc {r}}\)

2.2 Item-based models with side-information

2.3 Incremental collaborative filtering

2.4 Neural auto-encoders

3 Methodology and contributions

3.1 Low-rank model updates with the Woodbury matrix identity

3.2 Computational complexity analysis of eigendecomposition

3.3 Efficient estimation and upper bounding of \({\text {rank}}({\varvec{G}}_{{\varvec{\varDelta }}})\)

Lemma 1

Proof

Lemma 2

Proof

3.4 Approximate \(\textsc {dyn-ease}^{\textsc {r}}\) updates via truncated eigendecomposition

4 Experimental results and discussion

4.1 Efficiency of exact \(\textsc {dyn-ease}^{\textsc {r}}\) (RQ1)

4.2 Correlating rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) and runtime of exact \(\textsc {dyn-ease}^{\textsc {r}}\) (RQ2)

4.3 Analysing bounds for rank(\({\varvec{G}}_{{\varvec{\varDelta }}}\)) (RQ3)

4.4 Efficiency and effectiveness of approximate \(\textsc {dyn-ease}^{\textsc {r}}\) (RQ4)

4.4.1 Computation time for approximate \(\textsc {dyn-ease}^{\textsc {r}}\)

4.4.2 Recommendation accuracy for approximate \(\textsc {dyn-ease}^{\textsc {r}}\)

5 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation