1 Introduction

Given a pair of nodes on a large-scale hypergraph, how can we rapidly and efficiently calculate their proximity based on random walk with restart? How useful are such proximities for data-mining applications?

A hypergraph is a data structure consisting of a set of nodes and a set of hyperedges, and each hyperedge is a set composed of any number of nodes. Note that a hypergraph where each hyperedge joins any number of nodes is a generalization of a (pairwise) graph where each edge always joins two nodes. Due to this increased expressiveness, hypergraphs are widely used to model real-world group relations, including (two or more) researchers who co-author a paper, items purchased together, and tags co-appearing in a post (Benson et al. 2018; Do et al. 2020; Lee et al. 2023; Comrie and Kleinberg 2021).

Random walk, a concept widely used for graphs, naturally applies to hypergraphs. Random walk is a stochastic process that assumes an imaginary surfer moving randomly between nodes in a graph; and for instance, PageRank (Brin and Page 1998) utilizes this concept to model a random surfer navigating a web graph (i.e., a network of hyperlinks between web pages) and quantifies the importance of web pages based on the stationary distribution of the surfer’s visits. A simple hypergraph extension (Zhou et al. 2006) assumes a random surfer that repeats (a) choosing an incident hyperedge with probability proportional to edge weights and (b) choosing an incident node uniformly at random. However, its expressiveness is limited in that such random walk can always be reduced to random walk on an undirected graph, with some choice of weights (Chitra and Raphael 2019). Chitra and Raphael (2019) proposed a more expressive one that may not be reduced to random walk on any undirected graph. It assumes a random surfer who uses edge-dependent node weights (EDNW) when choosing incident nodes, and due to its expressiveness, it has been employed for clustering (Hayashi et al. 2020), product-return prediction (Li et al. 2018), object classification (Zhang et al. 2018b), anomaly detection (Lee et al. 2022), etc.

On graphs, the concept of random walk with restart (RWR) (Tong et al. 2007b) has also been widely used. RWR measures the stationary probability distribution of random walk when we assume a random surfer who restarts at a query node with a certain probability, and the distribution is naturally interpreted as the relevance of each node with respect to the query node. Due to its ability to consider multi-faceted relationships between nodes, RWR has been extensively utilized in graph mining applications including personalized ranking (Tong et al. 2007b; Jung et al. 2016), anomaly detection (Sun et al. 2005), subgraph mining (Tong et al. 2007a), graph neural networks (Gasteiger et al. 2019a), and graph augmentation (Gasteiger et al. 2019b; Lee and Jung 2023). Since RWR is typically required to be computed separately for a larger number of query nodes or even for all nodes, fast computation of it is indispensable. Therefore, many computation methods have been developed (Wang et al. 2019a; Hou et al. 2021), and many of them rely on preprocessing the input graph (Tong et al. 2007a; Fujiwara et al. 2012; Shin et al. 2015; Jung et al. 2017).

However, RWR on hypergraphs has been underexplored, while an extension of RWR to hypergraphs is straightforward with many potential applications. One potential reason is the lack of fast and scalable computation methods of it. For example, the previous work (Chitra and Raphael 2019) has relied on naive power iteration, which becomes impractical when dealing with a large number of query nodes on large-scale hypergraphs. Since random walk on a hypergraph consists of two-step transitions (i.e., from node to hyperedge, and from hyperedge to node), we cannot directly employ existing fast and scalable computation methods for RWR on a graph whose transition is just from node to node. Similar to graphs, RWR scores on hypergraphs vary across different query nodes, and computing RWR scores for a large number of query nodes is necessary for applications. Thus, dedicated efforts are necessary to develop fast computation methods can offer low costs per query node, even if it involves incurring a one-time preprocessing cost.

In this work, we propose ARCHER (Adaptive RWR Computation on Hypergraphs), a fast and space-efficient framework for computing RWR on real-world hypergraphs. After formally defining RWR on hypergraphs, we develop two computation methods for it based on two different (simplified) representations of hypergraphs. These two computation methods are complementary, and they offer relative advantages on different hypergraphs. Thus, we further propose an automatic selection method that chooses one computation method based on a simple but effective goodness criterion whose computation takes a very short time compared to the total running time.

Through extensive experiments on 18 real-world hypergraphs, we substantiate the speed and space efficiency of ARCHER and its two key driving factors: (a) the complementarity between the two computation methods composing ARCHER and (b) the accuracy of its automatic selection method. In addition, we demonstrate a successful application of RWR on hypergraphs and ARCHER to anomaly detection.

Reproducibility The source code and datasets used in this paper are available at https://github.com/jaewan01/ARCHER.

The rest of the paper is organized as follows. In Sect. 2, we introduce preliminaries and related work. In Sect. 3, we describe our approaches for computing RWR on hypergraphs. In Sect. 4, we introduce an application of RWR on hypergraphs for the purpose of anomaly detection. After sharing experimental results in Sect. 5, we present conclusions and future directions in Sect. 6.

2 Preliminaries and related work

In this section, we introduce some preliminaries and related studies on random walks on (hyper-)graphs and their applications.

2.1 Notations

We describe basic notations frequently used in this paper where related symbols are summarized in Table 1.

(Hyper)graph A hypergraph \({G_H=(\mathcal {V}, \mathcal {E}, \omega , \gamma )}\) consists of a set \(\mathcal {V}\) of nodes, a set \(\mathcal {E}\) of hyperedges, the weight \(\omega (e)\) of hyperedge e, and the weight \(\gamma _{e}(v)\) of node v depending on hyperedge e. Each hyperedge \(e\in E\) is represented by a non-empty subset of an arbitrary number of nodes, i.e., \(e \in 2^{V}\). We let \(n=\vert \mathcal {V} \vert\) and \(m=\vert \mathcal {E} \vert\) be the numbers of nodes and hyperedges, respectively. Similarly, a graph \(G=(V, E, w)\) consists of a set V of nodes, a set E of edges, and edge weights w.

Matrix representation Consider a one-to-one mapping f between \(\mathcal {V}\) and \(\{1,\cdots ,n\}\) and a one-to-one mapping g between \(\mathcal {E}\) and \(\{1,\cdots ,m\}\). For any matrix \({\textbf {X}}\in \mathbb {R}^{n \times m}\), we denote its (f(v), g(e))-th entry \({\textbf {X}}_{f(v)g(e)}\) simply by \({\textbf {X}}_{ve}\). Similarly, for any matrix \({\textbf {Y}}\in \mathbb {R}^{m \times n}\), we let \({\textbf {Y}}_{ev}\) denote \({\textbf {Y}}_{g(e)f(v)}\), and for any \({\textbf {Z}}\in \mathbb {R}^{n \times n}\), we let \({\textbf {Z}}_{vv}\) denote \({\textbf {Z}}_{f(v)f(v)}\). The matrix \({\textbf {W}} \in \mathbb {R}^{n \times m}\) is the hyperedge-weight matrix whose entry \({\textbf {W}}_{ve} = \omega (e)\) if \(v \in e\), and 0 otherwise. The matrix \({\textbf {R}} \in \mathbb {R}^{m \times n}\) is the node-weight matrix whose entry \({\textbf {R}}_{ev} = \gamma _{e}(v)\) if \(v \in e\), and 0 otherwise. In the adjacency matrix \(\textbf{A}\in \mathbb {R}^{\vert V \vert \times \vert V \vert }\) of any graph G, \(\textbf{A}_{uv} = w(e)\) if there is an edge between nodes \(u\in V\) and \(v \in V\); otherwise, \(\textbf{A}_{uv} = 0\).

Table 1 Symbols

Hypergraph expansions A hypergraph can be converted into graphs using clique- and star-expansion. Clique expansion (Sun et al. 2008) constructs a graph \(G_{\mathcal {C}}=(\mathcal {V}, E_{\mathcal {C}})\) from \(G_H\) by replacing each original hyperedge with a clique composed of the nodes in the hyperedge, i.e., \(E_{\mathcal {C}}=\{ (u,v) \vert u,v \in e,e\in \mathcal {E}\}\). Notably, the adjacency matrix of \(G_{\mathcal {C}}\) has the same sparsity pattern as \({\textbf {P}}={\textbf {W}}{\textbf {R}}\), as illustrated in Fig. 1.

Star expansion (Zien et al. 1999) constructs a graph \({G_{\star }=(V_{\star }, E_{\star })}\) by aggregating nodes and hyperedges as a new set of nodes (i.e., \(V_{\star }=\mathcal {V}\) \(\cup\) \(\mathcal {E}\)), and edges are created between each pair of incident node and hyperedge (i.e., \(E_{\star }=\{ (v,e) \vert v \in e, v \in \mathcal {V}, e \in \mathcal {E}\}\)). The sparsity pattern of the adjacency matrix of \(G_{\star }\) is the same as \({\textbf {S}} = \bigl ( {\begin{matrix} {\textbf {0}} &{} {\textbf {W}}\\ {\textbf {R}} &{} {\textbf {0}} \end{matrix}}\bigr )\), as illustrated in Fig. 1.

2.2 Random walk with restart on graphs

We introduce the concept of random walk with restart (RWR) on a graph and existing methods for computing RWR scores.

Concept Given a graph G and a query node s, random walk with restart (RWR) aims to obtain a vector \(\varvec{r}\) of proximities from s to each node on the graph (Tong et al. 2007b). Specifically, it assumes a random surfer that starts from node s and takes one of the following actions at each step:

  • Action (1) Random walk. The surfer randomly moves to one of the neighbors from the current node with probability \(1-c\). The probability of selecting each neighbor is proportional to the edge weight between the current node and the neighbor.

  • Action (2) Restart. The surfer jumps back to the query node s with restart probability c.

The stationary probability of the surfer visiting a node u is denoted by \(\textbf{r}_{u}\). That is, the RWR score vector \(\varvec{r}\) of all nodes w.r.t. s (a.k.a. single-source RWR scores) is the unique solution of the following equation:

$$\begin{aligned} \varvec{r} = (1 - c){\tilde{\textbf{A}}^{\top }}\varvec{r} + c\varvec{q}, \end{aligned}$$
(1)

where \(\tilde{{\textbf {A}}}\) is the row-normalized adjacency matrix of G, and c is called restart probability. An RWR query is denoted by \(\varvec{q} \in \mathbb {R}^{n}\), which is a unit vector whose s-th entry is 1. The resulting RWR score vector for the query \(\varvec{q}\) is denoted by \(\varvec{r} \in \mathbb {R}^{n}\). The choice of the query node s determines a specific RWR query \(\varvec{q}\), leading to a distinct RWR score vector \(\varvec{r}\). Note that the surfer often goes back to the query node s with probability c, and thus the proximities are spatially localized around s (Nassar et al. 2015), i.e., scores of nodes tightly connected to s are high, while those of distant nodes are low.

Fig. 1
figure 1

Clique- and star-expansions of an example hypergraph and their sparsity patterns

In the following paragraphs, we introduce several existing methods for exact single-source RWR calculation on graphs, with a focus on iterative methods and preprocessing methods. Note that there also exist approximate methods (Tong et al. 2007b; Wu et al. 2021; Lin et al. 2020; Wang et al. 2019b), and those for identifying only the top-k nodes with the highest scores (Hou et al. 2021; Wei et al. 2018; Wang et al. 2017).

Iterative methods This approach repeatedly updates RWR scores from the initial ones until convergence. Among various methods, power iteration has been widely utilized due to its simplicity, which is described as follows:

  • Power iteration: Page et al. (1999) utilized the power iteration method that repeats updating \(\varvec{r}\) based on the following equation:

    $$\begin{aligned} \varvec{r}^{(i)} \leftarrow (1 - c)\tilde{\textbf{A}}^{\top }\varvec{r}^{(i-1)} + c\varvec{q}, \end{aligned}$$
    (2)

    where \(\varvec{r}^{(i)}\) denotes \(\varvec{r}\) at the i-th iteration. It is repeated until \(\varvec{r}\) converges. If \(0<c<1\), \(\varvec{r}\) is guaranteed to converge to a unique solution (Langville and Meyer 2006).

Although the iterative approach does not require any computational cost for preprocessing, it exhibits expensive query processing cost (i.e., computational cost per RWR query \(\varvec{q}\)) due to the repeated matrix–vector calculation for each query.

Preprocessing methods This approach aims to quickly calculate \(\varvec{r}\) for a given query node s based on preprocessed results. From Eq. (1), we can represent the problem as solving the following linear system:

$$\begin{aligned} \left( {\textbf {I}}_{n} - (1 - c)\tilde{\textbf{A}}^{\top }\right) \varvec{r} = c\varvec{q} \quad \Leftrightarrow \quad {\textbf {H}}\varvec{r} = c\varvec{q} \quad \Leftrightarrow \quad \varvec{r} = c{\textbf {H}}^{-1}\varvec{q}, \end{aligned}$$
(3)

where \({\textbf {I}}_{n}\) is an identity matrix of size n, \({\textbf {H}} = {\textbf {I}}_{n} - (1 - c){\tilde{\textbf{A}}^{\top }} \in \mathbb {R}^{n \times n}\) is called the random-walk normalized Laplacian matrix with probability \(1-c\), and each column of \({\textbf {H}}^{-1}\) is the RWR scores w.r.t. each query node. Note that the inverse of \({\textbf {H}}\) always exists since its transpose is a strictly diagonally dominant matrix (Horn and Johnson 2012). However, preprocessing \({\textbf {H}}^{-1}\) for large graphs is impractical due to its expensive computational costs (spec., it requires \(O(n^3)\) time and \(O(n^2)\) space). To overcome the issues, preprocessing approaches focus on precomputing intermediate sub-matrices related to \({\textbf {H}}^{-1}\) and computing RWR scores rapidly based on them. These preprocessed matrices are computed only once and can be reused for multiple query nodes while reducing the computational cost per query.

As described later, preprocessing approaches designed for graphs can also be utilized to accelerate RWR computation on hypergraphs within our proposed framework ARCHER. In this paper, we consider the following state-of-the-art methods for computing RWR on graphs, while our framework can be used with any preprocessing-based approaches, such as (Tong et al. 2007a; Fujiwara et al. 2012):

  • BEAR: Shin et al. (2015) developed BEAR, a block elimination approach that efficiently preprocesses sub-matrices related to \({\textbf {H}}^{-1}\). For that, they utilized a node reordering technique called SlashBurnFootnote 1 (Kang and Faloutsos 2011) using the hub-and-spoke structure to reorder and partition the matrix \({\textbf {H}}\). After that, they applied the block elimination (Boyd et al. 2004) to the partitioned sub-matrices for computing \(\varvec{r}\).

  • BePI: Jung et al. (2017) proposed BePI, a scalable and memory-efficient method for computing \(\varvec{r}\). Although BEAR achieves a fast speed for computing an RWR query, its scalability for larger graphs is limited due to the high cost of the inversion of a sub-matrix inside the block elimination. To resolve the issue, they first utilized SlashBurn\(^{1}\) to reorder the matrix and then incorporated an iterative approach into the block elimination by replacing the sub-matrix inversion with an iterative linear solver.Footnote 2

Note that these two methods have distinct advantages. BePI is more space-efficient and thus can be applied to larger graphs, while BEAR processes each RWR query faster on small datasets. Our experimental results show that the same distinct advantages are observed also in RWR computation on hypergraphs (see Figs. 3 and 4).

Other methods to address the substantial cost of the computation of \({\textbf {H}}^{-1}\) include employing graph sparsification (i.e., reducing non-zeros of \({\textbf {H}}\)). For example, Zhang et al. (2018a) developed a spectral sparsification method for directed graphs, demonstrating a strong correlation between the RWR scores computed from the sparsified graph and those obtained from the original graph. Another approach involves approximating \({\textbf {H}}\) as an Eulerian Laplacian matrix, followed by the application of a directed Laplacian system-solving algorithm. From the obtained values, approximated values of the original solution can be derived in nearly linear time (Cohen et al. 2018, 2016). These methods differ from BEAR and BePI in that they yield an approximate solution by transforming \({\textbf {H}}\) into a computationally efficient form. We plan to explore the incorporation of such approximate computation algorithms into our framework as a part of our future research directions.

Applications RWR has been extensively utilized in diverse graph mining tasks based on node-to-node similarities on graphs. Sun et al. (2005) designed normality scores based on RWR to detect abnormal nodes in a bipartite graph. Tong et al. (2007a) used RWR to measure the goodness of a match between a query graph and a subgraph. Zhu et al. (2013) employed RWR for measuring the relevance between a query image and the data images. Jung et al. (2016, 2019) extended RWR to signed RWR in order to calculate personalized ranking scores in a signed graph. Gasteiger et al. (2019a) incorporated RWR into graph neural networks (GNNs) to prevent aggregated embeddings from being over-smoothed. The RWR score matrix has been used for augmenting static graphs (Gasteiger et al. 2019b) and dynamic graphs (Lee and Jung 2023) to improve the performance of GNNs.

2.3 Random walk on hypergraphs

In this section, we introduce several previous random walk models and other related studies on hypergraphs.

Random walk models on hypergraphs A typical random walk on a hypergraph (Zhou et al. 2006) repeats (a) selecting an incident hyperedge with probability proportional to edge weights and (b) selecting an incident node uniformly at random. Chitra and Raphael (2019) extended the concept of random walk to hypergraphs with edge-dependent node (i.e., vertex) weights (EDNW). Given a hypergraph \(G_H=(\mathcal {V}, \mathcal {E}, \omega , \gamma )\) where \(\gamma _{e}(v)\) is the weight of node v depending on edge e, the random walk on \(G_{H}\) is defined as follows:

  • Action (1-1) For the current node u, the surfer selects a hyperedge e containing node u with probability proportional to \(\omega (e)\).

  • Action (1-2) The surfer moves to node v selected from one of the nodes in the hyperedge e with probability proportional to \(\gamma _{e}(v)\).

In the above model, we set \(\gamma _{e}(u) = 0\) if \(u \notin e\). If each node has the same node weight for all of its incident hyperedges (i.e., \(\gamma _e(v) = \gamma _{e'}(v)\), \(\forall e\ne e'\in \mathcal {E}\)), it is called a hypergraph with edge-independent node weights (EINW). As described in (Chitra and Raphael 2019), a random walk on a hypergraph with EINW is equivalent to that on an undirected clique-expanded graph from the hypergraph with some choice of weights; thus, its expressiveness is limited. On the other hand, a random walk on a hypergraph with EDNW is more expressive in that it may not be equivalent to that on any undirected clique-expanded graph.

It should be noticed that even when random walk (with restart) on a hypergraph can be reduced to that on a graph, it may not be computationally optimal to calculate the equivalent random walk (with restart) on a graph. Thus, regardless of this (in)equivalence, it can be useful to develop fast computation methods for random walk (with restart) on hypergraphs.

Applications Hayashi et al. (2020) employed the random walk to devise a flexible framework for clustering hypergraph data. Li et al. (2018) proposed a local graph cut algorithm using the random walk for product-return prediction on a hypergraph. Zhang et al. (2018b) utilized the random walk for dynamic hypergraph structure learning. Lee et al. (2022) proposed HashNWalk which exploits the concept of random walk for detecting anomalous hyperedges in hyperedge streams. Note that these works utilized the concept of random walks on hypergraphs with EDNW, but they did not incorporate the concept of restart. Chitra and Raphael (2019) conducted a theoretical analysis of random walks on hypergraphs with EDNW. The authors briefly mentioned that extending this concept to incorporate restart is straightforward, but they did not provide further details. In their work, they utilized RWR for ranking problems on hypergraphs by using naive power iteration, which becomes impractical when dealing with many query nodes on large-scale hypergraphs.

Fig. 2
figure 2

Overview of ARCHER, which consists of (1) a star-expansion-based RWR computation method, (2) a clique-expansion-based RWR computation method, and (3) a preprocessing (including automatic selection) method

3 Proposed framework

In this section, we propose ARCHER (Adaptive RWR Computation on Hypergraphs), a novel framework for rapid and space-efficient computation of random walk with restart (RWR) scores on a hypergraph. As depicted in Fig. 2, ARCHER consists of three components: (a) star-expansion-based computation methods, (b) clique-expansion-based computation methods, and (c) automatic selection methods. In ARCHER, for a given hypergraph \(G_{H}\), one between star-expansion-based and clique-expansion-based computation methods is automatically selected based on the number of non-zeros in their resulting matrices (Component 3 in Sect. 3.4). Then, to leverage preprocessing techniques for fast RWR computation on graphs (e.g., BEAR and BePI), the RWR problem on the hypergraph is converted into that on the star-expanded graph (Component 1 in Sect. 3.2) or that on the clique-expanded graph (Component 2 in Sect. 3.3). After that, the RWR scores with respect to (potentially a large number of) query nodes are computed rapidly by employing a preprocessing-based approach. Note that our framework, ARCHER, can be equipped with any preprocessing-based approaches for RWR computation on graphs.

3.1 Random walk with restart on hypergraphs

First of all, we formally describe the random walk with restart (RWR) model on hypergraphs as follows:

Definition 1

(RWR on a hypergraph) Given a hypergraph \(G_H=(\mathcal {V}, \mathcal {E}, \omega , \gamma )\) and a query node s, a random surfer starts from node s. Then, the surfer takes one of the following actions at each step:

  • Action (1) Random walk. The surfer performs the following random walk on \(G_H\) with probability \(1-c\).

    • Action (1-1) For the current node u, the random surfer selects a hyperedge e containing node u with probability proportional to \(\omega (e)\).

    • Action (1-2) The surfer moves to node v selected from one of the nodes in the hyperedge e with probability proportional to \(\gamma _{e}(v)\).

  • Action (2) Restart. The surfer jumps back to the query node s with restart probability c.

The stationary probability of the surfer visiting a node u is denoted by \(\varvec{r}_{u}\), and the RWR score vector \(\varvec{r} \in \mathbb {R}^{n \times 1}\) of all nodes w.r.t. s in \(G_{H}\) is the unique solution of the linear system:

$$\begin{aligned} \varvec{r} = \underbrace{(1-c)\tilde{\textbf{R}}^{\top }\tilde{\textbf{W}}^{\top }\varvec{r}}_{\text {Random walk}} + \underbrace{c\varvec{q},}_{\text {Restart}} \end{aligned}$$
(4)

where \(0< c < 1\) is the restart probability of a random surfer, \(\varvec{q}\) is the RWR query vector, which is the unit vector whose s-th element is 1, and \((\tilde{\textbf{W}}\tilde{\textbf{R}})^{\top }\) is the transition matrix of the random walk on \(G_{H}\) where \(\tilde{\textbf{W}}\) and \(\tilde{\textbf{R}}\) are defined as follows:

  • (Regarding Action 1-1) \(\tilde{\textbf{W}} = \textbf{D}^{-1}_{\mathcal {V}}{\textbf {W}} \in \mathbb {R}^{n \times m}\) is the row-normalized hyperedge-weight matrix where \(\tilde{\textbf{W}}^{\top }\) indicates the transition from a node to a hyperedge. \({\textbf {W}}\) is the hyperedge-weight matrix, and \({\textbf {D}}_{\mathcal {V}} = \texttt {diag}({\textbf {W}}\varvec{1}_{m})\) is the node degree diagonal matrix where \(\varvec{1}_{m} \in \mathbb {R}^{m \times 1}\) is a column vector of ones.

  • (Regarding Action 1-2) \(\tilde{\textbf{R}} = \textbf{D}^{-1}_{\mathcal {E}}{\textbf {R}} \in \mathbb {R}^{m \times n}\) is the row-normalized node-weight matrix, where \(\tilde{\textbf{R}}^{\top }\) indicates the transition from a hyperedge to a node. \({\textbf {R}}\) is the node-weight matrix, and \({\textbf {D}}_{\mathcal {E}}=\texttt {diag}({\textbf {R}}\varvec{1}_{n})\) is the hyperedge degree diagonal matrix where \(\varvec{1}_{n} \in \mathbb {R}^{n }\) is a column vector of ones.

Although the RWR score vector \(\varvec{r}\) on \(G_{H}\) can be obtained by repeatedly iterating Eq. (4) based on the power iteration method, such an iterative approach is not satisfactory due to its high computational cost per query node, as discussed in Sect. 2.2. As quickly computing RWR scores for a large number of query nodes is necessary for many applications, in the following sections, we propose two RWR computation methods that provide low cost per query node by preprocessing an input hypergraph, which incurs a one-time cost.

Algorithm 1
figure a

Star-expansion-based Method for RWR on Hypergraphs

3.2 Component 1: Star-expansion-based Method

We first propose a star-expansion-based method that computes the RWR scores on the graph \(G_{\star }\) star-expanded from the hypergraph \(G_{H}\). For this purpose, we construct a new transition matrix \(\tilde{\textbf{S}}\) as follows:

$$\begin{aligned} \tilde{\textbf{S}}= \begin{bmatrix} {\textbf {0}} &{} \tilde{\textbf{W}}\\ \tilde{\textbf{R}} &{} {\textbf {0}} \\ \end{bmatrix}, \end{aligned}$$
(5)

where \(\tilde{\textbf{S}} \in \mathbb {R}^{N \times N}\) is also row-normalized as \(\tilde{\textbf{W}}\) and \(\tilde{\textbf{R}}\) are row-normalized, and \(N=n+m\). Note that the sparsity pattern of \(\tilde{\textbf{S}}\) is the same as that of \({\textbf {S}}\) which is the star-expanded graph \(G_{\star }\) from \(G_{H}\) as described in Sect. 2.1 (see the example in Fig. 1).

Our star-expansion-based method aims to calculate RWR scores on the new transition matrix \(\tilde{\textbf{S}}\) through the following equation:

$$\begin{aligned} \varvec{r}_{\star } = (1 - c_{\star })\tilde{\textbf{S}}^{\top }\varvec{r}_{\star } + c_{\star }\varvec{q}_{\star }, \end{aligned}$$
(6)

where \(c_{\star }\) is a modified restart probability from c (i.e., \(c_{\star } = 1 - \sqrt{1-c}\)), \(\varvec{r}_{\star } \in \mathbb {R}^{N \times 1}\) is the RWR score vector on \(\tilde{\textbf{S}}\), and \(\varvec{q}_{\star }\) is a modified query vector from \(\varvec{q}\) defined as follows:

$$\begin{aligned} \varvec{r}_{\star }= \begin{bmatrix} \varvec{r}_{\mathcal {V}} \\ \varvec{r}_{\mathcal {E}} \end{bmatrix} \qquad \text { and } \varvec{q}_{\star } = \begin{bmatrix} \varvec{q} \\ \varvec{0} \end{bmatrix}, \end{aligned}$$

where \(\varvec{r}_{\mathcal {V}}\) and \(\varvec{r}_{\mathcal {E}}\) denote the RWR score vectors on nodes and hyperedges, respectively. Once we obtain \(\varvec{r}_{\mathcal {V}}\), the target RWR score vector \(\varvec{r}\) is easily converted from \(\varvec{r}_{\mathcal {V}}\) according to the following theorem:

Theorem 1

(Star Expansion Equality) Suppose \(\varvec{r}\) is the RWR score vector on a hypergraph \(G_{H}\) in Eq. (4), and \(\varvec{r}_{\mathcal {V}}\) is the sub-vector of \(\varvec{r}_{\star }\), which is the RWR score vector on a star-expanded graph \(G_{\star }\) of \(\tilde{\textbf{S}}\) in Eq. (6). Then, the following equality holds:

$$\begin{aligned} \varvec{r} = \frac{c}{c_{\star }}\varvec{r}_{\mathcal {V}}, \end{aligned}$$

where \(c_{\star } = 1 - \sqrt{1-c}\) is the modified restart probability, which ranges from 0 to 1 if \(0<c<1\).

Proof

We rewrite Eq. (6) using the definitions of \(\tilde{\textbf{S}}\), \(\varvec{r}_{\star }\) and \(\varvec{q}_{\star }\) as follows:

$$\begin{aligned} \begin{bmatrix} \varvec{r}_{\mathcal {V}} \\ \varvec{r}_{\mathcal {E}} \end{bmatrix} =(1-c_{\star }) \begin{bmatrix} {\textbf {0}} &{} \tilde{\textbf{R}}^{\top }\\ \tilde{\textbf{W}}^{\top } &{} {\textbf {0}}\\ \end{bmatrix} \begin{bmatrix} \varvec{r}_{\mathcal {V}} \\ \varvec{r}_{\mathcal {E}} \end{bmatrix} +c_{\star } \begin{bmatrix} \varvec{q} \\ \varvec{0} \end{bmatrix}. \end{aligned}$$

Then, \(\varvec{r}_{\mathcal {V}}\) and \(\varvec{r}_{\mathcal {E}}\) are represented as:

$$\begin{aligned} \varvec{r}_{\mathcal {V}}&= (1-c_{\star })\tilde{\textbf{R}}^{\top }\varvec{r}_{\mathcal {E}} + c_{\star }\varvec{q}, \end{aligned}$$
(7)
$$\begin{aligned} \varvec{r}_{\mathcal {E}}&= (1-c_{\star })\tilde{\textbf{W}}^{\top }\varvec{r}_{\mathcal {V}}. \end{aligned}$$
(8)

By plugging in Eq. (8) into Eq. (7), we obtain the following equation:

$$\begin{aligned} \varvec{r}_{\mathcal {V}}&= (1-c_{\star })^2\tilde{\textbf{R}}^{\top }\tilde{\textbf{W}}^{\top }\varvec{r}_{\mathcal {V}} + c_{\star }\varvec{q} \nonumber \\ \Leftrightarrow \varvec{r}_{\mathcal {V}}&= (1-c_{\star })^2\tilde{\textbf{P}}^{\top }\varvec{r}_{\mathcal {V}} + c_{\star }\varvec{q}, \end{aligned}$$
(9)

where \(\tilde{\textbf{P}} = \tilde{\textbf{W}}\tilde{\textbf{R}}\). Note that \(c_{\star } = 1 - \sqrt{1-c}\) by its definition, satisfying \((1-c_{\star })^2=(1-c)\). Then, Eq. (9) is represented as follows:

$$\begin{aligned} \varvec{r}_{\mathcal {V}}&= (1-c)\tilde{\textbf{P}}^{\top }\varvec{r}_{\mathcal {V}} + c_{\star }\varvec{q} \\ \Leftrightarrow \varvec{r}_{\mathcal {V}}&= c_{\star }\left( {\textbf {I}}_{n} - (1-c)\tilde{\textbf{P}}^{\top }\right) ^{-1}\varvec{q} \\ \Leftrightarrow \varvec{r}_{\mathcal {V}}&= c_{\star }{\textbf {H}}_{\mathcal {C}}^{-1}\varvec{q} = \frac{c_{\star }}{c}\varvec{r}, \end{aligned}$$

where \({\textbf {H}}_{\mathcal {C}} = {\textbf {I}}_{N} - (1-c)\tilde{\textbf{P}}^{\top }\), and \(\varvec{r} = c{\textbf {H}}_{\mathcal {C}}^{-1}\varvec{q}\). This proves the claim \(\varvec{r} = \frac{c}{c_{\star }}\varvec{r}_{\mathcal {V}}\). \(\square\)

Theorem 1 indicates that the RWR score vector \(\varvec{r}\) of Eq. (4) can be obtained by solving the RWR problem on the star-expanded graph in Eq. (6). Since Eq. (6) has the same mathematical form as Eq. (1), we can apply preprocessing-based approaches (e.g., BEAR and BePI), which are based on the following linear system:

$$\begin{aligned} {\textbf {H}}_{\star }\varvec{r}_{\star } = c_{\star }\varvec{q}_{\star }, \end{aligned}$$

where \({\textbf {H}}_{\star } = {\textbf {I}}_{N} - (1-c_{\star })\tilde{\textbf{S}}^{\top } \in \mathbb {R}^{N \times N}\), and \({\textbf {I}}_{N}\) is an identity matrix of size N. Note that \({\textbf {H}}_{\star }\) is invertible, as shown below, and thus the linear system on \({\textbf {H}}_{\star }\) can be solved using preprocessing methods.

Theorem 2

(Invertibility of \(\textbf{H}^{\top }_{\star }\)) If \(0< c < 1\), \({\textbf {H}}_{\star }\) is invertible.

Proof

We first show that \(\textbf{H}^{\top }_{\star }\) is strictly diagonally dominant. Note that \(\textbf{H}^{\top }_{\star } = {\textbf {I}}_{N} - (1-c_{\star })\tilde{\textbf{S}}\) by its definition, and each entry of \(\tilde{\textbf{S}}\) is non-negative. For each row i, \(|\textbf{H}^{\top }_{\star _{ii}} |= 1\) because \(\tilde{\textbf{S}}_{ii} = 0\) as shown in Eq. (5). For non-diagonal entries of the i-th row of \(\textbf{H}^{\top }_{\star }\), \(\sum _{j \ne i}|\textbf{H}^{\top }_{\star _{ij}} |= 1-c_{\star }\) since \(\tilde{\textbf{S}}\) is row-normalized. Thus, the following inequality holds for every row i:

$$\begin{aligned} \sum _{j \ne i}|\textbf{H}^{\top }_{\star _{ij}} |= 1-c_{\star } < 1 = |\textbf{H}^{\top }_{\star _{ii}} |, \end{aligned}$$

where \(0< c_{\star } < 1\) for a given c, indicating \(\textbf{H}^{\top }_{\star }\) is strictly diagonally dominant.

The strict diagonal dominance of \(\textbf{H}^{\top }_{\star }\) implies its invertibility (Horn and Johnson 2012), which, in turn, implies the invertibility of its transposed matrix, \({\textbf {H}}_{\star }\). \(\square\)

Algorithm 1 summarizes the star-expansion-based method for computing the RWR score vector \(\varvec{r}\) w.r.t. a query node s in \(G_{H}\). The algorithm involves preprocessing and query phases as it adopts a preprocessing-based approach (e.g., BEAR or BePI). Note that the preprocessing phase is run once, while the query phase is run for each query node. In the preprocessing phase, the method first constructs the transition matrix \(\tilde{\textbf{S}}\) (lines 2 and 3). Then, it computes \({\textbf {H}}_{\star }\) (line 4) and preprocesses it by applying a preprocessing-based approach (line 5), resulting in a set \(\varvec{\Theta }_{\star }\) of preprocessed matrices. Whenever a user submits a specific query node s, the query phase computes the RWR score vector \(\varvec{r}\) w.r.t. s. Initially, its creates \(\varvec{q}_{\star }\) (line 8), followed by the computation of \(\varvec{r}_{\star }\) (line 11) based on Eq. (6). This process employs the query phase of the preprocessing method using the preprocessed results \(\varvec{\Theta }_{\star }\). Based on Theorem 1, the algorithm finally computes the target RWR score vector \(\varvec{r}\) (lines 10 and 11).

3.3 Component 2: Clique-expansion-based Method

We propose a clique-expansion-based method that computes the RWR scores on the graph \(G_{\mathcal {C}}\) clique-expanded from the hypergraph \(G_{H}\). It explicitly construct the transition matrix \(\tilde{\textbf{P}}=\tilde{\textbf{W}}\tilde{\textbf{R}}\) by which Eq. (4) becomes

$$\begin{aligned} \varvec{r} = (1 - c)\tilde{\textbf{P}}^{\top }\varvec{r} + c\varvec{q}, \end{aligned}$$
(10)

where the sparsity pattern of \(\tilde{\textbf{P}}\) is the same as that of the adjacency matrix of the clique-expanded graph \(G_{\mathcal {C}}\) from \(G_{H}\) as described in Sect. 2.1 (see the example in Fig. 1).

Based on Eq. (10), we apply a preprocessing approach to the graph of \(\tilde{\textbf{P}}\) clique-expanded from \(G_{H}\), which solves the following linear system:

$$\begin{aligned} {\textbf {H}}_{\mathcal {C}}\varvec{r} = c\varvec{q}, \end{aligned}$$

where \({\textbf {H}}_{\mathcal {C}} = {\textbf {I}}_{n} - (1-c)\tilde{\textbf{P}}^{\top } \in \mathbb {R}^{n \times n}\) and \({\textbf {I}}_{n}\) is the identity matrix of size n. Note that \({\textbf {H}}_{\mathcal {C}}\) is also invertible, which is proven in the following theorem:

Theorem 3

(Invertibility of \({\textbf {H}}_{\mathcal {C}}\)) If \(0< c < 1\), \({\textbf {H}}_{\mathcal {C}}\) is invertible.

Proof

We first show that \(\textbf{H}^{\top }_{\mathcal {C}}\) is strictly diagonally dominant. \(\textbf{H}^{\top }_{\mathcal {C}} = {\textbf {I}}_{n} - (1-c)\tilde{\textbf{P}}\) by its definition and each entry of \(\tilde{\textbf{P}}\) is non-negative. For each row i, \(|\textbf{H}^{\top }_{\mathcal {C}_{ii}} |= 1 - (1-c)\tilde{\textbf{P}}_{ii}\). Since \(\tilde{\textbf{P}}\) is row-normalized, \(\sum _{j \ne i} |\tilde{\textbf{P}}_{ij} |= 1 - \tilde{\textbf{P}}_{ii}\). Then, the following inequality holds for every row i:

$$\begin{aligned} \sum _{j \ne i} |\textbf{H}^{\top }_{\mathcal {C}_{ij}} |= (1-c)(1 - \tilde{\textbf{P}}_{ii}) = (1 - (1-c)\tilde{\textbf{P}}_{ii}) - c < 1 - (1-c)\tilde{\textbf{P}}_{ii} = |\textbf{H}^{\top }_{\mathcal {C}_{ii}} |, \end{aligned}$$

where \(0< c < 1\). This indicates that \(\textbf{H}^{\top }_{\mathcal {C}}\) is strictly diagonally dominant.

The strict diagonal dominance of \(\textbf{H}^{\top }_{\mathcal {C}}\) implies its invertibility (Horn and Johnson 2012), which, in turn, implies the invertibility of its transposed matrix, \({\textbf {H}}_{\mathcal {C}}\). \(\square\)

The clique-expansion-based method is summarized in Algorithm 2, which consists of preprocessing and query phases. In the preprocessing phase, it first explicitly builds the transition matrix \(\tilde{\textbf{P}} \in \mathbb {R}^{n \times n}\) (lines 2 and 3). Then, the algorithm computes the matrix \({\textbf {H}}_{\mathcal {C}}\) (line 4). By applying a preprocessing method, it processes \({\textbf {H}}_{\mathcal {C}}\) and obtains the preprocessed results \(\varvec{\Theta }_{\mathcal {C}}\) (line 5). In the query phase, it creates the RWR query vector \(\varvec{q}\) (line 8) and then computes the RWR score vector \(\varvec{r}\) by querying \(\varvec{q}\) using the preprocessed results \(\varvec{\Theta }_{\mathcal {C}}\) (line 9). The query phase is initiated whenever a user submits a query node.

The time and space complexities of both clique- and star-expansion-based methods can be directly derived from the complexities of RWR computation methods (e.g., BEAR (Shin et al. 2015) and BePI (Jung et al. 2017)) and the definitions of clique- and star-expansions. When employing BePI as the RWR computation method, although the complexities involve many terms related to graph structures, empirically, processing time, space cost, and query time are largely influenced by the number of edges after expansion, specifically, \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\) and \(\texttt {nnz}({\textbf {H}}_{\star })\) in clique- and star-expansion-based computations, respectively. For detailed empirical results, refer to Appendix E. This empirical tendency is utilized in the subsequent subsection for the automatic selection between clique- and star-expansion-based methods.

Algorithm 2
figure b

Clique-expansion-based Method for RWR on Hypergraphs

3.4 Component 3: Automatic selection method

As described in Sects. 3.2 and 3.3, our clique- and star-expansion-based methods allow for fast RWR computation on the hypergraph \(G_{H}\) by leveraging a preprocessing approach. Interestingly, the preprocessed matrices \({\textbf {H}}_{\star }\) and \({\textbf {H}}_{\mathcal {C}}\) have very different characteristics. For example, \({\textbf {H}}_{\mathcal {C}}\) may have a large number of non-zeros because each hyperedge e is replaced with a clique of all nodes in e, which can exert a bad effect on scalability as preprocessed results are densified. On the other hand, \({\textbf {H}}_{\star } \in \mathbb {R}^{N \times N}\) can be relatively sparse, but it is of large dimension. Recall that \(N = n + m\) where n and m are the numbers of nodes and edges, respectively. As a result, the relative time and space required to preprocess \({\textbf {H}}_{\star }\) and \({\textbf {H}}_{\mathcal {C}}\) heavily depends on datasets. For example, if \(n \ll m\), preprocessing a small matrix such as \({\textbf {H}}_{\mathcal {C}} \in \mathbb {R}^{n \times n}\) can be computationally advantageous even though it is dense.

Thus, we further develop an automatic selection method for choosing one between clique- and star-expansion-based methods so that the chosen method brings out the best performance of a preprocessing method. Various data statistics of a hypergraph can be considered to design a criterion based on which one method is chosen. Among them, our strategy is to utilize the number of non-zeros of a matrix to be preprocessed under a hypothesis that the performance of a preprocessing method is likely to be affected by non-zero entries as it exploits the sparsity of the preprocessed matrix for efficiency. We empirically prove the effectiveness of our strategy compared to various statistics in Sect. 5.5.

Based on the criterion, our framework selects the star-expansion-based method if the following predicate satisfies:

$$\begin{aligned} \texttt {nnz}({\textbf {H}}_{\mathcal {C}}) > \texttt {nnz}({\textbf {H}}_{\star }) \end{aligned}$$
(11)

where \(\texttt {nnz}(\cdot )\) returns the number of non-zeros of an input matrix. Otherwise, our method selects the clique-expansion-based method. Note that counting \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\) and \(\texttt {nnz}({\textbf {H}}_{\star })\) takes a very short time compared to the total running time mostly consumed by non-trivial operations, such as reordering and matrix multiplications of the preprocessing methods (Shin et al. 2015; Jung et al. 2017). While the empirical computation time is already very small, further optimization allows us to calculate Eq. (11) in \(O(\sum _{e \in \mathcal {E}}|e |^{2})\) time with O(n) extra space, as explained in Appendix D.

Algorithm 3
figure c

ARCHER: Adaptive RWR Computation on Hypergraphs

3.5 Ultimate framework: ARCHER

By putting all of the components together, we develop ARCHER, our ultimate framework for fast computation of RWR on hypergraphs, and the procedure in it is summarized in Algorithm 3. In the preprocessing phase, ARCHER first computes \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\) and \(\texttt {nnz}({\textbf {H}}{\star })\) (lines 2 and 3), which are utilized to select either clique- or star-expansion-based methods. Based on the criterion in Eq. (11), ARCHER chooses one of the methods and executes the corresponding preprocessing step (lines 4-7) to obtain a set \(\varvec{\Theta }\) of preprocessed matrices. In the query phase, for each query node s, ARCHER performs the query step depending on the selected method (lines 10-13), using the preprocessed matrices \(\varvec{\Theta }\), to compute the RWR score vector \(\varvec{r}\).

It is important to note that our framework ARCHER can be equipped with any pre-processing-based RWR computation method (e.g., BEAR and BePI), which is used inside Algorithms 1 and 2, and the total time and space complexity of ARCHER depends on the chosen method. Different methods offer distinct advantages, as demonstrated empirically in Sect. 5.

4 Application to anomaly detection

In this section, we present an application of RWR scores on hypergraphs for the purpose of anomaly detection. The empirical effectiveness of this approach is demonstrated in Sect. 5.6.

Given a hypergraph, this application aims to detect anomalous hyperedges that deviate from ordinary group interactions. Inspired from (Sun et al. 2005), which uses RWR for anomaly detection on graphs, we measure a normality score of a hyperedge based on relevance scores provided by hypergraph RWR. Our intuition is that if a hyperedge e is normal, the relevance scores between any pair of nodes in e should be high. Specifically, we define the normality score ns(e) as follows:

$$\begin{aligned} ns(e) = \frac{1}{\vert e \vert (\vert e \vert - 1)}\sum _{u \in e}\sum _{ {v \in e \scriptscriptstyle \backslash \{u\}}}\varvec{r}_{u \rightarrow v} \end{aligned}$$

where \(\varvec{r}_{u \rightarrow v}\) is the RWR score of the node v w.r.t. the query node u. In other words, it is the average pair-wise relevance scores between nodes in the hyperedge e. If the normality score is low, then the hyperedge is considered anomalous. Note that we need a fast computation method such as ARCHER for this task because it requires RWR scores for many query nodes.

More applications We have conducted a study on another application focused on the task of node retrieval, which is described in Appendix B.

5 Experiments

In this section, we evaluate the performance of ARCHER and compare it with other baselines for computing RWR on hypergraphs. We aim to answer the following questions from the experiments:

  • Q1. Preprocessing Time (Sect. 5.2). How long do ARCHER and the two computation methods composing it take for preprocessing?

  • Q2. Space Cost (Sect. 5.3). How much memory space do they require for their preprocessed results?

  • Q3. Query Time (Sect. 5.4). How quickly do they process RWR queries?

  • Q4. Automatic Selection Method (Sect. 5.5). How precisely does our automatic selection strategy decide an appropriate computation method for a given hypergraph?

  • Q5. Application to Anomaly Detection (Sect. 5.6). Can we achieve more accurate anomaly detection on hypergraphs using RWR scores, compared to existing approaches?

5.1 Experimental settings

We describe our experimental settings, including machines, methods, datasets, and parameters.

Table 2 Data statistics of real-world hypergraphs

Machines All experiments are conducted on a workstation with AMD Ryzen 9 3900X and 128GB memory.

Methods For experiments evaluating the computational performance, we compare ARCHER with using always one expansion method (star- or clique-expansion-based method). ARCHER and the baselines are equipped with BEAR (Shin et al. 2015) or BePI (Jung et al. 2017), state-of-the-art preprocessing methods for RWR on graphs. We also compare ARCHER with the power iteration methods on star- and clique-expanded graphs in Eqs. (6) and (10), respectively. We use our MATLAB implementation of the power iterations, and we use the source code of the authors for BEARFootnote 3 and BePI,Footnote 4 which are also implemented in MATLAB.

For anomaly detection, we consider the following two hypergraph-based anomaly-detection methods as baseline approaches:

  • LSH-A (Ranshous et al. 2017): For each incoming hyperedge, it computes the approximate frequency, which represents the number of previous hyperedges that are similar to the new one, in the aspect of common nodes shared. This is used to measure the unexpectedness of the hyperedge; intuitively, the hyperedge can be considered unexpected if it is novel and significantly different from previous hyperedges (i.e., low frequency). The approximation scheme is efficiently implemented in hyperedge streams by using Locality Sensitive Hashing (LSH) (Rajaraman and Ullman 2011). Specifically, it computes the MinHash signature with \(k_h\) hash functions and performs LSH with b bands. Then it scores the hyperedge with the similarity between the signature of previous hyperedges. LSH-A takes \(O(k_h \vert e \vert + b + b \lceil \frac{mb}{B} \rceil + 1)\) time per hyperedge, and we use our Python implementation of LSH-A.

  • HashNWalk (Lee et al. 2022): In HashNWalk, each hyperedge is hashed into M buckets called supernodes using each of \(k_h\) hash functions. Based on the transition probability (i.e., random walk of length 1), HashNWalk calculates the proximity between supernodes. Whenever a new hyperedge emerges, HashNWalk updates the proximity between the supernodes in it and compares it with the previous proximity to compute the anomaly score of the hyperedge. In scoring, HashNWalk incorporates the hyperparameter \(\alpha\) to control the degree of emphasis placed on recent hyperedges. HashNWalk takes \(O(k_h\vert e \vert + k_h \min (M, \vert e \vert )^2)\) time per hyperedge, and we use the official implementation of HashNWalk in C++.Footnote 5

Datasets We conduct extensive experiments on eighteen real-world hypergraphs  (Benson et al. 2018; Sinha et al. 2015; Yin et al. 2017; Leskovec et al. 2007; Amburg et al. 2020; Chodrow et al. 2021; Fowler 2006a, b; Ni et al. 2019; McAuley and Leskovec 2013; Harper and Konstan 2015), whose statistics are summarized in Table 2. We provide details of each dataset in Appendix A. The source code and datasets used in this paper are available at https://github.com/jaewan01/ARCHER.

Parameters. For the experiments of preprocessing and query costs, we set the restart probability c to 0.05, which has been widely used in previous work (Tong et al. 2007b; Shin et al. 2015; Jung et al. 2017). For BEAR, we set the hub selection ratio k of the reordering method to 0.001 as in (Shin et al. 2015). For BePI, we set the hub selection ratio k of the reordering method to 0.2, which is used for large graphs in (Jung et al. 2017) (see Footnote 1 for the usage of k). The error tolerance \(\epsilon\) for the power iteration and BePI is set to \(10^{-9}\). We set the time and memory limits for preprocessing to 12 hours and 128 GB, respectively.

We also perform careful hyperparameter tuning for the aforementioned anomaly detection methods. For LSH-A, we measure performance with different numbers of bands in LSH signatures (\(b \in \{2, 4, 8\}\)) and the length of LSH signatures (\(l \in \{2, 4, 8\})\) and report the best result obtained. For HashNWalk, we adopt a similar setting as described in (Lee et al. 2022). Specifically, we set the hyperparameter \(\alpha\) in the kernel function to 0.98. Additionally, we conduct a search for the optimal values of the number of hash functions \(k_h\) and the number of buckets M in the following ranges: (a) \(k_h \in \{10, 15, 20\}\) and \(M \in \{10, 20, 30\}\) for the email-Enron dataset (b) \(k_h \in \{8, 10, 12\}\) and \(M \in \{60, 80, 100\}\) for the senate-bills dataset, and (c) \(k_h \in \{8, 10, 12\}\) and \(M \in \{100, 150, 200\}\) for the house-bills dataset. We report the best result achieved for each dataset.

Fig. 3
figure 3

Preprocessing costs for various calculation methods of RWR on hypergraphs in terms of (a) preprocessing time and (b) space usage. For preprocessing times, the error bars indicate ±1 standard deviation. However, in many datasets, the error bars are so small that they are practically invisible. As shown in the figures, using ARCHER requires up to \(137.6\times\) less preprocessing time and \(16.2\times\) less space than using always one expansion method. Results are omitted if the corresponding methods ran out of time (\(>12\) hours) or out of memory (\(>128\) GB) during preprocessing

5.2 Preprocessing time

We evaluate the performance of ARCHER in terms of preprocessing time. BEAR and BePI are used as preprocessing techniques. For each method, we report the average preprocessing time of 10 experiments. Note that the iterative methods are excluded in this experiment because they do not require preprocessing. Figure 3a shows the preprocessing time of all tested methods on 18 real-world hypergraphs.

We first compare ARCHER with BEAR and each expansion-based method with BEAR. ARCHER with BEAR preprocesses hypergraphs up to \(4.2 \times\) faster than applying BEAR to the star-expanded graph (see the results on the SB dataset) and up to \(2.4 \times\) faster than applying BEAR to the clique-expanded graph (see the results on the ML20 dataset). Note that the clique-expansion-based method with BEAR cannot preprocess medium-sized datasets such as WAL and AM because clique expansion produces a too-dense matrix that BEAR cannot handle. For larger datasets such as TW, COG, COD, and THS, all methods with BEAR fail due to their limited scalability. On the other hand, ARCHER with BePI provides better scalability for preprocessing, and it successfully preprocesses all the datasets, showing up to \(137.6 \times\) faster than applying BePI to the clique-expanded graph (see the results on the TW dataset) and up to \(4.5 \times\) faster than applying BePI to the star-expanded graph (see the results on the EEU dataset).

These results imply the complementary nature of the clique-expansion-based methods and the star-expansion-based methods. Specifically, they exhibit relative advantages on different hypergraphs, and these advantages are significant. This underscores the importance of making careful choices between the two methods, as ARCHER does. Our results regarding space cost and query time in the following subsections further reinforce the importance of this selection, although we do not repeat the same discussion within those subsections.

Fig. 4
figure 4

Query time for various calculation methods of RWR on hypergraphs. The error bars indicate ±1 standard deviation. Using ARCHER takes up to \(218.8\times\) less query time than using always one expansion method. Results are omitted if the corresponding methods ran out of time (\(>12\) hours) or out of memory (\(>128\) GB) during preprocessing

5.3 Space cost

We analyze the space cost of ARCHER with each preprocessing technique compared to that of other methods. The iterative methods are excluded from the comparison because they do not produce preprocessed results beyond the size of the original data. We measure the memory size for storing preprocessed results in MB. Figure 3b shows the space cost of each method. ARCHER with BEAR uses up to \(16.2\times\) less memory than applying BEAR to the clique-expanded graph (see the results on the ML20 dataset) and up to \(9.6\times\) less memory than applying BEAR to the star-expanded graph (see the results on the SB dataset). Furthermore, the space cost of ARCHER with BePI is up to \(12.3 \times\) less than applying BePI to the clique-expanded graph (see the results on the TW dataset) and up to \(9.0 \times\) less than applying BePI to the clique-expanded graph (see the results on the SB dataset).

5.4 Query time

We examine the computational efficiency of ARCHER in processing RWR queries, compared to other baselines, including power-iteration methods. For every method, we measure the average query processing time for the same 30 query nodes. Figure 4 shows the results on 18 real-world hypergraphs. ARCHER with BEAR answers the RWR queries up to \(16.3 \times\) faster than applying BEAR to the star-expanded graph (see the results on the EEU dataset) and up to \(1.3 \times\) faster than applying BEAR to the clique-expanded graph (see the results on the ML20 dataset) For ARCHER with BePI, its query time is up to \(218.8 \times\) less than applying BePI to the star-expanded graph (see the results on the EEU dataset) and up to \(4.0 \times\) faster than applying BePI to the clique-expanded graph (see the results on the AM dataset). Note that regardless of the preprocessing techniques used, ARCHER significantly outperforms both power-iteration methods.

The experimental results regarding preprocessing and query costs imply that we should consider different aspects of the preprocessing methods when choosing one of them. If query cost matters more than preprocessing cost, BEAR is a good choice because its query speed is faster than that of BePI, especially on small datasets, as shown in Fig. 4. On the other hand, BePI is required for scalable RWR computation on larger hypergraphs because BEAR fails to process such hypergraphs, as discussed in Sect. 5.2.

Fig. 5
figure 5

Effectiveness of various data statistics in selection between the star- and clique-expansion-based methods. BePI is used as the preprocessing approach. Each dashed line indicates the best threshold for the corresponding criterion. Note that, when using our proposed criterion, we can choose the better expansion method for 17 (out of 18) datasets, while the number of wrong choices increases when we use the other criteria

5.5 Automatic selection method

We investigate the effectiveness of the proposed criterion in our automatic selection method in ARCHER, compared to other potential criteria based on major statistics of hypergraphs, including average hyperedge size, density, and overlapness (Lee et al. 2021), which are summarized in Table 2.

To evaluate the effectiveness of each criterion, we set the ground-truth label between star- and clique-expansion methods for each dataset depending on which expansion method leads to a shorter preprocessing time of BePI. The reason for labeling the datasets in this manner is that only BePI successfully preprocesses all the datasets, and the preprocessing time of each method is distinctly different, as shown in Fig. 3a. Our method selects the star method if \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})/\texttt {nnz}({\textbf {H}}_{\star }) > 1\); otherwise, it picks the clique method, i.e., the threshold for our suggested method is 1 in Fig. 5a. For the other criteria, we selected the threshold values that maximize the number of correct selections. Specifically, we set the threshold for density in Fig. 5b to 1.26, the threshold for overlapness in Fig. 5c to 15.41, and the threshold for the average size of hyperedges in Fig. 5d to 3.12.

Figure 5 demonstrates that, when using our proposed criterion, we can choose the better expansion method for 17 (out of 18) datasets, while the number of wrong choices increases when we use the other criteria. This result empirically confirms our hypothesis that the performances of preprocessing methods heavily depend on the number of non-zeros in the preprocessed matrix.

Fig. 6
figure 6

Correlation between two ratios on a natural logarithmic scale: (1) the ratio of \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\) and \(\texttt {nnz}({\textbf {H}}_{\star })\), and (2) the ratio of the costs of the clique- and star-expansion-based methods in terms of a preprocessing time, b space cost, and c query time. BePI is used as the preprocessing technique. We report the Pearson correlation coefficient for each scatter plot

We further analyze the correlation between two ratios: (1) the ratio of \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\) and \(\texttt {nnz}({\textbf {H}}_{\star })\), and (2) the ratio of (preprocessing, space, and query) costs of the clique- and star-expansion-based methods, across various datasets when BePI is used. In Fig. 6, we observe a strong positive correlation between the non-zero-count ratio and the cost ratio for each aspect of computation. This indicates that the cost in each aspect is closely related to the number of non-zero entries in the matrix being processed. Additionally, the positioning of data points in each plot demonstrates the effectiveness of our suggested method selection. Data points located in the lower left area indicate the correct selection of the clique method, while those in the upper right area indicate the correct selection of the star method. This demonstrates that our suggested selection approach is effective for most of the tested datasets.

Fig. 7
figure 7

Anomaly detection performance on hypergraphs in terms of AUROC and MAP. RWRs using weights, especially EDNW, are more accurate than the other methods, including RWR on unweighted clique-expanded graphs

5.6 Application to anomaly detection

In this section, we evaluate the effectiveness of RWR scores in detecting anomalies on hypergraphs. Refer to Sect. 4 for a detailed procedure on how we utilize RWR scores for anomaly detection.

Settings We conduct this experiment on the EEN (email-Enron), SB (senate-bills), and HB (house-bills) datasets, which are small enough for all compared methods to terminate. As the original datasets do not contain anomalous hyperedges, we synthetically generate them by injecting unexpected hyperedges following (Lee et al. 2022). Specifically, we randomly select a hyperedge \(e \in E\), and then create a hyperedge by replacing half nodes in e with random nodes. We repeat this process until t hyperedges are generated.

We consider three models of RWR on hypergraphs: 1) RWR using EDNW, 2) RWR using EINW, and 3) naive RWR. Note that the datasets used in the experiment do not contain explicit weights. For EDNW, we make the weights to be edge-dependent by setting \(\gamma _e(v) = \bar{d}(v)^{-\beta }\), where \(\bar{d}(v)\) denotes the unweighted degree of node v. That is, we adopt the principle that, since high-degree nodes are present in multiple hyperedges, they are likely to have a reduced impact within each hyperedge. We then set \(\beta =0.5\) and \(\omega (e) = 1\) (refer to Appendix C for the selection of \(\beta =0.5\)). In the case of EINW (edge-independent node weights), we set \(\gamma _e(v) = 1\) and \(\omega (e) = 1\). To check the effect of those weights, we further compare naive RWR that computes RWR scores on the unweighted clique-expanded graph from the input hypergraph. We vary the restart probability c from 0.1 to 0.9 by 0.1.

We compare those RWR models with LSH-A and HashNWalk, anomaly detection methods on hypergraphs. While they are originally designed for hypergraphs with timestamps, the used datasets are static, and thus we assign a random timestamp to each hyperedge (spec., we randomly order the hyperedges and use their orders as timestamps) when testing them.

Results Fig. 7 demonstrates the performance of each model for detecting the anomalous (unexpected) hyperedges in terms of AUROC and MAP. As shown in the figure, RWR using EDNW performs best, implying that it is beneficial to utilize the edge-dependent node weights. Even RWR using EINW outperforms HashNWalk and LSH-A. Naive RWR performs worst as the weights of nodes and hyperedges are all disregarded.

6 Conclusion and future directions

In this work, we consider random walk with restart (RWR) on hypergraphs after formally defining it (Definition 1). Then, we propose ARCHER (Algorithm 3) for its rapid and space-efficient computation. ARCHER is composed of two RWR computation methods (Algorithms 1 and 2) that are based on clique- and star-expanded graphs, respectively, of the input hypergraph. Since their relative performance heavily depends on datasets, ARCHER is equipped with a lightweight automatic method for selecting one between them. Using 18 real-world hypergraphs, we substantiate the speed and space efficiency of ARCHER (Figs. 3 and 4), revealing that these qualities are attributed to the complementary nature of the two RWR computation methods and the accuracy of the automatic selection method (Fig. 5). In addition, we introduce anomaly detection as an application of RWR on hypergraphs and show the empirical effectiveness of RWR on it (Fig. 7).

As potential directions for future work, we intend to extend our framework to incorporate approximate RWR computation algorithms, as discussed in Sect. 2.2. Furthermore, the present automatic selection algorithm in Sect. 3.4 is grounded in empirical observations, in which we plan to develop theoretically grounded yet efficient selection algorithms.