Random walk with restart on hypergraphs: fast computation and an application to anomaly detection

Chun, Jaewan; Lee, Geon; Shin, Kijung; Jung, Jinhong

doi:10.1007/s10618-023-00995-9

Random walk with restart on hypergraphs: fast computation and an application to anomaly detection

Published: 21 December 2023

Volume 38, pages 1222–1257, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Random walk with restart on hypergraphs: fast computation and an application to anomaly detection

Download PDF

Jaewan Chun¹,
Geon Lee¹,
Kijung Shin ORCID: orcid.org/0000-0002-2872-1526¹ &
…
Jinhong Jung²

550 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

Random walk with restart (RWR) is a widely-used measure of node similarity in graphs, and it has proved useful for ranking, community detection, link prediction, anomaly detection, etc. Since RWR is typically required to be computed separately for a larger number of query nodes or even for all nodes, fast computation of it is indispensable. However, for hypergraphs, the fast computation of RWR has been unexplored, despite its great potential. In this paper, we propose ARCHER, a fast computation framework for RWR on hypergraphs. Specifically, we first formally define RWR on hypergraphs, and then we propose two computation methods that compose ARCHER. Since the two methods are complementary (i.e., offering relative advantages on different hypergraphs), we also develop a method for automatic selection between them, which takes a very short time compared to the total running time. Through our extensive experiments on 18 real-world hypergraphs, we demonstrate (a) the speed and space efficiency of ARCHER, (b) the complementary nature of the two computation methods composing ARCHER, (c) the accuracy of its automatic selection method, and (d) its successful application to anomaly detection on hypergraphs.

Efficient Outlier Detection in Hyperedge Streams Using MinHash and Locality-Sensitive Hashing

Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark

Article 01 November 2018

Detecting Anomalous Subgraphs on Attributed Graphs via Parametric Flow

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Given a pair of nodes on a large-scale hypergraph, how can we rapidly and efficiently calculate their proximity based on random walk with restart? How useful are such proximities for data-mining applications?

A hypergraph is a data structure consisting of a set of nodes and a set of hyperedges, and each hyperedge is a set composed of any number of nodes. Note that a hypergraph where each hyperedge joins any number of nodes is a generalization of a (pairwise) graph where each edge always joins two nodes. Due to this increased expressiveness, hypergraphs are widely used to model real-world group relations, including (two or more) researchers who co-author a paper, items purchased together, and tags co-appearing in a post (Benson et al. 2018; Do et al. 2020; Lee et al. 2023; Comrie and Kleinberg 2021).

Random walk, a concept widely used for graphs, naturally applies to hypergraphs. Random walk is a stochastic process that assumes an imaginary surfer moving randomly between nodes in a graph; and for instance, PageRank (Brin and Page 1998) utilizes this concept to model a random surfer navigating a web graph (i.e., a network of hyperlinks between web pages) and quantifies the importance of web pages based on the stationary distribution of the surfer’s visits. A simple hypergraph extension (Zhou et al. 2006) assumes a random surfer that repeats (a) choosing an incident hyperedge with probability proportional to edge weights and (b) choosing an incident node uniformly at random. However, its expressiveness is limited in that such random walk can always be reduced to random walk on an undirected graph, with some choice of weights (Chitra and Raphael 2019). Chitra and Raphael (2019) proposed a more expressive one that may not be reduced to random walk on any undirected graph. It assumes a random surfer who uses edge-dependent node weights (EDNW) when choosing incident nodes, and due to its expressiveness, it has been employed for clustering (Hayashi et al. 2020), product-return prediction (Li et al. 2018), object classification (Zhang et al. 2018b), anomaly detection (Lee et al. 2022), etc.

On graphs, the concept of random walk with restart (RWR) (Tong et al. 2007b) has also been widely used. RWR measures the stationary probability distribution of random walk when we assume a random surfer who restarts at a query node with a certain probability, and the distribution is naturally interpreted as the relevance of each node with respect to the query node. Due to its ability to consider multi-faceted relationships between nodes, RWR has been extensively utilized in graph mining applications including personalized ranking (Tong et al. 2007b; Jung et al. 2016), anomaly detection (Sun et al. 2005), subgraph mining (Tong et al. 2007a), graph neural networks (Gasteiger et al. 2019a), and graph augmentation (Gasteiger et al. 2019b; Lee and Jung 2023). Since RWR is typically required to be computed separately for a larger number of query nodes or even for all nodes, fast computation of it is indispensable. Therefore, many computation methods have been developed (Wang et al. 2019a; Hou et al. 2021), and many of them rely on preprocessing the input graph (Tong et al. 2007a; Fujiwara et al. 2012; Shin et al. 2015; Jung et al. 2017).

However, RWR on hypergraphs has been underexplored, while an extension of RWR to hypergraphs is straightforward with many potential applications. One potential reason is the lack of fast and scalable computation methods of it. For example, the previous work (Chitra and Raphael 2019) has relied on naive power iteration, which becomes impractical when dealing with a large number of query nodes on large-scale hypergraphs. Since random walk on a hypergraph consists of two-step transitions (i.e., from node to hyperedge, and from hyperedge to node), we cannot directly employ existing fast and scalable computation methods for RWR on a graph whose transition is just from node to node. Similar to graphs, RWR scores on hypergraphs vary across different query nodes, and computing RWR scores for a large number of query nodes is necessary for applications. Thus, dedicated efforts are necessary to develop fast computation methods can offer low costs per query node, even if it involves incurring a one-time preprocessing cost.

In this work, we propose ARCHER (Adaptive RWR Computation on Hypergraphs), a fast and space-efficient framework for computing RWR on real-world hypergraphs. After formally defining RWR on hypergraphs, we develop two computation methods for it based on two different (simplified) representations of hypergraphs. These two computation methods are complementary, and they offer relative advantages on different hypergraphs. Thus, we further propose an automatic selection method that chooses one computation method based on a simple but effective goodness criterion whose computation takes a very short time compared to the total running time.

Through extensive experiments on 18 real-world hypergraphs, we substantiate the speed and space efficiency of ARCHER and its two key driving factors: (a) the complementarity between the two computation methods composing ARCHER and (b) the accuracy of its automatic selection method. In addition, we demonstrate a successful application of RWR on hypergraphs and ARCHER to anomaly detection.

Reproducibility The source code and datasets used in this paper are available at https://github.com/jaewan01/ARCHER.

The rest of the paper is organized as follows. In Sect. 2, we introduce preliminaries and related work. In Sect. 3, we describe our approaches for computing RWR on hypergraphs. In Sect. 4, we introduce an application of RWR on hypergraphs for the purpose of anomaly detection. After sharing experimental results in Sect. 5, we present conclusions and future directions in Sect. 6.

2 Preliminaries and related work

In this section, we introduce some preliminaries and related studies on random walks on (hyper-)graphs and their applications.

2.1 Notations

We describe basic notations frequently used in this paper where related symbols are summarized in Table 1.

(Hyper)graph A hypergraph ${G_H=(\mathcal {V}, \mathcal {E}, \omega , \gamma )}$ consists of a set $\mathcal {V}$ of nodes, a set $\mathcal {E}$ of hyperedges, the weight $\omega (e)$ of hyperedge e, and the weight $\gamma _{e}(v)$ of node v depending on hyperedge e. Each hyperedge $e\in E$ is represented by a non-empty subset of an arbitrary number of nodes, i.e., $e \in 2^{V}$. We let $n=\vert \mathcal {V} \vert$ and $m=\vert \mathcal {E} \vert$ be the numbers of nodes and hyperedges, respectively. Similarly, a graph $G=(V, E, w)$ consists of a set V of nodes, a set E of edges, and edge weights w.

Matrix representation Consider a one-to-one mapping f between $\mathcal {V}$ and $\{1,\cdots ,n\}$ and a one-to-one mapping g between $\mathcal {E}$ and $\{1,\cdots ,m\}$. For any matrix ${\textbf {X}}\in \mathbb {R}^{n \times m}$, we denote its (f(v), g(e))-th entry ${\textbf {X}}_{f(v)g(e)}$ simply by ${\textbf {X}}_{ve}$. Similarly, for any matrix ${\textbf {Y}}\in \mathbb {R}^{m \times n}$, we let ${\textbf {Y}}_{ev}$ denote ${\textbf {Y}}_{g(e)f(v)}$, and for any ${\textbf {Z}}\in \mathbb {R}^{n \times n}$, we let ${\textbf {Z}}_{vv}$ denote ${\textbf {Z}}_{f(v)f(v)}$. The matrix ${\textbf {W}} \in \mathbb {R}^{n \times m}$ is the hyperedge-weight matrix whose entry ${\textbf {W}}_{ve} = \omega (e)$ if $v \in e$, and 0 otherwise. The matrix ${\textbf {R}} \in \mathbb {R}^{m \times n}$ is the node-weight matrix whose entry ${\textbf {R}}_{ev} = \gamma _{e}(v)$ if $v \in e$, and 0 otherwise. In the adjacency matrix $\textbf{A}\in \mathbb {R}^{\vert V \vert \times \vert V \vert }$ of any graph G, $\textbf{A}_{uv} = w(e)$ if there is an edge between nodes $u\in V$ and $v \in V$; otherwise, $\textbf{A}_{uv} = 0$.

Table 1 Symbols

Full size table

Hypergraph expansions A hypergraph can be converted into graphs using clique- and star-expansion. Clique expansion (Sun et al. 2008) constructs a graph $G_{\mathcal {C}}=(\mathcal {V}, E_{\mathcal {C}})$ from $G_H$ by replacing each original hyperedge with a clique composed of the nodes in the hyperedge, i.e., $E_{\mathcal {C}}=\{ (u,v) \vert u,v \in e,e\in \mathcal {E}\}$. Notably, the adjacency matrix of $G_{\mathcal {C}}$ has the same sparsity pattern as ${\textbf {P}}={\textbf {W}}{\textbf {R}}$, as illustrated in Fig. 1.

Star expansion (Zien et al. 1999) constructs a graph ${G_{\star }=(V_{\star }, E_{\star })}$ by aggregating nodes and hyperedges as a new set of nodes (i.e., $V_{\star }=\mathcal {V}$ $\cup$ $\mathcal {E}$), and edges are created between each pair of incident node and hyperedge (i.e., $E_{\star }=\{ (v,e) \vert v \in e, v \in \mathcal {V}, e \in \mathcal {E}\}$). The sparsity pattern of the adjacency matrix of $G_{\star }$ is the same as ${\textbf {S}} = \bigl ( {\begin{matrix} {\textbf {0}} &{} {\textbf {W}}\\ {\textbf {R}} &{} {\textbf {0}} \end{matrix}}\bigr )$, as illustrated in Fig. 1.

2.2 Random walk with restart on graphs

We introduce the concept of random walk with restart (RWR) on a graph and existing methods for computing RWR scores.

Concept Given a graph G and a query node s, random walk with restart (RWR) aims to obtain a vector $\varvec{r}$ of proximities from s to each node on the graph (Tong et al. 2007b). Specifically, it assumes a random surfer that starts from node s and takes one of the following actions at each step:

Action (1) Random walk. The surfer randomly moves to one of the neighbors from the current node with probability $1-c$. The probability of selecting each neighbor is proportional to the edge weight between the current node and the neighbor.
Action (2) Restart. The surfer jumps back to the query node s with restart probability c.

The stationary probability of the surfer visiting a node u is denoted by $\textbf{r}_{u}$. That is, the RWR score vector $\varvec{r}$ of all nodes w.r.t. s (a.k.a. single-source RWR scores) is the unique solution of the following equation:

$$\begin{aligned} \varvec{r} = (1 - c){\tilde{\textbf{A}}^{\top }}\varvec{r} + c\varvec{q}, \end{aligned}$$

(1)

where $\tilde{{\textbf {A}}}$ is the row-normalized adjacency matrix of G, and c is called restart probability. An RWR query is denoted by $\varvec{q} \in \mathbb {R}^{n}$, which is a unit vector whose s-th entry is 1. The resulting RWR score vector for the query $\varvec{q}$ is denoted by $\varvec{r} \in \mathbb {R}^{n}$. The choice of the query node s determines a specific RWR query $\varvec{q}$, leading to a distinct RWR score vector $\varvec{r}$. Note that the surfer often goes back to the query node s with probability c, and thus the proximities are spatially localized around s (Nassar et al. 2015), i.e., scores of nodes tightly connected to s are high, while those of distant nodes are low.

In the following paragraphs, we introduce several existing methods for exact single-source RWR calculation on graphs, with a focus on iterative methods and preprocessing methods. Note that there also exist approximate methods (Tong et al. 2007b; Wu et al. 2021; Lin et al. 2020; Wang et al. 2019b), and those for identifying only the top-k nodes with the highest scores (Hou et al. 2021; Wei et al. 2018; Wang et al. 2017).

Iterative methods This approach repeatedly updates RWR scores from the initial ones until convergence. Among various methods, power iteration has been widely utilized due to its simplicity, which is described as follows:

Power iteration: Page et al. (1999) utilized the power iteration method that repeats updating $\varvec{r}$ based on the following equation:
$$\begin{aligned} \varvec{r}^{(i)} \leftarrow (1 - c)\tilde{\textbf{A}}^{\top }\varvec{r}^{(i-1)} + c\varvec{q}, \end{aligned}$$
(2)
where $\varvec{r}^{(i)}$ denotes $\varvec{r}$ at the i-th iteration. It is repeated until $\varvec{r}$ converges. If $0<c<1$, $\varvec{r}$ is guaranteed to converge to a unique solution (Langville and Meyer 2006).

Although the iterative approach does not require any computational cost for preprocessing, it exhibits expensive query processing cost (i.e., computational cost per RWR query $\varvec{q}$) due to the repeated matrix–vector calculation for each query.

Preprocessing methods This approach aims to quickly calculate $\varvec{r}$ for a given query node s based on preprocessed results. From Eq. (1), we can represent the problem as solving the following linear system:

$$\begin{aligned} \left( {\textbf {I}}_{n} - (1 - c)\tilde{\textbf{A}}^{\top }\right) \varvec{r} = c\varvec{q} \quad \Leftrightarrow \quad {\textbf {H}}\varvec{r} = c\varvec{q} \quad \Leftrightarrow \quad \varvec{r} = c{\textbf {H}}^{-1}\varvec{q}, \end{aligned}$$

(3)

where ${\textbf {I}}_{n}$ is an identity matrix of size n, ${\textbf {H}} = {\textbf {I}}_{n} - (1 - c){\tilde{\textbf{A}}^{\top }} \in \mathbb {R}^{n \times n}$ is called the random-walk normalized Laplacian matrix with probability $1-c$, and each column of ${\textbf {H}}^{-1}$ is the RWR scores w.r.t. each query node. Note that the inverse of ${\textbf {H}}$ always exists since its transpose is a strictly diagonally dominant matrix (Horn and Johnson 2012). However, preprocessing ${\textbf {H}}^{-1}$ for large graphs is impractical due to its expensive computational costs (spec., it requires $O(n^3)$ time and $O(n^2)$ space). To overcome the issues, preprocessing approaches focus on precomputing intermediate sub-matrices related to ${\textbf {H}}^{-1}$ and computing RWR scores rapidly based on them. These preprocessed matrices are computed only once and can be reused for multiple query nodes while reducing the computational cost per query.

As described later, preprocessing approaches designed for graphs can also be utilized to accelerate RWR computation on hypergraphs within our proposed framework ARCHER. In this paper, we consider the following state-of-the-art methods for computing RWR on graphs, while our framework can be used with any preprocessing-based approaches, such as (Tong et al. 2007a; Fujiwara et al. 2012):

BEAR: Shin et al. (2015) developed BEAR, a block elimination approach that efficiently preprocesses sub-matrices related to ${\textbf {H}}^{-1}$. For that, they utilized a node reordering technique called SlashBurn^{Footnote 1} (Kang and Faloutsos 2011) using the hub-and-spoke structure to reorder and partition the matrix ${\textbf {H}}$. After that, they applied the block elimination (Boyd et al. 2004) to the partitioned sub-matrices for computing $\varvec{r}$.
BePI: Jung et al. (2017) proposed BePI, a scalable and memory-efficient method for computing $\varvec{r}$. Although BEAR achieves a fast speed for computing an RWR query, its scalability for larger graphs is limited due to the high cost of the inversion of a sub-matrix inside the block elimination. To resolve the issue, they first utilized SlashBurn$^{1}$ to reorder the matrix and then incorporated an iterative approach into the block elimination by replacing the sub-matrix inversion with an iterative linear solver.^{Footnote 2}

Note that these two methods have distinct advantages. BePI is more space-efficient and thus can be applied to larger graphs, while BEAR processes each RWR query faster on small datasets. Our experimental results show that the same distinct advantages are observed also in RWR computation on hypergraphs (see Figs. 3 and 4).

Other methods to address the substantial cost of the computation of ${\textbf {H}}^{-1}$ include employing graph sparsification (i.e., reducing non-zeros of ${\textbf {H}}$). For example, Zhang et al. (2018a) developed a spectral sparsification method for directed graphs, demonstrating a strong correlation between the RWR scores computed from the sparsified graph and those obtained from the original graph. Another approach involves approximating ${\textbf {H}}$ as an Eulerian Laplacian matrix, followed by the application of a directed Laplacian system-solving algorithm. From the obtained values, approximated values of the original solution can be derived in nearly linear time (Cohen et al. 2018, 2016). These methods differ from BEAR and BePI in that they yield an approximate solution by transforming ${\textbf {H}}$ into a computationally efficient form. We plan to explore the incorporation of such approximate computation algorithms into our framework as a part of our future research directions.

Applications RWR has been extensively utilized in diverse graph mining tasks based on node-to-node similarities on graphs. Sun et al. (2005) designed normality scores based on RWR to detect abnormal nodes in a bipartite graph. Tong et al. (2007a) used RWR to measure the goodness of a match between a query graph and a subgraph. Zhu et al. (2013) employed RWR for measuring the relevance between a query image and the data images. Jung et al. (2016, 2019) extended RWR to signed RWR in order to calculate personalized ranking scores in a signed graph. Gasteiger et al. (2019a) incorporated RWR into graph neural networks (GNNs) to prevent aggregated embeddings from being over-smoothed. The RWR score matrix has been used for augmenting static graphs (Gasteiger et al. 2019b) and dynamic graphs (Lee and Jung 2023) to improve the performance of GNNs.

2.3 Random walk on hypergraphs

In this section, we introduce several previous random walk models and other related studies on hypergraphs.

Random walk models on hypergraphs A typical random walk on a hypergraph (Zhou et al. 2006) repeats (a) selecting an incident hyperedge with probability proportional to edge weights and (b) selecting an incident node uniformly at random. Chitra and Raphael (2019) extended the concept of random walk to hypergraphs with edge-dependent node (i.e., vertex) weights (EDNW). Given a hypergraph $G_H=(\mathcal {V}, \mathcal {E}, \omega , \gamma )$ where $\gamma _{e}(v)$ is the weight of node v depending on edge e, the random walk on $G_{H}$ is defined as follows:

Action (1-1) For the current node u, the surfer selects a hyperedge e containing node u with probability proportional to $\omega (e)$.
Action (1-2) The surfer moves to node v selected from one of the nodes in the hyperedge e with probability proportional to $\gamma _{e}(v)$.

In the above model, we set $\gamma _{e}(u) = 0$ if $u \notin e$. If each node has the same node weight for all of its incident hyperedges (i.e., $\gamma _e(v) = \gamma _{e'}(v)$, $\forall e\ne e'\in \mathcal {E}$), it is called a hypergraph with edge-independent node weights (EINW). As described in (Chitra and Raphael 2019), a random walk on a hypergraph with EINW is equivalent to that on an undirected clique-expanded graph from the hypergraph with some choice of weights; thus, its expressiveness is limited. On the other hand, a random walk on a hypergraph with EDNW is more expressive in that it may not be equivalent to that on any undirected clique-expanded graph.

It should be noticed that even when random walk (with restart) on a hypergraph can be reduced to that on a graph, it may not be computationally optimal to calculate the equivalent random walk (with restart) on a graph. Thus, regardless of this (in)equivalence, it can be useful to develop fast computation methods for random walk (with restart) on hypergraphs.

Applications Hayashi et al. (2020) employed the random walk to devise a flexible framework for clustering hypergraph data. Li et al. (2018) proposed a local graph cut algorithm using the random walk for product-return prediction on a hypergraph. Zhang et al. (2018b) utilized the random walk for dynamic hypergraph structure learning. Lee et al. (2022) proposed HashNWalk which exploits the concept of random walk for detecting anomalous hyperedges in hyperedge streams. Note that these works utilized the concept of random walks on hypergraphs with EDNW, but they did not incorporate the concept of restart. Chitra and Raphael (2019) conducted a theoretical analysis of random walks on hypergraphs with EDNW. The authors briefly mentioned that extending this concept to incorporate restart is straightforward, but they did not provide further details. In their work, they utilized RWR for ranking problems on hypergraphs by using naive power iteration, which becomes impractical when dealing with many query nodes on large-scale hypergraphs.

3 Proposed framework

In this section, we propose ARCHER (Adaptive RWR Computation on Hypergraphs), a novel framework for rapid and space-efficient computation of random walk with restart (RWR) scores on a hypergraph. As depicted in Fig. 2, ARCHER consists of three components: (a) star-expansion-based computation methods, (b) clique-expansion-based computation methods, and (c) automatic selection methods. In ARCHER, for a given hypergraph $G_{H}$, one between star-expansion-based and clique-expansion-based computation methods is automatically selected based on the number of non-zeros in their resulting matrices (Component 3 in Sect. 3.4). Then, to leverage preprocessing techniques for fast RWR computation on graphs (e.g., BEAR and BePI), the RWR problem on the hypergraph is converted into that on the star-expanded graph (Component 1 in Sect. 3.2) or that on the clique-expanded graph (Component 2 in Sect. 3.3). After that, the RWR scores with respect to (potentially a large number of) query nodes are computed rapidly by employing a preprocessing-based approach. Note that our framework, ARCHER, can be equipped with any preprocessing-based approaches for RWR computation on graphs.

3.1 Random walk with restart on hypergraphs

First of all, we formally describe the random walk with restart (RWR) model on hypergraphs as follows:

Definition 1

(RWR on a hypergraph) Given a hypergraph $G_H=(\mathcal {V}, \mathcal {E}, \omega , \gamma )$ and a query node s, a random surfer starts from node s. Then, the surfer takes one of the following actions at each step:

Action (1) Random walk. The surfer performs the following random walk on $G_H$ with probability $1-c$.
- Action (1-1) For the current node u, the random surfer selects a hyperedge e containing node u with probability proportional to $\omega (e)$.
- Action (1-2) The surfer moves to node v selected from one of the nodes in the hyperedge e with probability proportional to $\gamma _{e}(v)$.
Action (2) Restart. The surfer jumps back to the query node s with restart probability c.

The stationary probability of the surfer visiting a node u is denoted by $\varvec{r}_{u}$, and the RWR score vector $\varvec{r} \in \mathbb {R}^{n \times 1}$ of all nodes w.r.t. s in $G_{H}$ is the unique solution of the linear system:

$$\begin{aligned} \varvec{r} = \underbrace{(1-c)\tilde{\textbf{R}}^{\top }\tilde{\textbf{W}}^{\top }\varvec{r}}_{\text {Random walk}} + \underbrace{c\varvec{q},}_{\text {Restart}} \end{aligned}$$

(4)

where $0< c < 1$ is the restart probability of a random surfer, $\varvec{q}$ is the RWR query vector, which is the unit vector whose s-th element is 1, and $(\tilde{\textbf{W}}\tilde{\textbf{R}})^{\top }$ is the transition matrix of the random walk on $G_{H}$ where $\tilde{\textbf{W}}$ and $\tilde{\textbf{R}}$ are defined as follows:

(Regarding Action 1-1) $\tilde{\textbf{W}} = \textbf{D}^{-1}_{\mathcal {V}}{\textbf {W}} \in \mathbb {R}^{n \times m}$ is the row-normalized hyperedge-weight matrix where $\tilde{\textbf{W}}^{\top }$ indicates the transition from a node to a hyperedge. ${\textbf {W}}$ is the hyperedge-weight matrix, and ${\textbf {D}}_{\mathcal {V}} = \texttt {diag}({\textbf {W}}\varvec{1}_{m})$ is the node degree diagonal matrix where $\varvec{1}_{m} \in \mathbb {R}^{m \times 1}$ is a column vector of ones.
(Regarding Action 1-2) $\tilde{\textbf{R}} = \textbf{D}^{-1}_{\mathcal {E}}{\textbf {R}} \in \mathbb {R}^{m \times n}$ is the row-normalized node-weight matrix, where $\tilde{\textbf{R}}^{\top }$ indicates the transition from a hyperedge to a node. ${\textbf {R}}$ is the node-weight matrix, and ${\textbf {D}}_{\mathcal {E}}=\texttt {diag}({\textbf {R}}\varvec{1}_{n})$ is the hyperedge degree diagonal matrix where $\varvec{1}_{n} \in \mathbb {R}^{n }$ is a column vector of ones.

Although the RWR score vector $\varvec{r}$ on $G_{H}$ can be obtained by repeatedly iterating Eq. (4) based on the power iteration method, such an iterative approach is not satisfactory due to its high computational cost per query node, as discussed in Sect. 2.2. As quickly computing RWR scores for a large number of query nodes is necessary for many applications, in the following sections, we propose two RWR computation methods that provide low cost per query node by preprocessing an input hypergraph, which incurs a one-time cost.

3.2 Component 1: Star-expansion-based Method

We first propose a star-expansion-based method that computes the RWR scores on the graph $G_{\star }$ star-expanded from the hypergraph $G_{H}$. For this purpose, we construct a new transition matrix $\tilde{\textbf{S}}$ as follows:

$$\begin{aligned} \tilde{\textbf{S}}= \begin{bmatrix} {\textbf {0}} &{} \tilde{\textbf{W}}\\ \tilde{\textbf{R}} &{} {\textbf {0}} \\ \end{bmatrix}, \end{aligned}$$

(5)

where $\tilde{\textbf{S}} \in \mathbb {R}^{N \times N}$ is also row-normalized as $\tilde{\textbf{W}}$ and $\tilde{\textbf{R}}$ are row-normalized, and $N=n+m$. Note that the sparsity pattern of $\tilde{\textbf{S}}$ is the same as that of ${\textbf {S}}$ which is the star-expanded graph $G_{\star }$ from $G_{H}$ as described in Sect. 2.1 (see the example in Fig. 1).

Our star-expansion-based method aims to calculate RWR scores on the new transition matrix $\tilde{\textbf{S}}$ through the following equation:

$$\begin{aligned} \varvec{r}_{\star } = (1 - c_{\star })\tilde{\textbf{S}}^{\top }\varvec{r}_{\star } + c_{\star }\varvec{q}_{\star }, \end{aligned}$$

(6)

where $c_{\star }$ is a modified restart probability from c (i.e., $c_{\star } = 1 - \sqrt{1-c}$), $\varvec{r}_{\star } \in \mathbb {R}^{N \times 1}$ is the RWR score vector on $\tilde{\textbf{S}}$, and $\varvec{q}_{\star }$ is a modified query vector from $\varvec{q}$ defined as follows:

$$\begin{aligned} \varvec{r}_{\star }= \begin{bmatrix} \varvec{r}_{\mathcal {V}} \\ \varvec{r}_{\mathcal {E}} \end{bmatrix} \qquad \text { and } \varvec{q}_{\star } = \begin{bmatrix} \varvec{q} \\ \varvec{0} \end{bmatrix}, \end{aligned}$$

where $\varvec{r}_{\mathcal {V}}$ and $\varvec{r}_{\mathcal {E}}$ denote the RWR score vectors on nodes and hyperedges, respectively. Once we obtain $\varvec{r}_{\mathcal {V}}$, the target RWR score vector $\varvec{r}$ is easily converted from $\varvec{r}_{\mathcal {V}}$ according to the following theorem:

Theorem 1

(Star Expansion Equality) Suppose $\varvec{r}$ is the RWR score vector on a hypergraph $G_{H}$ in Eq. (4), and $\varvec{r}_{\mathcal {V}}$ is the sub-vector of $\varvec{r}_{\star }$, which is the RWR score vector on a star-expanded graph $G_{\star }$ of $\tilde{\textbf{S}}$ in Eq. (6). Then, the following equality holds:

$$\begin{aligned} \varvec{r} = \frac{c}{c_{\star }}\varvec{r}_{\mathcal {V}}, \end{aligned}$$

where $c_{\star } = 1 - \sqrt{1-c}$ is the modified restart probability, which ranges from 0 to 1 if $0<c<1$.

Proof

We rewrite Eq. (6) using the definitions of $\tilde{\textbf{S}}$, $\varvec{r}_{\star }$ and $\varvec{q}_{\star }$ as follows:

$$\begin{aligned} \begin{bmatrix} \varvec{r}_{\mathcal {V}} \\ \varvec{r}_{\mathcal {E}} \end{bmatrix} =(1-c_{\star }) \begin{bmatrix} {\textbf {0}} &{} \tilde{\textbf{R}}^{\top }\\ \tilde{\textbf{W}}^{\top } &{} {\textbf {0}}\\ \end{bmatrix} \begin{bmatrix} \varvec{r}_{\mathcal {V}} \\ \varvec{r}_{\mathcal {E}} \end{bmatrix} +c_{\star } \begin{bmatrix} \varvec{q} \\ \varvec{0} \end{bmatrix}. \end{aligned}$$

Then, $\varvec{r}_{\mathcal {V}}$ and $\varvec{r}_{\mathcal {E}}$ are represented as:

$$\begin{aligned} \varvec{r}_{\mathcal {V}}&= (1-c_{\star })\tilde{\textbf{R}}^{\top }\varvec{r}_{\mathcal {E}} + c_{\star }\varvec{q}, \end{aligned}$$

(7)

$$\begin{aligned} \varvec{r}_{\mathcal {E}}&= (1-c_{\star })\tilde{\textbf{W}}^{\top }\varvec{r}_{\mathcal {V}}. \end{aligned}$$

(8)

By plugging in Eq. (8) into Eq. (7), we obtain the following equation:

$$\begin{aligned} \varvec{r}_{\mathcal {V}}&= (1-c_{\star })^2\tilde{\textbf{R}}^{\top }\tilde{\textbf{W}}^{\top }\varvec{r}_{\mathcal {V}} + c_{\star }\varvec{q} \nonumber \\ \Leftrightarrow \varvec{r}_{\mathcal {V}}&= (1-c_{\star })^2\tilde{\textbf{P}}^{\top }\varvec{r}_{\mathcal {V}} + c_{\star }\varvec{q}, \end{aligned}$$

(9)

where $\tilde{\textbf{P}} = \tilde{\textbf{W}}\tilde{\textbf{R}}$. Note that $c_{\star } = 1 - \sqrt{1-c}$ by its definition, satisfying $(1-c_{\star })^2=(1-c)$. Then, Eq. (9) is represented as follows:

$$\begin{aligned} \varvec{r}_{\mathcal {V}}&= (1-c)\tilde{\textbf{P}}^{\top }\varvec{r}_{\mathcal {V}} + c_{\star }\varvec{q} \\ \Leftrightarrow \varvec{r}_{\mathcal {V}}&= c_{\star }\left( {\textbf {I}}_{n} - (1-c)\tilde{\textbf{P}}^{\top }\right) ^{-1}\varvec{q} \\ \Leftrightarrow \varvec{r}_{\mathcal {V}}&= c_{\star }{\textbf {H}}_{\mathcal {C}}^{-1}\varvec{q} = \frac{c_{\star }}{c}\varvec{r}, \end{aligned}$$

where ${\textbf {H}}_{\mathcal {C}} = {\textbf {I}}_{N} - (1-c)\tilde{\textbf{P}}^{\top }$, and $\varvec{r} = c{\textbf {H}}_{\mathcal {C}}^{-1}\varvec{q}$. This proves the claim $\varvec{r} = \frac{c}{c_{\star }}\varvec{r}_{\mathcal {V}}$. $\square$

Theorem 1 indicates that the RWR score vector $\varvec{r}$ of Eq. (4) can be obtained by solving the RWR problem on the star-expanded graph in Eq. (6). Since Eq. (6) has the same mathematical form as Eq. (1), we can apply preprocessing-based approaches (e.g., BEAR and BePI), which are based on the following linear system:

$$\begin{aligned} {\textbf {H}}_{\star }\varvec{r}_{\star } = c_{\star }\varvec{q}_{\star }, \end{aligned}$$

where ${\textbf {H}}_{\star } = {\textbf {I}}_{N} - (1-c_{\star })\tilde{\textbf{S}}^{\top } \in \mathbb {R}^{N \times N}$, and ${\textbf {I}}_{N}$ is an identity matrix of size N. Note that ${\textbf {H}}_{\star }$ is invertible, as shown below, and thus the linear system on ${\textbf {H}}_{\star }$ can be solved using preprocessing methods.

Theorem 2

(Invertibility of $\textbf{H}^{\top }_{\star }$) If $0< c < 1$, ${\textbf {H}}_{\star }$ is invertible.

Proof

We first show that $\textbf{H}^{\top }_{\star }$ is strictly diagonally dominant. Note that $\textbf{H}^{\top }_{\star } = {\textbf {I}}_{N} - (1-c_{\star })\tilde{\textbf{S}}$ by its definition, and each entry of $\tilde{\textbf{S}}$ is non-negative. For each row i, $|\textbf{H}^{\top }_{\star _{ii}} |= 1$ because $\tilde{\textbf{S}}_{ii} = 0$ as shown in Eq. (5). For non-diagonal entries of the i-th row of $\textbf{H}^{\top }_{\star }$, $\sum _{j \ne i}|\textbf{H}^{\top }_{\star _{ij}} |= 1-c_{\star }$ since $\tilde{\textbf{S}}$ is row-normalized. Thus, the following inequality holds for every row i:

$$\begin{aligned} \sum _{j \ne i}|\textbf{H}^{\top }_{\star _{ij}} |= 1-c_{\star } < 1 = |\textbf{H}^{\top }_{\star _{ii}} |, \end{aligned}$$

where $0< c_{\star } < 1$ for a given c, indicating $\textbf{H}^{\top }_{\star }$ is strictly diagonally dominant.

The strict diagonal dominance of $\textbf{H}^{\top }_{\star }$ implies its invertibility (Horn and Johnson 2012), which, in turn, implies the invertibility of its transposed matrix, ${\textbf {H}}_{\star }$. $\square$

Algorithm 1 summarizes the star-expansion-based method for computing the RWR score vector $\varvec{r}$ w.r.t. a query node s in $G_{H}$. The algorithm involves preprocessing and query phases as it adopts a preprocessing-based approach (e.g., BEAR or BePI). Note that the preprocessing phase is run once, while the query phase is run for each query node. In the preprocessing phase, the method first constructs the transition matrix $\tilde{\textbf{S}}$ (lines 2 and 3). Then, it computes ${\textbf {H}}_{\star }$ (line 4) and preprocesses it by applying a preprocessing-based approach (line 5), resulting in a set $\varvec{\Theta }_{\star }$ of preprocessed matrices. Whenever a user submits a specific query node s, the query phase computes the RWR score vector $\varvec{r}$ w.r.t. s. Initially, its creates $\varvec{q}_{\star }$ (line 8), followed by the computation of $\varvec{r}_{\star }$ (line 11) based on Eq. (6). This process employs the query phase of the preprocessing method using the preprocessed results $\varvec{\Theta }_{\star }$. Based on Theorem 1, the algorithm finally computes the target RWR score vector $\varvec{r}$ (lines 10 and 11).

3.3 Component 2: Clique-expansion-based Method

We propose a clique-expansion-based method that computes the RWR scores on the graph $G_{\mathcal {C}}$ clique-expanded from the hypergraph $G_{H}$. It explicitly construct the transition matrix $\tilde{\textbf{P}}=\tilde{\textbf{W}}\tilde{\textbf{R}}$ by which Eq. (4) becomes

$$\begin{aligned} \varvec{r} = (1 - c)\tilde{\textbf{P}}^{\top }\varvec{r} + c\varvec{q}, \end{aligned}$$

(10)

where the sparsity pattern of $\tilde{\textbf{P}}$ is the same as that of the adjacency matrix of the clique-expanded graph $G_{\mathcal {C}}$ from $G_{H}$ as described in Sect. 2.1 (see the example in Fig. 1).

Based on Eq. (10), we apply a preprocessing approach to the graph of $\tilde{\textbf{P}}$ clique-expanded from $G_{H}$, which solves the following linear system:

$$\begin{aligned} {\textbf {H}}_{\mathcal {C}}\varvec{r} = c\varvec{q}, \end{aligned}$$

where ${\textbf {H}}_{\mathcal {C}} = {\textbf {I}}_{n} - (1-c)\tilde{\textbf{P}}^{\top } \in \mathbb {R}^{n \times n}$ and ${\textbf {I}}_{n}$ is the identity matrix of size n. Note that ${\textbf {H}}_{\mathcal {C}}$ is also invertible, which is proven in the following theorem:

Theorem 3

(Invertibility of ${\textbf {H}}_{\mathcal {C}}$) If $0< c < 1$, ${\textbf {H}}_{\mathcal {C}}$ is invertible.

Proof

We first show that $\textbf{H}^{\top }_{\mathcal {C}}$ is strictly diagonally dominant. $\textbf{H}^{\top }_{\mathcal {C}} = {\textbf {I}}_{n} - (1-c)\tilde{\textbf{P}}$ by its definition and each entry of $\tilde{\textbf{P}}$ is non-negative. For each row i, $|\textbf{H}^{\top }_{\mathcal {C}_{ii}} |= 1 - (1-c)\tilde{\textbf{P}}_{ii}$. Since $\tilde{\textbf{P}}$ is row-normalized, $\sum _{j \ne i} |\tilde{\textbf{P}}_{ij} |= 1 - \tilde{\textbf{P}}_{ii}$. Then, the following inequality holds for every row i:

$$\begin{aligned} \sum _{j \ne i} |\textbf{H}^{\top }_{\mathcal {C}_{ij}} |= (1-c)(1 - \tilde{\textbf{P}}_{ii}) = (1 - (1-c)\tilde{\textbf{P}}_{ii}) - c < 1 - (1-c)\tilde{\textbf{P}}_{ii} = |\textbf{H}^{\top }_{\mathcal {C}_{ii}} |, \end{aligned}$$

where $0< c < 1$. This indicates that $\textbf{H}^{\top }_{\mathcal {C}}$ is strictly diagonally dominant.

The strict diagonal dominance of $\textbf{H}^{\top }_{\mathcal {C}}$ implies its invertibility (Horn and Johnson 2012), which, in turn, implies the invertibility of its transposed matrix, ${\textbf {H}}_{\mathcal {C}}$. $\square$

The clique-expansion-based method is summarized in Algorithm 2, which consists of preprocessing and query phases. In the preprocessing phase, it first explicitly builds the transition matrix $\tilde{\textbf{P}} \in \mathbb {R}^{n \times n}$ (lines 2 and 3). Then, the algorithm computes the matrix ${\textbf {H}}_{\mathcal {C}}$ (line 4). By applying a preprocessing method, it processes ${\textbf {H}}_{\mathcal {C}}$ and obtains the preprocessed results $\varvec{\Theta }_{\mathcal {C}}$ (line 5). In the query phase, it creates the RWR query vector $\varvec{q}$ (line 8) and then computes the RWR score vector $\varvec{r}$ by querying $\varvec{q}$ using the preprocessed results $\varvec{\Theta }_{\mathcal {C}}$ (line 9). The query phase is initiated whenever a user submits a query node.

The time and space complexities of both clique- and star-expansion-based methods can be directly derived from the complexities of RWR computation methods (e.g., BEAR (Shin et al. 2015) and BePI (Jung et al. 2017)) and the definitions of clique- and star-expansions. When employing BePI as the RWR computation method, although the complexities involve many terms related to graph structures, empirically, processing time, space cost, and query time are largely influenced by the number of edges after expansion, specifically, $\texttt {nnz}({\textbf {H}}_{\mathcal {C}})$ and $\texttt {nnz}({\textbf {H}}_{\star })$ in clique- and star-expansion-based computations, respectively. For detailed empirical results, refer to Appendix E. This empirical tendency is utilized in the subsequent subsection for the automatic selection between clique- and star-expansion-based methods.

3.4 Component 3: Automatic selection method

As described in Sects. 3.2 and 3.3, our clique- and star-expansion-based methods allow for fast RWR computation on the hypergraph $G_{H}$ by leveraging a preprocessing approach. Interestingly, the preprocessed matrices ${\textbf {H}}_{\star }$ and ${\textbf {H}}_{\mathcal {C}}$ have very different characteristics. For example, ${\textbf {H}}_{\mathcal {C}}$ may have a large number of non-zeros because each hyperedge e is replaced with a clique of all nodes in e, which can exert a bad effect on scalability as preprocessed results are densified. On the other hand, ${\textbf {H}}_{\star } \in \mathbb {R}^{N \times N}$ can be relatively sparse, but it is of large dimension. Recall that $N = n + m$ where n and m are the numbers of nodes and edges, respectively. As a result, the relative time and space required to preprocess ${\textbf {H}}_{\star }$ and ${\textbf {H}}_{\mathcal {C}}$ heavily depends on datasets. For example, if $n \ll m$, preprocessing a small matrix such as ${\textbf {H}}_{\mathcal {C}} \in \mathbb {R}^{n \times n}$ can be computationally advantageous even though it is dense.

Thus, we further develop an automatic selection method for choosing one between clique- and star-expansion-based methods so that the chosen method brings out the best performance of a preprocessing method. Various data statistics of a hypergraph can be considered to design a criterion based on which one method is chosen. Among them, our strategy is to utilize the number of non-zeros of a matrix to be preprocessed under a hypothesis that the performance of a preprocessing method is likely to be affected by non-zero entries as it exploits the sparsity of the preprocessed matrix for efficiency. We empirically prove the effectiveness of our strategy compared to various statistics in Sect. 5.5.

Based on the criterion, our framework selects the star-expansion-based method if the following predicate satisfies:

$$\begin{aligned} \texttt {nnz}({\textbf {H}}_{\mathcal {C}}) > \texttt {nnz}({\textbf {H}}_{\star }) \end{aligned}$$

(11)

where $\texttt {nnz}(\cdot )$ returns the number of non-zeros of an input matrix. Otherwise, our method selects the clique-expansion-based method. Note that counting $\texttt {nnz}({\textbf {H}}_{\mathcal {C}})$ and $\texttt {nnz}({\textbf {H}}_{\star })$ takes a very short time compared to the total running time mostly consumed by non-trivial operations, such as reordering and matrix multiplications of the preprocessing methods (Shin et al. 2015; Jung et al. 2017). While the empirical computation time is already very small, further optimization allows us to calculate Eq. (11) in $O(\sum _{e \in \mathcal {E}}|e |^{2})$ time with O(n) extra space, as explained in Appendix D.

3.5 Ultimate framework: ARCHER

By putting all of the components together, we develop ARCHER, our ultimate framework for fast computation of RWR on hypergraphs, and the procedure in it is summarized in Algorithm 3. In the preprocessing phase, ARCHER first computes $\texttt {nnz}({\textbf {H}}_{\mathcal {C}})$ and $\texttt {nnz}({\textbf {H}}{\star })$ (lines 2 and 3), which are utilized to select either clique- or star-expansion-based methods. Based on the criterion in Eq. (11), ARCHER chooses one of the methods and executes the corresponding preprocessing step (lines 4-7) to obtain a set $\varvec{\Theta }$ of preprocessed matrices. In the query phase, for each query node s, ARCHER performs the query step depending on the selected method (lines 10-13), using the preprocessed matrices $\varvec{\Theta }$, to compute the RWR score vector $\varvec{r}$.

It is important to note that our framework ARCHER can be equipped with any pre-processing-based RWR computation method (e.g., BEAR and BePI), which is used inside Algorithms 1 and 2, and the total time and space complexity of ARCHER depends on the chosen method. Different methods offer distinct advantages, as demonstrated empirically in Sect. 5.

4 Application to anomaly detection

In this section, we present an application of RWR scores on hypergraphs for the purpose of anomaly detection. The empirical effectiveness of this approach is demonstrated in Sect. 5.6.

Given a hypergraph, this application aims to detect anomalous hyperedges that deviate from ordinary group interactions. Inspired from (Sun et al. 2005), which uses RWR for anomaly detection on graphs, we measure a normality score of a hyperedge based on relevance scores provided by hypergraph RWR. Our intuition is that if a hyperedge e is normal, the relevance scores between any pair of nodes in e should be high. Specifically, we define the normality score ns(e) as follows:

$$\begin{aligned} ns(e) = \frac{1}{\vert e \vert (\vert e \vert - 1)}\sum _{u \in e}\sum _{ {v \in e \scriptscriptstyle \backslash \{u\}}}\varvec{r}_{u \rightarrow v} \end{aligned}$$

where $\varvec{r}_{u \rightarrow v}$ is the RWR score of the node v w.r.t. the query node u. In other words, it is the average pair-wise relevance scores between nodes in the hyperedge e. If the normality score is low, then the hyperedge is considered anomalous. Note that we need a fast computation method such as ARCHER for this task because it requires RWR scores for many query nodes.

More applications We have conducted a study on another application focused on the task of node retrieval, which is described in Appendix B.

5 Experiments

In this section, we evaluate the performance of ARCHER and compare it with other baselines for computing RWR on hypergraphs. We aim to answer the following questions from the experiments:

Q1. Preprocessing Time (Sect. 5.2). How long do ARCHER and the two computation methods composing it take for preprocessing?
Q2. Space Cost (Sect. 5.3). How much memory space do they require for their preprocessed results?
Q3. Query Time (Sect. 5.4). How quickly do they process RWR queries?
Q4. Automatic Selection Method (Sect. 5.5). How precisely does our automatic selection strategy decide an appropriate computation method for a given hypergraph?
Q5. Application to Anomaly Detection (Sect. 5.6). Can we achieve more accurate anomaly detection on hypergraphs using RWR scores, compared to existing approaches?

5.1 Experimental settings

We describe our experimental settings, including machines, methods, datasets, and parameters.

Table 2 Data statistics of real-world hypergraphs

Full size table

Machines All experiments are conducted on a workstation with AMD Ryzen 9 3900X and 128GB memory.

Methods For experiments evaluating the computational performance, we compare ARCHER with using always one expansion method (star- or clique-expansion-based method). ARCHER and the baselines are equipped with BEAR (Shin et al. 2015) or BePI (Jung et al. 2017), state-of-the-art preprocessing methods for RWR on graphs. We also compare ARCHER with the power iteration methods on star- and clique-expanded graphs in Eqs. (6) and (10), respectively. We use our MATLAB implementation of the power iterations, and we use the source code of the authors for BEAR^{Footnote 3} and BePI,^{Footnote 4} which are also implemented in MATLAB.

For anomaly detection, we consider the following two hypergraph-based anomaly-detection methods as baseline approaches:

LSH-A (Ranshous et al. 2017): For each incoming hyperedge, it computes the approximate frequency, which represents the number of previous hyperedges that are similar to the new one, in the aspect of common nodes shared. This is used to measure the unexpectedness of the hyperedge; intuitively, the hyperedge can be considered unexpected if it is novel and significantly different from previous hyperedges (i.e., low frequency). The approximation scheme is efficiently implemented in hyperedge streams by using Locality Sensitive Hashing (LSH) (Rajaraman and Ullman 2011). Specifically, it computes the MinHash signature with $k_h$ hash functions and performs LSH with b bands. Then it scores the hyperedge with the similarity between the signature of previous hyperedges. LSH-A takes $O(k_h \vert e \vert + b + b \lceil \frac{mb}{B} \rceil + 1)$ time per hyperedge, and we use our Python implementation of LSH-A.
HashNWalk (Lee et al. 2022): In HashNWalk, each hyperedge is hashed into M buckets called supernodes using each of $k_h$ hash functions. Based on the transition probability (i.e., random walk of length 1), HashNWalk calculates the proximity between supernodes. Whenever a new hyperedge emerges, HashNWalk updates the proximity between the supernodes in it and compares it with the previous proximity to compute the anomaly score of the hyperedge. In scoring, HashNWalk incorporates the hyperparameter $\alpha$ to control the degree of emphasis placed on recent hyperedges. HashNWalk takes $O(k_h\vert e \vert + k_h \min (M, \vert e \vert )^2)$ time per hyperedge, and we use the official implementation of HashNWalk in C++.^{Footnote 5}

Datasets We conduct extensive experiments on eighteen real-world hypergraphs (Benson et al. 2018; Sinha et al. 2015; Yin et al. 2017; Leskovec et al. 2007; Amburg et al. 2020; Chodrow et al. 2021; Fowler 2006a, b; Ni et al. 2019; McAuley and Leskovec 2013; Harper and Konstan 2015), whose statistics are summarized in Table 2. We provide details of each dataset in Appendix A. The source code and datasets used in this paper are available at https://github.com/jaewan01/ARCHER.

Parameters. For the experiments of preprocessing and query costs, we set the restart probability c to 0.05, which has been widely used in previous work (Tong et al. 2007b; Shin et al. 2015; Jung et al. 2017). For BEAR, we set the hub selection ratio k of the reordering method to 0.001 as in (Shin et al. 2015). For BePI, we set the hub selection ratio k of the reordering method to 0.2, which is used for large graphs in (Jung et al. 2017) (see Footnote 1 for the usage of k). The error tolerance $\epsilon$ for the power iteration and BePI is set to $10^{-9}$. We set the time and memory limits for preprocessing to 12 hours and 128 GB, respectively.

We also perform careful hyperparameter tuning for the aforementioned anomaly detection methods. For LSH-A, we measure performance with different numbers of bands in LSH signatures ($b \in \{2, 4, 8\}$) and the length of LSH signatures ($l \in \{2, 4, 8\})$ and report the best result obtained. For HashNWalk, we adopt a similar setting as described in (Lee et al. 2022). Specifically, we set the hyperparameter $\alpha$ in the kernel function to 0.98. Additionally, we conduct a search for the optimal values of the number of hash functions $k_h$ and the number of buckets M in the following ranges: (a) $k_h \in \{10, 15, 20\}$ and $M \in \{10, 20, 30\}$ for the email-Enron dataset (b) $k_h \in \{8, 10, 12\}$ and $M \in \{60, 80, 100\}$ for the senate-bills dataset, and (c) $k_h \in \{8, 10, 12\}$ and $M \in \{100, 150, 200\}$ for the house-bills dataset. We report the best result achieved for each dataset.

5.2 Preprocessing time

We evaluate the performance of ARCHER in terms of preprocessing time. BEAR and BePI are used as preprocessing techniques. For each method, we report the average preprocessing time of 10 experiments. Note that the iterative methods are excluded in this experiment because they do not require preprocessing. Figure 3a shows the preprocessing time of all tested methods on 18 real-world hypergraphs.

We first compare ARCHER with BEAR and each expansion-based method with BEAR. ARCHER with BEAR preprocesses hypergraphs up to $4.2 \times$ faster than applying BEAR to the star-expanded graph (see the results on the SB dataset) and up to $2.4 \times$ faster than applying BEAR to the clique-expanded graph (see the results on the ML20 dataset). Note that the clique-expansion-based method with BEAR cannot preprocess medium-sized datasets such as WAL and AM because clique expansion produces a too-dense matrix that BEAR cannot handle. For larger datasets such as TW, COG, COD, and THS, all methods with BEAR fail due to their limited scalability. On the other hand, ARCHER with BePI provides better scalability for preprocessing, and it successfully preprocesses all the datasets, showing up to $137.6 \times$ faster than applying BePI to the clique-expanded graph (see the results on the TW dataset) and up to $4.5 \times$ faster than applying BePI to the star-expanded graph (see the results on the EEU dataset).

These results imply the complementary nature of the clique-expansion-based methods and the star-expansion-based methods. Specifically, they exhibit relative advantages on different hypergraphs, and these advantages are significant. This underscores the importance of making careful choices between the two methods, as ARCHER does. Our results regarding space cost and query time in the following subsections further reinforce the importance of this selection, although we do not repeat the same discussion within those subsections.

5.3 Space cost

We analyze the space cost of ARCHER with each preprocessing technique compared to that of other methods. The iterative methods are excluded from the comparison because they do not produce preprocessed results beyond the size of the original data. We measure the memory size for storing preprocessed results in MB. Figure 3b shows the space cost of each method. ARCHER with BEAR uses up to $16.2\times$ less memory than applying BEAR to the clique-expanded graph (see the results on the ML20 dataset) and up to $9.6\times$ less memory than applying BEAR to the star-expanded graph (see the results on the SB dataset). Furthermore, the space cost of ARCHER with BePI is up to $12.3 \times$ less than applying BePI to the clique-expanded graph (see the results on the TW dataset) and up to $9.0 \times$ less than applying BePI to the clique-expanded graph (see the results on the SB dataset).

5.4 Query time

We examine the computational efficiency of ARCHER in processing RWR queries, compared to other baselines, including power-iteration methods. For every method, we measure the average query processing time for the same 30 query nodes. Figure 4 shows the results on 18 real-world hypergraphs. ARCHER with BEAR answers the RWR queries up to $16.3 \times$ faster than applying BEAR to the star-expanded graph (see the results on the EEU dataset) and up to $1.3 \times$ faster than applying BEAR to the clique-expanded graph (see the results on the ML20 dataset) For ARCHER with BePI, its query time is up to $218.8 \times$ less than applying BePI to the star-expanded graph (see the results on the EEU dataset) and up to $4.0 \times$ faster than applying BePI to the clique-expanded graph (see the results on the AM dataset). Note that regardless of the preprocessing techniques used, ARCHER significantly outperforms both power-iteration methods.

The experimental results regarding preprocessing and query costs imply that we should consider different aspects of the preprocessing methods when choosing one of them. If query cost matters more than preprocessing cost, BEAR is a good choice because its query speed is faster than that of BePI, especially on small datasets, as shown in Fig. 4. On the other hand, BePI is required for scalable RWR computation on larger hypergraphs because BEAR fails to process such hypergraphs, as discussed in Sect. 5.2.

5.5 Automatic selection method

We investigate the effectiveness of the proposed criterion in our automatic selection method in ARCHER, compared to other potential criteria based on major statistics of hypergraphs, including average hyperedge size, density, and overlapness (Lee et al. 2021), which are summarized in Table 2.

To evaluate the effectiveness of each criterion, we set the ground-truth label between star- and clique-expansion methods for each dataset depending on which expansion method leads to a shorter preprocessing time of BePI. The reason for labeling the datasets in this manner is that only BePI successfully preprocesses all the datasets, and the preprocessing time of each method is distinctly different, as shown in Fig. 3a. Our method selects the star method if $\texttt {nnz}({\textbf {H}}_{\mathcal {C}})/\texttt {nnz}({\textbf {H}}_{\star }) > 1$; otherwise, it picks the clique method, i.e., the threshold for our suggested method is 1 in Fig. 5a. For the other criteria, we selected the threshold values that maximize the number of correct selections. Specifically, we set the threshold for density in Fig. 5b to 1.26, the threshold for overlapness in Fig. 5c to 15.41, and the threshold for the average size of hyperedges in Fig. 5d to 3.12.

Figure 5 demonstrates that, when using our proposed criterion, we can choose the better expansion method for 17 (out of 18) datasets, while the number of wrong choices increases when we use the other criteria. This result empirically confirms our hypothesis that the performances of preprocessing methods heavily depend on the number of non-zeros in the preprocessed matrix.

We further analyze the correlation between two ratios: (1) the ratio of $\texttt {nnz}({\textbf {H}}_{\mathcal {C}})$ and $\texttt {nnz}({\textbf {H}}_{\star })$, and (2) the ratio of (preprocessing, space, and query) costs of the clique- and star-expansion-based methods, across various datasets when BePI is used. In Fig. 6, we observe a strong positive correlation between the non-zero-count ratio and the cost ratio for each aspect of computation. This indicates that the cost in each aspect is closely related to the number of non-zero entries in the matrix being processed. Additionally, the positioning of data points in each plot demonstrates the effectiveness of our suggested method selection. Data points located in the lower left area indicate the correct selection of the clique method, while those in the upper right area indicate the correct selection of the star method. This demonstrates that our suggested selection approach is effective for most of the tested datasets.

5.6 Application to anomaly detection

In this section, we evaluate the effectiveness of RWR scores in detecting anomalies on hypergraphs. Refer to Sect. 4 for a detailed procedure on how we utilize RWR scores for anomaly detection.

Settings We conduct this experiment on the EEN (email-Enron), SB (senate-bills), and HB (house-bills) datasets, which are small enough for all compared methods to terminate. As the original datasets do not contain anomalous hyperedges, we synthetically generate them by injecting unexpected hyperedges following (Lee et al. 2022). Specifically, we randomly select a hyperedge $e \in E$, and then create a hyperedge by replacing half nodes in e with random nodes. We repeat this process until t hyperedges are generated.

We consider three models of RWR on hypergraphs: 1) RWR using EDNW, 2) RWR using EINW, and 3) naive RWR. Note that the datasets used in the experiment do not contain explicit weights. For EDNW, we make the weights to be edge-dependent by setting $\gamma _e(v) = \bar{d}(v)^{-\beta }$, where $\bar{d}(v)$ denotes the unweighted degree of node v. That is, we adopt the principle that, since high-degree nodes are present in multiple hyperedges, they are likely to have a reduced impact within each hyperedge. We then set $\beta =0.5$ and $\omega (e) = 1$ (refer to Appendix C for the selection of $\beta =0.5$). In the case of EINW (edge-independent node weights), we set $\gamma _e(v) = 1$ and $\omega (e) = 1$. To check the effect of those weights, we further compare naive RWR that computes RWR scores on the unweighted clique-expanded graph from the input hypergraph. We vary the restart probability c from 0.1 to 0.9 by 0.1.

We compare those RWR models with LSH-A and HashNWalk, anomaly detection methods on hypergraphs. While they are originally designed for hypergraphs with timestamps, the used datasets are static, and thus we assign a random timestamp to each hyperedge (spec., we randomly order the hyperedges and use their orders as timestamps) when testing them.

Results Fig. 7 demonstrates the performance of each model for detecting the anomalous (unexpected) hyperedges in terms of AUROC and MAP. As shown in the figure, RWR using EDNW performs best, implying that it is beneficial to utilize the edge-dependent node weights. Even RWR using EINW outperforms HashNWalk and LSH-A. Naive RWR performs worst as the weights of nodes and hyperedges are all disregarded.

6 Conclusion and future directions

In this work, we consider random walk with restart (RWR) on hypergraphs after formally defining it (Definition 1). Then, we propose ARCHER (Algorithm 3) for its rapid and space-efficient computation. ARCHER is composed of two RWR computation methods (Algorithms 1 and 2) that are based on clique- and star-expanded graphs, respectively, of the input hypergraph. Since their relative performance heavily depends on datasets, ARCHER is equipped with a lightweight automatic method for selecting one between them. Using 18 real-world hypergraphs, we substantiate the speed and space efficiency of ARCHER (Figs. 3 and 4), revealing that these qualities are attributed to the complementary nature of the two RWR computation methods and the accuracy of the automatic selection method (Fig. 5). In addition, we introduce anomaly detection as an application of RWR on hypergraphs and show the empirical effectiveness of RWR on it (Fig. 7).

As potential directions for future work, we intend to extend our framework to incorporate approximate RWR computation algorithms, as discussed in Sect. 2.2. Furthermore, the present automatic selection algorithm in Sect. 3.4 is grounded in empirical observations, in which we plan to develop theoretically grounded yet efficient selection algorithms.

Notes

Let k and n denote the hub selection ratio and the number of nodes, respectively. SlashBurn removes $\lceil kn \rceil$ high-degree nodes (called hubs) from a graph so that it is split into the giant connected component (GCC) and remaining disconnected components (called spokes), and it recursively repeats this process on the GCC. The hubs and spokes are then utilized to construct its reordering permutation (refer to its paper for details). It is used in both BEAR and BePI.
As an iterative solver, BePI employs GMRES (Trefethen and Bau 2022), a Krylov subspace method, with a preconditioner such as incomplete LU decomposition where the iterative solver converges if its residual is less than error tolerance $\epsilon$.
https://datalab.snu.ac.kr/bear
https://datalab.snu.ac.kr/bepi
https://github.com/geon0325/HashNWalk
https://www.cs.cornell.edu/~arb/data/
https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2
https://www.yelp.com/dataset
https://snap.stanford.edu/data/ego-Twitter.html
https://grouplens.org/datasets/movielens/

References

Amburg I, Veldt N, Benson A (2020) Clustering in graphs and hypergraphs with categorical edge labels. In: Proceedings of the web conference 2020 (WWW), pp 706–717. https://doi.org/10.1145/3366423.3380152
Benson AR, Abebe R, Schaub MT et al (2018) Simplicial closure and higher-order link prediction. Proceed Natl Academy Sci. https://doi.org/10.1073/pnas.1800683115
Article Google Scholar
Boyd S, Boyd SP, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511804441
Book Google Scholar
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117. https://doi.org/10.1016/s0169-7552(98)00110-x
Article Google Scholar
Chitra U, Raphael B (2019) Random walks on hypergraphs with edge-dependent vertex weights. In: Proceedings of the 36th international conference on machine learning (ICML), pp 1172–1181, arXiv:1905.08287
Chodrow PS, Veldt N, Benson AR (2021) Generative hypergraph clustering: from blockmodels to modularity. Sci Adv 7(28):eabh1303. https://doi.org/10.1126/sciadv.abh1303
Article Google Scholar
Cohen MB, Kelner J, Peebles J, et al (2016) Faster algorithms for computing the stationary distribution, simulating random walks, and more. In: 2016 IEEE 57th annual symposium on foundations of computer science (FOCS), pp 583–592. https://doi.org/10.1109/FOCS.2016.69
Cohen MB, Kelner J, Kyng R, et al (2018) Solving directed laplacian systems in nearly-linear time through sparse lu factorizations. In: 2018 IEEE 59th annual symposium on foundations of computer science (FOCS), pp 898–909. https://doi.org/10.1109/FOCS.2018.00089
Comrie C, Kleinberg J (2021) Hypergraph ego-networks and their temporal evolution. In: 2021 IEEE international conference on data mining (ICDM), pp 91–100. https://doi.org/10.1109/icdm51629.2021.00019
Do MT, Yoon Se, Hooi B, et al (2020) Structural patterns and generative models of real-world hypergraphs. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (KDD). ACM, pp 176–186. https://doi.org/10.1145/3394486.3403060
Fowler JH (2006) Connecting the congress: a study of cosponsorship networks. Polit Anal 14(4):456–487. https://doi.org/10.1093/pan/mpl002
Article Google Scholar
Fowler JH (2006) Legislative cosponsorship networks in the US house and senate. Soc Netw 28(4):454–465. https://doi.org/10.1016/j.socnet.2005.11.003
Article Google Scholar
Fujiwara Y, Nakatsuji M, Onizuka M, et al (2012) Fast and exact top-k search for random walk with restart. Proceed VLDB Endowment 5(5), 442–453.https://doi.org/10.14778/2140436.2140441
Gasteiger J, Bojchevski A, Günnemann S (2019a) Predict then propagate: Graph neural networks meet personalized pagerank. In: International conference on learning representations (ICLR). arXiv:1810.05997
Gasteiger J, Weißenberger S, Günnemann S (2019b) Diffusion improves graph learning. In: Advances in neural information processing systems (NeurIPS). arXiv:1911.05485
Harper FM, Konstan JA (2015) The MovieLens datasets. ACM Trans Interact Intell Syst 5(4):1–19. https://doi.org/10.1145/2827872
Article Google Scholar
Hayashi K, Aksoy SG, Park CH, et al (2020) Hypergraph random walks, laplacians, and clustering. In: Proceedings of the 29th ACM international conference on information & knowledge management (CIKM), pp 495–504. https://doi.org/10.1145/3340531.3412034
Horn RA, Johnson CR (2012) Matrix analysis. Cambridge University Press. https://doi.org/10.1017/CBO9780511810817
Hou G, Chen X, Wang S, et al (2021) Massively parallel algorithms for personalized pagerank. Proceed VLDB Endow 14(9):1668–1680. https://doi.org/10.14778/3461535.3461554
Jung J, Jin W, Sael L, et al (2016) Personalized ranking in signed networks using signed random walk with restart. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 973–978. https://doi.org/10.1109/icdm.2016.0122
Jung J, Park N, Lee S, et al. (2017) BePI. In: Proceedings of the 2017 ACM international conference on management of data (SIGMOD), pp 789–804. https://doi.org/10.1145/3035918.3035950
Jung J, Jin W, Kang U (2019) Random walk-based ranking in signed social networks: model and algorithms. Knowl Inf Syst 62(2):571–610. https://doi.org/10.1007/s10115-019-01364-z
Article Google Scholar
Kang U, Faloutsos C (2011) Beyond ’caveman communities’: Hubs and spokes for graph compression and mining. In: 2011 IEEE 11th international conference on data mining (ICDM), pp 300–309, https://doi.org/10.1109/ICDM.2011.26
Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton. https://doi.org/10.1515/9781400830329
Lee G, Choe M, Shin K (2021) How do hyperedges overlap in real-world hypergraphs?—patterns, measures, and generators. In: Proceedings of the web conference 2021 (WWW), pp 3396–3407. https://doi.org/10.1145/3442381.3450010
Lee G, Choe M, Shin K (2022) HashNWalk: Hash and random walk based anomaly detection in hyperedge streams. In: Proceedings of the thirty-first international joint conference on artificial intelligence (IJCAI), pp 2129–2137. https://doi.org/10.24963/ijcai.2022/296
Lee G, Yoo J, Shin K (2023) Mining of real-world hypergraphs: Patterns, tools, and generators. In: Proceedings of the 29th ACM SIGKDD international conference on knowledge discovery & data mining (KDD). ACM, pp 5811–5812. https://doi.org/10.1145/3580305.3599567,
Lee J, Jung J (2023) Time-aware random walk diffusion to improve dynamic graph learning. In: Proceedings of the AAAI conference on artificial intelligence (AAAI). https://doi.org/10.1609/aaai.v37i7.26021
Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution. ACM Trans Knowl Discovery Data 1(1):2. https://doi.org/10.1145/1217299.1217301
Article Google Scholar
Li J, He J, Zhu Y (2018) E-tail product return prediction via hypergraph-based local graph cut. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (KDD), pp 519–527. https://doi.org/10.1145/3219819.3219829
Lin D, Wong RCW, Xie M, et al (2020) Index-free approach with theoretical guarantee for efficient random walk with restart query. In: IEEE 36th international conference on data engineering (ICDE), pp 913–924. https://doi.org/10.1109/icde48307.2020.00084
McAuley J, Leskovec J (2013) Discovering social circles in ego networks. arXiv:1210.8182
Nassar H, Kloster K, Gleich DF (2015) Strong localization in personalized PageRank vectors. In: Algorithms and models for the web graph (WAW), pp 190–202. https://doi.org/10.1007/978-3-319-26784-5_15
Ni J, Li J, McAuley J (2019) Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 188–197. https://doi.org/10.18653/v1/d19-1018
Page L, Brin S, Motwani R, et al (1999) The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab
Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press. https://doi.org/10.1017/CBO9781139924801
Ranshous S, Chaudhary M, Samatova NF (2017) Efficient outlier detection in hyperedge streams using MinHash and locality-sensitive hashing. In: Complex networks & their applications VI, pp 105–116. https://doi.org/10.1007/978-3-319-72150-7_9
Shin K, Jung J, Lee S, et al (2015) Bear: Block elimination approach for random walk with restart on large graphs. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD), pp 1571–1585. https://doi.org/10.1145/2723372.2723716
Sinha A, Shen Z, Song Y, et al (2015) An overview of microsoft academic service (MAS) and applications. In: Proceedings of the 24th international conference on world wide web (WWW), pp 519–527. https://doi.org/10.1145/2740908.2742839
Sun J, Qu H, Chakrabarti D, et al (2005) Neighborhood formation and anomaly detection in bipartite graphs. In: Proceedings of the fifth IEEE international conference on data mining (ICDM), pp 418–425. https://doi.org/10.1109/ICDM.2005.103
Sun L, Ji S, Ye J (2008) Hypergraph spectral learning for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp 668–676. https://doi.org/10.1145/1401890.1401971
Tong H, Faloutsos C, Gallagher B, et al (2007a) Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp 737–746. https://doi.org/10.1145/1281192.1281271
Tong H, Faloutsos C, Pan JY (2007) Random walk with restart: fast solutions and applications. Knowl Inf Syst 14(3):327–346. https://doi.org/10.1007/s10115-007-0094-2
Article Google Scholar
Trefethen LN, Bau D (2022) Numerical linear algebra, vol 181. Siam, https://doi.org/10.1137/1.9780898719574
Wang R, Wang S, Zhou X (2019) Parallelizing approximate single-source personalized PageRank queries on shared memory. VLDB J 28(6):923–940. https://doi.org/10.1007/s00778-019-00576-7
Article Google Scholar
Wang S, Yang R, Xiao X, et al (2017) Fora: simple and effective approximate single-source personalized pagerank. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 505–514. https://doi.org/10.1145/3097983.3098072
Wang S, Yang R, Wang R et al (2019) Efficient algorithms for approximate single-source personalized PageRank queries. ACM Trans Database Syst 44(4):1–37. https://doi.org/10.1145/3360902
Wei Z, He X, Xiao X, et al (2018) Topppr: Top-k personalized pagerank queries with precision guarantees on large graphs. In: Proceedings of the 2018 international conference on management of data (SIGMOD), pp 441–456. https://doi.org/10.1145/3183713.3196920
Wu H, Gan J, Wei Z, et al (2021) Unifying the global and local approaches: An efficient power iteration with forward push. In: Proceedings of the 2021 international conference on management of data (SIGMOD), pp 1996–2008. https://doi.org/10.1145/3448016.3457298
Yin H, Benson AR, Leskovec J, et al (2017) Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 555–564. https://doi.org/10.1145/3097983.3098069
Zhang Y, Zhao Z, Feng Z (2018a) A unified approach to scalable spectral sparsification of directed graphs. arXiv:1812.04165
Zhang Z, Lin H, Gao Y (2018b) Dynamic hypergraph structure learning. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence (IJCAI), pp 3162–3169. https://doi.org/10.24963/ijcai.2018/439
Zhou D, Huang J, Schölkopf B (2006) Learning with hypergraphs: clustering, classification, and embedding. In: Proceedings of the 19th international conference on neural information processing systems (NIPS), pp 1601–1608. https://doi.org/10.7551/mitpress/7503.003.0205
Zhu S, Zou L, Fang B (2013) Content based image retrieval via a transductive model. J Intell Inf Syst 42(1):95–109. https://doi.org/10.1007/s10844-013-0257-4
Article Google Scholar
Zien J, Schlag M, Chan P (1999) Multi-level spectral hypergraph partitioning with arbitrary vertex sizes. ITCSDI 18(9):1389–1399. https://doi.org/10.1109/iccad.1996.569592
Article Google Scholar

Download references

Funding

This work was supported by National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1C1C1008296) (No. NRF-2021R1C1C1008526) and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)) (No. 2021-0-02,068, Artificial Intelligence Innovation Hub).

Author information

Authors and Affiliations

Kim Jaechul Graduate School of AI, KAIST, Seoul, South Korea
Jaewan Chun, Geon Lee & Kijung Shin
School of Software, Soongsil University, Seoul, South Korea
Jinhong Jung

Authors

Jaewan Chun
View author publications
You can also search for this author in PubMed Google Scholar
Geon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Kijung Shin
View author publications
You can also search for this author in PubMed Google Scholar
Jinhong Jung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kijung Shin or Jinhong Jung.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Responsible editor: Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Experimental datasets

We provide a brief description of the datasets used in this paper.

EEN and EEU.^{Footnote 6} These hypergraphs represent sets of email addresses on emails in Enron (EEN) and a European research institution (EEU) where users are nodes and the group of the sender and all receivers of each email is a hyperedge.
HB and SB.$^{6}$ These represent co-sponsorships of bills in the House of Representatives (HB) and the Senate (SB) where US Congresspersons are nodes, and groups of sponsors and co-sponsors of bills are hyperedges.
WAL.$^{6}$ This is a hypergraph where products are nodes and hyperedges are sets of co-purchased products at Walmart.
TRI.$^{6}$ This is a hypergraph where nodes are accommodations (mostly hotels) and hyperedges are sets of accommodations that a user performed “click-out” during the same browsing session at Trivago.
COD, COG, and COH.$^{6}$ These are co-authorship hypergraphs where authors are nodes and each hyperedge represents the authors of a publication recorded on DBLP (COD), Geology (COG), and History (COH).
THS, THM, and THU).$^{6}$ These are hypergraphs where users are nodes and each hyperedge represents the group of users associated with a thread at StackOverflow (THS), MathStackOverflow (THM), and AskUbuntu (THU).
AM.^{Footnote 7} This is a hypergraph of Amazon (AM) product reviews (spec., those categorized as Movies & TV) where users are nodes and a group of products reviewed by the same user is a hyperedge. Each user has at least 5 reviews.
YP.^{Footnote 8} This is a hypergraph of user ratings on locations (e.g., hotels and restaurants) at Yelp (YP) where users are nodes and a group of locations a user rated is a hyperedge. Ratings higher than 3 are considered.
TW.^{Footnote 9} This is a hypergraph of social relationships on Twitter (TW) where users are nodes and each hyperedge represents a group of users that compose a ‘circle’ (or ‘list’) together on Twitter.
ML1, ML10, and ML20.^{Footnote 10} These hypergraphs represent interactions of movies at MovieLens with different sizes of 1 M (ML1), 10 M (ML10), and 20 M (ML20) movie ratings where nodes are movies and a group of movies a user rated is a hyperedge. Ratings higher than 3 are considered.

Appendix B: Application to node retrieval

In this section, we introduce an application of RWR scores on hypergraphs for the task of node retrieval. Additionally, we evaluate the empirical effectiveness of this approach.

Similar node retrieval Given a hypergraph and a query node s, this task is to search for nodes structurally similar to the query node. Specifically, we measure node-to-node proximities for s and use them as ranking scores to sort all nodes except s in the order of the scores. If nodes with the same class of the query node are ranked high, structurally similar nodes are successfully retrieved, which can be evaluated by ranking metrics such as AUROC and MAP. For this task, we compute the hypergraph RWR scores $\varvec{r}$ w.r.t. query node s through ARCHER, and utilize the scores for ranking.

Settings We conduct this experiment on the SB (senate-bills) and HB (house-bills) datasets, which contain binary node labels. The RWR models used in Sect. 5.6 are also used for this task. To introduce edge-dependent node weights (EDNW), we set $\gamma _e(v) = \bar{d}(v)^{-\beta }$, where $\bar{d}(v)$ represents the unweighted degree of node v. We then set $\beta =1.0$ and $\omega (e) = 1$ (refer to Appendix C for the selection of $\beta =1.0$). For EINW, we set $\gamma _e(v) = 1$ and $\omega (e) = 1$. We vary the restart probability c from 0.1 to 0.9 by 0.1.

Results Fig. 8 shows the experimental results on the node retrieval task in terms of AUROC and MAP. As shown in the figure, the RWR using EDNW shows the best performance, especially with high values of restart probability c, among all tested methods. Note that the RWR using EDNW outperforms that using EINW and naive RWR, indicating the edge-dependent node weights are useful also for this task.

Appendix C: Experiments on edge-dependent node weights for applications

In this section, we provide the experimental results regarding the effectiveness of the edge-dependent node weights for applications.

Anomaly detection For anomaly detection in Sect. 5.6, we set edge-dependent node weights $\gamma _e(v) = \bar{d}(v)^{-\beta }$ for hypergraph RWR. For the experiment, we assess the performance of RWR by varying two parameters: $\beta$ and the restart probability c. Specifically, we explore different values of $\beta$ within the range of $\{0.5, 1.0, 2.0\}$, and we also vary c between 0.1 and 0.9 in increments of 0.1. Figure 9 shows the results, $\beta = 0.5$ generally yields the best performance across the tested datasets.

Similar node retrieval For node retrieval in Appendix B, we also set $\gamma _e(v) = \bar{d}(v)^{-\beta }$ for hypergraph RWR. We test the node-retrieval performance of RWR by varying the values of $\beta$ and c. Specifically, the list of values tested for $\beta$ is $\{0.5, 1.0, 2.0\}$, while the range for c spans from 0.1 to 0.9. Figure 10 shows the results, and $\beta = 1.0$ leads to the best performance in most cases.

Appendix D: Counting of the number of non-zero entries

In this section, we discuss how to compute $\texttt {nnz}({\textbf {H}}_{\mathcal {C}})$ and $\texttt {nnz}({\textbf {H}}_{\star })$ rapidly and space-efficiently. They are used in Eq. (11) by ARCHER to select one between clique- and star-expansion-based methods.

Calculation of ${\texttt {nnz}({\textbf {H}}_{\mathcal {C}})}$ While it is possible to naively count the number of non-zeros in ${\textbf {H}}_{\mathcal {C}}={\textbf {I}}_{n} - (1-c)\tilde{\textbf{P}}^{\top }$, materializing ${\textbf {H}}_{\mathcal {C}}$ typically requires more space than the input data due to its relatively high density. Hence, we suggest a more efficient way based on the following property regarding ${\textbf {H}}_{\mathcal {C}}$:

$$\begin{aligned} \texttt {nnz}({\textbf {H}}_{\mathcal {C}}) = \texttt {nnz}(\tilde{\textbf{P}}) \end{aligned}$$

(D1)

where $\tilde{\textbf{P}} = \tilde{\textbf{W}}\tilde{\textbf{R}}$. The equality is from the fact that the diagonal entries of $\tilde{\textbf{P}}$ are non-zeros because $\tilde{\textbf{P}}$ involves the transition probability that moves from each node v to one of its hyperedges, and goes back to v. Note the sparsity pattern of $\tilde{\textbf{P}}$ is the same as that of the adjacency matrix of the clique-expanded graph $G_{\mathcal {C}}$ (with additional self-loops on every node) of the hypergraph $G_{H}$. Thus, we can calculate $\texttt {nnz}(\tilde{\textbf{P}})$ without materializing $\tilde{\textbf{P}}$, by directly counting the edges that are clique-expanded from each hyperedge.

Algorithm 4 summarizes the procedure for computing $\texttt {nnz}({\textbf {H}}_{\mathcal {C}})$. For each node v (line 2), we find every node u that appears together with v in at least one hyperedge (line 5). Whenever we find such u, it is equivalent to finding an edge (v, u), and thus we increment the count accordingly (line 7). Note that we maintain a set C of such nodes to prevent duplicated counting (lines 3 and 8). Regardless of the input, Algorithm 4 requires $O(|C |)=O(n)$ extra space to maintain the set C. The time complexity is $O(\sum _{v \in \mathcal {V}}\sum _{e \in E(v)}|e |) = O(\sum _{e \in \mathcal {E}}|e |^{2})$ because it requires $|e |$ operations for each node in e.

Calculation of ${\texttt {nnz}({\textbf {H}}_{\star })}$ Similarly, $\texttt {nnz}({\textbf {H}}_{\star })$ can also be efficiently calculated based on the following equalities:

$$\begin{aligned} \texttt {nnz}({\textbf {H}}_{\star })&= n + m + \texttt {nnz}(\tilde{\textbf{S}}) \nonumber \\&= n + m + \texttt {nnz}(\tilde{\textbf{W}}) + \texttt {nnz}(\tilde{\textbf{R}}) \nonumber \\&= n + m + \texttt {nnz}({\textbf {W}}) + \texttt {nnz}({\textbf {R}}), \end{aligned}$$

(D2)

where ${\textbf {H}}_{\star }={\textbf {I}}_{N}-(1-c)\tilde{\textbf{S}}^{\top }$. Note that ${\textbf {I}}_{N}$ is the identity matrix of size $N = n + m$, occupying $n + m$ non-zeros in ${\textbf {H}}_{\star }$. The matrix $\tilde{\textbf{S}}$ consists of $\tilde{\textbf{W}}$ and $\tilde{\textbf{R}}$ as shown in Eq. (5), and their sparsity patterns are the same as ${\textbf {W}}$ and ${\textbf {R}}$. The time complexity of this approach is dominated by that of counting the numbers of non-zero entries in ${\textbf {W}}$ and ${\textbf {R}}$. If ${\textbf {W}}$ and ${\textbf {R}}$ are in a sparse matrix format, the number of their non-zero entries can be computed in $O(\texttt {nnz}({\textbf {W}})+\texttt {nnz}({\textbf {R}}))=O(\sum _{v \in \mathcal {V}}\bar{d}(v))=O(\sum _{e \in \mathcal {E}}|e |)$ time and even in O(1) time in some formats (e.g., compressed sparse row). With the exception of the inputs (i.e., ${\textbf {W}}$ and ${\textbf {R}}$), this approach requires a constant amount of additional space.

Appendix E: Correlation between data statistics and costs of BePI

In this section, we empirically investigate the correlations between basic data statistics and the costs of BePI, which ARCHER employs for RWR computation. As the data statistics, we use $\texttt {nnz}({\textbf {H}})$ (i.e., $\texttt {nnz}({\textbf {H}}_{\mathcal {C}})$ and $\texttt {nnz}({\textbf {H}}_{\star })$ in clique- and star-expansion-based computations, respectively), density, overlapness, and average hyperedge size. As the costs of the clique- and star-expansion-based computation of BePI, we consider preprocessing time, space cost, and query time. The results obtained across all the datasets (refer to Appendix A) for both clique- and star-expansion-based computations are presented in Fig. 11. As shown in Fig. 11a, there exists a strong positive correlation between $\texttt {nnz}({\textbf {H}})$ and the costs for the calculation of RWR. For other statistics (see Figs. 11b, 11c, and 11d), there is no noticeable correlation between the statistics and the costs.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chun, J., Lee, G., Shin, K. et al. Random walk with restart on hypergraphs: fast computation and an application to anomaly detection. Data Min Knowl Disc 38, 1222–1257 (2024). https://doi.org/10.1007/s10618-023-00995-9

Download citation

Received: 10 February 2023
Accepted: 23 November 2023
Published: 21 December 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s10618-023-00995-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Random walk with restart on hypergraphs: fast computation and an application to anomaly detection

Abstract

Similar content being viewed by others

Efficient Outlier Detection in Hyperedge Streams Using MinHash and Locality-Sensitive Hashing

Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark

Detecting Anomalous Subgraphs on Attributed Graphs via Parametric Flow

Explore related subjects

1 Introduction

2 Preliminaries and related work

2.1 Notations

2.2 Random walk with restart on graphs

2.3 Random walk on hypergraphs

3 Proposed framework

3.1 Random walk with restart on hypergraphs

Definition 1

3.2 Component 1: Star-expansion-based Method

Theorem 1

Proof

Theorem 2

Proof

3.3 Component 2: Clique-expansion-based Method

Theorem 3

Proof

3.4 Component 3: Automatic selection method

3.5 Ultimate framework: ARCHER

4 Application to anomaly detection

5 Experiments

5.1 Experimental settings

5.2 Preprocessing time

5.3 Space cost

5.4 Query time

5.5 Automatic selection method

5.6 Application to anomaly detection

6 Conclusion and future directions

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Experimental datasets

Appendix B: Application to node retrieval

Appendix C: Experiments on edge-dependent node weights for applications

Appendix D: Counting of the number of non-zero entries

Appendix E: Correlation between data statistics and costs of BePI

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation