Keywords

1 Introduction

Graphs play a vital role in various disciplines, including social network analysis [12], bioinformatics [48], and computer vision [37]. The advent of Graph Neural Networks (GNNs, [23]) has significantly enhanced the analysis of these structures to capture complex relationships between nodes in a graph. However, traditional GNNs operate within the borders of Euclidean space, which may not be sufficiently expressive for data with inherent hierarchical or complex structures. To improve, this paper delves into the realm of hyperbolic geometry, a Riemannian manifold demonstrated to be particularly effective for embedding hierarchical data [16, 18]. We focus on the development of HGNNs [6], which leverage the unique properties of hyperbolic space to enhance the embedding of GNNs.

The principal challenge confronted by HGNNs is their architectural design, which primarily consists of combinations of aggregation and transformation within layers. This fusion presents a unique problem, particularly the difficulty of training attention weights and manifold parameters (e.g., curvature of the hyperbolic manifold) layer-wise in a deeply layered scheme. With such challenge, we pose our initial questions: Q1: Considering hyperbolic space slows down layer-wise attention and propagation [11, 25], how to develop a deeply-layered attentive HGNN? Q2: How to incorporate high-order info to benefit a deeply layered scheme? Q3: Deep GNNs suffer from embedding smoothing, how should the node smoothness be measured when there is no defined metric for hyperbolic smoothness. And how to tackle over-smoothing within hyperbolic manifold constraints?

Motivated by above questions, in this paper, we propose to decouple the functions within layers of HGNNs so as to deal with each of them separately. Unlike traditional decoupling-GNN approaches [15, 35] that aggregate all information from the neighbors, we view information propagation as a distillation process, such that unimportant information is filtered out and significant information is weighted and contributes to the continuous variation of embeddings. More explicitly, by letting the transformation layer manifest as an encoder-decoder scheme, the aggregation layer is re-envisioned to solve the partial differential equation (Neural ODE/PDE, [8]) - essentially, the graph diffusion equation [4] in hyperbolic space, which essentially simulates an infinitely deep HGNN with single layer parameters. In specific, in response to Q1, we consider the PDE reformulation and developed Hyperbolic-PDE (HPDE) solvers, which only leverage single-layer parameters. To answer Q2, we formulate the Hyperbolic Graph Diffusion Equation (HGDE), a low-high order vector flow function that can be integrated by HPDE. Tackling Q3, we firstly introduce the hyperbolic adaptation of Dirichelet energy and augmented HGDE with a hyperbolic residual, powered by Poincaré midpoint. Deconstructions above introduce extensive mathematical principles, including for instance: manifold vector field, flow, gradient, divergence, diffusivity, numerical HPDE solvers and hyperbolic residuals for bounding embedding energy decay. Through these concepts, we open new pathways to fully exploit the unique potential of hyperbolic space in the contextual analysis of graph-based data. In summary, the contributions of this paper are listed as follows.

(I) We present the geometric intuition for designing projective numerical integration methods that solve hyperbolic ODE/PDE, and examine the connection to Riemannian gradient descent methods. Focusing on fixed-grid solvers, we derive both hyperbolic generalizations of explicit schemes (Euler, Runge-Kutta) and implicit schemes (Adams-Moulton).

(II) We formulate the HGDE, which acts as the vector flow of the HPDE, and thereby induces concepts such as gradient, divergence and diffusivity within HGDE. The proposed framework is flexible and efficient for generating expressive (endowed by the depth) hyperbolic graph embeddings.

(III) We instantiate the diffusivity function as a mixed-order multi-head attention to account for both homophilic (local) and heterophilic (global) relations. Besides, we introduce hyperbolic residual technique to benefit the optimization and prevent over-smoothing.

Through extensive experiments and comparison with the state-of-the-art on multiple real-world datasets, we show that HGDE framework can not only learn comparably high-quality node embeddings as Euclidean models on non-hierarchical datasets, but outperform all compared hyperbolic models variants on highly-hierarchical datasets with improved efficiency and accuracy. https://github.com/ljxw88/HyperbolicGDE.

2 Preliminaries

Riemannian Geometry and Hyperbolic Space. A Riemannian manifold \(\mathcal {M}\) of n-dimension is a topological space associated with a metric tensor g, denoted as \((\mathcal {M}, g)\), which extends curved surfaces to higher dimensions and can be locally approximated by \(\mathbb {R}^n\). At any point \(\textbf{x}\in \mathcal {M}\), the tangent space \(\mathcal {T}_\textbf{x}\mathcal {M}\cong \mathbb {R}^n\) represents the first-order approximation of a small perturbation around \(\textbf{x}\), isomorphic to Euclidean space. The Riemannian metric g on the manifold determines a smoothly varying positive definite inner product on the tangent space, enabling the definition of diverse properties e.g. geodesic length, angles, and curvature.

The hyperbolic space \(\mathbb {H}^n\) is a smooth Riemannian manifold with a constant negative sectional curvature \(\kappa < 0\). Its coordinates can be represented via various isometric models. [3] established the equivalence of hyperbolic and Euclidean geometry through the utilization of the n-dimensional Poincaré ball model, which equips an open ball \(\mathbb {D}^n_\kappa = (\mathcal {D}^n_\kappa , g^{\mathbb {D}})\), with point set \(\mathcal {D}^n_\kappa = \{\textbf{x} \in \mathbb {R}^n : \Vert \textbf{x}\Vert < -\frac{1}{\kappa }\}\) and Riemannian metric \(g^{\mathbb {D}}_\textbf{x} = (\lambda _\textbf{x}^\kappa )^2 \textbf{I}_n\), where the conformal factor \(\lambda _\textbf{x}^\kappa = \frac{2}{1+\kappa \Vert \textbf{x}\Vert ^2}\). The Poincaré metric tensor induces various geometric properties e.g. distances \(d^\kappa _\mathbb {D}(\textbf{x}, \textbf{y})\), inner products \(\langle \textbf{u}, \textbf{v} \rangle _\textbf{x}^\kappa \), geodesics \(\gamma _{\textbf{x}\rightarrow \textbf{y}}(t)\) and more [26]. Geodesics also induce the definition of exponential and logarithmic maps [13]. At point \(\textbf{x}\in \mathbb {D}^n_\kappa \), the exponential map \(\exp _\textbf{x}^\kappa : \mathcal {T}_\textbf{x}\mathbb {D}_\kappa ^n \rightarrow \mathbb {D}_\kappa ^n\) essentially maps a small perturbation of \(\textbf{x}\) by \(\textbf{v}\in \mathcal {T}_\textbf{x}\mathbb {D}_\kappa ^n\) to \(\exp _\textbf{x}^\kappa (\textbf{v})\in \mathbb {D}_\kappa ^n\), so that \(t\in [0,1]: \exp _\textbf{x}^\kappa (t\textbf{v})\) is the geodesic from \(\textbf{x}\) to \(\exp _\textbf{x}^\kappa (\textbf{v})\). The logarithmic map \(\log _\textbf{x}^\kappa : \mathbb {D}_\kappa ^n \rightarrow \mathcal {T}_\textbf{x}\mathbb {D}_\kappa ^n\) is defined as the inverse of \(\exp _\textbf{x}^\kappa \). Finally, the parallel transport \(\mathcal{P}\mathcal{T}_{\textbf{x}\rightarrow \textbf{y}}: \mathcal {T}_\textbf{x}\mathbb {D}_\kappa ^n \rightarrow \mathcal {T}_\textbf{y}\mathbb {D}_\kappa ^n\) moves a tangent vector \(\textbf{v}\in \mathcal {T}_\textbf{x}\mathbb {D}_\kappa ^n\) along the geodesic to \(\mathcal {T}_\textbf{y}\mathbb {D}_\kappa ^n\) while preserving the metric tensor. For closed-form expression of above operations, please refer to Appendix B.

Diffusion Equations. The process of generating representations of individual data points through information flows can be characterized by an an-isotropic diffusion process, a concept borrowed from physics used to describe heat diffusion on Riemannian manifold. Denote the manifold as \(\mathcal {M}\), and let z(t) denote a family of functions on \(\mathcal {M}\times [0, \infty )\) and z(ut) be the density at location \(u\in \mathcal {M}\) and times t. The general framework of diffusion equations is expressed as a PDE

$$\begin{aligned} \partial z (u,t)/\partial t = \textrm{div}( a(z(u, t)) \nabla z(u, t) ), \quad t>0 \end{aligned}$$
(1)

where \(a(\cdot )\) defines the diffusivity function controlling the diffusion strength between any location pair at time t. The gradient operator \(\nabla : \mathcal {M}\rightarrow \mathcal {T}\mathcal {M}\) describes the steepest change at point \(u\in \mathcal {M}\). \(\textrm{div}(\cdot ):\mathcal {T}\mathcal {M}\rightarrow \mathcal {M}\) is the divergence operator that summarizes the flow of the diffusivity-scaled vector field \((a(\cdot )\nabla )\). Equation (1) can be physically viewed as a variation of heat based on time at the location i, identical to the heat that flows through that point from the surrounding areas.

Graph Diffusion Equation. Let \(\mathcal {G}=(\mathcal {V}, \mathcal {E})\) denote an undirected graph with the node set \(\mathcal {V}\) and the edge set \(\mathcal {E}\). Let \(\textbf{x} = \{\textbf{x}_i\in \mathbb {R}^d\}_{i=1}^{|\mathcal {V}|}\) be the node features and \(\textbf{z}(t)\) be node embeddings at time t. Process Eq. (1) can be re-written as

$$\begin{aligned} \partial \textbf{z}_i (t)/\partial t = \textrm{div}( \textbf{A}(\textbf{z}(t)) \nabla \textbf{z}_i(t) ), \end{aligned}$$
(2)

where \(\textbf{A}\) is generally realised by a time-independent \(n\times n\) attention matrix [4, 5], consistent with the flow of heat flux in/out node i. The formulation of Eq. (2) as a PDE allows leveraging vast existing numerical integration methods to solve the continuous dynamics.

Fig. 1.
figure 1

(a–c) Illustration of various numerical integration methods with comparison to RGD. In each time-step, an explicit scheme calibrates the vector field within only the tangent space of time t, while an implicit scheme requires multiple tangent spaces to estimate future slopes, thus requiring parallel transport for aligning the directions of vectors in different spaces. (d) Illustration of hyperbolic interpolation method.

3 Hyperbolic Numerical Integrators

Consider the continuous form of ODE/PDE specified by a neural network parameterized by \(\theta \), expressed as

$$\begin{aligned} d \textbf{h}(t)/d t = f_{\theta }(\textbf{h}(t), t), \quad \textbf{h}(0) = \textbf{h}_0 \end{aligned}$$
(3)

where the time step \(t=[0, T]\). Equation (3) essentially tells that the rate of change of \(\textbf{h}(t)\in \mathbb {R}^n\) at each time step is given by the vector field \(f_\theta : \mathbb {R}^n\times \mathbb {R} \rightarrow \mathbb {R}^n\). Equation (3) is integrated to obtain \(\textbf{h}(T)\). In our context, we are interested in formulating a PDE recipe that is aware of hyperbolic geometry.

Definition 1

A time-dependent manifold vector field is a mapping \(\mathcal {X}:\mathcal {M}\times \mathbb {R}\rightarrow \mathcal{T}\mathcal{M}\), which assigns each point in \(\mathcal {M}\) at t a tangent vector. The particle’s time-evolution according to \(\mathcal {X}\) is then given by the following PDE

$$\begin{aligned} d \textbf{h}(t)/d t = \mathcal {X}_\theta (\textbf{h}(t), t). \end{aligned}$$
(4)

Definition 2

A vector flow is a mapping generated by vector field, i.e. \(\mathcal {F}\equiv \pi (\mathcal {X})\), where \(\pi :\mathcal {M}\rightarrow \mathcal {M}\) is a smooth projection of vector field to manifold of their local coordinates. Vice versa, if \(\pi \) is a diffeomorphism, then \(\mathcal {X}\equiv \pi ^{-1}(\mathcal {F})\).

In hyperbolic geometry, where \(\pi \) and \(\pi ^{-1}\) are properly defined \(\exp \) and \(\log \) maps, our concern lies in the particle’s location on the manifold subsequent to integration, i.e. integrate through the path defined by flow \(\mathcal {F}\). This can be achieved via the spirit of projective method [17]. In the following, we derive numerical solvers for estimating the integral of field \(\mathcal {X}\) or flow \(\mathcal {F}\) w.r.t. time t using, respectively, the explicit and implicit schemes.

3.1 Hyperbolic Projective Explicit Scheme

In an explicit scheme, the state at the next time step is computed directly from the current state and its derivatives. In this part, we derive the hyperbolic generalization of the explicit scheme. To illustrate high-level ideas, we introduce both single step method and multi-step method. We also discuss the geometric intuition and strong analogy between one-step explicit scheme and Riemannian gradient descent (RGD).

H-Explicit Euler (HEuler). Consider a small time step \(\tau \). Iteratively, we seek an approximation for \(\textbf{h}(t + \tau )\) based on \(\textbf{h}(t)\) and vector field \(f(\cdot )\). In Euclidean space, the explicit Euler method is written as

$$\begin{aligned} \textbf{h}(t+\tau ) \approx \textbf{h}(t) + \tau f_{\theta }(\textbf{h}(t), t), \end{aligned}$$
(5)

which is a discrete version of Eq. (3). Similarly in hyperbolic space, we discretize Eq. (4), and have the stepping function formulated as

$$\begin{aligned} \textbf{h}_{\textrm{HEuler}}(t+\tau ) = \exp ^\kappa _{\textbf{h}(t)} (\tau \mathcal {X}_{\textrm{HEuler}}(t)), \end{aligned}$$
(6)

where the vector field \(\mathcal {X}\) gives the direction at time t according to flow \(\mathcal {F}_\theta ^\kappa \)

$$\begin{aligned} \mathcal {X}_{\textrm{HEuler}}(t) = \log ^\kappa _{\textbf{h}(t)}(\mathcal {F}_{\theta }^\kappa (\textbf{h}(t), t)) \in \mathcal {T}_{\textbf{h}(t)}\mathbb {D}_\kappa ^n. \end{aligned}$$
(7)

. The equation in Eq. (5) signifies a transition from \(\textbf{h}(t)\) in the direction of f by a distance proportional to \(\tau \). In hyperbolic space, where \(\textbf{h}(t)\in \mathbb {D}^n_{\kappa }\) and we presume \(\mathcal {X}^{\kappa }: \mathbb {D}^n_{\kappa } \rightarrow \mathcal {T}\mathbb {D}^n_{\kappa }\), the transition follows the geodesic dictated by the direction of \(\mathcal {X}^\kappa \). Recall the definition of exponential map: given \(x\in \mathbb {D}_\kappa \), \(\exp ^\kappa _x(v)\) takes \(v\in \mathcal {T}_x\mathbb {D}_\kappa \) and returns a point in \(\mathbb {D}_\kappa \) reached by moving from x along the geodesic determined by the tangent vector v. Thus Eqs. (67) can be essentially viewed as a geometric transportation of points on manifold along the curve defined by \(\mathcal {F}\).

. As visualized in Fig. 1(a), the explicit Euler can be viewed as reversed RGD, where the direction \(\mathcal {X}_\textrm{HEuler}(t)\) plays similar role as the Riemannian gradient \(g_t\) at \(\textbf{h}(t)\). Similar to RGD, when \((\mathcal {M}, \rho )\) is Euclidean space \((\mathbb {R}^n, \textbf{I}_n)\), then Eq. (6) converges to Eq. (5) since we have \(\exp _h^\kappa (v) \xrightarrow []{\kappa \rightarrow 0} h+v\). This property is useful on developing higher-order integrators.

H-Runge-Kutta (HRK). With a similar geometric intuition, we derive the hyperbolic extension of the Runge-Kutta method. Define the s-order HRK stepping function

$$\begin{aligned} \textbf{h}_{\textrm{HRK}}(t + \tau ) = \exp ^\kappa _{\textbf{h}(t)} (\tau \mathcal {X}_{\textrm{HRK}}(t)), \end{aligned}$$
(8)

where the vector field is estimated by

$$\begin{aligned} &\mathcal {X}_{\textrm{HRK}}(t) = \left( \sum _{i=1}^s \phi _i \log _{\textbf{h}(t)}^\kappa (\textbf{k}_i)\right) / {\sum _{i=1}^s \phi _i}. \end{aligned}$$
(9)

In Eq. (9), \(\textbf{k}\) denotes the vector flow functions, \(\{\phi _i\}\) are coefficients determined by the order. Specifically for 4th order Runge-Kutta (HRK4), we have \(\{\phi _{1\dots 4}\}=\{1, 3, 3, 1\}\) derived from Taylor series expansion as in [8]. The vector flows \(\textbf{k}_{1\dots 4}\) are respectively formulated by

$$\begin{aligned} & \textbf{k}_1 = \textbf{h}_{\textrm{HEuler}}(t+\tau ),\qquad \text {(Eq. (6))}\\ & \textbf{k}_2 = \mathcal {F}_{\theta }^\kappa (\exp ^\kappa _{\textbf{h}(t)} (\tau \mathcal {X}_{\textbf{k}_2}), t + \tau /3), \nonumber \text { where } \mathcal {X}_{\textbf{k}_2} = \log ^\kappa _{\textbf{h}(t)}(\textbf{k}_1)/3. \nonumber \\ & \textbf{k}_3 = \mathcal {F}_{\theta }^{\kappa }(\exp ^\kappa _{\textbf{h}(t)}(\tau \mathcal {X}_{\textbf{k}_3}), t + 2\tau /3), \text { where } \mathcal {X}_{\textbf{k}_3} = \log ^\kappa _{\textbf{h}(t)}(\textbf{k}_2) - \log ^\kappa _{\textbf{h}(t)}(\textbf{k}_1)/3. \nonumber \\ & \textbf{k}_4 = \mathcal {F}_{\theta }^\kappa (\exp ^\kappa _{\textbf{h}(t)}(\tau \mathcal {X}_{\textbf{k}_4}), t+\tau ), \text { where } \mathcal {X}_{\textbf{k}_4} = \log ^\kappa _{\textbf{h}(t)}(\textbf{k}_1) - \log ^\kappa _{\textbf{h}(t)}(\textbf{k}_2) + \log ^\kappa _{\textbf{h}(t)}(\textbf{k}_3). \nonumber \end{aligned}$$
(10)

As illustrated in Fig. 1(b), this method approximates the solution to the PDE within a small interval, considering not only the derivative at the initial time (as in Eq. (5)), but also at intermediate points and the end of the interval.

3.2 Hyperbolic Projective Implicit Scheme

In an implicit scheme, the state of the next iteration is computed by incorporating its own value. This requires solving a linear system to obtain \(\textbf{h}(t+\tau )\) based on \(\textbf{h}(t)\). In below, we illustrate a hyperbolic generalization of the implicit solver.

H-Implicit Adams-Moulton (HAM). Adams numerical integration methods are introduced as families of multi-step methods. With order \(s=0\), Adams methods are identical to the Euler’s method. Principally, there are two types of Adams methods, namely, Adams-Bashforth (explicit) and Adams-Moulton (implicit). Our emphasis is on the latter.

The implicit nature of AM requires the initialization of first several steps with a different method. We use the hyperbolic Runge-Kutta (Eq. (8)) for initialization. With the input \(\textbf{h}(t)\in \mathbb {D}_\kappa ^n\) and flow \(\mathcal {F}^\kappa \), define the warm up

$$\begin{aligned} &\textbf{h}_{\textrm{HAM}}(i\tau ) = \textbf{h}_\textrm{HRK4}(i\tau ), \quad 0\le i < s_\textrm{min} \end{aligned}$$
(11)

where \(s_\textrm{min}\) is the min order. During the whole warm up process, we maintain a queue \(\textbf{q}\) of tangent vectors and the points spanning the tangent space. In each time step of Eq. (11), we push \([q_0 = \mathcal {X}_{\textrm{RK4}}(i\tau ) , q_1=\textbf{h}(i\tau ) ]\) to the head of \(\textbf{q}\). When \(\textrm{len}(\textbf{q}) \ge s_\textrm{min}\), we start the time-stepping

$$\begin{aligned} \textbf{h}_{\textrm{HAM}}(t+\tau ) = \exp ^\kappa _{\textbf{h}(t)} (\tau \mathcal {X}_{\textrm{HAM}}(t)), \end{aligned}$$
(12)

where the vector field is expressed as

$$\begin{aligned} \mathcal {X}_{\textrm{HAM}}(t) = &\phi _0\mathcal{P}\mathcal{T}_{\textbf{h}(t+\tau ) \rightarrow \textbf{h}(t)} ( \log ^\kappa _{\textbf{h}(t+\tau )} ( \mathcal {F}^\kappa _\theta ( \textbf{h}(t+\tau ), t+\tau ) ) )\nonumber \\ &+ \sum _{i=1}^s \phi _i \mathcal{P}\mathcal{T}_{\textbf{q}_{i, 1} \rightarrow \textbf{h}(t)} ( \textbf{q}_{i, 0} ). \end{aligned}$$
(13)

The order \(s=\min (\textrm{len}(\textbf{q}), s_\textrm{max})\), \(\{\phi _i\}\) are coefficients determined by the order, which are typically within a pre-defined look-up table. As illustrated in Fig. 1(c), since the reference point \(\textbf{h}(t)\)’s stored in \(\textbf{q}\) are different, the parallel transport \(\mathcal{P}\mathcal{T}\) is leveraged for aligning tangent spaces for different slopes. When \(\textbf{h}_{\textrm{HAM}}(t+\tau )\) is accepted as converged, \([\log ^\kappa _{\textbf{h}(t)}(\textbf{h}_{\textrm{HAM}}(t+\tau ) ), \textbf{h}(t) ]\) is pushed to \(\textbf{q}\) for the next iteration and the last element is popped if \(\textrm{len}(\textbf{q})\) reaches \(s_\textrm{max}\). We refer readers to Appendix C for detailed explanation of the algorithms.

3.3 Interpolation on Curved Space

Fixed grid PDE solvers typically use their own internal step sizes \(\tau \) to advance the solution of the PDE. For certain time step t, given \(\textbf{h}(t)\) and \(\textbf{h}(t+\tau )\), we may want to obtain the solution at time point \(t+\delta \) where \(0<\delta <\tau \). Since \(\delta \) does not lie on the grid defined by \(\{0, \tau \}\), interpolation methods are invoked to estimate \(\textbf{h}(t+\delta )\). For hyperbolic geometry that \(\textbf{h}\in \mathbb {D}^n_\kappa \), define the interpolation

$$\begin{aligned} \textbf{h}(t+\delta ) = \exp _{\textbf{h}(t)}^\kappa \left( {\delta } \log ^\kappa _{\textbf{h}(t)}(\textbf{h}(t+\tau )) / {\tau }\right) . \end{aligned}$$
(14)

Proposition 1

(Proved in Appendix D). For any step size \(0<\delta <\tau \), the interpolation \(\textbf{h}(t+\delta )\) via Eq. (14) is on the geodesic between \(\textbf{h}(t)\) and \(\textbf{h}(t+\tau )\) on the manifold, and \(\frac{d_\mathbb {D}^\kappa (\textbf{h}(t), \textbf{h}(t+\delta ))}{ d_\mathbb {D}^\kappa (\textbf{h}(t), \textbf{h}(t+\tau )) }=\frac{\delta }{\tau }\) where \(d_\mathbb {D}^\kappa \) is the geodesic length.

4 Diffusing Graphs in Hyperbolic Space

4.1 Hyperbolic Graph Diffusion Equation

We study the diffusion process of graphs with node representation residing in hyperbolic geometry. Given the diffusion time \(t\in [0,T]\), embedding space \(\mathbb {D}^d_{\kappa _t}\) with learnable curvature \(\kappa _t\) at time t, node embedding \(\textbf{z}_{*}(t) \in \mathbb {D}^d_{\kappa _t}\) and \(\mathcal {C(\cdot )}\) being the correlated coordinates of certain node, we formulate the vector flow \(\mathcal {F}_\theta ^\kappa \) of the ith representation as

$$\begin{aligned} \underbrace{\exp _{\textbf{z}_i(t)}^{\kappa _t} \bigg ( \sigma \bigg [ \sum _{j\in \mathcal {C}(i)} }_\text {divergence} \underbrace{a(\textbf{z}_i(t), \textbf{z}_j(t))}_\text {diffusivity} \underbrace{ \log _{\textbf{z}_i(t)}^{\kappa _t} (\textbf{z}_j(t)) }_\text {gradient} \bigg ] \bigg ), \end{aligned}$$
(15)

where \(\sigma \) can be either identity/non-linear activation. With initial state encoded by learnable feature transformation \(\psi \), i.e. \(\textbf{z}(0) = \psi (\textbf{x})\in \mathbb {D}^d_\kappa \), the final state can be numerically estimated by our proposed HPDE integrators, i.e. \(\textbf{z}_i(T) = \textrm{HPDESolve}(\textbf{z}_i(0), \frac{\partial \textbf{z}_i(t)}{\partial t}, 0, T)\). In matrix form, the vector flow is expressed as

$$\begin{aligned} \mathcal {F}^\kappa _\theta (\textbf{z}(t), t) = \exp ^{\kappa _t}_{\textbf{z}(t)} \left( \sigma \big [ \textbf{S}(\textbf{z}(t))\nabla \textbf{z}(t) \big ]\right) , \end{aligned}$$
(16)

where \(\textbf{S}(\textbf{z}(t)) = (a(\textbf{z}_i(t),\textbf{z}_j(t)))\) is a normalized \(|\mathcal {V}|\times |\mathcal {V}|\) similarity matrix, and \(\nabla \textbf{z}(t))_{i,j} := \log _{\textbf{z}_i(t)}^{\kappa _t} (\textbf{z}_j(t)\). In below, we discuss the key ingredients of Eq. (15, 16).

Fig. 2.
figure 2

Schematic of HGDE. (a) The pipeline of our method includes hyperbolic projection, feature transformation, and HPDE block that integrates the GDE. After that, a decoder is applied to the embeddings for specific downstream tasks. (b) The visualization of the diffusion process within the HPDE block: first, map local gradients of \(\textbf{z}_i\) onto the tangent space, calculate the diffusivity, and diverge to obtain the vector flow (green arrow), then perform one-step integration on the manifold with the guidance of continuous curvature diffusion. (c) The details of attention-powered local-global diffusivity function. (Color figure online)

Gradient. The gradient of a function z(ut) at location u in a discrete space can be approximated as the difference between the function values at neighboring points. In graph space, let \(\textbf{z}_i\) and \(\{\textbf{z}_j\}_{j\in \mathcal {C}(i)}\), respectively, denote the target node and the correlated positions of i that can be modeled by edge connectivity or self-attention. The graph diffusion process [4, 5] treats nodes as Euclidean representations, such that the analogy of gradient operator \((\nabla \textbf{z}(t))_{i,j} : \mathbb {R}^d \rightarrow \mathbb {R}^d\) is expressed as \(\textbf{z}_j(t) - \textbf{z}_i(t)\). However, when nodes are embedded in Riemannian manifolds, the gradient of a node is no longer the difference between itself and neighboring points. Instead, we take vectors in the tangent space at \(\textbf{z}_i\) that are obtained by taking the derivative of \(\textbf{z}\) in all possible directions, i.e. \((\nabla \textbf{z}(t))_{i,j}: \mathbb {D}_\kappa ^d \rightarrow \mathcal {T}\mathbb {D}_\kappa ^d\) that can be formulated as \(\log _{\textbf{z}_i(t)}^\kappa (\textbf{z}_j(t))\). One recovers the discrete Euclidean gradient as the curvature \(\kappa \rightarrow 0\).

Diffusivity. The diffusivity scales the gradient, with either isotropic or anisotropic behavior. For graph diffusion, the isotropic formula is presented by the normalized adjacency matrix [23], where \(a_{i,j} = \frac{1}{\sqrt{d_i d_j}}\) iff. \((i,j)\in \mathcal {E}\) and d is the degree.

Alternatively, the anisotropic approach incorporates the attention mechanism [33] to account for the asymmetric relationship between pairs of nodes. This paper considers local, global and local-global schemes based on structure information. Define the schemes

$$ {\left\{ \begin{array}{ll} a^{\textrm{ldiff}}_{i,j} = \textrm{normalize}_{j\in \mathcal {N}(i)} \left( f_\theta \big (\textbf{z}_i(t) , \textbf{z}_j(t)\big ) \right) & \text {(local scheme)} \\ a_{i,j}^\textrm{gdiff} = \beta \textrm{normalize}_{j\in \mathcal {V}}\left( g_\phi \big (\textbf{z}_i(t) , \textbf{z}_j(t)\big ) \right) + \frac{1-\beta }{\sqrt{d_i d_j}} & \text {(global scheme)} \\ a_{i,j}^\textrm{lgdiff} = \beta \textrm{normalize}_{j\in \mathcal {V}}\left( g_\phi \big (\textbf{z}_i(t) , \textbf{z}_j(t)\big ) \right) + (1-\beta ) a^{\textrm{ldiff}}_{i,j} & \text {(local-global scheme)} \end{array}\right. } $$

where \(f_\theta /g_\phi \) are learnable functions that compute the diffusivity weight between node pair \((i, j)\in \mathcal {E}\). \(\beta \) can be constant or trainable parameters that adjust the emphasis on homophilic (local attention) and heterophilic (high-order, global attention) relations. In comparison, the local attention scheme implicitly incorporates the graph information since only neighboring elements are considered based on \(\mathcal {N}(i)\). Whereas for global attention, it neglects the graph topology and hence requires manual incorporation.

. A straightforward approach is to leverage the formula of graph attention [34], which is extended to the hyperbolic space by [6], where the weights are calculated tacitly in the tangent space. An alternative method to consider is the Oliver-Ricci Curvature (ORC) [27] attention, introduced in [38, 40] to drive message propagation. This approach is not limited by the non-Euclidean property of node feature, as it computes attention weight via the ORC value derived from the graph topology, thus allowing adoption without leveraging tangent space.

. Propagation of high-order node pairs results in exponentially increasing complexity compared to \(f_\theta \). [36, 42] introduced a series of scalable and efficient node-level transformers. With a similar notion in the hyperbolic space, we first project the embeddings onto the tangent space of the origin. Subsequently, the weights can be obtained using existing graph transformer architectures. We adopt energy-constrained transformers [36] with a sigmoid kernel, which performs well in most scenarios.

Figure 2(c) presents the high-level schematic of diffuse. The implementation and algorithmic details are delegated to Appendix C.

Divergence. For simplicity, we assume any \(\textbf{x}_i\in \mathbb {R}^d\) to be scalar-valued. The divergence at a point \(\textbf{z}_i\) is a measure of how much the vector field \(\mathcal {X}=\{\nabla \textbf{z}(t))_{i,j}\}_{j\in \mathcal {C}(i)}\) is expanding or contracting at \(\textbf{z}_i\). In a Euclidean space, the divergence would indeed be the sum of the components of the gradient, i.e., \(\textrm{div}_i = \sum _j (\nabla \textbf{z}(t))_{i,j}\), producing a scalar (with dimensionality \(\mathcal {T}_{\textbf{z}_i}\mathbb {D}^d \cong \mathbb {R}^d\)). In our context, we are interested in how \(\textbf{z}_i\) is varied in the manifold rather than in the tangent space; thus an exponential map is applied to the sum of gradients on \(\mathcal {T}_{\textbf{z}_i}\mathbb {D}^d\), giving \(\textrm{div}_i = \exp _{\textbf{z}_i(t)}(\sum _j a_{i,j}(\nabla \textbf{z}(t))_{i,j})\). This also satisfies the form of \(\mathcal {F}^\kappa _\theta \) in Definition 2, and thus can be numerically integrated through \(\textrm{HPDESolve}\).

Continuous Curvature Diffusion. Equation (16) implicitly guides the manifold towards its optimal geometry for embedding \(\textbf{z}(t)\) as the manifold parameter \(\kappa _t\) also accumulates and is updated during backpropagation. Similar to the attention parameters \(\theta \), we let \(\kappa \) be time independent based on the assumption that \(\lim _{\tau \rightarrow 0}\frac{\kappa _{t+\tau } - \kappa _t}{\tau } = 0\).

4.2 Convergence of Dirichlet Energy

Definition 3

Given the node embedding \(\{\textbf{z} _i \in \mathbb {D}^{d}_\kappa \}_{i=1}^{|\mathcal {V}|}\), the hyperbolic Dirichlet energy is defined as

$$\begin{aligned} f^{\kappa }_\textrm{DE}(\textbf{z}) = \frac{1}{2} \sum _{(i,j)\in \mathcal {E}} {d_{\mathbb {D}}^\kappa } \left( \exp ^\kappa _\textbf{o}\left( \frac{\log _\textbf{o}^\kappa (\textbf{z}_i)}{\sqrt{1+d_i}}\right) ,\exp ^\kappa _\textbf{o}\left( \frac{\log _\textbf{o}^\kappa (\textbf{z}_j)}{\sqrt{1+d_j}}\right) \right) ^2, \end{aligned}$$
(17)

where \(d_{i/j}\) denotes the node degree of node i/j. The distance \(d_\mathbb {D}^\kappa (\textbf{x}, \textbf{y})\) between two points \(\textbf{x}, \textbf{y}\in \mathbb {D}\) is the geodesic length; we detail the closed form expression in Appendix B.

Definition 3 introduces a node-similarity measure to quantify over-smoothness in hyperbolic space. \(f^{\kappa }_\textrm{DE}\) of node representation can be viewed as the weighted sum of distance between normalized node pairs. [25, Prop. 4] proved that hyperbolic energy \(f^{\kappa }_\textrm{DE}\) diminishes after message passing, and multiple aggregations result in converging towards zero energy, indicating reduced embedding expressiveness that could potentially cause over-smoothing. Also as proved in [43, Prop. 2] that over-smoothing is an intrinsic property of first-order continuous GNN. In a continuous diffusion process, where each iteration can be viewed as a layer in HGNNs, as supported by Fig. 3, we also observe a convergence of hyperbolic Dirichlet energy of \(\textbf{z}(t)\) w.r.t. time t.

Residual-Empowered Flow. Empirically, studies in multi-layer GNNs [15, 24] demonstrated the efficacy of adding residual connections to the initial layer. It is also claimed in [45] that using residual connections for both initial and previous layers can prevent the Dirichlet energy from reaching a lower energy limit, thus avoiding over-smoothing. Building upon these studies, we define the hyperbolic residual empowered vector flow

$$\begin{aligned} \mathcal {F}^\kappa _\theta (\textbf{z}(t), t) = \mu ^\kappa _\mathbb {D}\left( \{\dot{\textbf{z}}(t), \textbf{z}(t), \textbf{z}(0)\}; \{\eta \}_{j=1}^J \right) , \end{aligned}$$
(18)

where \(\dot{\textbf{z}}(t) = \exp ^{\kappa _t}_{\textbf{z}(t)} \left( \sigma \big [ \textbf{S}(\textbf{z}(t))\nabla \textbf{z}(t) \big ]\right) \) is the manifold dynamic as in Eq. (16). \(\{\eta \}_{j=1}^J\) are the weight coefficients. \(\mu _\mathbb {D}^\kappa \) is the node-wise hyperbolic averaging. We instantiate it via Möbius Gyromidpoint [32] for its trade-off between computational cost and precision. Define

$$\begin{aligned} \mu _\mathbb {D}^\kappa (\{\textbf{z}\}_{j=1}^J; \{\eta \}_{j=1}^J) = \left( \frac{1}{2}\otimes _\kappa \left( \frac{ \sum _{j}\eta _j\lambda _{\textbf{z}^{(j)}_i}^\kappa \textbf{z}^{(j)}_i }{ \sum _j |\eta _j|(\lambda _{\textbf{z}^{(j)}_i}^\kappa -1) } \right) \right) _{i=1}^{|\mathcal {V}|}. \end{aligned}$$
(19)

This operation ensures the point set constraint of \(\mathbb {D}\) for the residual flow. We recover the arithmetic mean as \(\kappa \rightarrow 0\). During diffusion, Eq. (18) retains at least a portion of the initial and prior embeddings. Since the initial embedding possesses high energy, the residual connection mitigates energy degradation and retains the energy of the final iteration at the same level as the preceding iterations.

5 Empirical Results

5.1 Experiment Setup

Datasets. Under homophilic setting, we consider 5 datasets for node classification and link prediction: Disease, Airport (transductive datasets, provided in [6] to investigate the tree-likeness modeling), PubMed, CiteSeer and Cora ( [39] widely used citation networks), which are summarized in the table in Appendix A. Additionally, we report the Gromov’s hyperbolicity \(\delta \) given by [16] for each dataset. A graph is more hyperbolic as \(\delta \rightarrow 0\) and is a tree when \(\delta = 0\).

For heterophilic datasets, we evaluate node classification on three heterophilic graphs, respectively, Cornell, Texas and Wisconsin [29] from the WebKB dataset (webpage networks). Detailed statistics are summarized in Appendix A. We use the original fixed 10 split datasets. In addition, we report the homophily level \(\mathcal {H}\) of each dataset, a sufficiently low \(\mathcal {H}\le 0.3\) means that the dataset is more heterophilic when most of neighbours are not in the same class.

Baselines. We compare our models to (1) Euclidean-hyperbolic baselines, (2) discrete-continuous depth baselines and (3) heterophilic relationship baselines. For (1), we compare against feature-based models, Euclidean, and hyperbolic graph-based models. Feature-based models: without using graph structure, we feed node feature directly to MLP and HNN [14]; Euclidean graph-based models: GCN [23], GAT [34], GraphSAGE [19], and SGC [35]; Hyperbolic graph-based models: HGCN [6], \(\kappa \)GCN [1], LGCN [44] and HyboNet [10]. For (2), we compare our models on citation networks with the discrete-continuous depth models. Discrete depth: GCNII [7], C-DropEdge [20]; Discrete-decouple: HyLa-SGC [41]; Continuous depth: GDE [30], GRAND and BLEND [4, 5]. For (3), we compare to the prevalent GNNs: GCN, GAT, HGCN, HyboNet, and those optimized for heterophilic relationships: H2GCN [46], GCNII, GraphSAGE and GraphCON [31]. The test results are partially derived from the above works. For fairness, we compare to models with no more than 16 layers/iterations. Please refer to Appendix A for more details regarding the compared baselines. We detail the parameter settings for model and evaluation metric in Appendix C.

Table 1. Test accuracy (%) for node classification task.
Table 2. Test ROC AUC (%) results for link prediction task.
Table 3. Discrete-continuous depth GNN comparison.
Table 4. Heterophilic relationship GNN comparison.

5.2 Experiment Results

Euclidean-Hyperbolic Baselines. We investigate our methods with different solvers with \(\tau =1\), i.e. HGDE-E (multi-step explicit integrator, HRK4) and HGDE-I (multi-step implicit integrator, HAM). The experimental results are summarized in Tables 1 and 2. (1) Our proposed models outperform previous Euclidean and hyperbolic models in four out of five datasets, suggesting that graph learning in hyperbolic space through topological diffusion is beneficial. (2) Hyperbolic models typically exhibit poor performance on datasets that are less hyperbolic (e.g., Cora), while our method surprisingly exceeds Euclidean GAT on datasets with lower \(\delta \), indicating the necessity of curvature diffusion in adapting to datasets with scarce hierarchical structures and modeling long-term dependency via the local-global diffusivity function. (3) HGDE and other hyperbolic models achieve superior performance compared to Euclidean counterparts in link prediction due to the larger embedding space in hyperbolic geometry, which better preserves structural dependencies and allows for improved node arrangement. (4) HGDE-E generally outperforms HGDE-I with lower memory consumption and better precision, indicating that a larger \(\tau \) may be necessary for implicit solvers. To align with multi-layer GNN schema (step-size is analogous to depth), we employ HGDE with HRK4 (\(\tau =1\)) for further evaluation.

Discrete-Continuous Depth Baselines. In Table 3, we compare our models with discrete and continuous-depth baselines. We observe that our method with \(T\in [12, 16]\) achieves competitive results with the state-of-the-art models. Notably, HGDE models consistently outperform discrete models and continuous models with Euclidean embeddings, highlighting the benefits of utilizing hyperbolic embeddings in a continuous-depth framework. Compared to position encoding approaches (e.g., HyLa, BLEND), HGDE exhibits superior performance, indicating the feasibility of using hyperbolic space embeddings directly over initial position encoding. Interestingly, we find HGDE models performs better when increasing T up to 12, but slightly worse at \(T=16\). This may due to the capacity of Poincaré ball or potential over-smoothing. Overall, the results underscore the effectiveness of the proposed HGDE models in harnessing the power of hyperbolic space for graph data modeling.

Table 5. Evaluation on image (CIFAR/STL) and text (20News) classification (Left) and Memory & Runtime comparison (Right). \(\star \) indicate OOM.

Heterophilic Relationship Baselines. We show that HGDE is also capable in managing heterophilic relationship. In Table 4, HGDE achieves the highest scores on the Texas and Cornell and a competitive score on Wisconsin. This shows that hyperbolic space is beneficial in learning hierarchical heterophilic relationships. It also reflects the flexibility of HGDE as a hyperbolic vector flow for embedding high-order structures, with our model, powered by HPDE, outperforming other baselines on average.

Image and Text Classification. We follow the experiment setup in [36] and conduct additional experiments on the CIFAR, STL, and 20News datasets to evaluate HGDE in multiple scenarios with limited label rates. We employ the SimCLR [9] extracted embedding as provided in [36] for image classification. For the pre-processed 20News [28] for text classification, we take 10 topics and regard words with \(\text {TF-IDF} > 5\) as features. For graph-based models, we use kNN to construct a graph over input features. For HGDE (hyperbolic), we map the initial feature to \(\mathbb {D}_\kappa \) via \(\exp _\textbf{o}^\kappa (\cdot )\) before the embedding process. As depicted in Table 5(Left), HGDE consistently surpasses its opponents, including MLP, LabelProp [47], ManiReg [2], GCN-kNN, GAT-kNN, DenseGAT, and GLCN [21]. Across all datasets, HGDE outperforms the Euclidean models, underscoring its proficiency in understanding the potential hierarchical structure of image embeddings [22] and text embeddings. Furthermore, HGDE exhibits good performance compared to static graph-based baselines e.g., GAT-kNN and GLCN, which underlines the distinct advantage of the evolving diffusivity mechanism in understanding the potential hierarchical structure of image/text embeddings.

5.3 Ablation Study

Efficacy of Hyperbolic Residual. Figure 3 visualizes the convergence of hyperbolic energy through iterations. We observe that, without residuals, the averaged energy rapidly decreases to near-zero values, supporting the hypothesis that, without residual connections, the embedding can evolve to an overly smoothed state that is potentially low in expressiveness. However, with hyperbolic residuals, for all three integrators, the average energy decreases over the first few iterations and then appears to stabilize around a certain value above zero. This behavior is consistent across both datasets, suggesting that the system is able to converge to a stable state with non-zero energy (Fig. 4).

Fig. 3.
figure 3

Hyperbolic Dirichlet energy \(f^{\kappa }_\textrm{DE}(\cdot )\) variation through t on Cora (left) and CiteSeer (right). Models are compared with different integrators w or w/o hyperbolic residual.

Fig. 4.
figure 4

Averaged node classification performance comparison of models with different diffusivity functions on various datasets.

Fig. 5.
figure 5

Cora diffusivity (400 node sampled from \(\mathbb {D}^2_{\kappa }\) embeddings) produced by \(a^{\textrm{ldiff}}\) (left) and \(a^{\textrm{lgdiff}}\) (right), blue and red lines denote local and global attention; bolder lines indicate more attentiveness. (Color figure online)

Efficacy of Diffusivity Function. Figure 5 visualizes sampled node embeddings and their edge diffusivity on Cora. The blue edges are inherently determined by the graph structure. Red ones are determined by global attention, showing that \(a^\textrm{lgdiff}\) accounts for high-order relations. The bar graph shows the average accuracy on various datasets produced by HGDE with different diffusivity functions. We find that anisotropic approaches generally outperform the isotropic approach, suggesting the necessity of directional information in the diffusion process. Although the performance degrades on CiteSeer when using \(a^{\textrm{lgdiff}}\), there are significant improvements on other graphs, certifying the benefit of higher-order proximity induced by local-global diffusivity.

Parameter Efficiency. In Table 5 (Right), we provide an additional comparison of peak GPU memory usage and per-epoch running time on the Cora. We tested HGDE-E where all models have a 16 hidden dim. Our model significantly outperforms the other baselines in both training time (for \(\ge 4\) layers) and memory consumption. The memory reduction is primarily due to the utilization of sparse attention, and the advantages of a weight-tied network (requiring only single-layer parameters) as a nature of HPDE. The training time efficiency is achieved by eliminating layer-wise feature transformation, implementing weight-tying, and applying scattered-agg for hyperbolic representation.

6 Conclusion

We developed multiple numerical integrators for HPDE, and proposed the first hyperbolic continuous-time embedding diffusion framework – HGDE. Being capable of capturing both low and high order proximity, HGDE outperforms both Euclidean and hyperbolic baselines on various datasets. The effectiveness of HGDE was further validated by the ablation studies on hyperbolic energy and diffusivity functions. The superiority of HGDE underscores the potential of developing PDE-based non-Euclidean models.

Limitation. While HGDE presents strong performance in modeling graph data, hyperbolic spaces may not always be optimal, particularly for data without clear hierarchical structures. For instance, HGDE is difficult to beat natural Euclidean deep models (e.g. GCNII) on the non-hierarchical \(\textsc {Cora}\). Moreover, a higher memory complexity and lower training time only tells the efficiency rather than scalability of HGDE, since our models are evaluated with fixed number of parameters (which is natural for ODE-based models), increasing T is not necessarily scaling up. Future work include addressing these limitations and exploring the scalability and generalizability of HGDE in diverse real-world settings.