1 Introduction

RoleSim is a role-based similarity measure that quantifies the closeness between two objects based on graph topology, with a proliferation of real-life applications [9, 10, 23] in, e.g., link prediction (social network), co-citation analysis (bibliometrics), motif discovery (bioinformatics), and collaborative filtering (information retrieval). It recursively follows a SimRank-like reasoning that “two nodes are assessed as role similar if they interact with automorphically equivalent sets of in-neighbors”. Intuitively, automorphically equivalent nodes in a graph are objects having similar roles that can be exchanged with minimum effect on the graph structure. Similar to the well-known measure SimRank [7], the recursive nature of RoleSim allows to capture the multi-hop neighboring structures that are automorphically equivalent in a network. Unlike SimRank that measures the similarity of two nodes from the paths connecting them, RoleSim quantifies their similarities through the paths connecting their different roles. As a result, two nodes that are disconnected each other will not be considered as dissimilar by RoleSim if they have similar roles. For evaluating similarity score s(ab) between nodes a and b, as opposed to SimRank whose similarity s(ab) takes the average similarity of all the neighboring pairs of (ab), RoleSim computes s(ab) by averaging only the similarities over the maximum bipartite matching of all the neighboring pairs of (ab). This subtle difference enables RoleSim to guarantee the automorphic equivalence, which SimRank lacks, in final scoring results. Therefore, RoleSim has been demonstrated as an effective similarity measure in many real applications. We summarize two of these applications below.

Application 1 (Similarity Search on the Web). Discovering web pages similar to a query page is an important task in information retrieval. In a Web graph, each node represents a web page, and an edge denotes a hyperlink from one page to another. RoleSim can be applied to measure the similarity of two web pages, based on the intuition that “two web pages are role-similar if they are pointed to by the automorphically equivalent sets of their in-neighboring pages”. This similarity measure produces more reliable similarity results than the SimRank model  [10].

Application 2 (Social Network De-anonymisation). Social network de-anonymisation is a method to validate the strength of anonymisation algorithms that protect a user’s privacy. RoleSim has been applied to de-anonymise node mappings based on the similarity information between a crawled network and an anonymised one. Based on the observation that “correct mappings tend to have higher similarity scores”, RoleSim iteratively evaluates pairwise node similarities between two networks, and captures the reasoning that “a pair of nodes between two networks is more likely to be a correct mapping if their neighbors are correct mappings”. RoleSim has demonstrated superior performance as compared with other existing de-anonymization algorithms  [23].

Despite its popularity in real-world applications, RoleSim has a major limitation: with the aim to achieve automorphic equivalence, its similarity score s(ab) only considers the limited information of the average similarity scores over the automorphically equivalent set (i.e., the maximum bipartite matching) of a’s and b’s in-neighboring pairs, but neglects the rest of the pairwise in-neighboring similarity information that is out of the automorphically equivalent set. Consequently, RoleSim does not always produce comprehensive similarity results because two pairs of nodes, which are not automorphically equivalent by nature, should be distinguished from each other even though the average values of their in-neighboring similarities over the set of the maximum bipartite matching are the same, as illustrated in Example 1.

Fig. 1.
figure 1

Limitation of RoleSim (RS) on a tiny web graph, where node-pairs (1, 2) and (1, 3) have the same RoleSim score (0.426) since RS aggregates only the in-neighboring pairs that are automorphically equivalent (colored in green) whose sums are the same \((0.488+0.360=0.848)\), while ignoring the remaining pairs. (Color figure online)

Example 1 (Limitation of RoleSim)

Consider the web graph G in Fig. 1, where each node denotes a web page, and each edge depicts a hyperlink from one page to another. Using RoleSim, we evaluate pairs of similarities between nodes, as partially illustrated in the ‘RS’ column of the right table. It is discerned that node-pairs (1, 2) and (1, 3) have the same RoleSim similarity values, which is not reasonable. Because node 2 and node 3 are not strictly automorphically equivalent by nature, their similarities with respect to the same query node 1, i.e.,  s(1, 2) and s(1, 3), should not be the same.

We notice that the main reason why s(1, 2) and s(1, 3) are assessed to be the same by the RoleSim model is that its similarity s(ab) considers only the average similarity scores over the maximum bipartite matching, denoted as \(M_{a,b}\), of (ab)’s in-neighboring pairs \(I_a \times I_b\), where \(I_a\) denotes the in-neighbor set of node a, and \(\times \) is the Cartesian product of two sets. Thus, the similarity information in the remaining in-neighboring pairs of (ab), i.e.,  \(I_a \times I_b-M_{a,b}\), are totally ignored. For example, if unfolding the in-neighboring pairs of (1, 2) and (1, 3) respectively, we see that, in the gray cells, \(M_{1,2}=\{(4,6),(5,7)\}\) (resp.  \(M_{1,3}=\{(4,9),(5,10)\}\)) is the maximum bipartite matching of (1, 2)’s (resp. (1, 3)’s) in-neighboring pairs \(I_1 \times I_2\) (resp.  \(I_1 \times I_3\)). The sum of the similarity values over \(M_{1,2}\) is \(0.488+0.360=0.848\), which is the same as that over \(M_{1,3}\). Thus, RoleSim cannot distinguish s(1, 2) from s(1, 3).    \(\square \)

Example 1 illustrates that, to effectively evaluate s(ab), relying only on the in-neighboring-pairs similarities in the maximum bipartite matching \(M_{a,b}\) (e.g., RoleSim) is not enough. Although RoleSim has the advantage of finding the most influential pairs \(M_{a,b}\) among all the in-neighboring pairs \(I_a \times I_b\) for achieving automorphic equivalence, it completely ignores the similarity information outside \(M_{a,b}\). For instance in Example 1, there are opportunities to take good advantage of the similarity values in the regions \(I_1 \times I_2-M_{1,2}\) and \(I_1\times I_3-M_{1,3}\) which would be helpful to distinguish s(1, 2) from s(1, 3) further when the average similarities over \(M_{1,2}\) and \(M_{1,3}\) are the same.

Contributions. Motivated by this, our main contributions are as follows:

1) We first propose a novel similarity model, RoleSim*, which accurately evaluates pairwise role similarities in a more comprehensive fashion. Compared with the existing well-known similarity models (e.g., SimRank and RoleSim), RoleSim* not only guarantees the automorphic equivalence that SimRank lacks, but also takes into consideration the pairwise similarities outside the automorphically equivalent sets that are overlooked by RoleSim. (Sect. 3.1)

2) We prove the existence and uniqueness of the RoleSim* solution, and show three key axiomatic properties of RoleSim*, i.e., symmetry, boundedness, and non-increasing monotonicity of its iterative similarity scores. (Sect. 3.2)

3) We derive an iterative formula for computing RoleSim* similarities, and a concise upper bound is obtained, which can estimate the total number of iterations required for attaining a desired accuracy. (Sects. 3.3 and 3.4)

4) We induce a distance metric based on our RoleSim* measure, and rigorously show that the RoleSim* distance metric fulfills the triangular inequality which other measures (e.g., cosine distance) lack. This implies the sum-transitivity of the RoleSim* measure. (Sect. 3.5)

5) We conduct an experimental study to validate the effectiveness of our RoleSim* model. Our empirical results show that RoleSim* achieves higher accuracy than the existing competitors (e.g., RoleSim and SimRank) while entailing comparable computational complexity bounds of RoleSim. (Sect. 4)

2 Related Work

Graph-based similarity models have been popular since SimRank measure was proposed by Jeh and Windom [7]. SimRank is a node-pair similarity measure, which follows the recursive idea that “two nodes are considered as similar if they are pointed to by similar nodes”. Since then, there have been surges of studies focusing on optimization problems to accelerate SimRank computation as the naive SimRank computing method entails quadratic time in the number of nodes. According to assumptions on data updates, recent results can be divided into static algorithms [1, 4, 5, 11, 15, 20, 24, 27, 29, 32, 37], and dynamic algorithms on evolving graphs [8, 12, 18, 22, 25, 31, 35]. According to types of queries, these results are classified into single-source SimRank [8, 11, 18, 24, 35], single-pair SimRank [6, 14], all-pairs SimRank [1, 19, 29, 30], and partial-pairs SimRank [20, 34].

There are many studies on semantic problems of pairwise similarity measures. Various SimRank-like measures have come into play, including C-Rank  [26], SimFusion  [33], P-Rank  [36], RoleSim  [9], MatchSim  [17], ASCOS  [2], SimRank*  [32], CoSimRank  [21], SemSim  [27]. Among them, RoleSim has stood out as a promising role-based similarity model, due to its elegant intuition that “if two nodes are automorphically equivalent, they should share the same role and their role similarity should be maximal”. To speed up the RoleSim computation, an approximate heuristic, named Iceberg RoleSim, was devised to prune small similarity values below a threshold.

Unlike SimRank that takes the average similarity of all the neighboring pairs of (ab), RoleSim computes s(ab) by averaging only the similarities over the maximum bipartite matching \(M_{a,b}\). However, all the similarity information not included in the matching \(M_{a,b}\) is completely ignored by RoleSim. In contrast, our RoleSim* model can effectively capture these information while guaranteeing automorphic equivalence.

There have also been a host of studies on variations of RoleSim  [3, 13, 17, 23]. Lin et al.  [17] introduced MatchSim whose similarity is defined to be the average similarity of (ab)’s maximum matched neighbors. It differs from RoleSim in that MatchSim initialises \(s_0(a,b)=1 \) if \(a=b\), and 0 otherwise, whereas RoleSim initialises all \(s_0(*,*) = 1 \). As a result, MatchSim scores do not guarantee automorphic equivalence. Li et al.  [13] proposed CentSim, a centrality based role similarity measure, which compares the centrality values of two nodes to evaluate their similarity. This measure employs several types of centralities including PageRank, Degree and Closeness for each node, and considers the weighted average of them for evaluating CentSim scores. Recently, Shao et al.  [23] introduced RoleSim++, an extension of RoleSim, which considers both incoming and outgoing neighbors in a digraph for social network de-anonymisation. It employs a novel matching algorithm, called NeighborMatch, to find matching for inner and outer neighbors, respectively. Furthermore, a threshold based version, \(\alpha \)-RoleSim++, is proposed to eliminate tiny scores for speedup further. Most recently, Chen et al.  [3] suggest a scalable model, StructSim, with an efficient BinCount matching algorithm and present a hierarchical scheme, which achieves a more efficient role similarity computation.

3 RoleSim*

3.1 RoleSim* Formulation

The central intuition underpinning RoleSim* follows a recursive concept that “two distinct nodes are assessed to be similar if they

  1. 1.

    mainly interact with the automorphically equivalent sets of in-neighbors, and

  2. 2.

    are in-linked by similar nodes that are out of automorphically equivalent sets.

The starting point for this recursion is to assign each pair of nodes a similarity score 1, meaning that initially no pairs of nodes are thought of to be more (or less) similar than others.

Notations. Before illustrating the mathematical definition to reify the RoleSim* intuition, we introduce the following notations.

Let \(G=(V,E)\) be a directed graph with a set of nodes V and a set of edges E. Let \(I_a\) be all in-neighbors of node a, and \(|I_a|\) the cardinality of the set \(I_a\). For a pair of nodes (ab) in G, we denote by \(I_a \times I_b=\{(x,y) \ | \ \forall x \in I_a \text { and } \forall y \in I_b\}\) all in-neighboring pairs of (ab), and s(ab) the RoleSim* similarity score between nodes a and b. Using \(I_a \times I_b\) and s(ab), we define a weighted complete bipartite graph, denoted by \(K_{|I_a|, |I_b|}= (I_a \cup I_b, I_a \times I_b)\), with each edge \((x,y) \in I_a \times I_b\) carrying the weight s(ab). We denoted by \(M_{a,b} \ (\subseteq I_a \times I_b)\) the maximum weighted matching in bipartite graph \(K_{|I_a|, |I_b|}\).

Example 2

Recall digraph G in Fig. 1. For nodes 1 and 2, their in-neighbors sets are \(I_1=\{4, 5\}\) and \(I_2=\{6, 7, 8\}\), respectively. The set of all in-neighboring pairs of (1, 2) is \(I_1 \times I_2=\{\mathbf {(4,6)},(4,7),(4,8),(5,6),\mathbf {(5,7)},(5,8)\}\). The maximum matching of bipartite graph \((I_1 \cup I_2, I_1 \times I_2)\) is \(M_{1,2}=\{(4,6),(5,7)\}\) (bold).    \(\square \)

Other notations frequently used throughout this paper are listed in Table 1.

Table 1. Description of main symbols

RoleSim* Formula. Based on our aforementioned intuition, we formally formulate the RoleSim* model as follows:

(1)

In Eq. (1), for every pair of nodes (ab), the set of their in-neighboring pairs, \(I_a \times I_b\), is split into two subsets: \(I_a \times I_b = M_{a,b} \cup (I_a \times I_b-M_{a,b})\). As a result, the definition of RoleSim* consists of two parts: Part 1 is the average similarity over maximum matching \(M_{a,b}\), indicating the contribution from (ab) interacting with the automorphically equivalent set, \(M_{a,b}\), of (ab)’s in-neighbors pairs. Part 2 is the average similarity over \((I_a \times I_b)-M_{a,b}\), corresponding to the contribution from (ab) being pointed to by the rest of (ab)’s in-neighbors pairs out of automorphically equivalent set \(M_{a,b}\). The relative weight of Part 1 and 2 is balanced by a user-controlled parameter \(\lambda \in [0,1]\). \(\beta \) is a damping factor between 0 and 1, which is often set to 0.6 or 0.8, implying that similarity propagation made with distant in-neighbors is penalised by an attenuation factor \(\beta \) across edges. When \(I_a\) (or \(I_b\)) \(=\emptyset \), which implies the maximum matching \(M_{a,b}=\emptyset \), we define Part 1 and Part 2 = 0 in order to avoid the denominators of the fraction in Part 1 and 2 being zeros.

Fixed-Point Iteration. To solve RoleSim* similarity s(ab) in Eq. (1), we adopt the following fixed-point iterative scheme:

$$\begin{aligned} s_{0}(a,b)&= 1 \qquad (\forall a, b)\end{aligned}$$
(2)
$$\begin{aligned} s_{k+1}(a,b)&= \beta \times \bigg ( {\frac{\lambda }{\left| {{I}_{a}} \right| +\left| {{I}_{b}} \right| -\left| {{M}_{a,b}} \right| }\sum \limits _{(x,y)\in {{M}_{a,b}}}{s}_k (x,y)} \nonumber \\+ & {} {\frac{1-\lambda }{\left| {{I}_{a}} \right| \times \left| {{I}_{b}} \right| -\left| {{M}_{a,b}} \right| }\sum \limits _{(x,y)\in ({{I}_{a}}\times {{I}_{b}})-{{M}_{a,b}}}{s}_k (x,y)} \bigg )+(1-\beta ) \end{aligned}$$
(3)

where \(s_k(a,b)\) denotes the RoleSim* score between nodes a and b at iteration k. Based on Eqs. (2) and (3), we can iteratively compute all pairs of similarity scores \(s_{k+1}(*,*)\) from those at the last iteration \(s_{k}(*,*)\). The fixed-point scheme in Eqs. (2) and (3) implies an iterative algorithm for RoleSim* computation, which requires \(O(K{|E|}^2)\) time to compute \({|V|}^2\) node-pairs for K iterations.

Threshold-Based RoleSim*. To accelerate RoleSim* computation, we notice that there are a significant number of node pairs whose iterative similarity values \(s_k(*,*)\) are very close to their convergent scores \(s(*,*)\) and thus will not change much in subsequent iterations. Hence, we propose the following threshold-based RoleSim* model, where \(\delta \) is a user-controlled threshold, which is a speed-accuracy trade-off.

$$\begin{aligned} s_{0}^{\delta }(a,b)&\,=\,&1 \nonumber \\ s_{k+1}^{\delta }(a,b)&\,=\,&{\left\{ \begin{array}{ll} s_{k}^{\delta }(a,b) \qquad \qquad \text { if } s_{k-1}^{\delta }(a,b)- s_{k}^{\delta }(a,b)<\delta \\ 1-\beta \qquad \qquad \,\,\,\,\,\text { if } s_{k}^{\delta }(a,b)<(1-\beta )+\delta \nonumber \\ \beta \times \bigg ( {\frac{\lambda }{\left| {{I}_{a}} \right| +\left| {{I}_{b}} \right| -\left| {{M}_{a,b}} \right| }\sum \limits _{(x,y)\in {{M}_{a,b}}}{s}_k^{\delta } (x,y)} \qquad \qquad \text {otherwise} \\ + {\frac{1-\lambda }{\left| {{I}_{a}} \right| \times \left| {{I}_{b}} \right| -\left| {{M}_{a,b}} \right| }\sum \limits _{(x,y)\in ({{I}_{a}}\times {{I}_{b}})-{{M}_{a,b}}}{s}_k^{\delta } (x,y)} \bigg )+(1-\beta )\nonumber \\ \end{array}\right. } \end{aligned}$$

3.2 Axiomatic Properties for RoleSim* Iterative Similarity

Based on the definition of iterative similarity \(s_{k}(a,b)\) in Eqs. (2) and (3), we next show three axiomatic properties of RoleSim*, i.e., symmetry, boundedness, and non-increasing monotonicity, based on the following theorem.

Theorem 1

The iterative RoleSim* \(\{s_{k}(a,b)\}\) in Eqs. (2) and (3) have the following key properties: for any node pair (ab) and each iteration \(k=0,1,\cdots \),

  1. 1.

    (Symmetry)    \(s_k(a,b) = s_k(b,a)\)

  2. 2.

    (Boundedness)    \(1-\beta \le s_k(a,b) \le 1\)

  3. 3.

    (Monotonicity)    \(s_{k+1}(a,b) \le s_k(a,b)\)

Proof

Please refer to [28] for a detailed proof of all theorems and lemmas.

Theorem 1 indicates that, for every iteration \(k=0,1,2,\cdots \), \(\{s_{k}(a,b)\}\) is a bounded symmetric scoring function. Moreover, as \(k \rightarrow \infty \), it can be readily verified that the exact solution s(ab) also is a bounded symmetric measure, which is similar to SimRank and RoleSim. In comparison, other measures (e.g., Hitting Time and Random Walk with Restart) are asymmetric ones.

3.3 Existence and Uniqueness

It is worth mentioning that, as opposed to SimRank whose iterative similarity is non-decreasing between 0 and 1 w.r.t.  k, RoleSim* similarity is non-increasing between \(1-\beta \) and 1. The bounded non-increasing property of RoleSim* guarantees the existence and uniqueness of its exact solution s(ab), as shown below:

Theorem 2 (Existence and Uniqueness)

There always exists a unique solution s(ab) (i.e., the exact RoleSim score) to Eqs. (2) and (3) such that the iterative RoleSim similarity \(\{s_{k}(a,b)\}\) converges to it, i.e.,  \(\lim _{k \rightarrow \infty } s_{k}(a,b) = s(a,b)\).

3.4 Accuracy Estimation

Having proved the existence and uniqueness of the exact RoleSim* solution, we are now ready to investigate the error bound of the difference between the k-th iterative similarity \(s_k(a,b)\) and exact one s(ab). In virtue of the non-increasing monotonicity of \(\{s_k(a,b)\}\), one can readily show that the exact s(ab) is the lower bound of all the iterative similarities \(\{s_k(a,b)\}\), i.e.,  \(s_k(a,b) \ge s(a,b) \ \ (\forall k)\). The following theorem further provides a concise upper bound to measure the closeness between \(s_k(a,b)\) and s(ab).

Theorem 3 (Iterative Error Bound)

For every iteration number \(k=0,1,2, \cdots \), the difference between \(s_k(a,b)\) and s(ab) is bounded by

$$\begin{aligned} s_k(a,b) - s(a,b) \le \beta ^{k+1} \qquad (\forall a,b) \end{aligned}$$
(4)

Theorem 3 derives a concise exponential upper bound for the difference between the k-th iterative similarity \(s_k(a,b)\) and exact s(ab). Combining this bound with the non-increasing monotonicity \(s_k(a,b) \ge s(a,b)\), we can obtain that the k-th iterative error \(s_k(a,b) - s(a,b)\) is between 0 and \(\beta ^{k+1}\). Moreover, Theorem 3 also implies that, given desired accuracy \(\epsilon >0\), the total number of iterations required for computing RoleSim* similarity is \(K = \lceil \log _{\beta } \epsilon \rceil \).

3.5 “Sum-Transitivity” of RoleSim* Similarity

In this section, we investigate the transitive property of the proposed RoleSim* similarity measure. Intuitively, when a similarity measure \(s(*,*)\) fulfills the transitive property, it means that, for any three nodes abc in the graph, if a is similar to b and b is similar to c, it implies that a is likely to be similar to c. The transitivity feature is useful in many real applications, e.g., for predicting and recommending links in a graph.

To study the transitive property of RoleSim*, let us induce a distance \(d(a,b):=1-s(a,b)\) from the RoleSim* measure. Due to \(s(*,*) \in [1-\beta , 1]\), the distance \(d(*,*)\) is between 0 and \(\beta \). In what follows, we will show that \(d(*,*)\) satisfies the triangular inequality, which is an indication of \(s(*,*)\) transitivity.

We first show the following lemma, which is needed for further proof of RoleSim* triangular inequality.

Lemma 1

Let \(s_k(*,*)\) be the k-th iterative RoleSim* similarity to Eqs. (2) and (3). For any three nodes abc in a graph, if \(s_k(a,b)+s_k(b,c)-s_k(a,c) \le 1\) holds at iteration k, the following inequalities holds:

$$\begin{aligned} P_1 :=\frac{\sum \limits _{(x,y)\in {{M}_{a,b}}}{{{s}_{k}}}(x,y)}{\left| {{I}_{a}} \right| +\left| {{I}_{b}} \right| -\left| {{M}_{a,b}} \right| }+\frac{\sum \limits _{(y,z)\in {{M}_{b,c}}}{{{s}_{k}}}(y,z)}{\left| {{I}_{b}} \right| +\left| {{I}_{c}} \right| -\left| {{M}_{b,c}} \right| }-\frac{\sum \limits _{(x,z)\in {{M}_{a,c}}}{{{s}_{k}}}(x,z)}{\left| {{I}_{a}} \right| +\left| {{I}_{c}} \right| -\left| {{M}_{a,c}} \right| } \le 1 \end{aligned}$$
(5)
(6)

Leveraging Lemma 1, we are now ready to show the sum-transitivity of the RoleSim* similarity distance, which is the main result in this subsection:

Theorem 4

The RoleSim* similarity s(ab) defined by Eq. (1) satisfies the following “sum-transitive” property: Let \(d(a,b):=1-s(a,b)\) be the closeness between nodes a and b. Then, for any three nodes abc in a graph, the following triangular inequality holds, i.e., 

$$\begin{aligned} d(a,b) + d(b,c) \ge d(a,c) \end{aligned}$$
(7)

4 Experimental Evaluation

4.1 Experimental Settings

Datasets. We use both real and synthetic datasets, as illustrated below:

Datasets

Abbr.

#Node-Pairs

#Nodes

#Edges

Type

Amazon

(AMZ)

25, 867, 396

5, 086

8, 970

Directed

DBLP

(DBLP)

5, 626, 384

2, 372

7, 106

Undirected

Synthetic

(SYN)

4, 000, 000

2, 000

5, 481

Undirected

  • . A co-purchasing digraph crawled from Customers Who Bought This Item Also Bought feature of AmazonFootnote 1. Each node is a product, and edge \(i \rightarrow j\) means that product j appears in the frequent co-purchasing list of i.

  • . A collaboration (undirected) graph taken from DBLP bibliography.Footnote 2 We extract a co-authorship subgraph from six top conferences in computer science (SIGMOD, VLDB, PODS, KDD, SIGIR, ICDE) during 2018–2020. If two authors (nodes) co-authored a paper, there is an edge between them.

  • . A random scale-free graph with a power-law degree distribution, generated by GenRndPowerLaw function in C++ SNAP Library.Footnote 3

All experiments are conducted on a PC with Intel Core i7-10510U 2.30GHz CPU and 16 GB RAM, using Windows 8 Professional 64-bit. Each experiment is repeated 5 times and the average is reported.

Compared Algorithms. We implemented all the following algorithms in VC++:

Models

Abbr.

Description

RoleSim*

(RS*)

Our proposed RoleSim* model in Section 3.1

SimRank

(SR)

A pairwise similarity model proposed by Jeh and Widom  [7]

MatchSim

(MS)

A similarity model relying on the matched neighbors of node pairs  [16]

RoleSim

(RS)

A model that guarantees the automorphically equivalence of nodes  [9]

RoleSim++

(RS++)

An enhanced RoleSim that considers both in- and out-neighbors  [23]

CentSim

(CS)

A similarity model that compares the centrality values of node pairs  [13]

Parameters. We use the following parameters as default: (a) damping factor \(\beta =0.8\), (b) relative weight \(\lambda =0.7\), (c) total number of iterations \(K=4\).

Semantic Evaluation. We design an unsupervised evaluation setting to quantify the effectiveness of the similarity measures in preserving self-similarity under different conditions. In particular, we study the effect of sampling the immediate neighborhood of a query point on similarity scores in RoleSim* compared with SimRank and RoleSim. Consider a single query node q. In our experiment, we create a node \(q'\) and add it to the graph. We connect \(q'\) to some proportion (\(\eta \)) of the total number of neighbors of q, and hereby refer to \(q'\) as the “sampled clone”. The similarity scores of q to all other points in the graph are computed using SimRank, RoleSim, and RoleSim*. We evaluate how much the relative similarities are preserved when different measures are used. We vary \(\eta \) in \(q'\) with step size 0.25 (and ensuring no orphaned nodes), and additionally consider \(\lambda ={0.0, 0.3, 0.5, 0.7, 1.0}\) for RoleSim*. Our results are aggregated over 20 queries on DBLP and AMZ graphs respectively, where query nodes are chosen as having high degree of neighbors.

Fig. 2.
figure 2

Effect of Sampling Ratio \((\eta )\) and Weight \((\lambda )\) on Ranking Quality (DBLP)

Fig. 3.
figure 3

Effect of Sampling Ratio \((\eta )\) and Weight \((\lambda )\) on Ranking Quality (AMZ)

4.2 Experimental Results

Semantic Accuracy. We first count the number of queries where the sampled clone \(q^{\prime }\) appears in the top-k (\(k={1,5,10}\)) similar nodes to query q for RoleSim*. Intuitively, this studies how much structural information is gleaned about a query node. Figure 2(a) presents the number of such queries out of 20 on the undirected DBLP graph, considering top-5 similarity scores. Other top-k plots are omitted, but show that with increasing k for a given sampling proportion there are more such queries even at lower \(\lambda \).

Next, we test the impact of sampling \(\eta \) and \(\lambda \) on ranking quality in RoleSim*. We plot the average ranking quality (normalized discounted cumulative gain (nDCG)), considering top-100 similar nodes of the sampled clone and comparing this to the baseline original query. We observe that the trend (with respect to \(\eta \)) seen in Fig. 2(b) and Fig. 3(b) for \(\lambda =1\) resembles that for RoleSim, and the trend for \(\lambda =0.5\) is close to that for SimRank.

Finally, we consider a fixed value of \(\lambda =0.7\) and confirm that the RoleSim* has higher ranking quality compared to SimRank and RoleSim, with respect to the average nDCG. Figure 2(c) with undirected DBLP graph shows that RoleSim* produces a more consistent nDCG even with small \(\eta \). For the directed AMZ graph in Fig. 3(c) too, RoleSim shows significant improvement at lower sampling, and the performance of SimRank is negatively affected throughout, while RoleSim* remains stable.

Table 2. Similarity rankings for “Philip S. Yu” on DBLP co-authorships data

Qualitative Case Study. Table 2 compares the similarity ranking results from three algorithms (SR, RS and RS*) for retrieving top-10 most similar authors w.r.t. query “Philip S. Yu” on DBLP. From the results, we see that the top rankings of RS* are similar to RS, highlighting its capability to effectively capture automorphic equivalent neighboring information. For instance, “Jure Leskovec” is top-ranked in RS* list. This is reasonable because he and “Philip S. Yu” have similar roles - they are both Professors in Computer Science with close research expertise (e.g., knowledge discovery, recommender systems, commonsense reasoning). However, the rankings of RS* are different from those of RS. For example, “Jure Leskovec” is ranked \({350}^{\text {th}}\) by SR, but \({4}^{\text {th}}\) by RS* and RS. This is because SimRank can only capture connected paths between two authors while ignoring their automorphic equivalent structure. “Jure Leskovec” has rare collaborations with “Philip S. Yu”, both direct and indirect, thus leading to a low SimRank score.

To evaluate RS* further, we choose two different values for \(\lambda \in \{0.6, 0.8\}\) to show how RS* ranking results are perturbed w.r.t.  \(\lambda \). From the results, we notice that, when \(\lambda \) is varied from 0.6 to 0.8, nodes with small SR scores (e.g., “Jure Leskovec”) exhibit a stable position in RS* ranking, whereas nodes having higher SR scores (e.g., “Huan Liu”) have a substantial change. This conforms with our intuition because “Huan Liu”’s collaboration with “Philip S. Yu” is closer than “Jure Leskovec”’s, and RS* is able to capture both connectivity and automorphic equivalence of two authors using a balanced weight \(\lambda \).

Fig. 4.
figure 4

Elapsed time comparison for different threshold-based RS*

Computational Time. Figure 4(a) compares the computational time of six algorithms (RS*, RS, MS, RS++, CS, SR) on various datasets (AMZ, DBLP, SYN), respectively. We notice that, on each dataset, RS* has comparable computational time to RS and MS. This implies that RS* achieves high accuracy without sacrificing running speed. In addition, RS*, RS, and MS are 2–4 times faster than RS++. This is because RS* need to find two maximum bipartite matchings for both in- and out-neighboring pairs, as opposed to RS* that involves the computation of only one matching. SR is slightly slower than RS*. This is consistent with our analysis as SR simply takes the average of all similarities of the in-neighboring pairs without the need to find the maximum bipartite matching. CS achieves the fastest speed since it simply assesses a node-pair similarity by aggregating their centrality values, thereby leading to low accuracy.

Figures 4(b) and (c) show the effect of iteration number k and threshold \(\delta \) on the running time of RS* on DBLP and AMZ, respectively. For each dataset, we vary \(\delta \) from 0 to 0.05. When \(\delta =0\), it reduces to RS* algorithm. From the results on both datasets, we discern that, for each fixed \(\delta \), the running time of threshold-based RS* increases as k grows. When \(\delta \) becomes larger, the growth rate of RS* time tends to be sublinear. For example, when \(\delta =0.05\) on DBLP, only after \(k=5\) iterations, the increasing time of threshold-based RS* has leveled off. In contrast, when \(\delta =0.01\), the time becomes steady after \(k=8\) iterations. The reason is that a higher setting of threshold \(\delta \) implies a larger number of pairs to be pruned per iteration, thus leading to the growth rate of the running time decreasing in an earlier stage during iterations.

5 Conclusion

We propose RoleSim*, a novel similarity model that guarantees automorphic equivalence and also considers neighboring similarity information beyond automorphically equivalent sets, thereby achieving better performance than both SimRank and RoleSim. We prove the existence and uniqueness of the RoleSim* solution, show that iteratively computing RoleSim* is bounded, and induce a RoleSim* distance obeying sum-transitivity of similarity scores. We also evaluate our model on DBLP, AMZ, and SYN datasets to demonstrate its superior ranking quality and comparable complexity to competitors.