1 Introduction

Management of uncertain graphs (or networks) has become an extremely important topic recently [8, 15, 16, 23, 29, 32, 35, 36, 53, 58], because graph structured data that are uncertain or noisy by nature appear in diverse applications such as traffic networks and protein–protein interaction (PPI) networks. Uncertainty exists commonly due to various reasons such as anonymous communication data and data collected through automated sensors [3], and such uncertainty is usually encoded in the graph edges. Hence, a general uncertain graph model is utilized in this paper, where the graph edges are represented as several possible values with probabilities. This graph model considers not only the existence uncertainty (where an edge will virtually disappear if the edge weight is relatively large) but also the weight uncertainty (that allows each edge to have different weights).

Given a query node q and a data node set S, a reverse k-nearest neighbor (\(\hbox {R}k\hbox {NN}\)) query finds a set of data nodes \(s\in S\) such that q is one of s’s kNN objects. \(\hbox {R}k\hbox {NN}\) search has large application base including location-based services, decision support, profile-based marketing, cluster and outlier detection, traffic networks, adventure games, and molecular biology [6, 10, 18, 52]. In particular, there are two popular versions of \(\hbox {R}k\hbox {NN}\) search, i.e., the monochromatic one and the bichromatic one. The monochromatic \(\hbox {R}k\hbox {NN}\) query is alike the definition above, and all objects are of the same type. In contrast, there are two distinct object types \(S_1\) and \(S_2\) in the bichromatic \(\hbox {R}k\hbox {NN}\) query [7, 22, 52]. Specifically, given two data node sets \(S_1\) (e.g., the customer set) and \(S_2\) (e.g., the restaurant competitors) and a query object q, a bichromatic \(\hbox {R}k\hbox {NN}\) query retrieves the objects \(s \in S_1\) such that q is one of their k-nearest neighbors from \(S_2\).

Example 1

R k NN search on uncertain traffic networks In traffic networks, the weight of a road segment indicates the travel time (e.g., the driving time) of this road segment. The exact road weight is very hard to obtain, as a result of the decreasing frequency of data collection to save resources, data device limitations, and wireless network traffic. In other words, because of the difficulty of measuring the exact traffic time of every road segment and the large scale of a traffic network, the road weight is usually uncertain [29]. Take the network shown in Fig. 1 as an example; assume that the nodes denote the interesting landmarks on the road network, and the weights of each edge given in Fig. 1b represent the possible travel time values with the probabilities. For instance, the weight of edge \(e_{1,2}\) between nodes \(n_1\) and \(n_2\) is 2 with probability 0.3 and 3 with probability 0.7. It indicates that passing the road \(n_1,n_2\) takes two time units with probability \(30\%\), and 3 time units with probability \(70\%\).

Fig. 1
figure 1

Illustration of an uncertain graph. a Graph structure. b Possible weights

Consider the scenario of evaluating the impact of opening a pizza restaurant at a selected location; the manager wants to examine how many residential areas would find the new pizza restaurant as their k-nearest choices [2, 14, 44], where the traveling time between a residential area and the pizza restaurant is a critical factor, and should be taken into account [30]. In addition, to cut down costs when carrying out an advertising campaign, it would be profitable for a restaurant owner to send menu cards only to those customers who have his/her restaurant as one of the k-closest pizza restaurants. For instance, for a candidate pizza restaurant location in node \(n_3\) of Fig. 1, \(\hbox {R}k\hbox {NN}\) search can find all the residential buildings such that \(n_3\) as one of their k-nearest pizza restaurants from the uncertain traffic network. Note that the network also contains some irrelevant nodes, such as node \(n_1\) (e.g., shopping malls) in Fig. 1a. It is worth mentioning that the uncertain graph structure in Fig. 1 serves as a running example throughout the paper.

Example 2

R k NN search on uncertain PPI networks In protein–protein interaction (PPI) networks, nodes denote proteins, and edges represent interactions among them. The edge of a protein pair indicates the reliability of the interaction between the two proteins [28]. Since the interactions are derived through noisy and error-prone lab experiments, each edge is associated with an uncertainty value [4, 19, 36]. \(\hbox {R}k\hbox {NN}\) search can help to provide the best correlation of protein essentiality and placement of proteins in PPI networks [33]. In particular, \(\hbox {R}k\hbox {NN}\) topology, consisting of all proteins and the edges to every protein from their \(\hbox {R}k\hbox {NNs}\) (generated from the PPI network), is used to examine the connection of essential proteins and their placement. The related results show that essential proteins are more likely to be proteins with many \(\hbox {R}k\hbox {NNs}\). Furthermore, \(\hbox {R}k\hbox {NN}\) search plays an important role in many gene ontology processes [33]. Besides, \(\hbox {R}k\hbox {NN}\) search is employed to identify novel inflammatory bowel disease (IBD)-related proteins via PPI networks in recent work [42]. An observed protein whose influenced proteins (i.e., \(\hbox {R}k\hbox {NNs}\)) are mostly known IBD-related proteins is statistically identified as a novel IBD-related protein. Therefore, the \(\hbox {R}k\hbox {NN}\) query with the statistical enrichment test is a great tool to identify IBD-related proteins in order to better understand the complex disease mechanism.

\(\hbox {R}k\hbox {NN}\) retrieval on deterministic graphs has been well studied, including both the monochromatic version and the bichromatic version [52]. Nonetheless, uncertain graph processing brings new challenges, because of the \(\sharp \)P-hard complexity. Thus, applying directly existing \(\hbox {R}k\hbox {NN}\) query algorithms on deterministic graphs to answer \(\hbox {UG-R}k\hbox {NN}\) queries cannot achieve good performance. Consequently, motivated by these, in this paper, we focus on the problem of \(\hbox {R}k\hbox {NN}\) search on uncertain graphs ( search for short), where the uncertain graph is modeled as a set of deterministic possible graphs under possible world semantic [16, 24, 29, 36, 54]. Based on the uncertain graph model, given a query node q and a data node set S on an uncertain graph, a (monochromatic) \(\hbox {UG-R}k\hbox {NN}\) query finds a set of data nodes \(s\in S\) such that q is one of s’s kNN objects with probability not smaller than a specified probability threshold \(\theta \).

This paper is dedicated to the development of algorithms that can support \(\hbox {UG-R}k\hbox {NN}\) queries efficiently. Intuitively, in order to obtain \(\hbox {UG-R}k\hbox {NN}\) query result, we can convert an \(\hbox {UG-R}k\hbox {NN}\) query to multiple \(\hbox {R}k\hbox {NN}\) queries on all the possible graphs, which can be supported by existing \(\hbox {R}k\hbox {NN}\) search algorithms. Let EV,  and \(\varOmega (e)\) be the sets of edges, nodes, and possible weights associated with an edge e on an uncertain graph, respectively. It is not hard to find that there are in total \(\varPi _{e\in E}|\varOmega (e)|\) possible graphs w.r.t. a specified uncertain graph. If all the edges share the same number of weights w (i.e., \(\forall e \in E, |\varOmega (e)|=w\)), the total number of \(\hbox {R}k\hbox {NN}\) queries we have to process for answering an \(\hbox {UG-R}k\hbox {NN}\) query is \(w^{|E|}\), with the time complexity of \(O(w^{|E|}\cdot (|V|^2\cdot \lg |V|+|V|\cdot |E|))\) that is extremely expensive.

In order to make sure \(\hbox {UG-R}k\hbox {NN}\) query processing is practical, we have to minimize the number of possible graphs (i.e., \(w^{|E|}\)) as many as possible. To this end, we propose three novel heuristics to significantly improve search performance. First, we present graph structure pruning (GSP) heuristic to reduce the value of |V| (and hence |E|) significantly by replacing an original graph with a much smaller essential uncertain graph and therefore cut down the number of possible graphs we have to evaluate. The effectiveness of GSP is proved by analysis as the number of nodes in the essential graph is theoretically bounded, and is demonstrated via experimental studies. Second, we present equivalent graph removing (EGR) heuristic to further reduce the number of possible graphs we have to evaluate during \(\hbox {UG-R}k\hbox {NN}\) search, by removing equivalent possible graphs that share the same result sets with the evaluated possible graphs. Nevertheless, simply identifying those equivalent possible graphs is as costly as finding \(\hbox {R}k\hbox {NN}\) query result on those possible graphs, as explained in Sect. 4. As a solution, we formulate graph conditional dominance relationship to locate certain equivalent possible graphs with low cost. Third, we develop probability bound pruning (PBP) heuristic to identify the objects that are qualified or unqualified \(\hbox {UG-R}k\hbox {NN}\) query results as early as possible during the search, which helps to minimize the validation cost of data nodes. Based on GSP, EGR, and PBP heuristics, we propose an exact algorithm called SDP to tackle \(\hbox {UG-R}k\hbox {NN}\) search.

Due to the \(\sharp \)P-hard complexity of \(\hbox {UG-R}k\hbox {NN}\) search, there is no exact algorithm which can handle \(\hbox {UG-R}k\hbox {NN}\) search in polynomial time. In view of this, we present a novel sampling algorithm, termed as TripS, to support \(\hbox {UG-R}k\hbox {NN}\) retrieval efficiently. Specifically, on the top of essential uncertain graph derived by GSP, TripS evaluates a group of sampled possible graphs by using an adaptive stratified sampling technique (denoted as ASSP) to get the query result, instead of the exhaustive evaluation over every possible graph. As theoretically guaranteed, ASSP, the adaptive stratified sampling technique utilized in TripS algorithm, is unbiased, and its variance is at least as good as that of Monte Carlo sampling. TripS takes \(O(N\cdot (|E|+|V|^2 \cdot \lg |V|))\) time with N being the total number of samples. In brief, our key contributions in this paper are summarized as follows.

  • We identify and formalize the problem of \(\hbox {UG-R}k\hbox {NN}\) query processing under a general graph model with uncertain edge weights.

  • We present three effective heuristics GSP, EGR, and PBP, to effectively shrink the scale of uncertain graph and to reduce the number of possible graphs we have to evaluate during search. Based on these heuristics, we develop an exact algorithm, i.e., SDP, for \(\hbox {UG-R}k\hbox {NN}\) retrieval.

  • We propose a novel algorithm, namely TripS, that integrates GSP with an adaptive stratified sampling algorithm, to perform \(\hbox {UG-R}k\hbox {NN}\) queries in high efficiency with theoretical guarantee. We also analyze the complexities of our presented algorithms, and extend our techniques to support bichromatic \(\hbox {UG-R}k\hbox {NN}\) search.

  • We conduct extensive experiments with both real and synthetic graphs to demonstrate the effectiveness and efficiency of our proposed heuristics and algorithms.

The rest of the paper is organized as follows. First, we present preliminaries of our work in Sect. 2. Then, we describe GSP and EGR heuristics for \(\hbox {UG-R}k\hbox {NN}\) search in Sects. 3 and 4, respectively. Our SDP and TripS algorithms are proposed in Sects. 5 and 6, respectively. We report experimental results and our findings in Sect. 7. Finally, we review related work in Sect. 8 and conclude our work with some directions for future work in Sect. 9.

2 Preliminaries

In this section, we first formalize the uncertain graph model and our \(\hbox {UG-R}k\hbox {NN}\) problem, and then, we present a baseline algorithm, denoted as Baseline, for \(\hbox {UG-R}k\hbox {NN}\) search. Table 1 lists the symbols used frequently throughout this paper.

2.1 Problem formulation

Definition 1

(Uncertain graph) An uncertain weighted graph \(\mathcal {G} \) \(=\) (V, E, \(\varOmega \), P), in which V denotes a set of nodes and E represents a set of edges. Each edge e is a random variable, \(\varOmega (e)\) is the sample space of e, and \(P(x\in \varOmega (e))\) is the probability mass function of e. In other words, \(\varOmega \): \(E \rightarrow \mathcal {P}(\mathcal {R})\), i.e., \(\varOmega \) maps each edge to a set of real values; and P: \(E \rightarrow (\mathcal {R} \rightarrow [0,1])\) is a function, which maps each edge to its probability mass function.

Table 1 Symbols and description

In fact, \(\varOmega (e)\) is a set of positive random events, i.e., \(\varOmega (e) (\subseteq \varOmega ) =\{\omega _1(e), \omega _2(e), \ldots , \omega _s(e)\}\), in which \(\omega _i(e)\)s (\(i=1, 2, \ldots , s\)) are the positive weights of an edge e. Function P assigns each weight \(\omega _i(e)\in \varOmega (e) (\subseteq \varOmega \)) of the edge e a probability \(p(e, \omega _i(e))\) which indicates the likelihood of \(\omega (e)\) with \(\sum _{\omega _i(e)\in \varOmega (e)}\) \(p(e,\omega _i(e)) = 1\). As formally defined in Definition 1, an uncertain graph actually refers to a graph whose edges have one or multiple weights with respective probabilities. Note that our uncertain graph model is similar as the one proposed in [29], and we focus on the discrete random variable case for every edge weight because of the following reasons: (i) the real-world uncertain data does not always follow any common probability distribution, e.g., Gaussian distribution; (ii) it is not always possible to capture accurately the probability density function (pdf) of real-world data; and (iii) sampling is often employed to obtain the distribution of random data in reality, which discretizes continuous random variables. In addition, our uncertain graph model is more general than the common model used in [16, 24, 36, 54]. Specifically, our model considers not only the existence probabilities of edges (i.e., existence uncertainty) but also allows each edge to have different weights with different probabilities (i.e., weight uncertainty).

An uncertain graph \(\mathcal {G}\) can generate a set of possible graphs (denoted as set \(\varPhi (\mathcal {G})\)), as formalized in Definition 2. Different from the uncertain graph, a possible graph is a deterministic graph with each edge e having exactly one constant weight \(\omega (e)\). It refers to a single instance of an uncertain graph \(\mathcal {G}\), with every edge assigned one and only one weight. Since the likelihood of a given weight \(\omega _i(e)\) could be very different from that of another weight \(\omega _j(e)\), the chance that one possible graph \(G_i\) presents could be very different from the presence probability of another possible graph \(G_j\). In order to capture how likely a specified possible graph may appear, we introduce the existence probability, as defined in Eq. 1. As a summary, an uncertain graph \(\mathcal {G}\) is actually the union of all the possible graphs derived based on \(\mathcal {G}\) with corresponding existence probabilities, i.e., \(\mathcal {G}= \{(G_i, \Pr (G_i))~|~ G_i \in \varPhi (\mathcal {G})\}\).

Fig. 2
figure 2

Illustration of possible graphs for the uncertain graph in Fig. 1. a \(G_1\): \(\Pr (G_1)\) = 0.0036288. b \(G_2\): \(\Pr (G_2)\) = 0.0013608. c \(G^\mathrm{{min}}\): \(\Pr (G^\mathrm{{min}})\) = 0.0127008. d \(G^{\max }\): \(\Pr (G^{\max })\) = 0.0001512

Definition 2

(Possible graph) Given an uncertain graph \(\mathcal {G}\), a possible world of uncertain graph (so-called possible graph), denoted as G \(=\) (V, E, W), is an instance of \(\mathcal {G}\), where \(W: E \rightarrow \mathbb {R}\) is the weight function that assigns each edge \(e\in E\) a positive weight \(\omega _G(e)\) with \(\omega _G(e) \in \varOmega (e)\). The existence probability of a possible graph G [16, 24], denoted as \(\Pr (G)\), is derived below.

$$\begin{aligned} \Pr (G)=\prod _{ e \in E} p(e, \omega _G(e)) \end{aligned}$$
(1)

Given a set of possible graphs w.r.t. an uncertain graph \(\mathcal {G}\), we would like to point out that two graphs, denoted as \(G^\mathrm{{min}}\) and \(G^{\max }\), are special. \(G^\mathrm{{min}}\) (\(G^{\max }\)) refers to the possible graph in which every edge is assigned to its minimum (maximum) possible weight, i.e., \(\forall e\in E\), \(\forall \omega _i(e)\in \varOmega (e)\), and \(\forall G\in \varPhi (\mathcal {G})\), \(\omega _{G^\mathrm{{min}}}(e)\) \(\le \) \(\omega _G(e)\) \(\le \) \(\omega _{G^{\max }}(e)\). We plot \(G^\mathrm{{min}}\) and \(G^{\max }\) of our sample uncertain graph in Fig. 2 for ease of understanding.

The reason we emphasize on possible graphs \(G^\mathrm{{min}}\) and \(G^{\max }\) is the following observation. Given a pair of nodes \(n_i\) and \(n_j\), its shortest path distance derived from \(G^\mathrm{{min}}\) (\(G^{\max }\)) provides a lower bound (an upper bound) for the shortest path distance derived from any possible graph, as stated in Lemma 1 where function \(d_G(n_i\), \(n_j)\) returns the shortest path distance between \(n_i\) and \(n_j\) in a graph G.

Lemma 1

Given an uncertain graph \(\mathcal {G}(V,E,\varOmega ,P)\), let \(G_t\) be a possible graph in \(\varPhi (\mathcal {G})\). For any pair of connected nodes \(n_i, n_j\in V\), it is guaranteed that \(d_{G^\mathrm{{min}}}(n_i,\) \(n_j)\) \(\le \) \(d_{G_t}(n_i,\) \(n_j)\) \(\le \) \(d_{G^{\max }}(n_i,n_j)\).

Proof

Let \(d(\textit{SP}_G(n_i, n_j))\) denote the corresponding distance of the shortest network path between \(n_i\) and \(n_j\) (i.e., \(\textit{SP}_G(n_i,\) \(n_j)\)) on a graph G. Based on the property that \(\forall e\in E\) and \(\forall \omega _i(e)\in \varOmega (e)\), \(\omega _{G^\mathrm{{min}}}(e) \le \omega _{G_t}(e) \le \omega _{G^{\max }}(e)\), we have

$$\begin{aligned} d_{G^\mathrm{{min}}}(n_i, n_j)= & {} d_{G^\mathrm{{min}}}(\textit{SP}_{G^\mathrm{{min}}}(n_i, n_j)) \\\le & {} d_{G^\mathrm{{min}}}(\textit{SP}_{G_t}(n_i, n_j)) \\\le & {} d_{G_t}(\textit{SP}_{G_t}(n_i, n_j)) = d_{G_t}(n_i, n_j). \end{aligned}$$

Thus, \(d_{G^\mathrm{{min}}}(n_i,n_j) \le d_{G_t}(n_i,n_j)\) holds. Similarly, we can derive that \(d_{G_t}(n_i,\) \(n_j)\) \(\le \) \(d_{G^{\max }}(n_i,\) \(n_j)\) holds. Therefore, the proof completes. \(\square \)

Based on the uncertain graph model, we formalize \(\hbox {R}k\hbox {NN}\) search and \(\hbox {UG-R}k\hbox {NN}\) query below. Note that function \(d_G^k(s,\) S) returns the shortest path distance from s to its kth nearest node in set S on a deterministic graph G.

Definition 3

(R k NN search [52]) Given a deterministic graph \(G=(V, E, W)\), a relevant data set \(S \subseteq V\), and a query node \(q\in V\), a \(\hbox {R}k\hbox {NN}\) query returns all the objects \(s\in S\) that have q as one of their k-nearest neighbors, i.e.,

$$\begin{aligned} \mathrm {R}k\mathrm {NN}(G,~q)=\{s\in S~\mid ~d_G(q,s) \le d_G^k(s, S)\}. \end{aligned}$$
(2)

Definition 4

(UG-R k NN search) Given an uncertain graph \(\mathcal {G} =\) (V, E, \(\varOmega \), P), a relevant data set \(S \subseteq V\), a user-specified parameter \(\theta \) \((0<\theta \le 1)\), and a query node \(q\in V\), an \(\hbox {UG-R}k\hbox {NN}\) query retrieves all the objects \(s\in S\) that take q as one of their k-nearest neighbors with probability being at least \(\theta \). Formally,

$$\begin{aligned} \mathrm {UG}\text {-}\mathrm {R}k\mathrm {NN}(\mathcal {G},~q)=\{ s\in S~\mid&\sum _{s\in \mathrm {R}k\mathrm {NN}(G\in \varPhi (\mathcal {G}),~q)}\Pr (G)\ge \theta \} \end{aligned}$$

Back to our sample graph illustrated in Fig. 1. Suppose S = \(\{n_2, n_4, n_6, n_7\}\), for the query node q (i.e., \(n_3\)), node \(n_2\) is a R2NN object of q on the possible graph \(G_1\) depicted in Fig. 2, as \(d_{G_1}(q,\) \(n_2)\) \(=\) \(d_{G_1}^2(n_2, S)\) (=3), referring to Definition 3. Similarly, nodes \(n_4\) and \(n_7\) are also the R2NN objects of q on \(G_1\). In addition, given an uncertain graph, the probability of node \(s\in S\) being an UG-R2NN object of q is the accumulative existence probability of all the possible graphs on which s is one of q’s \(\hbox {R}k\hbox {NN}\) objects. Therefore, for the sample uncertain graph shown in Fig. 1, the probability of nodes \(n_2, n_4,\) and \(n_7\) being UG-R2NN objects is 0.0036288 on \(G_1\).

The formalization of \(\hbox {UG-R}k\hbox {NN}\) search in Definition 4 follows the previous work on \(\hbox {R}k\hbox {NN}\) queries for uncertain data [5, 6, 26]. Compared with the one based on expected distances or expected ranks  [36], it has two main advantages: (i) it calculates accurately the probability being an answer object for each object, which improves search accuracy and (ii) \(\hbox {UG-R}k\hbox {NN}\) search under this definition is adaptive and flexible, since the user-specified probability threshold \(\theta \) is tunable. It is worthwhile to mention that our \(\hbox {UG-R}k\hbox {NN}\) definition does follow some important previous efforts on querying uncertain graphs [16, 24, 25, 29].

In addition, it is easy to find that the \(\hbox {UG-R}k\hbox {NN}\) problem is in \(\sharp \)P-hard class. First, kNN search on uncertain graphs (UG-kNN) is a \(\sharp \)P-complete problem, as proved in [24]. Second, through computing the kNN objects for every data node in S on an uncertain graph using UG-kNN algorithm and returning the data nodes \(o\in S\) such that a query node q belongs to o’s kNN objects, the \(\hbox {UG-R}k\hbox {NN}\) query can be handled. Hence, UG-kNN problem could be reduced to \(\hbox {UG-R}k\hbox {NN}\) search in polynomial time, and \(\hbox {UG-R}k\hbox {NN}\) search is a \(\sharp \)P-hard problem.

We would like to mention that due to space limitation, we present \(\hbox {UG-R}k\hbox {NN}\) search by mainly considering one input dataset S (i.e., monochromatic version), but the techniques proposed in this paper can be easily adjusted to support bichromatic version, which is discussed at the end of Sect. 6 and experimentally confirmed in experimental evaluation of Sect. 7. For the same reason, we assume that (i) the data nodes in S and query node q are located on nodes (i.e., \(S\subseteq V\) and \(q\in V\)), while our techniques could directly support the case where they are located on edges, and (ii) the graph edges have independent probability mass functions, while our techniques can also be generalized to support the \(\hbox {UG-R}k\hbox {NN}\) query when the weights of a set of edges are correlated [13, 40], where the existence probability of any possible world (i.e., the joint probability mass function) can be derived by inference techniques of probabilistic graph models (PGMs) such as junction tree and belief propagation. In particular, the pruning rule of GSP (to be presented in Sect. 3) is always available no matter edge weights are independent or not.

Furthermore, our proposed techniques in this paper can be generalized to tackle the \(\hbox {UG-R}k\hbox {NN}\) query under the graph model with uncertain nodes (e.g., considering the presence or non-presence of nodes and their adjacent edge with a given probability), which has a large application base in unreliable P2P networks and uncertain RDF graphs [27]. However, we would like to leave it to explore \(\hbox {UG-R}k\hbox {NN}\) search under such graph model (with uncertain nodes) in our future work.

2.2 Baseline algorithm

As mentioned in Sect. 1, the \(\hbox {UG-R}k\hbox {NN}\) query can be answered by performing \(\hbox {R}k\hbox {NN}\) search on every possible graph. Next, we present this brute-force solution, denoted as Baseline, with its pseudo-code depicted in Algorithm 1. Baseline takes as inputs an uncertain graph \(\mathcal {G}\), a node set S, a query node q, and parameters k and \(\theta \), and outputs the result set \(S_G\). It maintains the probability \(\Pr (s)\) for each node \(s\in S\) to record the accumulative probability of q being one of s’s kNN objects, which is initialized to 0 (line 1). Thereafter, for every possible graph G from \(\varPhi (\mathcal {G})\), Baseline retrieves its \(\hbox {R}k\hbox {NN}\) objects using lazy-EP algorithm proposed in [52], and updates \(\Pr (s)\) of answer objects s accordingly (lines 2–5). Finally, the algorithm returns all the nodes s with \(\Pr (s)\ge \theta \), and terminates.

figure a

\(\hbox {R}k\hbox {NN}\) search on a deterministic graph takes \(O(|V|^2\cdot \lg |V|+|V|\cdot |E|)\) time, if it is assumed that a Dijkstra-alike algorithm is adopted [16, 29]. Consequently, Baseline needs \(O(\varPi _{ e\in E}|\varOmega (e)|\cdot (|V|^2\cdot \lg |V|+|V|\cdot |E|))\) time to perform \(\hbox {UG-R}k\hbox {NN}\) search. If we assume that all the edges share the same number of distinct weights, denoted as w, Baseline takes \(O(w^{|E|}\cdot (|V|^2 \cdot \lg |V|+|V|\cdot |E|))\) time. Thus, Baseline is very inefficient, and we have to improve the efficiency of \(\hbox {UG-R}k\hbox {NN}\) retrieval for its practicability in real-life applications.

3 Graph structure pruning

Based on the above discussion, we understand that there are in total \(\varPi _{ e\in E}|\varOmega (e)|\) possible graphs corresponding to a given uncertain graph \(\mathcal {G}\). Given the fact that \(\varOmega (e)\) is determined by the applications, we can consider reducing the value of |E|. As a result, we propose a novel pruning technique, i.e., g raph s tructure p runing (GSP), to reduce the number of nodes (i.e., |V|) and the number of edges (i.e., |E|) accordingly. Specifically, GSP is to serve the purpose of reducing the number of possible graphs we have to consider during \(\hbox {UG-R}k\hbox {NN}\) search. Before presenting GSP, we first introduce the concept of dominance between two deterministic graphs, formally defined in Definition 5.

Definition 5

(Graph dominance) Given two possible graphs \(G_i=(V, E, W')\) and \(G_j=(V, E, W'')\) w.r.t. an uncertain graph \(\mathcal {G}\), we say that \(G_i\) dominates \(G_j\), denoted as \(G_i \prec G_j\), iff (i) \(\forall e\in E\), \(\omega _{G_i}(e) \le \omega _{G_j}(e)\), and (ii) \(\exists e'\in E\), \(\omega _{G_i}(e') < \omega _{G_j}(e')\).

Take the possible graphs shown in Fig. 2 as an example. There is no dominance relationship between \(G_1\) and \(G_2\) due to \(\omega _{G_1}(e_{3,4}) < \omega _{G_2}(e_{3,4})\) and \(\omega _{G_1}(e_{2,4}) > \omega _{G_2}(e_{2,4})\). Graph \(G^\mathrm{{min}}\) dominates \(G_1\) since edges \(e_{2,4}\) and \(e_{4,5}\) both have smaller weights in \(G^\mathrm{{min}}\) than their weights in \(G_1\), and all the other edges have the same weights in \(G^\mathrm{{min}}\) and \(G_1\). Obviously, \(G^\mathrm{{min}}\) dominates all the other possible graphs, while \(G^{\max }\) is dominated by all the other possible graphs.

The reason we introduce the graph dominance relationship is that we want to present two important properties on the shortest path distances between nodes in a given possible graph \(G_j\) and another possible graph \(G_i\) that dominates \(G_j\), as stated in Lemma 2.

Lemma 2

Given an uncertain graph \(\mathcal {G}\), let \(G_i, G_j \in \varPhi (\mathcal {G})\) be two possible graphs such that \(G_i \prec G_j\). It is confirmed that (i) \(\forall n_1, n_2 \in V\), their shortest path distance on \(G_i\) is bounded by that on \(G_j\), i.e., \(\forall n_1, n_2\in V\), \(d_{G_i}(n_1, n_2)\le d_{G_j}(n_1, n_2);\) and (ii) \(\forall n \in V\) and \(\forall S \subseteq V\), the shortest path distance between n and its kth nearest neighbor in S on \(G_i\) cannot exceed that on \(G_j\), i.e., \(\forall n \in V\) and \(\forall S \subseteq V\), \(d^k_{G_i}(n, S) \le d^k_{G_j}(n, S)\).

Proof

As \(G_i \prec G_j\), it is certain that \(\forall e\in E\), \(\omega _{G_i}(e) \le \omega _{G_j}(e)\), and \(\exists e'\in E\), \(\omega _{G_i}(e') < \omega _{G_j}(e')\) according to Definition 5. First, we prove statement (i) \(d_{G_i}(n_1, n_2)\le d_{G_j}(n_1, n_2)\) as follows.

$$\begin{aligned} d_{G_i}(n_1, n_2)= & {} d_{G_i}(\textit{SP}_{G_i}(n_1, n_2)) \le d_{G_i}(\textit{SP}_{G_j}(n_1, n_2)) \\\le & {} d_{G_j}(SP_{G_j}(n_1, n_2)) = d_{G_j}(n_1, n_2). \end{aligned}$$

Then, for statement (ii), we can get

$$\begin{aligned} d_{G_j}^k(n, S)= & {} \max \{d_{G_j}(n, s)~\mid ~s \in k\mathrm {NN}(G_j, n, S)\} \\\ge & {} \max \{d_{G_i}(n, s)~\mid ~s \in k\mathrm {NN}(G_j, n, S)\}\\\ge & {} d_{G_i}^k(n, S). \end{aligned}$$

Here, kNN(GnS) denotes the set of k objects in S that are closest to n on G. Thus, we have \(d_{G_i}(n_1, n_2)\le d_{G_j}(n_1, n_2)\) and \(d^k_{G_i}(n, S) \le d^k_{G_j}(n, S)\). The proof completes. \(\square \)

Again, we use \(G_1\) and \(G^\mathrm{{min}}\) depicted in Fig. 2 as an example. As \(G^\mathrm{{min}} \prec G_1\), the distance between any two nodes on \(G^\mathrm{{min}}\) is definitely bounded by that on \(G_1\), e.g., \(d_{G^\mathrm{{min}}}(n_5, n_2) (=5) < d_{G_1}(n_5, n_2) (=7)\) and \(d_{G^\mathrm{{min}}}(n_2, n_3)=d_{G_1}(n_2, n_3)= 3\). In addition, the distance from a given node n to its kth nearest node from S in \(G^\mathrm{{min}}\) cannot exceed that in \(G_1\). Take node \(n_2\) as an example. Given \(S=\{n_2,n_4,n_6,n_7\}\), we have \(d^3_{G^\mathrm{{min}}}(n_2, S) =d_{G^\mathrm{{min}}}(n_2, n_4)\) \(=\) 3 and \(d^3_{G_1}(n_2, S) = d_{G_1}(n_2,n_4)\) \(=\) 5. It is worth noting that the kth nearest node of a specified node n in one possible graph might be different from its kth nearest node in another possible graph, although \(n_2\) has the same third nearest node (i.e., \(n_4\)) in both \(G^\mathrm{{min}}\) and \(G_1\).

Based on Lemma 2 and the fact that \(G^\mathrm{{min}}\) (\(G^{\max }\)) dominates (is dominated by) all the other possible graphs, we derive the upper bound of the distance between a given node n and its kth nearest node in any possible graph, and use this upper bound to prune away all the nodes that cannot be one of n’s kNN objects in any given possible graph, as stated in Lemma 3. Here, function Traverse(n, \(d_{G^{\max }}^k(n, S)\), \(G^\mathrm{{min}}\)) returns all the nodes in the graph \(G^\mathrm{{min}}\) that are located at most \(d_{G^{\max }}^k(n, S)\) away from n.

Lemma 3

Given an uncertain graph \(\mathcal {G}\), let \(G \in \varPhi (\mathcal {G})\) be a possible graph, \(n \in V\) be a node, and \(S\subseteq V\) be a data node set. It is confirmed that \(d^k_{G^\mathrm{{min}}}(n, S) \le d^k_G(n, S) \le d^k_{G^{\max }}(n,S)\). Consequently, the k-nearest nodes to n are located at most \(d^k_{G^{\max }}(n,S)\) away from n in \(G^\mathrm{{min}}\), i.e., \(k\mathrm {NN}\)(GnS) \(\subseteq \) Traverse (n, \(d_{G^{\max }}^k(n, S)\), \(G^\mathrm{{min}})\).

Proof

We prove kNN(GnS) \(\subseteq \) Traverse(n, \(d_{G^{\max }}^k(n, S)\), \(G^\mathrm{{min}})\) by contradiction. Assume that there is a node \(n'\in V\) such that one of its k-nearest nodes o in \(G_i \in \varPhi (\mathcal {G})\) has its distance to \(n'\) on \(G^\mathrm{{min}}\) larger than \(d_{G^{\max }}^k(n',\) S), i.e., \(\exists o \in k\)NN\((G_i, n', S)\) such that \(d_{G^\mathrm{{min}}}(n',\) o) > \(d_{G^{\max }}^k(n',\) S). On the one hand, based on Lemma 2 and the assumption that o \(\in k\)NN\((G_i\), \(n'\), S), it is certain that \(d_{G^\mathrm{{min}}}(o,\) \(n')\) \(\le \) \(d_{G_i}(o,\) \(n')\) \(\le \) \(d_{G_i}^k(n',\) S). On the other hand, according to Definition 5, we have \(G^\mathrm{{min}} \prec G \prec G^{\max }\), for any \(G \in \varPhi (\mathcal {G})-\{G^\mathrm{{min}}, G^{\max }\}\). Then, we can guarantee \(d^k_{G^\mathrm{{min}}}(n, S)\) \(\le \) \(d^k_G(n, S)\) \(\le d^k_{G^{\max }}(n,S)\) based on the statement (ii) of Lemma 2, i.e., \(d_{G^\mathrm{{min}}}^k(n', S)\le d_{G_i}^k(n', S)\le d_{G^{\max }}^k(n', S)\). Next, we can derive that \(d_{G^\mathrm{{min}}}(o, n')\) \(\le \) \(d^k_{G^{\max }}(n',\) S), i.e., \(o\in \) Traverse(\(n'\), \(d_{G^{\max }}^k\) \((n',\) S), \(G^\mathrm{{min}}\)), which contradicts with our assumption. The proof completes. \(\square \)

figure b

Applying Lemma 3 in \(\hbox {UG-R}k\hbox {NN}\) search, we are certain that for any given node n, if a query node \(q\notin \) Traverse(n, \(d_{G^{\max }}^k(n, S)\), \(G^\mathrm{{min}}\)), then q cannot be one of n’s kNN objects, and thus, node n cannot be an \(\hbox {UG-R}k\hbox {NN}\) object for q, as stated in Lemma 4.

Lemma 4

(Candidate result set) Given an uncertain graph \(\mathcal {G}\), a node set \(S\subseteq V\), and a query node q, let \(N_c\) be the set of nodes \(n\in V\) satisfying \(d_{G^\mathrm{{min}}}(n, q) \le d_{G^{\max }}^k(n, S)\), i.e.,

$$\begin{aligned} N_c= \{n\in V~\mid ~d_{G^\mathrm{{min}}}(n, q) \le d_{G^{\max }}^k(n, S)\}. \end{aligned}$$
(3)

Then, \(\mathrm {UG}\)-\(\mathrm {R}k\mathrm {NN}(\mathcal {G}, q)\) \(\subseteq N_c\), and nodes in \((V-N_c)\) can be safely pruned.

Proof

Assume that the above statement is invalid, and there is at least one answer object \(n'\in \hbox {UG-R}k\hbox {NN}(\mathcal {G}, q\)) such that \(n'\in \) (\(V-N_c\)) with \(d_{G^\mathrm{{min}}}(n', q)>\) \(d_{G^{\max }}^k(n', S)\). According to Lemma 2, we have \(d_{G_i}(n', q)\ge d_{G^\mathrm{{min}}}(n', q)\), and hence, \(d_{G_i}(n', q)>d_{G^{\max }}^k(n', S)\ge d_{G_i}^k(n', S)\), i.e., \(n'\) is not an answer object. Therefore, our assumption is invalid, and the proof completes. \(\square \)

Table 2 Procedure of graph structure pruning

Based on the fact that set \(N_c\) is a super set that covers all the answer objects for an \(\hbox {UG-R}k\hbox {NN}\) query, we present graph structure pruning (GSP) in Heuristic 1. GSP forms an essential uncertain graph, denoted as \(\mathcal {G}_e\), which is a subgraph of \(\mathcal {G}\) and meanwhile contains all the candidate nodes in \(N_c\) for an \(\hbox {UG-R}k\hbox {NN}\) query. The other node set \(N_g\) is formed by all the nodes in the formed essential uncertain graph, which is a super set of \(N_c\), i.e., \(N_c\subseteq N_g\), as stated in Lemma 5.

Heuristic 1

(Graph structure pruning) Given an uncertain graph \(\mathcal {G}\), a data node set \(S\subseteq V\), and a query node q, let \(N_g\) represent the set of nodes returned by Traverse \((n, d_{G^{\max }}^k(n, S), G^\mathrm{{min}})\) for the nodes \(n\in N_c (\)satisfying \(d_{G^\mathrm{{min}}}(n, q) \le d_{G^{\max }}^k(n, S))\), i.e.,

(4)

For \(\hbox {UG-R}k\hbox {NN}\) search, it only needs to consider the subgraph, namely the essential uncertain graph \(\mathcal {G}_e\), formed by \(N_g\) and corresponding edges between the nodes in \(N_g\).

Proof

To the contrary, we assume that evaluating \(\mathcal {G}_e\) is not sufficient to answer \(\hbox {UG-R}k\hbox {NN}\) search on \(\mathcal {G}\), as there is at least one object/node \(s'\in (V-N_g)\) such that (i) \(s'\) is an answer object in \(G_i \in \varPhi (\mathcal {G})\), or (ii) \(s'\) directly affects the decision on whether a data object \(o\in N_c\) is an answer object in \(G_j \in \varPhi (\mathcal {G})\) or not. First, we prove case (i) is impossible. According to Lemma 4, answer object \(s'\in N_c\) and \(s'\) \(\in \) Traverse(\(s'\), \(d_{G^{\max }}^k(s', S)\), \(G^\mathrm{{min}}\)). Therefore, we can confirm that \(s'\in N_g\) based on Eq. 4, which contradicts with our assumption that \(s'\in (V-N_g)\). Hence, case (i) is impossible. Next, we prove case (ii) is also invalid. According to Lemma 3 and Eq. 4, for each candidate object \(o\in N_c\), all its possible k-nearest nodes are included in \(N_g\), i.e., kNN(\(G_j, o, S\)) \(\subseteq \) Traverse(o, \(d_{G^{\max }}^k(o\), \(S), G^\mathrm{{min}}\)) \(\subseteq \) \(N_g\). Thus, \(\forall s'\in (V-N_g)\), it cannot change the fact that any object \(o\in N_c\) is or is not an answer object, i.e., case (ii) is also invalid. The proof completes. \(\square \)

Lemma 5

Given node sets \(N_c\) and \(N_g\) w.r.t. a given uncertain graph \(\mathcal {G}\), it is confirmed that \(N_c \subseteq N_g\). Accordingly, \(\mathcal {G}_e\) covers all the candidates in \(N_c\) for the \(\hbox {UG-R}k\hbox {NN}\) query.

Proof

The proof is very straightforward, and thus, we skip it due to limited space. \(\square \)

Heuristic 1 suggests that only an essential uncertain graph \(\mathcal {G}_e\), a subgraph of a specified uncertain graph \(\mathcal {G}\), is necessary for processing an \(\hbox {UG-R}k\hbox {NN}\) query. Thus, we present GSP algorithm as the implementation of Heuristic 1 to form the essential uncertain graph \(\mathcal {G}_e\), with its pseudo-code listed in Algorithm 2. First, GSP initializes a min-heap H, a candidate result set \(N_c\), and a node set of an essential uncertain graph \(N_g\), and then, it gets the possible graphs \(G^\mathrm{{min}}\) and \(G^{\max }\) (lines 1–2). Note that all the entries in H correspond to nodes \(n\in V\) in the form of \(\langle n,d_{G^\mathrm{{min}}}(n,q) \rangle \), sorted in ascending order of their distances to a query node q on \(G^\mathrm{{min}}\). Next, GSP starts the network expansion to retrieve the candidate answer objects, guided by H (lines 3–11). GSP de-heaps the head entry \(\langle n,d_{G^\mathrm{{min}}}(n,q) \rangle \) from H, and processes n only if n has not yet been visited until H is empty. It drives the upper bound \(d^{\max }\) of n’s distance to its kth nearest node in any possible graph, and compares \(d^{\max }\) with the lower bound of the distance from n to q (lines 6–7). When the lower bound does not exceed \(d^{\max }\), node n is inserted into \(N_c\) based on Lemma 4, and the nodes returned by function Traverse(\(n, d^{\max }, G^\mathrm{{min}}\)) are added to \(N_g\) based on Heuristic 1 (lines 8–9). In the sequel, GSP en-heaps all n’s unvisited adjacent nodes into H for expansion later (lines 10–11). Thereafter, GSP constructs \(\mathcal {G}_e\) based on nodes in \(N_g\), and then, it returns both \(N_c\) and \(\mathcal {G}_e\) to demonstrate the algorithm (lines 12–13).

Example 3

In order to illustrate how GSP forms an essential uncertain graph for ease of understanding, we assume that an UG-R1NN \((k =1)\) query is issued from a node \(n_3\) on the sample uncertain graph depicted in Fig. 1 with S = \(\{n_2, n_4, n_6, n_7\}\). Table 2 lists the procedure of GSP, based on its possible graphs \(G^\mathrm{{min}}\) and \(G^{\max }\) shown in Fig. 2. GSP first evaluates the query node \(n_3\) from a min-heap H, and gets \(d_{G^{\max }}^1(n_3, S)\) = 4 from \(G^{\max }\), which is larger than \(d(n_3, q)\) \((=0)\). Thus, GSP sets \(N_c\) as \(\{n_3\}\), inserts all the nodes that are located within distance \(d_{G^{\max }}^1(n_3, S)=4\) to \(n_3\) on \(G^\mathrm{{min}}\) into \(N_g\) \((=\{n_3\), \(n_1\), \(n_4\), \(n_2\), \(n_5\})\), and en-heaps \(n_3\)’s adjacent nodes \(\{n_1\), \(n_4\), \(n_5\}\) into H. Similarly, GSP visits the nodes \(n_1,\) \(n_4\), \(n_2\), and \(n_5\) in order (refer to Table 2). Once H is empty, the essential uncertain graph is constructed based on the node set \(N_g\) = {\(n_3\), \(n_1\), \(n_4\), \(n_2\), \(n_5\)}, and \(N_c = \{n_3, n_1\}\) forms a candidate result set for \(\hbox {UG-R}k\hbox {NN}\) search.

Since GSP forms a subgraph \(\mathcal {G}_e\) that contains fewer nodes and hence fewer edges, it helps to reduce the number of possible graphs. Take our sample graph as an example. GSP forms a \(\mathcal {G}_e\) with only 5 nodes and 6 edges, and it cuts down the number of possible graphs to \(\varPi _{e\in E'}|\varOmega (e)| = 64\) where \(E'\) is the edge set of \(\mathcal {G}_e\).

In what follows, we prove the correctness of GSP heuristic in Lemma 6 and approximate the size of an essential uncertain graph \(\mathcal {G}_e\) (in terms of \(|N_g|\)) in Lemma 7.

Lemma 6

(Correctness) Given an uncertain graph \(\mathcal {G}\) and a query node q, if a node \(n\in V\) is not visited during the traversal of \(G^\mathrm{{min}}\) in Algorithm 2, node n cannot be an answer object on any possible graph \(G_i\in \varPhi (\mathcal {G})\), that is, GSP algorithm is correct with no false negative (i.e., there are no missing answers) and no false positive (i.e., there are no false answers) for \(\hbox {UG-R}k\hbox {NN}\) search.

Proof

Assume, to the contrary, that an answer object s on one possible graph \(G_i\in \varPhi (\mathcal {G})\) is not visited during the traversal of \(G^\mathrm{{min}}\). Since s is not visited during the traversal of \(G^\mathrm{{min}}\), there must be at least one node n on the path \(SP_{G_i}(n', q)\) on a graph \(G_i\) satisfying \(d_{G^\mathrm{{min}}}(n, q) > d_{G^{\max }}^k(n, S)\). According to Lemma 3, node q is not included in kNN(G, n, S) for \(\forall G\in \varPhi (\mathcal {G})\) (i.e., n cannot be an answer object for \(\mathcal {G}\)). Thus, we have

$$\begin{aligned} d_{G_i}(n', q)= & {} d_{G_i}(n', n) +d_{G_i}(n, q) \\> & {} d_{G_i}(n', n)+ d_{G_i}^k(n, S) \\\ge & {} \max \{d_{G_i}(n', o)~\mid ~ o\in k\mathrm {NN}(G_i, n, S)\} \\= & {} d_{G_i}^k(n', S). \end{aligned}$$

Then, we can conclude that the node \(n'\) (data object s) is not a real answer object on \(G_i\), and the assumption is invalid. Thus, for any unvisited node n in Algorithm 2, n cannot be an answer object on any possible graph \(G\in \varPhi (\mathcal {G})\), i.e., GSP algorithm has no false negative for \(\hbox {UG-R}k\hbox {NN}\) search.

On the other hand, we know that set \(N_c\) covers all the answer objects for an \(\hbox {UG-R}k\hbox {NN}\) query, as stated in Lemma 4, and all the candidate nodes \(n \in N_c\) with all their k-nearest neighbors are included in the essential uncertain graph \({\mathcal {G}_e}\) for a given uncertain graph \(\mathcal {G}\) based on Heuristic 1 and Lemma 5. Consequently, it is guaranteed that GSP has no false positive. Hence, GSP is correct for \(\hbox {UG-R}k\hbox {NN}\) retrieval, and the proof completes. \(\square \)

Lemma 7

(Effectiveness) Given an \(\hbox {UG-R}k\hbox {NN}\) query w.r.t. an uncertain graph \(\mathcal {G}\), assume that every node n contains a data object (i.e., data density d =1), and let \(\rho \) denote the maximal average degree in \(\mathcal {G}\). The cardinality of node set \(N_g\) of the essential uncertain graph \(\mathcal {G}_e\), i.e., \(|N_g|\), is bounded by \(\mathrm {min}(\frac{\rho ^{2{ k}+1}-1}{\rho -1}, ~|V|)\).

Proof

First, we prove that the k-nearest nodes (kNNs) of a query node q must appear in the nearest k layers of the graph in Fig. 3, which is similar to the breadth-first search (BFS) traversal tree from q on any graph \(\mathcal {G}\). Assume, to the contrary, that one node n located outside those k layers is one of q’s kNNs. Then, there must be at least k nodes o (different to q and n) on the shortest path from q to n. Thus, the shortest path distances from q to those k nodes o are certainly smaller than that between q and n, regardless of the edge weights. Hence, node n cannot be one of q’s kNNs, which contradicts our assumption. Similarly, kNNs of any node in V are within its closest k layers on the BFS traversal tree in Fig. 3. Based on Heuristic 1, \(N_g\) contains all the objects \(o'\) that may be answer objects and \(o'\)’s all possible nearest neighbor data nodes. Thus, \(N_g\) at most contains the first 2k layers of nodes in Fig. 3, i.e., \(|N_g| \le \sum _{i=0}^{2k} \rho ^i = \frac{\rho ^{2k+1}-1}{\rho -1}\). Considering that \(N_g \subseteq V\), we have \(|N_g| \le \mathrm {min}(\frac{\rho ^{2k+1}-1}{\rho -1}, |V|)\). The proof completes. \(\square \)

It is important to note that the larger the value of \(\frac{|V|-|N_g|}{|V|}\) is, the more effective the GSP is. Take the real co-authorship network DBLP with 1,226,749 nodes and 5,557,805 edges as an example. Let us set the node degree \(\rho \) to the average node degree 4.5 and k to 2. When data density d = 1, \(|N_g\)(DBLP)| \(\le \frac{4.5^5-1}{4.5-1}\) \(\approx \) 527. Clearly, the size of nodes in an essential uncertain graph is much smaller than the node set size in an original uncertain graph. Moreover, the lower bound of the effectiveness of GSP heuristic for DBLP can be approximated as \(\frac{|V|-527}{|V|}\) \(\approx \) 99.96%, meaning that at least 99.96% nodes in DBLP can be pruned by GSP heuristic for an \(\hbox {UG-R}k\hbox {NN}\) query.

Fig. 3
figure 3

Illustration for the proof of Lemma 7

It is worthwhile to mention that the derived bound in Lemma 7 holds without any assumption on the probability mass function of edges, other than the positive edge weights. Note that the positive weights guarantee that for a given node n, the shortest path distances from n to the accessed nodes are non-decreasing with the extending network. It indicates that all of n’s k-nearest neighbors could be obtained in the k-layer extension of the network from n (if d = 1). Further, if the edges in the graph share the same weight, the \(\lceil \log _{\rho }k\rceil \)-layer extension from node n is sufficient for getting n’s k nearest neighbors. In this case, \(|N_g|\) could be tightened as \(\min \)(\(\frac{\rho ^{2\lceil \log _{\rho }k\rceil +1}-1}{\rho -1}\), |V|).

On the other hand, the general assumption with positive weights is the main reason that the derived bound is mostly likely to be loose, even to |V| (i.e., the original node size). Fortunately, GSP can prune away many nodes to reduce significantly both the size and the number of possible graphs we have to evaluate during \(\hbox {UG-R}k\hbox {NN}\) search, as demonstrated via our experimental results in Sect. 7.

4 Equivalent graph removing

In Sect. 3, we present GSP heuristic whose purpose is to replace an uncertain graph \(\mathcal {G}\) with a much smaller essential uncertain graph \(\mathcal {G}_e\) for supporting \(\hbox {UG-R}k\hbox {NN}\) search efficiently. Although \(\mathcal {G}_e\) has successfully reduced the node size from |V| to \(|N_g|\), the number of possible graphs we have to consider w.r.t. \(\mathcal {G}_e\) might be still large, i.e., in total \(\varPi _{e\in E'}|\varOmega (e)|\) possible graphs with \(E' = \{e_{i,j}\in E|n_i, n_j \in N_g\}\) (i.e., \(E'\) is the edge set of \(\mathcal {G}_e\)). In this section, we present another pruning technique, i.e., Equivalent Graph Removing (EGR), which tries to reduce the number of possible graphs via removing equivalent possible graphs. In the sequel, we first explain the concept of equivalent possible graph, and then, we propose a new EGR heuristic based on this new concept to save the evaluation of certain possible graphs.

Before we formally define the concept of equivalent possible graph, we first present an intuitive example to explain why certain possible graphs are equivalent. To simplify our discussion, we introduce an edge set \(E^\mathrm{sp}_{G_i}\) for a given possible graph \(G_i = (V, E, W)\) \(\in \varPhi (\mathcal {G}_e)\) that consists of two subsets \(E^{\mathrm{sp}_q}_{G_i}\) and \(E^{\mathrm{sp}_n}_{G_i}\), defined in Eqs. 6 and 7, respectively. The edge set \(E^{\mathrm{sp}_q}_{G_i}\) refers to the set of edges that present in the shortest path from q to any node in V on \(G_i\); and \(E^{\mathrm{sp}_n}_{G_i}\) denotes the set of edges that appear in the shortest paths from node n in V to its k-nearest data nodes on graph \(G_i\) (i.e., those nodes returned by function kNN\((G_i, n, S)\)).

$$\begin{aligned} E^{\mathrm{sp}}_{G_i}~= & {} ~E^{\mathrm{sp}_q}_{G_i} \bigcup E^{\mathrm{sp}_n}_{G_i} \end{aligned}$$
(5)
$$\begin{aligned} E^{\mathrm{sp}_q}_{G_i}= & {} \bigcup _{n\in V} \{e~|~e\in \textit{SP}_{G_i}(n, q)\} \end{aligned}$$
(6)
$$\begin{aligned} E^{\mathrm{sp}_n}_{G_i}= & {} \bigcup _{n\in V} \{e~|~e\in \textit{SP}_{G_i}(n, o), o\in k\mathrm {NN}(G_i, n, S)\} \end{aligned}$$
(7)

Take the graph \(G_i\) shown in Fig. 4a as an example. When d \(=\)1 (i.e., \(S = \{n_1, n_2, n_3, n_4, n_5\}\)), for \(k =1\), we can get \(E^{\mathrm{sp}_q}_{G_i} = \{e_{1,3}\), \(e_{1,2}\), \(e_{3,4}\), \(e_{3,5}\}\) and \(E^{\mathrm{sp}_n}_{G_i} =\{e_{1,3}\), \(e_{1,2}\), \(e_{4,5}\}\). Note that an edge might belong to both \(E^{\mathrm{sp}_n}_{G_i}\) and \(E^{\mathrm{sp}_q}_{G_i}\) (e.g., edges \(e_{1,3}\) and \(e_{1,2}\)). For simplicity, we only mark them once in Fig. 4.

Fig. 4
figure 4

Illustration of equivalent possible graphs. a \(G_i\). b \(G_j\)

The reason we introduce the edge set \(E^\mathrm{sp}_{G_i}\) is that \(E^\mathrm{sp}_{G_i}\) defines the part of the graph that is traversed during \(\hbox {R}k\hbox {NN}\) search. Given two possible graphs \(G_i\) and \(G_j\) (e.g., \(G_i\) and \(G_j\) depicted in Fig. 4) satisfying the three conditions of \(E^\mathrm{sp}_{G_i} = E^\mathrm{sp}_{G_j}\) (as stated in Lemma 8), it is confirmed that they share the same result set to \(\hbox {R}k\hbox {NN}\) retrieval.

Lemma 8

Given two distinct possible graphs \(G_i\) and \(G_j\) w.r.t. an uncertain graph, for the \(\hbox {R}k\hbox {NN}\) query issued at a node q, it is confirmed that \(E^\mathrm{sp}_{G_i} = E^\mathrm{sp}_{G_j}\) \(\Rightarrow \) \(\mathrm {R}k\mathrm {NN}(G_i, q) = \mathrm {R}k\mathrm {NN}(G_j, q)\), where \(E^\mathrm{sp}_{G_i} = E^\mathrm{sp}_{G_j}\) holds iff (i) \(\forall e\in E^\mathrm{sp}_{G_i}, e\in E^\mathrm{sp}_{G_j}\); (ii) \(\forall e\in E^\mathrm{sp}_{G_j}, e\in E^\mathrm{sp}_{G_i}\); and (iii) \(\forall e\in E^\mathrm{sp}_{G_i}/E^\mathrm{sp}_{G_j}, \omega _i(e)=\omega _j(e)\).

Proof

For any object \(o\in G_i/G_j\), we know that all edges on \(SP_{G_i}(o, q)\)  /  \(SP_{G_j}(o, q)\) and \(SP_{G_i}(o, o')/SP_{G_j}(o, o')\) \((o'\in k\)NN\((G_i, o,\) S) / kNN\((G_j, o, S))\) are included in the set \(E^\mathrm{sp}_{G_i}/E^\mathrm{sp}_{G_j}\) based on Eq. 5. As \(E^\mathrm{sp}_{G_i}\) = \(E^\mathrm{sp}_{G_j}\), we can derive that \(d_{G_i}(o, q)\) = \(d_{G_j}(o, q)\), kNN\((G_i, o,\) S) = kNN\((G_j,\) oS), and thus, \(d_{G_i}^k(o, S)\) = \(d_{G_j}^k(o, S)\). Hence, we are certain that \(o\in \) \(\hbox {R}k\hbox {NN}\)(\(G_{i}, q)\) \(\Rightarrow \) \(o\in \) \(\hbox {R}k\hbox {NN}\)(\(G_{j}, q)\), and \(o\in \) \(\hbox {R}k\hbox {NN}\)(\(G_{j}, q)\) \(\Rightarrow \) \(o\in \) \(\hbox {R}k\hbox {NN}\)(\(G_{i}, q)\). Consequently, \(\hbox {R}k\hbox {NN}\)(\(G_i\), q) = \(\hbox {R}k\hbox {NN}\)(\(G_j\), q), and the proof completes. \(\square \)

We would like to highlight that the third condition of \(E^\mathrm{sp}_{G_j}\) \(=\) \(E^\mathrm{sp}_{G_i}\) is necessary for Lemma 8. Take the possible graphs \(G_x\) and \(G_y\) in Fig. 5 as example; they satisfy the first two conditions of \(E^\mathrm{sp}_{G_x} = E^\mathrm{sp}_{G_y}\) (which includes both of the two edges). Nevertheless, the two possible graphs \(G_x\) and \(G_y\) have different \(\hbox {R}k\hbox {NN}\) sets, since \(\hbox {R}k\hbox {NN}\)(\(G_{x}, q)\) = \(\varnothing \) and \(\hbox {R}k\hbox {NN}\)(\(G_{y}, q) = \{n\}\). In other words, only the first two conditions of \(E^\mathrm{sp}_{G_j} = E^\mathrm{sp}_{G_i}\) cannot guarantee the identical \(\hbox {R}k\hbox {NN}\) result set, and thus, the third condition is necessary.

Notably, given a possible graph \(G_i\) and its \(\hbox {R}k\hbox {NN}\) result set, we can safely ignore all the possible graphs \(G_j\) with \(E^\mathrm{sp}_{G_j}\) = \(E^\mathrm{sp}_{G_i}\) based on Lemma 8. Accordingly, we introduce the concept of equivalent possible graph in Definition 6. By removing all the equivalent possible graphs, we can further reduce the number of possible graphs we need to evaluate to improve query efficiency.

Definition 6

(Equivalent possible graph) Given a possible graph \(G_i \in \varPhi (\mathcal {G}_e\)), its equivalent possible graphs, denoted as \(\mathrm {D}_{G_i}\), include all the other possible graphs \(G_j \in \varPhi (\mathcal {G}_e)\) with \(E^\mathrm{sp}_{G_i} = E^\mathrm{sp}_{G_j}\), i.e., \(\mathrm {D}_{G_i} = \{G_j (\ne G_i) \in \varPhi (\mathcal {G}_e)\mid E^\mathrm{sp}_{G_j} = E^\mathrm{sp}_{G_i}\}\).

Fig. 5
figure 5

Explanation for Lemma 8. a \(G_x\). b \(G_y\)

Although we understand that we can remove equivalent possible graphs \(\mathrm {D}_{G_i}\) w.r.t. a given possible graph \(G_i\), we also realize that locating the edge set \(E^\mathrm{sp}_{G_j}\) on any possible graph \(G_j\in \mathrm {D}_{G_i}\) is as expensive as finding \(\hbox {R}k\hbox {NN}\) result on \(G_j\). In other words, we are not able to save anything via identifying and then removing equivalent possible graphs. We need to find a better way to utilize the findings that certain possible graphs actually share the same \(\hbox {R}k\hbox {NN}\) result set and evaluating one of them is sufficient. That is why we introduce a stronger dominance relationship, namely conditional dominance, between two possible graphs, as formally defined in Definition 7.

Definition 7

(Graph conditional dominance) For two possible graphs \(G_i\) and \(G_j\) w.r.t. an uncertain graph \(\mathcal {G}\) = (V, E, \(\varOmega \), P), with a given set \(E_d \subseteq E\), graph \(G_i\) \(E_d\)-dominates graph \(G_j\) conditionally, denoted as \(G_i \prec _{E_d} G_j\), iff (i) \(\forall e\in E_d, \omega _{G_i}(e) = \omega _{G_j}(e)\); (ii) \(\forall e'\in (E-E_d)\), \(\omega _{G_i}(e') \le \omega _{G_j}(e')\); and (iii) \(\exists e''\in (E-E_d)\), \(\omega _{G_i}(e'') < \omega _{G_j}(e'')\).

We would like to highlight that conditional dominance relationship is stronger than the dominance relationship introduced in Definition 5. For two possible graphs \(G_i\) and \(G_j\) and a given edge set \(E_d\), if \(G_i \prec _{E_d} G_j\), then \(G_i \prec G_j\). However, the reverse is not always valid. Take the possible graphs shown in Fig. 2 as an example. \(G^\mathrm{{min}}\) dominates \(G_1\). On the other hand, it \(E_d\)-dominates \(G_1\) conditionally for \(E_d = \{e_{1,2}, e_{1,3}\}\), but not for \(E_d =\) \(\{e_{1,2}, e_{2,4}\}\). In fact, when \(E_d\) is empty, conditional dominance relationship is equivalent to dominance relationship, i.e., \(\prec _{\varnothing }\) \(\Leftrightarrow \) \(\prec \).

As stated in Lemma 9, using conditional dominance relationship based on the edge set \(E^\mathrm{sp}_{G_i}\), we can locate a set of possible graphs \(\mathrm {D}^c_{G_i}\) with each \(G_j \in \mathrm {D}^c_{G_i}\) having its \(E^\mathrm{sp}_{G_j} = E^\mathrm{sp}_{G_i}\), i.e., \(\mathrm {D}^c_{G_i} \subseteq \mathrm {D}_{G_i}\). Note that locating equivalent possible graphs in \(\mathrm {D}^c_{G_i}\) does not require us to identify the edge set \(E^\mathrm{sp}_{G_j}\) of any possible graph \(G_j \in \mathrm {D}^c_{G_i}\).

Table 3 Illustration of the incremental access strategy

Lemma 9

Given a possible graph with set \(E^{ sp}_{G_i}\) (w.r.t. a query node q), let \(\mathrm {D}^c_{G_i}\) be the set of possible graphs \(G_j\) that are conditionally dominated by \(G_i\) based on \(E^\mathrm{sp}_{G_i}\), i.e., \(\mathrm {D}^c_{G_i}=\{G_j \in \varPhi (\mathcal {G}_e)\mid G_i \prec _{E^\mathrm{sp}_{G_i}} G_j \}\). It is confirmed that \(\mathrm {D}^c_{G_i} \subseteq \mathrm {D}_{G_i}\).

Proof

For any graph \(G\in \mathrm {D}^c_{G_i}\), according to the definitions of \(\mathrm {D}^c_{G_i}\) and graph conditional dominance, edges in \(E^\mathrm{sp}_{G_i}\) share the same weights on G and \(G_i\), and edges in \((E-E^\mathrm{sp}_{G_i})\) have larger or equal weights on G than/to those on \(G_i\). Hence, we have \(E^\mathrm{sp}_{G}\) = \(E^\mathrm{sp}_{G_i}\) based on Eq. 5. Referring to \(\mathrm {D}_{G_i}\) formalized in Definition 6, we can confirm that graph \(G\in \mathrm {D}_{G_i}\). Thus, \(\mathrm {D}^c_{G_i} \subseteq \mathrm {D}_{G_i}\), and the proof completes. \(\square \)

Take the possible graphs \(G_i\) and \(G_j\) depicted in Fig. 4 as an example. It is clear that \(G_j\in \mathrm {D}^c_{G_i}\) and \(G_j\in \mathrm {D}_{G_i}\), and meanwhile \(G_i\in \mathrm {D}_{G_j}\) but \(G_i\notin \mathrm {D}^c_{G_j}\).

We are now ready to present equivalent graph removing () technique in Heuristic 2. Its correctness is guaranteed by Lemma 8 and Lemma 9. enables us to skip the evaluation of certain possible graphs and thus to boost search performance.

Heuristic 2

(Equivalent graph removing) Given a possible graph \(G_i\in \varPhi (\mathcal {G}_e)\), a data node set S, and a query node q, it is confirmed that the possible graphs have the identical result set as \(G_i\), i.e., . Hence, those graphs can be safely removed after evaluating search on \(G_i\), with \(\Pr (o)\) of each answer object \(o\in \mathrm {R}k\mathrm {NN}(G_i, q)\) updated accordingly.

Next, we explain how to update \(\Pr (o)\) of each answer object \(o\in \) \(\hbox {R}k\hbox {NN}(G_i, q)\), when a possible graph \(G_j \) in \(\mathrm {D}^c_{G_i}\) is removed, as mentioned in Heuristic 2. In particular, given two possible graphs \(G_i\) and \(G'_i\), set \(\mathrm {D}^c_{G_i}\) might overlap with set \(\mathrm {D}^c_{G'_i}\), i.e., \(\mathrm {D}^c_{G_i} \cap \mathrm {D}^c_{G'_i}\) \(\ne \varnothing \). This is because they both might conditionally dominate the same possible graph even though they do not conditionally dominate each other. Thus, given an answer object \(o\in \hbox {R}k\hbox {NN}(G_i, q)\), for every possible graph \(G_j\) \(\in \mathrm {D}^c_{G_i}\), the existence probability of \(G_j\) (i.e., \(\Pr (G_j)\)) is added to \(\Pr (o)\) (because \(o\in \) \(\hbox {R}k\hbox {NN}(G_j, q)\)). However, \(\Pr (G_j)\) contributes once only to \(\Pr (o)\). That is to say, given an answer object \(o\in \hbox {R}k\hbox {NN}(G_a, q)\) \(\cap \) \(\hbox {R}k\hbox {NN}(G_b, q)\) and a possible graph \(G_j \in \mathrm {D}^c_{G_a}\cap \mathrm {D}^c_{G_b}\), if \(\Pr (G_j)\) has been added to \(\Pr (o)\) during the evaluation of \(G_a\), it does not contribute to \(\Pr (o)\) again during the evaluation of \(G_b\).

It is important to note that if possible graphs \(G\in \varPhi (\mathcal {G}_e)\) are evaluated in a random order, it is very likely that the equivalent possible graphs in \(\mathrm {D}^c_{G}\) are evaluated earlier than G, which significantly degrades the effectiveness of our presented EGR heuristic. Thus, in order to make sure that graph \(G\in \varPhi (\mathcal {G}_e)\) is evaluated earlier than graphs in \(\mathrm {D}^c_{G}\), we introduce the incremental access strategy (IAS) to decide the evaluation order of possible graphs in \(\varPhi (\mathcal {G}_e)\). For ease of understanding, we take the sample graph as an example to explain IAS strategy as follows.

First of all, before evaluating any possible graph, IAS sorts the edges in a random order, e.g., edge \(e_{1, 2}\) ranked first and edge \(e_{1, 3}\) ranked second as illustrated in the first row of Table 3. Then, it assigns all the edges to their minimum weights, and visits the corresponding possible graph \(g_1\) (i.e., \(G^\mathrm{{min}}\)) first. Based on \(g_1\), it increases the weight of the first edge to generate the second possible graph for evaluation, i.e., \(g_2\) in Table 3. As edge \(e_{1, 2}\) has only two possible weights 2 and 3 (shown in Fig. 1), \(g_1\) and \(g_2\) are the only two graphs generated by changing the weights of edge \(e_{1, 2}\), with the weights of edges ranked behind \(e_{1,2}\) all set to the minimum. Based on \(g_1\) and \(g_2\), it increases the weight of the second edge to generate two more graphs \(g_3\) and \(g_4\) for evaluation. Similarly, based on \(g_1\), \(g_2\), \(g_3\), and \(g_4\), it increases the weight of the third edge to generate another four more graphs for evaluation. The above process proceeds until the last graph \(G^{\max }\) is generated and evaluated. Note that IAS guarantees that graph \(G\in \varPhi (\mathcal {G}_e)\) is accessed before those in \(\mathrm {D}^c_{G}\), and hence, all the \(\hbox {R}k\hbox {NN}\) evaluation of equivalent possible graphs in \(\cup _{G\in \varPhi (\mathcal {G}_e)}\mathrm {D}^c_{G}\) could be saved.

figure c
Table 4 Procedure of equivalent graph removing

Based on Heuristic 2 and IAS strategy, we detail the implementation of EGR, with its pseudo-code presented in Algorithm 3. First, EGR empties \(\varGamma _{\mathrm {D}^c}\), the set of equivalent possible graphs (line 1). Then, it evaluates the possible graphs in (\(\varPhi (\mathcal {G}_e)-\varGamma _{\mathrm {D}^c}\)) based on the strategy IAS described previously (lines 2–19). To be more specific, for each possible graph G it visits, EGR derives the existence probability \(\Pr (G)\) by Eq. 1, empties \(\hbox {R}k\hbox {NN}\) set and \(E^\mathrm{sp}_{G}\) set, and inserts a query node q into heap H (lines 3–4). The evaluation on G is performed by evaluating the nodes in H. For each de-heaped node n, it is only evaluated once which explains the reason behind n.flag. It marks the node n by setting n.flag and inserts the edges on the shortest paths \(SP_{G}(o\in k\mathrm {NN}(G, n, S), n)\) into \(E^\mathrm{sp}_{G}\) set (lines 5–9). If n contains a data node s satisfying \(d_G(s, q)\le d_G^k(s, S)\), s is added to \(\hbox {R}k\hbox {NN}\) set and its probability of being an answer object \(\Pr (s)\) is updated accordingly. In addition, if n satisfies \(d_G(n, q)\le d_G^k(n, S)\), EGR en-heaps its unmarked adjacent nodes into H, and inserts corresponding adjacent edges into \(E^\mathrm{sp}_{G}\) (lines 10–16). The above process repeats until H is empty. Then, EGR updates \(\varGamma _{\mathrm {D}^c}\) and the probability \(\Pr (s)\) for every \(s\in \) \(\hbox {R}k\hbox {NN}\)(Gq), and visits the next possible graph (lines 17–19). Until all the possible graphs in (\(\varPhi (\mathcal {G}_e)-\varGamma _{\mathrm {D}^c}\)) have been evaluated, EGR returns set S with \(\hbox {R}k\hbox {NN}\) probabilities.

Example 4

In order to facilitate the understanding of EGR, we illustrate how to derive \(\mathrm {D}^c_{G^\mathrm{{min}}}\) for \(G^\mathrm{{min}}\) depicted in Fig. 2c during UG-R1NN query processing. Table 4 lists the content of an edge set \(E^\mathrm{sp}_{G^\mathrm{{min}}}\) and that of a heap H. First, EGR de-heaps the query node \(n_3\) from H, and inserts edge \(e_{3, 4}\) into \(E^\mathrm{sp}_{G^\mathrm{{min}}}\), since it presents on the shortest path from \(n_3\) to its nearest neighbor \(s_2(n_4)\). Then, EGR en-heaps its unmarked adjacent nodes \(n_1\), \(n_4\), and \(n_5\) into H, and updates \(E^\mathrm{sp}_{G^\mathrm{{min}}}\) to \(\{e_{1, 3}, e_{3, 4}\), \(e_{3,5}\}\). In this way, EGR continues evaluating the remaining nodes in H. Finally, it forms the set \(E^\mathrm{sp}_{G^\mathrm{{min}}} = \{e_{1,3}, e_{3,4}, e_{3,5}, e_{1,2}, e_{4,6}, e_{2,7}, e_{4,5}\}\). Thus, \(\mathrm {D}^c_{G^\mathrm{{min}}}\) contains 7 possible graphs on which at least one of the edges \(e_{2,4}, e_{2,6}\), and \(e_{5,8}\) has larger weight than that on \(G^\mathrm{{min}}\) according to Lemma 9.

5 Exact algorithm

In this section, we propose an exact algorithm, termed as SDP, for processing \(\hbox {UG-R}k\hbox {NN}\) search. Note that the algorithm utilizes three key techniques, including graph structure pruning (GSP) and equivalent graph removing (EGR) presented in previous two sections, and p robability b ound p runing (PBP) discussed later. In what follows, we first present PBP technique and then detail our SDP algorithm. As stated in Heuristic 3, PBP tries to reduce the data node validation cost for \(\hbox {UG-R}k\hbox {NN}\) retrieval.

Heuristic 3

(Probability bound pruning) Given any possible graph \(G_i \in \varPhi (\mathcal {G}_e)\), a query node q, and a specified parameter \(\theta \), let \(\overline{P}\) be the total existence probability of remaining unvisited possible graphs that have not been removed by EGR (see Heuristic 2). For a data node \(o\in \) \(\mathrm {R}\)k\(\mathrm {NN}\)(\(G_i, q\)), if o satisfies \( \Pr _{G_i}(o)\) \(\ge \) \(\theta \) or \(\Pr _{G_i}(o)+\overline{P} < \theta \), then o can be pruned during \(\hbox {UG-R}k\hbox {NN}\) search. Here, \(\Pr _{G_i}(o)\) denotes the probability of o to be an answer object based on currently evaluated possible graph.

Proof

The proof is intuitive, and thus, it is omitted. \(\square \)

It is worth mentioning that \(\Pr _{G_i}(o)\) and \(\Pr _{G_i}(o)+\overline{P}\) are actually the lower bound and the upper bound of \(\Pr (o)\) (i.e., o’s probability to become an answer object), respectively. The fact \(\Pr _{G_i}(o)\ge \theta \) means that o is definitely an answer object, and the fact \(\Pr _{G_i}(o)+\overline{P} < \theta \) indicates that o is definitely not an answer object.

Our SDP algorithm to support \(\hbox {UG-R}k\hbox {NN}\) query processing is depicted in Algorithm 4. To begin with, SDP initializes the \(\hbox {R}k\hbox {NN}\) probability \(\Pr (s)\) as zero for each data object \(s\in S\). Then, it utilizes GSP heuristic to derive the essential uncertain graph \(\mathcal {G}_e\) (line 2) in order to effectively cut down the size of the uncertain graph and thus to reduce the number of possible graphs. Based on \(\mathcal {G}_e\), SDP invokes EGR (i.e., Algorithm 3), and integrates PBP heuristic to evaluate possible graphs and update \(\Pr (s)\) for every data node \(s\in S\) accordingly (line 3). Note that PBP can be easily integrated into EGR algorithm. Every time when the total existence probability \(\overline{P}\) or any \(\Pr _{G_i}(o)\) changes, PBP is used to validate/invalidate data nodes for \(\hbox {UG-R}k\hbox {NN}\) search. Finally, SDP retrieves the data objects \(s\in S\cap N_c\) with \(\Pr (s)\ge \theta \) to complete the process.

Time complexity of SDP First, the traversal of a graph takes \(O(|V|\cdot \) \(\lg |V|+|E|)\) time, assuming that Dijkstra-alike algorithm is used [16, 29]. Thus, GSP heuristic (i.e., Algorithm 2) takes \(O(|V|^2\cdot \) \(\lg |V|+|V|\cdot |E|)\) time. Second, the time complexity of EGR heuristic (i.e., Algorithm 3) is \(O(w^{|E|}\cdot (|V|^2\cdot \) \(\lg |V|+|V|\cdot |E|))\) in the worst case, as there are \(w^{|E|}\) possible graphs if all the edges share the same number of weights w (i.e., \(\forall e \in E\), \(|\varOmega (e)|=w\)). In addition, PBP needs \(O(w^{|E|}\cdot |V|)\) time, since the number of data objects is at most |V| for every possible graph. Consequently, in the worst case, the total time complexity of SDP is \(O(w^{|E|}\cdot (|V|^2\cdot \) \(\lg |V|+|V|\cdot |E|))\).

Although SDP has significantly improved search performance compared with Baseline, its cost is still high. In order to further boost the performance of search, we propose an adaptive sampling algorithm to be introduced in the next section.

figure d

6 Sampling algorithm

Due to the \(\sharp \)P-hard complexity of \(\hbox {UG-R}k\hbox {NN}\) search as discussed in Sect. 2, there is no algorithm to exactly solve the \(\hbox {UG-R}k\hbox {NN}\) query in polynomial time. In view of this, we present an efficient algorithm, termed as TripS, which uses graph structure pruning and an adaptive stratified sampling technique, for \(\hbox {UG-R}k\hbox {NN}\) search. In particular, the framework of TripS is given in Algorithm 5. It first employs GSP (Algorithm 2) to get the essential uncertain graph (line 2). Second, on the top of essential uncertain graph, TripS conducts \(\hbox {R}k\hbox {NN}\) search on every sampled possible graph, captured by the adaptive stratified sampling technique (ASSP), for \(\hbox {UG-R}k\hbox {NN}\) retrieval (line 3). Note that the final result set contains the objects having the probabilities not smaller than \(\theta '\) (= \(\theta \cdot \sum _{G ~\mathrm{is ~sampled}}\Pr (G)\)). Given the fact that GSP heuristic has been introduced in Sect. 3, we only focus on ASSP sampling technique in this section.

Before we present ASSP, we first explain the popular metric used to evaluate the performance of the sampling algorithms. Assume that an sampling algorithm \(\hat{F}\) is the estimator of true value F, an essential metric to evaluate the accuracy of sampling approach is the mean squared error (MSE) [16, 24, 25], denote as \(\mathbb {E}[(\hat{F}-F)^2]\), which measures the expected difference between an estimator and the true value. It can be decomposed into two parts [45], i.e., \(\mathbb {E}[(\hat{F}-F)^2]\) = \(\mathrm{Var}(\hat{F})\) + \([\mathbb {E}(\hat{F})-F]^2\) = \(\mathrm{Var}(\hat{F})\) + \([\mathrm{Bias}(\hat{F}, F)^2]\). Here, \(\mathrm{Var}(\hat{F})\) represents the average deviation from its expectation. It is easy to find that for an unbiased estimator, the variance is simply the MSE. In other words, the key criterion to discriminate the estimators is their variances, if they are unbiased.

We use a naive unbiased random sampling method, i.e., Monte Carlo sampling [16, 24, 25] for \(\hbox {UG-R}k\hbox {NN}\) search. Specifically, Monte Carlo first draws N possible graphs, denoted as \(G_1\), \(G_2\), \(\ldots \), \(G_N\) from \(\mathcal {G}\), and then, it finds \(\hbox {R}k\hbox {NN}\) objects for each possible graph \(G_i\). Finally, it obtains Monte Carlo estimator by summarizing the \(\hbox {R}k\hbox {NN}\) results on the sampled possible graphs.

figure e

Different from the naive Monte Carlo estimator, the stratified sampling is an effective and popular sampling technique to tackle query processing [21, 24, 25]. It partitions the entire population into disjoint strata, and then represents every stratum by capturing specified number of samples in that stratum. Its main advantage is to capture the most representatives of a population. As a result, in TripS algorithm, we propose a new adaptive stratified sampling technique, namely ASSP, for \(\hbox {UG-R}k\hbox {NN}\) search, which is proved to be unbiased and achieves a variance at least as good as Monte Carlo sampling does.

Before we formally introduce the detailed ASSP technique, we first present an illustrative example to explain the basic idea of ASSP technique. To simplify discussion, we assume that an uncertain graph \(\mathcal {G}\) has four edges \(e_1, e_2, e_3\), and \(e_4\), and every edge has two distinct weights (denoted as \(v_a\) and \(v_b\)), i.e., \(\forall e, |\varOmega (e)| = w =2\). The weight of an edge might be different from the weight of another edge. Then, there are in total \(2^4\) possible graphs w.r.t. \(\mathcal {G}\). In ASSP, these possible graphs (i.e., the whole sample space) are divided into several sample subspaces, and the specified number of possible graphs are captured from each sample space. As shown in Table 5, assume that the 16 possible graphs \(G_1, G_2, \ldots , G_{16}\) are partitioned into 4 strata \(\varDelta _1, \varDelta _2, \varDelta _3,\) and \(\varDelta _4\), according to the weights of edges \(e_1\) and \(e_2\), and the total sample size N is specified as 8. For simplification, we assume ASSP randomly captures 2 out of 4 possible graphs in every stratum in Table 5. In other words, we utilize 8 sampled possible graphs to answer \(\hbox {UG-R}k\hbox {NN}\) query approximately, instead of exploring all 16 possible graphs.

Table 5 Illustration of stratified sampling on uncertain graphs
Table 6 Illustration of partition based on \(\gamma =2\) edges
Table 7 Illustration of partition based on \(\gamma =3\) edges

Consequently, we are aware that sample allocation and stratum partition are two critical steps for ASSP. To begin with, in ASSP, we employ the proportional sample allocation (PSA) [38]. Specifically, given the total sample size N, for achieving proportional samples, we assign sample size \(N_i\) for each stratum \(\varDelta _i\) proportional to the population probability of the stratum. The population probability, denoted as \(\pi _i\), w.r.t. a subspace \(\varDelta _i\), accumulates the existence probabilities of all the possible graphs included in \(\varDelta _i\), i.e.,

$$\begin{aligned} \pi _i = \sum _{G_t\in \varDelta _i} \Pr (G_t) \end{aligned}$$
(8)

where \(\Pr (G_t)\) is the existence probability of possible graph \(G_t\). Specifically, the sample size \(N_i\) is defined by Eq. 9.

$$\begin{aligned} N_i = \frac{\sum _{G_t\in \varDelta _i} \Pr (G_t)}{\sum _{G_t\in \varPhi (\mathcal {G})} \Pr (G_t)} \cdot N = \pi _i\cdot N \end{aligned}$$
(9)

Note that the total probability of all possible graphs (included in \(\varPhi (\mathcal {G})\)) is equal to one. As a result, strata with large population probabilities should be sampled heavily, whereas strata with small population probabilities deserve a relatively small sample size. It is important to point out that a naive allocation method is to assign sample size \(N_i\) by \(\frac{L_i}{L}\cdot N\), where \(L_i\) is the number of possible graphs contained in stratum \(\varDelta _i\), and L is the total number of possible graphs. This naive method is inferior to our used method. This is because it allocates sample sizes based on the assumption that every possible graph has the same existence probability (without considering the different probabilities of the possible graphs).

Now, we start to our stratum partition methods in details. To begin with, we select \(\gamma \) edges from the edge set E of an uncertain graph \(\mathcal {G}\) to partition the space into \(w^{\gamma }\) subspaces, assuming that each edge has w distinct weights. Note that \(w^{\gamma }\) is the largest size of subspaces that could be partitioned by \(\gamma \) edges. For instance, let \(\gamma =2\) in the above example. 2 edges can produce the maximal \(w^{\gamma }\) = \(2^2\) = 4 subspaces. This naive partition is simple but it has limitations. A small change of \(\gamma \) will cause a significant change of the subspace size. Thus, it is hard to achieve a flexible sample space partition. On the other hand, given \(\gamma \) edges, the minimum size of subspaces that could be partitioned by those \(\gamma \) edges is \(\gamma \cdot (w-1)+1\).

As depicted in Tables 6 and 7, two (three) edges can partition the previous example uncertain graph into 3 (4) subspaces. Note that the notation \(\star \) indicates the weight of an edge e is not specified, i.e., all the weights of the edge e could be allowed in the corresponding space. In addition, the 3 subspaces in Table 6 are actually the partition for the whole sample space, since they are disjoint between each other, and the union of them is the whole space. In other words, the \(\gamma \cdot (w-1)+1\) subspaces that \(\gamma \) edges divide must satisfy that those subspace are disjoint between each other, and the union of them forms the whole space. Not any \(\gamma \cdot (w-1)+1\) possible graph sets could form a partition of the whole sample space. For example, instead of \(\varDelta '_1, \varDelta '_2,\), and \(\varDelta '_3\) in Table 6, we assume that there are three other possible graph sets (denoted as \(P_1, P_2,\) and \(P_3\)) w.r.t. \(\gamma = 2\), such that \(P_1\) is identified by \(\langle e_1(v_a), e_2(v_a)\rangle \), \(P_2\) is labeled by \(\langle e_1(v_b), e_2(\star )\rangle \), \(P_3\) is denoted by \(\langle e_1(\star ), e_2(v_b)\rangle \). Then, these three possible graph sets cannot form a partition of the whole sample space. The reason is that the possible graphs with edge weights \(e_1(v_b)\) and \(e_2(v_b)\) could belong to \(P_2 \) and \(P_3\) simultaneously. In other words, one possible graph may be divided into two possible graph sets, which does not satisfy the disjointness of the partition.

As \(\gamma \) increases, it has a linear impact on \(\gamma \cdot (w-1)+1\), the minimum number of subspaces generated. However, we also notice that the partition using \(\gamma \cdot (w-1)+1\) renders the diversity of the sampling, since only a small group of subspaces is considered by sampling algorithm, which is not desirable for \(\hbox {UG-R}k\hbox {NN}\) retrieval. In order to maximize the advantages of two simple partition approaches above and meanwhile minimize their limitations, ASSP adopts two tunable parameters \(\gamma _1\) and \(\gamma _2\). It selects \((\gamma _1 + \gamma _2)\) edges such that the space is partitioned into \(w^{\gamma _1}\cdot (\gamma _2\cdot (w-1)+1)\) subspaces. Note that the partition based on \((\gamma _1 + \gamma _2)\) edges is superior to the above two partitions based on \(\gamma \) edges, which will be further confirmed by our experimental studies to be presented in Sect. 7.

Next, we want to emphasize that ASSP is a recursive stratified sampling algorithm. Specifically, in every recursion, ASSP chooses \((\gamma _1 + \gamma _2)\) unselected edges from the edge set E of an uncertain graph, and partitions the current sample space into \(w^{\gamma _1}\cdot (\gamma _2\cdot (w-1)+1)\) subspaces, as discussed above. It is important to note that there are at most \((\gamma _2-1)\) edges with unspecified edge weights in the selected \((\gamma _1 + \gamma _2)\) edges. For example, when \(\gamma =\gamma _2 = 2 (3)\) in Tables 6 and 7, there is one edge (there are two edges) with unspecified weights in the selected \(\gamma \) edges. Without loss of generality, if an edge is selected in one recursion but its weight is still unspecified, it is regarded as an unselected edge in the next recursion. The recursion (or the sample space partition) terminates until the size of E is smaller than \((\gamma _1 + \gamma _2)\), or the allocated sample size w.r.t. the current space is small than a given sample size threshold. When the final smallest subspaces \(\varDelta _i\) are obtained, the Monte Carlo sampling is used to capture \(\pi _i\cdot N\) possible graphs within each subspace \(\varDelta _i\) for \(\hbox {UG-R}k\hbox {NN}\) search, where N is the total sample size.

In addition, we also want to highlight that in order to achieve more adaptive diversity, ASSP sorts all the edges in an uncertain graph based on the breadth-first search (BFS) edge selection strategy that invokes BFS starting from the query object q to order edges, which will be maintained by an edge list \(E_s\). Hence, in every recursion, ASSP first chooses \(\gamma _1\) unselected edges from \(E_s\) and then another \(\gamma _2\) unselected edges from \(E_s\). In other words, ASSP actually explores the edges closer to the query node earlier. Furthermore, we decrease the value of \(\gamma _1\) but increase that of \(\gamma _2\) with the depth of the recursion so that ASSP actually divides more subspaces w.r.t. the edges closer to the query.

figure f

The pseudo-code of ASSP is given in Algorithm 6, which aims to find the candidate data objects with \(\hbox {R}k\hbox {NN}\) probabilities, by taking as inputs the essential uncertain graph \(\mathcal {G}_e=(V, E, W, P)\), a data node set S, a query object q, an edge list \(E_s\) that maintains the edges in E sorted based on BFS strategy from q, the string representation of the states (i.e., the weights) for selected edges X, sample size N, sample size threshold \(N_t\), and parameters k, \(\gamma _1\), \(\gamma _2\). Specifically, if \(|E_s| < \gamma _1+\gamma _2\) or \(N < N_t\), ASSP captures N sample graphs from the sample space related to one stratum denoted by X via Monte Carlo sampling, and updates the \(\hbox {R}k\hbox {NN}\) probability of every data object through lazy-EP algorithm (lines 1–11). Otherwise, ASSP gets the first \(\gamma _1\) unselected edges from \(E_s\) and next \(\gamma _2\) unselected edges from \(E_s\), maintained by \(E_1\) and \(E_2\), respectively (lines 13–14). It is important to note that the \(E_s\) list during sampling is static without edge deletion or insertion. One simple method to identify the unselected edges in \(E_s\) is via the comparison between X and \(E_s\) where the first edge in \(E_s\) but not in X is just the first edge of the \(\gamma _1\) edges in the current recursion (should be inserted in \(E_1\)). Then, for each state vector of those \(\gamma _1\) edges in \(E_1\) (there are \(w^{\gamma _1}\) state vectors in total) and each state vector of those \(\gamma _2\) edges in \(E_2\) (there are \(\gamma _2\cdot (w-1)+1\) state vectors in total), we maintain their string representation Y (line 18), and derive the corresponding existence probability \(\pi _Y\) of Y (line 19). After updating \(X=X+Y\), \(\gamma _1= \max \{\gamma _1-1, 0\}\), \(\gamma _2=\gamma _2+1\), and \(N=\pi _Y\cdot N\) for the input parameters of ASSP, ASSP is recursively recalled in line 21.

In the last part of this section, we analyze our TripS algorithm (including ASSP technique) and explain how to adapt our algorithms newly proposed to support bichromatic \(\hbox {UG-R}k\hbox {NN}\) query in order to demonstrate the flexibility and generality of our algorithms.

First, in TripS algorithm, the ASSP sampling technique needs to estimate the population proportion [38], according to \(\hbox {UG-R}k\hbox {NN}\) query problem in Definition 4. This is because the probability \(\Pr (o)\) of every object o to be an actual answer object is actually the proportion of possible graphs G such that o is a real answer object on G. It is guaranteed that the ASSP estimator is unbiased, and the variance of ASSP is at least as good as that of Monte Carlo sampling, as stated in Theorems 1 and 2, respectively.

Theorem 1

The ASSP estimator, denoted as \(\hat{F}_\mathrm{ASSP}\), is unbiased, i.e., \(\mathbb {E}({\hat{F}}_\mathrm{ASSP})\) = F.

Proof

The detailed proof can be found in Appendix A. \(\square \)

Lemma 10

Let \(N_i\) be the number of samples in stratum i, \(L_i\) be the number of population for stratum i, and T be the total number of strata divided by stratified sampling. The variance of ASSP, denoted as \({\hbox {Var}}({\hat{F}}_\mathrm{ASSP})\), can be derived

$$\begin{aligned} \mathrm{{Var}}({\hat{F}}_\mathrm{ASSP})=\sum _{i=1}^{T} \pi _i^{2}\cdot \frac{L_i-N_i}{L_i-1}\cdot \frac{{\hat{F}}_\mathrm{ASSP}(1-{\hat{F}}_\mathrm{ASSP})}{N_i}. \end{aligned}$$

Proof

The detailed proof can be found in Appendix B. \(\square \)

Theorem 2

The variance of ASSP, i.e., \(\mathrm{{Var}}(\hat{F}_\mathrm{ASSP})\), is at least as good as that of Monte Carlo sampling, i.e., \(\mathrm{{Var}}(\hat{F}_\mathrm{MC})\). Formally, \(\mathrm{{Var}}(\hat{F}_\mathrm{ASSP})\le \mathrm{Var}(\hat{F}_\mathrm{MC})\).

Proof

The detailed proof can be found in Appendix C. \(\square \)

Time complexity of TripS \(\hbox {R}k\hbox {NN}\) query for each possible graph takes \(O(|V|^2 \cdot \lg |V|+|V|\cdot |E|)\) time. Let N be the total number of samples, since sampling N possible graphs needs at most \(O(N\cdot |E|)\) time as discussed in [24], the time complexity of ASSP is \(O(N\cdot (|V|^2 \cdot \lg |V|+|V|\cdot |E|))\). Therefore, the time complexity of TripS algorithm is \(O(|V|^2 \cdot \lg |V|+|V|\cdot |E|)\) \(+\) \(O(N\cdot (|V|^2 \cdot \lg |V|+|V|\cdot |E|))\) = \(O(N\cdot (V|^2\cdot \lg |V|+|V|\cdot |E|))\).

Bichromatic UG-R k NN search Bichromatic \(\hbox {UG-R}k\hbox {NN}\) query is a powerful tool in many real-life applications, including business decision and resource allocation in complex networks, as explained in Sect. 1. Thus, we further explore bichromatic \(\hbox {UG-R}k\hbox {NN}\) search, as defined in Definition 8 based on bichromatic \(\hbox {R}k\hbox {NN}\) retrieval over deterministic graphs described in Sect. 1.

Definition 8

(Bichromatic UG-R k NN query) Given an uncertain graph \(\mathcal {G}=\)(V, E, \(\varOmega \), P), two relevant data node sets \(S_1 \), \(S_2\subseteq V\), a user-specified parameter \(\theta \) (\(0<\theta \le 1\)), and a query node \(q\in V\), a bichromatic \(\hbox {UG-R}k\hbox {NN}\) query returns the objects \(s_1\in S_1\) that have q as one of their k-nearest neighbors from \(S_2\) with the probability being no smaller than \(\theta \). Formally,

$$\begin{aligned} \mathrm {UG}\text {-}\mathrm {bR}k\mathrm {NN}(\mathcal {G},~q)&=\{s_1\in S_1 \mid \\&\sum _{s_1\in \mathrm {bR}k\mathrm {NN}(G\in \varPhi (\mathcal {G}),q)}\Pr (G) \ge \theta \} \end{aligned}$$

It is worth noting that the presented SDP and TripS algorithms can be easily extended to support bichromatic \(\hbox {UG-R}k\hbox {NN}\) search, by implementing following two changes. First, let kNN\((G, n, S_2)\) denote node n’s k-nearest neighbors in \(S_2\) on a graph G, and \(d_{G}^k(n, S_2)\) represent the distance from node n to its kth nearest neighbor in \(S_2\) on G. For bichormatic \(\hbox {UG-R}k\hbox {NN}\) retrieval, we only need to replace kNN(GnS) and the distance \(d_{G}^k(n, S)\) used in SDP and TripS algorithms with kNN\((G, n, S_2)\) and \(d_{G}^k(n, S_2)\), respectively. Second, to answer bichromatic \(\hbox {UG-R}k\hbox {NN}\) search, the extended SDP will focus on set \(S_1\) only to find the objects with the probability of being answer objects no smaller than \(\theta \). The extended SDP and TripS algorithms can support bichromatic \(\hbox {UG-R}k\hbox {NN}\) query efficiently, as to be demonstrated in Sect. 7.4.

7 Experimental evaluation

In this section, we present a comprehensive experimental evaluation. The objectives are threefold. First, we would like to verify the efficiency of the exact SDP algorithm for \(\hbox {UG-R}k\hbox {NN}\) query, and three pruning heuristics, i.e., GSP, EGR, and PBP. Second, we want to demonstrate the high performance of TripS algorithm. Finally, we would like to evaluate the performance of our proposed algorithms for bichromatic \(\hbox {UG-R}k\hbox {NN}\) search. In what follows, we first present our experimental settings, and then, we report the experimental results and our findings. All algorithms were implemented in C++, and all experiments were conducted on an Intel Core 2 Duo 2.10GHz Virtual Machine with 16GB RAM, running Microsoft Windows 7 Professional Edition.

Table 8 Statistics of the datasets used

7.1 Experimental setup

Datasets: four real datasets, viz., North America (NA), DBLP, Facebook, and STRING, are used in our experiments, representing real uncertain traffic networks, co-authorship networks, social networks, and PPI networks, respectively, as summarized in Table 8. Note that in order to show the popularity of the datasets, Table 8 also cites the specified work that used those datasets.

  • NA contains 175, 813 nodes and 179, 179 edges, representing the locations and the road segments between two adjacent locations of North America, respectively.

  • DBLP is a weighted collaboration network with 1,226,749 nodes and 5,557,805 edges. Each node denotes an author. The edge weight is set to the Jaccard distance of the published papers of the two nodes, which is based on the intuition that the more the papers they co-authored, the closer the relationship between the two authors is.

  • Facebook includes a list of user-to-user links from the Facebook New Orleans network, in total 4, 974 users and 97,754 links. Every node signifies one anonymized user identifier.

  • STRING is a prediction database derived from MINT, HPRD, BIOGRID, and INTACT, where each node represents a protein, and each edge stands for an interaction between two proteins with a confidence score that is used as the probability value in our study. In our experiments, we extract 45, 340 proteins with 10,000,000 interactions.

We also generate a synthetic Erdos–Renyi (ER) random graph, with 12, 500 vertices and 50, 000 edges. Following the popular methodology used in previous work [16, 29, 54, 59], we simulate three possible weights for every edge with the existence probabilities under several typical distributions. Specifically, the weight distribution \({\hbox {dtr}}_w\) can follow the Uniform (\(U_w\) for short) distribution or the Gaussian (\(G_w\) for short) distribution with mean as original weight \(\omega \) (in the original graphs) and variance \(\sigma ^2\) as 0.1. For the uniform distribution with mean \(\omega \) and variance \(\sigma ^2\), it means the weights are in the range (\(\omega - \sqrt{3}\sigma \), \(\omega + \sqrt{3}\sigma \)) uniformly, and thus, it is often abbreviated as Uniform(\(\omega - \sqrt{3}\sigma \), \(\omega + \sqrt{3}\sigma \)). For the Gaussian distribution with the mean as \(\omega \) and the variance \(\sigma ^2\) as 0.1, it is abbreviated as Gaussian(\(\omega \), 0.1). In addition, we use a simple normalized approach to generate the weight probabilities. It first gets the probabilities following the probability distribution \({\hbox {dtr}}_p\) including the Uniform (\(U_p\) for short) distribution and the Gaussian (\(G_p\) for short) distribution with mean as 0.5 and variance as 0.25. Then, it normalizes the weight probabilities of one edge to sum up to one. Note that another simple method to generate probabilities is to allocate the probabilities for them uniformly. Also, more reasonable and complex techniques can be employed to set the probabilities. It is worth noting that as confirmed by the experimental study later, the probability setting is not very sensitive to the performance of our proposed heuristics and algorithms.

Algorithms: there are in total six algorithms evaluated in our experiments, as described below.

  • Baseline (i.e., Algorithm 1).

  • SDP (i.e., Algorithm 4). It employs three pruning heuristics including GSP, EGR, and PBP to get the exact \(\hbox {UG-R}k\hbox {NN}\) query result.

  • TripS (i.e., Algorithm 5). It utilizes GSP heuristic to prune the graph scale and then uses ASSP sampling to derive the query result.

  • GMC (GSP + Monte Carlo sampling). It first prunes the graph scale by our proposed GSP heuristic, and then, it derives the query result using Monte Carlo sampling based on [10].

  • GR-I (GSP+ RSS-I). It employs GSP first and then uses RSS-I technique  [24, 25] to get the query result.

  • GR-II (GSP+ RSS-II). It gets the query result via utilizing GSP and RSS-II techniques [24, 25].

It is worth noting that RSS-I and RSS-II are the two special cases discussed in Sect. 6, where RSS-I partitions the whole space into \(w^{\gamma }\) subspaces, and RSS-II partitions the whole space into \(\gamma \cdot (w-1)+1\) subspaces with w being the number of different weights for an edge. In addition, we set by default k = 2, \(\theta \) = 0.5, d = 0.8, \(\gamma _1\) = 20, and \(\gamma _2\) = 30 for ASSP, \(\gamma \) = 50 for RSS-I and RSS-II, and sample threshold \(N_t\) =10 in every set of experiments. Unless mentioned otherwise, we report the average query execution time of 20 queries issued at random nodes.

Parameters: we explore several factors in our experiments, including weight/probability distribution dtr, graph size (i.e., the number of edges |E|), probability threshold \(\theta \), data density d (i.e., |S| / |V|, the rate of the nodes contain data objects to the total nodes in the graph), average node degree \(\rho \) (i.e., |E| / |V|, the average rate of the number of edges to the number of nodes), query parameter k, and sample size N. The settings of all these parameters are summarized in Table 9, where the default values are shown in bold, and \(|D_{E}|\) denotes the original number of edges for each graph dataset. In every set of experiments, we only change one parameter, with the rest set to their defaults.

Table 9 Parameter ranges and default values

Metrics: we employ the average query time and the pruning rate as the main performance metrics. Specifically, we employ the pruning rates , , and w.r.t. Heuristic 1 , Heuristic 2 , and Heuristic 3 , respectively. Note that the pruning rates of , and are calculated individually, and they prune the graph scale, the number of possible graphs, and the number of data nodes, respectively. The pruning rates are defined as follows.

  • The pruning rate is the ratio of the number of nodes pruned by to the total number of nodes |V| in the original graph, i.e., .

  • The pruning rate equals the ratio of the total number of equivalent graphs (defined in Definition 6) to the number of possible graphs w.r.t. the essential uncertain graph \(\mathcal {G}_e\), i.e., .

  • The pruning rate is the ratio of the number of data nodes identified (discarded/qualified) by (denoted as Pruned(s)) to the number of data nodes S, i.e., .

In addition to the pruning rate, for these three heuristics, we also report the average execution time (in seconds) of GSP (denoted by GSP-time), the times of EGR executed successfully (denoted by #EGR), i.e., the number of possible graphs pruned by EGR \(|\bigcup \mathrm D^c|\), as well as the times of PBP executed successfully (denoted by #PBP), i.e., the number of data nodes pruned by PBP \(|{\hbox {Pruned}}(s)|\).

Fig. 6
figure 6

\(\hbox {UG-R}k\hbox {NN}\) cost versus distribution on NA

Furthermore, we introduce three different metrics to measure the accuracy of sampling algorithms, including the \(F_1\)-score, the root mean square error (RMSE), and the relative variance. Specifically, the \(F_1\)-score is calculated by \(\frac{2PR}{P+R}\) where precision \(P = \frac{TP}{TP+FP}\) and recall \(R = \frac{TP}{TP+FN}\). Here, TP, FP, and FN denote the number of true positives, false positives, and false negatives for a sampling algorithm, respectively.

The RMSE metric is leveraged to measure the difference between the result derived by sampling algorithms and the real result (derived by exact algorithm SDP). The smaller the RMSE value, the smaller the difference between the result of sampling algorithm and the real result. For every query node, let \(\mathbf {z_0}\) denote the exact probability vector of the |S| data nodes to be \(\hbox {UG-R}k\hbox {NN}\) objects, and \(\mathbf {z_i}\) be the probability vector derived by a sampling algorithm at the ith time (\(i = 1, 2, \ldots , 100\)). The RMSE of the result derived by sampling algorithms w.r.t. the real result, i.e., the difference between \(\mathbf {z_0}\) and all \(\mathbf {z_i}\)s, is defined as \(\sqrt{\frac{\sum _{i=1}^{100}\sum _{j=1}^{|S|}(\mathbf {z_i}[j]-\mathbf {z_0}[j])^2}{100}}\), in which \(\mathbf {z_i}[j]\) is the jth dimensional value of \(\mathbf {z_i}\), i.e., the probability of the jth data node in S to be a \(\hbox {R}k\hbox {NN}\) object.

In addition, as mentioned in Sect. 6, the MSE metric becomes the variance for unbiased algorithms including Monte Carlo sampling and TripS. Hence, following the previous work on uncertain graphs [16, 24, 25], we employ the relative variance to further evaluate the performance of sampling algorithms, since the exact variances of the algorithms are intractable due to the \(\sharp \)P-hard complexity of \(\hbox {UG-R}k\hbox {NN}\) search. The relative variance of a sampling algorithm (denoted as \(\hat{F}\)) is defined as \(\sigma _{\hat{F}}/\sigma _{MC}\), where \(\sigma _{\hat{F}}\) and \(\sigma _{MC}\) are the variances of \(\hat{F}\) and GMC algorithms, respectively. The smaller the relative variance, the better the performance of the sampling algorithm, as compared with Monte Carlo sampling. Let \(\mathbf {z_a}\) be the average probability vector of all \(\mathbf {z_i}\)s, \(C_i\) be the cosine distance between \(\mathbf {z_i}\) and \(\mathbf {z_a}\), and \(\bar{C}\) be the mean of all \(C_i\)s. The variance \(\sigma \) is defined as \(\frac{\sum _{i=1}^{100}(C_i-\bar{C})^2}{99}\).

Fig. 7
figure 7

UG-\(\hbox {R}k\hbox {NN}\) cost versus uncertainty of weight on NA

Fig. 8
figure 8

\(\hbox {UG-R}k\hbox {NN}\) cost versus graph size. a NA. b DBLP. c Facebook. d STRING

Fig. 9
figure 9

\(\hbox {UG-R}k\hbox {NN}\) cost versus k. a NA. b DBLP. c Facebook. d STRING

Fig. 10
figure 10

\(\hbox {UG-R}k\hbox {NN}\) cost versus d. a NA. b DBLP. c Facebook. d STRING

7.2 Results on exact algorithms

In the first set of experiments, we evaluate the performance of SDP under different edge weight distributions \({\hbox {dtr}}_w\) and weight probability distributions \({\hbox {dtr}}_p\). The experimental results on NA graph are shown in Fig. 6, where four distribution combinations (i.e., \(U_w/U_p\), \(U_w/G_p\), \(G_w/U_p\), and \(G_w/G_p\)) are evaluated. It is observed that the execution time of SDP algorithm, as well as the execution time of GSP heuristic, is not very sensitive to weight/probability distributions. Three heuristics GSP, EGR, and PBP are effective in most cases. This is because heuristic GSP performs similarly under different weight distributions, and it is not affected by the weight probability distributions. Thus, without loss of generality, in the following experiments, the edge weights follow Uniform (\(U_w\)) distribution, and the probabilities follow Gaussian (\(G_p\)) distribution by default (i.e., \({\hbox {dtr}}_w=U_w\) and \({\hbox {dtr}}_p=G_p\)).

Moreover, in order to further verify the effect of high weight uncertainty on the algorithm performance, we generate the weights following Uniform(1, 10), Uniform(1, 100), and Uniform(1, 1000) over NA graph, where the probability follows Gaussian(0.5, 0.25) by default. As depicted in Fig. 7, one can see that with the increasing of the weight uncertainty, the execute time of SDP algorithm grows remarkably. This is because the power of GSP becomes a little weak for high uncertainty, and therefore, the possible graphs we have to evaluate grow exponentially, which resulting in much overhead for query processing. Notably, the large increasing number of possible graphs we have to evaluate (with the ascend of weight uncertainty) can also be confirmed by the remarkable growth on the number of possible graphs pruned by EGR when the pruning rate of EGR does not vary significantly.

The second set of experiments compares the performance of Baseline and SDP algorithms for \(\hbox {UG-R}k\hbox {NN}\) search with various graph sizes, as depicted in Fig. 8. To vary graph size, we extract connected subgraphs from the original graphs as the experimental graphs. The extraction follows the same method used in [29], which expands the graph from a random selected vertex via breadth-first traversal. Obviously, SDP outperforms Baseline significantly, when the size of edges is larger than 10, and the average execution time of Baseline is around \(10^5\) s when edge size is 20. The reason is that the average query cost of Baseline grows exponentially with the increasing of edge size, which is consistent with the complexity of Baseline, i.e., \(O(w^{|E|}\cdot (|V|^2\cdot \lg |V|+|V|\cdot |E|))\). Hence, the time cost of Baseline is unacceptable when edge size is larger than 20. In other words, Baseline is impractical on large graphs. In contrast, SDP is equipped with three effective heuristics GSP, EGR, and PBP, which boost the performance of \(\hbox {UG-R}k\hbox {NN}\) search.

Another observation is that GSP and EGR become more powerful when graphs turn larger in most cases. This is because when the graph is small, it limits the room for improvement. GSP takes much less time to prune more than 99% nodes of the original big graphs. It signifies at most 1% nodes are remained in the essential uncertain graphs. In addition, the power of PBP drops with the growth of graph size in most cases. The reason is that as GSP and EGR do not work efficiently on smaller graph, it gives PBP the chance to prune more data nodes. However, with the increasing power of GSP and EGR, fewer data nodes remain for PBP in \(\hbox {UG-R}k\hbox {NN}\) retrieval.

The third set of experiments reports the experimental results with k varying from 1 to 8, as depicted in Fig. 9. It is worthwhile to point out that due to the high complexity and poor performance of Baseline, Baseline is impractical for these graphs. Hence, we focus on SDP algorithm only in this and the rest sets of the experiments. Evidently, the average query time of SDP ascends significantly with the growth of k. The reason is that as k grows, the number of nodes in the essential uncertain graph formed by GSP becomes larger; therefore, the number of possible graphs we have to evaluate grows exponentially. It is also consistent with our analysis presented in Lemma 7 of Sect. 3, where the pruning rate \(R_{\mathsf{GSP}}\) drops with the growth of k. When k turns large, e.g., k = 4 and 8, the average query time of exact algorithm is relatively long (even more than \(10^6\) s). This time cost is beyond the scope of acceptance. Notably, the first step of SDP algorithm, i.e., GSP heuristic, has been conducted completely. It is important to note that in Fig. 9, we only plot part of the results (i.e., the CPU time and the pruning rates) when the average query time is acceptable. The corresponding pruning rates of EGR and PBP are denoted by the dash \(``-\),” if the time cost is unacceptable.

Fig. 11
figure 11

\(\hbox {UG-R}k\hbox {NN}\) cost versus \(\theta \). a NA. b DBLP

Fig. 12
figure 12

\(\hbox {UG-R}k\hbox {NN}\) cost versus \(\rho \) on ER

Table 10 Average query time (in seconds) of SDP and TripS algorithms under different k values

The fourth set of experiments verifies the influence of data density d on the efficiency of SDP. The average query time of SDP and the pruning performance of three heuristics on different datasets are depicted in Fig. 10, with varying d from 0.7 to 1.0. Note that data density d is the ratio of the number of nodes containing (relevant) data objects to the total number of nodes, i.e., |S| / |V|. Observed that the average query time drops as d grows. The reason is that there are more data objects on the graphs with the growth of d. Therefore, for a given k, the distance from every node n to its kth nearest neighbor becomes smaller, resulting in an essential uncertain graph with smaller size. Thus, query performance improves. It is obvious that all the three heuristics are consistently effective in all cases.

The fifth set of experiments investigates the impact of parameter \(\theta \) on \(\hbox {UG-R}k\hbox {NN}\) search, with the results plotted in Fig. 11. It is observed that the average query time of the algorithms is not very sensitive to \(\theta \). This is because \(\theta \) only affects the performance of PBP heuristic, which is employed in the final refinement step of \(\hbox {UG-R}k\hbox {NN}\) query. Moreover, all the heuristics demonstrate significant pruning power in all cases. Note that due to the similar performance under various \(\theta \)s, we only report the corresponding results on NA and DBLP for clarity and simplicity.

We also study the effect of graph structure on the performance of SDP. We show the average query time of SDP and the pruning efficiency of heuristics on ER dataset in Fig. 12 by changing the average node degree \(\rho \). Note that average node degree \(\rho \) denotes the average ratio of the number of edges to the number of nodes, i.e., |E| / |V|. Specifically, we set \(\rho \) to 1, 2, 3, 4, and 5, respectively, by selecting/adding the corresponding proportion of the edges from every original graph (with a given node set V). Clearly, the average query time becomes larger with a high node degree \(\rho \). This is because for a specified number of nodes, the size of edges grows with \(\rho \), incurring a large number of possible graphs during search. The average query time increases accordingly. This phenomenon is consistent with our analysis presented in Lemma 7 of Sect. 3, where the pruning rate \(R_{\mathsf{GSP}}\) decreases with the growth of \(\rho \). Also, we again observe that all three heuristics perform efficiently in all cases.

In total, SDP algorithm has a good performance, compared with Baseline. It benefits from the three effective heuristics including GSP, EGR, and PBP.

7.3 Results on sampling algorithms

In this subsection, we verify the efficiency of TripS algorithm on \(\hbox {UG-R}k\hbox {NN}\) search. First, we present the experimental results of SDP and TripS algorithms with sample size being 1000 under various k values in Table 10. We can observed that TripS algorithm performs better than SDP in several orders of magnitude for all datasets. This is because TripS answers \(\hbox {UG-R}k\hbox {NN}\) query approximately via sampling 1000 possible graphs (instead of evaluating all the possible graphs). The processing time increases with the growth of k. The reason behind is as same as the explanation of Fig. 9. Note that in Table 10, some results of SDP are denoted by the dash \(``-\),” meaning that the corresponding query time of exact SDP algorithm is beyond the scope of acceptance.

Table 11 RMSE of the result from sampling algorithms w.r.t. the exact result
Fig. 13
figure 13

\(\hbox {UG-R}k\hbox {NN}\) search versus N. a NA. b DBLP. c Facebook. d STRING

Second, we evaluate the accuracy of the sampling algorithms (including GMC, GR-I, GR-II, and TripS) by three metrics, i.e., the \(F_1\)-score, the root mean square error (RMSE), and the relative variance. From the experimental results, we observe that all the four algorithms have identical \(F_1\)-score on every dataset, e.g., the \(F_1\)-score is 0.696 on DBLP dataset with \(\theta = 0\). This is because these algorithms find the same answer objects (with different probabilities to be \(\hbox {R}k\hbox {NN}\) objects) in most cases, while the probability information is not measured by \(F_1\)-score metric. In fact, what distinguishes these algorithms is usually the probabilities to be \(\hbox {R}k\hbox {NN}\) objects. This exactly reveals the shortcoming of the \(F_1\)-score metric for sampling algorithms. Due to this observation, we skip the experimental results w.r.t. \(F_1\)-score metric.

Table 11 lists the RMSE between the result of sampling algorithms and the real result. Notice that TripS consistently generates a smaller RMSE than its competitors, because of the adaptive recursive stratified sampling used in TripS. In other words, TripS returns the result that is the closest to the real result.

Figure 13 shows the relative variances and the average query time of sampling algorithms with varying sample size N. Note that we use the line (bar) to represent the relative variance (query time). Clearly, under the comparable average query time, TripS consistently achieves a higher accuracy (indicated by a lower relative variance) than GMC, GR-I, and GR-II. This is because TripS shares the same time complexity with the competitors, and thus, their runtime is similar, but TripS adopts an adaptive recursive stratified sampling technique to achieve a lower relative variance. It also signifies that our ASSP technique employed in TripS is superior to its competitors including Monte Carlo sampling, RSS-I sampling, and RSS-II sampling [24, 25]. In addition, the relative variance drops (i.e., the accuracy improves) and the query time increases with the increasing sample size. The reason is that more \(\hbox {R}k\hbox {NN}\) queries need to be conducted when there are more sampled possible graphs, resulting in more expensive query cost and higher accuracy.

To sum up, TripS performs the best for \(\hbox {UG-R}k\hbox {NN}\) search. In addition, we would like to highlight that both the \(F_1\)-score metric and the RMSE metric are inapplicable if the exact algorithm for \(\hbox {UG-R}k\hbox {NN}\) search is impractical. Thus, due to the \(\sharp \)P-hard complexity of \(\hbox {UG-R}k\hbox {NN}\) query, the relative variance is a relatively appropriate metric for sampling algorithms.

7.4 Results on bichromatic UG-RkNN search

In order to verify the efficiency of the algorithms for processing bichromatic \(\hbox {UG-R}k\hbox {NN}\) queries, we first report the average query time of SDP algorithm and the pruning rates of GSP, EGR, and PBP heuristics under default settings, as listed in Table 12. As expected, similar to the performance reported in the above (monochromatic) \(\hbox {UG-R}k\hbox {NN}\) retrieval, three pruning heuristics are always effective. Furthermore, GSP plays an important role in SDP algorithm for bichromatic \(\hbox {UG-R}k\hbox {NN}\) search.

In addition, we also report the accuracy and the average query time of GMC, GR-I, GR-II, and TripS algorithms in Tables 13 and 14, respectively. Compared with the performance of SDP shown in Table 12, it is obvious that TripS is much faster than SDP algorithm in all cases. From Tables 13 and 14, we can conclude that TripS can support bichromatic \(\hbox {UG-R}k\hbox {NN}\) queries with high accuracy and comparable time, compared with state-of-the-art algorithms including GMC, GR-I, and GR-II. Note that in this set of experiments, the data node set S in default (i.e., \(|S|=0.8\times |V|\)) is randomly partitioned into two equal-sized node sets \(S_1\) and \(S_2\) for bichromatic \(\hbox {UG-R}k\hbox {NN}\) search. Due to the same complexity of the algorithm as that of (monochromatic) \(\hbox {UG-R}k\hbox {NN}\) query, we omit the experimental results for bichromatic \(\hbox {UG-R}k\hbox {NN}\) search under different parameters.

8 Related work

In this section, we review the existing work on \(\hbox {R}k\hbox {NN}\) queries and the management of uncertain graphs.

Table 12 Results on exact bichromatic \(\hbox {UG-R}k\hbox {NN}\) search
Table 13 Comparison of the accuracy (relative variance)
Table 14 Comparison of average query time (in seconds)

8.1 Reverse k-nearest neighbor queries

\(\hbox {R}k\hbox {NN}\) search has received considerable attentions in the last decade, since the concept of reverse nearest neighbor (RNN) was first introduced in [18]. Many algorithms have been proposed in answering RNN/\(\hbox {R}k\hbox {NN}\) query and its variants in the literature [9, 11, 12, 20, 37, 41, 43, 47,48,49,50,51]. Probabilistic \(\hbox {R}k\hbox {NN}\) retrieval on uncertain data has also been extensively studied. Lian and Chen [26] study the PRNN query, which retrieves the data objects with their probabilities being RNNs greater than or equal to a user-specified threshold, and propose an effective pruning method, i.e., geometric pruning (GP). Cheema et al. [6] explore the problem of RNN queries on uncertain data, and develop novel pruning rules that can effectively prune the objects that cannot be the answer objects. Bernecker et al. [5] developed a general framework for probabilistic reverse nearest neighbor queries on uncertain data. Xu et al. [49] explore interval reverse nearest neighbor (IRNN) queries over moving objects, which return the objects that maintain nearest neighboring relations to the moving query objects for the longest time in a given interval. IRNN actually considers location uncertainty of moving objects.

In addition, Yiu et al. [52] investigate the \(\hbox {R}k\hbox {NN}\) query on graphs and present a fundamental lemma that can be employed to prune the search space. Safar et al. [39] show how Voronoi diagram can be applied to spatial query processing, and in particular to RNN search. Li et al. [22] study continuous \(\hbox {R}k\hbox {NN}\) search on road networks. They utilize a new data structure, called DLM tree, to represent the whole monitoring region of a continuous \(\hbox {R}k\hbox {NN}\) query. Cheema et al. [7] adopt a filter-and-refinement technique to solve continuous \(\hbox {R}k\hbox {NN}\) search on Euclidean spaces and road networks, respectively. Wang et al. [46] utilize the concept of influence zone to address the problem of continuous monitoring of \(\hbox {R}k\hbox {NN}\) queries in road networks. The influence zone is a region in the network such that a client c is the \(\hbox {R}k\hbox {NN}\) if and only if it lies inside this zone.

Note that all the aforementioned methods are unsuitable for \(\hbox {UG-R}k\hbox {NN}\) search because they either only focus on the deterministic graph (network) or only consider the uncertainty of the data but not uncertain structure of the graph.

8.2 Uncertain graphs

Recently, how to manage uncertain graphs has become an increasingly important research topic. There are four main research lines on uncertain graphs: (i) the queries based on shortest path distances over uncertain graphs, such as k-nearest neighbor search on uncertain graphs [36], aggregate nearest neighbor retrieval on uncertain graphs [29], and distance-constraint reachability computation on uncertain graphs [16]; (ii) pattern mining on uncertain graphs, e.g., frequent subgraph pattern mining over uncertain graphs [8, 23, 31, 59], finding (top-k) maximal cliques on uncertain graphs [32, 58], and uncertain graph data mining [17]; (iii) subgraph/ supergraph (similarity) search over uncertain graphs, such as finding reliable subgraph from probabilistic graphs [15] and subgraph/ supergraph (similarity) search on uncertain graphs [53, 55,56,57]; and (iv) keyword search over uncertain graphs [27, 54]. In addition, there are some studies on the representation of uncertain graphs [1, 34] and sampling over uncertain graphs [10, 24, 25].

We want to highlight that our work is different from the above efforts on uncertain graphs, because we focus on \(\hbox {R}k\hbox {NN}\) retrieval over uncertain graphs. This is, to the best of our knowledge, the first attempt on \(\hbox {UG-R}k\hbox {NN}\) search.

9 Conclusions

In this paper, we study the problem of reverse k-nearest neighbor queries on uncertain graphs (\(\hbox {UG-R}k\hbox {NN}\) query for short). We first introduce GSP heuristic to replace the original uncertain graph with a much smaller essential uncertain graph, which reduces the size of uncertain graph significantly for \(\hbox {UG-R}k\hbox {NN}\) search. The effectiveness of GSP is guaranteed by a theoretical analysis, and is confirmed via experimental study. Then, we propose EGR heuristic to remove the equivalent possible graphs to further reduce the number of possible graphs. A novel concept of graph conditional dominance relationship is introduced to enable an efficient implementation for EGR. We also present PBP heuristic to minimize the validation cost of the data nodes. Combining these three newly presented heuristics, we propose SDP algorithm to support exact \(\hbox {UG-R}k\hbox {NN}\) search. We further present a novel sampling algorithm, i.e., TripS, based on an adaptive stratified ASSP sampling and GSP heuristic, for \(\hbox {UG-R}k\hbox {NN}\) search. ASSP is unbiased, and its variance is at least as good as that of Monte Carlo sampling. We also generalize our heuristics to tackle bichromatic \(\hbox {UG-R}k\hbox {NN}\) retrieval. Extensive experiments using both real and synthetic graphs demonstrate the efficiency of our proposed algorithms. In the future, we plan to explore \(\hbox {UG-R}k\hbox {NN}\) search under the graph model with uncertain nodes.