S-MRST: a novel framework for indexing uncertain data

Zhu, Rui; Wang, Bin; Luo, Shiying; Yang, Xiaochun; Wang, Guoren

doi:10.1007/s11280-016-0409-x

S-MRST: a novel framework for indexing uncertain data

Published: 07 September 2016

Volume 20, pages 697–727, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

World Wide Web Aims and scope Submit manuscript

S-MRST: a novel framework for indexing uncertain data

Download PDF

Rui Zhu¹,
Bin Wang¹,
Shiying Luo¹,
Xiaochun Yang¹ &
…
Guoren Wang¹

411 Accesses
1 Citation
Explore all metrics

Abstract

This paper studies the problem of probabilistic range query over uncertain data. Although existing solutions could support such query, it still has space for improvement. In this paper, we firstly propose a novel index called S-MRST for indexing uncertain data. For one thing, via using an irregular shape for bounding uncertain data, it has a stronger space pruning ability. For another, by taking the gradient of probability density function into consideration, S-MRST is also powerful in terms of probability pruning ability. More important, S-MRST is a general index which could support multiple types of probabilistic queries. Theoretical analysis and extensive experimental results demonstrate the effectiveness and efficiency of the proposed index.

Indexing Uncertain Data for Supporting Range Queries

Efficient Range Query Processing on Complicated Uncertain Data

Processing Probabilistic Range Queries over Gaussian-Based Uncertain Data

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Query processing over uncertain data has become increasingly important due to the imperfect nature of sensing devices in many applications, such as sensor network monitoring object identification, RFID networks, location-based services (LBS) and so on [2, 6, 14]. For instance, a modern warehouse monitoring system usually utilizes RFID technique to monitor product positions for purpose of warehousing management (see Figure1a). However, RFID reader can only provide warehouse manager with product’s positional ranges instead of exact positions. In this case, a product could be abstracted into an uncertain object o 〈r,PDF(x)〉, in which o.r refers to where o may appear (i.e., uncertain region), and o.PDF refers to the probability of o appearing at location x. Accordingly, if a manager wants to know which products are located in the query region, the system could report all products having a sufficiently large chance (a threshold) to fall in a given query region(see Figure 1b).

As another example, a production promotion system often collects personal location information, analyzes their location distribution and self-adjusts the marketing strategy according to the statistical result. In this scenario, personal location information is also expressed by a spatial region other than a point for the privacy protection. Therefore, the personal location information is also uncertain, and is thus expressed by uncertain data.

Among various uncertain data types, 2-dimensional uncertain data, which update infrequently, are prevailing (e.g., the above two examples). Intuitively, when handling this kind of uncertain objects, the key idea is to use an efficient index to manage them, and provide them with, as tight as possible, spatial and probabilistic bounds both. In addition, the index should be general, which could support multiple types of queries such as probabilistic range query (abbreviated as prob-range query), probabilistic-kNN query, k-range query, and so on. In the rest of the paper, we will use the probabilistic range query as an example to explain how our proposed index works. The other query algorithms will be discussed in Section 4.

A traditional prob-range query is to find out the objects that appear in the query region with a probability at least 𝜃 (the probabilistic threshold). Since such a computation involves expensive and complex integral [11, 21], the filter-refinement is preferable. In the filtering phase, the uncertain objects, which cannot be the query results, are quickly filtered without triggering the complex integral. For the objects that cannot be filtered, in the refinement phase, the integral has to be done to verify the answers. Since such computation is expensive, the key of optimizing a prob-range query is to avoid, as much as possible, computing the costly complex integral [21].

Several indexes have been proposed to answer prob-range queries over uncertain data. Their key idea is pre-computing the summary of each object’s PDF (short for probability density function), augmenting existing index techniques to organize summary, and then developing a group of pruning rules for filtering [30, 31]. The existing indexes can be classified into two clusters, namely PCR- and partition-based. PCR-based indexes (e.g., U-Tree) derive from R-Tree [21]. They propose PCR (short for probabilistically constrained region) for summarizing the PDF of uncertain object, and employ R-Tree to manage these PCRs. In addition, a group of PCR-based pruning rules are developed for disqualifying the objects without having to obtain their appearance probabilities [21]. However, as will be reviewed in Section 2.1, the filtering ability of these rules are slightly weak, and may cause much unnecessary computing cost.

Partition-based indexes (e.g., UI-tree [30] and UD-tree [31]) employ partition technique to summarize the PDF of uncertain object, and use R-tree to manage the partition results. Compared with PCR-based summary, partition-based summary could provide object with a stronger pruning/validating ability. However, its main problem is the space cost of these two indexes is high, which may cause extra expensive I/O cost. As a summary, these two types of indexes both have room for improvement. A more efficient index with both powerful filtering ability and low space cost is desired.

In this paper, we propose a novel index called S-MRST for indexing uncertain data. In brief, S-MRST derives from R-tree. Compared with R-tree, the main differences are as follows: it firstly associates each node of S-MRST with an extra filter, called skeleton, which helps S-MRST for space pruning. We then propose a novel summary called MRST (short for multi-resolution summary tree) to capture the PDF of uncertain data. Via considering the gradient of PDF, MRST provides uncertain data with a strong probabilistic pruning ability and consumes little space cost. To summarize, our proposed S-MRST has following three main advantages.

Self-adaptive boundary. We propose a novel filter called skeleton to bound uncertain data. The self-adaptive ability of skeleton is reflected in the following two aspects. Firstly, each node in S-MRST uses an irregular shape for bounding its children. Therefore, our proposed skeleton could self-adjust its shape to cater the local distribution of objects, tightly bound uncertain data and enhance the space pruning ability. In addition, we seek for the most I/O efficient skeleton. To achieve this goal, we build the cost model w.r.t the skeleton, and find the suitable strategy for skeleton construction.
Gradient based summary. We propose a novel summary called MRST (multi-resolution summary tree) to approximately approach the PDF of uncertain object. MRST fully considers the gradient of PDF, and more effectively captures the feature of objects’ PDF. In this way, our proposed summary has a more powerful filtering ability and consumes lower space cost. We then propose a novel algorithm to access MRST. Through using the key idea of greedy algorithm, this algorithm could reduce the computation cost as much as possible.
Multiple queries supporting. We show that S-MRST can support uncertain range search as well as top-k range query, k nearest neighbor range query, and so on. Above all, it could be regarded as a general index supporting many kind of queries.

The rest of this paper is organized as follows: Section 2 gives related work and the problem definition. Section 3 proposes S-MRST. Section 4 proposes the algorithms that could support various kinds of queries. Section 5 evaluates the proposed methods with extensive experiments. Section 6 is the conclusion and the future work.

2 Background

Table 1 summarizes the mathematical notations used in the paper.

Table 1 The Summary of Notations

Full size table

2.1 Related work

Real-world data often exhibit uncertainty and impreciseness due to the imperfect nature of sensing devices in many applications such as sensor data analysis [9, 23–27] , RFID networks [10, 22], location-based services (LBS) [16], textual database [28, 29] and so on. Therefore, various probabilistic queries have been studied over uncertain databases [6, 15], for example, probabilistic range query [7, 8], NN search [12], RNN search [3, 4, 13], skyline query [18], top-k query [20] and so on. These queries usually retrieve those uncertain objects that satisfy the query with probabilities greater than a threshold. Previous works usually designed specific techniques (e.g., pruning methods, indexing) to improve the query efficiency. This paper focuses on the problem of indexing uncertain data for supporting prob-range query. The main existing algorithms can be classified into two clusters, namely PCR- and partition-based. Besides, there are also many excellent works that could efficiently support prob-range query. In the following parts, we will explain their key ideas respectively.

2.1.1 PCR-based index

One type of classical index is PCR-based (e.g., U-Tree and U-catalog-tree), which constructs a group of PCRs for each object based on different probabilistic thresholds 𝜃, denoted as o.PCR(𝜃). To be more specific, given an object o and the probabilistic threshold 𝜃, o.PCR(𝜃) is constructed as follows: 2 lines in each dimension are computed, where o has the probability 𝜃 of occurring in the left (right, up, down) side of one line. Accordingly, they use the rectangle bounded by these 4 lines as o.PCR(𝜃) (see Figure 2a). In order to handle queries under different thresholds, they pre-construct a group of PCRs for each object, and use them as the summary of uncertain data. In addition, PCR-based indexes use a group of R-Trees to manage the PCRs corresponding to different thresholds, where the nodes of each R-Tree are constructed based on their indexing PCR. From Figure 2b, given a node E and the PCRs of o ₁ to o ₃, E.p c r is used for bounding the PCRs of objects (e.g., the dotted-rectangle in Figure 2b).

Given a prob-range query q, it firstly selects a suitable R-Tree according to q.𝜃, and then accesses this R-Tree for searching. Finally, it employs PCR-based pruning strategies for disqualifying/validating. Back to the example in Figure 2, let q ₁ to q ₃ be three prob-range queries and o be an uncertain object. Under q ₁, o could be pruned since o.PCR(𝜃) does not intersect with q ₁.r. On the other hand, under q ₂, o could be validated because q.r completely contains the left, upper, down border of o.m b r and l ₃. However, the above pruning strategies may not work sometimes. Assuming o.PDF obeys uniform distribution, because $q_{3}.r\cap o$.PCR (𝜃)≠∅, o can not be pruned (q.𝜃=0.2). However, o is not the query result of q ₃ obviously from Figure 2a.

In addition, the node e of U-Tree can not tightly bound its children especially when the node is a leaf. In Figure 2c, E is the node that contains o ₁ to o ₃. We can see that the dotted-rectangle, i.e., E.p c r, which bounds the PCR of o ₁ to o ₃ is smaller than E.m b r. However, there exist much blank region in the dotted-rectangle, which may cause frequent false alarm. For example, as shown in Figure 2c, although q ₄ is contained in the PCR-based MBR, i.e., the dotted rectangle, none of PCRs in E overlaps with q ₄.r. In this case, q ₄ has to scan all the children in E, which may waste lots of computing resources.

2.1.2 Partition-based index

Another type of index is partition-based. Given an object o, its summary is constructed by partitioning o _r into a group of sub-regions, pre-computing the probability of o lying in each sub-region and using the partition as the summary of o. Then, the partition-based indexes use R-tree (or quad-tree) to manage the partition of each object (Figure 3).

Given a prob-range query q, and the partition-based index $\mathcal {I}$, q searches on $\mathcal {I}$, and finds the objects that overlap with q _r. For each object that overlaps with q _r, the pruning strategy is to compute the lower and upper bounds of a p p(o,q). Specifically, if a sub-region o _i of o is contained in q _r, P _i.a p p contributes to both lower bound and upper bound of a p p(o,q). We use P _i.a p p here to denote the probability of o lying in the sub-region o _i. If o _i.r overlaps with q _r, P _i.a p p only contributes to the upper-bound of a p p(o,q). After accessing all the partitions of o, if l b(o,q)≥q.𝜃, o is a result of q; if u b(o,q)≤q.𝜃, o is disqualified. Otherwise, o should be refined. Here, l b(o,q) denotes the probabilistic lower-bound of o lying in q _r, and u b(o,q) denotes the probabilistic upper-bound of o lying in q _r.

Although partition-based indexes have a stronger pruning ability than PCR-based indexes, their space cost is higher. The reason behind is the partition strategy cannot reflect the P D F’s gradient. In addition, because the filtering algorithm does not consider the intrinsic property of q.r and subregions, its pruning ability also has space for improvement.

2.1.3 Others

Dmitri V. Kalashnikov et al. [11] propose a grid-based index named U-grid for indexing uncertain objects. U-grid is a 2-layer index. The 1st-layer index provides the query with spatial information, probability summary and a pointer to an entry with detailed probability information. The summary probability information stored in the 2nd-layer is based on a “virtual grid”, where the 2nd-layer index uses these aggregate information for pruning. However, U-grid does not provide a lower-bound for the query and its storage cost is also very high. Angiulli et al. [2] develop a pivot-based index for managing uncertain data in the general metric space, which could support distance-based range query in the metric space. Nevertheless, the index size is usually much larger than the other indexes. Agarwal et al. [1] employ the segment tree for indexing uncertain data, and provided theoretical analysis on range search on uncertain data. Rui Zhu al. [32] proposed R- M R S T for indexing uncertain data. Its problem is the node of R- M R S T can not provide objects with a tight bound, which causes some unnecessary computation.

2.2 Problem definition

Given a multi-dimensional probabilistic object o in the d-dimension space, it is described either continuously or discretely. In the continuous case, an object has two attributes: o _r and o.PDF. o _r is a d-dimension uncertain region, and o.PDF is the probability of o appearing at location x. In the discrete case, o is represented by a set of sampled points {x ₁, x ₂, … , x _m}, and o occurs at location x _i with probability x _i.p. Given a prob-range query q, we use app(o,q) to represent the likelihood of o falling in the query region q _r. app(o,q) is also calculated by two cases. In the continuous case, a p p(o,q) is computed based on (1), where ${o_{r} \cap {q_{r}}}$ denotes the intersection of o _r and q _r, and o is a result if a p p(o,q) ≥q.𝜃. In the discrete case, a p p(o,q) is computed based on (2), where n ₁ is the amount of sampled points of o, and n ₂ is the amount of sampled points falling into $o_{r} \cap {q_{r}}$.

$$ {app}(o,q)=\int {_{o_{r} \cap{q_{r}}}\textsf{PDF}(x)dx} $$

(1)

$$ {app}(o,q)=\sum\limits_{i=1}^{n2} {o.\textsf{PDF}({x_{i}})}/\sum\limits_{i = 1}^{n1} {o.\textsf{PDF}({x_{i}})} $$

(2)

Definition 1 (Probabilistic Range Query)

Given a set of probabilistic objects $\mathcal {O}$ and a range query q, the probabilistic range query retrieves all uncertain objects $o\in \mathcal {O}$ with a p p(o,q) ≥𝜃, where 𝜃 is the probabilistic threshold satisfying 0≤q.𝜃≤1.

Problem statement

In this paper, we aim to propose an effective indexing mechanism to facilitate the range query processing. This index has a stronger filter ability and lower cost of storage.

3 S-MRST

In this section, we firstly present the overview of S-MRST. We then discuss the s k e l e t o n and MRST, which are two primary units of S-MRST.

3.1 S-MRST overview

As shown in Figure 4, S-MRST derives from R-Tree. One difference is S-MRST associates each node with a group of filters. We call them s k e l e t o ns that could handle various probabilistic queries under different thresholds. Compared with M B R, a s k e l e t o n is a flexible arbitrary shape that could unrestrainedly adjust its boundary to cater the local distribution of node’s children. Thus, s k e l e t o n could tightly bound node’s children (or children’s P C R) and avoid the false alarm problem as much as possible.

Besides s k e l e t o n, we propose another novel structure called M R S T (short for multi-resolution summary tree) to capture the P D F of uncertain data. Compared with state-of-the-art summaries, by taking the gradient of P D F into consideration, M R S T provides strong pruning/validating ability and also consumes less space.

Given a set of $\mathcal {N}$ uncertain objects, S-MRST is constructed as follows. Firstly, for every uncertain object, an M R S T is constructed and associated with the object. Secondly, an M B R bounding all these objects is computed and associated with the root of the S-MRST. Then, the construction is done by recursively partitioning each region into smaller subregions such that the amount of children in each subregion is roughly the same and less than the page size (eg., 4KB). Finally, we build s k e l e t o ns for each node.

Given a prob-range query q, the query algorithm is similar with that of accessing R-Tree. Two main differences are: (i) given a node e, if e.m b r overlaps with q.r, the algorithm firstly finds a suitable s k e l e t o n for filtering according to q.𝜃. And then, the algorithm accesses the corresponding s k e l e t o n to decide whether pruning e or not; (ii) given an uncertain object o, if o.m b r overlaps with q.r, the algorithm accesses the M R S T of o to judge whether o is a result of q.

Take an example in Figure 4. Given the prob-range q ₁ with the probability threshold 0.2 and the S-MRST $\mathcal {I}$, the query algorithm searches on $\mathcal {I}$ for finding the objects that are overlapped with q ₁.r. When the algorithm accesses the node F, although $q_{1}.r\cap F.mbr\neq \emptyset $, we still conclude that no object in F is a query result of q ₁, since q ₁ is not overlapped with the skeleton of F. For q ₂, because $q_{2}.r\cap o.mbr\neq \emptyset $, we access the M R S T of o. As will be reviewed in Section 3.3, the lower-bound probability of o lying in q ₂.r larger than 0.2. Therefore, we regard o as a query result of q ₂.

3.2 The skeleton

In this section, we introduce a novel filter called s k e l e t o n. We first present the concept of signal skeleton, and propose the construction and processing algorithms in turn. To find the suitable skeleton and further optimize the algorithm, we fully consider the filtering ability of s k e l e t o n and the corresponding side-effects (e.g., extra I/O cost). In addition, we also discuss the clustering skeletons for handling various probabilistic queries under different thresholds.

3.2.1 The signal skeleton

As mentioned before, the false alarming problem of U-Tree is severe when indexing uncertain data, especially when its managed PCRs are constructed under a high threshold. For example, when q.𝜃≥0.5, the PCR of uncertain objects are degraded to points. Inspired by this observation, we propose s k e l e t o n as a tighter bound for bounding uncertain data. Note that s k e l e t o n can bound many types of objects (e.g., points, rectangles, cycles and so on), and in this section, we use points as example for discussion.

The s k e l e t o n is proposed based on the following observation. Let e be a node, q be a prob-range query, and $q.r\cap e.mbr\neq \emptyset $, if q.r does not contain any object in e, we could find an arbitrary shape which bounds e’s objects and does not overlap with q.r. Obviously, the filter ability of this arbitrary shape is stronger than that of MBR. The reason behind is the arbitrary shape could self-adjust its boundary for catering the local distribution of node’s children other than using a fixed shape for bounding objects.

However, a natural question is how to appropriately construct (or access) this arbitrary shape. In this section, we propose s k e l e t o n as formally defined in Definition 2. Via depicting the shape under grid (e.g., the boundary of cells), we could develop a series of algorithms for supporting s k e l e t o n construction, accessing and so on.

Definition 2

Skeleton Let e be a node of R-Tree, $\mathcal {G}(m\times m)$ be a grid that partitions e.m b r, the skeleton of e (denoted by s(e,m)) is a filter that uses an irregular shape for bounding e’s children, where its boundary is depicted by the cells of $\mathcal {G}$.

Skeleton construction

According to Definition 2, since s k e l e t o n is depicted by grid, before constructing, we need to initialize the grid. Given a node e, we firstly use the grid $\mathcal {G}(m\times m)$ to partition e.m b r. Then, we traverse all the cells and label the ones that overlap with (or contain) e’s children as 1. For simplicity, we call this kind of cells as 1-cells.

After initialization, we are going to find a suitable s k e l e t o n, which should satisfy the following two conditions: (i) its boundary is also the boundary of cells; (ii) its area should be as small as possible. In this paper, we find this boundary using greedy selection. To be more specific, we compute the amount of 1-cells in each row/column, and sort the rows/columns according to this amount. From then on, we repeat the following operations to find this s k e l e t o n: (i) selecting the row/column with the most 1-cells as one row/column of s k e l e t o n; (ii) labeling the 1-cells in such row/column as accessed-1-cells; (iii) updating the order of these row/column according to the 1-cell amount of these rows/columns they contain. When all the 1-cells are all turning to accessed-1-cells, we use an irregular shape to bound these row/column.

Next, we explain the s k e l e t o n representation. Given a s k e l e t o n s(e,m), it is depicted by two types of bit vectors that are g b i ts and v b i ts. We firstly explain g b i t. In the d-dimensional space (e.g., d=2 in this section), the g b i t amount is d. Their role is to mark which rows (or columns) contain 1-cells. For example, in Figure 5, we allocate 2 g b i ts for s(e,m). They mark which columns (or rows) exist 1-cells, i.e., g b i t[0] and g b i t[1]. From Figure 5, because there are three columns existing 1-cells(e.g,. columns 0,5 and 7), g b i t[0] is set to 10100001. Similarly, g b i t[1] is 01000000.

v b i ts are associated with the columns (or rows) overlapped with (or included in) the closed region. Accordingly, they are indexed by g b i ts. For example, g b i t[0], which is 10100001, indicates that v b i ts are associated with column 0, 5 and 7. The role of v b i ts are marking which cells are 1-cells. The bits corresponding to 1-cells are marked 1, and the others are marked 0. Let v b i t[i,j] denote the v b i t indexed by g b i t[i], and associated to columns (or row) j. For example, since the cells in row 0 and 2 are 1-cells, we set v b i t[0,0] to 00000101.

Skeleton processing

After discussing the s k e l e t o n construction, we explain how to access a s k e l e t o n. Note that a s k e l e t o n is depicted by a group of bit vectors (e.g., g b i ts and v b i ts). Accordingly, the s k e l e t o n processing also requires bit operations such as AND, OR and so on.

Specifically, given a prob-range query q and a node e, if $q.r\cap e.mbr\neq \emptyset $, we generate another d bit vectors named q b i ts. They mark which rows (or columns) of cells are overlapped with q.r. For example, in Figure 5c, because q _r overlaps with column 3, 4, 5 (or row 3, 4, 5, 6, 7), the corresponding q b i ts are {00111000,11111000} respectively.

After obtaining q b i ts, we apply a two-level pruning strategy including global pruning and local pruning. In the global pruning phase, we compare g b i ts with q b i ts. If $\bigvee qbit[i]\wedge gbit[i]=0$ , e could be pruned. Because if q b i t[i]∧g b i t[i]= 0 for any i, no row/column overlapped with q.r is associated with v b i t. Thus, if $\bigvee _{i\in 0,d-1} qbit[i]\wedge gbit[i]=$ 0, q.r cannot be overlapped with any 1-cell. From Figure 5b, given the range query q ₁ and node e, we firstly generate q b i ts that is {00001110, 00111100}. Then, we apply global pruning. Since 00001110 ∧10100001=0 and 00111100 ∧ 01000000=0, e could be filtered based on global pruning.

If e can not be pruned by global pruning, we conclude that q.r may overlap with several rows (columns) that are associated with v b i t. In this case, we conduct the local pruning, where we access the corresponding v b i ts. Suppose g b i t[i]∧q b i t[i]≠0, and we v b i t[ i,j] with qbit[ (i+1) mod 2]. If qbit[ (i+1) mod 2] ∧vit[ i,j] ≠0, there exist 1-cells overlapped with q.r, and we can not filter e. So, we load e’s children from disk to memory. Otherwise, we prune e.

From Figure 5c, given the prob-range query q ₂ and the node e, the corresponding q b i t of q ₂ is {00111100, 11111000}. Since 1111100 ∧01000000 ≠ 0, q.r may overlap with several rows that contain 1-cells. We then apply the local pruning. Since vbit[1,6]=00010000 and 00111100 ∧ 00010000 ≠ 0, there may exist 1-cells in the cell c ₄₆ that overlaps with q ₂. In this case, we need to load e’s children from disk to memory.

3.2.2 I/O efficient skeleton

Although s k e l e t o n could provide node’s children with a tighter bound, at the same time, it also consumes higher space cost, where we have to spend high I/O cost in loading the skeleton from disk to the memory. In this section, we show how to solve this problem by finding a balance between pruning ability and I/O cost, i.e., how to maintain the strong pruning ability while avoid incurring excessive I/O cost.

Our solution is based on the following observation: given a node e and its children, if e.m b r is filled with e’s children (e,g., Figure 6a), the filter ability of s k e l e t o n may not be strong. Worse yet, the extra I/O cost may reduce the overall performance of S-MRST. On the contrary, if e.m b r is mostly covered by blank area(e,g., Figure 6c), the superiority of s k e l e t o n will be fully used. In the following, we first analyze the I/O cost and saved cost caused by s k e l e t o n. And then, we study the I/O-based s k e l e t o n construction algorithm.

For the I/O cost, let the block size be B, its loading cost be I O _{b
l
o
c
k}, and skeleton size be |s|. The expected I/O cost of loading a s k e l e t o n from disk to the memory is no more than $\frac {|s|IO_{block}}{B}$. For saving cost, if $q.r\cap e.mbr\neq \emptyset $, we load e’s child to the memory, which incurs a cost of I O _{b
l
o
c
k}. If s(e,m) (i.e., the s k e l e t o n of e) helps us prune e, we save I O _{b
l
o
c
k} I/O cost. Totally, the expected of saved cost is p _e(m)I O _{b
l
o
c
k}, where p _e(m) is the probability that e is pruned by s(e,m).

$$ C_{ske}(m) =p_{e}(m)IO_{block}-\frac{|s|IO_{block}}{B} $$

(3)

Then, we discuss p _e(m). First of all, we explain the concept of b(m). Let e be a node, and $\mathcal {G}(m\times m)$ be the grid that partitions e.m b r into m×m cells. For each cell, if it overlaps with children of e, we call it as as 1-cell. Otherwise, we call it as a 0-cell. For those adjacent 0-cells that could form a rectangle, we call them grid-based blank region b(m). We now formally discuss how to compute p _e(m). Given b(m) and a prob-range query q, if $q.r\cap e.mbr\neq \emptyset $ and b(m).l>q.l∧b(m).w>q.w, q.r is contained in b(m) with the probability $\frac {|b(m)|-|q.r|}{|e.mbr|}$. Let $\mathcal {B}$= { b(m)₁,b(m)₂, …, b(m)_n} be blank rectangles that satisfy b(m)_i.l>q.l∧b(m)_i.w>q.w, q is contained in a blank of e.m b r with the probability ${\sum }_{i< n} \frac {max (0,|b(m)_{i}|-|q.r|)}{|e.mbr|}$. Notice that, q _l and q _w are the expected length and width of the query range respectively. Since they are easy to obtain (e.g., from the history data), we skip the details.

Lastly, we formally explain the I/O efficient s k e l e t o n construction algorithm. Given a node e, we firstly determine whether the s k e l e t o n of e should be constructed according to $\sum \frac {|b_{i}|-|q.r|}{|e.mbr|}$. If $\sum \frac {max (0,|b_{i}|-|q.r|)}{|e.mbr|}=0$, we need not to construct s k e l e t o n for e. Here, b _i is the blank region in e without considering the gird, i.e., the upper-bound of b(m)_i. If $\sum \frac {max (0,|b_{i}|-|q.r|)}{|e.mbr|}>0$, we compute the suitable m based on (3). To be more specific, we set m to 1 and compute s(e,1) according to (3) at the beginning. From then on, we set m to 2ⁿ, and repeat the above operation until |s| increases to C. Lastly, we find the maximum C _{s
k
e}(m), and use the corresponding m, denoted as m ^∗, as the final resolution. Here, the threshold C will be discussed in the following part.

Discussion

Because the s k e l e t o n of a node is stored in a block, the size of the s k e l e t o n should be no more than a threshold C. If the resolution m _{s
t
o
p} makes C _{s
k
e}(m) larger than C, m _{s
t
o
p} should be ignored, and we need not to check the other resolution m less than m _{s
t
o
p}. Therefore, among all the resolutions ranging from 1 to $\frac {m_{stop}}{2}$, m ^∗ could be regarded as the most suitable resolution.

3.2.3 Adopting skeletons for bounding uncertain data

Since s k e l e t o n could self-adjust its boundary and volume (e.g., lies on the grid resolution) to cater the distribution of node’s children, it is suitable for bounding objects’ PCR. Similar with U-tree, we also construct a group of s k e l e t o ns for each node.

Theoretically, we could construct an infinite number of s k e l e t o ns. However, the more s k e l e t o ns stored for a node, the tighter bound a node could provide for its children, and the higher I/O cost they consume. Thus, we should carefully select a few s k e l e t o ns for construction. For example, from Figure 6, the area of s k e l e t o n in Figure 6b is shrunk a little compared with the area of s k e l e t o n in Figure 6a. Therefore, the difference of their pruning ability is small, and the s k e l e t o n in Figure 6b is not worth storing. As a contrast, because the area of s k e l e t o n in Figure 6c is much smaller than that of Figure 6a, it is worth storing.

$$ C_{ske}(m^{*}, \theta)= \frac{|s|IO_{block}}{B}-p_{e}(m^{*},\theta)IO_{block} $$

(4)

$$ Diff(\theta_{0}, \theta) = C_{ske}(m^{*}, \theta)-C_{ske}(m^{*}, \theta_{0})- \frac{|s|IO_{block}}{B} $$

(5)

We now formally introduce the skeleton construction algorithm. Given a set of object $\mathcal {O}$ in a node e, we firstly construct the s k e l e t o n for bounding their corresponding MBR. Because we should guarantee that one node contains at least |e| children, the total size of a node’s s k e l e t o n size should be no more than $\frac {B}{|e|}$. Based on this observation, we construct the s k e l e t o ns for each node in a greedy way, where we gradually increase the probability threshold 𝜃 (i.e., with the step length SL) and tentatively construct s(m ^∗,𝜃) according to (4) and (5). They are used for evaluating whether a given s k e l e t o n is worth selecting.

To be more specific, we set the s k e l e t o n size threshold to $\frac {B}{|e|SL}$ at the beginning, and check whether $s(e,m_{1}^{*}, 0)$ could be constructed. We construct it if possible, and update the s k e l e t o n size threshold to $\frac {B^{\prime }}{|e|(SL-1)}$. $B^{\prime }$ here equals to $B-|s(e,m_{i}^{*}, i)|$. Otherwise, we update the s k e l e t o n size threshold to $\frac {B}{|e|(SL-1)}$. Here, $B^{\prime }$ here equals to $B-|s(e,m_{i}^{*}, i)|$, and $s(e,m_{i}^{*}, \theta _{i})$ is the i-th selected s k e l e t o n (i=0 at the beginning). From then on, for each probability threshold, we repeat the above operations for s k e l e t o n construction.

Cost analysis

In this following, we explain the s k e l e t o n construction cost. Let the objects count be N, and the capacity of each node be |e|, the expected size of a s k e l e t o n is $\frac {B}{|e|SL}$, i.e., O(m ²). Therefore, when we construct a signal s k e l e t o n, the m varies from 1 to $\lfloor \sqrt {\frac {B}{|e|SL}}\rfloor $. Accordingly, the cost of constructing a signal s k e l e t o n is ${\sum }_{i=0}^{i=\lfloor \log _{2}\sqrt {\frac {B}{|e|SL}}\rfloor } i^{2}$, which is $O(\frac {B}{|e|SL})$. When we construct a group of s k e l e t o ns for a node, the total cost is bounded by $O(SL\frac {B}{|e|SL})=O(\frac {B}{|e|})$. Because the node count is bounded by ${\sum }_{i=0}^{i=\log _{|e|}^{N}-1} e^{i}$, the total cost of constructing S-MRST is $O(\frac {N\times SL}{|e|}\times \frac {B}{|e|})=O(\frac {NB\times SL}{|e|^{2}})$.

3.3 The MRST

Given an object o and a query q, if $o.mbr\cap q.r\neq \emptyset $, we have to access its summary for probabilistic filtering. In this case, the filtering ability of the summary is vitally important. Motivated by this observation, we firstly study state-of-the-art summaries and propose a new bound for probabilistic filtering. Inspired by this bound, we then develop a novel summary called M R S T for summarizing the P D F of uncertain data. Moreover, we also discuss the storage and access of MRST.

3.3.1 A tighter probabilistic bound for filtering

In this section, we develop another probabilistic bound for filtering uncertain data, which fully considers the intrinsic properties of partitions and query region. To be more specific, given a partition $\mathcal {P}$ { P ₁,P ₂,...P _n} of an uncertain object o, and a range query q satisfying q.r∧o.m b r≠ ∅, we use Property 1 to compute the probabilistic lower-bound and upper-bound (denoted by l b(o,q) and u b(o,q)) of o lying in q.r. Compared with the algorithm discussed in [31], when we compute l b(o,i,q) and u b(o,i,q) , as shown in (6) and (7), we fully consider several intrinsic properties of P _i. Here, S(o,i,q) refers to the intersection area between q.r and P _i, P _i.u d refers to the maximal probability density in P _i, P _i.l d refers to the minimal probability density in P _i, and P _i.a p p refers to the probability of o lying in P _i.

Property 1

Given an object o and a query q, if q.r overlaps with (or contains) o’s partitions $\bigcup _{i=1}^{i=n_{1}}P_{i}$, $lb(o,q)={\sum }_{i=1}^{i=n_{1}} lb(o,i,q)$ and u b(o,q) is ${\sum }_{i=1}^{i=n_{1}} ub(o,i,q)$.

$$~ lb(i,o,q) = P_{i}.ld\times S(o,i,q) $$

(6)

$$~ ub(i,o,q) = min(P_{i}.ud\times S(o,i,q), P_{i}.app) $$

(7)

Discussion

Taking these intrinsic properties into consideration, our proposed probabilistic bound is tighter than that of [31]. In the extreme case, if the P D F of uncertain data obeys uniform distribution, our proposed bound could exactly prune/validate all the objects. Here, B P B (basic probabilistic bound) denotes the probability bound provided by state-of-the-art partition-based approaches; I P B P B (intrinsic property based probability bound) denotes the probability bound provided by ours.

3.3.2 Summarizing uncertain data using multi-resolution grid

Besides the powerful filtering ability, I P B P B inspires us to develop a new method to capture the feature of P D F. From Property 1, we know that an outstanding summary should achieve the following two goals: (i) P _i.u d- P _i.l d should be small; (ii) $|\mathcal {P}|$ should be relatively small. To achieve these goals, we employ multi-resolution grid [5, 17, 19] for partition. Given an object o, the object region o _r, and one of its sub-region P _i.r, if the probability density in P _i.r changes dramatically, o _i.r is finely partitioned. Otherwise, o _i.r is coarsely partitioned. Intuitively, because the partition result reflects the gradients of the P D F, the above two goals are easy to achieve.

In the following part, we formally discuss the summary construction. First of all, we propose the concept of P B D (short for probability bound difference) as the criterion of partition. For simplicity, P B D(o) (or P B D(o,i)) denotes the probability bound difference of o (or P _i).

Definition 3 (P B D)

Given a partition P _i of an object o, P B D(o,i) = (P _i.u d−P _i.l d) ×|P _i|.

Given an object o, the partition includes two steps namely s p i l t and s h r i n k. s p i l t is to partition the subregions where the probability density changes dramatically. The termination condition is the P B D of each subregion drops to a parameter λ. We then use a quad-tree to temporarily organize the split result. Take an example in Figure 7. Given an uncertain object o, we firstly use a M B R to bound o _r (e.g., the shadow region). And then, we employ the partition. From Figure 7g, because P B D(o,A) and P B D(o,C) are less than λ (=0.1 in this section), we stop splitting them. We subdivide P B D(o,B) and P B D(o,D) into four parts, since they are larger than λ. Figure 7b is the result of spilt, and Figure 7c shows the corresponding quad-tree.

After s p i l t, we enter the merge phase, where we merge the partitions whose probability density are roughly the same. In particularly, given two partitions P _i and P _j, if P B D(o,i + j)≤ λ, we merge them. To be more specific, s h r i n k is applied in a post-order way. We firstly merge the leaf nodes within the same subtree. Figure 7d shows the intermediate result of merging. We then merge the leaf nodes among different subtrees. In this step, the leaf node in each subtree with the minimal appearance probability is selected as the candidate node (e.g., d ₁, b ₁, A and C). Given two candidate node u and v from different nodes, if PBD(o,u + v) ≤ λ, they are merged. According to Figure 7g, because PBD(o,C + b ₂) ≤λ, C and b ₂ are merged. Figure 7f is the final M R S T.

A natural question is how to find the suitable λ. We answer this question by building the cost model of summary, where we should fully consider its filtering ability, accessing cost and I/O cost. Since it has been discussed in [31], we skip the details. We want to highlight that, from Theorem 1, our proposed M R S T could effectively reduce the space cost. Here, B P B S (basic partition-based summary) represents the summary under basic partition-based algorithms (e.g., UD-Tree and UI-Tree); $\mathcal {P}^{m}$ and $\mathcal {P}^{u}$ denotes the M R S T and B P B S-based summary; P B D ^m(o) and P B D ^u(o) are their corresponding P B D; $|\mathcal {P}^{m}|$ and $|\mathcal {P}^{u}|$ denote the partition amount; ${P_{i}^{m}}.s$ (or ${P_{j}^{u}}.s$) denote the area of a MRST-based summary (or B P B S-based summary) respectively.

Theorem 1

Given an object o and its two partitions $\mathcal {P}^{m}$ and $\mathcal {P}^{u}$ , if P B D ^m(o)=P B D ^u(o), $|\mathcal {P}^{m}|\leq |\mathcal {P}^{u}|$.

Proof

Let $\mathcal {P}^{m}$ {${P_{1}^{m}},{P_{2}^{m}},...{P_{n}^{m}}$} and $\mathcal {P}^{u}$ {${P_{1}^{u}},{P_{2}^{u}},...{P_{v}^{u}}$} be two partitions. $\forall {P_{i}^{m}}, {P_{j}^{u}}$, if ${P_{i}^{m}}.s\leq {P_{j}^{u}}.s$, P B D ^m(o,i)≤P B D ^u(o,j). Thus, if P B D ^m(o,i)≥P B D ^u(o,j), ${P_{i}^{m}}.s\geq {P_{j}^{u}}.s$. Because P B D ^m(o)=P B D ^u(o) and $\sum {P_{i}^{m}}.s=\sum {P_{j}^{u}}.s$, $|\mathcal {P}^{m}|<|\mathcal {P}^{u}|$. □

3.3.3 Efficient summary storage

In this section, we explain the M R S T storage. Similar to s k e l e t o n, we also use bit vectors to store the M R S T of uncertain data. In this way, as will be discussed later, the space cost could be compressed to a relatively small scale. For example, if the P D F of uncertain data obeys normal distribution, each M R S T costs 36B. As a comparison, B P B S consumes 140B for each object.

M R S T stores three types of information to capture the P D F of uncertain object. Given an object o and the corresponding partition $\mathcal {P}$ { P ₁,P ₂,...P _n}, we associate each P _i with the probabilistic information (eg., P _i.a p p, P _i.u d and P _i.l d), location information, and hierarchical information (e.g., the relationship between parent and its children).

We first study the probabilistic information compression. Here, we compress them to bit-vectors respectively. We firstly use 6 bits to represent P _i.a p p, where the domain is 0 to 63. For example, given a partition P _i, if P _i.a p p=0.2, we use the bit vector 001100 (e.g., ⌊0.2×63⌋ =12) to express it. We then use 4-bit vectors to express P _i.u d(also P _i.l d), where the domain is 0 to 15.

We then study the location information compression. For each coordination, we compress it to a n-bit bit vector. Specifically, we bound o _r using an M B R(denote by o.m b r). Next, we use a “virtual grid” $\mathcal {G}(n\times n)$ to partition o.m b r. Lastly, we use the “virtual coordinate” to express the location information of each subregion. For example, using a 7-bits vector, the resolution of the grid is 128×128. The left-bottom (right-upper) coordinates are described by the cell Id. As shown in Figure 8, the “virtual coordinate” of node d ₁ is expressed by (64,111) and (80,127). The area information of each partition also depends on the resolution of the “virtual grid”. Due to the limitation of space, we skip the details.

For the hierarchical information, we use a k-bit vector to express “offset+len” so as to describe the hierarchical relationship between the parent and its children. As shown in Figure 7, D is an interval node that has two children d ₁ and d ₃, where the offset is 3,and len=2. They are expressed by the bit-vector 11 and 10.

After explaining the compression strategies, we discuss the node storage formation. There exist 3 types of nodes that are leaf, super-leaf, and interval node. Figure 8 shows the memory structure of leaf node. The others are similar with leaf node. We describe the data structure of each type as follows.

Leaf node: it is expressed by a tuple with five elements: 〈t y p e,a p p(o,i),u b(o,i),l b(o,i),p(o,i)〉. The type field occupies 2 bits and is used to distinguish from other node types. The other fields of leaf node have been explained already.
Super leaf node: it is also expressed by 〈t y p e,a p p(o,i),u b(o,i),l b(o,i),p(o,i)〉. A super leaf node is made up by a group of leafs nodes {l ₁,l ₂,…,l _n} in the same subtree. For example, B in Figure 8d is a super leaf node built up by d ₁,d ₂ and d ₄. $app(o,D^{-})={\sum }_{i\in 1,2,4}app(o, d_{i})$; $lb(o,D^{-})=\min _{i\in 1,2,4}lb(o, d_{i})$; $ub(o,D^{-})=\max _{i\in 1,2,4}ub(o, d_{i})$.
Interval node: it is expressed by 〈t y p e,a p p(o,i),o f f(o,i),l e n(o,i),p(o,i)〉. Here, the o f f(o,i) and l e n(o,i) field occupy 6 bits and 2 bits respectively.

3.3.4 Accessing the summary of uncertain data

In this section, we propose Algorithm 1 to support the accessing of MRST. It employs the key idea of greedy algorithm. Algorithm 1 uses a field called d(q,i) to determine the accessing order of the nodes in MRST to terminate accessing as early as possible. Here, d(q,i) is computed through (8). Obviously, the larger d(q,i) is, the greater it contributes to u b(o,q)- l b(o,q), and the corresponding P _i should be prior accessed. Compared with the traditional accessing method such as pre-order traversal and in-order traversal, it is more efficient to compute the bound via using d(q,i) for controlling the accessing order of the nodes.

$$~ d(i,q)=min(u(i,o)\times S(q,i),app(o,i))-lb(i,o)\times S(q,i,o) $$

(8)

Given a prob-query q and an object o, if q.r overlaps with o.m b r, we access the MRST of o to check whether o is a result of q. As shown in Algorithm 1, we firstly access the root of MRST, and compute l b(o,q) and u b(o,q) according to (6) and (7). If o can not be pruned (or validated), we initialize the array L (line 2-6). Then, Algorithm 1 repeated the following things: (i) find the node e whose corresponding d(i,q) is maximal in L; (ii) access every child of e to compute the new bound according to (6) and (7) respectively. If o is still not filtered, we update L. To be more specific, for each child i of e, if i also has children and $q.r\cap P_{i}.r\neq \emptyset $, we insert i into L. After accessing the MRST of an object, o is validated if l b(o,q) is larger than q _𝜃. Also, o is pruned if the upper-bound of u b(o,q) is less than q _𝜃. If o can not be pruned/validated, we have to conduct integral to check whether o is the result.

4 The query algorithms

In this section, we first discuss the efficient range query algorithm over S-MRST. Given a prob-range query q, the search starts from the root of S-MRST, and eliminates its entries which violate the query conditions. For each remaining entry, we retrieve its child node, and perform the above process recursively until a leaf node is reached. For an object o encountered, we attempt to filter it through accessing its MRST. For the object o which can not be filtered, in the refinement phase, we use integral to check whether o is the result of q. Since the algorithm is easy to understand, we skip the details.

In the following, we discuss kNN-range query and top-k range query over S-MRST respectively. We want to highlight that our proposed S-MRST also could support many types of queries, such as kNN query, top-k query and so on. For the limitation of space, we skip the details.

4.1 The top-k range query over S-MRST

In some applications, the weight of uncertain objects are usually different, where the system often uses a scoring function F to evaluate the weight of each uncertain object. Accordingly, returning k objects with the highest weight to the user is prefer sometimes. In this section, we study the problem of top-k range query over S-MRST. We firstly discuss the problem definition. And then, we explain the top-k range query algorithm over S-MRST.

4.1.1 Problem definition

We now formally propose the concept of top-k range query. For simplicity, in the section, we only consider continuous uncertain data. However, our techniques can also be applied to answer the queries over discrete uncertain data.

Definition 4

Top-k Range Query. Given a set of uncertain objects $\mathcal {O}$, a score function F, and a range query q, q retrieves k objects with the highest expected value r a n k(o) in $\mathcal {O}$, where r a n k(o) equals to F(o)×a p p(q,o), and a p p(q,o) refers to the probability of o lying in the query region.

Figure 9a shows an example of top-k range query. {A,…,D} are uncertain data with weight {2,…,16}. For range query q, their expected value are {1,2.1,7,0} respectively. Since k=2, the query result are B and C.

4.1.2 The query algorithm

In this section, we propose Algorithm 2 to efficiently support top-k range query. Compared with prob-range query, the main difference is we have no given probabilistic threshold to apply the probabilistic pruning. Therefore, given a query q, a naive method is to compute the expected value of the objects which overlap with the q _r. And then, we sort the expected value of all these objects, and report the k objects with the highest expected value to the user.

However, the main limitation of this algorithm is it has to do expensive integral calculus. Aiming to this problem, we propose a novel algorithm to support this query. Let $\mathcal {O}$ be the object set, and $\mathcal {O}_{q}$ {o ₁,o ₂,…,o _m} be a subset of $\mathcal {O}$ overlapping with q _r. Our searching algorithm consists of initialization and selection. In the initialization phase, if $|\mathcal {O}_{q}|\leq k$, we use $\mathcal {O}_{q}$ as the result set (Line 3 to Line 4). Otherwise, we push all objects contained in q _r into the candidate set C. If |C|≥k, we use the k-th highest weight value as the threshold η. Otherwise, we set η to 0 (Line 5 to Line 10). From then on, we employ a greedy algorithm for searching. To be more specific, we sort all objects in $\mathcal {O}_{q}$ in descending order according to their weight, and store them in a list named L (line 11).

In the selection phase, we access every object in the list. For each object, if its weight is smaller than the threshold η, we ignore it (Line 13 to Line 14). Otherwise, we employ algorithm 1 to compute the probabilistic lower-bound and upper-bound of o lying in the query region. If L[i].u b×L[i].F<η, we delete L[i]. Otherwise, we employ the integral to compute the probability of o lying in the query region, and validate whether L[i] could be inserted into the candidate set. If so, we insert L[i] into the candidate set, and update η accordingly. We repeat the above operations until the list L is null.

4.2 The kNN range query algorithm over S-MRST

As is depicted in [6-7], kNN is used for finding a set of k objects which are nearest to the query point. However, in some cases, the quality of the query results may be not high. For example, if there are only a few objects near to the query point, some query results in the result set may be far from the query point. In this case, this kind of query result may not be suitable. Based on this observation, in this section, we study the problem of kNN range query over uncertain data. In the following part, we define the problem.

4.2.1 Problem definition

In this section, we first introduce the concept of uncertain data based distance (short for UDist), and we then explain the problem definition of kNN range search over S-MRST.

Definition 5 (UDist)

Given an uncertain object o〈o _r,o _{p
d
f}〉 and a query point q, the U-distance between q and o is computed according to (9), where d i s t(x,q) refers to the distance between the query point q to the point x∈o _r.

$$ {UDist}(o,q)=\int {dist(x,q) o.pdf(x)dx} $$

(9)

Definition 6 (kNN-range)

Given a set of uncertain objects $\mathcal {O}$ and a query point q〈r,𝜃,k〉, if $\mathcal {O}$ is a set of objects which lie in the query region with the probability at least q.𝜃, q returns $\textsf {min}(k,|\mathcal {O}|)$ objects with smallest UDist to the system.

Figure 9b shows an example of kNN-range search over uncertain data. {A,…,D} are uncertain data . Under kNN query q, their UDist are {8,6,10,16} respectively. Since k=2, the query result are A and B.

4.2.2 The query algorithm

In order to support this query, in this section, we propose the kNN-range query algorithm. Similar with that of Algorithm 2, Algorithm 3 still consists of initialization and selection phase. In the initialization phase, given a kNN-range query q, we first find the set C of objects which are lying in the query region with the probability at least q.𝜃. If |C|≤k, we return C to the system.

If |C|≥k, we enter the selection phase. In this phase, we construct two lists lb and ub for each object in C, and insert the final query results into the result set. The lb and ub here record the UDist lower-bound and upper-bound of each object in C. To be more specific, if the UDist lower-bound of an object is higher than the k-th value in the ub, it could be removed. Otherwise, if the UDist upper-bound of an object is lower than the k-th value in the lb, it is inserted into the result set. For the others, we employ greedy algorithm for searching. To be more specific, we compute the UDist of $k-|\mathcal {R}|$ objects with lowest lower-bound according to Definition 6. In this way, we could get the upper-bound of the result. For the others, if $\mathcal {R}[k]<ub[i]$, o _i can not be a query result. Otherwise, we compute the UDist of o _i according to Definition 6.

5 Experimental evaluation

This section experimentally evaluates the efficiency of our proposed techniques. S-MRST will be compared with U-Tree (a classic technique) and UD-Tree, where U-Tree and UD-Tree are classic PCR-based and partition-based indexes respectively. Two real spatial data sets GPS and STOCK are employed. They contain 34MB and 30.6MB points respectively. In GPS set, each record is expressed by a tuple 〈l a t i t u d e,l o n g i t u d e〉. In order to generate uncertain objects, we build an R-Tree $\mathcal {I}$, where we use coordinates of records as sampled points, leaf nodes of $\mathcal {I}$ as uncertain objects. The size of each leaf node varies from 100 to 200. Accordingly, the uncertain data amount in GPS is 173KB. In STOCK set, each record is expressed by a tuple 〈v a l u e,v o l u m e〉. In order to generate uncertain objects, we also build an R-Tree $\mathcal {I}$, where we use these tuples as sampled points, leaf nodes of $\mathcal {I}$ as uncertain objects. The size of each leaf node varies from 100 to 200. Accordingly, the uncertain data amount in STOCK is 144KB. In addition, three synthetic data sets containing 128KB uncertain objects are generated and employed. In order to evaluate the pruning ability of S-MRST, as shown in Figure 10, the boundary of uncertain data are fitted by several irregular shapes.

GPS is employed as the default set. A workload contains 100 queries in our experiment. The region of the each query is a rectangular, which varies from 500×500 to 1500×1500. In addition, we randomly choose a probabilistic threshold 𝜃 ∈(0,1] for each query. S-MRST is implemented in C++. Experiments are running on a PC with i3-core and 16 GB memory.

5.1 Index construction

In this section, we discuss the cost of constructing S-MRST. Here, GPS, STOCK refer to two real-data sets, and FIG-A, FIG-B, FIG-C refer to three synthetic data sets respectively, where the shapes of objects in each set are described in Figure 10. In addition, P-node (or U-node) refers to the structure that stores both leaf nodes and interval nodes in UD-Tree (or U-Tree).

In Figure 11a, the storage cost of MRST is less than that of PCRs. The reason is MRST uses a few partitions to construct the summary of uncertain data. In Figure 11b, the storage cost of s k e l e t o n is higher than the others. However, its size is also acceptable. One main reason is we use bit-vectors to store s k e l e t o n. In addition, we study the optimal s k e l e t o n via considering the filter ability of s k e l e t o n and its I/O cost. Thus, the space cost of s k e l e t o n is not much higher than that of U-Tree and UD-Tree. From Figure 11c, we find that the total space cost of S-MRST is the lowest of all, because MRST helps us save lots of space cost.

We then compare the index construction time of these three indexes under different data sets. As shown in Figure 12a, the construction cost of s k e l e t o n is higher than both P-node and U-node. However, the difference is minor. The reason is we check whether the s k e l e t o n of a node should be constructed before the s k e l e t o n construction algorithm is invoked. Therefore, we could avoid lots of unnecessary computation cost. From Figure 12b, we find that the MRST construction cost is roughly the same as that of the other two indexes’ summary construction cost. This is because we use the self-partition scheme to construct the MRST for uncertain data , which can often help us early terminate the construction. The total construction time of S-MRST is 1 time higher than that of the other two indexes. However, we still want to highlight that S-MRST is suitable when data update is infrequent. Under this assumption, the construction time of S-MRST is acceptable.

5.2 Query performance

In this section, we evaluate the query performance of these three indexes. In the first group of experiments, we evaluate the performance of S-MRST, UD-Tree [31] and U-Tree under different r _q. The query region varies from 500 ×500 to 1500 ×1500, the other parameters (e.g., q.𝜃) are default value. The parameters of the experiments are as same as the previous one. Firstly, we evaluate the ability of pruning/validating. In Figure 13, S-MRST performs the best of all. This is because the pruning ability of S-MRST is stronger than UD-Tree and U-Tree, which enables S-MRST to avoid most of the unnecessary accessing. In addition, since the space cost of S-MRST is still lower than the other two, S-MRST spends lower I/O cost than both UD-Tree and U-Tree. We also find that, with the increase of q _r, S-MRST is the most stable one. The reason is the larger q _r, the more objects that overlap with q _r. Since S-MRST could provide powerful filtering ability to the query, it could avoid most of the integral computation.

In the second group of experiments, we evaluate the performance of S-MRST, UD-Tree and U-Tree when 𝜃 varies. The 𝜃 varies from 0.1 to 0.9, and the other parameters are default. The experiment content are as the same as that of the first group. The Figure 14a–b are the results of the experiments. In Figure 14, S-MRST performs the best of all. We also find that the candidate set of S-MRST is the smallest of all, since the s k e l e t o n of S-MRST has a powerful filter ability to prune uncertain objects. Another reason is that MRST could effectively approach the PDF of uncertain data, and therefore has a stronger filtering ability than the other indexes.

The third group of experiments evaluate the filtering ability and the computational cost of s k e l e t o n. The query region ranges from 500 ×500 to 2500 ×2500, with the other parameters set to default. Figure 15a–b show the results, from which we can see s k e l e t o n still performs the best of all. The reason is, s k e l e t o n could self-adjust its boundary to cater the locate distribution of uncertain data, and provide objects with tighter boundary. Therefore, more objects could be filtered using s k e l e t o n.

In the fourth group of experiments, we study the probability filtering ability of MRST. We count the amount of uncertain objects that can not be pruned/validated. In addition, we demonstrate the response time. The result are reported in Figure 16. As expected, MRST has a stronger filtering ability. In Figure 16a, the computational cost based on MRST is lower than of PCR. The reason is that a few partitions could approach the PDF of uncertain data. Therefore, the accessing cost of MRST is the lowest of all. More importantly, the space cost of MRST is much smaller than that of U-Tree. Totally, the performance of MRST is best of all.

We also evaluate how the performance of S-MRST, UD-Tree and U-Tree would be affected when supporting top-k range query. The parameter k ranges from 10 to 200, and query region is uniformly 1000 ×1000 rectangle, with the other parameters being the default value. Figure 17a–b show the result, from which we can see that S-MRST performs the best just as when supporting prob-range query. This is as expected since we use MRST for computing expected weight lower-bound and upper-bound of candidates. Hence the bound generated by MRST is much tighter than that of both U-Tree and UD-Tree, making it possible to filter most candidates.

We then evaluate the performance of the three indexes when supporting kNN range query. The k varies from 10 is 200. In Figure 18, we find that both the running time and candidate count of S-MRST are the least of all. The reason behind is the summary under S-MRST could effectively approach the feature of PDF. Therefore, S-MRST could provide objects with a tighter UDlist bound, and avoid integral operation as much as possible. In addition, we find that S-MRST performs most stable among these indexes. By contrast, the performance of UD-Tree and U-Tree are both sensitive to k. The reason is, with the increase of k, more objects are selected as candidates, and the filter ability of the summary is more important to the algorithm performance. Therefore, S-MRST performs best of all.

We then evaluate the performance of S-MRST under different partition resolution m and different parameter λ. For the resolution m, it changes from 2×2 to 32×32. The other parameters are all default values. The MRST construction algorithm adopts the ones discussed in section 3.3. In Figure 19, we find that, with the increase of m, the pruning ability of S-MRST gets stronger. However, after m reaches 8, the pruning ability of S-MRST does not change a lot. The reason is, when the resolution is high enough, the area of 1-cells approaches to the blank area in each node. In this case, the impact of resolution m on the pruning ability of S-MRST does not changes a lot. As expected, the space cost of the S-MRST increases with the value m. Lastly, we find that, with the increase of m, the running time of S-MRST drops first, and then turns larger. The reason is, with the increase of m, the I/O cost gets higher. However, the pruning ability of S-MRST does not change a lot. Therefore, the total cost of S-MRST turns to large.

For the parameter λ, it changes from 0.01 to 0.2. The s k e l e t o n construction algorithm adopts the one discussed in Section 3.2.3. The other parameters are all default values. In Figure 20, we find can see that the pruning ability of S-MRST gets stronger with the increase of λ. However, after λ increases to 0.1, the pruning ability of MRST remains rather stable. The reason is, the pruning rule of MRST takes many factors into consideration. Even if λ is not small enough, MRST also could provide a powerful pruning ability. Therefore, even λ is a very small value, the pruning ability of MRST still does not change a lot. The reason is MRST has the ability of adjusting the partition so as to cater the distribution of probability density function. Even if the partition resolution is not high enough, it also could satisfy that the PBD of each sub-region is lower than λ. As expected, with the increases of λ, the running time of S-MRST does not change a lot.

Finally, We compare the performance of S-MRST, UD-Tree, and U-Tree under different data set. Five data sets are employed. We use default parameters in these experiments. As expected, In Figure 21 S-MRST performs the best of all, which indicates that S-MRST could be regarded as a general framework for indexing uncertain data, and it is not sensitive to the distribution of uncertain data.

6 Conclusions

In this paper, we studied the problem of indexing uncertain data. Through deep analysis, we proposed an effective indexing technique named S-MRST to manage uncertain data. S-MRST could provide a very tight bound for pruning/validating the objects that overlap(or non-overlap) with the query region in a lower cost, and could support many types of query. Our experiments convincingly demonstrated the efficiency of our indexing techniques. In the future, we will further study other indexes which are suitable for high-dimensional uncertain data and support probabilistic data update frequently.

References

Agarwal, P. K., Cheng, S.-W., Tao, Y., Ke, Y.: Indexing uncertain data. In: PODS, pp. 137–146 (2009)
Angiulli, F, Fassetti, F: Indexing uncertain data in general metric spaces. IEEE Trans. Knowl. Data Eng. 24(9), 1640–1657 (2012)
Article Google Scholar
Bernecker, T., Emrich, T., Kriegel, H.-P., Renz, M., Zankl, S., Zu̇fle, A.: Efficient probabilistic reverse nearest neighbor query processing on uncertain data. PVLDB 4(10), 669–680 (2011)
Google Scholar
Cheema, M. A., Lin, X., Wang, W., Zhang, W., Pei, J.: Probabilistic reverse nearest neighbor queries on uncertain data. IEEE Trans. Knowl. Data Eng. 22 (4), 550–564 (2010)
Article Google Scholar
Chen, S, Ooi, B. C., Tan, K.-L., Nascimento, M. A.: St ²b-tree: a self-tunable spatio-temporal b ⁺-tree index for moving objects. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 29–42. Vancouver (2008)
Chen, L, Gao, Y, Li, X, Jensen, C.S., Chen, G., Zheng, B: Indexing metric uncertain data for range queries. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 951–965. Melbourne (2015)
Cheng, R., Kalashnikov, D. V., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 551–562. San Diego (2003)
Cheng, R., Xia, Y., Prabhakar, S., Shah, R., Vitter, J. S.: Efficient indexing methods for probabilistic threshold queries over uncertain data. In: VLDB, pp. 876–887 (2004)
Faradjian, A., Gehrke, J., Philippe, B.: GADT: A probability space ADT for representing and querying the physical world. In: Proceedings of the 18th International Conference on Data Engineering, pp. 201–211. San Jose (2002)
Jeffery, S.R., Franklin, M. J., Garofalakis, M. N.: An adaptive RFID middleware for supporting metaphysical data independence. VLDB J. 17(2), 265–289 (2008)
Article Google Scholar
Kalashnikov, D.V., Ma, Y., Mehrotra, S., Hariharan, R.: Index for fast retrieval of uncertain spatial point data. In: GIS, pp. 195–202 (2006)
Kriegel, H.-P., Kunath, P., Renz, M.: Probabilistic nearest-neighbor query on uncertain objects. In: Advances in Databases: Concepts, Systems and Applications, 12th International Conference on Database Systems for Advanced Applications, DASFAA 2007, pp. 337–348. Bangkok (2007)
Lian, X., Chen, L.: Efficient processing of probabilistic reverse nearest neighbor queries over uncertain data. VLDB J. 18(3), 787–808 (2009)
Article Google Scholar
Lian, X, Chen, L: Ranked query processing in uncertain databases. IEEE Trans. Knowl. Data Eng. 22(3), 420–436 (2010)
Article Google Scholar
Lian, X, Chen, L: Causality and responsibility: Probabilistic queries revisited in uncertain databases. In: 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, pp. 349–358. San Francisco (2013)
Mokbel, M.F., Chow, C.-Y., Aref, W. G.: The new casper: Query processing for location services without compromising privacy. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 763–774. Seoul (2006)
Ohsawa, Y., Sakauchi, M.: The bd-tree - a new n-dimensional data structure with highly efficient dynamic characteristics. In: IFIP Congress, pp. 539–544 (1983)
Pei, J., Jiang, B., Lin, X., Yuan, Y.: Probabilistic skylines on uncertain data. In: Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, pp. 15–26. Austria (2007)
Shapiro, J.M.: Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. Signal Process. 41(12), 3445–3462 (1993)
Article MATH Google Scholar
Song, C, Li, Z, Ge, T.: Top-k oracle: A new way to present top-k tuples for uncertain data. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, pp. 146–157. Brisbane (2013)
Tao, Y., Cheng, R., Xiao, X., Ngai, W.K., Kao, B., Prabhakar, S.: Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: VLDB, pp. 922–933 (2005)
Tran, T.T. L., Sutton, C. A., Cocci, R., Nie, Y., Diao, Y., Shenoy, P. J.: Probabilistic inference over rfid streams in mobile environments. In: ICDE, pp. 1096–1107 (2009)
Tong, Y, Chen, L, Cheng, Y, Philip, S. Yu: Mining frequent itemsets over uncertain databases. PVLDB 5(11), 1650–1661 (2012)
Google Scholar
Tong, Y, Chen, L, Ding, B: Discovering threshold-based frequent closed itemsets over probabilistic data. In: EEE 28th International Conference on Data Engineering (ICDE 2012), pp. 270–281. Washington (2012)
Tong, Y., Cao, C.C., Chen, L.: TCS: Efficient topic discovery over crowd-oriented service data. In: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pp. 861–870. New York (2014)
Tong, Y., Zhang, X., Cao, C.C., Chen, L.: Efficient probabilistic supergraph search over large uncertain graphs. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 809–818, CIKM 2014. Shanghai (2014)
Tong, Y, Chen, L, She, J: Mining frequent itemsets in correlated uncertain databases. J. Comput. Sci. Technol. 30(4), 696–712 (2015)
Article MathSciNet Google Scholar
Yang, X., Wang, B., Qiu, T., Wang, Y., Li, C.: Improving regular-expression matching on strings using negative factors. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 361–372. New York (2013)
Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: Improving the performance of approximate queries on string collections. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 377–392. Melbourne (2015)
Zhang, Y., Lin, X., Zhang, W., Wang, J., Lin, Q.: Effectively indexing the uncertain space. IEEE Trans. Knowl. Data Eng. 22(9), 1247–1261 (2010)
Article Google Scholar
Zhang, Y., Zhang, W., Lin, Q., Lin, X.: Effectively indexing the multi-dimensional uncertain objects for range searching. In: EDBT, pp. 504–515 (2012)
Zhu, R., Wang, B., Wang, G.: Indexing uncertain data for supporting range queries. In: Web-Age Information Management - 15th International Conference, WAIM 2014, Proceedings, pp. 72–83. Macau (2014)

Download references

Acknowledgment

This work is partially supported by the NSF of China for Outstanding Young Scholars under grant No. 61322208, the NSF of China for Key Program under grant No. 61532021, and the NSF of China under grant Nos. 61272178 and 61572122.

Author information

Authors and Affiliations

College of Information Science and Engineering, Northeastern University, Shenyang, China
Rui Zhu, Bin Wang, Shiying Luo, Xiaochun Yang & Guoren Wang

Authors

Rui Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shiying Luo
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Guoren Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, R., Wang, B., Luo, S. et al. S-MRST: a novel framework for indexing uncertain data. World Wide Web 20, 697–727 (2017). https://doi.org/10.1007/s11280-016-0409-x

Download citation

Received: 11 December 2015
Revised: 16 August 2016
Accepted: 18 August 2016
Published: 07 September 2016
Issue Date: July 2017
DOI: https://doi.org/10.1007/s11280-016-0409-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

S-MRST: a novel framework for indexing uncertain data

Abstract

Similar content being viewed by others

Indexing Uncertain Data for Supporting Range Queries

Efficient Range Query Processing on Complicated Uncertain Data

Processing Probabilistic Range Queries over Gaussian-Based Uncertain Data

1 Introduction

2 Background

2.1 Related work

2.1.1 PCR-based index

2.1.2 Partition-based index

2.1.3 Others

2.2 Problem definition

Definition 1 (Probabilistic Range Query)

Problem statement

3 S-MRST

3.1 S-MRST overview

3.2 The skeleton

3.2.1 The signal skeleton

Definition 2

Skeleton construction

Skeleton processing

3.2.2 I/O efficient skeleton

Discussion

3.2.3 Adopting skeletons for bounding uncertain data

Cost analysis

3.3 The MRST

3.3.1 A tighter probabilistic bound for filtering

Property 1

Discussion

3.3.2 Summarizing uncertain data using multi-resolution grid

Definition 3 (P B D)

Theorem 1

Proof

3.3.3 Efficient summary storage

3.3.4 Accessing the summary of uncertain data

4 The query algorithms

4.1 The top-k range query over S-MRST

4.1.1 Problem definition

Definition 4

4.1.2 The query algorithm

4.2 The kNN range query algorithm over S-MRST

4.2.1 Problem definition

Definition 5 (UDist)

Definition 6 (kNN-range)

4.2.2 The query algorithm

5 Experimental evaluation

5.1 Index construction

5.2 Query performance

6 Conclusions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation