Keywords

1 Introduction

With the rapid progress of mobile Internet technologies and the growing popularity of smart phones, a great amount of geo-information is generated at an unprecedented scale. In social network applications (e.g., Facebook or Twitter), there are a great number of users. Their personal information can be described by a set of attribute-value pairs and their geo-locations revealed by GPS. In a online-to-offline system, there are millions of users browsing products on the system, such products can be described by a set of attribute-value pairs and associated with geo-locations. In this paper, we refer to such data with both attribute-value pairs and geo-location information as geo-tagged attribute-value pair objects.

In a location-aware publish/subscribe system, subscribers subscribe their interests and publishers publish events with geo-information. This kind of systems has many real-world applications. In a location-aware targeted advertising application system, advertisers are subscribers, who specify the properties of their interested users. For example, they can have a subscription like (e.g.,“16 \(\leqslant \) age \(\leqslant \)28,hobby \(\in \){Tennis, basketball}”,“51.16145,0.14123”). The system will display corresponding advertisements on users screens. This is a well-known pushing advertising model. Users of social network systems (such as Facebook, Twitter, etc.) act as publishers. Their personal information, such as age, hobby and locations, becomes an event (e.g.,“age=20, sex=female, hobby=tennis,school=Harvard”, “51.16515,0.14123”). Advertisements can be displayed on these users screen if there is a high relevance between an event and a subscription. This kind of information pushing model is useful for online-to-offline commerce platforms, such as GrouponFootnote 1, Sellers or service providers in Groupon system are subscribers, who may want to accurately push their advertisements to potential customers by specifying both users properties and a series of their product information (e.g.,“hobby=smart-phone, item \(\in \){Iphone6s, Iphone6}, price \(\leqslant \) 499$”, “51.25643, 0.14845”) as a subscription. Users of this system are publishers. When a user click a product link, the information of this product and the users properties can become an event(e.g.,“hobby=smart-phone, item=Iphone6s, price=469”, “51.2612, 0.12545”). In such applications, only a few advertisements can be displayed due to the limited screen size.

Existing unstructured location-aware publish/subscribe systems [1, 3, 5, 8, 19, 20] can support subscriptions with geo-textual descriptions very well. For example, users of Twitter can register their interests with a geo-textual description (e.g.,“Cheapest iphone6s”, “31.4522,51.4451”). The system has to ensure a timely delivery of relevant geo-textual objects to the user. However, this kind of location-aware pub/sub systems cannot support geo-tagged attribute-value pair objects, which need a structured description to capture attributes and values. Existing structured location-aware pub/sub systems [5, 13]use a boolean expression presenting a subscription. They can efficiently retrieve all matched information. Therefore, a user may be overwhelmed. To address these issues, we propose a new type of top-k subscriptions matching with boolean expressions, referred as Location-aware Top-k Subscription Matching with Boolean Expressions(in short TSMB-loc). The latest solution proposed for top-k subscription matching with boolean expression [7] focused on fuzzy matches. We will develop a solution ensuring strict boolean semantics of expressions.

There are two challenges on developing solutions for location-aware top-k subscription matching with boolean expressions. First, how can we filter out the candidates of top-k best subscriptions from millions of subscriptions with a large amount of attributes, values and geo-locations. Second, it needs to retrieve the top-k best subscriptions over tons of candidates. Thus, an efficient and effective solution to cope with the TMSB-loc problem is necessary.

To efficiently and effectively process the TSMB-loc problem, we propose a novel \(R^t\)-tree based index structure, ranked \(R^t\)-tree (called \(RR^t\)-trees since then), by integrating the \(R^t\)-tree [3] index structure with a predicate index structure together and using a subscription partitioning scheme. When an event with location information arrives, our method can quickly retrieve its top-k best matched subscriptions. To summarize, we make the following contribution:

  • We propose a new problem, location-aware top-k subscription matching with boolean expressions(TSMB-loc).

  • We propose a new index structure, called \(RR^t\)-trees as the solution of TSMB-loc, which can efficiently retrieve top-k best subscriptions from millions of subscriptions.

  • We conduct experiments on a synthetic dataset and a real-world dataset to evaluate the performance of our proposed \(RR^t\)-trees.

The remaining of this paper is organized as follows. In Sect. 2, we overview related work. Then, we formalize the problem in Sect. 3. In Sect. 4, we propose a baseline solution by extending a boolean expression index structure and a state-of-the-art spatial index R-tree. In Sect. 5, we propose an advanced solution, the \(RR^t\)-trees index structure. In Sect. 6 we give the similarity upper bound of \(RR^t\)-trees. In Sect. 7, we present the matching algorithm of \(RR^t\)-trees. Extensive experimental results are reported in Sect. 8, and we conclude this paper in Sect. 9.

2 Related Work

This research topic is closely related to two main research branches: structured pub/sub systems and location-aware unstructured pub/sub systems. We will briefly review these two branches as follows.

Structured Pub/Sub. There are some researches on structured pub/sub with boolean expressions [2, 4, 9, 11, 12, 18, 21]. Guo et al. [5] proposed a new location-aware pub/sub system named Elaps. Elaps can continuously detect moving subscriptions of users over event streams. Hu et al. [7] proposed a \(R^I\)-tree for top-k subscription matching over structured information. They are all different to our problem since Elaps cannot maintain a top-k best matched subscriptions and \(R^I\)-tree is a partial matching in a boolean expression. To adapt to different workloads, Sadoghi and Jacobsen [10] presented BE*-tree index structure. BE*-tree allows the values of attributes to be continuous. It combines a bi-directional tree expansion mechanism with an overlap-free splitting strategy. D et al. [21] proposed Op-index, which builds an inverted index over the pivot attributes of subscriptions and developed a two-level partitioning scheme to handle subscriptions with high dimensional attributes. However, BE*-tree does not make the location dimension into consideration and Op-index can not return top-k subscriptions.

Location-Aware Unstructured Pub/Sub. There are many related researches on unstructured location-aware pub/sub over geo-textual data. [1, 5, 6, 8, 1417, 19, 20, 2224]. To study the location-aware pub/sub problem for parameterized spatio-textual subscriptions, Hu et al. [6] presented a filter-verification framework by integrating prefix filtering and spatial pruning techniques together. To efficiently filter geo-textual data, Li et al. [8] proposed \(R^t\)-tree,which loads the selected token from subscriptions into different R-tree nodes. To support ranking semantics, Yu et al. [20] make an extension of \(R^t\)-tree. These three works [6, 8, 20] only focus on geo-textual information, which cannot retrieve top-k subscription matching. Note that pub/sub focusing on geo-textual cannot support geo-tagged attribute-value pair objects. We propose \(RR^t\)-trees, which has two distinguishing features. First, \(RR^t\)-trees allows user to specify their interests in form of boolean expressions. A boolean expression is much more expressive than that of a textual content. Second, \(RR^t\)-trees focus on retrieving top-k best matched subscriptions.

3 Problem Definition

In this section we formally define the problem of location-aware top-k subscription matching with boolean expressions.

Subscription: Subscribers register their interests as subscriptions. A subscription s is consisted by three elements: s.B, s.loc, \(\alpha \), where s.B is a boolean expression to describe the interests of subscribers, s.loc is the spatial location of a subscriber, and \(\alpha \) is a parameter to balance the relative importance between spatial similarity and boolean expression similarity (we call it BE similarity since then). The boolean expression is a combination of predicates in conjunctive normal form. A predicate is a constraint specified by users to represent the relationship between an attribute and its value. A predicate contains three elements: an attribute A, an operator \(f_{op}\), and a value v. That is, p(A,\(f_{op}\),v) denotes a predicate p. The operator can be a relational operator (\(<\),\(>\),\(\leqslant \),\(\geqslant \),\(=\),\(\ne \)), or a set (\(\in \),\(\notin \)). Each predicate has a weight \(\omega _s\), where \(\sum _{i=1}^{n}\omega _{si}=1\). Thus, the subscription can be modeled as follows:

$$\begin{aligned} s: \{[(p_1,\omega _{s1})\wedge (p_2,\omega _{s2})\wedge (p_i,\omega _{si})\wedge ...\wedge (p_n,\omega _{sn})],\ \text {loc}, \alpha \} \end{aligned}$$

Event: An event e contains a collection of attribute-value pairs denoted as e.V and a geo-position denoted as e.loc. The attribute-value pairs e.V are represented in the form of conjunction of predicates with equality operator. That is, \(\nu (A,v)\) denotes an attribute-value pair \(\nu \). Each attribute-value pair has a weight \(\omega _e\), where \(\sum _{i=1}^{n}\omega _{ei}=1\). Thus, an event can be denoted as follows:

$$\begin{aligned} e:\{[(\nu _1,\omega _{e1})\wedge (\nu _2,\omega _{e2})\wedge (\nu _3,\omega _{e3})\wedge ...\wedge (\nu _n,\omega _{en})], \text {loc} \} \end{aligned}$$

The weight \(\omega _s\) is given by subscribers and is used to represent users’ preference among predicates in a subscription. The weight \(\omega _e\) signifies the relevance between value-pair and its predicate. It is generated according to the appearing frequency of the attribute-value pair in the whole dataset.

Definition 1

(Predicate Match) Given an attribute-value pair \(\nu \), for a predicate p appears in a subscription, we said that there is a predicate match if \(p.A=\nu .A\) and \(p_i(\nu _i.v)\)=true.

Definition 2

(Boolean Expression Match) A boolean expression s.B is said to match a collection of attribute-value pairs e.V if each of the predicates in s.B has a match in e.V.

Definition 3

(Similarity Function \(\phi \)) Given an event e=e:{[(\(\nu _1\), \(\omega _{e1}\))\(\wedge \) (\(\nu _2\), \(\omega _{e2}\))\(\wedge \)(\(\nu _3\), \(\omega _{e3}\))\(\wedge \)...\(\wedge \)(\(\nu _n\), \(\omega _{en}\))], loc} and a subscription s=s: {[(\(p_1\), \(\omega _{s1}\))\(\wedge \) (\(p_2\), \(\omega _{s2}\))\(\wedge \) (\(p_i\),\(\omega _{si}\))\(\wedge \)...\(\wedge \) (\(p_n\),\(\omega _{sn}\))], loc, \(\alpha \)}, we define the similarity function \(\phi (e,s)\) as follows:

$$\begin{aligned} \phi (e,s)=s.\alpha \cdot \varphi _{BE}(e,s) +(1-s.\alpha )\cdot \varphi _s(e,s) \end{aligned}$$
(1)

where \(\varphi _{BE}\) is a BE similarity function and the \(\varphi _S\) is a spatial similarity function.

$$\begin{aligned} \varphi _{BE}(e,s)=\sum _{p_j\in s,\nu _i\in e,p_j.A=\nu _i.A, p_i(\nu _i.v)=true}^{n=s.\varsigma }\omega _{sj}\cdot \omega _{ei} \end{aligned}$$
(2)

where \(s.\varsigma \) is the size of subscription s(number of predicates in a subscription).

The spatial similarity is given by:

$$\begin{aligned} \varphi _s(e,s)=1-\frac{distance(e.loc,s.loc)}{MaxDistance} \end{aligned}$$
(3)

where distance(e.loc,s.loc) is the Euclidian distance between s and e, and the MaxDistance is the maximum distance among subscriptions.

As shown in Fig. 1 for the event e={A=3(0.1)\(\wedge \) B=3(0.5)\(\wedge \) C=4(0.2)\(\wedge \) F=2(0.2),e.loc}, the boolean expression subscription \(S_1\) matches the attribute-value pairs of e according to Definition 2. However, the subscription boolean expression \(S_4\) does not match attribute-value pairs of e as there is no attribute-value pair in e that matches the predicate G\(\geqslant \)4. Based on Definition 3, the spatial similarity \(\varphi _S\)(e,\(s_1\)) is 0.35, and the BE similarity \(\varphi _{BE}\)(e,\(s_1\)) is 0.25. Therefore, the balanced similarity \(\phi \)(e,\(s_1\)) is 0.30. Similarly, since the spatial similarity \(\varphi _S\)(e,\(s_9\)) is 0.15, and the BE similarity is \(\varphi _{BE}\)(e,\(s_9\)) is 0.18, \(\phi \)(e,\(s_9\)) is 0.18. Thus, if we want to retrieve top-1 best subscription for event e, the answer is \(S_1\).

Location-Aware Top-k Subscription Matching: Given a set of subscriptions S, TSMB-loc aims to finds the top-k most relevant strict matched subscriptions \(S^k\in S\). For any subscription s\(\in \) \(S^k\) and \(s^*\) \(\in \)(S-\(S^k\)), \(\phi \)(e,\(s^*\))\(<\) \(\phi \)(e,s).

Fig. 1.
figure 1

An example of subscriptions and events

4 A Baseline Solution

In this section, we extend two state-of-art index structures (Op-index and R-tree) to cope with the TMSB-loc problem. These extensions will be used as baseline solution to evaluate our advanced solutions proposed in Sect. 5.

Op-index is a well-known pub/sub index structure for boolean expressions, which builds an inverted index structure over a pivot attributeFootnote 2 and designs a two level partitioning scheme to handle the pub/sub problem with high-dimensions attribute. Based on op-index and a well-known spatial information index structure R-tree, we can integrate op-index with R-tree together (called OPR-tree in short) to cope with the TSMB-loc problem. We first build an R-tree for all the locations of subscriptions. When the subscriptions fall into leaf nodes of the R-tree, we organize subscriptions inside each leaf node using the Op-index structure. The construction and the query processing of OPR-tree will be explained in details.

OPR-Tree Construction: For each subscription s, we retrieve its leaf node n in the R-tree using its spatial point information s.loc and then select its pivot attribute \(\delta _A\). We partition subscriptions inside each leaf node into groups according to their pivot attributes. We can present a list of subscriptions with the same pivot attribute \(\delta _A\) as a list denoted as L(n,\(\delta _A\)). Each list (e.g., L(n,\(\delta _A\))) can be further partitioned based on the operators (\(<\),\(>\),\(\leqslant \),\(\geqslant \),\(=\),\(\ne \)) of predicates. The predicates with the same operator are organized into a sub-list. A sub-list of L(n,\(\delta _A\)) can be presented as L(n,\(\delta _A\),op), where op is a specific operator. For each group of predicates L(n,\(\delta _A\),op), we use a signature segment to map the predicates by a hash function. We compute the hash value of each predicate using a hash function h(p.A) to select a bit from the signature segment and set this bit to 1. Besides, there is a collection of counter arrays, corresponding with each subscription list L(n,\(\delta _A\),op). The value of the counter array is initialed as the size of the subscriptions. For each predicate in L(n,\(\delta _A\),op), there is a pointer to point to its corresponding counter array value.

OPR-Tree Query Processing: For an event e, we search the R-tree using the spatial point information e.loc to find a corresponding leaf node n. Then we extract each candidate pivot attribute \(\nu _i\).A (\(\nu _i\) \(\in \) e.V) from the set of distinct attribute-value pairs e.V. If \(\nu _i.A\) is indeed a pivot attribute, we extract each attribute-value pair \(\nu _i\) to search the predicate lists L(n,\(\delta _A\),op) in L(n,\(\delta _A\)). For each attribute-value pair \(\nu _i\), we first calculate the hash value of a candidate attribute, i.e., h(\(\nu _i.A\)). If the corresponding bit of the signature segment is 1, we search the corresponding predicate list L(n,\(\delta _A\),op). Then, the BE similarity \(\varphi _{BE}\)(e,s) is calculated if \(p_j\)(\(\nu _i\).v=true), where \(p_j\) \(\in \) L(n,\(\delta _A\),op). If the corresponding value of the counter array goes to 0, we calculate the spatial similarity \(\varphi _S\)(e,s). The final balanced similarity \(\phi \)(e,s) is calculated and added into the temporary result set as a candidate top-k result. The upper bound of a given leaf node n to a given event e is a balanced similarity between the minimum Euclidian distance from e to n and the maximum weight of attribute-value in e.

OPR-tree partitions the subscriptions by the region of leaf nodes and organizes these subscriptions using Op-index. However, the region of a leaf node may be very small, which results in a poor pruning ability in high spatial dimensions. Therefore, OPR-tree is not very efficient. In order to avoid this problem, we proposed \(RR^t\)-trees method in the next section.

5 \(RR^t\)-trees Solution

In this section, we present a framework that integrates \(R^t\)-tree and a predicates index structure into a new index, named \(RR^t\)-tree. Based on \(RR^t\)-tree, we develops a partitioning scheme to organize subscriptions into disjointed \(RR^t\)-trees. Finally, we introduce the upper bound of efficiently and effectively filtering out top-k best subscriptions over this framework.

5.1 \(RR^t\)-tree Index Structure

As we discussed in Sect. 4, OPR-tree is not very efficient to meet TSM-loc since the poor pruning ability in spatial dimension.

R\(^t\)-tree [3] is an unstructured location-aware pub/sub index structure, which integrates so-called high-quality representative tokens selected from subscriptions into the nodes of R-tree. Based on \(R^t\)-tree, we propose a method, ranked \(R^t\)-tree (\(RR^t\)-tree) to cope with the TSMB-loc problem. The basic idea of \(RR^t\)-tree is to convert the tokens of \(R^t\)-tree into predicates of a subscription, which will be loaded into the ancestor nodes of a leaf node in which a subscription locates. Then, predicates in each node are indexed using a predicate index structure.

RR \(^{t}\) -Tree Construction: We build an R-tree for all the locations of subscriptions. For a given subscription s, we first extract the distinct predicate \(p_i\), containing its weight, where \(p_i\in s\). Then, we load \(p_i\) into different nodes at different levels, which is determined by the spatial location s.loc of the subscription s.

Considering a built R-tree, its height is H, the size of a given subscription s is s.\(\varsigma \). If s.\(\varsigma \) \(>\)H, we directly insert the last \(s.\varsigma \) - H+1 predicates. If \(s.\varsigma \) \(<\) \(H\), only the ancestors in the first \(s.\varsigma \) level contain the corresponding predicates of s. Letting \(s. p_i\) denote the i-th predicate in s, then \(s. p_i\) is loaded in an ancestor node at i-th level. For each node n on i-th level, there is a set of predicates denoted as P. We build inverted lists over the attributes of the predicates in P to organize predicates with the same attribute. To track the number of matched predicates of a subscription during an event query processing, we assign a hash map M for each subscription and initial each hash value M[s] to be 0. When a predicate \(p_i\) matches an attribute-value pair, we increase its corresponding hash value by 1. Based on the number of matched predicates M[s], we can efficiently filter the subscription. To explain this, we have the following lemmas.

Lemma 1. Consider an event e and a node n at the i-th level. If M[s]\(<\)i, s cannot be a candidate top-k result.

Proof. As s appears in the i-th level, it contains at least i predicates. For each node on path from the root to node n, s.B must have a predicate which cannot match all attribute-value pairs in event e. According Definition 2, s cannot be the candidate top-k result of e.

Lemma 2. Consider an event e and a node n at the i-th level, if M[s]\(=\)s.\(\varsigma \), s must be a candidate top-k result

Proof. If M[s]=s. \(\varsigma \), all the predicates of s.B are matched by e. According to Definition 2, s must be a candidate top-k subscription of e.

Lemma 3. Consider an event e and a node n at the i-th level, if M[s]=i\(<\)s.\(\varsigma \), and n is a leaf node, s cannot be a candidate top-k result of e.

Proof. Since n is a leaf node, we have M[s]=i=H \(<\)s.\(\varsigma \). That is, the rest s. \(\varsigma \)- H+1 predicates are loaded in the node n, which cannot match all attribute-value pairs in event e. According to Definition 2, s cannot be the candidate top-k result of e.

Lemma 4. Consider an event e and a node n at the i-th level. For any \(p_i.A\in p_i\in P\in n\), if \(p_i.A\) does not appear in e, we can directly pruning the node n.

Proof. If \(p_i.A\) does not appear in e, according to Definition 1, no predicate in node n matches any attribute-value pair. According to Definition 2, subscriptions on node n cannot be candidate top-k results of e.

Predicate Index Structure: In each node on \(RR^t\)-tree, there are a set of predicates P, a weight of each predicate, the maximum alpha value \(\alpha _{max}\) and minimum alpha value \(\alpha _{min}\). To efficiently retrieve matched predicates in P, we design an index structure for P. We index the predicates of P in two partitioning steps. In the first step, predicates are partitioned into disjointed predicate lists based on attributes as follows:

$$\begin{aligned} P=L(A_1)\cup L(A_2)\cup L(A_i)\cup ...\cup L(A_n) \end{aligned}$$
(4)

For each predicate in list \(L(A_i)\), there is a pointer to point to the number of matched predicates M[s] of its corresponding subscription. In second step, the predicates list \(L(A_i)\) is further partitioned by their operators (we only use standard operators, such as \(\le \),\(\ge \),=) into its corresponding value list \(L(A_i,op)\) as follows:

$$\begin{aligned} L(A_i)=L(A_i,\le )\cup L(A_i,=)\cup L(A_i,\ge )\cup ...\cup L(A_n,op) \end{aligned}$$
(5)

Figure 2(a) shows the \(RR^t\)-tree index structure for the subscriptions shown in Fig. 1. Figure 2(b) shows the predicate index structure for \(P_3\).

Fig. 2.
figure 2

An \(RR^t\)-tree index for the subscriptions shown in Fig. 1

5.2 \(RR^t\)-trees Index Structure

Since the number of subscriptions can be very large, it is necessary to improve the efficiency of \(RR^t\)-tree. To address this problem, we partition subscriptions according to their pivot attributes (discussed in Sect. 4) into N Footnote 3 subscription lists and organize the subscription lists using disjointed \(RR^t\)-trees. We simply name this method \(RR^t\)-trees since then. Given a set of subscriptions S, we partition these subscriptions according to their pivot attributes \(\delta _A\) and organize them using \(RR^t\)-trees as follows

$$\begin{aligned} S=RR^t-trees=RR^t-tree(\delta _{A1})\cup RR^t-tree(\delta _{A2})\cup ...\cup RR^t-tree(\delta _{An}) \end{aligned}$$
(6)

From Definitions 1 and 3, we can conclude that if an event e matches a subscription s, then all the attributes in s must appear in e. Obviously, if there is an attribute in s but not in e, e wont be matched by s. Thus, given an event e, we only consider the subscriptions whose pivot attributes appear in e. Attributes with a low frequency in a whole dataset has low probabilities to appear in subscriptions. Thus, we choose the lowest frequency attribute in a subscription as the pivot attribute.

The index structure of RR-trees for the subscriptions shown in Fig. 1 is shown in Fig. 3. According to the rule of selecting pivot attributes mentioned above, A, D, E and G are selected as a pivot attribute respectively. Given an event e in Fig. 1, subscriptions in both L(E)and L(G) don’t match e definitely.

Fig. 3.
figure 3

The index structure of \(RR^t\)-trees

6 Similarity Upper Bound of \(RR^t\)-tree and \(RR^t\)-trees Solution

After having described the \(RR^t\)-tree and \(RR^t\)-trees index structure, it is the time to introduce the upper bound of similarity.

Definition 4

(\(UB_{BE}(e,n)\)) For a given event e and a node n in \(RR^t\)-tree, the upper bound of the BE similarity \(UB_{BE}(e,n)\) is defined as follows. The BE similarity bound of a given event e to a node n is:

$$\begin{aligned} UB_{BE}\left( e,n \right) =Max\left\{ s\in n.parent\left[ \sum _{1}^{i-1}\omega _{si}\cdot \omega _{ej}+\omega _{emax}^*\cdot \left( 1-\sum _{1}^{i-1}\omega _{si} \right) \right] \right\} \end{aligned}$$
(7)

where \(\sum _{1}^{i-1}\omega _{si}\cdot \omega _{ej}\) is the total score of matched predicates of s appearing in level 1 to level i-1 where \(i>1\), and \(\omega _{emax}^*\) is the maximum weight of unmatched attribute-value pairs in e for subscription s. And \(1-\sum _{1}^{i-1}\omega _{si}\) is the remaining total weights of unmatched predicates in subscription s.

Definition 5

(\(UB_S(e,n)\)) For a given event e and a node n, the upper bound of the spatial similarity \(UB_S(e,n)\) is defined as follows:

$$\begin{aligned} UB_S(e,n)=(1-\frac{MinDistane(e.loc,n.MBR)}{MaxDistance}) \end{aligned}$$
(8)

where MaxDistance is the maximum distance between subscriptions, n.MBR it the minimum bounding rectangle of node n and MinDistance(e.locn.MBR) is the minimum Euclidian distance between e.loc and any point on n.MBR.

Definition 6

(UB(e,n)) According to Eqs. 7 and 8, for a given event e and a node n, the total upper bound UB(e,n) is defined as follows:

$$\begin{aligned} UB\left( e,n \right) =Max\left\{ \forall \alpha \in \left( \alpha _{min},\alpha _{max} \right) min\left[ 1-\alpha ,UB_{BE}\left( e,n \right) \right] +\alpha \cdot UB_S(e,n) \right\} \end{aligned}$$
(9)

where \(\alpha _{min}\), \(\alpha _{max}\) are the minimum and maximum \(\alpha \) of subscriptions in node n.

According to Definition 6, we have the following lemma:

Lemma 5. Given an event e and a node n whose MBR encloses a set of subscriptions \(S^n\). For any subscription s where \(s\) \(\in \) \(S^n\), there is:

$$\begin{aligned} \phi (e,s)<UB\left( e,n \right) \end{aligned}$$
(10)

We omit the proof due to space constraints.

7 Matching Algorithm

How can we use \(RR^t\)-tree to retrieve top-k best matched subscriptions with boolean expressions? We show the processing of retrieving top-k best matched subscriptions in Algorithm 1. We use a bound-queue to store the nodes that have not been visited in the algorithm. The nodes in the bound-queue is ordered by their UB(en) in a descending order, which is calculated according to Eq. 9 from its parent node. For the root node, the bound is 1. Given an event e, we traverse all the RR\(^t\)-tree(\(\nu _i\).A) in \(RR^t\)-trees from the root node, where \(\nu _i\) \(\in \) e. The algorithm will return candidate subscriptions for the top-k best matched subscriptions. It will stop under two situations as follows.

  • When k subscriptions are found and its minimum similarity is larger than the maximum UB(en) in the bound-queue.

  • When the bound-queue is empty.

figure a

8 Experiments

In this section, we will evaluate three indexes (OPR-tree, \(RR^t\)-tree, \(RR^t\)-trees) on synthetic and real-world datasets. All the methods are implemented in java (JDK7) and experiments are running on a machine with 3.2 GHz Intel(R) (TM) Core i5-3470 CPU and 16 GB of RAM.

8.1 Experimental Setup

Two datasets are used in our experiments, one synthetic dataset and one real-world dataset (eBay dataset), shown in Table 1. To generate the synthetic dataset, we implement a data generator, which can generate attributes, operators and values. For the set operator \(\in ,\notin \), we rewritten them into standard operator =, \(\ge \),\(\le \). For the weights of attribute-value pairs in an event are generated base on this equation \(\omega _{ei}=\frac{\nu .f}{\sum _{i=0}^{n}\nu _i.f}\), where \(\nu _i.f\) is the frequency of an attribute-value pair in the whole dataset. And n is the number of attribute-value pairs in an event. We totally generate 5M synthetic subscriptions corresponding with 10k events for event matching tests in the synthetic dataset. For the real-world dataset (eBay), we generate 10M subscriptions and 10k events based on 10k product messages, and we extract the spatial information from Twitter to generate final subscriptions and events.

Table 1. Parameters and settings

8.2 Experimental Results

In this section, we will evaluate three indexes on synthetic datasets and real-world datasets. For synthetic datasets we evaluate the performance of three indexes from different perspectives, a varied number of subscriptions and distinct attributes, varied average size of subscriptions,varied average size of events, varied parameter k and \(\alpha \).

Fig. 4.
figure 4

Varying value of k on ebay dataset

Fig. 5.
figure 5

Varying value of \(\alpha \) on eBay dataset

Fig. 6.
figure 6

Varying number of subscriptions on eBay dataset

Matching Time under Various k: The value of k is an important parameter for the TSMB-loc problem. Figure 4 shows the performance of the three methods with the increment of k on ebay dataset. We can clearly see that the average matching time of all the three methods increases with the increment of k. However, R\(R^t\)-trees achieves the best performance because of its powerful pruning ability. It is nearly 3 times faster than the next best method \(RR^t\)-tree. OPR-tree performs the worst, using the largest average matching time.

Matching Time under Various \(\alpha \) : We also conduct several experiments to investigate the impact of varying the maximum balance value \(\alpha \). Our experimental results are shown in Fig. 5. Figure 5 shows that the average matching time of all the three methods grows slowly as the value of \(\alpha \) increases. \(RR^t\)-trees still achieve the best performance over the real-world dataset.

Matching Time on Varying Number of Subscription: From Fig. 6, we can see that all the three methods are sensitive to the number of subscriptions. And \(RR^t\)-trees achieves the lowest event matching time, followed by \(RR^t\)-tree. OPR-tree performs the worst, having the highest matching time. \(RR^t\)-trees is 3.5 times faster than \(RR^t\)-tree. This is because \(RR^t\)-trees uses pivot attributes to partition subscriptions. This causes that indexing the subscriptions using \(RR^t\)-trees much more efficient than that using a single \(RR^t\)-tree. Furthermore, the upper bounds over a small number of subscriptions are easier to calculate than those over a larger scale of subscriptions, which causes the matching time of \(RR^t\)-trees grows much more smoothly than that of the other two methods.

Fig. 7.
figure 7

Average sizes of events on synthetic dataset

Fig. 8.
figure 8

Average sizes of subscriptions on ebay dataset

Fig. 9.
figure 9

Varying number of distinct attributes on synthetic dataset

Matching Time under Different Sizes of Events: Again, since the size of each event in the real-world dataset eBay is fixed, we couldnt conduct these experiments on the dataset eBay. We conduct experiments over the synthetic dataset. The experimental results are shown in Fig. 7. We can see that the average matching time of the three methods increases with the increment of the size of events. This is because the number of candidate subscriptions increases when the size of events increases. \(RR^t\)-trees scales much better than the other two methods (\(RR^t\)-tree and OPR-tree). It is about 3 times faster than the next best method \(RR^t\)-tree.

Matching Time under Different Sizes of Subscriptions: The size of subscriptions can affect the matching performance. The average event matching time on different sizes of subscriptions of the three methods on ebay dataset is reported in Fig. 8. From this figure, we can see that all the three methods are sensitive to the size of subscriptions. But, both \(RR^t\)-trees and \(RR^t\)-tree scale better than OPR-trees. It is because than both in \(RR^t\)-trees and \(RR^t\)-tree, subscriptions are pruned by each predicate on the nodes of R-tree, as we described in Sect. 5. \(RR^t\)-trees scales better than \(RR^t\)-tree because of its pruning ability of pivot attributes. However, OPR-tree only prunes subscriptions by the spatial bounds.

Matching Time under Different Numbers of Distinct Attributes: Since the number of distinct attributes is fixed in the real-world dataset eBay, we conduct experiments with various numbers of distinct attributes on our synthetic dataset. As Fig. 9 shows, when the number of attributes increases, the average matching time of both \(RR^t\)-trees and OPR-tree decreases. However, \(RR^t\)-tree performs oppositely. With the increment of number of attributes, its average matching time increases a little. \(RR^t-trees\) deceases gradually, because it generates many more narrowed partitions when the number of distinct attributes increases during partitioning subscriptions according to pivot attributes.

9 Conclusion

In this paper, we tackled the problem of location-aware top-k subscription matching, which is significant for location-aware publish/subscribe systems with a stream of geo-tagged attribute-value pair objects. Facing the challenge of efficiently and effectively delivering events to the top-k subscribers, we proposed a novel index structure called \(RR^t\)-trees, which integrates \(R^t\)-tree and a predicate index structure. In addition, we developed an efficient filtering strategy to reduce the searching space. Extensive experiments conducted in both synthetic and real-world datasets demonstrate the effectiveness and efficiency of our algorithms.