1 Introduction

The continued development of location-based social media enables the omnipresence of geo-textual data. For example, Twitter, a popular micro-blogging service, enables mobile users to publish a short-text post with textual, spatial, and temporal information. Other social media platforms such as Facebook, WeChat, Foursqure, allow mobile users to publish items with textual, spatial, and time information as well. These items can be modeled as geo-textual objects. Each geo-textual object consists of text document, spatial coordinate, and timestamp. As the number of geo-textual objects has been skyrocketing over the past decades, it is of great importance to enable basic data analytic functionalities regarding massive-scale geo-textual data. In particular, similarity join, as one of the most popular data analytic functionalities, has been extensively investigated by existing studies. Similarity join operation has been playing an important role in spatial data management as it has a broad range of applications, including but not limited to data cleaning, data summarization, ridesharing recommendations, and spatial keyword query result authentication.

However, existing studies regarding similarity join has the following limitations. First, they do not consider the similarity join in geo-textual domain where spatial, textual, and temporal similarities are required to be taken into account. Second, existing approaches are designed on the basis of static scenario. That is, underlying data objects are regarded as a collection of items. However, in many real-life applications the data objects are arriving in a streaming fashion. It is of great importance to enable real-time processing of similarity join for geo-textual data streams.

In this paper, we investigate the problem of similarity join over a stream of geo-textual objects. In particular, given a stream of geo-textual objects P and a set of geo-textual object Q, we study the problem of finding all pairs of geo-textual objects (oi,oj) where oiP and ojQ and the similarity between oi and oj is no less than a similarity threshold 𝜃. Note that the update frequency of P is much more higher than that of Q. We need to consider the following three aspects when measuring the similarity between two geo-textual objects: (1) spatial proximity; (2) textual relevancy; (3) temporal gap.

Efficient processing of geo-textual similarity join under the aforementioned scenario has the following technical challenges. First, we need to define an effective similarity metric to measure the similarity between two geo-textual objects. The similarity metric should take spatial proximity, textual relevancy, and temporal gap between two objects into consideration. Further, the similarity function is expected to be light-weighted. Second, we the number of geo-textual objects can be very large. It is very often that we need to process million of or even tens of millions of geo-textual objects. As such, our solution is required to be scalable. Third, in real-life applications, geo-textual data streams may have a high arrival rate. Our solution needs to meet the efficiency requirement and return real-time join results over dynamic datasets. A straightforward approach works as follow. Given a collection of query objects Q and a stream of geo-textual objects P, each time when a new object on arrives to P, we calculate the similarity between on and each query object oq in Q. If the similarity between on and oq is no less than the similarity threshold 𝜃, we add pair (on,oq) into the join result set. This approach is time consuming because we the number of query objects can be very large. It is difficult to meet both efficiency and scalability requirements.

In this light, we develop a novel three-phase filtering approach, which is named as tri-filtering, to process the spatio-temporal similarity join problem in a real-time fashion over geo-textual data streams. To be specific, tri-filtering mechanism indexes one of a geo-textual object set that is updated less frequently and regards the other geo-textual object set as a data stream. Then, we propose a hybrid grid indexing structure, HGI, that effectively combines the spatial information, textual information, and temporal information of geo-textual objects. In particular, HGI partitions first partitions the underlying space into a set of m × m cells. Each cell maintains an inverted file and a sequence of time slots. Note that the inverted file indexes objects located in the cell on the basis of their terms, and the time slots partition the time space into a set of time interval with equal timespan. Each time slot indexes geo-textual objects whose timestamps are within the slot. When a new geo-textual object arrives, we compute a spatial similarity upper bound between the new object and the objects in each cell. If the spatial similarity upper bound is smaller than the similarity threshold 𝜃, we may prune all objects indexed under the corresponding cell safely. If a cell cannot be pruned based on the spatial similarity upper bound, we proceed to evaluate each time slot in the cell and calculate a spatio-temporal similarity between the new object and the objects in each time slot associated to the cell. If the spatio-temporal similarity upper bound is smaller than the similarity threshold 𝜃, we may prune the objects indexed under the correesponding time slot safely. Otherwise, we proceed to evaluate each posting in the inverted file associated with the time slot of the cell.

Our proposed tri-filtering framework has the following major advantages.

  • Effectiveness: The tri-filtering framework is able to generate real-time join results for massive-scale geo-textual data. It is capable of handling geo-textual stream with high arrival rate.

  • Scalability: In real-life applications, the throughput of geo-textual data streams can be very high. Thanks to the multi-layer filtering techniques, the tri-filtering framework is able to process the similarity join problem over a large number of geo-textual data simultaneously.

  • Generalization: The tri-filtering framework is independent of the similarity metrics. It is applicable to a variety of spatial, temporal, and textual similarity functions.

Even if the problem of geo-textual similarity join has been investigated by existing studies, as far as we have concerned, existing studies on this matter are incapable of addressing the aforementioned three factors at the same time. It is worthy of noting that all of the aforementioned three factors, namely effectiveness, scalability, and generalization, are playing important roles in geo-textual similarity join. As the location-based services and social media platforms are becoming increasingly popular, it is imperative to develop an effective, scalable, and generic real-time mechanism to answer the geo-textual similarity join problem.

To sum up, we have made the following contributions.

  • We study a new problem of real-time geo-textual similarity join, which has a broad range of applications, such as geo-textual data cleaning, geo-textual data analytics, and geo-textual data visualizations.

  • We develop a tri-filtering framework with a dedicated geo-textual object indexing structure to organize massive-scale geo-textual objects effectively. Based on the indexing structure, we propose a couple of filtering techniques and an online geo-textual object matching algorithm. With the aforementioned techniques, tri-filtering is capable of generating geo-textual similarity join results in real-time fashion.

  • We conduct extensive experiments over real-world datasets. Our experimental results show that tri-filtering is able to achieve high effectiveness and high scalability.

The remaining parts of this paper are organized as follow. Section 2 presents the definition of geo-textual object and our real-time geo-textual similarity join problem. Section 3 introduces our proposed geo-textual similarity metric. Section 4 details our proposed solution. Section 5 conducts the experimental studies. Section 6 presents the related studies. Section 7 concludes this paper.

2 Problem statement

This section presents the definition of geo-textual objects, geo-textual similarity join, and our problem formulation.

Definition 1

Geo-textual Object. A geo-textual object can be defined as a triple o = 〈ψ,ρ,t〉, where o.ψ denotes text description, o.ρ denotes a geographical location (latitude and longitude), and o.t denotes a timestamp.

Next, we present our definition of continuous geo-textual similarity join problem.

Definition 2

Continuous Geo-Textual Similarity Join (CGTS-Join). Given a collection of geo-textual objects Q, a stream of geo-textual objects P, and a similarity threshold 𝜃, the CGTS-Join problem aims to maintain a set of geo-textual object pairs \(\mathcal {S}\) where each object pair \((o_{i},o_{j})\in \mathcal {S}\) satisfies the following conditions:

  1. (1)

    oiP and ojQ;

  2. (2)

    The similarity between oi and oj, denoted by Sim(oi,oj), is no less than 𝜃.

Basically, we regard object set Q as a static set while regard set P as a dynamic set. Each time when a new object on is inserted into P, we need to update the result set \(\mathcal {S}\) with new similar pairs that contain on.

3 Geo-textual similarity metric

In this section, we present our geo-textual similarity metric. Note that when we measure the similarity between two geo-textual objects, we need to take the following three aspects into consideration: (1) spatial proximity; (2) textual relevancy; (3) time gap. At the same time, based on our application scenario, the proposed similarity metric is supposed to be calculated without much difficulty. That is, the computation of the similarity between two geo-textual objects needs to have relatively low time complexity. For the purpose, we devise the following similarity function the measure the similarity between two objects oi and oj.

$$\begin{array}{@{}rcl@{}} Sim(o_{i},o_{j})&= &\alpha\times Sim_{spatial}(o_{i}.\rho,o.j.\rho)+\beta\times Sim_{textual}(o_{i}.\psi,o_{j}.\psi)\\&&+\gamma\times Sim_{temporal}(o_{i}.t,o_{j}.t), \end{array}$$
(1)

where Simspatial(oi.ρ,o.j.ρ) denotes the spatial proximity between oi and oj, Simtextual(oi.ψ,oj.ψ) denotes the textual relevancy between oi and oj, Simtemporal(oi.t,oj.t) denotes the time gap score between oi and oj, and α, β, and γ denote the preference parameters representing the weights of spatial proximity, textual relevancy, and time gap score, respectively. Note that the preference parameters, α, β, and γ, should satisfy the following requirement.

$$\alpha+\beta+\gamma = 1$$

We proceed to introduce how to compute the spatial proximity, textual relevancy, and time gap score, respectively, as follow. We first present how to compute the spatial proximity between oi and oj, which is denoted by Simspatial(oi.ρ,o.j.ρ) in (2).

$$Sim_{spatial}(o_{i}.\rho,o.j.\rho)=1-\frac{dist(o_{i}.\rho,o.j.\rho)}{dist_{max}},$$
(2)

where distmax denotes the maximum possible distance in the underlying space. The textual relevancy between oi and oj, denoted by Simtextual(oi.ψ,o.j.ψ), is computed by (3).

$$Sim_{textual}(o_{i}.\psi,o.j.\psi)=\frac{|o_{i}.\psi\cap o_{j}.\psi|}{|o_{i}.\psi\cup o_{j}.\psi|}.$$
(3)

Note that apart from (3) our framework may support other kinds of text similarity measurement such as cosine similarity and language model. Finally, we present how to compute the time gap score between oi and oj, which is defined by (4).

$$Sim_{temporal}(o_{i}.t,o.j.t)=1-\frac{|o_{i}.t-o_{j}.t|}{t_{current}-t_{earliest}},$$
(4)

where tcurrent denotes the current timestamp and tearliest denotes the earliest timestamp among the indexed geo-textual objects.

4 Continuous geo-textual similarity join

In this section, we first present the baseline solution to answering the CGTS-Join problem, which is named as Brute Force Matching (BFM) Algorithm (cf. Section 4.1). Next, we present the details of our tri-filtering framework (cf. Section 4.2).

4.1 Brute force matching algorithm

This subsection present the baseline algorithm to answer the CGTS-Join problem. The high-level idea is as follows. Each time when a new geo-textual object on arrives, we compute the similarity score between on and each indexed object oi. If the similarity between on and oi is no less than the similarity threshold 𝜃, we generate object pair (on,oi) and add the pair into the join result set \(\mathcal {S}\). In this way, \(\mathcal {S}\) can be continuously updated as the arrival of new objects.

We proceed to analyze the time complexity of the BFM algorithm. Let P and Q be two sets of geo-textual objects. Assume that P is updated very frequently and Q is a static set. Each time when a new object is inserted into P, we need to compute the similarity between the new object and every object in Q. As such, the time complexity of processing each new object is O(|ψ|×|Q|), where |ψ| denotes the number of terms per object and |Q| denotes the cardinality of Q. Because we need to process all of the objects in P, the overall time complexity of BFM algorithm can be O(|ψ|×|P|×|Q|),

4.2 Tri-filtering framework

The BFM algorithm has the following limitations. First, each time a new object arrives, we need to calculate the similarity between the object and all of the existing objects, which is extremely time consuming. Second, the number of existing geo-textual objects may become larger as time goes by. As such, it is inevitable that the time cost of processing a new object may exhibit an linear-increasing trend, which is inapplicable to the scenario with data streams. To address the aforementioned limitations, we develop a dedicated H ybrid G rid I ndexing structure, HGI, that is able to organize the spatial, textual, and temporal information of geo-textual objects in an effective manner. Based on the indexing structure, we propose pruning techniques based on similarity upper bounds. We also propose efficient object matching algorithms based on the pruning techniques.

The details of HGI will be presented in Section 4.2.1. The group filtering techniques and object matching algorithms will be detailed in Section 4.2.2.

4.2.1 Hybrid grid index

The Hybrid Grid Index, HGI, is designed for organizing geo-textual objects in an effective manner. Geo-textual objects are organized on the basis of spatial-temporal-textual hierarchy. Given a collection of geo-textual objects, we first organize the geo-textual objects based on their spatial information. Next, we organize the geo-textual objects based on their temporal information. Finally, objects are indexed based on their textual information. We proceed to introduce how to organize geo-textual objects based on spatial information, temporal information, and textual information, respectively.

To organize the spatial information of geo-textual objects, the HGI applies grid indexing structure. Note that the grid indexing structure partitions the underlying space into m × m grid cells. For each cell c, we store a subset of geo-textual objects whose locations fall within the spatial range indicated by cell c. Note that m is a system parameter, which is determined based on the spatial distribution of a particular dataset.

To organize the temporal information of geo-textual objects, the HGI uses bucket indexing structure. Specifically, for each grid cell c, we further partition the geo-textual objects indexed under c into b buckets, where b is a system parameter that is determined based on the temporal distribution of the dataset. Each bucket B under cell c is associated with a timespan [ta,tb] where ta denotes the earliest possible timestamp of B and tb indicates the latest possible timestamp of c.B. When a new geo-textual object on arrives, if on.ρ falls in cell c and on.t falls within timespan [ta,tb], we let c.B store on.

To organize the textual information of geo-textual objects, the HGI uses inverted file for the purpose. In particular, for each bucket B connected to each grid cell c, we maintain its corresponding inverted file that indexes the geo-textual objects whose locations fall in the spatial range of c and whose timestamps fall within timespan [ta,tb]. The inverted file consists of a set of inverted lists. Each inverted list is associated to a particular keyword. Inverted list is also called postings list. Each postings list consists of a set of postings. Each posting is an entry of a particular geo-textual object. Figure 1 illustrates the structure of inverted file associated to particular buckets. We see that given a grid cell c, the geo-textual objects located within c are further partitioned by four buckets, namely B1, B2, B3, and B4. Each bucket corresponds to a particular timespan. Note that the timespans of different buckets are mutually exclusive. Each bucket is associated to an inverted file, which is called postings list. The detail of a postings list is illustrated by Figure 2. From Figure 2 we see that bucket B1 is associated to a postings list. The postings list consists of a number of linked lists. Each linked list corresponds to a keyword, which is denoted by w1, w2,..., or w6. Each linked list consists of postings. Each posting contains an object id and a pointer to the object. If the posting of object o appears in the linked list of keyword w1, then it means that o.ψ contains w1.

Figure 1
figure 1

Inverted files associated to a buckets

Figure 2
figure 2

Postings list of inverted file

4.2.2 Object matching with group filtering techniques

In this subsection, we present our object matching algorithm and our proposed group filtering techniques. The high-level of our object matching algorithm works as follow.

figure a

Algorithm 1 presents the pseudo code of our object matching algorithm. The inputs are (1) new geo-textual object on, (2) object set Q, (3) HGI index \(\mathcal {I}(Q)\), of which the data being indexed is Q, (4) similarity threshold 𝜃, and (5) current join result set \(\mathcal {S}\). The output is the updated join result set that takes the object pairs related to on into consideration.

When a new object on arrives, we first evaluate on against the objects in each cell c. Specifically, for each cell c, we calculate the similarity upper bound between on and objects indexed by c, which is denoted by \(Sim^{s}_{ub}(o_{n},c)\) (Line 2). The value of \(Sim^{s}_{ub}(o_{n},c)\) is computed by (5).

$$Sim^{s}_{ub}(o_{n},c)=\alpha\times \left( 1-\frac{dist_{min}(o_{n}.\rho,c)}{dist_{max}}\right) + \beta + \gamma,$$
(5)

where distmin(on.ρ,c) denotes the minimum Euclidean distance between on.ρ and c. If the similarity upper bound between on and objects indexed by c is no less than the similarity threshold 𝜃 (Line 3), we proceed to evaluate on against each bucket indexed under c. To be specific, for each bucket B indexed under cell c, we calculate the similarity upper bound between on and objects indexed by bucket B, which is denoted by \(Sim^{st}_{ub}(o_{n},B)\) (Line 5). The value of \(Sim^{st}_{ub}(o_{n},B)\) is computed by (6).

$$Sim^{st}_{ub}(o_{n},B)=\alpha\times \left( 1-\frac{dist_{min}(o_{n}.\rho,c)}{dist_{max}}\right) + \gamma\times \left( 1-\frac{|o_{n}.t-B.t_{max}|}{t_{current}-t_{earliest}}\right) + \beta,$$
(6)

where B.tmax denotes the latest timestamp of objects in B. If the similarity upper bound between on and objects indexed by bucket B is no less than the similarity threshold 𝜃 (Line 6), we proceed to visit the inverted file maintained by B. In particular, for each term w in on.ψ we retrieve its corresponding postings list, which is denoted by Iw (Line 8). For each posting p in Iw we retrieve its entry and get the corresponding object oi through the entry (Line 10). After that, we compute the exact similarity between on and oi. If the similarity between on and oi is no less than 𝜃, we add object pair (on,oi) into the join result set \(\mathcal {S}\) (Lines 12–13). Finally, we return \(\mathcal {S}\) as the updated join result (Line 14).

5 Experimental study

This section presents our detailed experimental studies regarding the performance of our proposal.

5.1 Baseline

Our baseline approach is presented in Section 4.1. It is basically a brute force search algorithm. Each time a new geo-textual object arrives, we compute the similarity between the new object and every existing object. The baseline approach is named as Brute Force Matching Algorithm, which is abbreviated as BFM.

5.2 Datasets

We use two real-life geo-textual datasets in our experiments, namely FQ and TE. We proceed to introduce the two datasets respectively. FQ is a real-life dataset collected from Foursquare. The dataset consists of 2 million check-ins from all over the world. Each check-in contains a spatial coordinate with latitude and longitude, and a short text description. As for TE, it is also a real-life dataset. It is collected from Twitter. It consists of 10 million tweets with geographical locations. Likewise, each tweet contains a spatial coordinate and a short text (up to 140 characters).

5.3 Experimental settings

Our parameter settings are denoted by Table 1. It is worthy of noting that we use BFM to represent the baseline approach, which is the abbreviation of brute force matching algorithm. We use TriFilter to represent our proposed method with both HGI indexing structure and group filtering techniques applied.

Table 1 Parameter settings

5.4 Experimental result

Our experimental results are presented as follow.

5.4.1 Effect of the number of object keywords

The first set of experiments evaluates the effect of the number of keywords in each geo-textual object. Figure 3 shows the performance results of BFM and TriFilter on FQ and TE datasets respectively. We see that both methods exhibit an increasing trend regarding the processing time when we increase the number of object keywords. The reason can be explain by the fact that when the number of object keywords increases, we need to visit more terms and postings. As such, it may take more time to retrieve the terms and postings.

Figure 3
figure 3

Effect of the number of object keywords

In addition, we find that the runtime increasing trend of BFM is more significant than the runtime increasing trend of TriFilter. The reason is that TriFilter may help filter out some unqualified pairs at an early stage.

5.4.2 Effect of the number of grid cells

Figure 4 demonstrates the results on FQ and TE respectively when we vary the number of grid cells regarding TriFilter. We see that when we increase the number of grid cells from 16 to 256, the object matching runtime performance becomes better. The reason is that more grid cells in the HGI index may indicate that the objects are more likely to be filtered out in a group manner. As such, more unqualified objects may be pruned at an early stage without the need of evaluating individually. However, when we proceed to increase the number of grid cells from 256 to 1024, we see that the runtime increases significantly. The reason is that more grid cells, in turn, may increase the number of computations regarding the similarity upper bound between a new object and a group of objects in a cell. Moreover, we find that the TriFilter performs best when the number of grid cells is set to be 256 on both FQ and TE.

Figure 4
figure 4

Effect of the number of grid cells

5.4.3 Effect of spatial weight parameter

Figure 5 illustrates the performance comparison between BFM and TriFilter when we tune the spatial weight parameter α from 0.2 to 0.8. Note that when we tune the weight parameter α, the other two weight parameters β and γ are set to be equal. For example, when α is set to be 0.2, β and γ are both set to be \(\frac {1-0.2}{2}=0.4\). For TriFilter, we see that the time cost of object matching decreases when we increase the value of α. The reason is that when we increase the value of α, spatial proximity between two objects will be weighted more. As such, indexed objects are more likely to be pruned in the first phase. As for BFM, the runtime exhibits a relatively consistent trend as we vary the value of α.

Figure 5
figure 5

Effect of spatial weight α

5.4.4 Effect of textual weight parameter

Figure 5 demonstrates the performance comparison between BFM and TriFilter as we vary the textual weight parameter β from 0.2 to 0.8. Note that when we tune the weight parameter β, the other two weight parameters α and γ are set to be equal. For instance, when β is set to be 0.6, α and γ are both set to be \(\frac {1-0.6}{2}=0.2\). For TriFilter, we see that the time cost of object matching increases when we increase the value of β. The reason is that when we increase the value of β, spatial proximity between two objects will be weighted less. As such, indexed objects are less likely to be pruned in the first phase, which may induce more checking and evaluation. As for BFM, the runtime exhibits a relatively consistent trend as we vary the value of β (Figure 6).

Figure 6
figure 6

Effect of textual weight β

5.4.5 Effect of similarity threshold

In the last set of experiments, we evaluate the object matching performance when we vary the similarity threshold 𝜃. From Figure 7 we see that the runtimes of both methods exhibit an decreasing trend as we increase the similarity threshold 𝜃. The reason can be explained by the fact that when the similarity threshold becomes greater, more indexed objects may be pruned because more object pairs may become unqualified. Further, we notice that the runtime decreasing trend regarding TriFilter is more conspicuous than that of BGM. The reason is that TriFilter has more group filtering mechanisms. Higher value of similarity threshold may make a group of objects be more likely to be filtered out.

Figure 7
figure 7

Effect of similarity threshold 𝜃

6 Related work

In this section, we investigate related studies regarding geo-textual object matching and location-based similarity join.

6.1 Matching of geo-textual objects

The problem of geo-textual object matching can be classified into two categories: geo-textual data search and similarity join of geo-textual data. The problem of geo-textual search requires users to define spatial keyword queries. Each spatial keyword query may have spatial requirement and textual requirement. Spatial requirement may be defined as a distance threshold, s spatial proximity score, or a spatial region. Textual requirement may be defined by a set of query keywords or a keyword-based Boolean predicate. The problem of spatial keyword query processing has been extensively studied by existing literature [1,2,3,4,5,6,7,8,9]. In particular, some studies take the semantic meaning of geo-textual objects into consideration [10, 11]. Sequential geo-textual data query processing is also investigated by existing literature regarding trajectory data analytics [12]. A number of benchmark studies and surveys have been published over the past few years [13,14,15]

Recently, the problem of continuous spatial keyword search has been investigated as well. They define the problem of continuous spatial keyword search in a variety of fashions, including location-based pubish/subscribe [16,17,18,19,20,21,22,23,24], continuous geo-textual filtering mechanism [25], and moving geo-textual object processing [26].

Nonetheless, the aforementioned studies define the problem as query processing problems, which cannot be directly used to handle the problem of geo-textual similarity join.

6.2 Location based similarity join

Existing studies regarding the problem of location-based similarity join can be classified into the following categories. The first category is geo-textual object similarity join [27, 28]. It is regarding the join operation between two sets of geo-textual objects. The second category is trajectory similarity join [29,30,31,32,33]. It is regarding the join operation between two sets of trajectory data. The third category is trajectory to object similarity join [34]. It is regarding the join operation between a set of trajectories and a set of geo-located objects.

However, existing similarity join studies has the following limitations. First, they do not consider the popular scenario where the dataset is updated in a continuous manner. Second, they do not consider all of the three major aspects for geo-textual objects: spatial proximity, text relevancy, and temporal gap. As a result, they are inapplicable to our CGTS-Join problem.

7 Conclusions

We consider the problem of Continuous Geo-Textual Similarity Join (CGTS-Join). To this end, we define a new geo-textual similarity metric to measure the similarity between two geo-textual objects by taking spatial proximity, textual relevancy, and temporal gap into consideration. We develop a Hybrid Grid Indexing Structure (HGI) to effectively organize massive-scale geo-textual objects. We also propose a Tri-Filtering framework to answer the CGTS-Join problem. The experimental results on two real-life datasets show that our proposal, Tri-Filtering framework, is capable of achieving a runtime reduction of 60%-90% in comparison against baseline.