1 Introduction

Large scale geo-crowdsourcing or peer-production Volunteered Geographical Information (VGI) [11], such as OpenStreetMap (OSM)Footnote 1 and Wikimapia,Footnote 2 has created high potential for establishing reliable sources of geographical information. As a prominent example, OSM accelerates the generation of massive geospatial information from community users and currently has more than 3.7 billion geographical objects. OSM is not only used by end users, but also adopted by companies to support map applications, location recommendation, sports watches, real estate search engine, and many other geospatial services.

OSM aims to provide two types of information about geographical objects: 1) geographical boundaries such as points, lines and regions and 2) annotations or tags. A tag consists of a ‘Key’ and a ‘Value’ to describe the objects. Example objects include building footprints, business places, or tourist attractions. The keys provide a broad class of features (for example, building or amenity) while the values detail the specific features, for example, “building=retail”, “amenity=school”.

However, many of the objects from OSM, in particular, places of interest, have limited annotations. For example, the existing tags in OSM focus on describing general geographical attributes of the real world, not including more detailed information or user reviews, like the nicknames or the impression of visitors to an famous tourism attraction, or data that isn’t current, like temporal exhibitions of a museum and popular events hosting at the places.

On the other hand, much of the social media data is associated with geo-locations. A recent studyFootnote 3 shows that 1.0% of tweets are geotagged in some way, and 87% of geotagged tweets contain exact coordinates (longitude, latitude). This shows a major increase of geo-tagged tweets from 0.23% shown in a study in 2010.Footnote 4 Such geo-tagged tweets, if combined, could provide rich information that can be potentially associated with other geospatial data sources. For example, the work in [25] uses geo-tagged tweets as external contextual data to annotate mobile users. One natural question is, can we use such geo-tagged social media to support semantic annotations for geographical objects such as churches, museums, and tourism attractions?

In this paper, we propose to enrich OSM objects with semantic annotations by integrating and analyzing geo-tagged social media data, in particular, geo-tagged tweets. This will complement OSM’s objective description oriented annotations to provide a broader range of annotations. Thus, it could significantly improve the value of OSM to support geospatial services. Figure 1 illustrates the process for annotating OSM with a list of relevant words generated from geo-tagged tweets. For example, the tower bridge in London is annotated with the general name (TowerBridge), one popular exhibition (GlassWalkWay or GlassFloor), and many other nearby places. We propose a comprehensive framework on extracting relevant annotations (popular exhibitions of the place, place names, or place nicknames) on top of non-relevant words (names of nearby places) from tweets for places in OSM. We formalize the problem as to find a ranking function that could rank relevant social signals (e.g., words in tweets) on top of non-relevant ones, and measure the likelihood that an annotation candidate is relevant to a given geographic object. Different from traditional information retrieval problems, a new spatial context is introduced into the problem. Thus, our approach will capture both the relevance and the locality of annotation candidates given a targeted location. As described next, major challenges exist for such spatial semantic annotation problem.

Fig. 1
figure 1

Examples that use geo-tagged tweet to annotate geographical objects in OSM

One immediate challenge is integrating large scales of spatial data, including both OSM data and tweets. Capturing local signals for a given location requires spatial data integration across all relevant geospatial objects. For example, we need to search the whole social media corpus for retrieving nearby tweets for a given tourism attraction. However, both VGI and social media platforms produce data at very large scales. OSM has more than 3.7 billion geographical objects and the number is increasing continuously on a daily basis. Moreover, spatial queries, which are essential to support spatial data integration, are highly compute-intensive due to the multi-dimensional nature.

To integrate massive spatial data, we take a MapReduce based approach which partitions the space (which is heavily skewed) into tiles and parallelizes spatial matching queries through MapReduce. This is especially effective to support heavy duty geometric computation during the query.

Another challenge is the difficulty with the estimation of spatial locality due to the diversity of geographical object representations. In OSM, many objects are represented with boundaries, for example, polygons. However, many objects only have a simple point based representation due to limited information or due to the small extent of the objects. For example, less than half of churches in OSM have boundaries.Footnote 5

We propose two alternative methods that handle the two types of spatial objects: frequency based methods for objects with a clear boundary, and probability based method for objects with a point based representation. For the frequency based method, we consider all the tweets contained in the boundary of an object for the analysis. For probability based method, we estimate the probability of a nearby tweet for annotation contribution with respect to the distance between the tweet and the object. Kernel Density Estimation (KDE) model is used for this method.

Another major challenge is the noisy feature of social media data. Social media comprise a broad range of topics. Social media contain large amount of informal languages and personal trivial words from interpersonal chatting or news retweeting, which requires carefully tuned methods to extract meaningful semantic information.

We provide multiple approaches to reduce or remove the effect of noises from signals. For frequency based methods, we provide multiple ways to weigh the relevance of terms for objects, including document corpus, tweet collections, and user collections. For probability based method with KDE, we provide an adaptive approach to minimize the noise effect by tuning the kernel bandwidth inversely with the word density.

While it is difficult to provide ground-truth to evaluate semantic annotations, we propose two alternative approaches to validate our work. We first validate the explicitly relevant annotations with names of places, for which the ground truth is available, and then propose to validate our methods with case studies, with manual evaluation of the relevance of annotation words.

In summary, our work has three major contributions. First, we study and formalize an important problem in geo-social media analytics: integrating social media data and VGI data to derive knowledge about geographical objects. Second, we propose a comprehensive framework on annotating OSM objects using geo-tagged tweets, including a frequency based method and a probability based method. Third, we evaluate our methods on a large geo-tagged tweets corpus and representative geographic objects from OSM, which demonstrates promising results through ground-truth comparison and case studies.

2 Related work

Geospatial services provide location based information to consumers, businesses, and governments. This industry is increasing dramatically with the high availability of cost-effective location sensing devices such as smart phones and GPSs. Businesses can rely on geospatial services for improved operational efficiency, targeted marketing and smarter decision making. Consumers can benefit from geospatial services for directions and searching places of interest.

Geospatial data

Recent years have witnessed an explosion of geospatial data, which provides promising alternative data sources to support geospatial services. While commercial map platforms such as Google Map and Here Map provide APIs for retrieving points of interests, there are major restrictions for public use. Location-based social networks (LBSN) such as Yelp and FourSquare provide constrained access to their place repositories, which are themselves limited. CityGrid and Certain vertical recommendation sites such as TripAdvisor also contain business locations and related customer reviews or tips, which however contains very limited types of objects.

Geospatial analysis with OSM

While the data consumption of OSM mainly comes from map rendering, geocoding, and smart routing, its analytical value has yet to be explored. The previous OSM data analytical work mainly focuses on the measurement of content bias [18] or predictive analysis such as fine-grained population estimation [5]. In this work, we integrate OSM data with geo-tagged social media for semantic annotation. Recently, Wu et al. [25] use geo-tagged tweets to annotate Twitter users. Work from Sengstock et al. in [21] extracts latent geographic features from Flickr tags, which is for general geographic knowledge discovery. Coffey et al. [8] use probabilistic topic modelling for semantic enrichment of mobility data recorded in terms of trip counts with Twitter data.

Geosocial networking

Previous studies that bring together social media users and geographic objects mainly rely on check-in data from Location-based social networks (LBSNs). Karamshuk et al.in [13] utilize user mobility and popularity of places in LBSNs for the problem of optimal retail store placement. Li et al. [16] study the common characteristics of popular venues with check-ins from Foursquare. Georgiev et al. use LBSNs to analyze event patterns [9] and the impact of the Olympic Games on local retailers [10].

Geospatial analysis with social media

Previous studies have used geo-tagged social media to support data analytics for neighborhood characteristics [20], event detection [14], geolocation [12], or spatio-temporal data mining in particular application scenarios. Most prior works analyze geo-tagged social media within geographic granularity up to street level [15]. For example, Quercia et al. [19] use Flickr and Foursquare to examine the safety of streets. Thomee et al. [23] uncover the colloquial boundaries of locally characterizing regions. In our work, we explore geo-tagged tweets with fine-grained geographic context and extract semantic annotations for individual places of interests.

3 Overview

3.1 Problem definition

Our goal is to use geo-tagged social media data to annotate geographical objects. We first define our problem as follows. Table 1 summarizes the notation used in the paper.

Table 1 Summary of notation

3.1.1 Geographic objects

Peer-production VGI platforms, such as OSM or Wikipedia, contain a large amount of geographic objects. In our problem setting, we consider two common representations of geographic objects: 1) point based geographic objects and 2) boundary based geographic objects. Point based geographic objects are a set of points \(\,\mathbf {P} = \{p_{1}, p_{2}, ... , p_{N_{P}}\}\,\) where each object \(\,p_{i}\) = \([id_{p_{i}}, l_{p_{i}}]\,\) is represented as a single point in space with an object ID \(id_{p_{i}}\) and the location (latitude, longitude): \(l_{p_{i}}\) = \((x_{l_{p}}, y_{l_{p}})\). Boundary based geographic objects are a set of polygons \(\mathbf {B} = \{b_{1}, b_{2}, ... , b_{N_{B}}\}\) where each object \(b_{i}\) = \([id_{b}, L_{b}]\) is represented with an object ID and a closed polygon consisting of an ordered list of points that delineates the boundary of the object in space. The boundary \(L_{b} = \{l_{1}, l_{2}, ... , l_{N}\}\) consists of latitude and longitude based point \(l_{i}\) = \((x_{l_{i}}, y_{l_{i}})\).

3.1.2 Geo-tagged social media

The geo-tagged social media signals can be represented as a set of documents \(\mathbf {D} = \{d_{1}, d_{2}, ... , d_{N_{D}}\}\). Each document \(d_{j}\) for \(j = 1, 2, ... , N_{D}\) consists a tuple \(<id_{d_{j}}, id_{u}, l_{d_{j}}, W_{d_{j}}>\) where \(id_{d_{j}}\) and \(id_{u}\) denotes the ID of the document and the ID of the user who generates this content. The document location \(l_{d_{j}}\) is a single point in the space represented by latitude and longitude based position \((x_{d_{j}}, y_{d_{j}})\). \(\mathbf {W_{d_{j}}} = \{w_{1}, w_{2}, ... , w_{N_{W}}\}\) indicates a set of associated features extracted from the document \(d_{j}\). While social media provide a wide range of signals such as image, video, their associated tags or metadata, for our study, we focus on unigrams from tweet content.

3.1.3 The semantic annotation problem

Given a collection of geographic objects \(\mathbf {P}\) or \(\mathbf {B}\) and a spatial-sensitive social media corpus \(\mathbf {D}\), our goal is to integrate geographic information in P and \(\mathbf {B}\) and social contextual information in \(\mathbf {D}\). With integrated geospatial data, this work focuses on extracting semantic annotations from social signals w.r.t. fine-grained geographic objects either in the form of a point \(\mathbf {P}\) or a polygon \(\mathbf {B}\). For example, restaurants or coffee shops are typically represented as points, and building footprints, parking lots, or pitches are represented as polygons. With a variety of scales, churches, on the other hand, have no dominant form for spatial representations. The semantic annotations for a targeted geographic object is a set of relevant words, \(\mathbf {A} = \{(w_{1}, s_{1}), (w_{2}, s_{2}), ... , (w_{N_{A}}, s_{N_{A}})\}\), where \(s_{i}\) (i = 1,2,...,NA) is a score that measures the relevance of \(w_{i}\) w.r.t. the geographic object.

The semantic annotation problem can be defined as to find a ranking function \(\mathbf {f(p_{i}, w_{j})}\) (or \(\mathbf {\,f(b_{i}, w_{j})\,}\)) for a word \(\,w_{i} \epsilon V_{D}\,\) w.r.t. a given geographic object \(p_{i}\) or \(b_{i}\), where \(\mathbf {V_{D}} = \{w_{1}, w_{2}, ... , w_{N}\}\) refers to the vocabulary that includes all annotation candidates generated from the social media corpus D. Analogous to a typical information retrieval task, our goal is to provide satisfactory ranking function to rank relevant annotation keywords on top of non-relevant ones.

3.2 Overview of methods

The key challenge of the spatial semantic annotation problem is how to measure the likelihood that a word \(w_{j}\) is relevant to the given geographic object. A unique constraint for our problem is that the annotation candidates and the annotating targets possess a spatial context, whereas a traditional information retrieval problem ranks relevant documents to a query. Thus, our goal is to propose an applicable model that should capture both the relevance and the locality of annotation words w.r.t. the targeted locations of geographic objects.

3.2.1 Spatial data integration

The spatial relevance of tweets will be largely affected by the proximity to the geographical objects, the extent and representation of the geographical objects. For a boundary based geographic object, intuitively, tweets contained in the boundary will likely have a higher “signal to noise ratio” than those outside of the boundary. For point based geographic objects, similarly, nearby tweets within a distance should have higher relevance than those outside, and the relevance could be affected by the distance.

To capture the spatial locality of words, we propose to use spatial queries to cross-match tweets with geographical objects, and filter tweets based on spatial proximity – only tweets close to the geographical objects will be used. Due to the massive volume of geospatial objects from OSM, the vast number of tweets, and the high computational complexity associated spatial queries (for example, containment), such spatial queries will be very expensive.

To support scalable spatial queries, we first perform skew-aware space partitioning to generate balanced tiles, and then run spatial queries for each title in parallel through MapReduce by invoking an on-demand spatial query engine. We then normalize query results for objects across boundaries. We extend our current work Hadoop-GIS [1, 4, 24] to support the queries needed for such data integration. The two query scenarios are illustrated in Fig. 2.

Fig. 2
figure 2

Two types of data integration between geographic objects and geo-tagged social media (1) Range Search and (2) POINT-IN-POLYGON

3.2.2 Frequency based semantic annotation

Once the nearby social signals are aggregated for each object, we propose two alternative methods to find the ranking functions that can produce relevant words within the refined annotation candidates: frequency based methods and probability based method.

Boundary represented geospatial objects normally have a larger extent compared to point based objects. Intuitively, a term that occurs frequently within a place may be a relevant annotation. We can count the occurrences of nearby words based on the Term Frequency (TF) w.r.t. the targeted location.

To reduce noisy terms, we can improve frequency based method by smoothing it with a weighting factor, using Inverse Document Frequency (IDF). IDF can measure how much information the word provides by checking whether the word is common or rare across all documents. So even though tweets has limited lengths, IDF should still give smaller weights to very commonly occurring words. Since multiple occurrences of a term from distinct tweets or users tend to contribute more than those from a single tweet, we then further propose collective tweet weighting and collective user weighting.

3.2.3 Probability based semantic annotation

For point based object representation, one issue is that aggregating nearby words requires a distance threshold. How to choose such threshold is challenging as different place categories may have different scales of neighborhoods. An inappropriate threshold, in this case, would result in high frequency words from irrelevant tweets.

To address these limitations, we propose to use probability based method which models the relevancy versus the distance. We take the Kernel Density Estimation (KDE) based methods for the ranking problem. KDE has been previously used for modeling human location [17] and generating semantic annotations for mobility data [25]. This work focuses on modeling geo-tagged words with KDE for annotating point based geographic objects. Other than frequency based methods, KDE models the spatial density of the word occurrences and then weights differently for words with different distances. The estimated spatial density can be controlled over a bandwidth parameter h. We can analyze and set the parameter h with respect to different types of annotation words or different place categories.

4 Spatial data integration

Integrating tweets with OSM will require two types of spatial queries as shown in 2. 1) Containment based query or point in polygon query: for each boundary based geospatial object, find all tweets contained in the boundary; and 2) Range search: for each point based geospatial object, find all nearby tweets within distance d. The later can be performed by generating a buffered circle with radius d and a containment query. We extend our previous work Hadoop-GIS, a MapReduce based spatial query system, to support the queries.

We propose to provide spatial data integration through MapReduce based spatial queries at large scale. MapReduce based systems have emerged as a scalable and cost effective solution for massively parallel data processing. However, most of these MapReduce based systems either lack spatial query processing capabilities or have limited spatial query support. While the MapReduce model fits nicely with large scale problems through key-based partitioning, spatial queries and analytics are intrinsically complex and difficult to fit into the model due to its multi-dimensional nature [3].

To support large scale spatial queries on these datasets, the following steps are performed: spatial partitioning; tile based spatial query processing with MapReduce; and result normalization or duplicate removal for boundary-crossing objects. The overall workflow is shown in Fig. 3.

Fig. 3
figure 3

The workflow of MapReduce based spatial data integration of geo-tagged tweets and OSM objects

The space of OSM is first partitioned into balanced tiles [2, 24] based on Sort-Tile-Recursive (STR) algorithm, which tries to order and pack spatial objects for bulk loading to generate an R-Tree for all the OSM objects. Note that OSM objects are first computed to generate the minimal bounding rectangles (MBR). The MBRs of the parent nodes of leaves will become natural partition boundaries.

Once the partition boundaries of OSM data are generated, MapReduce is used to match tweets for containing OSM objects. First, for each partition represented with an MBR, all containing OSM objects and tweets in each partition are identified through a map function by comparing the boundaries. Then a reducer function is started to match tweets to containing OSM objects for each tile. An R*-Tree for tweets and an R*-Tree for OSM objects are built in memory on-the-fly for each tile, respectively. Based on the two R*-Trees for the two datasets, a spatial join algorithm is invoked to find the containment relationships between OSM objects and tweets through traversing the R*-Trees [7]. Following this, a geometric computation on containment relationship is further checked if a tweet is contained not only in the MBR of an OSM object, but also in the polygon of the OSM object.

After all the matching is done, another MapReduce job is performed to identify all duplicated objects. As an object on the boundary of tiles will be assigned multiple times during the partitioning, there will be duplicated results. A sort is performed which removes all the duplicated objects in the result.

Note that the overhead of on-demand indexing is a very small fraction of the overall cost, but it significantly reduces the search space and provides very efficient queries. The computational geometry for containment relationship is heavy duty and takes a large portion of the total time, which is actually effectively parallelized through MapReduce.

5 Frequency based semantic annotation

We start with simple term frequency (TF) based approach to evaluate the term relevance for annotations based on frequencies of occurrences, and then refine it with document corpus based weighting (TF-IDF) to reduce weights for terms across multiple documents. As multiple occurrences of a term from distinct tweets contribute more than those from a single tweet, we further propose to consider collective tweet weighting (TF-per-tweet-IDF). Last, since distinct users tend to provide more independent opinions, we introduce collective user weighting (TF-per-user-IDF).

5.1 Term frequency based weighting

Given a geographic object, one intuitive semantic annotation method is to rank the nearby social media signals according to the frequencies of occurrence, i.e., term frequency (TF). Formally, given a geographic boundary \(b_{i}\) or a geographic point \(p_{i}\), a containment based query \(contain(b_{i}, D)\) and a range-within-distance query range(pi,D,δ) aggregate all the social media documents located within the boundary of \(b_{i}\) as \(D_{b_{i}}\) and documents within the range of a distance \(\delta \) to pi as \(D_{p_{i}}\). The TF based ranking function \(TF(b_{i}, w_{j})\) and \(TF(p_{i}, w_{j})\) then measures the relevance of a word \(w_{j}\) in Eq. 1, where \(W_{b_{i}}\) and \(W_{p_{i}}\) indicate the set of associated features extracted from the document \(D_{b_{i}}\) and \(D_{p_{I}}\) respectively.

$$\begin{array}{@{}rcl@{}} TF(b_{i}, w_{j}) &=& |\{w_{j}\,\epsilon \,W_{b_{i}}: l_{w_{j}}\ in\ L_{b_{i}}\}|\\ TF(p_{i}, w_{j}, \delta) &=& |\{w_{j}\,\epsilon \,W_{p_{i}}: dist(l_{p_{i}}, l_{w_{j}}) < \delta \}|, \end{array} $$
(1)

The TF based ranking function does not distinguish common words, stop words, or expression words, such as“im”, “start”, and “time”, which overwhelm important terms with richer semantics. To filter such non-relevant words and boost the ranking of more important words, we use the algorithm of term frequency-inverse document frequency (TF-IDF) to smooth the direct term frequencies.

5.2 Document corpus based weighting

Given a large collection of documents, TF-IDF is often used to represent the relative importance or uniqueness of a term to a specified document. Intuitively, TF-IDF based method gives a low weight to a word that is frequent in one document but also appears across many other documents. In our application scenario, tweets are usually short in length but accumulate exponentially as regard to the total number of documents, which provide a rich data source for smoothing the term frequencies.

Given a geo-tagged social media corpus D and a geographic object \(b_{i}\) or \(p_{i}\), the TF-IDF based ranking function \(TFIDF(b_{i}, w_{j}, D)\) and \(TFIDF(p_{i}, w_{j}, D)\) measure the relevance of a word \(w_{j}\) in Eq. 2, where \(W_{b_{i}}\) and \(W_{p_{i}}\) indicate the set of associated features extracted from the document \(D_{b_{i}}\) and \(D_{p_{I}}\) respectively.

$$\begin{array}{@{}rcl@{}} TFIDF(b_{i}, w_{j}, D) &=& TF(b_{i}, w_{j}) * IDF(D, w_{j})\\ IDF(D, w_{j})&=&\log \left[\frac{N_{D}}{1 + |\{d_{k}\,\epsilon \,D: w_{j}\,\epsilon \,W_{d_{k}}\}|}\right]\\ TFIDF(p_{i}, w_{j}, \delta, D) &=& TF(p_{i}, w_{j}, \delta) * IDF(D, w_{j})\\ IDF(D, w_{j})&=&\log \left[\frac{N_{D}}{1 + |\{d_{k}\,\epsilon \,D: w_{j}\,\epsilon \,W_{p_{k}}\}|}\right], \end{array} $$
(2)

5.3 Collective tweet weighting

The collective signals from overall social media context have been effectively utilized to smooth a term frequency through the weight of inverse document frequency. Our spatial data integration framework, on the other hand, generates a local context through the aggregated nearby documents, which contain additional knowledge for the relevance of a term w.r.t. targeted objects. For example, a term mentioned in multiple tweets in the local context should imply higher relevance than a term with multiple occurrences coming from a single tweet.

We propose collective tweet weighting (TF-per-tweet-IDF) to smooth the direct term frequency by counting term occurrences per tweet, i.e., multiple mentions in a single tweet count only once. Formally, the collective weighting method is defined in Eq. 3, where \(D_{b_{i}}\) is aggregated documents located within the boundary of bi, and \(D_{p_{i}}\) is aggregated documents within the range of a distance \(\delta \) to \(p_{i}\).

$$\begin{array}{@{}rcl@{}} TF_{tweet}(b_{i}, w_{j}) &=& |\{d_{k}\,\epsilon \,{D_{b_{i}}}: w_{j}\,\epsilon \,W_{d_{k}}\}|\\ TF_{tweet}(p_{i}, w_{j}, \delta) &=& |\{d_{k}\,\epsilon \,{D_{p_{i}}}: w_{j}\,\epsilon \,W_{d_{k}}\}|\\ TF_{tweet}IDF(b_{i}, w_{j}, D) &=& TF_{tweet}(b_{i}, w_{j}) * IDF(D, w_{j})\\ TF_{tweet}IDF(p_{i}, w_{j}, \delta, D) &=& TF_{tweet}(p_{i}, w_{j}, \delta) * IDF(D, w_{j}), \end{array} $$
(3)

5.4 Collective user weighting

One added knowledge within the social media platform is the author information. Terms from the same user tend to be similar and different users tend to generate more independent contents. By identifying the original source of each term, we can distinguish terms coming from the same user or from diverse users.

We propose collective user weighting (TF-per-user-IDF): the multiple occurrences of a term from the same user will be counted only once and the frequency of a term from distinct users will be the count of distinct users. Formally, it is defined in Eq. 4, where \({U_{b_{i}}}\) and \({U_{p_{i}}}\) are the set of users who generate the social media documents in \(D_{b_{i}}\) and \(D_{p_{i}}\) respectively. \(W_{u_{k}}\) indicates the set of associated features extracted from the document of the user \(u_{k}\,\epsilon \,{U_{b_{i}}}\) or \(u_{k}\,\epsilon \,{U_{p_{i}}}\).

$$\begin{array}{@{}rcl@{}} TF_{user}(b_{i}, w_{j}) &=& |\{u_{k}\,\epsilon \,{U_{b_{i}}}: w_{j}\,\epsilon \,W_{u_{k}}\}|\\ TF_{user}(p_{i}, w_{j}, \delta) &=& |\{u_{k}\,\epsilon \,{U_{p_{i}}}: w_{j}\,\epsilon \,W_{u_{k}}\}|\\ TF_{user}IDF(b_{i}, w_{j}, D) &=& TF_{user}(b_{i}, w_{j}) * IDF(D, w_{j})\\ TF_{user}IDF(p_{i}, w_{j}, \delta, D) &=& TF_{user}(p_{i}, w_{j}, \delta) * IDF(D, w_{j}), \end{array} $$
(4)

Discussion

The frequency based methods with weighting are based on the assumption that the spatial relevance of a tweet to an OSM object is certain, i.e., a tweet is clearly contained in the object boundary. Thus, such methods work better for large objects with boundary based representations.

6 Probability based semantic annotation

For point based object representation, to associate spatial relevance of a tweet to an OSM object, a circle based approximate buffer is created for spatial matching. Choosing the right threshold for the buffer is challenging as each type of places may have very different scale of neighborhood. For example, a coffee shop has a much smaller extent than a church. A popular landmark such as a tourism attraction around a coffee shop may generate many tweets which are irrelevant to the shop. A frequency based method is not working any more, as it treats all nearby words with same spatial relevance regardless of the distance.

For objects with only point based representations, or for objects with very small extents, the spatial relevance will be dependent on the distance between the tweet and the object. We propose a probability based method to model the probability of the relevance of a word to a geospatial object as a function of the distance. Kernel Density Estimation (KDE) is a non-parametric method for estimating a density function from a random sample of data. Prior work has utilized KDE for modeling the spatial density of word occurrences, individual mobility data [25], and check-ins from LBDNs. Our work investigates KDE model for annotating geographic points with the spatial probability of word occurrences.

6.1 Kernel density estimation

As mentioned earlier, frequency based method for boundary objects without enough spatial extents leads to data sparsity problem and introduces more noise from nearby landmarks. The essence of KDE based model is to estimate a spatial density from word occurrences. The counts of word occurrences are then smoothed out with the spatial density over the continuous space.

Formally, let \(\mathbf {L^{w_{j}}} = \left \{l_{1}^{w_{j}}, l_{2}^{w_{j}}, ... , l_{N}^{w_{j}}\right \}\) refer to all occurrences of a word \(w_{j}\, \epsilon \, V_{D}\) where \(V_{D}\) is from a geo-tagged social media corpus D. Given a two-dimensional Gaussian kernel function \(\mathbf G\) and a fixed bandwidth \(\mathbf h\), we propose a ranking function (KDE-fixed) for the word \(w_{j}\) w.r.t. a geographic point \(p_{i}\) as described in Eq. 5, where \(C_{h}\) refers to a 2 x 2 covariance matrix.

$$\begin{array}{@{}rcl@{}} KDE_{fixed}(p_{i}, L^{w_{j}}, G, h) &=& \frac{1}{|L^{w_{j}}|}\sum\limits_{k = 1}^{N^{w_{j}}}G_{h}\left( l_{k}^{w_{j}}, l_{p_{i}}\right)\\ G_{h}(l_{k}^{w_{j}}, l_{p_{i}}) &=& \frac{1}{2\pi h} \exp \left[-\frac{1}{2} \left( l_{k}^{w_{j}}, l_{p_{i}}\right)^{T} \mathbf{C_{h}^{-1}}\left( l_{k}^{w_{j}}, l_{p_{i}}\right) \right]\\ C_{h} &=& \left[\begin{array}{ll} h & 0 \\ 0 & h \end{array}\right], \end{array} $$
(5)

Similar to the idea of collective tweet weighting, the KDE-fixed method refer to the word occurrences per tweet, i.e., multiple mentions of a word in a single tweet count only once. To extend the KDE-fixed method with collective user weighting, we propose an alternative KDE-fixed method (KDE-fixed-per-user): the multiple occurrences of a word from the same user will be counted only once and the centroid point of these multiple occurrences will present the location of the word.

With the KDE-fixed method, each word occurrence contributes to the overall ranking score according to its distance to the targeted point, which provides a more accurate estimation about the relevance and omits the requirement of a boundary for encompassing nearby words. Previous work [17, 22] suggests that the choice of the bandwidth value h determines the shape of the resulting spatial density. While a smaller h produces a sharper peaked distribution around the locations of word occurrences, an inappropriately large bandwidth h would generate an oversmoothed estimation. In the experiment, we try to adjust h with different values for our datasets with different types of objects.

6.2 KDE with adaptive bandwidth

The above KDE-fixed method requires tuning of the bandwidth which is time consuming. Besides, the smoothing is homogeneous for all the words regardless of the difference of their spatial densities. For example, the name of an iconic symbol in the city tend to accumulate near the landmark address. The bandwidth in such situation should obviously be different with that around a sparsely populated area.

In order to prevent either overfitting or oversmoothing, we take another adaptive based approach (KDE-adaptive) where the bandwidth is set adaptively for the KDE based ranking function. Given a term \(w_{j}\, \epsilon \, V_{D}\), a customized bandwidth h would be generated according to the provided occurrence locations \({L^{w_{j}}}\). Inspired by Breiman et al. [6], we set the bandwidth \(h_{w_{j}}\) as the distance between the targeted geographic point \(p_{i}\) and its k-th nearest neighbor. The formal definition of KDE-adaptive is described in Eq. 6, where \(h_{j}\) refers to the Euclidean distance to the k-th nearest neighbor to \(l_{p_{i}}\).

$$\begin{array}{@{}rcl@{}} KDE_{adatpive}(p_{i}, L^{w_{j}}, G) &=& \frac{1}{|L^{w_{j}}|}\sum\limits_{n = 1}^{N^{w_{j}}}G_{h_{j}}(l_{n}^{w_{j}}, l_{p_{i}})\\ C_{h_{j}} &=& \left[\begin{array}{ll} h_{j} & 0 \\ 0 & h_{j} \end{array}\right], \end{array} $$
(6)

Similar to the KDE-fixed method, multiple mentions of a word in a single tweet count only once. For the alternative KDE-adaptive method with collective user weighting (KDE-adaptive-per-user ), the multiple occurrences of a word from the same user count once and use the centroid point as its location.

In our problem setting, noisy signals such as stop words, expression words or spams accumulate across time could overwhelm the spatial semantics of our interest. By setting the bandwidth according to k-th nearest neighbor, the adaptive kernel approach tunes the bandwidth inversely with the word density. For the word with a low density, the distance of its k-th nearest neighbor to a given object is larger than the word with a dense occurrence, which results in a larger bandwidth to adapt the sparseness of the data. In our experiment, we evaluate different choices of k for the datasets.

7 Experimental evaluation

We evaluate the performance of frequency based method and probability based method to annotate multiple types of geospatial objects extracted from UK with tweets. We also compare the difference between frequency based method and probability based method for annotating point based geographical objects. We provide both ground-truth based comparison and case studies with manual evaluation.

7.1 Datasets

We downloaded the entire set of OSM data from Planet OSMFootnote 6 and filtered the data to generate a collection of representative places from UK. The places of interests are selected according to the tag information in OSM data, for example, railway stations, sports centres, tourism attractions, tourism museums, historic sites, cinemas and theatres, and places of worship (i.e., churches). The geo-tagged tweets corpus was collected for the period between Nov 1, 2014 and Sep 09, 2015 and contains 343,779,205 geo-tagged tweets in total. For simplicity, only English words from tweet contents are considered as annotation candidates in the experiments. The overall statistics of the datasets is summarized in Table 2.

Table 2 Description on datasets

7.2 Experimental settings

Name detection experiment

We designed a name detection experiment to assess our proposed semantic annotation methods. While it is difficult to provide ground-truth to evaluate semantic annotations, one special OSM tag, the name for a given place, is provided by most OSM objects and could serve as ground truth to evaluate our proposed annotation methods. We built a ground truth dataset by extracting a subset of places with their name tags contained in OSM data and appearing in the nearby tweet contents.

Ground truth Dataset Generation

In detail, to build the ground truth dataset, our spatial data integration framework first cross-matches the whole geo-tagged tweets corpus with geographical objects in OSM. For boundary based objects, the integrated corpus includes all boundaries that contain at least one tweet. For point based objects, the integrated corpus includes all points with at least one tweet detected within their buffered circle ranges. In our experiment, the radius distance for the buffered circle is set up to 0.002 decimal degrees (worth up to 250 meters). We then filter out the ground truth corpus with the place names appearing in the nearby tweet contents. A set of representative place categories is used in the following experiments (Table 3).

Table 3 Statistics of ground truth datasets

Evaluation metrics

Given a geographic object, the semantic annotation result is a sorted list of relevant words ordered by their ranking scores from a semantic annotation methods described in Sections 5 and 6. We validate whether the top K words with the highest ranking scores will contain the place name. Given a collection of boundary based objects or point based objects in the ground truth, the name detection accuracy is the percentage of places with their names contained in the top K annotations.

7.3 Evaluation of frequency based methods

We evaluate frequency based methods for boundary based objects with place name detection. Figure 4a illustrates the performance of TF and TF-IDF methods on all seven types of boundary based objects in the ground truth datasets. TF-IDF clearly outperforms TF for name detection accuracy. Such result indicates that collective signals from overall social media context can effectively smooth the direct term frequency through weighting in IDF.

Fig. 4
figure 4

a Name detection accuracy of TF and TF-IDF methods for top K results; bd Name detection accuracy of TF-IDF method with boundary object grouping based on user count, tweet count and word count respectively

In reality, some areas are more densely populated than others. We further examine the performance of TF-IDF method on places with different popularity. We then group all boundary based objects according to their contained user counts, tweet counts and word counts respectively. As shown in Fig. 4b–d, places that contain more signals tend to have a higher accuracy for name detection experiments, no matter how the places are grouped. With a closer examination, however, we find that grouping with user counts has a higher improvement than grouping with word couts and tweet counts. This implies that higher user appearance can supply richer information for annotations.

Based on the observation, we design two variants of TF-IDF method discussed in Section 5, i.e., TF-per-tweet-IDF, and TF-per-user-IDF, which incorporate local signals from aggregated nearby documents. We then use four frequency based methods to annotate names of the places containing tweets coming from at least 30 users. The results in Fig. 5a demonstrate the effectiveness of user information for enhancing annotations, and indicate that a larger number of distinct users mentioning the same keyword will provide stronger evidence for the relevance of the keyword to the corresponding places.

Fig. 5
figure 5

a Name detection accuracy of frequency based methods; b Name detection accuracy of TF-IDF and TF-per-user-IDF for different place categories

We also evaluate the performance for different place categories. Figure 5b shows the name detection accuracy using top 10 annotations for different place categories in our ground truth dataset. TF-per-user-IDF consistently outperforms TF-IDF across almost all categories. The only exception is railway stations, where TF-IDF performs better. Besides, the overall accuracy for railway stations is also much higher than other categories. This implies that geo-tagged tweets from stations contain more spatial dependent information and have less noise. On the other hand, the overall accuracy for shops has a very low accuracy, which implies its nearby social context has a much lower “signal to noise ratio”.

7.4 Evaluation of probability based methods

We evaluate probability based methods for annotating geographical points. We first compare different KDE based methods for annotating point objects. Figure 6a shows the name detection accuracy of all point objects combined. Figure 6b shows the accuracy with top 10 annotations for different types of places in our datasets. We experiment different parameters and compare the highest accuracies for both KDE-fixed based methods (with bandwidth value h set as 0.0001 decimal degree) and KDE-adaptive based methods (with the number of neighbors as 2). The adaptive bandwidth methods clearly outperform fixed bandwidth methods.

Fig. 6
figure 6

a Name detection accuracy of probability based methods; b Name detection accuracy of KDE-fixed-with-weighting and KDE-adaptive for different place categories

The probability based methods rely on the bandwidth parameter h for estimating the word density distribution. In order to prevent either overfitting or oversmoothing, smaller bandwidth values should be assigned to denser words and larger ones should fit to sparse words. To better understand the influence of bandwidth, we study the effect of bandwidth on accuracy. Figure 7 illustrates the accuracy trend with varying bandwidth for detecting names of churches. We experiment on churches because this category of places (in our dataset) includes both tourism hotspts with many dense words and local churches with only sparse words. We observe that KDE-fixed reaches the highest accuracy with h between 0.0005 and 0.01 decimal degrees (20 meters to 1 kilometer). For KDE-adaptive, we find that the accuracy is decreased when h is larger than the distance between the place and its 10th nearest neighbor.

Fig. 7
figure 7

a Name detection accuracy with increasing bandwidth h for KDE-fixed; b Name detection accuracy for KDE-adaptive with h set to the distance between the place and its k-th nearest geo-tagged tweet

7.5 Frequency based methods vs probability based methods

We compare frequency based methods and probability based methods for detecting the names of places for point based objects. To support point based objects with frequency based method, an approximate buffer with a distance threshold \(\delta \) is used to identify nearby tweets. The results in Fig. 8a show that, when the range search threshold \(\delta \) is decreased, the number of places with their names detected from nearby tweets is also decreased. This suggests that a smaller distance threshold \(\delta \) will lead to a loss of relevant information.

Fig. 8
figure 8

a Number of places with names detected from nearby tweets with varying range search threshold δ; b Name detection accuracy of different semantic annotation methods for point based objects

We compare the performance of frequency based methods versus probability based for identifying names of churches. As shown in Fig. 8b, both KDE-fixed (with h set to 0.01 decimal degree) and KDE-adaptive (with the number of neighbors as 3) outperform all frequency based methods.

7.6 Case studies

We also perform two case studies to evaluate the annotation results with human interpretation, one for boundary based object (Imperial War Museum North) and the other for point based objects (Tower Bridge). We first classify semantic annotation results into three categories: explicitly relevant, implicitly relevant, and non-relevant. Explicitly relevant annotations are about major characteristics of an object, for example, the name and theme of a museum. Implicitly relevant annotations are more about derived information or minor information, for example, a collection in a museum.The case studies only generate semantic annotations from unigrams. However, bigrams and trigrams may contain more semantic information, so we perform additional experiments to compare unigrams, bigrams, and trigrams as shown in Fig. 9.

Fig. 9
figure 9

Name detection accuracy of 6 frequency based methods for a uni-gram, b bigram, and c trigram; d Name detection accuracy of TF-IDF method with boundary object grouping based on user count, tweet count and word count respectively

7.6.1 Boundary object: imperial war museum North

For Imperial War Museum North, we compare top 20 annotations from four frequency based methods. The existing tags in OSM (Fig. 10a) mainly contain the name and place category. Example explicitly relevant annotations include “war”, “museum”, “imperial”, and “iwmn” (the abbreviation for the full name). Implicitly relevant annotations include “architecture”Footnote 7, “wellingtonbomber” (hashtag for wellington bomber), “gunturret” (hashtag for gun turret), which are either the collections or the characteristics of Imperial War Museum. Non-relevant words include names of nearby places such as “univeristyofmanchester” (hashtag for University of Manchester) or the city name alone ‘manchester’ which is too abroad as an annotation. As shown in Fig. 10b, TF-per-user-IDF produces much more relevant annotation keywords (6 explicitly relevant and 4 implicitly relevant) than other frequency based methods.

Fig. 10
figure 10

Interpretation and evaluation of tweets based semantic annotations for Imperial War Museum North (boundary based geographical object)

7.6.2 Point object: tower bridge

For Tower Bridge, we compare top 20 annotations from two frequency based methods and two probability based methods. Explicitly relevant annotations include “walkway”, “glasswalkway” (the hashtag for glass walk way), “glassfloor”, which are either famous exhibition or a feature of Tower Bridge. The non-relevant words include common language or names of nearby businesses and landmarks. As shown in Fig. 11b, non-relevant words from frequency based methods contain more common language, and probability based methods generate names of nearby landmarks. KDE-adaptive method produces more relevant annotations than KDE-fixed method. The KDE-adaptive method in this case study detects one explicit relevant word as the top 1 result and 2 other implicitly relevant words among the top 6 results.

Fig. 11
figure 11

Interpretation and evaluation of tweets based semantic annotations for London Bridge (point based geographical object)

7.7 Textural feature comparison between unigram, bigram and trigram

The above experiments focus on unigrams to annotate geographic objects. To evaluate whether our results are consistent for different types of ngram features, we extract the unigrams, bigrams, or trigrams from geo-tagged tweets in name detection experiments for boundary based places shown in Fig. 9. We use six frequency based methods to detect the names of places that contain tweets with at least 20 distinct unigrams, bigrams or trigrams. The ngram (unigram, bigram, or trigram) names should also be mentioned in the boundaries of the places. In the end, the ground truth datasets contain 3,106 place objects for unigram experiments (Fig. 9a), 2,228 place objects for bigram experiments (Fig. 9b), and 641 place objects for trigram experiments (Fig. 9c).

The accuracies of unigram experiments are higher then bigram and trigram for all types of place categories (Fig. 9d). Such results may come from the limit of ground truth datasets. As we have to exclude the places with only one word name for bigram experiments and trigram experiments, the remaining place objects with at least one bigram or trigram name may contain more noise and are harder to rank the place names in the top 10 annotations.

In contrast to the results of unigram experiments (Fig. 9a), we also find that, among the six frequency based methods, the bigram and trigram experiments have highest accuracies for the methods without document corpus based weighting, i.e., TF, TF-per-tweet, and TF-per-user (Fig. 9b–c). In general, bigrams and trigrams may contain more semantic information. The results indicate that we should use more customized methods for different types of features to explore the rich semantics of tweet content and many other geo-tagged social media data.

8 Conclusion

Vast amounts of spatial big data are being increasingly generated through geo-crowdsourcing (VGI) and active users (social media). Integrating multiple sources of spatial big data could provide new insights and create new forms of value. In this paper, we present integrated spatial data analytics to support geo-tagged tweets based annotation for OpenStreetMap objects. Our spatial data integration is built on a MapReduce based spatial query engine which makes it possible to quickly integrate large scale spatial data. We first propose frequency based methods optimized through various weighting schemes to annotate objects with clear boundaries, and then propose probability based methods based on KDE optimized with adaptive bandwidth to annotate objects with point based representations. Our experiments from ground-truth comparison and human interpretation of annotation results demonstrate promising results.