Keywords

1 Introduction

How can we geographically represent text documents? For documents strongly linked to geographical space, such as hiking blogs, a representative geographical area can be automatically generated by combining natural language processing methods with geographical processing such as spatial filtering or clustering. The resulting document representations, known as ‘document scopes’ or ‘document footprints’, are useful in downstream tasks (e.g. spatial queries), upstream tasks (e.g. improving placename disambiguation), and as an end in themselves (e.g. visualizing a text document on a mapping interface) (Monteiro et al. 2016; Purves et al. 2007; Quercini et al. 2010).

The context of our work is a project on how people describe landscapes in Switzerland. With the goal of comparing landscape descriptions from different data sources (hiking blogs and Flickr photos), document footprints are generated for a web-crawled corpus of hiking blogs in order to query and select Flickr photos based on location. For this task, we aimed to generate high-precision, geographically focused footprints.

2 Data and Methods

Our corpus consisted of web documents related to ten study sites in the German speaking region of Switzerland in a first-person narrative. Documents were collected by targeted web-crawling, with five texts per site selected by manual triage, for a final corpus of 50 documents.

Fig. 1
figure 1

Processing pipeline for document footprint generation

To generate document footprints, we followed the established three-step processing pipeline (Amitay et al. 2004; Monteiro et al. 2016) consisting of: (1) identifying placenames, (2) grounding placenames, and (3) generating a footprint (geometry) (Fig. 1). Poor placename identification has been identified as a major source of error for document scope propagated downstream (Amitay et al. 2004; Purves et al. 2007). Thus, to obtain precise footprints, we performed step 1 manually, which was feasible due to the small corpus size. Step 2, grounding, involved querying an API to obtain ranked results from the SwissNames3D gazetteer for each placename, after having aggregated placenames repeating within a study site and recording their frequencies. For the final step, footprint generation, we experimented with two approaches: iterative filtering based on the centroid and standard deviation of our candidate points (Smith and Crane 2001), and clustering using DBSCAN to identify one main cluster and discard outliers.

Finally, we experimented with permutations in processing decisions in order to generate optimal footprints which suited our requirements. Decisions included: how many candidates for each placename to retain at the grounding stage; whether to treat with higher priority placenames with exactly one candidate; and whether to use the frequency per site of placenames. We automated the entire processing pipeline, starting from the manually annotated placenames, to output ten convex hulls or bounding boxes on each run.

3 Results and Conclusion

Our preliminary results showed that for placename grounding, simple approaches worked well enough for our purposes: ranked candidate placenames from SwissNames3D were sufficiently accurate that we obtained good results by retaining just the top candidate for each placename. For footprint generation, satisfactory results were obtained using both distance-based filtering and DBSCAN clustering, but the DBSCAN results were better suited to more complex geometric arrangements.

Our footprint requirements stemmed from our downstream tasks: performing a spatial query for Flickr photos, and ultimately, comparing datasources about landscapes at our ten study sites. These task-based requirements, along with the availability of quality placename resources for our area of study, influenced our processing decisions at every stage. Future work will fully automate the placename identification stage, and will systematically measure the effects of permutations in processing decisions on the downstream tasks of querying and document comparison.