1 Introduction

Geocoding, also called toponym resolution or toponym disambiguation, is the subtask of geoparsing that disambiguates place names in text. The goal of geocoding is, given a textual mention of a location, to choose the corresponding geospatial coordinates, geospatial polygon, or entry in a geospatial database. Geocoders must handle place names (known as toponyms) that refer to more than one geographical location (e.g., Paris can refer to a town in the state of Texas in the United States, or the capital city of France), and geographical locations that may be referred to by more than one name (e.g., Leeuwarden and Ljouwert are two names for the same city in the Netherlands), as shown in Fig. 1. Geocoding plays a critical role in tasks such as tracking the evolution and emergence of infectious diseases (Hay et al., 2013), analyzing and searching documents by geography (Bhargava et al., 2017), geospatial analysis of historical events (Tateosian et al., 2017), and disaster response mechanisms (Ashktorab et al., 2014; de Bruijn et al., 2018)).

Fig. 1
figure 1

An illustrative example of geocoding challenges. One toponym (Paris) can refer to more than one geographical location (a town in the state of Texas in the United States or the capital city of France in Europe), and a geographical location may be referred to by more than one toponym (Leeuwarden and Ljouwert are two names for the same city in the Netherlands)

The field of geocoding, previously dominated by geographical information systems communities, has seen a recent surge in interest from the natural language processing community due to the interesting linguistic challenges this task presents. The four most recent geocoding datasets (see Table 1) were all published at venues in the ACL Anthology. And the recent ACL-SIGLEX sponsored SemEval 2019 Task 12: Toponym Resolution in Scientific Papers (Weissenbacher et al., 2019) resulted in several new natural language processing approaches to geocoding. The field has thus changed substantially since the most recent survey of geocoding (Gritta et al., 2017), including a doubling of the number of geocoding datasets, and the advent of modern neural network approaches to geocoding.

Those new to this area of research would thus benefit from a survey and critical evaluation of the currently available datasets, evaluation metrics, and geocoding algorithms. Our contributions are:

  • The first survey on geocoding to include recent deep learning approaches

  • Coverage of new geocoding datasets (which increased by 100% since 2017) and geocoding systems (which increased by 50% since 2017)

  • Discussion of new directions, such as polygon-based prediction

In the remainder of this article, we first highlight some previous geocoding surveys (Sect. 2) and explain the scope of the current survey (Sect. 3). We then categorize the features of recent geocoding datasets (Sect. 5), compare different choices for geocoding evaluation metrics (Sect. 6), and break down the different types of features and architectures used by geocoding systems (Sect. 7). We conclude with a discussion of where the field should head next (Sect. 8).

2 Related works

An early formal survey of geocoding is Leidner (2007), which distinguished finding place names (known as geotagging or toponym recognition) from linking place names to databases (known as geocoding or toponym resolution). They found that most geocoding methods were based on combining natural language processing techniques, such as lexical string matching or word sense matching, with geographic heuristics, such as spatial-distance minimum and population maximum. Most geocoders studied in this thesis were rule-based.

Monteiro et al. (2016) surveyed work on predicting document-level geographic scope, which often includes mention-level geocoding as one of its steps. Most of this survey focused on the document-level task, but the geocoding section found techniques similar to those found by Leidner (2007).

Gritta et al. (2017) reviewed both geotagging and geocoding, and proposed a new dataset, WikToR. The survey portion of this article compared datasets for geoparsing, explored heuristics of rule-based and feature-based machine learning-based geocoders, summarized evaluation metrics, and classified common errors from several geocoders (misspellings, case sensitivity, processing fictional and historical text presents, etc.). Gritta et al. (2017) concluded that future geoparsers would need to utilize semantics and context, not just syntax and word forms as the geocoders of the time.

Leidner (2021) reviewed many geospatial information processing tasks, but discussed only two geocoding systems in its section on geocoding.

Geocoding research since these previous surveys has changed in several important ways, as will be described in the remainder of this article. Most notably, new datasets and evaluation metrics are enabling new polygon-based views of the problem, and deep learning methods are offering new algorithms and new approaches for geocoding.

3 Article inclusion criteria

We focus on the geocoding problem, where mentions of place names are resolved to database entries or polygons. We thus searched the Google Scholar and Semantic Scholar search engines for papers matching any of the keyword queries: geocoding, geoparsing, geolocation, toponym resolution, toponym disambiguation, or spatial information extraxtion. From the results, we excluded articles that described tasks other than mention-level geocoding, for example:

  • Matching an entire document or microblog post to a single location (Luo et al., 2020; Hoang & Mothe, 2018; Kumar & Singh, 2019; Lee et al., 2015; Melo & Martins, 2017), as in geographic document retrieval and classification (Gey et al., 2005; Adams & McKenzie, 2018)

  • Matching typonyms to each other within a geographical database (Santos et al., 2018)

  • Location name recognition (geotagging) (Chen et al., 2022)

We also excluded papers published before 2010 (e.g., Smith and Crane, 2001), as they have been covered thoroughly by prior surveys.

In total, we reviewed more than 60 papers and included more than 30 of them in this survey.

4 Overview of the survey

The survey is divided into three parts: geocoding datasets, geocoding evaluation metrics, and geocoding systems. In each part, we break down the relevant research to reveal the most common features shared across different research efforts and analyze the challenges and opportunities presented.

For geocoding datasets, we find that recent advances have led to an increased variety of domains, while the available geographic databases and geospatial label types have changed little. GeoNames remains the dominant geographic database, and point-based labels dominate over polygons. The availability of free polygon data on OpenStreetMap presents an opportunity to create new datasets that emphasize polygons over points.

For evaluation metrics, median error distance is preferred over mean error distance, and area under the curve of geocoding error distances (AUC) is favored over Accuracy@161 km. Yet these point-based metrics ignore the size and shape of geographic locations, while polygon-based metrics represent an opportunity to more carefully evaluate geocoding systems.

For geocoding systems, features like string matching and population are included in most systems regardless of whether they treat the problem as ranking or classification or whether they use deep neural networks or more traditional machine learning algorithms. Variability in selection of evaluation datasets makes direct comparison across systems difficult, but several systems have reported results on the LGL, WikTOR, GeoVirus, and WOTR datasets. These results generally show that deep neural network models outperform more traditional machine learning algorithms. The neural network models typically incorporate fewer features (e.g., having limited notion of spatial distance), thus there is an opportunity to design deep learning architectures that can incorporate such features.

The remainder of this survey elaborates on these findings in detail.

5 Geocoding datasets

Many geocoding corpora have been proposed, drawn from different domains, linking to different geographic databases, with different forms of geocoding labels, and with varying sizes in terms of both articles/messages and toponyms. Table 1 cites and summarizes these datasets, and the following sections walk through some of the dimensions over which the datasets vary.

Table 1 Summary of geocoding datasets covered by this survey, sorted by year of creation

5.1 Domains

The news domain is the most common target for geocoding corpora, covering sources like broadcast conversation, broadcast news, news magazines, and newspapers. Examples include the ACE 2005 English SpatialML Annotations (ACS), the Local Global Lexicon (LGL), CLUST, TR-NEWS, GeoVirus, GeoWebNews, and TopRes19th. Though all these datasets include news text, they vary in what toponyms are included. For example, LGL is based on local and small U.S. news sources with most toponyms smaller than a U.S. state, while GeoVirus focuses on news about global disease outbreaks and epidemics with larger, often country-level, toponyms.

Web text is also a common target for geocoding corpora. Wikipedia Toponym Retrieval (WikToR) and GeoCoDe are both based on Wikipedia pages. ACS, mentioned above, also includes newsgroup and weblog data. And social media, specifically Twitter, is the target for ZG and GeoCorpora. TUD-Loc-2013 contains a variety of webpages including news articles and blogs. These corpora vary as widely as the internet text upon which they are based. For example, GeoCoDe and WikToR include the first paragraphs of Wikipedia articles, while ZG and GeoCorpora contain Twitter messages with place names that were highly ambiguous and mostly unambiguous, respectively.

Other geocoding domains are less common, but have included areas such as historical documents and scientific journal articles. The Official Records of the War of the Rebellion (WOTR) corpus annotates historical toponyms of the U.S. Civil War. Ardanuy and Sporleder (2017) created 5 historical multi-lingual datesets based on national, regional, local, and colonial historical newspapers. CLDW contains historical writings about the English Lake District in the early seventeenth and early twentieth centuries. The SemEval-2019 Task 12 dataset is based on scientific journal papers from PubMed Central.Footnote 1

5.2 Geographic databases

All geocoding corpora rely on some database of geographic knowledge, sometimes also called a gazetteer or ontology. Such a database includes canonical names for places along with their geographic attributes such as latitude/longitude or geospatial polygon, and may include other information, such as population or type of place.

Most geocoding corpora have used GeoNamesFootnote 2 as their geographic database, including ACS, LGL, CLUST, ZG, WikToR, TR-NEWS, GeoCorpora, GeoVirus, GeoWebNews, and SemEval-2019-12. GeoNames is a crowdsourced database of geospatial locations, with almost 7 million entries and a variety of information such as feature type (country, city, river, mountain, etc.), population, elevation, and positions within a political geographic hierarchy. The freely available version of GeoNames contains only a (latitude, longitude) point for each location, with the polygons only available with a premium data subscription, so most corpora based on GeoNames do not use geospatial polygons.

Geocoding corpora where recognizing geospatial polygons is important have typically turned to OpenStreetMap.Footnote 3 OpenStreetMap is another crowdsourced database of geospatial locations, which contains both (latitude, longitude) points and geospatial polygons for its locations. WOTR and GeoCoDe are based on OpenStreetMap.

Wikipedia and UnlockFootnote 4 have also been utilized although they are less common geographic databases. For example, in TopRes19th, the toponyms are annotated with the link to the corresponding Wikipedia entries, which can be used to obtain the geographic coordinates of the locations through their URLs.

5.3 Geospatial label types

Three different types of geospatial labels have been considered in geocoding corpora: database entries, (latitude, longitude) points, and polygons. All corpora except WTOR and GeoCoDe assign to each place name the (latitude, longitude) point that represents its geospatial center. Many of the GeoNames-based corpora (LGL, CLUST, TUD-Loc-2013, TR-NEWS, GeoCorpora, GeoWebNews, and SemEval-2019-12) also assign to each place name its GeoNames database ID. The WTOR corpus assigns to each place name a point or a polygon, and GeoCoDe assigns to each place name only a polygon. Figure 2 shows an example of a polygon annotation from GeoCoDe.

Fig. 2
figure 2

The red-shaded area is the polygon label for Biancavilla, which is defined by the set of its boundary coordinates retrieved from OpenStreetMap

5.4 Challenges: geocoding datasets

While there have been significant improvements in geocoding datasets, the community has not successfully pivoted from point-based labels to the more precise representation of geographic areas as polygons. This is due primarily to the dominance of GeoNames as a geographic database. GeoNames provides polygons only for a fee, creating a barrier for individuals and organizations that that would like to pursue polygon-based geocoding research.

An additional challenge is associative toponyms, such as Canadian or Russian. Associative toponyms are included in many geocoding datasets, such as LGL, GWN, and TR-News, but the geographic databases include only literal toponyms (e.g., Canada or Russia). Resolving such toponyms will thus be more difficult, especially when their demonymic forms diverge from their names (e.g., Netherlands vs. Dutch).

5.5 Opportunities: geocoding datasets

An opportunity for future research on geocoding datasets is to pivot to polygon based labels, which can more faithfully represent complex regions. OpenStreetMap, though used less widely in geocoding research to date, offers free polygon data, and thus provides an opportunity to design new polygon-based geocoding datasets that are not limited by GeoNames fees. Such datasets would allow the development of geocoding systems that better reflect the geography of the world.

Another opportunity in geocoding is to take advantage of the increased variety of domains now available, including historical documents, scientific documents, Wikipedia, and social media. Most work to date has focused on a single one of these domains, meaning there is a need to develop approaches to unify the various datasets, allowing more general and robust geocoding systems to be trained.

6 Geocoding evaluation metrics

Geocoding systems are evaluated on geocoding corpora using metrics that depend on the corpus’s geospatial label type.

6.1 Database entry correctness metrics

When the target label type is a geospatial database entry ID, common evaluation metrics for multi-class classification tasks are applied. These metrics can also be used for corpora with (latitude, longitude) point labels by breaking the globe down into a discrete grid of geospatial tiles, and treating each geospatial tile like a database entry.

Accuracy is the number of place names where the system has predicted the correct database entry, divided by the number of place names. Accuracy is sometimes also called Precision@1 or P@1 when there is only one correct answer (as in the case for current geocoding datasets) and when the ranking-based system is turned into a classifier by taking the top-ranked result as its prediction (the current standard for geocoding evaluation).

$$Accuracy = \frac{{\left| {\hat{U}} \right|}}{{\left| U \right|}}{\text{ }}$$

where U is the set of human-annotated place names, \(\hat{U}\) is the set of place names where the system’s single prediction or top-1 ranked result is correct.

6.2 Point distance metrics

When the target label type is a (latitude, longitude) point, common evaluation metrics attempt to measure the distance between the system-predicted point and the human-annotated point.

Mean error distance calculates the mean over all predictions of the distance between each system-predicted and human-annotated point:

$$Mean\;Error\;Dist = \frac{{\sum\limits_{{u \in U}} d is\left( {l_{s} \left( u \right),l_{h} \left( u \right)} \right)}}{{\left| U \right|}}{\text{ }}$$

where U is the set of all human-annotated place names, \(l_s(u)\) is the system-predicted (latitude, longitude) point for place name u, \(l_h(u)\) is the human-annotated (latitude, longitude) point for place name u, and dis is the distance between the two points on the surface of the globe.

Median Error Distance is defined in a similar way to mean error distance, but takes the median of the error distances rather than the mean.

Accuracy@k km/miles measures the fraction of system-predicted (latitude, longitude) points that were less than k km/miles away from the human-annotated (latitude, longitude) points. Formally:

$$Acc@k = \frac{{\left| {\left\{ {u|u \in U \wedge dis\left( {l_{s} \left( u \right),\;l_{h} \left( u \right)} \right) < = k} \right\}} \right|}}{{\left| U \right|}}{\text{ }}$$

where U, \(l_s\), \(l_h\), and dis are defined as above, and k is a hyper-parameter. A common choice for k is 161 km \(\approx\) 100 miles (Cheng et al., 2010).

Area Under the Curve (AUC) calculates the area under the curve of the distribution of geocoding error distances. A geocoding system is better if the area under the curve is smaller. Formally:

$$AUC = \ln \frac{{Actual\;Error\;Distanc}}{{Max\;Possible\;Errors}}{\text{ }}$$

where ActualErrorDistance is the area under the curve, and MaxPossibleErrors is the farthest distance between two places on earth.

Table 2 Summary of geocoding systems covered by this survey, sorted by year of creation

6.3 Polygon-based metrics

When the target label type is a polygon, evaluation metrics attempt to compare the overlap between the system-predicted polygon and the human-annotated polygon.

Polygon-based precision and recall were proposed by Laparra and Bethard (2020) based on the intersection of system-predicted and human-annotated geometries. Formally:

$$\begin{gathered} Precision = \frac{1}{{\left| S \right|}}\sum\limits_{{i \in |S|}} {\frac{{area\left( {S_{i} \cap H_{i} } \right)}}{{area\left( {S_{i} } \right)}}} \hfill \\ Recall = \frac{1}{{\left| H \right|}}\sum\limits_{{i \in |H|}} {\frac{{area\left( {S_{i} \cap H_{i} } \right)}}{{area\left( {H_{i} } \right)}}} \hfill \\ \end{gathered}$$

where the S is the system-predicted set of polygons and H is the human-annotated set of polygons.

6.4 Challenges: geocoding evaluation metrics

Some challenges exist with specific metrics. A challenge of using mean error distance is its sensitivity to outliers: a few locations with large errors can skew the results and obscure the accuracy of the majority of locations. For instance, Gritta et al. (2017) found that roughly 20% of the places caused most of the errors. A challenge of using Accuracy@k km/miles is that it weights small and large errors equally, which may not properly reflect the expectations of users of geocoding systems.

A challenge for all point-based evaluation metrics is that locations are not points on the globe, but regions, and thus the point-based evaluation metrics that are currently popular do a poor job of measuring the actual shapes predicted by geocoding systems.

6.5 Opportunities: geocoding evaluation metrics

For the metrics with specific challenges, alternative metrics have been defined and could be used more widely in future research. Median error distance is similar to mean error distance, but is more robust to outliers. AUC is similar to Accuracy@k km/miles, but it gives more weight to smaller errors, which are often more significant than larger errors in practical applications (Jurgens et al., 2015).

A larger opportunity in geocoding evaluation is the application of polygon-based metrics. While to date such metrics have been applied only to one polygon-based dataset, polygon-based metrics could also be applied to datasets with database entry labels. This would give credit to geocoding systems when two or more database entries are equally applicable, such as a mention of "Dallas" which is ambiguous between city and county, and where the polygons of both choices overlap. By considering the overlap of polygons, polygon-based metrics could provide a more precise evaluation of geocoding performance in such cases.

7 Geocoding systems

Table 2 summarizes the approaches of geocoders over the last decade. These models have different approaches to the prediction problem, ranging from ranking to classification to regression. They implement their predictive models with technology ranging from hand-constructed rules and heuristics, to feature-based machine-learning models, to deep learning (i.e., neural network) models that learn their own features.

7.1 Prediction types

Ranking is the most common approach to making geospatial predictions ( Edinburgh Parser, TGBRW-2010, MAC-2010, IGeo, LS-2011, MG, CLAVIN, LS-2012, WISTR, GeoTxt, CMU-Geolocator, SMFCM-2015, GeoSem, CBH, SHS, DM_NLP, RS-2020, GeoNorm). For example, most rule-based systems index their geospatial database with a search system like Lucene (https://lucene.apache.org/), and query that index to produce a ranked list of candidate database entries. This ranked list may be further re-ranked based on other features such as population or proximity. The type of scores using in re-ranking include binary classification score ( MG, LS-2012, WISTR, CMU-Geolocator, CBH, SHS, DM_NLP ), regression distance MAC-2010, the precision at the first position of the ranked list SMFCM-2015, and heuristics based on information in the geospatial database ( Edinburgh Parser, TGBRW-2010, IGeo, LS-2011, CLAVIN, GeoTxt ).

Classification is commonly used in making geospatial predictions when the Earth’s surface has been discretized into tiny areas ( Topocluster, CamCoder, HIS-2019, CME-2019, MLG, DeezyMatch, TR-2022, LGGeoCoder ). For example, CamCoder divides the Earth’s surface into 7,823 tiles, and then changes the geospatial label of each toponym to the tile containing its coordinate. CamCoder then directly predicts one of 7823 classes for each toponym mention.

Regression is sometimes used for geospatial predictions when the label type is a (latitude, longitude) point or a polygon (CME-2019, LB-2020, Bi-LSTM). For example, LB-2020 predict a set of coordinates (i.e., a polygon) by applying operations over reference geometries, where the operations take sets of coordinates as inputs and produce sets of coordinates as outputs. Regression approaches to geocoding are rare because directly predicting coordinates over the entire surface of the Earth is challenging.

7.2 Features and heuristics

All geocoding systems combine string matching (exact string matching, Levenshtein distance, etc.) with other features and/or heuristics (population, words in nearby context, etc.). Details of such features are described in this section.

String match checks whether the place name matches any names in the geospatial database ( Edinburgh Parser, TGBRW-2010, MAC-2010, IGeo, LS-2011, MG, CLAVIN, GeoTxt, CMU-Geolocator, SMFCM-2015, GeoSem, CBH, SHS, DM_NLP, HIS-2019, RS-2020, DeezyMatch, TR-2022, Bi-LSTM, GeoNorm). String matching can be done exactly, or approximately with edit distance metrics like Levenshtein Distance. For example, GeoTxt calculates the Levenshtein Distance between the place name in the text and each candidate entry from the geospatial database, and selects the candidate with the lowest edit distance.

Population looks at the size of the population associated with candidate database entry, typically preferring more populous entries to less populous ones ( Edinburgh Parser, TGBRW-2010, MAC-2010, IGeo, LS-2011, MG, LS-2012, CLAVIN, GeoTxt, CMU-Geolocator, SMFCM-2015, CBH, SHS, CamCoder, DM_NLP, GeoNorm). For example, when the Edinburgh Parser geocodes the text I love Paris, it resolves Paris to Paris, France instead of Paris, TX, U.S. since the former has a greater population in the geospatial database.

Type of place looks at the geospatial feature type (country, city, river, populated place, facility, etc.) of a candidate database entry, typically preferring the more geographically prominent ones ( Edinburgh Parser, TGBRW-2010, MAC-2010, IGeo, LS-2011, MG, CLAVIN, LS-2012, GeoTxt, TRAWL, CMU-Geolocator, SMFCM-2015, GeoSem, CBH, SHS, DM_NLP, TR-2022, GeoNorm). For example, TGBRW-2010 prefers “populated places” to “facilities” such as farms and mines, when there are multiple candidate geospatial labels.

Words in the nearby context are used to disambiguate ambiguous place names ( LS-2012, WISTR, CMU-Geolocator, SMFCM-2015, Topocluster, GeoSem, CBH, SHS, DM_NLP, CamCoder, CME-2019, MLG, LGGeoCoder, TR-2022, GeoNorm). Ways of using context words range from simple to complex. For example, WISTR uses a context window of 20 words on each side of the target place name, aiming to benefit from location-oriented words such as uptown and beach. In contrast, CMU-Geolocator searches for common country and state names in other nearby location expressions, using these mostly unambiguous place names to help resolve the target place name.

One sense per referent is a heuristic that assumes that all occurrences of a unique place name in the same document will refer to the same geographical database entry ( Edinburgh Parser, TGBRW-2010, IGeo, LS-2011, GeoTxt, CBH, SHS, DM_NLP, GeoNorm). For example, after each time that IGeo resolves a place name to a geospatial label, it propagates the same resolution to all identical place names in the remainder of the document.

Spatial minimality is a heuristic that assumes that place names in a text tend to refer to geospatial regions that are in close spatial proximity to each other ( Edinburgh Parser, TGBRW-2010, IGeo, LS-2011, CLAVIN, SPIDER, Topocluster, GeoSem, CBH, SHS, GeoNorm). For example, when IGeo geocodes the text 96 miles south of Phoenix, Arizona, just outside of Tucson, it takes Tucson as an “anchor” toponym and resolves that first to get a target region. Then for Phoenix, it selects the geospatial label that is most geographically proximate to the target region.

7.3 Method types

Rule-based systems use hand-crafted rules and heuristics to predict a geospatial label for a place name ( Edinburgh Parser, TGBRW-2010, IGeo, LS-2011, CLAVIN, GeoTxt, HIS-2019, RS-2020, LB-2020 ). The rule bases range in size from 2 to more than 200 rules, and rules may be formalized in rule grammars or defined more informally and provided as code. For example, IGeo uses a rule defined via code to identify place names in comma groups (e.g., “New York, Chicago and Los Angeles”, all major cities in the U.S.), and then resolves all toponyms by applying a heuristic uniformly across the entire group. As another example, LB-2020 uses 219 synchronous grammar rules to parse a target polygon from reference polygons by constructing a tree of geometric operators (e.g., \(\textsc {Between}(p_1, p_2)\) calculates the region between geolocation polygons \(p_1\) and \(p_2\)).

Feature-based machine-learning systems use many of the same features and heuristics of rule-based systems, but provide these as input to a supervised classifier that makes the prediction of a geospatial label ( MAC-2010, MG, LS-2012, WISTR, CMU-Geolocator, SMFCM-2015, Topocluster, GeoSem, CBH, SHS, DM_NLP ). They typically operate in a two-step rank-then-rerank framework, where first an information retrieval system produces candidate geospatial labels, then a supervised machine-learning model produces a score for each candidate, and the candidates are reranked by these scores. Classification and ranking algorithms include logistic regression (WISTR), support vector machines ( MAC-2010, CMU-Geolocator ), random forests ( MG, LS-2012 ), stacked LightGBMs (DM_NLP), and LambdaMART (SMFCM-2015). For example, MAC-2010 trains a support vector machine regression model using features such as the population and the number of alternative names for each candidate.

Deep learning systems often approach geocoding as a one-step classification problem by dividing the Earth’s surface into an \(N \times N\) grid, where the neural network attempts to map place names and their features to one of these \(N \times N\) categories ( CamCoder, CME-2019, MLG, DeezyMatch, Bi-LSTM, LGGeoCoder, TR-2022, GeoNorm). Each system has a unique neural architecture for combining inputs to make predictions, typically based on either convolutional neural networks (CNNs) or recurrent neural networks (RNNs).

CamCoder was the first deep learning based-geocoder. Its lexical model uses CNNs to create vectors representing context words (a window of 200 words, location mentions excluded), location mentions (context words excluded) and the target place name. Its geospatial model produces a vector using a geospatial label’s population (from the database) as its prior probability. CamCoder concatenates the lexical and geospatial vectors for the final classification.

MLG is also a CNN-based geocoder, but it does not use population or other geospatial database information. It captures lexical features in a similar manner to CamCoder, but takes advantage of the S2 geometry (https://s2geometry.io/) to represent its geospatial output space in hierarchical grid-cells from coarse to fine-grained. MLG can predict the geospatial label of a place name at multiple S2 levels by mutually maximizing both precision and generalization of predictions.

CME-2019 and TR-2022 is an RNN-based geocoder that uses HEALPix geometry Gorski et al. (2005) to discretize the Earth’s surface. It uses long short-term memory network with pre-trained Elmo embeddings Peters et al. (2018) or the embeddings generated by the pre-trained BERT Devlin et al. (2018) to create vectors representing the place name, local context (50 words around the place name), and larger context (paragraph or 500 words around the place name). The three vectors are concatenated and used to predict both the class of the HEALPix region and the coordinates of the centroid of the HEALPix class. This joint learning approach allows the two tasks to be mutually promoted and restricted.

GeoNorm is a geocoding architecture that improves toponym resolution by employing a two-stage generate-and-rerank method. Initially, it uses lexical-based information retrieval to suggest potential location entries from a geospatial ontology, GeoNames. These candidates are then prioritized using a transformer-based model that incorporates data such as population size. The first stage resolves clear entities like countries and states, while the second stage addresses more ambiguous locations, using results from the first as contextual support. This approach allows GeoNorm to achieve top-notch accuracy in identifying geographical references in text.

7.4 Challenges: geocoding systems

One of the challenges in geocoding research is the lack of consistency in evaluation datasets used by different geocoders. While the LGL, WikTOR, GeoVirus, and WOTR datasets have been shared by multiple geocoders, there is still much variability in the choice of evaluation datasets. This can make it difficult to compare the performance of different geocoders and to draw meaningful conclusions from the results. We nevertheless present the partial comparison that is possible in Table 3.

Table 3 Reported results on LGL, WikToR, GeoVirus, and WOTR. For accuracy@161 km, larger is better (\(\uparrow\)). For mean error distance, smaller is better (\(\downarrow\))

The table reveals a challenge for the neural network models: they are data hungry. The gains of neural network models over prior approaches are modest on smaller datasets, such as LGL and GeoVirus, and only become large on the larger datasets, such as WikTOR and WOTR. This need for large datasets may be due to the architectures themselves, or they may be a result of the simpler set of features input to neural network systems as compared to pre-neural-network systems.

7.5 Opportunities: geocoding systems

One opportunity for geocoding system research is to increase the size of the training datasets. This could be achieved by applying techniques like multi-task learning to train a single model using the variety of available geocoding datasets.

Another opportunity is to incorporate additional features into the deep learning models. For instance, document-level consistency features like one sense per referent, geospatial consistency features like spatial minimality, and additional database information beyond population were used by geocoding systems before deep learning models. Designing neural architectures that can incorporate such features could yield performance gains not possible with the current feature sets.

8 Future directions

A key direction of future research will be output representations. Many past geocoders focused on mapping place names to geospatial database entries (see column 4 of Table 2). This was convenient, enabling fast resolution by applying standard information retrieval models to propose candidate entries from the database, but was limited by the simple types of matching that information retrieval systems could perform. Modern deep learning approaches to geocoding allow more complex matching of place names to geospatial locations, but typically rely on discretizing the Earth’s surface into tiles to constrain the size of the network’s output space. For the neural networks to achieve the fine-grained level of geocoding available in geocoding databases, they may need to consider hierarchical output spaces (e.g., Kulkarni et al., 2020) or compositional output spaces (e.g., Laparra and Bethard, 2020) that can express the necessary level of detail without exploding the output space.

Another key direction of future research will be the structure and evaluation of geocoding datasets. Most existing datasets and systems treat geocoding as a problem of identifying points rather than polygons (see column 4 of Table 1 and column 5 of Table 2). Yet the vast majority of real places in geospatial databases are complex polygons (as in Fig. 2), not simple points. More polygon-based datasets are needed, especially ones like GeoCoDe (Laparra & Bethard, 2020) that include complex descriptions of locations (e.g., between the towns of Adrano and S. Maria di Licodia) and not just explicit place names (e.g., Paris). The current state-of-the-art for complex geographical description geocoding is rule-based, but more polygon-based datasets will drive algorithmic research that can improve upon these rule-based systems with some of the insights gained from deep neural network approaches to explicit place name geocoding.

Finally, geocoding evaluation is still an open research area. Future research will likely extend some of the new polygon-based evaluation metrics. For example, using polygon precision and recall would give credit to a geocoding system that predicted the GeoNames entry Nakhon Sawan even if the annotated data used the entry Changwat Nakhon Sawan, since the polygons of these two place names are nearly identical.

9 Conclusion

After surveying a decade of work on geocoding, we have identifed several trends. First, combining contextual features with geospatial database information makes geocoders more powerful. Second, like much of NLP, geocoders have moved from rule-based systems to feature-based machine-learning systems to deep-learning systems. Third, the older rank-then-rerank approaches, combining information retrieval and supervised classification, are being replaced by direct classification approaches, where the Earth’s surface is discretized into many small tiles. Finally, the field of geocoding is just beginning to look beyond a point-based view of locations to a more realistic polygon-based view.