Keywords

1 Introduction

Historical maps are an irreplaceable primary source of geographical and political information in the past (e.g., historical place names, landmarks, natural features, transportation networks, and war, trade, and diplomacy networks). The image processing and pattern recognition community started to develop computational methods for the extraction and recognition of the content from archived images of maps since the early 80s (Chiang et al. 2014). With the exponential growth of available map scans in the archives and on the internet, a variety of disciplines in the natural and social sciences grow interests in using historical maps in their studies. For example, the Mappa Mundi by Fra Mauro (ca. 1450) (Fig. 1) contains not only place names but also provides “natural philosophy, description of places and people, commercial geography, history, navigation and direction of expansion, and, finally, on what we can nowadays call methodological issues. In addition, Fra Mauro’s world map also includes hundreds of images, representing cities, temples, funerary monuments, streets, and ships, as well as a scene in the lower left corner representing Earthly Paradise” (Nanetti et al. 2015).

Fig. 1.
figure 1

East Asia Mainland of the Mappa Mundi (ca. 1450), Fra Mauro

In many cases, historical maps are the only source that provides professionally surveyed historical geographic data. Map archives such as the U.S. Geological Survey (USGS) National Geologic Map Database,Footnote 1 USGS Topographic Maps,Footnote 2 David Rumsey Map Collection,Footnote 3 OldMapsOnline.org,Footnote 4 and the National Library of Scotland,Footnote 5 together store millions of this type of historical map in either paper or scanned format. For example, between 1884 and 2006, the USGS has created over 200,000 topographic maps. According to the USGS, in the United States these topographic maps “portray both natural and manmade features. These maps show and name works of nature including mountains, valleys, plains, lakes, rivers, and vegetation. They also identify the principal works of man, such as roads, boundaries, transmission lines, and major buildings.” The USGS National Geospatial Program has scanned these historical paper maps. Collectively, these publicly available scanned maps portray the evolution of the American landscape over a 125-year period.Footnote 6 Similar map series exist in many countries, e.g.; the Ordnance Survey maps in the U.K archived by the National Library of Scotland.

In the case of more recent historical maps produced using modern geospatial survey technologies (e.g., the USGS Topographic Map series, Ordnance Survey six-inch series, and other national agency series dated from the early 1800), the detailed map data on the states of landscapes in the past are essential for understanding the causes and consequences of environmental change and support a variety of natural and social studies on topics such as cancer and environmental epidemiology, urbanization, biodiversity, landscape changes, and history. (See Gregory et al. 2015 for more examples and methodologies in historical geographic information systems.) Many of these historical maps are not georeferenced, and almost all of the maps have content that is not machine-readable. Existing map processing technologies are still limited in making a large number of historical maps fully searchable by their content because the archived documents often suffer from bleaching, blurring, and false coloring (e.g., Khotanzad and Zink 2003; Leyk and Boesch 2009; Leyk et al. 2006). The reader is referred to Chiang et al. (2014) and Chiang et al. (2016) for detailed reviews and case studies on map processing techniques and systems.

Today, a researcher can spend a great deal of time and effort searching and cross-referencing data sources to find relevant maps. Then they need to digitize the map for converting the map content to a machine-readable format (e.g., Godfrey and Eveleth 2015; Nanetti et al. 2015). The researcher may need to search in various publication repositories, map repositories, search engines, and then they will often not find the historical map that they are looking for and work without it. In many cases, these historical maps exist, and it just requires too much effort to locate and digitize them. The result is that researchers waste time and resources and do not get as far as they could have in their work because the relevant information is not discoverable or takes too long to prepare for scientific analysis.

The challenges in working with historical maps present an enormous collaboration opportunity for the image processing and pattern recognition community to build advance map processing technologies for transforming the scientific studies that currently use textual content in historical maps. Therefore, it is important to understand the current landscape in the broad applications of historical maps. This paper first describes the potentials and current applications of historical maps in a variety of studies, including topics in natural science (Sect. 2) and social science (Sect. 3). Next, the paper describes the current trend in extracting and recognizing textual content from historical maps (Sect. 4). Finally, the paper discusses the future outlook in text recognition technologies in map processing (Sect. 5).

2 Potentials and Applications of Historical Maps in Natural Science

Historical data archives (e.g., museum and herbaria collections, digital photography and newspaper archives) support a variety of scientific studies in natural science on topics such as biodiversity (e.g., Hill et al. 2009), evolutionary biology (e.g., Lavoie 2013), human disease (e.g., Yoshida et al. 2014), plant biology (Davis et al. 2015; Vellend et al. 2013), and ecology (e.g., Newbold 2010; Pyke and Ehrlich 2010), but geolocating the historical localities mentioned in archives (e.g., Calflora Observation DatabaseFootnote 7 and the Global Biodiversity Information Facility; Samy et al. 2013) is challenging and very often a tedious manual process using historical maps. Murphey et al. (2004) reviewed the problems in georeferencing museum collections. They compared a number of geoparsing tools including the GEOLocate (Rios and Bart 2010) and BioGeomancer (Guralnick et al. 2006). Since then, a variety of advanced algorithms for geoparsing has been proposed (e.g., Leidner and Lieberman 2011) and open-source software packages (e.g., CLAVIN,Footnote 8 CLIFF (D’Ignazio et al. 2014)), and the Edinburgh Geoparser (Alex et al. 2015) are available. These algorithms and tools are widely used in geolocating places in the unstructured text and also used in spatial humanities research (e.g., Gregory et al. 2015). However, these tools need a “gold data” gazetteer to provide the location information of recognized place names, and the lack of historical reference gazetteers remains a challenge. The result is that even if the geoparsing software can correctly identify a historical name as a geolocation reference in the unstructured text, the geocoordinates of the historical name is still unknown if the place name no longer exists. To locate the place names that no longer exist in contemporary data sources, a researcher needs to search and cross-reference a variety of data sources such as archives of historical maps, newspapers, and photography.

For example, a data record in an online database of California herbarium specimens describes an August 16th, 1902 observation of Artemisia douglasiana (California mugwort) at the location “near Mesmer” in Los Angeles. The place name Mesmer near or within both the City and County of Los Angeles no longer exists in the contemporary geographic data sources, including authoritative sources like the U.S. CensusFootnote 9 and USGS GNIS (the United States Geological Survey Geographic Names Information System)Footnote 10 and open sources, such as GeoNames,Footnote 11 OpenStreetMap,Footnote 12 and Wikipedia. Searching “Mesmer” in the GeoNames gazetteer results in an airport “Mesmer Airport” in New York and a street “Rue Mesmer” in Haiti. Neither of the results helps to geolocate the observation of California mugwort in 1902. A Google search with the keywords “Mesmer” and “Los Angeles” reveals a few interesting facts that could be helpful for geolocating Mesmer. First, the search results include a person, Louis Mesmer (1829–1900), who was a prominent businessman and the owner of the famous United States Hotel in Los Angeles. Because it was common to name locations after well-known families (e.g., Wilshire, Hancock, and Doheny in Southern California), Mesmer could be a place name in the Los Angeles area in the past. Second, the search results contain a link to a map in the Los Angeles Public Library collections showing a proposed development plan in 1924 for the “Mesmer City” in Los Angeles (Fig. 2). At the time, Mesmer City was advertised as “In the direct path of the Los Angeles’ growth toward the ocean”.Footnote 13 This map further narrows down the search space for Mesmer to somewhere nearby Culver City and Baldwin Hills. Together, the time and location information from the search results points to the USGS topographic map that contains the Mesmer in 1901 (Fig. 3). In this case, Mesmer is geolocated, but the entire process cannot be scaled to handle thousands of records in an efficient manner.

Fig. 2.
figure 2

Map of the Mesmer City development

Fig. 3.
figure 3

The USGS historical topographic map shows the location of Mesmer. (Southern California Sheet No. 1, circa 1901)

Historical GIS (Geographic Information System) (Gregory and Ell 2007) could alleviate the problem of geolocating historical locality references by providing a platform for collecting datasets of historical place names, but the datasets are rarely available. Even when historical gazetteers are available, their spatiotemporal coverage is often sparse. For example, the U.S. Census only provides post-2010 and also 2000 and 1990 census gazetteer files. NHGIS (the National Historical Geographic Information System at the Minnesota Population Center)Footnote 14 provides historical demography data down to the census tract level but only a few place names. The Ramsay Place Names File from the State Historical Society of Missouri provides a historical gazetteer covering locations in the State of Missouri from 1928 to 1945 (Adams 1928). The website “A Vision of Britain through Time” from the GB Historical GIS at the University of PortsmouthFootnote 15 provides historical place names in the Great Britain dated back in the early 19th century.

As shown in the examples of natural science studies in this section, the ability to automatically use the textual content in historical maps as the locality reference source will be able to transform historical data records in documents and collections into georeferenced datasets. This ability will enable natural science researchers to efficiently find, query, and analyze a variety of historical records by location.

3 Potentials and Applications of Historical Maps in Social Science

Historical maps also play an important role in social science studies. Kurashige (2013)Footnote 16 used historical census data, voting records, and precinct numbers and boundaries extracted from a 1920 map to study “who” (e.g., occupations and political parties) in Los Angeles voted for the 1920 California Alien Land Law that discriminates against Japanese (Fig. 4).

Fig. 4.
figure 4

Automatically unlocking precinct boundaries in a historical map for analyzing historical voting records with demographic datasets.

Ngo et al. (2015) used historical maps and land records to build an interactive visualization of land reclamation in Hong Kong (Fig. 5). This web toolFootnote 17 is among the top hits when searching Hong Kong land reclamation on Google.

Fig. 5.
figure 5

Building an interactive visualization of land reclamation in Hong Kong from historical maps.

The Spatial Sciences Institute at the University of Southern California (USC) is collaborating with an insurance company to automatically read historical Ordnance Survey maps (ca. 1900–1970) covering the entire U.K. to identify likely locations of subterranean contamination, such as factories, mines, quarries, and gas works which no longer exist and otherwise would not be known today (Fig. 6).Footnote 18

Fig. 6.
figure 6

Quarries and infill lands in a historical map and the contemporary satellite imagery.

In a joint effort, the USC Shoah Foundation Visual History Archive (VHA) works with the USC Spatial Sciences Institute to link historical maps to places mentioned in genocide survivor testimonies in the VHA archive. The linkages enrich the personal stories of the survivors by using the spatial and temporal context in historical maps to enable the viewers to “go back in time” to recreate the physical world of the historical experience of the survivors (Fig. 7).Footnote 19

Fig. 7.
figure 7

Using historical maps to identify the wedding location (historical synagogue) described by Holocaust survivor Murray Burger.

Nanetti et al. (2015) manually transcribed and georeferenced the textual content in the Mappa Mundi by Fra Mauro (ca. 1450). They use the transcribed data as a knowledge aggregator to represent the world as seen from Venice in the fifteenth centuries. They also plan to use the map data for automatic provenance and validation assessment of large and heterogeneous collections of other historical sources.

The example studies in this section demonstrate the power of historical maps in social science research. They also show that not only extracting and recognizing map content is important, but providing semantic annotations to the map content and linking the map content to other data sources will enable researchers to investigate complex social science problems at a scale that cannot be done today.

4 Current Trends in Text Recognition from Historical Maps

Text recognition from maps is a difficult task, especially for historical maps. This is because map labels often overlap with other map features, such as road lines, do not follow a fixed orientation within a map, and can be stenciled and handwritten text (Chiang et al. 2014, 2016; Nagy et al. 1997). Also, many historical scanned maps suffer from poor graphical quality due to bleaching of the original paper maps and archiving practices. This section presents a number of trends in text recognition from historical maps.

Classic Approaches.

Traditional, the approaches on text recognition from historical maps follow the classic document recognition strategy that first analyze the map to identify the potential text areas. Then the approaches use optical character recognition (OCR) algorithms or tools (e.g., TesseractOCR)Footnote 20 to convert the detected areas into machine-readable text (Honarvar Nazari et al. 2016; Chiang and Knoblock 2014; Li et al. 2000; Pezeshk and Tutwiler 2010, 2011; Raveaux et al. 2007, 2008). This line of work has demonstrated promising results on single map sheet or map series but still does not handle large numbers and various types of historical maps because of the significant heterogeneity of historical maps.

Crowdsourcing Approaches.

To handle the vast variety of historical map types and sources, recent work has adopted the crowdsourcing strategy. Crowdsourcing is not a new idea (but can be very difficult to implement and popularize) in document recognition and map processing. The David Rumsey Map Collection held crowdsourcing events to georeference their map collections by crowdsourcing. The New York Public Library provides semi-automatic tools for the public to extract parcel polygons from their historical insurance maps.Footnote 21 They also noted that even with the crowdsourcing approach, semi-automatic tools are required to process their map collections in a reasonable time (Arteaga 2013). Specifically for converting textual content in historical maps using crowdsourcing, the Pelagios CommonsFootnote 22 is a notable community that provides tools and online infrastructures to facilitate annotating historical locality references in digital materials. Their tools allow semi-automatic extraction, recognition, annotation, and linking of place names in historical maps (Simon et al. 2010, 2014, 2015). These tools and the online infrastructure allow them to provide full-text searchable place data ranges from ancient times to 1500 AD and from Europe to East Asia.

To make the crowdsourcing strategy more effective and efficient, it will be necessary to build adaptive semi-automatic techniques that improve the level of automation as more maps are processed. Also, as the crowdsourcing strategy is used, approaches for cross-validating between user generated content and the “gold data” as well as recording the provenance information are required (e.g., Garijo et al. 2015).

Multi-model Approaches.

Another line of work in the recent development of text recognition from historical maps uses additional data sources as the dictionary to help correct the recognition errors. While this dictionary strategy is common in OCR, compiling and effectively using a dictionary for recognizing historical map text is difficult. This is because a dictionary built using contemporary data sources does not contain place names that no longer exist. Also, without knowing the map coverage beforehand, multiple dictionary entries can match to a partially recognized label. For example, a partially recognized label “Glas wo” near London could be matched to “Glasgow” when the label is “Glassworks”. Even the map coordinates are known, the map text might not be at the exact location of the geographic features depending on the cartographic labeling practice of the map. Weinman (2013) presents an approach that overcomes this challenge. His approach recognizes text labels in maps to then match the recognized text to a gazetteer by their position patterns using a RANSAC variant called MLESAC (Torr and Zisserman, 2000). He showed that this approach could automatically georeference historical maps and improve the recognition accuracy even when the gazetteer only contains 70% of the text in the test maps.

In previous work, we developed a semi-automatic approach that extracts and recognizes text labels in map images in a system called Strabo (Apache Version 2 License) (Chiang and Knoblock 2014). While Strabo could achieve over 90% precision and recall in recognizing text labels in scanned contemporary maps, it could only produce 47.6% precision and 83.5% recall on well-conditioned text from historical Ordnance Survey six-inch maps (Chiang et al. 2016). The result is that very often only partial labels could be recognized from a historical map (Figs. 8(a) and (b)) and manual post-processing is required to correct the recognition results.

Fig. 8.
figure 8

Matching imperfect OCR results from two map editions to improve recognition accuracy

In an effort to test higher levels of automation in text recognition from historical maps (Yu et al. 2016), we exploit the fact that geographic names for the same area found in different data sources is not independent and use geographic names in OpenStreetMap and other maps covering the same area as the “dependent” knowledge source. Given a historical map, the task at hand is to recognize all map labels in the map accurately without user intervention. First, the system queries a map repository to find all map editions covering the same area and then extracts and recognizes labels in the identified maps. Second, the system compares and uses a fuzzy matching algorithm to match the recognized (imperfect) labels using their locations and string similarity. Finally, the system uses two million geographical names extracted from OpenStreetMap to generate an improved recognition result. For example, by matching “Cltureh” from the 1935 map to “urch” in the 1900 map, the system finds the word “Church” in the geographic names extracted from OpenStreetMap to replace “Cltureh” and “urch” in the 1935 and 1900 recognition results, respectively.

For the multi-model approaches, the current challenges include how to exploit string similarity measures between the extracted map text (which contain recognition errors) and other sources (e.g., gazetteer entries) to (1) prune the search space for finding the matching pattern efficiently and (2) using matches between the OCR text and dictionary entries to learn potential OCR errors specifically for each map type. For example, the character sequences “ni” and “in” is commonly recognized as one character “m” during OCR. With enough training data (matches between OCR text and dictionary entries), the algorithm should be able to learn that the OCR results “Baldwm Hills” is highly likely to be “Baldwin Hills” for a specific map type or condition.

5 Outlooks

This paper presents studies in natural and social sciences demonstrating the opportunities for the image processing & pattern recognition community to transform conventional research practices in using historical maps. For example, a new technology that automatically generates machine-readable or -understandable (e.g., LinkedData (Bizer et al. 2009)) place name databases from historical maps and to do so at scale will enable biology scientists to minimize the time and effort for geo-locating their data records and to efficiently query and analyze historical records by location and time. These opportunities also present unique possibilities for researchers in image processing & pattern recognition to identify collaborators in other scientific domains. This type of interdisciplinary collaboration allows the researchers in image processing & pattern recognition to create algorithms and applications to solving “wicked” research problems and addressing real-world challenges facing our society. Further, the paper discusses a number of trends and their challenges in text recognition from historical maps. These trends have already shown promising results in the automatic unlocking of textual content in heterogeneous historical maps. Solving the challenges in these trends will make it possible to use a large number of heterogeneous historical maps efficiently and study historical spatiotemporal datasets on a large scale.