Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Biographical Registers

Schools, professional guilds, scientific societies, and other person organizations regularly publish biographical registers of their members. Such registers provide a valuable source of information on personal data of groups of people. At the same time, social cohesion and self-esteem of people sharing e.g. common history, interests, or other aspects of life can be enhanced. To name a few examples in Finland, the government has regularly published the “State Calendar” (Suomen Valtiokalenteri)Footnote 1 of prominent Finnish officials, the historical Student Register (Ylioppilasmatrikkeli)Footnote 2 1640–1852 of the University of Helsinki contains data about 18 000 early academic persons in Finland, and there is a register of 73 100 engineers and architects in FinlandFootnote 3, maintained by the labor union TEK since 1930’s. Registers are usually created while the persons listed are still alive.

Such registers typically contain short biographical entries of people that belong to some group, with perhaps a photo attached. Traditionally, such registers have been published in print, making it difficult to keep the data up-to-date. When reading an old register, a recurring problem is to find out what happened to the persons after the register was published. For example, when reading one’s old high school graduation register: what happened to the classmates afterwards?

This paper presents an overview of research underway, addressing the problem of transforming printed biographical registers into Linked Data, and enriching their contents using Named Entity Linking [2, 3]. As a concrete case study, we consider the printed register “Norssit 1867–1992. Helsingin Norssin matrikkeli”, a book of 708 pages, containing short bios of over 10 000 students and teachers of the prominent Finnish high school “Norssi”, a training school of the University of Helsinki. This school celebrates its 150th anniversary in 2017, so this is a good moment to create an enriched look back at the history of its alumni.

2 Norssi Alumni on the Semantic Web

Extracting Structure from Text. The project started by digitizing the book at the Digitization Centre of the National Library of Finland. As a result, an OCR-version in XML of the book pages was obtained, including coordinates of detected images of persons. The data extracted was then transformed into RDF form, where each biographical entry was extracted from the OCR text. Also the photos of persons were extracted from the images of the book pages and linked with the bios. After this, a collection of regex rules and Python scripts were designed in order to (1) clean OCR errors in the data and to (2) extract various pieces of information from the short bios, such as the name of the person, birth place, hobbies, and relatives mentioned. An example of a short biograph in the book is depicted in Fig. 1. The extracted data was then uploaded into a SPARQL endpoint of the Linked Data Finland serviceFootnote 4 [5].

Fig. 1.
figure 1

A short biographical entry in the register book Norssit 1867–1992

From a data linking viewpoint, the birthday and full name of the persons were known at this point, which could be used to enrich the data from several other datasets listed in Table 1. Links were created to Wikipedia, Wikidata, National Biography of FinlandFootnote 5 and its Swedish complement BLFFootnote 6, BookSampoFootnote 7 Linked Data, CultureSampoFootnote 8 portal, WarSampoFootnote 9 portal, ULANFootnote 10 authority register by The J. Paul Getty Trust, VIAFFootnote 11, and the genealogical data service GeniFootnote 12. For entity linking to databases offering a SPARQL endpoint, the tool SPARQL ARPAFootnote 13 was used. In cases where the database provides a REST API, like Wikipedia or Geni.com, a special Python script was used. The script was used also in the case of BLF, where the data was available as a CSV formatted table.

Table 1. Data sources linked to the Norssit register

For example, the RDF data corresponding to Fig. 1 is presented below (with long URIs and literal values shortened for brevity by using three periods):

figure a
Fig. 2.
figure 2

Faceted search for short biographies in the alumni register Norssit 1867–1992

Application Online. Based on the RDF data, a faceted search and browsing applicationFootnote 14 depicted in Fig. 2 was created using the SPARQL Faceter tool [6]. On the left, the first column contains the following facets: (1) Text search. (2) Links to the data sources listed in Table 1. (3) Family name. (4) Place of birth. (5) Year of enrollment. (6) Year of graduation. (7) Hobbies. Each box in the rows presents a person, containing the data related to the facets and the biograph. There are also links to the original text, the RDF data, and the book page of the register entry. By clicking on it, the page in the book from which the text comes from is shown. Especially interesting is the facet and column for links to other data sources. For example, by selecting WarSampo or Wikipedia, classmates with a history in the WarSampo Second Word War history portal or Wikipedia page can be filtered, and corresponding homepages on these external services be found. In this way, the reading experience of the end user can be extended substantially.

Prosopograhical Research. Furthermore, faceted search provides the end user with a means for filtering and studying subgroups of people in the register for prosopographical research, say persons having a Wikipedia page, born in the same area, having the same education or hobbies, etc. The upper bar of the application contains link buttons to two separate pages of visualizations that include, e.g., pie charts and histograms, based on Google charts. By making filtering selections on facets as in Fig. 2, the graphics are automatically updated accordingly. For example, a pie chart there depicts the distribution of the higher education degrees of the filtered alumni subgroup, a multi-bar histogram visualizes most common professions of the filtered persons as time goes by, and yet another graph shows the popularity of different universities and colleges chosen by the alumni after the high school.

3 Related Work and Discussion

Previous works of applying Linked Data technologies to biographical data include, e.g., [7], Biography.netFootnote 15 [8], and the Semantic National Biography of Finland [4]. The conference proceedings [1] includes several papers on bringing biographical data online, on analyzing biographies with computational methods, on group portraits and networks, and on visualizations. Complementing these works, the study of this paper focuses on extracting structure from printed biographical registers. Our work also emphasizes the idea of enriching the texts with external links to other biographical datasets, and on faceted search and browsing of biographical data for prosopographical studies. Our work continues, e.g., on developing new models of biographical data for prosopographical research, and on finalizing and evaluating the data linking process (precision and recall) and the demonstrator.

Our work is part of the Severi projectFootnote 16, funded mainly by Tekes. Thanks to Vanhat Norssit for funding the digitization of the register and opening the data.