Abstract
This paper presents the idea to enrich printed biographical person registers with linked data related to events that took place after the register was published. By transforming printed historical documents into structured data, semantic search to written texts can be provided for the reader. Even more importantly, life stories of historical persons can be extended based on data linking by extracting semantic structures from printed texts, and by combining this data with external datasets and data services. Such linking provides an enriched context for prosopographical research on people in the register, as well as an enhanced reading experience for anyone interested in reading the biographies. As a concrete case study, a register 1867–1992 of over 10 000 alumni of the prominent Finnish high school “Norssi” was transformed into RDF, was enriched by data linking, was published as a linked data service, and is provided to end users via a faceted search engine and browser for studying lives of historical persons and for prosopographical research.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Biographical Registers
Schools, professional guilds, scientific societies, and other person organizations regularly publish biographical registers of their members. Such registers provide a valuable source of information on personal data of groups of people. At the same time, social cohesion and self-esteem of people sharing e.g. common history, interests, or other aspects of life can be enhanced. To name a few examples in Finland, the government has regularly published the “State Calendar” (Suomen Valtiokalenteri)Footnote 1 of prominent Finnish officials, the historical Student Register (Ylioppilasmatrikkeli)Footnote 2 1640–1852 of the University of Helsinki contains data about 18 000 early academic persons in Finland, and there is a register of 73 100 engineers and architects in FinlandFootnote 3, maintained by the labor union TEK since 1930’s. Registers are usually created while the persons listed are still alive.
Such registers typically contain short biographical entries of people that belong to some group, with perhaps a photo attached. Traditionally, such registers have been published in print, making it difficult to keep the data up-to-date. When reading an old register, a recurring problem is to find out what happened to the persons after the register was published. For example, when reading one’s old high school graduation register: what happened to the classmates afterwards?
This paper presents an overview of research underway, addressing the problem of transforming printed biographical registers into Linked Data, and enriching their contents using Named Entity Linking [2, 3]. As a concrete case study, we consider the printed register “Norssit 1867–1992. Helsingin Norssin matrikkeli”, a book of 708 pages, containing short bios of over 10 000 students and teachers of the prominent Finnish high school “Norssi”, a training school of the University of Helsinki. This school celebrates its 150th anniversary in 2017, so this is a good moment to create an enriched look back at the history of its alumni.
2 Norssi Alumni on the Semantic Web
Extracting Structure from Text. The project started by digitizing the book at the Digitization Centre of the National Library of Finland. As a result, an OCR-version in XML of the book pages was obtained, including coordinates of detected images of persons. The data extracted was then transformed into RDF form, where each biographical entry was extracted from the OCR text. Also the photos of persons were extracted from the images of the book pages and linked with the bios. After this, a collection of regex rules and Python scripts were designed in order to (1) clean OCR errors in the data and to (2) extract various pieces of information from the short bios, such as the name of the person, birth place, hobbies, and relatives mentioned. An example of a short biograph in the book is depicted in Fig. 1. The extracted data was then uploaded into a SPARQL endpoint of the Linked Data Finland serviceFootnote 4 [5].
From a data linking viewpoint, the birthday and full name of the persons were known at this point, which could be used to enrich the data from several other datasets listed in Table 1. Links were created to Wikipedia, Wikidata, National Biography of FinlandFootnote 5 and its Swedish complement BLFFootnote 6, BookSampoFootnote 7 Linked Data, CultureSampoFootnote 8 portal, WarSampoFootnote 9 portal, ULANFootnote 10 authority register by The J. Paul Getty Trust, VIAFFootnote 11, and the genealogical data service GeniFootnote 12. For entity linking to databases offering a SPARQL endpoint, the tool SPARQL ARPAFootnote 13 was used. In cases where the database provides a REST API, like Wikipedia or Geni.com, a special Python script was used. The script was used also in the case of BLF, where the data was available as a CSV formatted table.
For example, the RDF data corresponding to Fig. 1 is presented below (with long URIs and literal values shortened for brevity by using three periods):
Application Online. Based on the RDF data, a faceted search and browsing applicationFootnote 14 depicted in Fig. 2 was created using the SPARQL Faceter tool [6]. On the left, the first column contains the following facets: (1) Text search. (2) Links to the data sources listed in Table 1. (3) Family name. (4) Place of birth. (5) Year of enrollment. (6) Year of graduation. (7) Hobbies. Each box in the rows presents a person, containing the data related to the facets and the biograph. There are also links to the original text, the RDF data, and the book page of the register entry. By clicking on it, the page in the book from which the text comes from is shown. Especially interesting is the facet and column for links to other data sources. For example, by selecting WarSampo or Wikipedia, classmates with a history in the WarSampo Second Word War history portal or Wikipedia page can be filtered, and corresponding homepages on these external services be found. In this way, the reading experience of the end user can be extended substantially.
Prosopograhical Research. Furthermore, faceted search provides the end user with a means for filtering and studying subgroups of people in the register for prosopographical research, say persons having a Wikipedia page, born in the same area, having the same education or hobbies, etc. The upper bar of the application contains link buttons to two separate pages of visualizations that include, e.g., pie charts and histograms, based on Google charts. By making filtering selections on facets as in Fig. 2, the graphics are automatically updated accordingly. For example, a pie chart there depicts the distribution of the higher education degrees of the filtered alumni subgroup, a multi-bar histogram visualizes most common professions of the filtered persons as time goes by, and yet another graph shows the popularity of different universities and colleges chosen by the alumni after the high school.
3 Related Work and Discussion
Previous works of applying Linked Data technologies to biographical data include, e.g., [7], Biography.netFootnote 15 [8], and the Semantic National Biography of Finland [4]. The conference proceedings [1] includes several papers on bringing biographical data online, on analyzing biographies with computational methods, on group portraits and networks, and on visualizations. Complementing these works, the study of this paper focuses on extracting structure from printed biographical registers. Our work also emphasizes the idea of enriching the texts with external links to other biographical datasets, and on faceted search and browsing of biographical data for prosopographical studies. Our work continues, e.g., on developing new models of biographical data for prosopographical research, and on finalizing and evaluating the data linking process (precision and recall) and the demonstrator.
Our work is part of the Severi projectFootnote 16, funded mainly by Tekes. Thanks to Vanhat Norssit for funding the digitization of the register and opening the data.
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
References
ter Braake, S., Fokkens, A., Sluijter, R., Declerck, T., Wandl-Vogt, E.: BD2015 Biographical Data in a Digital World 2015, CEUR Workshop Proceedings (2015). http://ceur-ws.org/Vol-1399/
Bunescu, R.C., Pasca, M.: Using Encyclopedic knowledge for named entity disambiguation. EACL 6, 9–16 (2006)
Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.: Evaluating entity linking with Wikipedia. Artif. Intell. 194, 130–150 (2013)
Hyvönen, E., Alonen, M., Ikkala, E., Mäkelä, E.: Life stories as event-based linked data: case semantic national biography. In: Horridge, M., Rospocher, M., van Ossenbruggen, J. (eds.) Proceedings of ISWC 2014 Posters and Demonstrations Track, CEUR Workshop Proceedings, pp. 1–4 (2014). http://ceur-ws.org/Vol-1272/paper_5.pdf
Hyvönen, E., Tuominen, J., Alonen, M., Mäkelä, E.: Linked data Finland: a 7-star model and platform for publishing and re-using linked datasets. In: Presutti, V., Blomqvist, E., Troncy, R., Sack, H., Papadakis, I., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8798, pp. 226–230. Springer, Cham (2014). doi:10.1007/978-3-319-11955-7_24
Koho, M., Heino, E., Hyvönen, E.: SPARQL Faceter–client-side faceted search based on SPARQL. In: Troncy, R., Verborgh, R., Nixon, L., Kurz, T., Schlegel, K., Sande, M.V. (eds.) Joint Proceedings of the 4th International Workshop on Linked Media and the 3rd Developers Hackshop, CEUR Workshop Proceedings (2016). http://ceur-ws.org/Vol-1615/semdevPaper5.pdf
Larson, R.: Bringing lives to light: biography in context. Final Project Report, University of Berkeley (2010). http://metadata.berkeley.edu/Biography_Final_Report.pdf
Ockeloen, N., Fokkens, A., ter Braake, S., Vossen, P., De Boer, V., Schreiber, G., Legêne, S.: BiographyNet: managing provenance at multiple levels and from different perspectives. In: Groth, P., van Erp, M., Kauppinen, T., Zhao, J., Keßler, C., Pouchard, L.C., Goble, C., Gil, Y., van Ossenbruggen, J. (eds.) Proceedings of the 3rd International Workshop on Linked Science, LISC 2013, pp. 59–71. CEUR Workshop Proceedings (2013). http://ceur-ws.org/Vol-1116/paper7.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hyvönen, E., Leskinen, P., Heino, E., Tuominen, J., Sirola, L. (2017). Reassembling and Enriching the Life Stories in Printed Biographical Registers: Norssi High School Alumni on the Semantic Web. In: Gracia, J., Bond, F., McCrae, J., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds) Language, Data, and Knowledge. LDK 2017. Lecture Notes in Computer Science(), vol 10318. Springer, Cham. https://doi.org/10.1007/978-3-319-59888-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-59888-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59887-1
Online ISBN: 978-3-319-59888-8
eBook Packages: Computer ScienceComputer Science (R0)