Abstract
This paper presents the idea and our work of extracting and reassembling a genealogical network automatically from a collection of biographies. The network can be used as a tool for network analysis of historical persons. The data has been published as Linked Data and as an interactive online service as part of the in-use data service and semantic portal BiographySampo—Finnish Biographies on the Semantic Web.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Extracting and inferring social or genealogical networks from historical documents can provide new information for biographical and prosopographical [1] research. However, genealogical data is often available only in textual form providing challenges for knowledge extraction: How to identify persons and their gender by different name forms? How to disambiguate namesakes in different times? How to extract the genealogical relations between the mentions? This paper presents a case study for extracting the explicit genealogical network implicit in the national collection of 13 144 Finnish biographiesFootnote 1. The methodological idea is to combine regular expression identification, imprecise proper name matching, gender information, and data about expected lifespans for more accurate results. The system was evaluated with promising results, and a tool was constructed, based on Linked Data, for examining the underlying network of \(\sim \)81 000 extracted basic relations “parent”, “spouse”, and “child”. On top of the Linked Data service, a new application was created for studying the networks interactively as a new part of the in-use BiographySampoFootnote 2 system [2].
Related Work. Extracting and studying biographical networks has been researched in the Six Degrees of Francis Bacon [3] project and BiographyNet [4]. Articles [5, 6] discuss extracting genealogical networks from multi-source vital records. For the large public there are many crowd-sourcing-based commercial genealogy websites, such as ancestry.com, myheritage.com, and geni.com. This paper extends our earlier papers about BiographySampo [2] and network analysis based on biographical link references into extraction of genealogical networks [7, 8], and presents an application view for studying such networks interactively.
2 Extracting Genealogical Networks from Texts
Dataset. BiographySampo is a semantic portal based on a knowledge base that has been created using natural language processing methods, linked data, and semantic web technologies. It contains 13 144 biographies of notable Finns that can be browsed through a faceted search application and using tools for Digital Humanities research. [9] In addition to the genealogical network discussed in this paper, the data been a source for reference network extraction [7, 8].
Pattern-Based Knowledge Extraction. Many biographies in the dataset include semi-formal textual descriptions of family relations of the protagonist. As an example, the description of baroness Elisabeth JärnefeltFootnote 3 is given below:
Jelizaveta Konstantinovna Clodt von Jürgensburg from year 1857 known as Järnefelt, Elisabeth S 11.1.1839 Pietari, K 3.2.1929 Helsinki.
V Baron, major general Konstantin Karlovitsh Clodt von Jürgensburg and Catharine Vigné.
P 1857– senator, governor, lieutenant general August Alexander Järnefelt S 1833, K 1896, PV bailiff Gustaf Adolf Järnefelt and Aurora Fredrika Molander.
Children: Caspar (Kasper) Woldemar S 1859, K 1941, critic, translator, Russian language teacher, painter, P Emma Ahonen; Edvard Armas S 1869, K 1958, conductor, composer, professor, P1 songstress Maikki Pakarinen, P2 songstress Olivia (Liva) Edström; Aina (Aino) S 1871, K 1969, P composer Jean Sibelius;
The semi-formal expressions here have uniformity in structure that can be used effectively for pattern-based information extraction: First, the given and family names are mentioned and after that the years of birth S and death K. The description provides information about the parents (marked with V), spouses (P), parents-in-law (PV), children, and children-in-law of the protagonist.
One major problem in knowledge extraction here is recognizing the same person, here Elisabeth Järnefelt, referenced with different names: Jelizaveta Konstantinovna Clodt von Jürgensburg, Elisabeth Clodt von Jürgensburg or most commonly Elisabeth Järnefelt. On the other hand, same names are used in families over and over again. For example, there is a case of four people with name Christian Trapp, a grandfather, a father, a sonFootnote 4, and a grandson. They cannot be distinguished without additional information about their known lifespans.
Data Processing. In our knowledge extraction pipeline, the genealogical textual description of the protagonist is first divided into the parts describing his/her parents, spouses (wife/husband distinction is not known at this point), and children. The division is based on using regular expressions matching the punctuation and the tokens V, P, PV.
The years of birth, death, or marriage are easily separated from the text sequence. To separate occupational descriptions from the proper names, we used the ARPA serviceFootnote 5 together with vocabularies of Finnish female, male, and family namesFootnote 6. The extracted names were used to reason the gender of the person, which was used to refine relations, e.g., to specify a parent as a mother or a father. For the network, the spouses were linked with the children by the known years of marriage and child birth.
To gain detailed vital information for disambiguation, we reasoned lifetime estimates, e.g., the missing years of birth of the parents based on the known birth year of their child. The estimates were constructed by first collecting the years of births of a parent and a child from the known cases in data. The distributions of parent ages at child birth are depicted in Fig. 1. To reason the ages of spouses, a similar study was performed with the result that 99% of differences between the births of a husband and wife is in the range of -18–+35 years. The more relatives with known records a person has, the more precise the estimates are.
3 Evaluation
For evaluation we randomly chose 50 biographies, and manually compared the texts with the extracted results. The test set had mentions of 170 people. We compared the data fields of person names, years of birth and death, gender, occupation, and relation type. According to our evaluation 94.5% of the extracted people records were mentioned only in a single biography; the accuracy was 97.3%. For people mentioned in multiple biographies the accuracy was 80.4%; our system could not identify all mentions referring to same people.
For an example of the extracted network, a part of genealogical network of Elisabeth JärnefeltFootnote 7 is depicted in Fig. 2. She is in the largest connected component in our network. This component contains 2694 family relation links and connects 1835 people in 250 biographies. To further enrich the web portal, the immediate family relations were used to reasonsFootnote 8 like siblings, cousins, uncles, aunts, grandparents, grandchildren, and relatives-in-law.
Notes
- 1.
https://kansallisbiografia.fi/, accessed 20 March 2019.
- 2.
Online at: http://biografiasampo.fi; cf. project homepage for further information and publications: https://seco.cs.aalto.fi/projects/biografiasampo/en/.
- 3.
- 4.
- 5.
http://seco.cs.aalto.fi/projects/dcert/, accessed: 9 March 2019.
- 6.
https://www.avoindata.fi/data/en_GB/dataset/none, accessed: 20 March 2019.
- 7.
- 8.
References
Verboven, K., Carlier, M., Dumolyn, J.: A short manual to the art of prosopography. In: Prosopography approaches and applications. A handbook. Unit for Prosopographical Research (Linacre College), pp. 35–70 (2007)
Hyvönen, E., et al.: BiographySampo – publishing and enriching biographies on the semantic web for digital humanities research. In: Hitzler, P., et al. (eds.) ESWC 2019. LNCS, vol. 11503, pp. 574–589. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21348-0_37
Finegold, M., Otis, J., Shalizi, C., Shore, D., Wang, L., Warren, C.: Six degrees of Francis Bacon: a statistical method for reconstructing large historical social networks. Digit. Hum. Q. 10(3) (2016)
Ockeloen, N., et al.: BiographyNet: managing provenance at multiple levels and from different perspectives. In: Proceedings of the 3rd International Conference on Linked Science (LISC 2013), vol. 1116, pp. 59–71. CEUR Workshop Proceedings (2013)
Efremova, J., et al.: Multi-source entity resolution for genealogical data. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds.) Population Reconstruction, pp. 129–154. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19884-2_7
Malmi, E., Rasa, M., Gionis, A.: AncestryAI: a tool for exploring computationally inferred family trees. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 257–261. International World Wide Web Conferences Steering Committee (2017)
Tamper, M., Leskinen, P., Apajalahti, K., Hyvönen, E.: Using biographical texts as linked data for prosopographical research and applications. In: Ioannides, M., et al. (eds.) EuroMed 2018. LNCS, vol. 11196, pp. 125–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01762-0_11
Tamper, M., Hyvönen, E., Leskinen, P.: Visualizing and analyzing networks of named entities in biographical dictionaries for digital humanities research. In: Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CI-Cling 2019). Springer, April 2019, accepted
Hyvönen, E., Leskinen, P., Tamper, M., Tuominen, J., Keravuori, K.: Semantic National Biography of Finland. In: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference (DHN 2018), vol. 2084, pp. 372–385. CEUR Workshop Proceedings (2018). http://www.ceur-ws.org/Vol-2084/short12.pdf
Acknowledgements
Thanks to Business Finland for financial support and CSC – IT Center for Science, Finland, for computational resources.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Leskinen, P., Hyvönen, E. (2019). Extracting Genealogical Networks of Linked Data from Biographical Texts. In: Hitzler, P., et al. The Semantic Web: ESWC 2019 Satellite Events. ESWC 2019. Lecture Notes in Computer Science(), vol 11762. Springer, Cham. https://doi.org/10.1007/978-3-030-32327-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-030-32327-1_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32326-4
Online ISBN: 978-3-030-32327-1
eBook Packages: Computer ScienceComputer Science (R0)