Extracting Genealogical Networks of Linked Data from Biographical Texts

Leskinen, Petri; Hyvönen, Eero

doi:10.1007/978-3-030-32327-1_24

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11762))

Included in the following conference series:

European Semantic Web Conference

1054 Accesses
3 Citations

Abstract

This paper presents the idea and our work of extracting and reassembling a genealogical network automatically from a collection of biographies. The network can be used as a tool for network analysis of historical persons. The data has been published as Linked Data and as an interactive online service as part of the in-use data service and semantic portal BiographySampo—Finnish Biographies on the Semantic Web.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Extracting the Main Path of Historic Events from Wikipedia

Using Biographical Texts as Linked Data for Prosopographical Research and Applications

The phylogenomic revolution and its conceptual innovations: a text mining approach

Article 07 March 2019

1 Introduction

Extracting and inferring social or genealogical networks from historical documents can provide new information for biographical and prosopographical [1] research. However, genealogical data is often available only in textual form providing challenges for knowledge extraction: How to identify persons and their gender by different name forms? How to disambiguate namesakes in different times? How to extract the genealogical relations between the mentions? This paper presents a case study for extracting the explicit genealogical network implicit in the national collection of 13 144 Finnish biographies^{Footnote 1}. The methodological idea is to combine regular expression identification, imprecise proper name matching, gender information, and data about expected lifespans for more accurate results. The system was evaluated with promising results, and a tool was constructed, based on Linked Data, for examining the underlying network of \(\sim \)81 000 extracted basic relations “parent”, “spouse”, and “child”. On top of the Linked Data service, a new application was created for studying the networks interactively as a new part of the in-use BiographySampo^{Footnote 2} system [2].

Related Work. Extracting and studying biographical networks has been researched in the Six Degrees of Francis Bacon [3] project and BiographyNet [4]. Articles [5, 6] discuss extracting genealogical networks from multi-source vital records. For the large public there are many crowd-sourcing-based commercial genealogy websites, such as ancestry.com, myheritage.com, and geni.com. This paper extends our earlier papers about BiographySampo [2] and network analysis based on biographical link references into extraction of genealogical networks [7, 8], and presents an application view for studying such networks interactively.

2 Extracting Genealogical Networks from Texts

Dataset. BiographySampo is a semantic portal based on a knowledge base that has been created using natural language processing methods, linked data, and semantic web technologies. It contains 13 144 biographies of notable Finns that can be browsed through a faceted search application and using tools for Digital Humanities research. [9] In addition to the genealogical network discussed in this paper, the data been a source for reference network extraction [7, 8].

Pattern-Based Knowledge Extraction. Many biographies in the dataset include semi-formal textual descriptions of family relations of the protagonist. As an example, the description of baroness Elisabeth Järnefelt^{Footnote 3} is given below:

Jelizaveta Konstantinovna Clodt von Jürgensburg from year 1857 known as Järnefelt, Elisabeth S 11.1.1839 Pietari, K 3.2.1929 Helsinki.

V Baron, major general Konstantin Karlovitsh Clodt von Jürgensburg and Catharine Vigné.

P 1857– senator, governor, lieutenant general August Alexander Järnefelt S 1833, K 1896, PV bailiff Gustaf Adolf Järnefelt and Aurora Fredrika Molander.

Children: Caspar (Kasper) Woldemar S 1859, K 1941, critic, translator, Russian language teacher, painter, P Emma Ahonen; Edvard Armas S 1869, K 1958, conductor, composer, professor, P1 songstress Maikki Pakarinen, P2 songstress Olivia (Liva) Edström; Aina (Aino) S 1871, K 1969, P composer Jean Sibelius;

The semi-formal expressions here have uniformity in structure that can be used effectively for pattern-based information extraction: First, the given and family names are mentioned and after that the years of birth S and death K. The description provides information about the parents (marked with V), spouses (P), parents-in-law (PV), children, and children-in-law of the protagonist.

One major problem in knowledge extraction here is recognizing the same person, here Elisabeth Järnefelt, referenced with different names: Jelizaveta Konstantinovna Clodt von Jürgensburg, Elisabeth Clodt von Jürgensburg or most commonly Elisabeth Järnefelt. On the other hand, same names are used in families over and over again. For example, there is a case of four people with name Christian Trapp, a grandfather, a father, a son^{Footnote 4}, and a grandson. They cannot be distinguished without additional information about their known lifespans.

Data Processing. In our knowledge extraction pipeline, the genealogical textual description of the protagonist is first divided into the parts describing his/her parents, spouses (wife/husband distinction is not known at this point), and children. The division is based on using regular expressions matching the punctuation and the tokens V, P, PV.

The years of birth, death, or marriage are easily separated from the text sequence. To separate occupational descriptions from the proper names, we used the ARPA service^{Footnote 5} together with vocabularies of Finnish female, male, and family names^{Footnote 6}. The extracted names were used to reason the gender of the person, which was used to refine relations, e.g., to specify a parent as a mother or a father. For the network, the spouses were linked with the children by the known years of marriage and child birth.

To gain detailed vital information for disambiguation, we reasoned lifetime estimates, e.g., the missing years of birth of the parents based on the known birth year of their child. The estimates were constructed by first collecting the years of births of a parent and a child from the known cases in data. The distributions of parent ages at child birth are depicted in Fig. 1. To reason the ages of spouses, a similar study was performed with the result that 99% of differences between the births of a husband and wife is in the range of -18–+35 years. The more relatives with known records a person has, the more precise the estimates are.

3 Evaluation

For evaluation we randomly chose 50 biographies, and manually compared the texts with the extracted results. The test set had mentions of 170 people. We compared the data fields of person names, years of birth and death, gender, occupation, and relation type. According to our evaluation 94.5% of the extracted people records were mentioned only in a single biography; the accuracy was 97.3%. For people mentioned in multiple biographies the accuracy was 80.4%; our system could not identify all mentions referring to same people.

For an example of the extracted network, a part of genealogical network of Elisabeth Järnefelt^{Footnote 7} is depicted in Fig. 2. She is in the largest connected component in our network. This component contains 2694 family relation links and connects 1835 people in 250 biographies. To further enrich the web portal, the immediate family relations were used to reasons^{Footnote 8} like siblings, cousins, uncles, aunts, grandparents, grandchildren, and relatives-in-law.

Notes

1.
https://kansallisbiografia.fi/, accessed 20 March 2019.
2.
Online at: http://biografiasampo.fi; cf. project homepage for further information and publications: https://seco.cs.aalto.fi/projects/biografiasampo/en/.
3.
http://biografiasampo.fi/henkilo/p3148.
4.
http://biografiasampo.fi/henkilo/p10013.
5.
http://seco.cs.aalto.fi/projects/dcert/, accessed: 9 March 2019.
6.
https://www.avoindata.fi/data/en_GB/dataset/none, accessed: 20 March 2019.
7.
http://biografiasampo.fi/henkilo/p3148/sukulaiset.
8.
http://biografiasampo.fi/henkilo/p3148.

References

Verboven, K., Carlier, M., Dumolyn, J.: A short manual to the art of prosopography. In: Prosopography approaches and applications. A handbook. Unit for Prosopographical Research (Linacre College), pp. 35–70 (2007)
Google Scholar
Hyvönen, E., et al.: BiographySampo – publishing and enriching biographies on the semantic web for digital humanities research. In: Hitzler, P., et al. (eds.) ESWC 2019. LNCS, vol. 11503, pp. 574–589. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21348-0_37
Chapter Google Scholar
Finegold, M., Otis, J., Shalizi, C., Shore, D., Wang, L., Warren, C.: Six degrees of Francis Bacon: a statistical method for reconstructing large historical social networks. Digit. Hum. Q. 10(3) (2016)
Google Scholar
Ockeloen, N., et al.: BiographyNet: managing provenance at multiple levels and from different perspectives. In: Proceedings of the 3rd International Conference on Linked Science (LISC 2013), vol. 1116, pp. 59–71. CEUR Workshop Proceedings (2013)
Google Scholar
Efremova, J., et al.: Multi-source entity resolution for genealogical data. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds.) Population Reconstruction, pp. 129–154. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19884-2_7
Chapter Google Scholar
Malmi, E., Rasa, M., Gionis, A.: AncestryAI: a tool for exploring computationally inferred family trees. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 257–261. International World Wide Web Conferences Steering Committee (2017)
Google Scholar
Tamper, M., Leskinen, P., Apajalahti, K., Hyvönen, E.: Using biographical texts as linked data for prosopographical research and applications. In: Ioannides, M., et al. (eds.) EuroMed 2018. LNCS, vol. 11196, pp. 125–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01762-0_11
Chapter Google Scholar
Tamper, M., Hyvönen, E., Leskinen, P.: Visualizing and analyzing networks of named entities in biographical dictionaries for digital humanities research. In: Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CI-Cling 2019). Springer, April 2019, accepted
Google Scholar
Hyvönen, E., Leskinen, P., Tamper, M., Tuominen, J., Keravuori, K.: Semantic National Biography of Finland. In: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference (DHN 2018), vol. 2084, pp. 372–385. CEUR Workshop Proceedings (2018). http://www.ceur-ws.org/Vol-2084/short12.pdf

Download references

Acknowledgements

Thanks to Business Finland for financial support and CSC – IT Center for Science, Finland, for computational resources.

Author information

Authors and Affiliations

Semantic Computing Research Group (SeCo), Aalto University, Espoo, Finland
Petri Leskinen & Eero Hyvönen
HELDIG – Helsinki Centre for Digital Humanities, University of Helsinki, Helsinki, Finland
Eero Hyvönen

Authors

Petri Leskinen
View author publications
You can also search for this author in PubMed Google Scholar
Eero Hyvönen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petri Leskinen .

Editor information

Editors and Affiliations

Kansas State University, Manhattan, KS, USA
Pascal Hitzler
Vienna University of Economics and Business, Vienna, Austria
Sabrina Kirrane
Linköping University, Linköping, Sweden
Olaf Hartig
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Victor de Boer
Leibniz Information Centre for Science and Technology University Library (TIB), Hannover, Germany
Maria-Esther Vidal
University of Bonn, Bonn, Germany
Maria Maleshkova
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Stefan Schlobach
Jönköping University, Jönköping, Sweden
Karl Hammar
F. Hoffmann-La Roche AG, Basel, Switzerland
Nelia Lasierra
Robert Bosch GmbH, Stuttgart, Germany
Steffen Stadtmüller
Aalborg University, Aalborg, Denmark
Katja Hose
IMEC, Ghent University, Ghent, Belgium
Ruben Verborgh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Leskinen, P., Hyvönen, E. (2019). Extracting Genealogical Networks of Linked Data from Biographical Texts. In: Hitzler, P., et al. The Semantic Web: ESWC 2019 Satellite Events. ESWC 2019. Lecture Notes in Computer Science(), vol 11762. Springer, Cham. https://doi.org/10.1007/978-3-030-32327-1_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-32327-1_24
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32326-4
Online ISBN: 978-3-030-32327-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Extracting Genealogical Networks of Linked Data from Biographical Texts

Abstract

Similar content being viewed by others

Extracting the Main Path of Historic Events from Wikipedia

Using Biographical Texts as Linked Data for Prosopographical Research and Applications

The phylogenomic revolution and its conceptual innovations: a text mining approach

1 Introduction

2 Extracting Genealogical Networks from Texts

3 Evaluation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Extracting Genealogical Networks of Linked Data from Biographical Texts

Abstract

Similar content being viewed by others

Extracting the Main Path of Historic Events from Wikipedia

Using Biographical Texts as Linked Data for Prosopographical Research and Applications

The phylogenomic revolution and its conceptual innovations: a text mining approach

1 Introduction

2 Extracting Genealogical Networks from Texts

3 Evaluation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation