1 Introduction

Extracting and inferring social or genealogical networks from historical documents can provide new information for biographical and prosopographical [1] research. However, genealogical data is often available only in textual form providing challenges for knowledge extraction: How to identify persons and their gender by different name forms? How to disambiguate namesakes in different times? How to extract the genealogical relations between the mentions? This paper presents a case study for extracting the explicit genealogical network implicit in the national collection of 13 144 Finnish biographiesFootnote 1. The methodological idea is to combine regular expression identification, imprecise proper name matching, gender information, and data about expected lifespans for more accurate results. The system was evaluated with promising results, and a tool was constructed, based on Linked Data, for examining the underlying network of \(\sim \)81 000 extracted basic relations “parent”, “spouse”, and “child”. On top of the Linked Data service, a new application was created for studying the networks interactively as a new part of the in-use BiographySampoFootnote 2 system [2].

Related Work. Extracting and studying biographical networks has been researched in the Six Degrees of Francis Bacon [3] project and BiographyNet [4]. Articles [5, 6] discuss extracting genealogical networks from multi-source vital records. For the large public there are many crowd-sourcing-based commercial genealogy websites, such as ancestry.com, myheritage.com, and geni.com. This paper extends our earlier papers about BiographySampo [2] and network analysis based on biographical link references into extraction of genealogical networks [7, 8], and presents an application view for studying such networks interactively.

2 Extracting Genealogical Networks from Texts

Dataset. BiographySampo is a semantic portal based on a knowledge base that has been created using natural language processing methods, linked data, and semantic web technologies. It contains 13 144 biographies of notable Finns that can be browsed through a faceted search application and using tools for Digital Humanities research. [9] In addition to the genealogical network discussed in this paper, the data been a source for reference network extraction [7, 8].

Pattern-Based Knowledge Extraction. Many biographies in the dataset include semi-formal textual descriptions of family relations of the protagonist. As an example, the description of baroness Elisabeth JärnefeltFootnote 3 is given below:

Jelizaveta Konstantinovna Clodt von Jürgensburg from year 1857 known as Järnefelt, Elisabeth S 11.1.1839 Pietari, K 3.2.1929 Helsinki.

V Baron, major general Konstantin Karlovitsh Clodt von Jürgensburg and Catharine Vigné.

P 1857– senator, governor, lieutenant general August Alexander Järnefelt S 1833, K 1896, PV bailiff Gustaf Adolf Järnefelt and Aurora Fredrika Molander.

Children: Caspar (Kasper) Woldemar S 1859, K 1941, critic, translator, Russian language teacher, painter, P Emma Ahonen; Edvard Armas S 1869, K 1958, conductor, composer, professor, P1 songstress Maikki Pakarinen, P2 songstress Olivia (Liva) Edström; Aina (Aino) S 1871, K 1969, P composer Jean Sibelius;

The semi-formal expressions here have uniformity in structure that can be used effectively for pattern-based information extraction: First, the given and family names are mentioned and after that the years of birth S and death K. The description provides information about the parents (marked with V), spouses (P), parents-in-law (PV), children, and children-in-law of the protagonist.

One major problem in knowledge extraction here is recognizing the same person, here Elisabeth Järnefelt, referenced with different names: Jelizaveta Konstantinovna Clodt von Jürgensburg, Elisabeth Clodt von Jürgensburg or most commonly Elisabeth Järnefelt. On the other hand, same names are used in families over and over again. For example, there is a case of four people with name Christian Trapp, a grandfather, a father, a sonFootnote 4, and a grandson. They cannot be distinguished without additional information about their known lifespans.

Data Processing. In our knowledge extraction pipeline, the genealogical textual description of the protagonist is first divided into the parts describing his/her parents, spouses (wife/husband distinction is not known at this point), and children. The division is based on using regular expressions matching the punctuation and the tokens V, P, PV.

The years of birth, death, or marriage are easily separated from the text sequence. To separate occupational descriptions from the proper names, we used the ARPA serviceFootnote 5 together with vocabularies of Finnish female, male, and family namesFootnote 6. The extracted names were used to reason the gender of the person, which was used to refine relations, e.g., to specify a parent as a mother or a father. For the network, the spouses were linked with the children by the known years of marriage and child birth.

To gain detailed vital information for disambiguation, we reasoned lifetime estimates, e.g., the missing years of birth of the parents based on the known birth year of their child. The estimates were constructed by first collecting the years of births of a parent and a child from the known cases in data. The distributions of parent ages at child birth are depicted in Fig. 1. To reason the ages of spouses, a similar study was performed with the result that 99% of differences between the births of a husband and wife is in the range of -18–+35 years. The more relatives with known records a person has, the more precise the estimates are.

Fig. 1.
figure 1

Distribution of parent ages at a child birth

3 Evaluation

For evaluation we randomly chose 50 biographies, and manually compared the texts with the extracted results. The test set had mentions of 170 people. We compared the data fields of person names, years of birth and death, gender, occupation, and relation type. According to our evaluation 94.5% of the extracted people records were mentioned only in a single biography; the accuracy was 97.3%. For people mentioned in multiple biographies the accuracy was 80.4%; our system could not identify all mentions referring to same people.

For an example of the extracted network, a part of genealogical network of Elisabeth JärnefeltFootnote 7 is depicted in Fig. 2. She is in the largest connected component in our network. This component contains 2694 family relation links and connects 1835 people in 250 biographies. To further enrich the web portal, the immediate family relations were used to reasonsFootnote 8 like siblings, cousins, uncles, aunts, grandparents, grandchildren, and relatives-in-law.

Fig. 2.
figure 2

Genealogical network around Elisabeth Järnefelt as seen in a BiographySampo view