Keywords

1 Introduction

BiographySampoFootnote 1 is a semantic portal that is based on a knowledge graph that has been created using natural language processing methods, linked data, and semantic web technologies [15, 35]. The graph currently contains ca. 13 100 biographical textual descriptions of notable Finns that can be browsed through using faceted search and a variety of data-analytic tools. 9200 of the entries contain a short, free text biography of the person, created by 977 professional authors. The portal has been built to help historians and scholars in biographical [33] and prosopographical research [8, 37]Footnote 2. A major novelty of BiographySampo is to provide the user with data-analytic and visualization tools for solving research problems in Digital Humanities (DH), based on Linked Data [9, 12].

In the biography texts, the authors mention other people they consider significant from an occupational or other relevant perspective. In our case study, the editors of the dictionary of biography at the publisher Finnish Literature Society (SKS) have changed these mentions into internal links to corresponding articles in the dictionary if there is one.Footnote 3 A link is added typically only once when a person is mentioned for the first time. These links serve in the original biography collection as a way to browse and move between the biographies.

However, many links are missing from the text. For example, there are mentions of relatives and external people who do not have a biography in the dictionary to be linked to, e.g., William Shakespeare and Richard Wagner. In addition, if a biography A mentions person B, but the biography of B has been added in the collection after editing A, it has not been possible to add the link. The explicit links between people in the biographical texts therefore create a scarcely interlinked reference network of the biographical texts.

This paper argues that making the reference network underlying a biographical dictionary explicit can be useful in biographical and prosopographical research. The idea of using the network analysis of historical people for Digital Humanities research has been suggested before in, e.g., [3, 38]. A contribution of our paper is to apply the idea to biography collections, where connections are based on entity mentions. To support the argument, we present a case study using BiographySampo where the reference network underlying its textual biographies was extracted and enriched into a knowledge graph and published as a linked data service, on top of which a set of tools were created for Digital Humanities research. This idea is currently being applied also to a genealogical network extracted from the same texts [20].

In the following, the underlying knowledge graph with its person and place ontologies, and the process of extracting and enriching the reference network is first presented (Sect. 2). After this, application views to study the networks underlying the biographical texts are presented (Sect. 3). Firstly, a network analysis tool is presented for visualizing and studying the egocentric network of a protagonist in biographical research. Secondly, this idea is generalized for prosopography where groups of people sharing characteristics (e.g., occupation, gender, or area of living) are studied. Here the user can first separate the target group using faceted search and then visualize the group’s sociocentric network. Thirdly, when visualizing the networks, it turned out that they often include serendipitous [1] (surprising) connections between people, raising the question: why are these two people interconnected? A tool is clearly needed for explaining the connections, not only showing them. For this purpose, an application view showing the textual contexts in which the connections arise was created. Lastly, the toolset presented also includes an application called contextual reader [24], where the user is able to get information about the extracted linked entities by hovering the mouse on top of the mentions. After presenting the application views, the applications and named entity extraction is evaluated (Sect. 4). In conclusion (Sect. 5), the contributions of the paper are summarized, related works discussed, and the directions of further research suggested.

2 Extracting Named Entities from Biographical Texts

In order to build and integrate network analysis tools, reference analysis tools, and the contextual reader application to BiographySampo, the existing links and named entities need to be extracted from the texts, and the underlying BiographySampo Knowledge Graph (BSKG) be enriched accordingly. In this section the knowledge base, extraction process, and the data transformations that enable the end user applications are discussed.

Knowledge Graph. BSKG includes the biography collectionsFootnote 4 of SKS written by 977 scholars from different fields. The biographies describe the lives and achievements of historical and contemporary figures, containing vast amounts of references to notable Finnish and foreign figures and to historical events, works (e.g., paintings, books, music, and acting), places, organizations, and dates. The graph includes 13 144 people with a biographical description, 51 200 related people mentioned in the biographies, and the 977 authors of the biographies. There are furthermore 225 000 lifetime events of the protagonists including their births, deaths, and other biographical events. The biographical texts also contain manually added 31 500 HTML links between the biographies that were included in the knowledge base [35]. There is also a separate graph of 4970 places, extracted from the Finnish Gazetteer of Historical Places and Maps (Hipla) and data serviceFootnote 5 [14, 17]. Foreign place names were linked using the Google Maps APIsFootnote 6. The lifetime events have lots of mentions of other kinds, such as governmental or educational buildings, public places etc. An additional dataset of approximately 2000 resources was extracted for them from Wikidata. The data was also augmented with a list of countries in the world and their capitals [21].

Extraction and Linking Process. The biographical texts [35] were transformed into an RDF dataset and enriched with linguistic information, totaling in 120 million triples. The data can be queried from a SPARQL endpoint. This data contains manually annotated links that have been extracted from the HTML as well as links based on entity linking.

Named entity linking (NEL) tools [25, 26, 28] typically use a process that can be broken into three tasks [4, 7]: 1) named entity recognition (NER), 2) named entity disambiguation (NED), and 3) NEL. NER identifies the entities from text, NED disambiguates them, and lastly NEL links the mentions to their meanings in ontologies or knowledge bases. Our new linking tool, Nelli, extracts and links entities from texts in a similar manner. However, in addition it combines multiple, in our case three different tools for NER and NEL. The purpose of this approach is to improve disambiguation by utilizing a voting scheme [6, 32] where each tool has a vote on the interpretation it makes for the same piece of text. The best candidate is the one with the most votes. For example, to identify a place from the string Turku Cathedral the tools return three answers of which one is Turku and two are Turku Cathedral, the winning interpretation.

Nelli uses the tools FiNER, ARPA, and LINFER. FiNERFootnote 7 is a rule-based NER tool for Finnish, ARPA [23] is a NER and NEL tool [10, 18] that queries matches from controlled vocabularies. To supplement FiNER and ARPA, a third tool LINFER was implemented utilizing the linguistic RDF data to identify named entities. The parsed linguistic data not only contains part of speech information but also Dependency Grammar relations. With this information a set of rules was created to infer which proper nouns (or nouns) would be most likely place or person names. With this tool entities, e.g., Åbo Akademin kirjasto (engl. Åbo Akademi University Library) can be identified by analyzing inflected forms and dependencies. These rules were encapsulated in LINFER to utilize the linguistic features of words and their relations.

In addition to each tool having a vote, votes can be earned for entity length, linkage, and by named entity type. Sometimes it may be difficult to correctly identify longer named entities, such as place or organization names, and therefore a vote is given to the longest matching candidates. Also candidates that have found a match in an ontology are favored with a vote. Nelli also has a priority order for named entity types where votes can be added to favor some entity types over others. For example, the address Konemiehentie 2, Espoo contains a city name. In order to have the address as the top voted candidate, it will help to give to the address type a higher score than to the more general location.

Once Nelli has all the interpretations and metrics about the candidates, it calculates the votes and writes the results in Turtle format. For this extension of the original data, we used NLP Interchange Format (NIF)Footnote 8 [11], Dublin Core MetadataFootnote 9, and a custom namespaceFootnote 10 to supply classes and properties that describe named entity metadata. For recording the results, the application writes nbf:NamedEntity class instances that have the basic information about the entity. It has properties to describe the extracted string (nif:isString), base form of the string (nif:lemma), its named entity type (nbf:namedEntityType), where it is linked (skos:relatedMatch), the location of the string in text (nif:beginIndex, nif:endIndex), and the method that was used to extract the named entity (nbf:usedNeMethod). In the source dataset, the texts have been split into documents, paragraphs, sentences, and words. The word-entities are also added a dct:isPartOf property referring to the named entity instances they are a part of and similarly the sentences have a nbf:hasNamedEntity property. The value of the nbf:namedEntityType property is an instance of the nbf:NamedEntityType class that is the description of the named entity type. The value of the nbf:usedNeMethod property is an instance of the class nbf:NamedEntityMethod that has provenance information about the tools used to extract the named entity. In addition to the nbf:NamedEntity class, there is also the nbf:NamedEntityGroup class that groups the entities in each sentence based on location and possible overlap. Each group has all members indicated with the property nbf:member and the top voted entity with nbf:primary.

Reference Networks. Network analysis of people [3, 38] is a set of methods that can be used to study social networks [30]. In our case, the networks were built from the HTML links and mentions of people in the biographies to create a reference network which is analogous to citation networks [34]. In a reference network, the nodes are people, and when a person A is mentioned in the biography of B, a directed edge is added from B to A. The edges are instances of the class nbf:Reference with properties for the source biography nbf:source, the mentioned person nbf:target, and the type of the reference as nbf:ManualAnnotation (for HTML links) or nbf:AutomaticAnnotation (for identified named person entities). The number of references to the target person in the source biography is declared as the value of the nbf:weight property which for manual HTML links equals one.

The transformed network data can then be used in applications by querying the nodes, e.g., biographical details of people, and the edges, e.g., the links between people. Based on the data, the networks can be generated automatically for an individual or a group.

3 Applications

To test the potential of network analysis in biography and prosopography, a reference network was constructed based on HTML links in ca. 6100 of the 13 144 biographies, enhanced with additional edges from 400 biographies by Nelli. This group was limited to politicians, writers, athletes, lutherans, artists, architects, and musicians because their biographies contain long textual descriptions.

The BSKG included entities for people and places extracted from texts. For place linking, we used the YSO Places ontologyFootnote 11 of the Finnish Ontology Service Finto that contains contemporary place resources for municipalities, provinces, countries, and continents. The contemporary data was extended with the WarSampo place ontology [13] that includes historical Finnish places. A priority order was set for place and person entities so that more specific place names, for instance, have a higher score. Also, to avoid having people’s first and last names mislabeled as places, person named entities were given a higher score.

With this setup, 33 120 entities were extracted and used as a basis for four application views presented next in this section.

1. Egocentric Networks. The egocentric networks are formed from people nodes, i.e., biographical details of people, and edges, i.e., the links between people. The networks are generated to the center of the screen and centered around one person, in this case the protagonist. On the left hand side of the user interface, there are network toggles that can be used to alter the layout of the network in the following ways: Firstly, the user can toggle the amount of nodes to be seen, i.e., limit the size of the network to be visualized. Secondly, the user can select to see the network built using the manual HTML links only, automatically extracted links, or both. In this way, the manual and automatically extracted links can be compared with each other. Thirdly, to emphasize the most significant nodes in the graph, the node size can be determined based on using four distance and centrality measures used in network analysis: distance to the protagonist, degree, in-degree, out-degree, or pagerank [27]. Fourthly, it is possible to color the person nodes based on the gender, occupational area, or distance to the protagonist. The network is generated based on the selected toggle options, and the automatic links option shows the edge weight based on how frequently the person is mentioned in the text.

Fig. 1.
figure 1

Kasper Järnefelt’s egocentric network where nodes are colored by occupation.

Figure 1 depicts the egocentric network of the Finnish critic, translator, and cultural person Kasper Järnefelt (1859–1941). The network shows, e.g., lots of links to contemporary Finnish cultural persons with a biography in the system (based on the HTML links), as well as connections to external people, such as authors Nikolai Gogol, Henrik Ibsen, and Leo Tolstoi (based on Nelli), who do not have biography in BiographySampo. The linkage is based on the fact that Järnefelt has translated their works into Finnish. The width of the edges indicates the number of references between the biographies and is an indication of potential importance. The legend box in the right upper corner explains the color coding of the occupational areas used for the nodes. The toggles for making the selections for the visualization are not shown in the figure for brevity.

In BiographySampo, the egocentric networks are located under the Network tab in the personal home pages of the protagonists.

2. Sociocentric Networks The sociocentric networks are located in their own view in BiographySampo. They can be accessed from the navigation bar under the title Verkostot (engl. Networks). In this application view, the user first filters the target group she is interested in studying by using faceted search. For example, people of similar occupation or place of birth can be easily filtered out by selections in corresponding facets.

Fig. 2.
figure 2

Minna Canth in a sociocentric network where nodes are colored by gender. (Color figure online)

An example of a group view is presented in Fig. 2. The facets are situated on the left hand side of the screen underneath the general network analysis toggles, but are not visible here. In this case, the user has filtered out Finnish authors of the mid 19th century. Here, the Finnish female playwright and social activist Minna Canth gains the highest pagerank, illustrated by the size of her node. The gender of persons is indicated by red (women) or blue (men), an option selected from the toggles on the left.

Fig. 3.
figure 3

The references made to Minna Canth.

Fig. 4.
figure 4

The references made to Johan Snellman.

3. Explaining References. BiographySampo also contains an application view that explains the edges in the egocentric and sociocentric networks. This reference view can be found for each protagonist in a separate tab on their homepage. The idea is to explain edges by providing the user with the sentences in which the references to other people are mentioned.Footnote 12 The sentences can be retrieved from the linguistic graph of the underlying SPARQL endpoint [35]. The references have been divided into two groups: 1) Sentences in other bios that make a reference to the protagonist’s biography. 2) Sentences in the protagonist’s biography that make reference to other biographies. For example, the references to Minna Canth include sentences from the biographies of actors, writers, and playwrights that were influenced by her, whereas her own biography mentions mainly contemporary writers and artists.

In addition to listing sentences that include links, BiographySampo also has a separate statistics application view that depicts at what time a person is referenced to. Here time is based on the birth year of the protagonist in the biography making the reference. The purpose of this temporal view is to be able to see how a person is referenced through time. For example, as shown in the Fig. 3, Minna Canth is frequently mentioned over a long time, because, e.g., the actors and directors of the national biography are using her plays. In comparison, a person such as the 19th century philosopher and Finnish statesman Johan Vilhelm Snellman, who had a significant role in improving the role of the Finnish language in the 19th century Finland, is mentioned, as shown in the Fig. 4, frequently mostly in the biographies of his contemporaries.

This view can not only be used to identify the influences of these notable Finns in history, but also to study the edges that exist in the networks. This helps the user to see why an edge exists between two people and what kind of semantic meaning each edge holds in the network. For example, in the references page of Minna Canth, the user can see that in most cases she is mentioned because of her literary work, and by people who have acted, directed, or visited her salon to discuss and exchange thoughts on literature and ideologies, such as Darwinism.

4. Contextual Reader. Contextual reader is yet another application of Nelli data in BiographySampo. The idea here is to show the text annotated with links to the named entities, such as people, places, and organizations. It enhances the reading experience of the user by providing contextual linked information about the named entities in the biographies when the mouse is hovered over the text. The application is in work in Fig. 5, where the mouse is over Nikolai Gogol.

To achieve this, Nelli was configured to link named entities to contextual background information in the BSKG and other datasets available in SPARQL endpoints, such as biographies, map services, and ontology services. This was done to interlink biographical texts to each other and to help the user to understand and learn better from the texts based on their context [22, 26].

The system is based on the CORE tool [24], where entity mentions in texts can be linked to linked data resources in real time. Here string-based semantic disambiguation is used and only one interpretation is always selected. In contrast, in BiographySampo annotations are created in a pre-processing phase facilitating deeper analysis and disambiguation of entities, where challenging multiple interpretations can be given to the end-user for final human disambiguation.

This application was integrated into the biography tab of a person’s homepage. The user can read the biography and gain more understanding through the links to people (indicated in blue color), places (green), and organizations (gray) (cf. Fig. 5). The links are also indicated by a symbol showing the type of the link. By hovering on top of an internal link (to BSKG) or an external link (to, e.g., Wikidata), as in Fig. 5 to the Russian playwright Nikolai Gogol, the user gets more information about that person (here from the Wikidata SPARQL endpoint). The place links lead to a map view to provide information related to that place and a map marking the location.

Fig. 5.
figure 5

Contextual Reader application used on Kasper Järnefelt’s biography. (Color figure online)

4 Assessment and Evaluation

In this section, lessons learned in developing the applications are first discussed and their usefulness assessed from an end user perspective. After this, an evaluation of the NEL tool Nelli follows.

Assessing Applications. The network analysis views have been built for individuals and for groups of people. The egocentric networks for individuals are often smaller in size and therefore facets are not included in the view for filtering out related people. However, in some cases it may be interesting to scale egocentric networks to include only occupational references or people who have lived at the same time. The basic network toggling tools are provided for both views and can be used to color the nodes by occupation or gender. Also, it is possible to compare networks based on manual and automatic links.

The reference explanation view adds textual context to the links and in most cases is a helpful tool for understanding relations between nodes. However, the view currently only shows the sentences with manual HTML links. In addition, in some cases one sentence does not have enough context for an explanation. For example, in the biography of Aale Tynni (a poet) there is a highly serendipitous surprising link to Tapio Rautavaara (a singer, actor, and athlete). It turns out that both of them got a gold medal in the Olympic Games in London 1948, but in different categories: Tynni in lyrics and Rautavaara in javelin throw. However, the sentence with the link does not explain this. The information is in the previous sentence, and it would be useful in this case to show more than one sentence to explain the relation.

The contextual reader application visualizes the extracted named entities in the text and adds more contexts through linking to BSKG and external datasets and ontologies. There are currently only three types of named entities visible in the contextual reader to provide context but more could be added, e.g., named works of art. Also, it would be useful to add images or maps for places as has been done in [13]. The extracted named entities from the texts are often people that the authors of the biographies consider significant occupationally. The networks and the reference analysis reflects these choices creating biases similarly to [38]. Reference networks in our case are not actual social networks. For example, in the biography of Jutta Urpilainen, the former Prime Minister Jyrki Katainen is mentioned because Urpilainen worked as the Minister of Finance in Katainen’s Cabinet of Finland, but in the biography of Katainen, Urpilainen is not mentioned. It is important to keep in mind that these networks only give insight to who are considered by authors and their sources to be significant to the protagonist [15].

Evaluation of Named Entity Extraction. In order to measure the quality of Nelli in the task of identifying named entities, we inspected place and person links for 50 biographical texts. Self-references to the protagonist were ignored in calculations because the idea was to identify information that helps the reader to understand better about this person and the references to self do not add value in this task. In addition, we calculated organization names containing a linked place name (e.g., The National Museum of Finland) as false positive. The linking of places and people was evaluated using precision, recall, and F1-score as shown in Table 1; the identification of organizations was ignored in the test as this is still ongoing work.

Table 1. Results for recognition and linking places and people.

The results in the Table 1 for places and people have been counted in two ways: 1) to exclude false negatives that cannot be found from the ontology (\(FN_{out}\)) and 2) to include all false negatives (\(FN_{all}\)). By comparing these two counts, it can be seen how entities missing from the used ontologies impacts the results for places and especially people. In most cases, the tool is dependent on the chosen ontology due to having only a few people with the same names. However, the overall \(F1_{all}\)-scores are good for people 74% and places 84%. The precision (Precision) for places is lower than the recall, causing a drop in the F1-score. In comparison, the precision for people is nearly perfect. However, the recall (\(Recall_{out}\), \(Recall_{all}\)) for people is lower than the recall for places. This is because some people cannot be found from the ontology due to tool errors, incorrect data (missing maiden or married names, badly formed data labels), or problems with baseforming foreign names.

The precision for places suffers also due to mixing last names with place names when the names are not identified from the text. In order to reduce mixing of place and person names, the last names could be identified using the extracted full person names. Often people are referenced in the text first with a full name and later with only the last name. By using the last names from the full names, most references could be extracted and mix-ups with place names avoided. The place recognition often mixes place names and regular words, such as adjectives as places. For example, when the initial word of a sentence is the infected form of the word oma (engl. own), it is understood as the place Oman. By adding a rule that only considers entities that are written with a capital letter can help to reduce these issues. However, it alone is not enough and utilization of linguistic information can help to filter initial words that are not proper nouns.

5 Conclusions

In this case study, a total of 31 500 manually created links between biographies were utilized to visualize and study a reference network underlying a dictionary of biographies. In addition, the application of Nelli to the data added a total of 33 120 named entity links in the network of which some 12 800 were for places and some 20 800 for people. This data was utilized to enrich the networks with additional references to people cataloged in the dictionary of biography and with new external nodes in the network. Nelli succeeded in identifying people with 74% accuracy and places with 84% accuracy. Four application views were added in BiographySampo to support analysis of the networks for the end user.

The selection of ontologies has a role in the success of the work. The place names in biographical texts were distinctive and easy to link to comprehensive ontologies with low granularity. It was helpful, too, that the BSKG contained only a handful of namesakes. By adding fixes to prevent the linking of adjectives and postpositions to places it is possible to increase the success rates. The disambiguation scheme enabled successful linking of person names, which prevented most of the mix-ups between people and places.

The applications presented were based on reference networks. Unlike in [3, 31, 38], the user can study the networks of different groups through facet selections and visualize the networks in a variety of ways, such as re-sizing the nodes based on their topological properties or by coloring the nodes based on occupational area or gender. The networks of individuals can be studied in the egocentric network to see, for instance, the spreading of influences. The foreign influences of notable writers, politicians, and philosophers are prominent in the automatically enriched networks, and a full view of their reach can be seen through the egocentric networks. The networks, complemented with the reference view to study the explanations for the edges, gives more insight into the impact of individuals in groups. The contextual reader application enhances the reading experience by providing information about the linked entities. These applications facilitate novel, more diverse usage of BiographySampo in biographical and prosopographical research.

However, the applications also raise new questions and problems of source criticism regarding the quality of the automatically extracted content and semantic interpretation of the networks [15]. It is clear, for example, that the people selected in the dictionary do not necessarily constitute a homogeneous prosopographical group but were selected by the editors, and people mentioned in the texts reflect the decisions made by the authors and the sources they have used.

Related Work. Representing and analyzing biographical data is a new research and application field [15, 36]. The network analysis based on biographical data has been studied in [3, 19, 38] where networks were created using a variety methods to extract named entities and their relations from text. In BiographySampo, the networks were created using the NEL approach [25, 26, 28].

The network analysis views were constructed to study individuals and groups of people. Several related works [3, 5, 19, 31, 38] and network analysis and visualization methods [27] have influenced the tools presented in this paper. The tools in BiographySampo extend traditional systems by adding user controls that can be used to scale and toggle the layout of the networks. In addition, the sociocentric network analysis allows the user to use facets (such as gender, vocation, birth and death places) to form groups of people and study their networks. To extend the network analysis tools, BiographySampo also includes a reference analysis view explaining the links, which is similar to KORP’sFootnote 13 [2] keywords in context view but provides context for the edges in the network similarly to LinkedJazz’sFootnote 14 [31] relationship view. Unlike in LinkedJazz, the view shows all relations and how a person is referenced throughout time to show how a person’s work influences his or her contemporaries and other generations of notable people. The view is constructed using text that has been transformed into RDF [35] and by querying the sentences with manually crafted links from the SPARQL service.

In order to visualize the named entities, a contextual reader application [24] was created. Similar visualizations of named entity data have been used in, e.g., DBpedia SpotlightFootnote 15 [26] and Gate CloudFootnote 16 [25]. The WarSampo [13] portal and the Semantic Finlex portal [29] include contextual reader applications that have been configured to link text into ontologies in real-time. These applications have influenced the creation of the BiographySampo’s contextual reader. However, in our case the entities are not extracted in real time but in a preprocessing phase for more robust semantic disambiguation.