Keywords

1 Introduction

Nowadays, large knowledge bases (KBs) are available as linked data under open licenses, like DBpediaFootnote 1 and WikidataFootnote 2. Exploiting equivalences of entities across these KBs is crucial for data-driven application that require, e.g., to obtain additional data about an entity across several KBs, or to support disambiguation operations.

We conducted an observational study of the virtual graph formed by equivalence relations between entities of 8 open KBs for entities of type agent (persons, organizations) in cultural heritage (CH) data. In particular, we measured the quantity of equivalences that this graph could provide for a dataset from EuropeanaFootnote 3 containing references to agents in descriptions of CH objects.

This study provides insights about the equivalence links across KBs and the potential benefits of crawling this virtual equivalence graph for discovering equivalences of agents referred to in datasets. It is informative for future research and for designing innovative applications, such as the case of Europeana who seeks to acquire agent name variants/translations or extra biographical information [1].

We follow, in Sect. 2, by describing related work on linked data and equivalence graphs. Section 3 presents how the study was conducted. Section 4 details the results and their analysis. Section 5 highlights our conclusions and presents future work.

2 Related Work

The exploitation of KB equivalence links for specific applications has been addressed earlier. Beek et al. (2018) have gathered the largest dataset of owl:sameAs statements from the web of data [2]. Similarly to us, Correndo et al. (2012) have conducted a statistical and qualitative analysis of the graph of instance level equivalences, and explored their use for computing alignments at conceptual level [3].

Research on the quality of linked data equivalence statements is relevant for us. It has especially reported (sometimes incorrect) uses of owl:sameAs to represent different degrees of equivalence [4,5,6]. Work on linked data aggregation and cleaning [7, 8] has also revealed data quality to be a challenge both at the level of semantics and the one of syntax [9, 10]. Especially relevant for us, an empirical study by Asprino et al. (2019) investigated the modelling style and the general structure of linked open data, including issues for the equivalence graphs formed by interlinking [11].

Regarding CH, the creation of KBs has been a long-term practice, and started much earlier than the emergence of the Semantic Web. In this domain however, the stated equivalences between major open KBs have not been studied recently.

3 Design of the Study

We have conducted an observational study gathering the existing equivalence relations between entities across 8 KBs:

  • DBpedia - a multilingual KB created by extracting structured data from Wikipedia.

  • data.bnf.fr (BnF) - a project by the French National Library that makes available data about bibliographic entities.

  • datos.bne.es (BNE) - a KB of bibliographic data by the National Library of Spain.

  • Library of Congress NamesFootnote 4 (NAF) - a KB that provides authoritative data for names of persons, organizations, events, places, and titles.

  • The Union List of Artist NamesFootnote 5 (ULAN) - ULAN contains names, relationships, notes, sources, and biographical information for artists.

  • Gemeinsame NormdateiFootnote 6 (GND) - an KB for personal names, subject headings and corporate bodies, managed mainly by the German National Library.

  • Virtual International Authority FileFootnote 7 (VIAF) - a cooperation of OCLC with mainly national libraries, combining multiple KBs from libraries, archives and museums.

  • Wikidata - a collaborative KB hosted by the Wikimedia Foundation.

By considering the transitive closure of the resulting compound set of equivalence statements, one obtains a virtual equivalence graph with entities from all the KBs as nodes. Our study was divided into two parts.

First, we measured the amount of stated equivalence relations between KBs by considering all the equivalences asserted by at least one KB, not using any additional external sources. The statements were collected preferably via SPARQL, or via a file-based RDF distribution of the KB. We collected all statements where the property was one ofFootnote 8: owl:sameAs; skos:exactMatch; skos:closeMatch; or schema:sameAs. This selection was based on a preliminary profiling of the KBs, where we found these standard properties to be the most often used for representing equivalence.

In the second part of the study, we focused on the entity type agent, and measured the quantity of equivalences that the joint equivalence graph could provide for a dataset containing references to agents in descriptions of CH objects.

The first task was to create a set of URIs referring to agents. For this purpose, we used the APIsFootnote 9 for accessing and querying the dataset aggregated by Europeana. We located 1,164,323 unique RDF resources about agents used by the Europeana data providersFootnote 10. From these we excluded all anonymous (blank) nodes and all the URIs that contain a URI fragment appended to the URI of the CH object. These resources without a “real” identifier are likely to correspond to cases where the agent does not come from a pre-existing controlled, “authoritative” KB, but are just created ad-hoc for the description of the cultural object. The resulting set contains 286,090 unique agent URIs, and the majority of them belong to a KB in our study, as Table 1 shows.

Table 1. Amounts of unique URIs in the set from Europeana that belong to a KB in the study

The set of agent URIs was then used to initiate a series of crawling iterations of the equivalence graph. The crawler was instructed to crawl the statements with any of the properties mentioned in Sect. 3. It assumes that all properties are transitive, including skos:closeMatch, and that transitivity applies across all types of propertiesFootnote 11. In the first iteration, we crawled directly the agent URIs and gathered all the equivalence relations their KB contained for them. From the second iteration and onwards, the crawler obtained equivalent agent URIs by searching in the KBs for any URI that was collected in previous crawling iterations and adding the URIs that these KBs declared to be equivalent to the original ones. At the end of each iteration, the crawler generated a report about the newly found equivalent URIs. We repeated the crawling process for newly found equivalences several times until the increase of URIs resulting from one iteration was negligible.

4 Results

The study provided informative results on four aspects of the KBs and their virtual equivalence graph. Each aspect is presented in the following subsections.

4.1 Existing Equivalences Between Knowledge Bases

We did two measurements on the equivalence statements between the KBs. The first measurement considered all types of equivalences, and the second measurement was made considering solely skos:closeMatch equivalences. Our motivation for measuring separately the skos:closeMatch equivalences was because this property expresses equivalence with a degree of uncertainty, while the three others seek to capture exact equivalence, which may be an important aspect for many applications.

Table 2 presents the results considering the 4 properties for equivalence, showing the amounts of statements when a KB publishes an equivalence to another KB and when other KBs publish an equivalence to the KB being considered. The table also shows the number of KBs linked by equivalences to each KB. A total amount of 60,307,328 equivalences are stated in the 8 KBs.

Table 2. The amounts of equivalence statements involving each knowledge base.

The results show high interconnection between KBs. All KBs express equivalences to at least one other KB, and all KBs are the target of equivalences stated in at least one KB. An interesting observation is that 3 out of the 8 KBs are focused only on agents (VIAF, NAF and ULAN), and 2 of them, VIAF and NAF, are among the 3 most linked KBs. GND is the second most linked KB, and the most linked of the KBs that cover more than one entity type, followed by Wikidata and DBpedia.

skos:closeMatch equivalences are much less frequent than the exact equivalences and only two KBs use them: BnF and ULAN. They represent only 1.5% of the total amount of equivalences stated by BnF. ULAN applies skos:closeMatch more frequently, reaching nearly 50% of the equivalences published. Overall, 192,300 statements use the skos:closeMatch predicate, which represents only 0.3% of all the equivalences stated by the studied KBs (Table 3).

Table 3. The amounts of skos:closeMatch statements involving each knowledge base.

4.2 Crawling of the Equivalences for agent URIs

The results of the crawling iterations on the URIs of Europeana are shown in Table 4. After the 1st iteration (i.e., crawling beginning from the URIs in the Europeana set alone) we found 50,112 equivalent URIs. The amount of gathered equivalences has increased steeply in the first 3 crawling iterations. From the 1st crawl to the 2nd, the number of equivalences increased by 588%, and it increased by 42% on the 3rd iteration. The number of newly acquired equivalences was 0.76% in the 4th iteration, and under 0.1% in the 5th, so we opted to analyse and report on the results up to the 4th iteration (included). Only 3 iterations were needed to collect 99% of the equivalences. Although not all KBs are directly connected by equivalences, this shows that equivalent agent instances are closely connected in the equivalence graph.

Table 4. The results of the 4 crawling iterations of the Europeana set of agent URIs.

VIAF was the KB with the highest number of equivalent URIs found. After the 4th crawling iteration, 60.7% of the set had equivalent VIAF URIs. Wikidata had the 2nd highest number of equivalences, reaching 34.5%.

For 3 KBs, less than 10% of the set had equivalences: ULAN, BNE and GND. The lower result for ULAN was expected since it is focused on artists. GND was the KB with the most URIs in the Europeana set, therefore, this result can be explained by the fact that for all GND URIs in the set, only equivalences to other KBs could be found. The results of BNE may be also explained by its high presence in the Europeana set.

For researchers and practitioners designing innovative systems based on agent linked data, the choice for using one or more KBs will always be highly influenced by the specific domain of application. Nevertheless, the results of the study indicate VIAF as the most linked KB, and therefore, in future work we would like to further exploit its data and equivalences.

4.3 Compliance with Semantic Web Standards

One of our initial observations during the study was that Wikidata is the only KB which does not use the standard equivalence properties. In fact, in an earlier study on Wikidata’s data about CH resources [12], we have observed that it uses a very limited number of the standard Semantic Web “meta-modeling” properties. During the current study, we observed that owl:sameAs is in use only for internal equivalences between Wikidata’s entities. None of skos:exactMatch, skos:closeMatch nor schema:sameAs are used.

Instead, Wikidata uses its own wdt:P2888 (exact match), and a set of properties categorized as External identifiers Footnote 12. Each of these External identifier properties represents the local identifier for a Wikidata resource within the external information space of a particular institution or dataset. The values of statements with these properties are usually not URIs, and when a local identifier can be transformed into a URI, the definition of the property contains the formatting string for deriving the URI from the local identifierFootnote 13. We have identified 159 properties of type External Identifier from which a URI could be derived.

We collected Wikidata’s equivalences via its SPARQL endpoint, therefore we adapted our SPARQL queries to use the corresponding Wikidata properties. Another adaptation was done in the tools for analysis of the equivalence graph, so that the Wikidata properties would be considered as exact equivalences.

4.4 Data Quality of the Equivalence Statements

Our study did not have the objective to address the quality of equivalence statements, but we did come across a problem that blocked our crawling experiment, forcing us to find a solution. This problem was caused by four URIs used in 77,379 equivalence statements by VIAF, which seem plainly wrongFootnote 14. Besides establishing wrong equivalences, this problem posed difficulties for crawling the equivalence graph. It would take several (probably many) additional iterations for the number of equivalent URIs to stabilize, and very large groups of equivalent URIs would be formed. To bypass the problem, we tried to filter out such incorrect URIs by detecting major outliers in terms of the mean of equivalences/URI. The mean of equivalences/URI in VIAF was of 1.006 and each of these four URIs were present in thousands of equivalence statements. The outlier URIs were discarded when we repeated the crawling process, therefore they were excluded from our study.

5 Conclusion and Future Work

The results obtained in our study confirm that the agents in KBs are highly interlinked. This high level of interlinking is in accordance with earlier studies of owl:sameAs general usage [3, 11] and the reports from the publishers of the CH KBs on the work they have carried outFootnote 15. The study highlights also that the majority of equivalences are expressed with exact equivalence predicates (like owl:sameAs), while matches with uncertainty (skos:closeMatch) are a minority of 0.3%.

Although each KB is not directly linked to all other KBs, all KBs are a source and a target of equivalence links. Crawling of the agent URIs used in Europeana shows that only a few crawling iterations of the equivalence graph are needed to acquire a nearly complete set of equivalences from all KBs. Three iterations were enough to collect 99% of the equivalences gathered after five iterations.

VIAF is the KB with the highest number of agent equivalences, followed by Wikidata. An equivalent VIAF URI was found for 60.7% of Europeana’s agent URIs, and for Wikidata, equivalences were found in 34.5% of Europeana’s agent URIs.

Future work includes the detection of possibly incorrect equivalences, since this study, like earlier research [4], has detected some quality issues in the (owl:sameAs) links. Conversely, it would be interesting to estimate recall issues, i.e. whether many new links could (and should) be created across KBs via automatic or manual alignment.