1 Introduction

Many datasets use different identifies to refer to the same real life entities, or may contain duplicates themselves. Automated methods for identifying and linking duplicate entities, also known as entity matching, in the knowledge graphs are necessary. A considerable difficulty with entity matching is that the total number of possible entity pairs is much larger than the number of actual (duplicate) entity pairs, also known as the problem of skewness [1, 10]. This extreme skewness can cause false positive results to overwhelm the true positives, even for highly accurate classifiers. This has caused many other works to use ranking techniques, and their associated metrics, to sort the possible entity pairs with some similarity measure, where duplicate entity pairs are expected to appear on top of the ranking [4, 8, 9, 11]. Other works, such as Saeedi et al. [7], perform blocking in the first stages to reduce the number of pairs that are evaluated. Furthermore, Raad et al. [5] start with a set of sameAs relations, and use a community detection algorithm to associate an error degree for each sameAs relation. They show that when only taking sameAs relations with an error degree \(\le 0.4\), they achieve 100% accuracy within their random sample.

The identified set of pairs are generally required to satisfy some structural properties, in particular transitivity. However, taking the transitive closure of the entity pairs identified by entity matching techniques may not work as this may possibly conclude many spurious entity pairs.

We propose the application of cluster editing for entity matching and set up a number of experiments to evaluate our proposal. We show that compared to an ad-hoc enforcement of transitive closure on identified pairs, our approach always results fewer distinct entity pairs (i.e. they have a higher precision) while retaining duplicate entity pairs (i.e. recall is not lowered).

The experiments are performed on semi-synthetic datasets that are generated by introducing duplicates in an existing dataset in a controlled manner. This results in a range of different cluster distributions, where we measure the effects of the number of clusters and different cluster sizes.

2 Applying Cluster Editing on Matched Entities

An overview of our overall method is given in Fig. 1. We start with an embedding of a set of entities E, some of which may be duplicates, and use Euclidean distance to measure their proximity (panel A of Fig. 1). For each entity \(e_i \in E\), we make k candidate pairs \((e_i,e_j)\), where \(e_j\) is the k-nearest neighbor of \(e_i\), thereby addressing skewness by ruling out the vast majority of pairs.

The dotted lines in panel B illustrate the candidate pairs for \(k=1\). Moreover, we assume that a (small) subset of these candidate pairs is labeled by a domain expert (blue lines in panel B). The labeled pairs are used to train a probabilistic classifier. This classifier is used to determine, for each candidate pair \((e_i,e_j)\), the fitted probability \(p_{ij}\) that \(e_i\) and \(e_j\) are duplicates. Depending on the features used by the classifier, and its complexity, the fitted probabilities need not be proportional to the distance between entities. We do however assume that the features used by the classifier are symmetric so that \(p_{ij}=p_{ji}\), and therefore we can indeed regard a pair of entities as unordered. We then use a cut-off value \(\theta \) so that if \(p_{ij} > \theta \), then \(e_i\) and \(e_j\) are predicted to be duplicates (panel C). This “raw outcome" of the pairwise classifier, which represents te sameAs relation, may however violate the expected transitivity constraint. Obviously, an ad-hoc application of transitive closure to the links predicted by the classifier never removes any links, but can only add new links (panel G). This may result many spurious entity pairs. A more principled method to restore transitivity is to use the cluster editing technique [3]. Here, we compute a weight \(w(i,j) = \log (\frac{p_{ij}}{1 - p_{ij}}) - \log (\frac{\theta }{1-\theta })\) for each pair of entities \((e_i,e_j)\) within the same connected component (regardless of whether it is a candidate pair or not), such that w(ij) is positive if \(p_{ij} > \theta \), and negative otherwise (panel D). If w(ij) is positive (negative), a link between i and j is provisionally assumed to be present (absent). The resulting set of links may however again violate the transitivity constraint. Cluster editing is used to restore transitivity by adding and/or removing links in such a way that the total score \(\sum _{(i,j)} w(i,j)x_{ij}\) is maximized, where \(x_{ij}=1\) if a link between i and j is present in the solution, and \(x_{ij}=0\) otherwise (panels E and F). Finally, more information about our method is available at [2].

Fig. 1.
figure 1

An overview of the entity matching process.

3 Results

The data source we use is an RDF version of EcarticoFootnote 1, a comprehensive collection of biographical data about, among others, painters, engravers and book sellers. These people worked in the Low Countries at the time of the \(16^{th}\) and \(17^{th}\) century. This data source is actively curated so we can be sure there are no duplicate records of persons. From the Ecartico graph, we have constructed a new graph containing all schema:Person entities that have values for the properties – schema:name, – schema:workLocation, and – schema:hasOccupation. For each of these entities, we also copy the property-value pairs for – schema:birthDate, – schema:deathDate, – schema:birthPlace, and – schema:deathPlace when they are present. This resulted in a graph with, including rdf:type for the class schema:Person, eight properties, all centered around that one RDF class.

Then we introduce duplicates by uniformly sampling a percentage of entities and altering their respective URI’s. For example, when sampling entities, an entity with URI www.data.uu.nl/1234 will be modified to www.data.uu.nl/1234/3. We create three cluster distributions \(\mathcal {D}_{10}\), \(\mathcal {D}_{25}\) and \(\mathcal {D}_{50}\), which are determined by the percentage of entities that is sampled, which can be either 10, 25 or 50%. Figure 2 shows the resulting distributions of clusters, where, for instance with 10%, we expect most entities to be in a cluster of size 1, meaning they were not duplicated. When modifying 50%, of entities, however, this results in most entities being duplicated at least once, and the largest proportion being duplicated twice. In this case non-duplicate entities are in the minority, making the entity matching problem considerably harder.

Table 1 shows the results of our experiments. We denote the application of transitive closure with the subscript TC and the application of cluster editing with the subscript CE. For every value of \(\theta \in (0, 1)\) (in steps of 0.01) we generate an associated F\(\frac{1}{2}\)-score, as it is our experience that a low precision has a larger negative impact (than low recall) on the performance of downstream systems such as SPARQL engines. The F\(\frac{1}{2}\)-score weights precision twice as heavy as recall. We average the F\(\frac{1}{2}\)-score for all values of \(\theta \) (100 values between 0 and 1) to denote the performance of a given combination of cluster distribution, classifier and features. We experimented with a logistic regression (LR) and support vector machine (SVM) classifier. All were trained using cosine similarity as the sole feature. Furthermore, we used just 100 pairs to train each classifier, limiting the burden on the domain expert as much as possible.

Fig. 2.
figure 2

Generated probability distributions of entity clusters of size 1 to 4 in the synthetic data. The values of a color sum to one.

Fig. 3.
figure 3

Left: relative number of pairs predicted vs actual number (red dotted line). Right: precision for transitive closure (red lines) and edited closure (blue lines). (Color figure online)

Table 1. A comparison of the mean and maximum F\(\frac{1}{2}\)-scores, and associated value for \(\theta \) for transitive closure (TC) and cluster editing (CE), per classifier and cluster distributions (\(\mathcal {D}_{10}, \mathcal {D}_{25}, \mathcal {D}_{50}\))

In all cases we observe that the application of cluster editing improves the resulting set of duplicates over the application of transitive closure. Furthermore, the optimal value for \(\theta \) is in most cases reduced when cluster editing is applied, suggesting that a more lenient cutoff can be used, while at the same time improving performance. Furthermore, Fig. 3a shows how closure of the entity pair set tend to overestimate the number of duplicate pairs for low values of \(\theta \), where the number of predicted pairs is more than 4 times the number of actual pairs. Finally, Fig. 3b shows that the cluster editing method consistently outperforms transitive closure in precision.

4 Conclusion and Future Work

In practice, entity matching methods are applied and the resulting entity pairs are used by, e.g., a reasoner in a SPARQL engine, which applies the transitive closure. This may introduce many spurious links, potentially creating large clusters of unrelated entities. We propose to apply cluster editing to create a set of links that is closed under transitivity and show that the application of cluster editing, compared to the transitive closure, always results in a set of duplicates that contains fewer distinct entity pairs (i.e. they have a higher precision) while retaining duplicate entity pairs (i.e. recall is not lowered). The NP-Hardness of cluster editing limits us to solving only relatively small connected components. There are, however, heuristic methods which enables larger components to be solved. These heuristic methods can effectively reduce the instance size of the problem and are fast in case a small number of edits is allowed [6]. Additionally, the Louvain community detection technique proposed by Raad et al. [5] for detecting erroneous links can be applied to the connected components formed by the candidate pairs. It would be interesting to see how it compares to other (heuristic) clustering techniques.