Keywords

1 Introduction

A knowledge graph (KG) is a way to store information (semi-)structurally to enable automatic data processing and data interpretation. KGs are utilized in various Information Retrieval related applications requiring semantic search of information [1, 11]. While there exist various large open-source KGs, such as YAGO-3 [25], Wikidata [38], or ConceptNet [33], they often contain orthogonal information, and have their respective strength and weaknesses. Hence, being able to combine information from different knowledge graphs is required in many applications. An important subtask is identifying matching entities across several graphs, called entity alignment (EA). Recent years witnessed substantial advances regarding the methodology, in particular involving graph neural networks (GNNs) [6, 7, 19, 28, 34,35,36,37, 40, 42, 44, 46]. Common among these approaches is that they use a set of given seed alignments and infer the remaining ones. While several benchmark datasets are equipped with alignments, acquiring them in practice is cumbersome and expensive, often requiring human annotators. To address this problem, we propose to use active learning for entity alignment. In summary, our contributions are as follows:

  • To the best of our knowledge, we are the first to propose using active learning for entity alignment in knowledge graphs. We investigate and formalize the problem, identify critical aspects, and highlight differences to the classical active learning setting for classification.

  • A specialty of entity alignment is that learning is focused on information about aligned nodes. We show how to additionally utilize information about exclusive nodes in an active learning setting, which leads to significant improvements.

  • We propose several different heuristics, based upon node centrality, graph and embedding coverage, Bayesian model uncertainty, and certainty matching.

  • We thoroughly evaluate and discuss the heuristics’ empirical performance on a well-established benchmark dataset using a recent GNN-based model. Thereby, we show that state-of-the-art heuristics for classification tasks perform poorly compared to surprisingly simple node centrality based approaches.

2 Problem Setting

We study the problem of entity alignment for knowledge graphs (EA). A knowledge graph can be represented by the triple \(\mathcal {G} = (\mathcal {E}, \mathcal {R}, \mathcal {T})\), where \(\mathcal {E}\) is a set of entities, \(\mathcal {R}\) a set of relations, and \(\mathcal {T} \subseteq \mathcal {E} \times \mathcal {R} \times \mathcal {E}\) a set of triples. The alignment problem now considers two such graphs \(\mathcal {G}^L, \mathcal {G}^R\) and seeks to identify entities common to both, together with their mapping. The mapping can be defined by the set of matching entity pairs \(\mathcal {A} = \{(e, e') \mid e \in \mathcal {E}^L, e' \in \mathcal {E}^R, e \equiv e' \}\), where \(\equiv \) denotes the matching relation. While some works are using additional information such as attributes or entity labels, we solely consider the graph structure’s relational information. Thus, a subset of alignments \(\mathcal {A}_{train} \subseteq \mathcal {A}\) is provided, and the task is to infer the remaining alignments \(\mathcal {A}_{test} := \mathcal {A} \setminus \mathcal {A}_{train}\). With \(\mathcal {A}^L := \{e \in \mathcal {E}^L \mid \exists e' \in \mathcal {E}^R: (e, e') \in \mathcal {A}\}\) we denote the set of entities from \(\mathcal {G}^L\) which do have a match in \(\mathcal {A}\), and \(\mathcal {A}^R\) analogously. With \(\mathcal {X}^L = \mathcal {E^L} \setminus \mathcal {A}^L\) we denote the set of exclusive entities in the graph \(\mathcal {G}^L\) which occur neither in train nor test alignment, and \(\mathcal {X}^R\) analogously.

In practice, obtaining high-quality training alignments means employing a human annotator. As knowledge graphs can become large, annotating a sufficient number of alignment pairs may require significant labeling efforts and might be costly. Thus, we study strategies to select the most informative alignment labels to achieve higher performance with fewer labels, commonly referred to as active learning. The following section surveys existing literature about active learning with a particular focus on graphs and reveals differences in our setting.

3 Related Work

Classical active learning approaches [31] often do not perform well in batch settings with neural network architectures. Therefore, developing active learning heuristics for neural networks is an active research area. New approaches were proposed for image [2, 16, 18, 30, 39, 43], text [32, 45] and relational [5, 17, 23, 27, 41] data. Active learning algorithms aim to select the most informative training instances. For instance, the intuition behind uncertainty sampling [22] is that instances about which the model is unconfident comprise new or not yet explored information. However, the estimation of neural networks’ uncertainty is not a trivial task since neural networks are often overconfident about their predictions [15]. One approach to tackle this problem is to use Monte-Carlo dropout to estimate the uncertainty for active learning heuristics [16, 27, 32]. Alternatively, [2] demonstrated that ensembles of different models lead to better uncertainty estimation and consequently better instance selection. The method described in [23] adopts a different approach and queries labels for instances for which it is the most certain that they are unlabeled. For this assessment, the authors propose an adversarial framework, where the discriminator differentiates between labeled and unlabeled data.

Geometric or density-based approaches [5, 17, 18, 30, 41, 43], on the other hand, aim to select the most representative instances. Therefore, unlabeled instances are selected for labeling, such that labeled instances cover unlabeled data in the embedding space. Other approaches to estimate the informativeness of unlabeled samples use, e.g., the expected length of gradient [45].

Active learning approaches with neural networks on relational data were so far applied to the classification of nodes in homogeneous graphs [5, 17, 23, 41] and link prediction in knowledge graphs [27]. In [8, 9, 26] authors propose active learning approaches for the graph matching problem, where the matching costs are known in advance, and the goal is to minimize assignment costs. Note that this is different from our task, where the goal is to learn meaningful representations of the entities.

4 Methodology

In this section, we introduce our proposed labeling setting and describe data post-processing to leverage exclusive nodes. Moreover, we propose numerous new labeling strategies: Some strategies take inspiration from existing state-of-the-art heuristics for classification. Others are developed entirely new based on our intuitions. Finally, we present our evaluation framework for the evaluation of different heuristics.

4.1 Labeling Setting

Since we are dealing with matching KGs, where entities have meaningful labels, we assume that human annotators use these entity names for matching. Therefore, we see two different possibilities to formulate the labeling task:

  1. 1.

    The system presents annotators with possible matching pairs, and they label it as True or False

  2. 2.

    The system presents annotators a node from one of the two KGs, and the task is to find all matching nodes in the other KG.

It is easier to label a single instance in the first scenario, as it is a yes/no question. However, since each node can have more than one matching node in the other KG, \(|\mathcal {E}^L| \times |\mathcal {E}^R|\) queries are necessary to label the whole dataset. In contrast, in the second scenario, human annotators need a similar qualification but the time spent per labeled instance increases because they have to search for possible matchings. However, there are the following advantages of the second scenario:

First, there are only \(|\mathcal {E}^L| + |\mathcal {E}^R|\) possible queries. Second, in both scenarios, the learning algorithm needs positive matchings to start training. Assuming \(|\mathcal {A}^L| \approx |\mathcal {A}^R| \approx |\mathcal {A}|\) and \(|\mathcal {E}^R| \approx |\mathcal {E}^L| \approx |\mathcal {E}|\), the probability to select a match with a random query is in the first scenario \(\nicefrac {|\mathcal {A}|}{|\mathcal {E}|^2}\), whereas for the second scenario it is \(\nicefrac {|\mathcal {A}|}{|\mathcal {E}|}\). Additionally, in the second scenario, it is possible to start with some simple graph-based heuristics, e.g., based on a graph centrality score like degree or betweenness. For many KGs, it is a valid assumption that the probability of having a match is higher for more central nodes. Cold-start labeling performance is especially relevant when the labeling budget is restricted. Third, in the classical active learning scenario, there is the assumption that each query returns a valid label. However, for EA, the information that two nodes do not match is limited since negative examples can also be obtained by negative sampling. In contrast, in the second scenario, we can use information about missing matchings to adapt the dataset, see Sect. 4.2.

In this paper, we focus on the second scenario. However, heuristics relying on information from the matching model described in Sect. 4.3 can also be applied in the first scenario.

4.2 Dataset Adjustment

The EA task’s main motivation is either the fusion of knowledge into a single database or exchanging information between different databases. In both cases, the primary assumption is that there is information in one KG, which is not available in the other. This information comes in relations between aligned entities, relations with exclusive entities, or relations between exclusive entities. While larger differences between the KGs increase their fusion value, they also increase the difficulty of matching processes. One possibility to partially mitigate this problem is to enrich both KGs independently using link prediction and transfer links between aligned entities in the training set [6, 23]. As this methodology does only deal with missing relations between shared entities, in this work, we go a step further: Since we control the labeling process, we naturally learn about exclusive nodes from the annotators. Therefore, we propose to remove the exclusive nodes from the KGs for the matching step. After the matching is finished, the exclusive nodes can be re-introduced. In the classical EA setting, where the KGs and partial alignments are already given, and there is no control over dataset creation, the analogous removal of exclusive nodes is not possible: To determine whether a node is exclusive or just not contained in the training alignment requires access to the test alignments, hence representing a form of test leakage.

4.3 Active Learning Heuristics

The main goal of active learning approaches is to select the most informative set of examples. In our setting, each query either results in matches or verified exclusiveness, both providing new information. Nodes with an aligned node in the other KG contribute to the signal for the supervised training. State-of-the-art GNN models for EA learn by aggregating the k-hop neighborhood of a node. Two matching nodes in training become similar when their aggregated neighborhood is similar. Therefore, the centrality of identified alignments or their coverage is vital for the performance. On the other hand, exclusive nodes improve training by making both KGs more similar. Since it is not clear from the outset, what affects the final performance most, we analyze heuristics with different inductive biases.

Node Centrality – Selecting nodes with high centrality in the graph has the following effects: (a) a higher probability of a match in the opposite graph, and (b) updates for a larger number of neighbors if a match or significant graph changes when being exclusive. Although there is a large variety of different centrality measures in graphs [10], we observed in initial experiments that they perform similarly. Therefore, in this work, we evaluate two heuristics based on the nodes’ role in the graph. The first, degree heuristic (denoted as deg), orders nodes by their degree, and the nodes with a higher degree are selected first. The second, betweenness heuristic (betw), works similarly and relies on the betweenness centrality measure.

Graph Coverage – Real-World graphs tend to have densely connected components [12]. In this case, if nodes for labeling are selected according to some centrality measure, there may be a significant overlap of neighborhoods. At the same time, large portions of the graph do receive no or infrequent updates. Therefore, we propose a heuristic, seeking to distribute labels across the graph. We adopt an approximate vertex cover algorithm [29] to define an active learning heuristic for entity alignment. Each node is initialized with a weight equal to its degree. Subsequently, we select the node from both graphs with the largest weight, remove it from the candidate list, and decrease all its neighbors’ weight by one. We denote this heuristic as avc.

Embedding Space Coverage – The goal of embedding space coverage approaches is to cover the parts of the embedding space containing data as well as possible. Here we adapt the state-of-the art method coreset [30] (denoted as cs) for the EA task. Thereby, we aim to represent each graph’s embedding space by nodes with positive matchings. We adopt a greedy approach from [30], which in each step selects the object with the largest distance to the nearest neighbor among already chosen items. Its performance was similar to the mixed-integer program algorithm while being significantly faster. In the process of node selection, it is not known whether nodes in the same batch have matchings or are exclusive. Thereby, in each step, each candidate node is associated with a score according to its distance to the nearest positive matching or the nodes already selected as potential positives in the same batch. The node with the largest distance to the closest positive point is added to the batch.

Embedding Space Coverage by Central Nodes – The possible disadvantage of coreset heuristic in the context of entity alignment is that selected nodes may have low centrality and therefore affect only a small portion of the graph. Intuitively, it is possible because each next candidate is maximally distant from all nodes with positive matchings, which are expected to be more or less central. In this heuristic, we try to remedy this effect and sample nodes with high centrality in different parts of embedding space. Therefore, in each step, we perform clustering of node representations from both graphs in the joint space, c.f. Fig. 1. We count already labeled nodes in each cluster and determine the number of candidates selected from this specific cluster. This number is inversely proportional to the number of already labeled nodes in the cluster. We then use a node centrality based heuristic to select the chosen number of candidates per cluster. We denote this heuristic by esccn.

Fig. 1.
figure 1

Schematic visualization of the esccn heuristic. The labeled nodes per cluster are counted and used to derive how many samples to draw from this cluster. Another heuristic is then used to select the specific number from the given clusters, e.g., a graph-based degree heuristic.

Uncertainty Matching – Uncertainty-based approaches are motivated by the idea that the most informative nodes are those for which the model is most uncertain about the final prediction. We reformulate EA as a classification problem: The number of classes corresponds to the number of matching candidates, and we normalize the vector of similarities to the matching candidates with the softmax operation. A typical uncertainty metric for classification is Shannon entropy computed over the class probability distribution, where large entropy corresponds to high uncertainty. We can employ Monte-Carlo Dropout to compute a Bayesian approximation of the softmax for the entropy similarly to [17]. However, the repeatable high entropy across multiple dropout masks indicates the prediction uncertainty, where the model is certain that a right prediction is impossible. In the context of entity alignment, we expect high prediction uncertainty for the exclusive nodes since a model may be certain about lacking good matchings. Therefore we opt for model uncertainty for the entity alignment. The model uncertainty is high if the model makes different (certain) decisions for the same instances in multiple runs [14]. We employ BALD [21] with Monte-Carlo Dropout [17]. The heuristic computes the expected difference between the entropy of single model prediction and expected entropy. Note that numerous classes may lead to similar entropy and BALD values for the whole dataset. To mitigate this effect, we employ softmax temperature [20].

Fig. 2.
figure 2

Visualization of scoring method of the prexp heuristic. We fit two normal distributions for matching and exclusive nodes. Each distribution models the maximum similarity these nodes have to any node in the other graph (\(s_{max}(q)\)). To assess the quality of a query q, we get its maximum similarity, and evaluate \(P_{match}(s_{max}(e)\le s_{max}(q)) -P_{excl}(s_{max}(e) \ge s_{max}(q))\), i.e. the black area minus the red one. (Color figure online)

Certainty Matching – A distinctive property of EA is that the supervised learning signal is provided only by the part of the labeled nodes that have a matching partner in the other graph. Therefore, we propose a heuristic that prefers nodes having matches in the opposite graph, named previous-experience-based (prexp). As the model is trained to have high similarities between matching nodes, the node with maximum similarity is the most likely matching partner for a given node. Moreover, we expect that higher similarity values indicate a better match, such that we can utilise this maximum similarity as a matching score: \(s_{max}(e)=\max _{e' \in \mathcal {E}^R}similarity(e, e')\) for \(e \in \mathcal {E}^L\). Thus, we hypothesize that the distribution of maximum similarity values between exclusive nodes and those having a matching partner differ and can be used to distinguish those categories. However, we note that the similarity distribution for already labeled nodes may differ from those that are not labeled, as the labeled nodes directly receive updates by a supervised loss signal. Hence, we use historical similarity values acquired when we selected unlabeled nodes for labeling, and the ground truth information about them having a match received after the labeling. Based on these, we fit two normal distributions for maximum similarities: The first distribution with the probability function \(P_{match}\) describes the distribution of maximal similarity score of nodes with matchings. Similarly, the function \(P_{excl}\) computes the probability that the maximal similarity score belongs to an exclusive node. For each entity in question e, we take its maximal similarity score to the candidate in other graph and compute a difference between two probabilities \(P_{match}(s_{max}(e)\le x) - P_{excl}(s_{max}(e) \ge x)\) as heuristic score, c.f. Fig. 2. This score is large if the maximal similarity of exclusive nodes is smaller than that of nodes with matchings. We keep only entities with the score greater than threshold t, where t is a hyperparameter. This way, we make sure that the score is used only if matching and exclusive nodes are distinguishable. If there are not enough entities that fulfill this requirement, we use some simple fallback heuristic, e.g., degree, for the remaining nodes.

5 Evaluation Framework

Fig. 3.
figure 3

Visualization of node categorisation for \(\mathcal {E}^L = \{A, B, C\}\), and \(\mathcal {E}^R = \{D, E, F\}\). Solid lines represent training alignments, whereas dashed ones denote test alignments. Node B is the only exclusive node. All blue nodes are in the initial pool \(\mathcal {P}_0\). The red dashed nodes D and F may not be requested for labeling as they neither are exclusive nor participate in a training alignment. When node A is requested, only the alignment (AE) is returned, and A, as well as E, become unavailable. The second training alignment (CE) can still be obtained by requesting C. (Color figure online)

To evaluate active learning heuristics in-vitro, an alignment dataset comprising two graphs and labeled alignments is used. These alignments are split into training alignments \(\mathcal {A}_{train}\) and test alignments \(\mathcal {A}_{test}\). We employ an incremental batch-wise pool-based framework. At step i, there is a pool of potential queries \(\mathcal {P}_i \subseteq \left( \mathcal {E}^L \cup \mathcal {E}^R\right) \), from which a heuristic selects a fixed number of elements \(\mathcal {Q}_i \subseteq \mathcal {P}_i\), where \(b = |\mathcal {Q}_i|\) is often called the budget. These queries are then passed to an alignment oracle \(\mathfrak {O}\) simulating the labeling process and returning \( \mathfrak {O}(\mathcal {Q}_i) = (\mathcal {A}_{i}, \mathcal {X}_i^L, \mathcal {X}_i^R) \), where the first component comprises the discovered alignments \(\mathcal {A}_{i} = \{(a, a') \in \mathcal {A}_{train} \mid \{a, a'\} \cap \mathcal {Q}_i \ne \emptyset \}\), and the last components the exclusive nodes \(\mathcal {X}_i^L = \mathcal {X}^L \cap \mathcal {Q}_i\), and \(\mathcal {X}_i^{R}\) analogously. Afterward, the labeled nodes are removed from the pool, i.e. \( \mathcal {P}_{i+1} = \mathcal {P}_{i} \setminus \left( \mathcal {A}_i^L \cup \mathcal {A}_i^R \cup \mathcal {X}_i^L \cup \mathcal {X}_i^R\right) \). Note that when dealing with 1:n matchings, we remove all matches from the set of available nodes, despite some of them having additional alignment partners. As each alignment edge can be retrieved using any of its endpoints, this does not pose a problem. Now, the model is trained with all already found alignments, denoted by \(\mathcal {A}_{\le i}\), and without all exclusive nodes discovered so far, denoted by \(\mathcal {X}_{\le i}^L, \mathcal {X}_{\le i}^R\), given as

$$ \mathcal {A}_{\le i} = \bigcup _{j \le i} \mathcal {A}_j,\quad \mathcal {X}_{\le i}^L = \bigcup _{j \le i} \mathcal {X}_j^L,\quad \mathcal {X}_{\le i}^R = \bigcup _{j \le i} \mathcal {X}_j^R. $$

Following [27, 32], we do not reset the parameters but warm-start the model with the previous iteration’s parameters. The pool is initialized with \( \mathcal {P}_{0}:= \mathcal {A}_{train}^L \cup \mathcal {A}_{train}^R \cup \mathcal {X}^L \cup \mathcal {X}^R. \) We exclude nodes that are not contained in the training alignment, but in the test alignments, as in this case, either a test alignment has to be revealed, or a node has to be unfaithfully classified as exclusive. An illustration of the pool construction and an example query of size one is given in Fig. 3.

6 Experiments

6.1 Setup

For evaluation, we use both subsets of the WK3l-15k dataset [7]Footnote 1. Similarly to [28] we extract additional entity alignments from the triple alignments. Besides using the official train-test split, we perform an additional 80-20 train-validation split shared across all runs. We additionally evaluate the transferability of the hyperparameter settings. One of the challenges in active learning is that hyperparameter search for a new dataset is not possible because of the lack of labeled data at the beginning. Therefore, for the evaluation of the second subset en-fr, we use the best hyperparameter settings which we obtained using en-de and compare how consistent are results for both subsets.

We employ a GNN-based model, GCN-Align [40]. We use the best settings as found in [3]. To allow for Monte-Carlo Dropout estimation for the Bayesian heuristics, we additionally add a dropout layer between the embeddings and the GCN and vary the dropout rate. We employ a margin-based matching loss, and we exclude so far identified exclusive nodes from the pool of negative samples. Following [2], we use 25 runs with different dropout masks for Bayesian approaches. As evaluation protocol, we always retrieve 200 queries from the heuristic, update the exclusives and alignments using the oracle, and train the model for up to 4k epochs with early stopping on validation mean reciprocal rank (MRR) evaluated every 20 epochs, with a patience value of 200 epochs. There are different scores for the evaluation of entity alignment, which evaluate different performance aspects [4]. In this work, we report Hits@1 (H@1) on the test alignments since this metric is most relevant for the applications. We selected the heuristics’ hyperparameters according to the AUC of the step vs. validation H@1 score. Using the best hyperparameter configuration, we re-ran the experiments five times and report the mean and the standard deviation of the results on the test set.

Fig. 4.
figure 4

Performance vs number of queries for random baseline with different levels of dropout, and when removing exclusive nodes from message passing. Removing exclusives significantly improves the final performance.

6.2 Results

Removal of Exclusives – Figure 4 shows the test performance of the random selection baseline heuristic compared to the number of queries, with the standard deviation across five runs shown as shaded areas. As can be seen by comparing the two solid lines, removing exclusives is advantageous, in particular, when many queries are performed, i.e., many exclusives are removed. Therefore, we focus the subsequent analysis only on the case, when found exclusives are removed from the graph. Moreover, we can see that using a high dropout value of 0.5 is disadvantageous on both datasets. While a dropout value of 0.2 also hurts performance for the en-de subset, it does not have a negative influence on en-fr.

Fig. 5.
figure 5

Performance on test alignments vs. number of queries for different heuristics.

Table 1. Mean and standard deviation of AUC number of queries vs. test hits @ 1 aggregated from five different runs for each heuristic and subset. The * symbol indicates significant results compared to the rnd baseline according to unequal variances t-test (Welch’s t-test) with \(p < 0.01\).
Fig. 6.
figure 6

Performance on test alignments vs. number of queries for different heuristics. This figure shows only queries up to 2,000, i.e., the region where not many alignments have been found so far.

Comparison of Different Heuristics – Figure 5 compares the performance of different heuristics through all steps. Since there is a large overlap across different heuristics, we additionally compute AUC for each heuristic and report it in Table 1. From the results, we observe that our expectations about the performance of different heuristics are mostly confirmed. Most of the heuristics perform significantly better than random sampling. Our intuitions about possible problems with coreset in the context of entity alignment are also verified: The heuristic performs consistently worse than the random sampling baseline. On the other hand, our new esccn heuristic, which also tries to cover embedding space, but uses most central nodes instead, is one of the best performing heuristics. We also observe an inferior performance of the uncertainty-based heuristic, which performance is comparable with the random heuristic. Note, that we also evaluated softmax entropy and maximal variation ratio heuristics from [17] and their performance was similar. Overall, we see similar patterns for both subsets: There is a set of good performing heuristics and their performance is very similar.

Performance in Earlier Stages – In many real-life applications, the labeling budget is limited; therefore, the model performance in the first steps is of higher relevance. Therefore, in Fig. 6, we analyze the model performance in the first 2,000 iterations. We observe that the escnn and betw heuristics compete for first prize and that towards the end, they are superseded by other heuristics.

Fig. 7.
figure 7

Number of found training alignments vs. number of queries for different heuristics. This figure shows only queries up to 2,000, i.e., the region where not many alignments have been found so far.

Influence of Positive Matchings – In Fig. 7, we show the number of alignment pairs identified by each heuristic in the first 2,000 steps. For most heuristics, the plots look very similar to the plots in Fig. 6 above with the performance on the y axis. In Fig. 4, we also saw that the removal of exclusive nodes affects the performance only at later iterations. Therefore, we can conclude that finding nodes with matches is especially important in the early training stages.

On the whole, we can conclude that node centrality based heuristics like betw are the right choice for active learning for entity alignment. It achieves performance comparable with model-based approaches and does not require access to model predictions during the labeling process. The labeling ordering can be precomputed and does not change, also facilitating to parallelize the labeling process for a fixed budget to multiple annotators, e.g., using systems such as Amazon Mechanical Turk.

7 Conclusion

In this paper, we introduced the novel task of active learning for entity alignment and discussed its differences to the classical active learning setting. Moreover, we proposed several different heuristics, both, adaptions of existing heuristics used for classification, as well as heuristics specifically designed for this particular task. In a thorough empirical analysis, we showed strong performance of simple centrality and graph cover heuristics, while adaptations of state-of-the-art heuristics for classification showed inferior performance. For future work, we envision transferring our approaches to other graph matching problems, such as matching road networks [13] or approximating graph edit distance [24]. Moreover, we aim to study the generalization of our findings to other datasets and models.