Keywords

1 Introduction

A knowledge base (KB) is a collection of facts about the world. This information is often stored as structured records with which inference and search engines are run to answer user questions. KBs can be generally classified into curated KBs and open KBs. A curated KB is one that is manually and collaboratively created. Examples include Freebase  [5], DBpedia  [1], Wikidata  [28] and YAGO  [26]. A curated KB models knowledge as entities and the relations among them. For example, the fact “Apple was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne” is represented by 4 entities “Apple Incoprorated’, “Steven Paul Jobs”, “Stephen Gary Wozniak”, and “Ronald Wayne”; and the relation “founded-by” that connects the company to the three founders. Because curated KBs are manually created, they are accurate and unambiguous. In particular, each entity is given a unique id that helps distinguish different entities that share the same name (e.g., Apple (company) vs. Apple (fruit)). The drawbacks of curated KBs, however, are inefficient update and limited scopes. An open KB (OKB)  [11, 31] is constructed by collecting vast amounts of web documents and applying open information extraction (OIE) on the documents to extract assertions. Some OIE systems include TextRunner  [2], ReVerb  [10], OLLIE  [22] and ClausIE  [7]. Assertions are extracted from document sentences and are represented in the form of \(\langle subject ; relation ; object \rangle \) triples. For example,

\(A_1: \langle \)Mumbai; is the largest city of; India\(\rangle \)

\(A_2: \langle \)Bombay; is the capital of; Maharashtra\(\rangle \)

\(A_3: \langle \)Bombay; is the economic hub of; India\(\rangle \)

are three assertions extracted by ReVerb from the ClueWeb09Footnote 1 corpus which consists of around 500 million English Web pages. An entity name is a string that describes an entity, e.g., “Mumbai”. An entity mention is an occurrence of an entity name in an assertion (e.g., there are two entity mentions of the entity name “Bombay” in assertions \(A_2\) and \(A_3\)). OKBs can be automatically constructed, which gives them the advantages of wider scopes and being up-to-date.

An important issue of OKBs are ambiguity among entities and relations. For example, “Mumbai” and “Bombay” are two different entity names, but they refer to the same (city) entity; “Apple”, on the other hand, can refer to a company or a fruit. In order to properly answer queries using OKBs, we need to perform entity resolution (ER), which is the process of determining the physical entity that a given entity mention refers to  [3, 4].

One approach to solving the ER problem is entity linking (EL)  [9, 12, 14, 16, 21, 23, 24, 29]. Given an entity mention with a name x, EL identifies an entity e in a curated KB that is most likely the one that x refers to. All entity mentions that are linked to e are then treated as referring to the same entity. For EL to work, an entity mention in an OKB has to have an equivalence in a curated KB. This obviously limits EL’s applicability. For example, the dataset provided by  [27] consists of 45,031 assertions extracted by ReVerb from the ClueWeb09 corpus. In this dataset, about 23.7% of the entity mentions cannot be linked to any Wikipedia entities.

In this paper, we study canonicalization as another approach to solving ER. In  [13], canonicalization is done by applying hierarchical agglomerative clustering (HAC) to cluster assertions (and thus the entity mentions of the assertions). The idea is to group assertions with similar entity names and context into clusters. Entity mentions of assertions that are grouped in the same cluster are then considered to refer to the same entity. A canonical form is then given to represent the subject names of these mentions. In  [27], Vashishth, et al. propose the CESI algorithm that uses assertion embedding  [18] as feature for clustering. It is shown that CESI generally outperforms the algorithms given in  [13].

The above methods follow a similar clustering framework in which a certain similarity measure is applied. In practice, it is tricky to control the similarity threshold based on which clusters are formed. For example, “Barack Obama” and “Barack Hussein Obama” are highly similar in terms of word overlap, but “Mumbai” and “Bombay” are totally different words. In this paper we propose a new approach called Multi-Level Canonicalization with Embeddings (MULCE). MULCE utilizes the state-of-the-art language models BERT  [8] and GloVe to construct assertion embeddings. The key difference between MULCE and existing methods is that MULCE splits the clustering process into two steps. The first step clusters assertions at a coarse-granularity level, followed by a fine tuning step. Our experiments show that MULCE outperforms existing methods and produces high-quality canonicalization results.

The rest of the paper is organized as follow. Section 2 summarizes some related works. Section 3 presents MULCE. Section 4 presents experiment results. Finally, Sect. 5 concludes the paper.

2 Related Works

In this section we briefly describe some related works. We focus on three topics, namely, noun phrase clustering, entity linking, and word embedding.

[Noun Phrase Clustering]. Noun phrase clustering solves the entity resolution problem by clustering entity mentions. Example works include ConceptResolver  [15], Resolver  [30], Galarraga et al.  [13] and CESI  [27].

ConceptResolver is designed to process noun phrases extracted by the OIE system NELL  [6]. It operates in two phases. The first phase performs disambiguation under the one-sense-per-category assumption. For example, apple can be a company or a fruit, but there cannot be two companies called Apple. Entity names are type-augmented (e.g., apple becomes apple:company or apple:fruit). The second phase uses HAC to cluster entity mentions under each category.

Resolver clusters entity mentions derived from TextRunner’s Open IE triples. The similarity of two given entity mentions is based on the string similarity of their names, as well as the similarity of the relation phrases and objects of the assertions that contain the mentions. To improve HAC efficiency, some pruning techniques are applied.

Galarraga et al.  [13] proposed the blocking-clustering-merging framework. Given an OKB, a canopy is created for each word found in some entity name of the assertions. For example, given the assertion \(A: \langle \)Barack Obama; is the president of; the US\(\rangle \). The entity name “Barack Obama” induces two canopies: \(P_{ Barack }\) and \(P_{ Obama }\). A canopy \(P_w\) is a set of assertions such that an assertion \(A_i \in P_w\) if the entity name of \(A_i\) contains the word w. For example, the assertion A above is a member of both \(P_{ Barack }\) and \(P_{ Obama }\). The process of constructing canopies is called token blocking. HAC clustering is performed for each canopy. In  [13], it is shown that IDF Token Overlap is the most effective similarity measure evaluated. Specifically, the similarity of two entity names is given by the number of overlapping words between them, with a weighting given to each word according to the inverse frequency of that word. In the following, we call this algorithm G-IDF.

CESI  [27] adopts a comprehensive word-embedding model, which is mainly based on GloVe. Furthermore, CESI utilizes some side information such as entity linking results and a paraphrase database. CESI canonicalizes an OKB in three steps: (1) acquire side information for each assertion; (2) learn assertion embeddings using the side information; (3) perform clustering to canonicalize the OKB. With word embeddings and side information, CESI has a high comprehensiveness, allowing it to outperform existing methods. However, as the side information requires entities to exist in a curated KB before canonicalization, one potential limitation is that CESI cannot handle emerging entities effectively.

[Entity Linking (EL)]. Entity Linking is another approach to solve the ER problem. EL links entity mentions with their corresponding entities in a curated KB. Readers are refer to survey papers  [14, 23, 29] on entity linking. Here, we describe some representative works  [9, 21, 24].

In  [9, 21], entity mentions are linked to corresponding Wikipedia pages, which is also known as Wikification. Both methods adopt a two-phase approach: candidate selection followed by ranking, with some differences in their selection and ranking strategies. For a given entity name, some candidate entity Wikipedia pages are first selected. These pages are then ranked according to their similarity with the given entity name in the ranking phase. The entity name will be linked to the highest-ranked Wikipedia page. In  [9], candidate selection is based on measuring string similarity between the entity name and the title of a Wikipedia page, with special considerations given to acronyms and aliases. During the ranking phase, candidates are ranked based on some Wikipedia features such as the rank of the page in Google search, and other advanced string similarity features such as character n-grams shared by the entity names and titles of pages. On the other hand, the method proposed in  [21] uses the anchor text of hyperlinks (the displayed text of a clickable link) for candidate selection. Given an entity name, if the name appears frequently as the displayed text of a hyperlink that links to a Wikipedia page, then the Wikipedia page is considered a candidate for the given entity name. The candidates are then ranked according to some local features or some global features. Given an entity name e, the document d from which e is extracted, and a candidate Wikipedia page p, local features measure the similarity between p and e, and the similarity between p and d. Global features model the relatedness among all the candidate pages such that the entity linking results of the names in a document are coherent.

The method proposed in  [24] links entity mentions from unstructured Web texts to an entity heterogeneous information network (HIN) such as DBLP and IMDb networks. Entity linking is based on some probabilistic models. The entity popularity model computes the popularity of an entity in an HIN based on PageRank  [19]. Given an entity mention m in a document d and an entity e in the HIN, the entity object model computes the similarity between e and m based on the texts in d and the neighbors of e in the HIN. The two models are combined and optimized using expectation-maximization algorithm. Finally, the knowledge population algorithm enriches the HIN using the documents that contain high-confident entity linking results, which can provide more information in the subsequent linking processes.

[Word Embeddings and Sentence Embeddings]. Word embeddings map words to a high dimensional vector space such that two words having similar meanings have a high cosine similarity between their vectors. GloVe  [20] and word2vec  [17] are two commonly-used models implementing the word embedding methodology. Both models learn embeddings based on the co-occurrence information of words in text documents. In word2vec, a continuous skip-gram model is adopted. GloVe adopts a log-bilinear regression model. It is essentially a count-based model that learns word embeddings by building a word-word co-occurrence matrix and then learning the co-occurrence probability. An extension of word embedding is sentence embedding, which generates a vector for every token in a sentence. It also embeds information about token position and co-occurrences with other tokens, etc. Compared with word embedding, sentence embedding can better capture contextual information. BERT  [8] is a new deep learning NLP model that produces sentence embeddings effectively. The idea of BERT is to pre-train a general language model with large amounts of training data. The model can then be efficiently fine tuned for different downstream tasks and applications. Our algorithm MULCE uses both GloVe and BERT to provide embeddings for assertions’ subject entity mentions. Specifically, we use GloVe as a word embedding technique to capture the semantics of an entity name, and then use BERT to obtain a sentence embedding that refines the meaning of an entity mention by incorporating contextual information of sentences. More details will be given in the next section.

3 MULCE

In this section we describe our algorithm MULCE. We first give some definitions.

Definition 1 (Assertion, subject entity name, entity mention, OKB)

An assertion is a triple of the form \(A = \langle s; r; o \rangle \), where s, r, o are the subject, relation, and object fields, respectively. A subject entity name (or simply “subject name”, or “subject”) is a string s that appears in the subject field of some assertion A. We use s(A) to denote the subject name of assertion A. An entity mention m is a pair of the form (s(A), A) for some assertion A. We use m(A) to denote the mention of assertion A; and m.s to denote the subject name s(A) of the mention. An OKB is a collection of assertions.

For example, if \(A_1= \langle \)Mumbai; is the largest city of; India\(\rangle \), then \(m(A_1)\) = (“Mumbai”, \(A_1\)), and \(s(A_1)\) = \(m(A_1).s\) = “Mumbai”. In this paper we focus on resolving subject entity names via canonicalization.

Definition 2 (Canonicalization)

Given an OKB of n assertions \(\mathcal {K} = \{A_i\}_{i=1}^n\), the problem of canonicalizing \(\mathcal {K}\) is to compute a partitional clustering \(\mathcal {C} = \{C_j\}_{j=1}^{|\mathcal {C}|}\). The clustering \(\mathcal {C}\) induces a mention-entity mapping \(\rho \). Specifically, \(\forall A \in \mathcal {K}\), if \(A \in C_j\), then \(\rho (m(A)) = e_j\), where \(e_j\) denotes a physical entity. \(e_j\) is called the canonical form of m(A).

Given two subject names, we consider the following four scenarios in judging if the two names refer to the same entity.

  1. (i)

    Easy Negatives: the names are very different strings and they refer to different entities (e.g., Dwayne Johnson vs. Obama);

  2. (ii)

    Easy Positives: the names are very similar strings and they refer to the same entity (e.g., President Obama vs. Barack Obama);

  3. (iii)

    Hard Negatives: the names are similar (either semantically or string-wise) but they refer to different entities (e.g., Johnny Damon vs. Johnny Cash (string-wise) or Hong Kong vs. Macau (semantically));

  4. (iv)

    Hard Positives: the names are very different but refer to the same entity (e.g. Mumbai vs. Bombay).

The challenge lies in designing a canonicalization method that handles an OKB containing instances of different categories as listed above. We propose MULCE, which is a two-stage, coarse-to-fine mechanism to canonicalize an OKB. Figure 1 shows the overall workflow of MULCE. The first stage, word-level canonicalization (Sect. 3.1), produces coarse-grained clusters according to lexical information of subject names. These coarse-grained clusters are further divided into fine-grained clusters in the second stage, sentence-level canonicalization (Sect. 3.2), where contextual information of assertions are considered.

Fig. 1.
figure 1

Workflow of MULCE

3.1 Word-Level Canonicalization

The first stage of MULCE aims at discovering entity mention pairs that are easy negative. Given such a pair (\(m_1\), \(m_2\)), the first stage will attempt to put the mentions \(m_1\) and \(m_2\) into different clusters. This can be achieved by clustering assertions based on the GloVe vectors of their subject names. The power of using GloVe embeddings to extract a coarse taxonomy in the vector space has been previously demonstrated by CESI  [27]. Here are the details of word-level canonicalization. Given an OKB containing a set of assertions (Fig. 1(a)), we compute a GloVe embedding for each subject name. For a subject name s, the embedding of s is the average of GloVe embedding vectorsFootnote 2 of the words in s. Subject names are then clustered with complete-link HAC based on their embeddings (Fig. 1(b)) using cosine similarity as the distance metric. We adopt complete-link HAC because in  [27], it is suggested that small-sized clusters are expected in canonicalization problem. Any two subject names from one cluster should be similar as they both refer to the same entity. After the HAC, assertions are partitioned into coarse clusters according to the clustering of their subject names (Fig. 1(c)). The focus of this stage is to construct coarse clustering that correctly splits entity mentions of easy-negative instances while keeping those of positive instances in the same clusters. In particular, we allow some coarse clusters to contain mentions of hard-negative instances. These instances will be further processed in the next stage.

3.2 Sentence-Level Canonicalization

The second stage of MULCE focuses on separating entity mentions of hard-negative instances within each cluster obtained in the first stage, while keeping those of hard-positive instances together in the same cluster. This is achieved by performing sentence-level canonicalization using the other fields of the assertions as context information of the subject entity mentions. MULCE applies BERT to encode the semantic information of an assertion (which includes all three fields of a triple, namely, subject, relation, and object). We note that directly performing HAC over the sentence embeddings of assertions may not provide satisfactory results. This is because with the simple sentence structures of triples (or assertions), the information encoded into one single sentence embedding may be too limited and biased. Hence, given a subject name s, MULCE uses the mean vector of all sentence embeddings that share the same subject name s as the representation of s.

Based on this idea, we propose the following sentence-level canonicalization. We first compute a BERT embedding for each subject name. For each word in an assertion, we obtain token-level pre-trained BERT vectorFootnote 3 using bert-as-serviceFootnote 4. For a subject name s, we collect all assertions \(A_i\) = \(\langle s_i; p_i; o_i \rangle \) with \(s_i = s\) and embed each assertion using BERT. We then extract the token-level BERT vectors \(v_i\) corresponding to the subject \(s_i\), which is the final hidden state of the BERT encoder. The BERT embedding vector of the subject name s is computed by averaging all the \(v_i\)’s. Even though we only extract the embedding vectors corresponding to subjects in this step, we consider this step as sentence-level because the information contained in the sentence (i.e., relation and object) is leveraged in the process of generating subject tokens.

After the BERT embeddings of subjects are computed (Fig. 1(d)), we use single-link HAC to cluster the BERT embeddings within each coarse cluster. The resulting clusters are called fine clusters (Fig. 1(e)). Finally, we construct an output mapping that resolves assertions’ subject names. Specifically, two assertions \(A_i\) and \(A_j\) are considered to refer to the same physical entity if and only if the BERT embeddings of their subject names, \(s_i\) and \(s_j\), co-locate in the same cluster (Fig. 1(f)).

4 Experiments

We conduct experiments to evaluate the performances of the canonicalization algorithms. First, we provide details of the datasets used in the experiments in Sect. 4.1. This is followed by a summary of the metrics used, presented in Sect. 4.2. Finally, Sect. 4.3 reports experimental results and discussions.

4.1 Datasets

In the experiments, we use two real-world datasets to evaluate MULCE and other state-of-the-art algorithms. Both datasets consist of sampled assertions that are extracted from the ClueWeb09 corpus using ReVerb  [10]. We call this collection of assertions ReVerb OKB. Table 1 summarizes the statistics of the two datasets. We briefly describe them below.

Table 1. Dataset statistics

Ambiguous Dataset: The Ambiguous dataset was created by Galárraga et al.  [13]. Firstly, 150 entities that have at least two different names in the ReVerb OKB were sampled. We call this set sampled entity set \(E_s\). An entity e is called a homonym entity if (1) e has the same name as some other entity \(e' \in E_s\) and (2) e and \(e'\) refer to different physical entities. Assertions whose subjects refer to sampled entities or homonym entities are collected into the dataset Ambiguous. Intuitively, entities mentioned in the assertions of the set are ambiguous as the same name can refer to different physical entities. We use the Ambiguous dataset released by the authors of CESI  [27] as it contains additional information that is necessary to run CESI.

ReVerb45K Dataset: The ReVerb45K dataset was provided by  [27]. Similar to Ambiguous, assertions in ReVerb45K have subjects referring to entities with at least two different names. ReVerb45K is larger (more assertions) and sparser (lower assertion-to-entity ratio) compared with Ambiguous.

Both datasets are split into validation sets and test sets by  [27]. We use the validation sets to determine the HAC clustering thresholds. For MULCE, as it involves two levels of clustering, we used grid search on validation sets to find optimal thresholds. The ground truth is obtained by linking subject names to Wikipedia pages using Stanford CoreNLP entity linker  [25]. Assertions with subject names linked to the same Wikipedia page are considered to have their subjects referring to the same physical entity. We observe that compared with  [12], which was adopted in the evaluation of many previous works, Stanford CoreNLP entity linker achieves a higher precision but a lower recall. That is, mostly only high-confidence linking results are produced. If an assertion A in a dataset (Ambiguous or ReVerb45K) has a subject name that cannot be linked to a Wikipedia page, A will be excluded in the performance evaluation, since its ground truth entity cannot be determined by its Wiki-linkage.

4.2 Evaluation Metrics

We follow  [13, 27] and evaluate clustering results with macro, micro, and pariwise scores. Specifically, let \(\mathcal {C}\) be the clustering produced by a canonicalization algorithm, G be the gold standard clustering (i.e., the set of clusters formed according to the ground truth), and n be the number of assertions.

Macro Analysis: We define macro precision (\(P_{ macro }\)) as the fraction of clusters in \(\mathcal {C}\) that have all the assertions with subjects linked to the same wiki entity in the ground truth. Macro recall (\(R_{ recall }\)) is the fraction of wiki entities that have all the assertions linked to them assigned to the same cluster by a canonicalization method.

Micro Analysis: Micro precision measures the purity of the clusters assuming that the most frequent ground truth entity in a cluster is the correct entity of that cluster. More formally, \(P_ micro (\mathcal {C},G)=\frac{1}{n}\sum _{c\in \mathcal {C}}\max _{g \in G}|c\cap g|\). Micro recall is defined symmetrically as \(R_ micro (\mathcal {C},G) = P_ micro (G,\mathcal {C})\). It measures the fraction of assertions assigned to the correct cluster, assuming that for each entity e, the correct cluster is the one that contain the most assertions with their subjects’ ground truth entity being e.

Pairwise Analysis: We say that two assertions in a cluster are a “hit” if their subject refer to the same ground truth entity. Pairwise precision is defined as \(P_ pairwise (\mathcal {C},G)=\frac{\sum _{c\in \mathcal {C}}\# hits_c }{\sum _{c\in \mathcal {C}}\# pairs_c }\), where \(\# pairs_c =\left( {\begin{array}{c}|c|\\ 2\end{array}}\right) \) is the number of pairs in a cluster c. Pairwise recall is defined similarly as \(R_ pairwise (\mathcal {C},G)=\frac{\sum _{c\in \mathcal {C}}\# hits_c }{\sum _{e\in G}\# pairs_e }\).

For each analysis, we also report the F1 score.

4.3 Results and Discussions

We compare MULCE against G-IDF and CESI, which are described in Sect. 2. We also conduct experiments to evaluate the following two ablated versions of MULCE:

  • Word-Level Canonicalization: Canonicalization is done by clustering subjects’ GloVe embeddings.

  • Sentence-Level Canonicalization: BERT embeddings of subjects are clustered without forming “coarse clusters”.

Table 2. Performance comparison using Ambiguous and ReVerb45K datasets

Table 2 presents the results. Overall, MULCE achieves the highest F1 scores across both datasets. We further make the following observations.

G-IDF clusters assertions based on whether their subject names share some uncommon words. Therefore, it requires a corpus that is large enough to provide accurate estimates of words’ document frequencies. Another disadvantage of such G-IDF is that it does not utilize the relation or the object fields of assertions in clustering, despite those fields provide valuable contextual information. Moreover, G-IDF only take subject names as strings without considering their semantics. Nevertheless, G-IDF achieves high recall scores. This shows that G-IDF is good at identifying subject names of the same entity, although it lacks the ability to distinguish string-wise similar but semantically different entities.

CESI uses side information (see Sect. 2) to learn assertion embeddings. The embeddings provide semantics information of subject names. CESI has high micro recall and pairwise recall scores for the ReVerb45K dataset. This demonstrates its ability to identify literally different subject names that refer to the same physical entity. The shortcoming of CESI is that it does not work well for the hard-negative cases, i.e., subject names that are very similar but in fact refer to different entities. Using MULCE, with the multi-level framework, we can tackle these highly similar cases using sentence-level canonicalization.

Ablation Analysis. We conduct an ablation analysis by applying only word-level canonicalization or applying only sentence-level canonicalization. For both datasets, we see that word-level canonicalization performs better in recall scores, and sentence-level canonicalization performs better in precision scores. The result is expected because it follows our design principle. We identify subjects with similar meanings first in word-level canonicalization (larger and coarser clusters, high recall), and then further split and refine the clusters in sentence-level canonicalization (smaller and finer clusters, high precision). When both word-level and sentence-level canonicalization are employed, i.e., using MULCE, we register the highest F1 scores. This demonstrates that MULCE’s two-level clustering method is highly effective.

Case Study. We further illustrate the effectiveness of the algorithms by a case study. The following assertions are extracted from the ReVerb45K dataset.

  • \(A_1: \langle \)Mumbai; is the largest city of; India\(\rangle \)

  • \(A_2: \langle \)Bombay; is the economic hub of; India\(\rangle \)

  • \(A_3: \langle \)Hong Kong; is a special administrative region of; China\(\rangle \)

  • \(A_4: \langle \)Macau; is a special administrative region of; China\(\rangle \)

Note that the subjects of \(A_1\) and \(A_2\) refer to the same city and hence the assertions should be put in the same cluster. These two assertions test whether an algorithm can distinguish the same entity with different names (hard positives). Assertions \(A_3\) and \(A_4\) have different entities as subjects (Hong Kong and Macau), but these entities are highly semantically similar (hard negatives).

We observe that G-IDF separates the four assertions into different clusters. This is because the similarity function used, namely, IDF token overlap, is purely word-based and the 4 subject names share no common words. CESI correctly puts \(A_1\) and \(A_2\) in the same cluster. This demonstrates that word embeddings and side information have provided enough clue for the algorithm to infer semantic similarity even though the names have different words. However, CESI incorrectly puts \(A_3\) and \(A_4\) in the same cluster. This shows that the CESI has problems handling the hard-negative cases. Our method, MULCE, can correctly handle all four assertions. In word-level canonicalization, \(A_1\) and \(A_2\) are grouped into one coarse cluster, and \(A_3\) and \(A_4\) are grouped into the other coarse cluster. These clusters are then processed by sentence-level canonicalization, where \(A_3\) and \(A_4\) are separated into two clusters while \(A_1\) and \(A_2\) remain in the same cluster. Through this case study, we demonstratre that MULCE is capable of properly handling hard positive and hard negative cases.

5 Conclusion

In this paper, we studied the problem of OKB canonicalization. We proposed the two-stage canonicalization framework. Using word-level canonicalization, we can get coarse clusters where subject names having similar meanings are grouped together. These coarse clusters are further divided into fine clusters in the sentence-level canonicalization, where BERT embeddings are used to capture the information of the relation and object in an assertion. With experiments, we demonstrated that MULCE outperforms state-of-the-art methods on two datasets. We also conducted an ablation study to show that combining word-level and sentence-level canonicalizations is effective.