Keywords

1 Introduction

Many studies have investigated knowledge graphs (KGs) as databases for dialogue systems [14, 19,20,21, 24]. A KG is represented as a set of triples \((e_s,r,e_o)\) where \(e_s\) is a subject entity, r is a relation, and \(e_o\) is an object entity. The relations between two entities can be flexibly represented in KGs. On the other hand, it is basically impossible to represent every triple in the real world.

We can estimate the missing triples in KGs using knowledge graph completion (KGC) [2, 8], which can be utilized to generate the response sentences of a dialogue system [7]. However, the more missing triples that exist, the lower is the KGC performance. To improve the KGC performance, KGs can be augmented using a different external database, as exemplified in Fig. 1. Increasing the number of relations per entity by augmenting KG will improve the KGC performance.

A crucial problem in this augmentation is that entity names often differ between an existing KG and an external database. We call such different names having identical meanings orthographic variants. For example, “chocolate cake” is often abbreviated to “choco cake” in Japanese.

We identify entities whose meanings are identical and merge them, as shown in the “chocolate cake” example in Fig. 1. In our study, entity identification refers to associating two entities with the same meanings. If such entities are successfully merged, we can augment more relations between existing entities, which will improve the KGC performance.

Our proposed entity identification uses the similarity of feature vectors generated by BERT [5] by considering the graph information. We evaluated its effectiveness by the KGC performance obtained after augmenting a KG with entity identification.

Fig. 1.
figure 1

Augmentation of KG using different databases

2 Related Work

Although some studies have addressed KG augmentation or construction, most did not take into account orthographic variants [1, 4, 9, 22]. Meng et al. [10] constructed a KG from Chinese literature by merging orthographic variants using the Word2vec [11] model trained from the original literature. However, the KG and the external database considered in our study have no original literature to train a model.

Ikeda et al. [6] and Saito et al. [13] used language models to remove Japanese orthographic variants without KGs. Turson et al. [17] also studied a similar method for Uighur. Unlike our study, these works assume the availability of sufficient documents for training models.

Zhang et al. [23] and Sun et al. [15] input entity or triple information to language models to perform NLP tasks. However, both works assume that the original KG has enough relations between its entities.

Fig. 2.
figure 2

Augmentation details with entity identification

3 Entity Identification Based on Graph Information

3.1 Augmentation with Entity Identification

Figure 2 shows the augmentation of a KG using entity identification, which is done on entities \(e_s\) and \(e_o\) in triples \((e_s, r, e_o)\) of an external database used for augmentation. The entity identification module outputs the most similar entities, \(\hat{e}_s\) and \(\hat{e}_o\), in the existing KG and their similarity scores. If the similarity scores are larger than or equal to threshold \(\theta \), \(e_s\) and \(e_o\) in the original triple are replaced with \(\hat{e}_s\) and \(\hat{e}_o\). Then the triple is augmented into the KG. We did not use triples that have unreplaced entities for augmentation because they may degrade the KGC performance.

3.2 Feature Vectors of Entities with BERT Considering Graph Information

Entity identification calculates the cosine similarity between the feature vectors of the entities in the KG and the external database. An entity with the largest cosine similarity in the KG is identified as the most similar. The feature vectors are computed with graph information using BERT.

We use the name of each entity and the triples containing it as input to BERT. Figure 3 shows an example. The triples are grouped by relations, and the sentences about them are connected by [SEP] tokens. When computing the feature vector for entity “chocolate cake,” the input is “[CLS] chocolate cake [SEP] ingredients are egg and chocolate [SEP] superclass is dessert [SEP]” based on the graph structure. A [CLS] token is always used at the beginning of the BERT input.

Mean-pooling was applied to the sequence of output vectors from BERT. The pooled vector is a feature vector of each entity. In addition, we normalized each feature vector by subtracting the mean of all the feature vectors from it to improve the KGC performance after augmentation.

Fig. 3.
figure 3

Input format for BERT based on graph information

4 Experiments and Evaluations

4.1 Settings

We used a food subgraph from Wikidata [18] as the original KG. We extracted a portion of it and used it as test and validation data. The remaining graph after the extraction was used as the augmentation target. The target data had 14454 triples, the validation data had 242, and the test data had 243. They contained 8423 entities and 110 kinds of relations.

We used Rakuten Recipe from the Rakuten public dataFootnote 1 for the external database. It has about 800,000 recipes. The entities to be augmented came from the names of dishes and the ingredients in the recipes.

We used TransE [3] and RotatE [16] as KGC models. Using the validation data, we set the embedding dimension to 300 for both models. For each triple \((e_s,r,e_o)\) in the test data, we evaluated the performance of randomly predicting either \(e_s\) or \(e_o\). Hits@N(\(N=1, 10\)) and mean reciprocal rank (MRR) were used as evaluation metrics. The BERT model was fine-tuned from a pre-trained model for JapaneseFootnote 2. Its hyperparameters are based on a previous paper [12]. Threshold \(\theta \) (Fig. 2) was experimentally set to 0.4 using the validation data.

We set two baselines. One was “EditDist-based,” in which similarity scores were computed by subtracting the normalized edit distance between the entity names from 1. The edit distance was normalized by the length of the longer entity name. Each entity name was treated as letters representing its Japanese pronunciation in this baseline. Entity pairs with similarity scores over 0.9 were regarded as identical. The other baseline was “BERT” without graphs, i.e., only each entity name was used to compute its feature vector by BERT.

4.2 Results and Discussion

Table 1 shows the KGC performance and the number of triples of the augmented KG for each method. Our proposed method is “BERT+graph.”

Our BERT+graph method outperformed the other methods in every metric, especially the BERT baseline, and its number of triples decreased from the BERT baseline. This result indicates that the graph information reduced the triples that do not contribute to the KGC performance and positively impacted it.

Comparing the performance of each method, the increase from the BERT baseline to our BERT+graph method exceeded that from the EditDist-based baseline to the BERT baseline in all the metrics. This also confirms the effectiveness of graph information.

Table 1. KGC performance and number of triples of augmented KG for each method

Table 2 shows some examples of similarity scores? “Target entity” is an entity of Rakuten Recipe, and “Existing entity” is an entity of the food subgraph. We include the Japanese entity names and the pronunciations in parentheses. Our BERT+graph method successfully computed more appropriate similarity scores. For example, its similarity scores were high for similar pairs, such as 酢イカ (vinegared squid) and イカ (squid), and low for dissimilar pairs, such as 酢イカ (vinegared squid) and スイカ (watermelon) or かき揚げ (vegetable tempura) and カキ (oyster). On the other hand, even for a pair indicating exactly the same thing, such as そば(soba) and 蕎麦 (soba), the scores of another similar pair, such as そば(soba) and かけそば (kakesoba), were higher. Kakesoba is a kind of soba. While the KG augmentation improved KGC performance as demonstrated in Table 1, its negative effects were mitigated by preventing the erroneous merging of dissimilar pairs, such as 酢イカ (vinegared squid) and スイカ (watermelon).

Table 2. Examples of entity identification results

5 Conclusion

We augmented a KG with entity identification based on graph information and evaluated its KGC performance effectiveness after augmentation. Our experiment’s results indicate that our proposed method outperformed the two baselines. In the future, we will verify whether our proposed method remains effective with another KG and external databases.