1 Introduction

As a milestone work in the field of NLP, substantial work has shown that the pre-trained language models (PLMs) with self-supervised learning on large-scale corpora have learned the rich semantic knowledge that facilitates a variety of NLP downstream tasks [54]. Manning [1] explains that PLMs learn the meaning of words because meaning can be considered as a connection of linguistic forms, and PLMs have seen many connections of words to understand the meaning of words. For example, PLMs understand the word “Washington, D.C.” by these two sentences “Washington, D.C. is the capital city and federal district of the United States” and “Washington, D.C. is located on the east bank of the Potomac River”. As a result, PLMs significantly improve the performance of most NLP tasks through such learning.

Today, it has become a consensus to use PLMs, through fine-tuning or prompting, as the backbone of downstream tasks. With the proposed initial PLMs BERT [20] and GPT [49], the PLMs community began to flourish. Subsequently, various PLMs are proposed to address different needs and tasks. For example, some works propose knowledge distillation models such as DistilBERT [48] and TinyBERT [47] to reduce the number of parameters in PLMs and to increase the speed of training and inference. Some works present knowledge-enhanced models such as ERNIE [35] and KEPLER [36] to address knowledge-driven downstream tasks. However, there is a lack of a comprehensive study of the application of these PLMs to an important sub-task of NLP, namely knowledge graph question answering (KGQA).

KGQA aims to find answers to natural language questions from the knowledge graph (KG) which is typically store structured knowledge in the form of triples, denoted as (subject, relation, object). The study of various PLMs applied to KGQA is valuable for the following reasons.

  1. 1.

    The general domain KGQA is difficult to use in practice because of the efficiency issue. The huge size of the general domain KGFootnote 1 leads to a large training and inference time for the KGQA system. Some KGQA works limit the search range in KG to reduce complexity [40,41,42], yet the training time for a well-performing KGQA model [40] still exceeds 2 weeks. Without this search limitation, the model [40] would even take a few months to train as noted in [41]. Several works have attempted to reduce complexity by optimizing KGQA approaches [43,44,45], but at the expense of performance. In the recent KGQA system, PLMs have become a regular component of the system due to their obvious gains in performance. However, the application of powerful but large PLMs further increases the difficulty of using KGQA systems in practice. Therefore, it is necessary to explore the trade-off between performance and efficiency of PLMs on KGQA.

  2. 2.

    KGQA is a knowledge-intensive task and involves several common NLP subtasksFootnote 2 such as mention detection, entity disambiguation and relation detection as shown in Figure 1. There are substantial works to tackle these subtasks using PLMs as the cornerstone and with success [28,29,30,31,32]. Moreover, several works utilize structured knowledge to enhance PLMs for more than just self-supervised training on the large-scale corpus, with significant improvements in mention detection [33] and relation detection [34,35,36,37] tasks. Nevertheless, there is lack of work on making a comprehensive comparison on each subtasks from the perspective of PLMs. Therefore, exploring the use of PLMs for KGQA is also instructive for these subtasks.

Fig. 1
figure 1

The general structure of KGQA

This work aims to evaluate the overall performance of various PLMs on KGQA comprehensively. We not only examine the accuracy and efficiency metrics of KGQA systems based on different PLMs, but also study their scalabilityFootnote 3. Specifically, we designed four KGs of increasing size to explore the variation of these KGQA systems. Three classes of nine PLMs are used for evaluation, including the common PLMs Bert [20], Roberta [21], XLnet [50] and Gpt2 [49], the lightweight PLMs ALbert [22], DistilBert [48] and DistilRoberta [48], and the knowledge-enhanced PLMs Luke [33] and Kepler [36]. As common models serve as the backbone models for lightweight PLMs and knowledge-enhanced PLMs, we follow [54] to further classify them according to pre-trained task categories, namely, masked language modeling (i.e., Bert and Roberta), language modeling (i.e., Gpt2), and permuted language modeling (i.e., XLnet). The investigation focused on the simple but common Simple KGQA task, which can be answered by a triple in KG. Moreover, we summarize two basic KGQA frameworks from previous works for the experiment. These two frameworks are vanilla without additional neural network modules except for PLMs and simple linear layers. It allows us to focus on comparing PLMs instead of various KGQA approaches with complex neural network modules. We also conducte experiments to compare the performance of these PLMs under fine-tuning with ChatGPT under zero-shot settings on the KGQA task.

In summary, our main contributions are as follows.

  1. 1.

    To the best of our knowledge, this is the first attempt to comprehensively study the overall performance of various PLMs in KGQA tasks. For this purpose, we summarize two basic KGQA frameworks from popular simple KGQA approaches to exclude the interference of complex neural network modules. We have implemented 18 KGQA systems based on these two basic KGQA frameworks using a total of nine PLMs. Further, we propose three KGQA benchmarks based on the popular SimpleQuestions benchmark. These four benchmarks have a linearly increasing KG scale.

  2. 2.

    We conduct comprehensive experiments to evaluate the overall results of all implemented KGQA systems on all benchmarks. We make detailed analyses regarding overall accuracy, efficiency and scalability from the perspective of different PLMs and KGQA frameworks. In addition, we further analyze the overall performance of the sub-modules of the KGQA systems to investigate the impact of the different PLMs and KGQA frameworks on these subtasks. We also compare the performance of these PLMs under fine-tuning with ChatGPT under zero-shot settings on three KGQA datasets.

  3. 3.

    We find that knowledge-distilled lightweight PLMs and knowledge-enhanced PLMs are promising for use in KGQA. This leads us to delve into this direction in the future to explore practical KGQA systems. Besides, we observe that ChatGPT has an excellent performance in KGQA tasks, while there are still some limitations. Section 7 summarizes all the important findings. Our KGQA frameworks based on PLMs provide new strong baselines of simple KGQA. We have released code and benchmarks as publicly accessible resources to help the future development of the KGQA community.

The rest of the article is structured as follows. In Section 2, we introduce related works on simple KGQA and PLM Applications On KGQA. In Section 3, we present the preliminary knowledge of this work. In Section 4, we summarise the existing simple KGQA methods and describe the two summarised KGQA basic frameworks in detail. In Section 5, we introduce four benchmarks and evaluation metrics. In Section 6, we describe the results and analysis of the experiment. Finally, Section 7 concludes this work and introduces future works.

2 Related works

2.1 Simple knowledge graph question answering

Knowledge graph question answering (KGQA) aims to find answers to natural language questions from the knowledge graph (KG). Simple KGQA means that a natural language question can be answered by a triplet in KG. The two mainstream branches of the current KGQA methods are information retrieval (IR) and semantic parsing (SP) [51,52,53]. The former attempts to retrieve answers directly from a subgraph centred on the topic entity and then models answer features for ranking. The latter tries to train a semantic parser to transform the question into intermediate logical forms and then execute it against KG. In simple KGQA, the IR method employs various neural networks to score the similarity between the question and each candidate fact in the subgraph and then find the best match. It follows the process of retrieving question-specific subgraph and then ranking the facts in it. Bordes et al. [24] used a memory network to encode questions and facts into the same vector space and score their similarity. Dai et al. [15] proposed a two-step conditional probability estimation problem and adopted a BiGRU network as an encoder. Yu et al. [7] designed two independent hierarchical residual BiLSTMs to represent questions and relations with different granularities. Yin et al. [11] used two independent models, a character-level CNN and a word-level CNN with attentive max-pooling. Lukovnikov et al. [9] proposed an end-to-end word/character-level encoding network for ranking subject-relation pairs and retrieving relevant facts. In simple KGQA, the SP method is simplified to a classification model because only a relation or a predicate needs to be generated. Ture and Jojic [6] employed a two-layer BiGRU model as a classifier. Petrochuk and Zettlemoyer [5] used a BiLSTM to classify relations and achieve state-of-the-art performance. Mohammed et al. [2] only adopted simple neural networks (i.e. LSTMs and GRUs) or non-neural network models (i.e. CRFs). In Section 4, we name IR method and SP method as the retrieval and ranking-based method and the classification-based method to show the differences more clearly.

2.2 PLM-based methods for KGQA

Pretrained language models have been widely served for various downstream tasks, including KGQA, due to the powerful representation capabilities learned from large-scale text corpora. For the IR method, PLMs provide a unified way to model unstructured text and structured KG information in a unified semantic space, which facilitates question-specific subgraph reasoning. Zhang et al. [59] trained a PLM-based path retriever to retrieve hop-by-hop question-related relations. At each step, the retriever ranked the top-k relations based on the question and the relations selected in the previous step. Hu et al. [58] introduced PLM to help align questions and paths in a step-wise reasoning manner from explicit text semantic matching and implicit KG structure matching. Luo et al. [14] proposed a BERT-based model to preserve the original question-fact interaction information and reduce the semantic gap. For the SP method, PLMs significantly improve the understanding of questions, especially complex ones. Lukovnikov et al. [3] made the first attempt to use PLMs as classifiers to predict relations, with a significant performance improvement over shallow neural networks. In addition, Lukovnikov et al. demonstrated the greater advantage of PLMs on limited training data. Some works [60, 61] used PLMs to directly generate executable programs based on a given question and other relevant KG information. Substantial improvement in model performance demonstrates the effectiveness of such usages of PLMs. However, few KGQA works have taken into account the efficiency of PLMs. This is crucial for KGQA, which is inherently difficult to apply in practice.

3 Preliminaries

In this section, we introduce the definition of simple KGQA task (Section 3.1) and large-scale pre-trained language models (PLMs) (Section 3.2).

3.1 Task definition

This work focuses on evaluating PLMs on simple knowledge graph question answering, where the natural language question can be answered by a triple in KG. Simple questions are frequently queried in search engines and question-answering robots. The 100 most frequently asked questions on Google search engines in 2021 are simple questionsFootnote 4. and most of them can be answered by the KGQA system.

For ease of understanding, we define some notations used in this paper. Formally, a knowledge graph (KG) is typically a collection of subject-relation-object triples, denoted by \(\mathbb {G}=\left\{ (s, r, o)\vert s,o \in \mathbb {E}, r \in \mathbb {R} \right\} \), where (sro) denotes that relation r exists between subject s and object o, \(\mathbb {E}\) and \(\mathbb {R}\) denote the entity set and relation set, respectively. Given the available KG \(\mathbb {G}\), KGQA aims to answer natural language questions \(Q = \left\{ w_1, w_2, ..., w_n \right\} \) in the format of a sequence of words with the answers \(\mathcal {A}_q \subset \mathbb {E}\). For simple KGQA task, the answers directly connect to the topic entity and a KGQA system is trained using a dataset \(D = \left\{ Q, \left\langle s, r\right\rangle \right\} \), where \(\left\langle s, r\right\rangle \) refers to a subject-relation pair.

In inference stage, given a natural language question QThe film Forrest Gump is directed by who?” as shown in Figure 1, the KGQA system can answer this question by the answer Robert Zemeckis which is retrieved by the subject-relation pair \(\left\langle \textit{Forrest Gump, directed\_by} \right\rangle \) in the KG \(\mathbb {G}\).

3.2 Large-scale pretrained language models

Neural network-based language models represent everything through vectors of real numbers. They can learn better representations on a large corpus by back-propagating from the language model prediction task to the representation of words. Early work on language models trained shallow networks to capture the semantic meaning of words, such as Word2Vec [25] and GloVe [26]. However, they suffer from the drawback of not being able to represent polysemantic words in different contexts.

Since the introduction of Transformer [17], it has become feasible to train deep neural models for NLP tasks. Transformer is a more complex model than the simple neural networks previously explored by humans for word sequences. One of its main ideas is the attention mechanism, by which the representation of one location is computed as a weighted combination of representations from other locations. With Transformer as the architecture, various PLMs trained on large-scale corpora such as BERT [20] and RoBERTa [21] have been proposed with the goal of language model learning. Large-scale PLMs with hundreds of millions of parameters can learn polysemantic words as well as factual knowledge from contextual semantics. Furthermore, numerous works have proposed the use of structured knowledge to enhance PLMs, such as KEPLER [36], and others have used distillation techniques to reduce the number of PLM parameters, such as DistilBERT [48]. Most of them are based on improvements of the primary PLMs, i. e., BERT and RoBERTa. One such large-scale pre-trained language model can be deployed for many specific NLP tasks, requiring only a small number of further instructions. A standard approach is to fine-tune the model with a small amount of additional supervised learning. By fine-tuning large-scale PLMs, the rich linguistic knowledge of PLMs shows great performance on downstream NLP tasks. Recently, ChatGPT PLM, released by OpenAI, has gained huge attention from the NLP community and many other fields. ChatGPT is fine-tuned from the GPT-3.5 series models through reinforcement learning from human feedback [65]. Several works [62,63,64] have shown that ChatGPT demonstrates powerful capabilities on a lot of NLP tasks, but testing in knowledge-intensive downstream tasks is lacking. This work aims to explore the practicability of various PLMs on knowledge-intensive downstream tasks, i.e., knowledge graph question answering, to help researchers select the appropriate PLMs according to their needs. Unfortunately, ChatGPT currently only supports limited access ways and times, limiting our testing. We will leave more work about ChatGPT for the future.

4 Two basic KGQA frameworks

4.1 Summary of the framework

To analyze the practicality of PLMs applie d to KGQA. We summarise several simple KGQA approaches and propose two basic KGQA frameworks for evaluationFootnote 5, a classification-based KGQA framework (KGQAcl) and a retrieval and ranking-based KGQA framework (KGQArr). Previous works [2,3,4,5,6] belonging KGQAcl designed various deep neural networks to encode the question and then map the question vector to the KG relational dictionary. Previous works [7,8,9,10,11,12,13,14,15,16] belonging KGQArr first retrieved adjacent relations (one-hop) of linked entities and then designed new network architecture or introduced contextual information to rank these relations. Some works also propose approaches such as utilize relation detection models to reorder entities [7] or adopt a joint training strategy [10, 13] to improve performance but make the KGQA framework more complex. Our frameworks do not consider these approaches as improving the performance of KGQA is not the purpose of this work.

Both basic frameworks consist of four modules, including (1) Mention Detection, (2) Entity Disambiguation, (3) Relation Detection and (4) Answer Query. The main difference between these two basic frameworks is the relation detection module. For KGQArr, this module is intended to to rank candidate relations (i.e. information retrieval). For KGQAcl, this module aims to map question intent to KG relations (i.e. semantic parsing). Mention detection and entity disambiguation are also regarded as two steps of the entity linking task. Existing studies on KGQA typically treat entity linking as an individual task to be handled in advance [53]. While KGQAcl usually treats entity linking and relation detection as separate modules, KGQArr considers the whole process as a pipeline, with relation detection coming after entity linking.

4.1.1 Mention detection

Given a natural language question, the model will first find the mention representing the entity’s name in that question. Previous works usually treated mention detection as a named entity recognition task and employed various models such as RNN, CNN and their variants [2, 5, 6, 12] or BERT [14] to solve it. Other work regards it as span detection task [3] or adopts CNN-LSTM as an encoder-decoder to generate entities directly [4].

4.1.2 Entity disambiguation

The detected mentions will be used to collect candidate entities, and these candidates will then be ranked. Several works employ n-gram heuristics approaches [2, 3, 6] to collect entities efficiently, and then different methods such as character similarity [2,3,4], TF-IDF scores [6] are employed for entity disambiguation. [5] disambiguate candidate entities by such a simple method as the score of connected relations. Our frameworks have adopted this simple method.

4.1.3 Relation detection

This module aims to obtain the correct relation in KG corresponding to the question. We summarise the two mainstream approaches, viz., KGQAcl and KGQArr in this work. The former is based on the idea of classification and maps the questions directly into the KG relational dictionary as being independent of the previous modules. Previous works use various models like RNN [6], LSTM [2, 4, 5] and BERT [3] to encode sequences of questions, which are then classified into KG relation categories. The latter can be regarded as a similarity matching task, which will use linked entities to retrieve a set of candidate relations and then select the one with the highest similarity to the question. Various models [7, 8, 14, 15, 24], attention mechanism [11, 12] and external features such as context [16] and type [10] are designed to enhance the performance.

4.1.4 Answer query

Candidate entities and candidate relations with scores will be combined into pairs to query in KG to get the answers. The combination with the highest total score of the weighted sum of entities and relations is considered the correct pair [2,3,4,5,6]. Some works [9, 10, 12, 13] jointly train entity disambiguation and relation detection to select the pair with the highest model score. However, this approach cannot be implemented in our PLMs-based frameworks due to the limitations of the GPUFootnote 6. So we do not consider this approach, which is difficult to ground because of its excessive hardware requirements.

Note that the two basic KGQA frameworks we summarised are vanilla and contain only PLMs and simple linear layers. This makes sense as it allows us to focus on comparing PLMs. Except for answer query, other modules are implemented based on PLMs. In addition, the modules are identical in both frameworks, excluding the relation detection. Next, we will detail these two PLMs-based KGQA frameworks.

4.2 The basic classification-based KGQA framework

Fig. 2
figure 2

The basic classification-based KGQA framework

The basic CLassification-based KGQA framework (KGQAcl) is shown in Figure 2. It consists of four modules described in Section 4.1, namely Mention Detection, Entity Disambiguation, Relation Detection and Answer Query.

4.2.1 Mention detection

Given a question Q, the goal of mention detection is to identify the subject mention m. For instance, the subject mention of the question in Figure 2 is “washington”. We treat this task as a common PLMs-based named entity recognition task. The sequence of question is encoded by PLMs and will then be fed into a linear classification layer. It will assign a label for each word in the question sequence, B for the beginning of mention, I for intermediate of mention, and O for non-mention. As PLMs adopt different tokenization methods for words, we only annotate the first token of each word and fill in the rest using the special character \(\left\langle pad \right\rangle \).

4.2.2 Entity disambiguation

The mention m representing the entity name will be used to link to the grounded nodes in the KG. We pre-generate an inverted indexed dictionary which establishes a mapping of mentions to entities. We use m to look up the corresponding KG entities in the inverted index dictionary, which are regarded as candidate entities \(E_c\). For instance, we obtain a set of candidate entities according to the mention “washington”, including the capital of the United States “Washington, D.C.”, the state “Washington” and the person “George Washington”. Besides, the adjacent relations \(r_{e_i}\) for each \(e_i\in E_c\) retrieved from KG will be used for disambiguation. For example, the adjacent relations of the person “George Washington” are “\(born\_in\)”, “\(died\_in\)”, “\(founded\_organisation\)”, etc.

Various PLMs are employed to score the entity \(e_i\in E_c\), and the formula is \(S_{e_i}=\mathtt {g_{PLM}}(Q\vert e_i \vert r_{e_i})\), where \(\vert \) refers to the connection symbol, \(\mathtt {g_{PLM}}()\) represents an PLM encoder. The loss function of the entity disambiguation model based on PLM is:

$$\begin{aligned} {\mathcal {L_{ED}} = -\log P(y=e_k),} \end{aligned}$$
(1)
$$\begin{aligned} {P(y=e_k) = \frac{e^{S_{e_k}}}{e^{S_{e_k}}+\sum _{j=1}^{N}e^{S_{e_j}}},} \end{aligned}$$
(2)

where \(e_k\) denotes the gold entity, N indicates the number of negative samples, \(e_j\) represents negative entities and \(P(y=e_k)\) is the probability of \(e_k\). In addition, we adopt a simple linguistic approach, fuzzy string matching, to initially rank \(E_c\) to select more challenging negative sample entities to train the model. We initial rank \(E_c\) according to the Levenshtein Distance score between the entity name and the mention m.

The entity set \(E_c\) with scores is obtained at the inference stage.

4.2.3 Relation detection

Relation detection is a PLMs-based classification task in this framework. Since simple KGQA only considers one-hop relations, a question Q corresponds to only one relation in KG. The model aims to map Q to a KG relation \(r\in \mathbb {R}\). For instance, the question “where is washington locate” in Figure 2 corresponds to the relation “located_in” in KG. Specifically, PLMs are employed to encode the question sequence to obtain the vector h, and then h is fed into the linear classification layer to obtain the probability distribution of relationsFootnote 7. The goal of the model is to minimize:

$$\begin{aligned} {\mathcal {L_{RD}} = -log P(y=\hat{r}\vert Q) ,} \end{aligned}$$
(3)
$$\begin{aligned} P(y=\hat{r}\vert Q) = \frac{e^{h_{\hat{r}}}}{\sum _{j=1}^{M}e^{h_{r_j}}}, \end{aligned}$$
(4)

where \(\hat{r}\) refers to the gold relation, \(P(y=\hat{r}\vert Q)\) represents the probability of \(\hat{r}\) and M denotes to the number of relation categories. Finally, the relation set \(R_c\) with scores is obtained.

4.2.4 Answer query

This module does not involve any neural networks and aims to query the answer in KG using entity-relation pair. Given the set of candidate entities \(E_c\) and the set of candidate relations \(R_c\) obtained by the entity disambiguation and relation detection modules, we combine them into (er) pairs to be queried in KB, where \(e \in E_c\) and \(r \in R_a\). We rank each (er) pair, whose score is the weighted sum of its component scores, i.e., the entity disambiguation score and the relation detection score. The score of the (er) pair is

$$\begin{aligned} S_{(e, r)} = \lambda S_e + (1-\lambda )S_r , \end{aligned}$$
(5)

where \(\lambda \in (0, 1)\), tuned according to the result of validation set. \(S_e\) and \(S_r\) are normalized entity score and relation score, respectively.

Note that the (er) pair may be invalid because such a combination does not exist in KG. We remove these pairs by querying and verifying them in KG. In addition, the popularity of entities is applied to further prune pairs for the same score. In our work, the popularity is derived from FACC1Footnote 8 and the degree of entities.

4.3 The basic retrieval and ranking-based KGQA framework

The Retrieval and Ranking-based KGQA framework (KGQArr) is shown in Figure 3. It is a pipeline structure and consists of four modules, of which mention detection, entity disambiguation and answer query are identical to KGQAcl, differing only in the relation detection module.

Fig. 3
figure 3

The basic retrieval and ranking-based KGQA framework

Different from KGQAcl, the aim of relation detection in KGQArr is to select the relation in candidate relations \(R_c\) that has the highest semantic similarity score to the question pattern p. \(R_c\) consists of all adjacent relations searched by candidate entity \(E_c\) in KG. The question pattern p is obtained by using a special token \(\left\langle e \right\rangle \) by mention m to replace the mention m in the question q,

Following the way of Sentience-Bert [27], we employ two PLMs that share parameters to encode questions pattern and relations, respectively. This way of encoding significantly improves efficiency compared to cross-encoding. For each relation \(r_i \in R_c\), we compute their similar score \(Score(p, r_i)\). The final predicted relation \(\hat{r}\) is given by the following formula:

$$\begin{aligned} \hat{r}= \textrm{argmax}_{r_{i}\in R_{c}}Score(p,r_{i}), \end{aligned}$$
(6)
$$\begin{aligned} Score(q,r_{i})=\textrm{cos}(Pool(h_p),Pool(h_{r_i})), \end{aligned}$$
(7)

where \(h_p\) and \(h_{r_i}\) are both obtained by PLMs, Pool() refers to the pooling layer. During training, we adopt the hinge loss to maximize the margin between the gold relation \(r^+\) and the negative relation \(r^-\) in \(E_c\).

$$\begin{aligned} \mathcal {L_{RD}}=\sum _{i=1}^{k}\texttt{max}\left\{ 0,\gamma -Score(p,r^+)+Score(p,r_{i}^-) \right\} , \end{aligned}$$
(8)

where \(\gamma \) is a constant parameter, where k is the number of negative relations. As with the KGQAcl framework, the candidate relations \(R_c\) with scores and the candidate entities \(E_c\) with scores will be fed into the answer query module together.

5 Benchmarks

In this section, we will describe the four benchmarks utilized for the experiments and the method for constructing the benchmarks (Section 5.1). In addition, we introduce accuracy and efficiency evaluation metrics, as well as the method for evaluating the scalability of the KGQA system based on PLMs (Section 5.2).

5.1 Construction of the benchmarks

We construct experiments on four benchmarks. Apart from the popular simple KGQA benchmark SimpleQuestions [24], we construct three more benchmarks to explore the scalability of PLMs on KGQA. In particular, we increase the scale of the original KG of SimpleQuestion and propose the three KGQA benchmarks to investigate the performance changes of PLMs as the KG size increases. Note that the question-answering datasets of the four benchmarks are the same.

(a) The original SimpleQuestions with small-scale KG (SQs) [24]. The original benchmark contains more than 100,000 questions, divided into train/validation/test on a 7/1/2 split. The KG resource of this benchmark is FB2M, denoted as \(\mathbb {G}_{S}\), which contains 2M entities and 6.7K relations. Some previous works pre-pruned KB to fit their methods because they assumed all questions were known. We do not preprocess KG to do experiments with all comparison models.

(b) SimpleQuestions with large-scale KG (SQl). Benchmark SQl requires getting a triple in a large-scale KG to answer questions. For the construction of the KG of SQl, denoted as \(\mathbb {G}_{L}\), we retrieve all one-hop triples of the entities in FACC1 in the Freebase dumpFootnote 9. FACC1 provides the common names and the popularity of the entities. We then merge \(\mathbb {G}_{FACC1}\) with \(\mathbb {G}_{S}\) to obtain the KG \(\mathbb {G}_L\): \(\mathbb {G}_L = \mathbb {G}_{FACC1} \bigcup \mathbb {G}_{S}\). \(\mathbb {G}_L\) contains 108M entities, 12.7K relations and 292M triples, which completely covers \(\mathbb {G}_{S}\).

Additionally, we construct two more benchmarks with KGs as \(\mathbb {G}_{M-A}\) and \(\mathbb {G}_{M-B}\) respectively. Their number of triples is between that of \(\mathbb {G}_{S}\) and \(\mathbb {G}_{L}\), and the number of triples of these four KGs grows uniformly.

(c) SimpleQuestions with medium-scale KG\(_A\) (SQm-a). The KG \(\mathbb {G}_{M-A}\) of SQm-a includes 61M entities, 11.1K relations and 105M triples, which also completely covers \(\mathbb {G}_{S}\).

(d) SimpleQuestions with medium-scale KG\(_B\) (SQm-b). The KG \(\mathbb {G}_{M-B}\) of SQm-b includes 90M entities, 12.2K relations and 202M triples, which also completely covers \(\mathbb {G}_{S}\).

The overall comparison of the KG for the four benchmarks is shown in Table 1. Apart from the number of entities, relations and triples for the KG of each benchmark, we also count the average degree, i.e., the average number of adjacent relations of the entities appearing in the SimpleQuestions dataset. Average degree can reflect the challenge of the benchmark to some extent.

Table 1 The overall comparison of the KG for the four benchmarks

5.2 Evaluation metrics

We evaluate the overall performance of KGQA in terms of accuracy and efficiency. The accuracy metric follows the common evaluation method for SimpleQuestions, where we calculate the accuracy of inferred (sr) pairs. Only a fact matching the ground truth answer in both subject \(\hat{s}\) and predicate \(\hat{r}\) is correct.

$$\begin{aligned} accuracy= \frac{\sum _{i=1}^N1_{[(\hat{s}_i,\hat{r}_i)=(s_i,r_i)]}}{N}, \end{aligned}$$
(9)

where N refers to the number of questions.

We calculate the average training and test time of all KGQA systems to evaluate their efficiency. For a fair comparison, we set the same batch size and negative sampling number for each PLM in the same basic framework.

Additionally, we defined the Variation in Accuracy (VA) and the Variation in average test Time (VT) to evaluate the scalability of the KGQA system.

Definition 1

(VA) VA of a KGQA system on a benchmark represents the variation in the accuracy of the system on that benchmark compared to the accuracy of the system on the benchmark KGs. VA of a KGQA system in a benchmark KG\(_x\) represents the gap between the accuracy of the system on KG\(_x\) and the accuracy of the system on small-scale KG benchmarks KGs. \(\textrm{VA} = accuracy_{\textsc {KGs}}-accuracy_{KG_x}\), where KG\(_x\) represents one of our four benchmarks. A higher VA means that the system performs worse in terms of scalability.

Definition 2

(VT) VT of a KGQA system on a benchmark KG\(_x\) represents the change in the average test time of the system on KG\(_x\) compared to the average test time of the system on small-scale KG benchmarks KGs. \(\textrm{VT} = time_{KGx}-time_{\textsc {KGs}}\), where KG\(_x\) represents one of our four benchmarks. A higher VT means that the system performs worse in terms of scalability.

6 Experiments

In this section, we first present all the PLMs-based KGQA systems (Section 6.1) and the experimental setup (Section 6.2). We then show the overall experimental results and discuss the results in light of the three research questions (Section 6.3). We further explore the sub-modules of KGQA and discuss them according to the three new research questions (Section 6.3). Besides, we also evaluate all systems on two other KGQA datasets apart from the SimpleQuestions family (Section 6.5). Finally, we compare the performance between ChatGPT and other PLMs on datasets SimpleQuestions, WebQuestionSP and FreebaseQA (Section 6.6).

6.1 All KGQA systems based on PLMs

In this work, 18 KGQA systems (9 PLMs * 2 basic KGQA frameworks) were implemented for evaluation. Three classes of nine PLMs were used for evaluation, including the common large-scale PLMs BertFootnote 10, RobertaFootnote 11, XLnetFootnote 12 and Gpt2Footnote 13 , the lightweight PLMs ALbertFootnote 14, DistilBertFootnote 15 and DistilRobertaFootnote 16, and the knowledge-enhanced PLMs LukeFootnote 17 and KeplerFootnote 18. As common models serve as the backbone models for lightweight PLMs and knowledge-enhanced PLMs, we follow [54] to further classify the common PLMs according to pre-trained task categories, namely, Masked Language Modeling (MLM, i.e., Bert and Roberta), Language Modeling (LM, i.e., Gpt2), and Permuted Language Modeling (PeLM, i.e., XLnet). Parameters of these PLMsFootnote 19 are shown in Table 2.

Table 2 Parameters of various PLMs

BERT BERT is the most representative pre-trained language model that uses the encoder of the deep Transformer as its backbone. BERT uses Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) as self-supervised tasks for pretraining.

RoBERTa RoBERTa has almost the same architecture as BERT, while it differs in the parameter settings and training objectives. RoBERTa removes the NSP loss and creates the dynamic MLM mask instead of the static mask used in BERT to train a larger scale and a longer sequence model.

GPT2 Unlike BERT and Roberta, which are all masked language models, GPT2 is an autoregressive language model predicting one token at a time from left to right (i.e. LM). GPT2 is often used for natural language generation, whereas BERT and Roberta are mainly used for natural language understanding.

XLNET XLNET is known as an permuted language model [54]. Unlike GPT2 can not utilize the context from the backward side, XLNET resolves this problem by adopting a new objective called Permutation Language Modeling (PeLM), enabling the model to take advantage of both forward and backward contexts.

ALBERT ALBERT is a lite version of BERT. All its Transformer blocks share parameters and its embedding matrix is decomposed into two smaller matrices. Thus ALBERT has a much smaller number of parameters than BERT. Instead of NSP, Albert predicts the order of two consecutive text segments.

DistilBERT DistilBERT is a distilled version of BERT that is pre-trained on the same corpus in a self-supervised manner, using the BERT model as a teacher. This means that it only pre-trains on raw texts, with no humans labeling them in any way.

DistilRoBERTa DistilRoBERTa is a distilled version of RoBERTa. It follows the same training procedure as DistilBERT.

LUKE LUKE is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism. The entity-aware self-attention mechanism is an extension of the self-attention mechanism of the Transformer and considers the types of words or entities when computing attention scores.

KEPLER KEPLER is a unified model for knowledge embedding (KE) and PLM representation. It encodes textual entity descriptions with a PLM and then optimizes KE and language modeling objectives jointly.

6.2 Experimental setup

All KGQA systems were trained using an NVIDIA GeForce RTX 2080 TI. We performed a grid search for all KGQA systems, choosing the hyperparameter configuration that achieves the highest final accuracy. We adopted an early stop strategy in training and set the patience to 3. Because the PLMs-based entity disambiguation model is too large, the batch size can only be set to 1. Nevertheless, this may cause some PLMs to be difficult to converge, so we used the gradient accumulation method to increase the gradient accumulation step instead of increasing the batch size. Note that the value of the batch size multiplied by the gradient accumulation step is the same for all PLMs in the same sub-module task to ensure a fair comparison of their training times. In addition, KGQA systems output Top-50 results for entity disambiguation and Top-5 results for relation detection to combine (subject, relation) pairs.

6.3 Overall results and discussions

In this section, we show the overall results of the 18 KGQA systems based on various PLMs in terms of accuracy and efficiency (i.e., average training time and average testing time). Based on these experimental results, we will discuss three questions: (1) What PLMs have the best accuracy or efficiency performance? (2) What are the differences in accuracy and efficiency between the two basic KGQA frameworks? (3) How scalable are the various PLMs, i.e., how do their accuracy and efficiency vary as the size of KG increases?

6.3.1 Discussion on the accuracy of KGQA systems

The accuracy performance of all the studied PLMs-based KGQA frameworks and benchmarks are summarised in Table 3.

The bold numbers in Table 3 represent the highest accuracy of all KGQA system results using one benchmark. Roberta and the two knowledge-enhanced PLMs Luke and Kepler achieves the best accuracy results. Both Luke and Kepler are based on Roberta for knowledge augmentation. The results demonstrate the powerful performance of Roberta and that knowledge enhancement is beneficial for knowledge-intensive tasks, i.e., KGQA. Luke and Kepler perform better on small-scale KG benchmarks while Roberta perform better on large-scale KG benchmarks. This may be due to the fact that the introduction of additional knowledge-enhancing pre-training objects affects the robustness of the model itself. Comparing the performance of the same PLM on the two basic frameworks, we found that the KGQArr framework significantly outperforms the KGQAcl framework on small-scale KG benchmark SQs. However, as the size of KG increases, the accuracy of KGQArr becomes inferior to that of KGQAcl. We will investigate the reason for this in Section 6.4 by analyzing the performance variation of their submodules. In addition, we noted that XLnet and Gpt2 are worse than the other PLMs in terms of accuracy in almost all settings, even for the three lightweight PLMs. In particular, the Gpt2-based KGQArr system performed extremely poorly in all benchmarks. We believe that the modelling way of PLMs influences it. All PLMs are modelled by the auto-encoding way (i.e. MLM) except XLnet and Gpt2. Gpt2 is modeled via auto-regressive way (i.e. LM), while XLnet combines the idea of auto-encoding with auto-regressive modeling ((i.e. PeLM)). Auto-encoding modelling is skilled in solving natural language understanding (NLU) tasks, while auto-regressive modelling is proficient in solving natural language generation (NLG) tasks. Therefore, XLnet and Gpt2 perform poorly on KGQA because traditional approaches treat it as an NLU task to solve. There has been some recent work [46] to convert KGQA to the NLG task for handling, and we will explore this approach in the future.

Table 3 Overall accuracy (%) of different PLMs-based KGQA systems on four benchmarks

We investigated the scalability of the KGQA system with two metrics, VA and VT, defined in Section 5.2. As shown in Figure 4, the VA of all PLMs-based KGQA systems under both basic frameworks shows an increasing trend, indicating that scalability gradually worsens as KG size increases. We excluded the analysis of Gpt2 due to its terrible accuracy. Among the KGQAcl and KGQArr frameworks, XLnet exhibits the worst scalability performance, especially for the benchmark SQl with the largest KG size. In addition, the knowledge-enhanced PLMs Luke and Kelpler perform inferiorly to other PLMs in terms of scalability on the larger scale KG benchmarks (SQm-b and SQl). In contrast, the lightweight PLMs ALbert, Distilbert and Distilrobert are more robust to KG scale variations and perform better in terms of scalability. We will further analyze which sub-modules in the framework primarily affect scalability in Section 6.4.

Fig. 4
figure 4

Scalability of all KGQAcl frameworks (a) and KGQArr frameworks (b) in terms of accuracy variation

Fig. 5
figure 5

Scalability of all KGQAcl frameworks (a) and KGQArr frameworks (b) in terms of average test time variation

6.3.2 Discussion on the efficiency of KGQA systems

The efficiency performance of all the studied PLMs-based KGQA frameworks and benchmarks are summarised in Table 4. We set the same patience for all KGQA systems so that each model was trained to converge. Due to the different convergence rates of the different PLMs, the variation of the training time of the PLMs did not coincide with the variation of the testing time.

As shown in Table 4, the two lightweight PLMs DistilBert and DistilRoberta exhibit the highest efficiency in training and testing. DistilBert is up to 3.1x faster than XLnet on training (2325.5 ms vs. 749.7 ms), and DistilBert is up to 3.2x faster than XLnet on testing (85.3 ms vs. 26.3 ms). Another lightweight model, ALbert, has the least number of parameters, but its efficiency does not have any advantage over other PLMs. Therefore, the knowledge distillation approach is an effective way to improve efficiency, while the matrix parameter sharing strategy only reduces GPU memory consumption, with no improvement in efficiency. In addition, the time consumption of all PLMs tends to increase as the size of KG increases. Comparing the efficiency of different basic frameworks for the same PLM, KGQArr is always more time-consuming than KGQAcl. According to the analysis in Section 6.3.1, the KGQArr framework is only more accurate than the KGQAcl framework for small-scale KG benchmark SQs. Therefore, the KGQAcl framework is a better choice for large-scale KG benchmarks.

The efficiency scalability of all PLMs based on both frameworks is shown in Figure 5. All PLMs show an increasing trend in VT as the KG size increases. Among them, DistilRoberta has the best scalability as it has the smallest VT on all benchmarks. Another lightweight PLM, DitilBert, also shows good scalability. Section 6.3.1 also demonstrates DistilRoberta and DitilBert have the equivalent accuracy performance as the other PLMs on large-scale KG. These findings indicate that the two knowledge distillation PLMs have excellent scalability. The knowledge distillation is a promising approach for PLMs applied to KGQA.

Table 4 Efficiency of all PLMs-based KGQA frameworks on four benchmarks

6.3.3 Summary and new research questions

Some important conclusions can be drawn from the above discussion. Roberta, Luke and Kepler perform best in terms of overall accuracy. Nevertheless , Luke and Kepler have slightly poorer scalability, with a greater variation in accuracy as the KG size increases. The two lightweight PLMs DistillBert and DistilRoberta exhibit the best scalability in accuracy and efficiency. Their accuracy on large-scale KG is the same as other PLMs, and their inference time is up to 3.3x faster than other PLMs. For the KGQA framework, the KGQArr framework is significantly less efficient than the KGQAcl framework. Furthermore, the higher accuracy of KGQArr-based systems than KGQAcl-based systems is only at small KG scales. As the KG size increases, the KGQAcl-based system gradually outperforms the KGQArr-based system, which indicates the poor scalability of the KGQArr framework.

These findings lead us to explore the following questions further. (1) What sub-modules of the KGQA system are primarily responsible for the differences in accuracy and efficiency? (2) What sub-modules in the KGQA system are most susceptible to the variation in the size of KG? (3) Why does the KGQArr framework have worse scalability? We explore these three questions by examining the performance of the submodules of all KGQA systems.

6.4 Study of the KGQA sub-modules

6.4.1 Results and discussion on KGQA sub-modules

We further compared the sub-module performance of each KGQA system in this section to explore the primary influencers of accuracy and efficiency for each PLMs. Additionally, we compared the two KGQA base frameworks to explore the reasons for their large variability.

Tables 5, 6 and 7 show the overall results for Mention Detection (MD), Entity Disambiguation (ED) and Relation Detection (RD) respectively. For efficiency, we only compare the average test time. We do not analyze Anwer Query further as it is irrelevant to PLMs. The final result of MD is not affected by the KGQA basic frameworks and benchmarks, and it is only relevant to PLMs. Table 5 shows that all PLMs except Gpt2 have similar accuracy and efficiency on MD. It indicates that Gpt2, based on auto-regressive modelling (i.e. LM), is not good at solving NER tasks. Bert has the highest F1 value but poor efficiency. Notice that Bert’s distilled version DistilBert improves efficiency by almost double and has only a slight performance penalty.

The ED of both frameworks is the same. As shown in Table 6, RoBerta exhibits the best accuracy performance, and DistilBERT and DistilRoBERTa have the shortest test time. It is worth to note that the accuracy and efficiency of all PLMs in the ED task are greatly affected by the KG size. This is because as the KG size increases, the number of candidate entities and the degree of entities increases, as shown in Table 1. But, as the KG size gets larger, the impact on accuracy becomes smaller. XLnet shows the most severe decrease in accuracy (27.59% decrease), and ALbert shows the most significant increase in test time (50.3ms increase).

Table 5 Results of mention detection of all PLMs-based KGQA systems
Table 6 Results of entity disambiguation of all PLMs-based KGQA systems on four benchmarks
Table 7 Results of relation dectection of all PLMs-based KGQA systems on four benchmarks

As shown in Table 7, there are significant differences in performance on RD between the two KGQA frameworks, which leads to differences in the final accuracy and efficiency of the two frameworks. All rows in Table 7 show that the KGQAcl is more efficient than the KGQArr because KGQArr needs to encode all candidate relations to and questions to calculate similarity, whereas KGQAcl only needs to encode questions. This is also why the increase in KG size significantly affects the accuracy and efficiency of KGQArr, yet it does not affect KGQAcl. Although the accuracy of KGQArr is significantly higher than that of KGQAcl on small-scale KG benchmark SQs, the former is less scalable than the latter. In addition, the knowledge-enhanced PLMs Luke and Kepler show the highest accuracy performance, which indicates the effectiveness of the knowledge-enhanced approach.

In general, both ED and RD modules significantly impact the final accuracy. ED and RD based on the KGQArr framework have a primary effect on the final efficiency, and they are most susceptible to changes in KG size. KGQArr has worse scalability than KGQAcl due to their different approaches to solving RD.

6.4.2 Entity disambiguation using the vanilla method

The analysis in Section 6.4.1 demonstrates that PLMs-based entity disambiguation takes up the most time in the whole KGQA system. Given the high computational complexity of PLMs, we attempt to solve entity disambiguation using a vanilla method without any neural networks. We use only a simple linguistic approach, fuzzy matching, to rank all candidate entities (mentioned in Section 4.2.2). Specifically, we rank all candidate entities according to the Levenshtein Distance score between the entity name and the subject mention.

Table 6 shows the comparison results regarding the performance of vanilla methods and PLMs on entity disambiguation on all benchmarks, and their impact on the final accuracy. Compared to the vanilla method, all PLMs significantly improved the performance of entity disambiguation on all benchmarks (Van ED vs. ED). However, the improvements in the final accuracy of PLMs are not as significant in most cases (Van Acc vs. Acc). This is because the answer query module performs a weighted combination of candidate entities with scores and candidate relations with scores, which also screens out ambiguous entities to some extent. It is worth noting that XLnet improves entity disambiguation on the large-scale KG benchmark KGl (Figure 6(c)), yet is worse than the vanilla method in terms of final accuracy. This is because the large-scale KG contains too many noisy relations, leading the XLnet-based entity disambiguation model to assign a lower score to the ambiguous entities that some answer query modules can filter out. In addition, Figure 7 shows the efficiency of the vanilla method compared to that of PLMs on entity disambiguation, the vanilla method takes much less time than PLMs. Therefore, PLMs-based entity disambiguation is time costly and has limited improvement in the final accuracy of KGQA. More importantly, using the XLnet-based entity disambiguation model can even reduce the final accuracy on large-scale KG benchmark KGl.

Fig. 6
figure 6

Comparison of top-1 recall of entity disambiguation and final accuracy results for various PLMs (a-i) based methods and vanilla method. Van ED and Van Acc denote entity disambiguation and the whole KGQA system using vanilla method, and ED denotes entity disambiguation and the whole KGQA system using PLM

Fig. 7
figure 7

The average test time of the various PLMs-based methods compared to the vanilla method used to resolve entity disambiguation in all benchmarks

Table 8 Overall accuracy and efficiency of different PLMs-based KGQA systems on WebQuestionSP (WB) and FreebaseQA (FBQ)

6.5 Validation beyond the simple questions benchmarks

In addition to the four benchmarks of the SimpleQuestions family (Section 5.1), we evaluated the accuracy and efficiency of all systems on the WebQuestionSP [57] and FreebaseQA [56] datasets. Both datasets adopt the large-scale KG, Freebase, as the resource and include a high proportion of simple questions (71.3% in WebQuestionSP and 66.4% FreebaseQA). As these two datasets include numerous questions with multi-hop paths or multiple constraints, such as “What character did Natalie Portman play in Star Wars?”, we followed [55] to pre-process these two datasetsFootnote 20. Specifically, we kept only simple questions that can be answered by a triple and questions with entities or predicates within FB2M.

Table 8 demonstrates the overall accuracy and efficiency results for all systems on WebQestionSP and FreebaseQA. Based on these results, we get similar conclusions to the experiments on SimpleQuestions. Roberta and Luke have the best performance. Gpt2 perform the worst in terms of performance, especially in the KGQArr framework. Almost all PLMs based on the KGQArr framework have higher accuracy performance than those based on the KGQAcl framework but are more time-consuming. The two distillation-based PLMs, DistilBert and DistilRoberta, are far more efficient than the other PLMs. Furthermore, all systems performed poorly on FreebaseQA, with even the best, Luke, only achieving 42.08% accuracy. After performing an error analysis, we found that FreebaseQA contained many mislabelled and unanswerable questions.

6.6 ChatGPT for zero-shot KGQA

We conducted experiments to compare the performance between ChatGPTFootnote 21 and other PLMs on 300 sampling questions from SimpleQuestions, WebQuestionSP and FreebaseQA. Note that ChatGPT was under the zero-shot KGQA setting, while other PLMs were fine-tuned using the training set in a better-performing framework (KGQAcl framework with GPT2 and KGQArr frameworks with the other PLMs). The input of ChatGPT consists of the instruction (“Please answer the given question based on the context. The answers should be factual answers.”) and the question, inspired by [62]. After reading the entire input, the model generates the answer in the form of a piece of textFootnote 22. For each question, the answers generated by ChatGPT were evaluated and cross-validated by two professionals with reference to gold answers.

Table 9 Accuracy (%) of ChatGPT, KGQAcl framework with GPT2 and KGQArr frameworks with the other PLMs on SimpleQuestions (SQ), WebQuestionSP (WB) and FreebaseQA (FBQ)

Table 9 demonstrates that ChatGPT outperforms other PLMs by up to 12% on WebQuestionSP and far surpasses other PLMs by up to 54% on FreebaseQA. However, ChatGPT performs miserably on SimpleQuestions with an accuracy of only 29.3%. We speculate that the discrepancy is caused by the different construction methods of these datasets. WebQuestionSP was derived from Google Suggest API, while FreebaseQA was scraped from trivia and quiz-league websites, which are still accessible. In contrast, SimpleQuestions were constructed by humans based on Freebase triples. Therefore, it is possible that ChatGPT has seen these questions or related texts due to its extremely large training corpus.

Table 10 Six error types of ChatGPT on SQ

We further categorized the error cases of ChatGPT on SimpleQuestions as shown in Table 10. We consider the Enumeration type questions, which account for 38.2%, as a category of errors due to the difficulty of verifying that all enumeration items are correct. Note that when enumeration items include the gold answers, we consider the answer as correct. That is, In type 1 Enumeration, ChatGPT’s answer does not contain a golden answer. Even though we regard all the questions of Type 1 as correct, the accuracy of ChatGPT on SimpleQuestions is 56.3%, which is still significantly inferior to other PLMs. Wrong Answers (account for 26.9%) indicate that ChatGPT answers differ from gold answers. Besides, we noticed that ChatGPT even generates incorrect facts, also known as the hallucination problem [63]. For example, in the second example in Table 10, this politician and revolutionary is actually Felix Dzerzhinsky rather than Ivan Dzerzhinsky. In addition, 22.6% of the error cases are due to a lack of knowledge about the subject entity (i.e. Lack of Knowledge), and 10.0% of the error cases are due to a lack of additional information to disambiguate the subject entity (i.e. Ambiguous Entities). 1.4% of the errors are due to ChatGPT misunderstanding the semantics of the question (i.e. Misunderstanding). The poor quality of the question itself causes 0.9% of the errors (i.e. Dataset Problem). These cases demonstrate that ChatGPT may generate factual errors and still lacks extensive factual knowledge since many subject entities cannot be identified.

7 Conclusion and future works

Due to the improved performance of PLMs on most NLP tasks, it has become a consensus to use PLMs as a skeleton to solve NLP tasks. In this paper, we investigate the application of PLMs to solve a knowledge-intensive task, namely knowledge graph question answering. We conduct comprehensive experiments to explore the accuracy and efficiency performance of PLMs on KGQA, as well as the scalability of PLMs as KG size increases. In addition, we compare the performance between ChatGPT and other PLMs on three KGQA datasets. We present a detailed analysis of these experimental results and draw some important conclusions regarding the use of PLMs in KGQA.

  1. 1.

    Roberta and the knowledge-enhanced PLMs Luke and Kepler achieve the highest accuracy performance in the KGQA task. Luke and Kepler performed better on the small-scale KG benchmarks, and Roberta performed better on the large-scale KG benchmarks.

  2. 2.

    Lightweight PLMs DistilBert and DistilRoberta with knowledge distillation technology significantly improve efficiency and have lower accuracy than other PLMs on the small-scale KG benchmarks. However, DistilBert and DistilRoberta exhibit the best scalability. As KG size increases, the gap between their accuracy and that of other PLMs is gradually eliminated.

  3. 3.

    The accuracy of XLnet with permuted language modelling and Gpt2 with language modelling is worse than that of PLMs with masked language modelling, especially the KGQArr framework based on Gpt2.

  4. 4.

    The combined overall accuracy and efficiency results of KGQA show that PLMs-based entity disambiguation has no advantage over fuzzy matching-based entity disambiguation. Although the former is significantly better than the latter in the performance of entity disambiguation, the gap in accuracy between the two KGQA systems based on them is insignificant because the answer queryf module has the ability to disambiguate.

  5. 5.

    ChatGPT shows superior performance on zero-shot WebQuestions and FreebaseQA, even significantly outperforming other PLMs with fine-tuning. We speculate that this is due to the fact that ChatGPT has seen a similar corpus during training, as it performs extremely poorly on manually constructed SimpleQuestions. The error case analysis suggests that ChatGPT may generate answers with incorrect facts and still lack knowledge since many subject entities cannot be identified.

Further, we examine the overall results of the various PLMs on the subtasks of KGQA and obtain similar conclusions. Roberta and Bert exhibit the best performance on the entity detection and entity disambiguation tasks, while the knowledge-enhanced PLMs Luke and Kepler show strong capabilities on the relation detection task. DistilBert and DistilRoberta have a clear efficiency advantage and perform well on all tasks except for the entity disambiguation task, which is slightly inferior. In addition, we find that the KGQArr-based systems are significantly less efficient than the KGQAcl-based systems. Furthermore, the higher accuracy of KGQArr-based systems than KGQAcl-based systems is only when the KG scale is small. As the KG scale increases, the former is gradually inferior to the latter, which indicates the poor scalability of the KGQArr framework.

In future work, we will extend the proposed simple KGQA framework to the multi-hop complex KGQA framework. We will also keep investigating the application of knowledge distillation and knowledge-enhanced PLMs in KGQA, as our experiments show them to be promising. In addition, we will follow up with artificial general intelligence models like ChatGPT and test them more carefully, especially in terms of efficiency.