Introduction

As the volume of patents continues to escalate, patent examiners confront extensive patent databases, necessitating the meticulous identification of relevant citations to support their examination endeavors (Alcácer & Gittelman, 2006; Kuhn, 2011). A proficient citation recommendation system not only economizes examiners’ time but also enhances their comprehension of pertinent prior implementations, thereby elevating the scientific robustness and dependability of patent examinations (Choi et al., 2022a, 2022b). Moreover, precision in citation recommendation plays a pivotal role in fostering knowledge dissemination, thereby stimulating innovation and technological advancement (Färber & Jatowt, 2020).

Unlike scientific and technical papers that often cite many references to support research backgrounds or theories, patent citations have unique citation motives and purposes (Brooks, 1985; Meyer, 2000). In addition to the patent applicant's self-review of the patent application, patent citations primarily come from the patent examiner’s review of technically relevant "comparison documents" of the patented innovation (Lin et al., 2016). In order to enhance the standardization and rigor of patent grants and prevent applicants from intentionally omitting citations of technically similar literature to bolster their chances of patent approval, patent examiners must thoroughly examine all technically analogous literature associated with the proposed patent application (Zhao & Wen, 2017). Therefore, examiner citations reflect knowledge linkage better than applicant citations and are a good indicator of knowledge linkage, and patent citations that are technologically relevant to the text of the pending patent can be obtained through the examiner’s evaluation (Chen, 2017; Wada, 2018).

Although models applying various methods for patent citation recommendation (Chen et al., 2020; Yin et al., 2024) are now available, these models still ignore the importance of examiner citations. As an important means of protecting intellectual property, the patent examination process is strict and careful (China National Intellectual Property Office (CNIPO), 2010). In this regard, existing recommendation systems must fully utilize the potential of more reliable patent citation information labeled by professional patent examiners. Patent examiners are responsible for determining the patentability of pending patents, and thus, examiner citation datasets may be more credible and applicable than applicant citations used in previous studies (Choi et al., 2022a, 2022b; Fu et al., 2015; Lu et al., 2020).

Patent citations are established upon technical citation contexts, encompassing factors such as semantic similarity, technical relevance, or legal protection existing between the pending patent and the cited patent (Meyer, 2000). Therefore, to gain a comprehensive understanding of the context of patent citations, it is essential to consider not only the textual data of patents but also to integrate metadata that represents the technical domain. However, prior studies have primarily focused on identifying semantically similar patents (An et al., 2021; Arts et al., 2018; Zhang et al., 2016) or relied solely on textual similarity for citation recommendation (Chen et al., 2020; Yin et al., 2024). These approaches might not have adequately adapted to the diverse citation recommendation needs of patent examiners.

Hence, the primary focus of this paper is to leverage knowledge graphs to enhance the accuracy and efficiency of citation recommendations in the patent examination process. Through the integration of structured information, entities, relationships, and patent-specific semantic details, the objective is to offer patent examiners more contextually relevant and precise recommendations. Our code and data are available at https://github.com/xinyu0610/Citation-Recommendation-Model-for-Patent-Examiners.

The main contributions of this study can be described as follows:

  1. (i)

    We propose PK-Bert, utilizing a knowledge graph with semantic information and an improved Transformer framework, to identify patent examiner citation relationships.

  2. (ii)

    We collect crucial patent attributes to construct the Patent-info knowledge graph, comprising a total of 215,089 patent triples.

  3. (iii)

    We introduce a novel approach to embed patent information and investigate the impact of various attributes on examiner citation recommendations through ablation experiments.

The remainder of this manuscript is organized as follows. We review previous literature in “Related works” section and “Methodology” section describes the methods and models used in this study. “Experiment” section shows the data collection, knowledge graph construction, patent examiner citation recommendation experiments, and the ablation study. Finally, the discussion and conclusion are presented in “Conclusion and future works” section.

Related works

In this section, we review and summarize the process and influencing factors of patent examination and further discuss the methods related to patent citation recommendation and knowledge graph enhancement techniques.

Patent examination process

Patent examination is an approval process a patent must go through before it can be granted. According to the Patent Law of China, the examination and approval process of a patent application for an invention includes five stages: acceptance, preliminary examination, publication, substantive examination, and authorization. A utility model or design patent application does not undergo early publication and substantive examination in the examination and approval process. It has only three stages: acceptance, preliminary examination, and authorization (CNIPO, 2020). The examination flowchart is shown in Fig. 1.

Fig. 1
figure 1

Simplified patent examination process for Chinese invention patents

In the patent examination process, each patent requires a minimum of two complete examinations by an examiner to evaluate the patent's novelty and so on. The typical approach to the assessment of prominent substantive features—the so-called ‘three-step method’—is provided in the Guidelines for Patent Examination 2010 (CNIPO, 2010), as follows:

  1. (i)

    Determine the closest prior art.

  2. (ii)

    Determine the distinguishing technical features of the invention and the technical problem solved by the invention.

  3. (iii)

    Determine whether the claimed invention is obvious to a person skilled in the art.

The United States Patent Office (USPTO), the European Patent Office (EPO) and the Japan Patent Office (JPO) have similar guidelines (EPO, 2023; JPO, 2015; USPTO, 2020). Patent examination is time-consuming under stringent examination requirements. In areas requiring timely technology protection, such as artificial intelligence, the average examination period is 32.81 months, which negatively impacts the protection and dissemination of new technologies (Ou et al., 2022).

The main factors influencing the length of patent examination are patent characteristics, quality, and value indicators, as well as determinants affecting the complexity of the examination task (Harhoff & Wagner, 2009). Additionally, the period of examination is also influenced by the patent agent, the number of priority claims, the length of the central claims, the number of application pages, and the examiner (Tong et al., 2018). Patent examination quality depends on examiner experience, ability, time allocated per decision, other incentives and examiner characteristics (DeGrazia et al., 2021). Increasing examiner workload can lead to systematic bias in their decisions. Examiners who need more time to conduct prior art searches are inclined to grant patents, and a large workload reduces the examination quality (Kim & Oh, 2017). However, patent search system improvements can significantly reduce appeals frequency from examiners' rejections and grant decisions, ultimately leading to shorter examination times (Yamauchi & Nagaoka, 2015).

Therefore, improving the method of recommending patent citations at the examination stage can improve the efficiency of examination.

Patent citation recommendation

The purpose of patent citation recommendation is primarily to help patent applicants, researchers and examiners better understand and evaluate a patent's technical background, prior art, and innovation. Depending on their target audience, they can be distinguished into three directions: patent retrieval systems for applicants (Shalaby & Zadrozny, 2019), patent analysis systems for researchers (Krestel et al., 2021), and patent recommendation systems for patent examiners (Fu et al., 2015).

For most users, access to a list of citable patents is first done through a retrieval system. And patent retrieval systems, such as Google Patents and Derwent Innovations Index, make recommendations based on the degree of similarity of patent texts. The company-oriented patent recommendation system considers the fit between the company’s needs and patent technology and recommends potentially transferable patent documents for the company by matching the company’s needs and patent technology (Chen & Deng, 2023; Lee & Sohn, 2021). The vector space model (VSM), which is based on text mining techniques, is the most commonly used, incorporating patent keyword analysis to calculate the VSM-weighted similarity of the text elements of a patent (Arts et al., 2018; Zhang et al., 2016). Incorporating semantic information in the traditional model can effectively improve the technical similarity between patents calculated by the model. Specific methods include setting different weight values to reflect the semantic information differences of words at other positions (Arts et al., 2021). Assessing the similarity among patents requires a rigorous computational analysis grounded in identifying entities within patent documents and an in-depth exploration of the semantic relations interconnecting these entities (An et al., 2021; Hain et al., 2022; Teng et al., 2024; Wang & Liu, 2022; Wang et al., 2019).

Based on the retrieval system, the patent citation recommendation system gives more consideration to the subject matter similarity or technical similarity between patents. The heterogeneous relationship between patents is modelled around subject features to extract deep patent features, and the recommended patent relevance is obviously and significantly better than the keyword-based method and the standard subject model (Chen et al., 2020). Considering multi-topic information in citation recommendations can lead to more affluent patent citation lists than similarity methods (Yin et al., 2024). It is worth mentioning that the model proposed by Choi et al., (2022a, 2022b) employs a two-stage structure, i.e., selection based on textual information and pre-trained CPC embedding values and re-ranking of candidate patents using a trained deep learning model combined with examiner citation information. The proposed model and dataset can help researchers understand the ins and outs of technical citations and better accomplish the citation recommendation task. Furthermore, in our previous study (Lu et al., 2020), we defined the concept of technical similarity between patents by categorizing patent citation relationships into patents with citations and similar but uncited patents based on the premise that patent citations are associated with knowledge. In that approach, experiments with deep learning models demonstrate that there are still technical differences between patent citations and similar patents pushed by recommender systems.

Based on the above research, we believe that it is feasible and effective to construct a citation recommendation model for patent examiners. The model can learn more in-depth information about the patent technology, enabling it to better distinguish among similar patents recommended by the retrieval system. This capability proves effective in reducing the time examiners need for patent retrieval.

Knowledge graph enhancement methods

A knowledge graph systematically portrays information comprising entities, relationships, and semantic descriptions. Entities encompass real-world objects and abstract concepts, and relationships signify the connections between entities. Semantic descriptions of entities and their relationships incorporate well-defined types and properties, each carrying a specific meaning (Ji et al., 2022). In addition to enabling the visualization of relationships between entities, knowledge graphs are used as external knowledge links in natural language processing. Knowledge graphs find applications in text augmentation by leveraging structured information to enhance and enrich textual content (Wu et al., 2022). By incorporating entities, relationships, and semantic details, knowledge graphs contribute to a more comprehensive understanding of the context, enabling improved content generation and information enrichment (Shi et al., 2023). This approach facilitates creating more contextually relevant and semantically meaningful text, enhancing textual content's overall quality and depth across various domains, from natural language processing to information retrieval, content generation systems (Dietz et al., 2018; Ridho et al., 2020).

Knowledge graphs play a significant role in the patent recommendation by leveraging structured information to enhance the relevance and efficiency of the recommendation process. By incorporating entities, relationships, and semantic details from a vast patent corpus, knowledge graphs enable a more nuanced understanding of the technological landscape. This enhanced understanding allows for identifying relevant patents based on their semantic connections, improving the accuracy of patent recommendations (Deng & Ma, 2022; Xiao et al., 2023).

Through integrating knowledge graphs, patent recommendation systems can effectively consider the intricate relationships between patents, categories, inventors, and technical concepts. This approach helps provide more precise and targeted recommendations and better interpret the recommendation results (Chen & Deng, 2023).

Therefore, in this paper, we explore the factors affecting patent examiner citation recommendation. We aim to construct a more accurate examiner citation recommendation model by developing a patent knowledge graph and incorporating the attributes and relationships of patents in the text for enhancing semantic information.

Methodology

We offer to use the knowledge graph with semantic information and the Transformer framework (Vaswani et al., 2017) with better performance to enhance Bert’sFootnote 1 (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) ability to identify the patent citation relationship. We enhance the generalized K-BERTFootnote 2 (Enabling Language Representation with Knowledge Graph) (Liu et al., 2020) and introduce PK-Bert for patent examiner citation recommendations. By utilizing a patent knowledge graph, PK-BERT can seamlessly integrate domain-specific knowledge into its models, which leads to better understanding and generation of language. The overall architecture flowchart of PK-Bert is shown in Fig. 2.

Fig. 2
figure 2

Data preparation and model structure

Knowledge embedding layer

The knowledge embedding layer is commonly used to integrate external knowledge sources, such as knowledge graphs or ontologies, into a neural network model. This allows PK-Bert to represent words and concepts more effectively, considering their relationships and dependencies. In the context of patent analysis, the knowledge embedding layer can embed knowledge entities into the patent text and perform sentence tree transformation of the patent claim text. To begin, we define the input to our model as a patent claim text, represented as a sentence \(s\) composed of n words \({w}_{i}\). The s can be expressed as \(s=\{{w}_{0}, {w}_{1}, {w}_{2}, \dots , {w}_{n}\}\). Then we define a particular knowledge graph as \(K\), which \(K=\{({w}_{i},{ r}_{ij}, {w}_{ij}), \dots ,({w}_{i},{ r}_{ik}, {w}_{ik})\}\). (\({w}_{i},{ r}_{ij}, {w}_{ij}\)) is a triplet, where \({w}_{i}\) and \({w}_{ij}\) are the different entities in the knowledge graph, \({r}_{ij}\) is the relationship between these two entities. Finally, for the input patent sentence \(s\), the knowledge graph \(K\) is embedded through two steps of knowledge query and knowledge injection to form the final patent sentence tree \({{\varvec{S}}}_{{\varvec{t}}{\varvec{r}}{\varvec{e}}{\varvec{e}}}\), whose structure is shown in the Fig. 3. In the knowledge query step, if the word \({w}_{i}\) exists corresponding to the knowledge graph entity \({w}_{{i}{\prime}}\), it is noted as \(({w}_{i},{ r}_{i1}, {w}_{i1})\) of \({w}_{i}\) branch. Since a word \({w}_{i}\) may exist in k different triples of the knowledge graph, the knowledge injection step only keeps the \({w}_{i}\) branches of [\(\left({w}_{i},{ r}_{i1}, {w}_{i1}\right), \dots , ({w}_{i},{ r}_{ik}, {w}_{ik})\)] but does not consider continuing to operate knowledge query of \({w}_{ik}\). That is each word \({w}_{i}\) can have more than one tree branch, but the depth of the tree branch is at most one.

Fig. 3
figure 3

The structure of the sentence tree \({{\varvec{S}}}_{{\varvec{t}}{\varvec{r}}{\varvec{e}}{\varvec{e}}}\)

Seeing position layer

Since the addition of knowledge graph entities may change both the word order and semantic logic of the original sentence, addressing the knowledge noise problem is also a very critical issue. Due to the embedding of knowledge, it is also essential to solve the problem of spatial consistency between the knowledge vector and the original text vector. This study here refers to the seeing layer in the generic K-Bert model. It operates the visual position representation of patent sentence trees to overcome the problem of semantic logic confusion brought by incorporating knowledge graphs.

Let’s take the “Information query system based on natural language” for example. The sentence tree structure is shown in Fig. 4. The actual position index information is marked in black. But the position information of the branch should be closely associated with the corresponding word, so the adapted position index information is kept in red.

Fig. 4
figure 4

The positional index of each node in the sample sentence tree is used to guide the injection of knowledge entities

We build the corresponding seeing position matrix according to the sample sentence tree. From the black position numbers in Fig. 4. The “Synonym” at position 5 and the “Linguistic information” at position 6 are only associated with the “natural” and the “language” at position 3 and 4. Therefore, positions 5 and 6 are visible with positions 3, 4, 5, 6 (marked as blue), and invisible with positions 0, 1, 2, 7, 8, 9, 10, 11 (marked as grey). Following this logic, we eventually obtain the 0–1 visual position matrix as shown in Fig. 5.

Fig. 5
figure 5

Seeing position matrix for sample sentence tree

Word embedding layer

The role of the word embedding layer is to transform the output sentence tree of the knowledge layer into the word embeddings. Referring to the embedding representation of the Bert model, the word embedding here also consists of token embedding, adapted-position embedding and segment embedding. Since the sentence structure is altered after incorporating knowledge, preserving the original structural information while transforming the modified sentence tree into a word sequence is the crucial point of this layer. Taking two patent texts \(A:\{{w}_{00}, {w}_{00{\prime}}, {w}_{01}, {w}_{02}\}\) and \(B:\{{w}_{10}, {w}_{11}, {w}_{11{\prime}}, {w}_{12}\}\) as the input examples of the patent examiner citation recommendation model, the final obtained patent text input vector is \(X= token\_emb +adapted\_position\_emb +segment\_emb\), as shown in Fig. 6.

Fig. 6
figure 6

Word embedding of patent text input sample

For token embedding, to obtain a suitable word sequence, word embedding layer reorders the words in the sentence tree, i.e., the supplementary knowledge words of the branches are inserted after the corresponding original words, and the following words are sequentially arranged backward in order. In addition, some flag bits with notable roles are added to the input of the Bert model. Among them, the [CLS] flag is usually placed at the first place of a sentence, indicating that the corresponding sentence representation vector is a classification task. And the [SEP] flag is used to separate the two sentences of the input. For the above input two patent texts A and B, the token embedding of this paper is {[CLS], \({w}_{00}, {w}_{00{\prime}}, {w}_{01}, {w}_{02}\), [SEP], \({w}_{10}, {w}_{11}, {w}_{11{\prime}}, {w}_{12}\), [SEP]}.

For adapted-position embedding, from the seeing mentioned above position layer, it is known that the sentence tree structure incorporated into the knowledge graph has two types of the actual and adapted position index. Here we choose the adjusted position index. That's because the entity of the knowledge embedding is only related to the associated words, so the position of the knowledge embedding should be after the related terms. Taking the above input example, for patent A, \({w}_{00{\prime}}\) is the knowledge embedding word of \({w}_{00}\), while \({w}_{01}\) is the next word of \({w}_{00}\) according to the original sentence order, so the adapted position index of \({w}_{00{\prime}}\) and \({w}_{01}\) are the same.

For segment embedding, it is used to mark the different sentences when input contains multiple sentences. When two input sentences are used for the semantic matching task, the segment embedding of the above input example is {1, 1, …, 1, 2, 2, …, 2}.

Encoder layer

The encoder layer mainly considers combining the Bert model with better encoding performance and general neural network models such as convolutional neural networks (CNN) and gate recurrent unit (GRU) to obtain the corresponding Bert, BertCNN and BertGRU encoder by splicing between these models. In this section, we use universal encoder representations (UER)Footnote 3 (Zhao et al., 2019) to implement splicing between different neural networks.

Bert encoder mainly consists of a multi-headed self-attention mechanism, a fully connected feedforward neural network layer, and residual connectivity and normalization. First, we transform the patent text incorporated into the knowledge graph into the vector–matrix through word embedding. Then, in the self-attention layer, to express the semantics at a deeper level, different linear variations of the text vector–matrix X are performed to obtain the corresponding Q, K, and V. The calculation formula is shown in Eqs. (1)–(3).

$$Q=Linear\left(X\right)=X*{W}^{Q}$$
(1)
$$K=Linear\left(X\right)=X*{W}^{K}$$
(2)
$$V=Linear\left(X\right)=X*{W}^{V}$$
(3)

Further, to learn different aspects of text features, PK-Bert considers the original multi-headed attention mechanism output and the seeing position layer influence, so the final attention mechanism formula is shown in Eq. (4) below. The Q, K, and V are the matrices obtained from the above linear variation, M is the seeing position matrix above, and \({d}_{k}\) is the scale factor used to offset the effect of the dot product calculation.

$$Attention=softmax\left(\frac{Q{K}^{T} + M}{\sqrt{{d}_{k}}}\right)*V$$
(4)

Since the format of the vector after the Bert encoder is “[CLS]{Patent A text vector}[SEP]{Patent B text vector}[SEP]”, we select the [CLS] corresponding vector of the last layer as the output vector of the patent text pair.

As for BertCNN, after the Bert encoder, the CNN encoder conducts convolution and pooling operations on the \(n\times k\)-dimensional text vector matrix. Different convolution kernel sizes \({h}_{i}\times k\times m\), where h1, h2, and h3 are the heights of three different convolution kernels, k is their width, which is consistent with the dimensionality of the word embedding, and m is the number of each kernel, are employed to extract semantic features from various aspects of the text. The convolution operation divides the matrix into windows, and for each window, the ReLU activation function is applied to obtain feature vectors. The resulting feature vectors are further processed through pooling operations, selecting the maximum value for each feature to form a pooling vector output. This reduces model complexity and transforms the convolutional layer output into a fixed-length input for the classification layer. Finally, the maximum pooling vectors are vertically stitched to create the final splicing vector, denoted as \({c}_{max}\), the output vector characterizing the patent text pair.

Like the BertCNN encoder, the \(n\times k\)-dimensiona vector matrix \({\text{V}}\) for the patent text pair is obtained through the Bert encoder. We utilize the bidirectional GRU model for context encoding to optimize computational efficiency while maintaining effectiveness. The update and reset gate input vectors filter and update the hidden state based on the input sequence \({X}_{t}\) at moment t. The input information for updating the hidden state is obtained through the tanh function applied to the filtered hidden state \({h}_{t-1}\). The final hidden state \({h}_{t}{\prime}\) at time t is obtained through weighted summation. The bidirectional GRU output is a vertically spliced combination of the forward \(\overrightarrow{{h}_{t}}\) and \(\overleftarrow{{h}_{t}}\) coding outputs, denoted as \({h}_{t}\). Given that the last bit of bidirectional GRU encoding encapsulates all semantic information, it is selected as the output vector characterizing the patent text pair.

Prediction layer

After Encoder encoding, the final splicing vector of the full-connected layer stitching is obtained. We then output the probability values of the two classes by the Softmax classifier and take the class where the maximum probability value is located as the predicted classification class.

The fully connected layer mainly maps the features learned in the Encoder layer to a one-dimensional equal-length vector space to facilitate subsequent classification operations. The output vector of the fully connected layer is then vector normalized by the Softmax layer, and the calculation equations are shown in Eqs. (5) and (6).

$${z}_{i}={w}_{i}*x+{b}_{i}$$
(5)
$${y}_{i}= \frac{{e}^{{z}_{i}}}{\sum_{i=1}^{3}{e}^{{z}_{i}}}$$
(6)

Experiment

This section presents experiments that evaluate the validity of PK-Bert against each other. Specifically, we aim to answer the following evaluation questions:

  • Question 1: Does fusing knowledge graphs improve the performance of patent examiner citation recommendations?

  • Question 2: Which part of the patent knowledge graph is more helpful for examiner citation recommendation?

  • Question 3: Can models incorporating knowledge graphs achieve better results using earlier training data on the latest test set?

Data collection and knowledge graph construction

Data collection

We obtained patent data in two fields in Chinese from Google Patents. Patents are classified, and each classification number is examined to ensure accuracy. When searching for a patent, one can use either keywords or classification numbers. We collected data using keywords and classification numbers to construct a dataset for model validation. This approach covers many search cases and yields data with distinct characteristics. Furthermore, considering the input data length constraints of the Bert model, we opted for shorter patent abstracts as input text.

In our dataset, each patent is linked to relevant patents through citations, being cited by others, and thematic similarity. The data collection involves initial keyword searches on the Google Patents website to compile comprehensive patent lists. Subsequently, we systematically access the details page for each patent, where Google Patents provides three lists: citations, cited by, and similar patents. The similarity list is generated by the Google Patents system, and in both the citation and cited by lists, examiner-provided citations are distinguished. Through meticulous data cleaning, we identified no overlap between patents in the citation and similar lists. We curated patents meeting our criteria from these lists, forming pairs as follows: Patent A-Patent B-Label 0 (uncited but similar), Patent A-Patent C-Label 1 (examiner citation).

The first dataset comprises a collection of 10,030 patent abstract texts acquired using the keyword ‘natural language processing’. Meanwhile, the second dataset consists of 17,586 patent abstract texts obtained by filtering patents with the IPC number set to 'H04'. The final experimental dataset of this paper includes Google Patent Pair of Natural Language Processing (GPair-NLP) and Google Patent Pair of H04 classification (GPair-H04).

And we adopt a quantitative screening approach to guarantee both balance and completeness in the dataset. During the quantitative screening phase, illustrated with GPair-NLP, we initially amassed around 10,000 patents, yielding approximately 50,000 pairs of patent relationships. Following additional statistical analysis and data cleaning, we excluded pairs with a single relationship type. Post-cleaning, we noted a slight surplus of similar patent pairs over examiner-cited pairs. Consequently, in the second round of data collection, we concentrated solely on patents where the number of examiner-cited pairs surpassed that of similar pairs, continuing until a balanced quantity of patents for both relationship types was achieved.

The experimental training set, validation set: test set = 8:1:1 is set. The basic statistics are shown in Table 1. And the distribution of patent publication dates within the datasets are illustrated in Fig. 7.

Table 1 Experimental dataset situation
Fig. 7
figure 7

Distribution of patent publication time in the dataset

Patent knowledge graph construction

In the knowledge graph embedding part, we choose CnDBpediaFootnote 4 (Xu et al., 2017) as knowledge graphs for generic domains. CnDBpedia is a knowledge graph based on the DBpediaFootnote 5 ontology and covers a wide range of domains, including geography, history, culture, and science. It contains over 4.5 million entities and 5.1 million resource description framework (RDF) triples, making it one of the largest knowledge graphs in the world. CnDBpedia is constructed by extracting structured information from Chinese Wikipedia, and it is continuously updated to reflect changes in the underlying Wikipedia pages.

In order to validate the advantages and disadvantages of generic and domain knowledge graphs for the semantic representation of patents, we focus on constructing a comprehensive knowledge graph centered around patent information. Employing the patent number as a unique identifier, we gathered five essential attributes—patent title, IPC classification, inventors, assignee, and publication date—to form the foundation of our knowledge graph. During the knowledge graph embedding phase, to facilitate seamless embedding, we prepended each original patent text with its corresponding patent number. The constructed Patent-info knowledge graph adheres to a (Patent Number, Attribute, Attribute Value) format, ensuring that by using the patent number as an index, the corresponding patent information can be effectively embedded. The number of entities of each type is shown in Table 2.

Table 2 The number of entities of each type

Experimental model settings

Experimental model introduction

Our experiments include the following models.

  1. (i)

    Text CNN/GRU (Lu et al., 2020): The Siamese twin network structure is used to first encode the same CNN/GRU for each of the two patents, and then output an equal-length structural vector after the dot product and dot subtraction computation methods. Finally, the resulting text representation vector is input to Softmax to predict the result of the patent pair.

  2. (ii)

    Bert/BertCNN/BertGRU: The Bert model based on Transformer encoding is proposed for the patent text semantic matching task. The special flags such as [CLS] and [SEP] in the Bert model are used to process the patent pair text, and the encoding vector of the patent pair text is obtained and finally input to the Softmax classifier to predict the result of the patent pair.

  3. (iii)

    K-Bert/K-BertCNN/K-BertGRU: Knowledge embedding employs the Bert model with CnDBpedia.

  4. (iv)

    PK-Bert/PK-BertCNN/PK-BertGRU: Knowledge embedding employs the Bert model with Patent-info.

Experimental setup and model training

The neural network model usually contains many parameters, among which the model’s internal parameters, such as weight values, are updated automatically during training. In contrast, the model's initialization parameters must be set by oneself before training. Here, through pre-experiments and reference to previous experiences, the main parameters, parameter descriptions, and settings involved in the model experiments are shown in Appendix A.

Patent examiner citation recommendation

Models evaluation (question 1)

Considering that the number of samples of each category in the patent dataset in this study is not the same, to overcome the influence of different category shares on measuring model effects, the macro-average metrics of precision, recall, and F1 value are used as evaluation metrics. The experimental results of all models in the patent datasets of the two datasets are shown in Table 3. We mark the best experimental results using bold.

Table 3 Experimental results on the GPair-NLP and GPair-H04

The K-Bert series models consistently outperform the baseline Bert model across all evaluated metrics (precision, recall, and F1 scores). This indicates that integrating knowledge graphs within the Bert series enhances examiner citation recommendation task performance.

Notably, the Patent-info knowledge graph contains a modest entity count of 103,793, significantly fewer than the 4,597,165 entities present in CnDBpedia; the former’s quantity is merely 2% of the latter. However, models incorporating the Patent-info knowledge graph demonstrate superior performance to those using CnDBpedia. This trend emphasizes the importance of leveraging patent-specific information for practical examiner citation recommendations within the patent domain.

The patent data often includes sequences of technical terms and concepts that benefit from the sequential processing capabilities of the GRU architecture. GRU excels in capturing the nuanced relationships and dependencies between words and phrases within the patent text, adequately representing the sequential nature inherent in patent information. Conversely, CNN is better suited for capturing hierarchical features and patterns but might need to be more effective in handling the sequential dependencies prevalent in patent texts.

In the context of examiner citation recommendation, where the chronological order of information is crucial, the GRU architecture is advantageous. It excels in understanding the context of prior citations and the sequence of information in patents, making it well-suited for this task.

To further analyze the experimental results, we present the experimental results of the two datasets output by category in Tables 4 and 5. We mark the best experimental results using bold.

Table 4 Experimental results by category on the GPair-NLP
Table 5 Experimental results by category on the GPair-H04

Models exhibit consistently higher recall and F1 scores for Label 1 across both datasets. This trend can be attributed to the nature of the task, where Label 1 likely represents positive instances or relevant citations. The emphasis on correctly identifying and recommending relevant citations aligns with the task's objective, leading to higher recall and F1 scores for Label 1. The model's capacity to effectively capture and recommend relevant citations contributes to this superior performance.

Despite the higher scores for Label 1, the evaluation metrics for Label 0 do not exhibit imbalance, ensuring a reliable overall assessment of the model's performance. The absence of significant disparities in precision, recall, and F1 scores between the two labels indicates that the model maintains a balanced approach in handling positive and negative instances. This balance is crucial for ensuring that the model's recommendations are not skewed towards a particular label, thus enhancing the reliability of the overall evaluation results.

Ablation study (question 2)

To gain deeper insights into the factors that exert the most significant influence on the task of examiner citation recommendation within the patent knowledge graph, we conducted ablation experiments using the K-Bert model as our foundational framework. The experimental results are presented in Table 6, offering a comprehensive view of the effects of various factors on the performance of the citation recommendation task.

Table 6 Experimental results on the GPair-NLP and GPair-H04

Removing the publication date and title improves model performance, indicating these elements might not be crucial and could even introduce noise. In contrast, excluding inventor details has a substantial impact, underlining its pivotal role in the model's recommendation capabilities. In addition, assignees also has some impact on the results of the experiment. Patent categorization, although less influential, still plays a role.

Patents from the same inventor tend to concentrate on specific technological features, rendering them highly likely to exhibit technical similarities. For patent examiners, this signifies that patents from a joint inventor are particularly relevant due to their heightened probability of sharing similar technologies. Moreover, it is common for inventors to utilize technologies akin to their previously published patents, strategically refraining from citing their work to enhance the likelihood of patent authorization. Additionally, assignees, often corporate entities, tend to focus on acquiring patents within a specific domain. Consequently, patents held by the same assignee are more likely to cover a wide range of product technologies. Lastly, patents classified under the same IPC code generally share similar technological aspects, increasing the likelihood of citation by examiners.

Assessing models on latest patents (question 3)

This part of our study investigates if models, equipped with knowledge graphs, can perform better when trained on earlier data and tested on the latest patents from November 2023 in artificial intelligence. We randomly collected 1,869 patents with publication dates after November 2023 and captured their examiner citations and lists of similar patents to form the test set. The test set details are shown in Table 7, and the distribution of patent issuance times involved in the dataset is shown in Fig. 8.

Table 7 Latest test dataset details
Fig. 8
figure 8

Distribution of publication time in the latest test set

As can be seen in Fig. 8, the test patents are concentrated in 2023, while the latest datasets used for model training are published in 2022, covering the years from 2014 to 2018. There is no intersection between the training and test sets. The experimental results are shown in Table 8. We mark the best experimental results using bold.

Table 8 Latest patent dataset experimental results

Based on the results of the experiments conducted on the latest patent dataset, we can draw several observations regarding recall values:

  1. (i)

    Text-CNN and Text-GRU models show lower recall values (52.40% and 55.10%, respectively). The convolutional and recurrent structures may not capture the intricate relationships within the patent text as effectively as knowledge-enhanced models.

  2. (ii)

    Bert, BertCNN, and BertGRU models demonstrate improved recall values, with BertGRU being slightly lower. Bert-based models leverage contextual information effectively but may face challenges in capturing nuanced patterns within patent texts.

  3. (iii)

    PK-Bert models consistently outperform other models in recall. This indicates that incorporating patent-specific knowledge enhances the models' ability to identify relevant citations, resulting in higher recall. It underscores the importance of utilizing domain-specific knowledge graphs to improve citation recommendations.

Conclusion and future works

Our research utilizes knowledge graphs to augment examiners' citation recommendations. Notably, despite the small number of entities in the Patent-info knowledge graph (only 2% of CnDBpedia), it consistently outperforms models using the CnDBpedia, highlighting the importance of using patent-specific information for practical recommendations and emphasizing the importance of domain-specific knowledge graphs. The Experiments on latest patents show that knowledge-enhanced models, especially those containing Patent-info, have a sustained advantage in terms of recall. It affirms the efficacy of domain-specific knowledge graphs and provides valuable insights for patent examiners seeking more granular and context-aware citation recommendations.

However, it is essential to recognize some limitations of our study. First, the depth of our analysis of this category is limited by the need for a publicly available dataset of Chinese patent examiner citations. This limitation prevents us from gaining a more comprehensive understanding of the results of Chinese patent recommendations and their specific categories.

Then, our study does not cover patent examiner citations from different countries or the lack of analysis of patents of the same family. In addition, while showing value, the Patent-info Knowledge Graph has limitations. It mainly consists of structured attributes and lacks deeper semantic information.

Lastly, after incorporating the knowledge graph, our recommendation model can provide a certain degree of interpretability. However, our results are still general for patent examination, which requires a detailed reasoning process.

Potential enhancements to the knowledge graph involve integrating additional semantic features to further augment the model's comprehension of patent-related information. The utilization of graph neural networks for embedding is under consideration in the knowledge embedding component (Choi et al., 2022a, 2022b; Choi & Yoon, 2022). Additionally, taking into account the features cited by examiners from different national patent offices will contribute to enhancing the interpretability of the model.