Keywords

1 Introduction

For any researchers who want to engage in researching and developing new innovative technologies in a scientific domain, the first step is to understand the evolution of core technologies and to find the key point. Therefore, we need to extract core technology among countless documents before we start to study. Patent documents, recording innovation completely and detailly, are regarded as the foundation of technology, products, and processes. According to the World Intellectual Property Organization, 90% of the world’s technologies are registered in patent, so make use of patent information effectively will contribute greatly to master technology and creation. Additionally, analyzing core patents is the most efficient way to use patent information. Patent documents contain references and a large number of relationships of citation constitute a patent citation network, which can be analyzed to find core patents. In general, innovations are based on the existing research, as well as the citation network demonstrates an inheritance relationship according to the idea of technological evolution, thus, every patent can be regarded as a piece of a fragment of knowledge. The main paths in the network are the backbone of technological progress, which include more core patents.

The study of references, as one of the means to trace patterns of technological advancements, has been growingly used among scientific publications. As a result, many methods have been proposed to exact and represent the evolution patterns of ideas, including algorithmic historiography, knowledgeflow, and influence trajectories [1].However, those means are considered too complex to understand. Thus, new ways have been designed to find critical subnetworks as a summary of core patens. Main path analysis (MPA), originally proposed by Hummon and Doreian [2], has been applied in various domains to extract main paths. Compared with historical and descriptive methods used before, it enriches the engineering perspective for technological research and makes a significant contribution to the identification of core patents and extraction of the mainstream [3].

Although applied by many studies of citation networks, MPA is not always a proper method to exploit the main paths of patent citation networks. Through experiments, we have discovered the number of patent references can reach hundreds or even thousands incredibly, which will cause the main paths to concentrate on those patents because MPA includes a greedy algorithm. Once the citation network contains nodes with large references, the result will be not accurate enough through MPA.

PageRank, an algorithm proposed by Google to rank webpages, can evaluate the quality of webpages. It assigns relative importance and authority scores to each webpage through hyperlinks, which can be regarded as the basis for webpage ranking. Absolutely, the algorithm is still applied in most search engines. Inspired by that, we use the idea of PageRank to calculate the importance of the patents, then traverse paths of citation network to extract main paths. The results show obvious improvements in the average cited frequency and coverage ability of core patents founded by our methods.

2 Related Work

2.1 Main Path Analysis

The goal of the MPA is to find main paths. A main path of citation network is a connected subnetwork of important citation arcs connecting a number of origin nodes (typically, historical files) to destination nodes (typically, recent publications) [4]. Citation arcs, if the arcs sit on many paths between origin nodes and destination nodes, are considered as have evolutionary importance to a certain filed, thus it is appended to the main paths of the corresponding citation network. Briefly speaking, main paths approximately demonstrate development trajectories among the major ideas.

First step of the MPA is to weight citation arcs based on some distinctive features. In terms of knowledge transfer, the weight of citation arc (u, v) is measured by the number of search paths passing through (u, v) between a set of sources and destinations. Regularly, Search Path Count (SPC) and Search Path Node Pair (SPNP) are used to weight citation arcs [5].

SPC measures the importance of the arcs in the network by counting the number of times the connection between two adjacent nodes traversed by all paths of the corresponding network. Let \({N}^{-}(m)\) denotes the path from the source point s to the point m, and \({N}^{+}(n)\) denotes path from node n to destination t, then for any path from s to t passes arc (m, n), the value of the arc π is [6]:

$$ {\uppi } = {\upsigma } \cdot ({\text{m}},{\text{n}}) \cdot {\mathrm{\tau}},$$
(1)

where σ denotes all paths from s to m, τ denotes all paths from n to t. The mathematical expression of the weight N (m, n) of the citation arc is shown in Eq. 2:

$$ {\text{N}}\left( {{\rm{m}},{\rm{n}}} \right) = N^{ - } \left( m \right) \times N^{ + } (n) $$
(2)
$$ N^{ - } (m) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {m = s} \hfill \\ {\sum\nolimits_{{n:nRm}} N^{ - } \left( m \right)} \hfill & {others} \hfill \\ \end{array} } \right. $$
(3)
$$ N^{ + } (m) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {m = t} \hfill \\ {\sum\nolimits_{{n:nRm}} N^{ + } \left( v \right)} \hfill & {others} \hfill \\ \end{array} } \right. $$
(4)

where nRm denotes m cites n.

After obtaining the weighting directed graph, for extracting main paths, MPA then use deep first method and greedy algorithm to get paths consisting of arcs with highest traversal weight from sources and destinations.

Since SPC is the most typical algorithm for weighting network, in the comparative experiment we also use SPC to weight arcs.

2.2 PageRank

The reason why a page has a link to another one is because the later has more authoritative and reliable content. The basic idea of PageRank to rank webpages is to calculate number of links to the webpages. [7] They proposed a concept of PR value represent importance of one page. Furthermore, the value of each page is related to the number of external links that link to its own and their PR value. Therefore, the method reduces impact of link number on weight. For example, webpage w1 is linked by a meaningful webpage and webpage w2 is linked by many webpages with poor information, because of the different PR value among webpages link to w1 and w2, the PR value of w1 may large than w2. Formula for PageRank is as follow:

$$\begin{gathered} {\text{PR}}\left( {\text{u}} \right) = {\text{d}}\sum\nolimits_{{v \in B_{u} }} {\frac{{PR(v)}}{{L(v)}}} + \frac{{1 - d}}{n} \hfill \\ \hfill \\ \end{gathered},$$
(5)

where \({B}_{u}\) denotes a collection of all webpages link to webpage u, v denotes a webpage belonging to \({B}_{u}\), L(v) denotes the number of external links of webpage v, d denotes damping factor, n denotes the number of all the webpage in this network.

3 Datasets

In this paper, we use “Etching the Insulating Layers” and “Encapsulations” in chip manufacturing as keywords to draw the core patent technology trajectories [8] for two patent sets respectively.

3.1 Datasets

As a first step, we build patent datasets respectively. The datasets are extracted from USPTO(United States Patent and Trademark Office) [9], which including all the US patents from July 31,1790 to present as well as full searchable text. When searching related patents, USPTO allows patents to be retrieved by International Patent Classification (IPC). The IPC strategy selects patents by consulting a broad range of secondary sources and searching in International Patent Classification Table [10]. Then, we can confirm query statements “ICL/H01L21/311” and “ICL/H01L21/02 AND ICL/H01L21/56” to retrieval related patents. The partial table of IPC is shown in Table 1.

Table 1. Internet patent classification table

Then, we analyze the number of patent references, and remove the patents with unreasonable citations from datasets. When it comes to patents related to “Etching the Insulating Layers”, we obtain 5496 patents with application time span from 1975 to 2020. By exploiting distribution of patent references we remove patents with number of citations more than 300, the citations of those patents are incredible, which can cause unnecessary calculations. Figure 1 shows the distribution of the patent citations, after removing patents with unreasonable references, there are 5448 patents left. In same way, there are 2921 patents build dataset of patents related to “Encapsulations”.

Fig. 1.
figure 1

Distribution of the citations of patents related to “Etching the Insulating Layers”

3.2 Improved Method

In the present article, we use PageRank [10] to improve weighting algorithm for MPA and give patent nodes value directly, instead of using the method of giving value of arcs, which helps to reduce the effects of unreasonable references to main paths.

Supposing there are two citation networks like Fig. 2. Left network shows relationship of single reference, node A references node B (knowledge flow from B to A) and node B references node C. According to SPC, it’s obvious that weight of arcs BA and CB are both 1. Comparingly, right network demonstrates node B cites many other nodes. Although relationship of BA in both networks are same, with the number of node B references increases the value of BA will raise correspondingly according to MPA. The number of paths from source nodes to the destination nodes pass BA are increasing. So, if node B has unreasonable references, the mistake references can amplify the weight of arc BA. Due to the main paths is the path consist of largest citation arcs, the arc BA in later network has greater probability be selected to the main path and node B be regarded as core patent.

Fig. 2.
figure 2

Two cases of patent citation

In fact, a network composed of thousands of nodes is much more complicated than the case of graph with only two or three nodes. Normally, about five thousand patents and their citations can build a network with even millions of paths, while, any operation for nodes will cause huge effect on the network, especially for weight of arcs. Table 2 shows the impact of changes of references in patent node ‘3951843’ on the network.

Table 2. Changes in network when the number of patent citations changes

Therefore, to prevent unreasonable references from distracting the main paths, weighting algorithm should be improved. In order to weighting more reasonable, before we calculate value of patents, we assume each node that not be referenced has same initial value I, and the other nodes has initial value i in the network [11]. When a patent has N citations, it divides its value into N parts, and gives a piece of part to every patent it references. In this way, patent will give lower values to other nodes cited once it has many references, comparingly, patent will transfer all value itself to the node cited if it has just one citation. Then we can calculate the real value of each patent in the network, which we call PR value. The expression of calculating PR value is as follow:

$$\rm{PR(patent) = }\left\{ {\begin{array}{*{20}l} {\sum\nolimits_{{v \in U}} {\frac{{h(v)}}{{L(v)}}} } \hfill & {indegree\;of\;the{{~}}\;patent\;not\;equals\;to{{~}}\;0} \hfill \\ I \hfill & {indegree\;of\;the{{~}}\;patent\;equals\;to\;{\text{~}}0} \hfill \\ \end{array} } \right.,$$
(6)

where PR denotes PR value of current patent, v denotes patent that cites current patent, U denotes a collection of patents, h(v) refers to the initial value of a patent. While, in our method, we no longer assign weight to arc between nodes, but give value to patent nodes directly, which means that, when calculating the weight of the patents in the network, we have to build network according to relationship of reference. When extracting main paths according to direction of knowledge flow. The Algorithm of SPC and improved weighting algorithm is shown in Table 3 and Table 4 respectively.

Table 3. Algorithm of SPC
Table 4. Algorithm of improved method

After getting directed graph, we follow the idea of MPA and traverse all paths to find the paths with largest PR value, that is main paths.

3.3 Evaluation

For better measure importance of the main paths, we propose some indicators. A patent citation network often contains more subnetworks, which reflect different branches in the development of this domain. When the degree values of all nodes of a subnet are greater than or equal to K value, the subnet is called K-core graph. Because the cited frequency of patents is equal to out-degree value of corresponding nodes in citation network, the K value here is considered as out-degree value. The larger the K value is, the larger the average out-degree value of the subnet is, and the more nodes the subnet can affect, so the subnet is more important to the citation network. Consequently, for a collection of nodes, calculating the number of nodes in different K-core graphs of a network can reflect the importance of the collection to it. We can measure importance of main paths that extracted by MPA and method we improved respectively through this way. Since out-degree value can reflect cited situation of node in network, so average out-degree value of main paths also reasonable indicator for measure importance of main paths.

K-core decomposition is regular method help find core subgraph (the largest connected subgraph) in network [12]. In K-core graph, all the edges including nodes whose degree value less than k also been removed from network. 1-core net represents original graph, and (K + 1)-core net is subgraph of K-core graph. We can also observe changes by measuring graph density, average degree, number of connected blocks, average clustering coefficient and other indicators besides number of nodes and edges.

Graph density of a simple graph is defined as the ratio of the number of edges with the maximum possible edges. Since patent citation network has no mutual references, the more nodes the network has, the lower density of the net will have. Because the new node cannot refer to all previous nodes in citation network. For directed simple graphs, the density is:

(7)

where E is the number of edges, V is number of nodes.

Six Degrees of Separation is a conjecture in the field of mathematics, which point out that it spends only 6 steps to connect any two people in the world on average. In other words, if each individual build link to their friends, link relationship will cover all the individuals after 6 times traversal. Based on the Six Degrees of Separation, we can measure the coverage ability of core patents to judge the importance. In patent citation network, if node A refers to node B, then node B has an impact on node A. We can consider that node B connects to node A. Because core patents have more influence than other patents, so they can connect to most patent nodes faster than others. Under the same conditions, we can measure importance of two sets of core patents by comparing the number of other patent nodes the core patents node can connect to in citation network.

4 Experiment Result

First of all, we build patent citation networks according to the related patents of “Etching the Insulating Layers” and “encapsulation” technologies. Among them, the former graph has 40605 nodes and 77426 edges; the later graph has 21344 nodes and 45865 edges. “Etching the Insulating Layers” related patent references network is shown in Fig. 3. Different color nodes denote different out-degree value.

Fig. 3.
figure 3

Citation network of patents related to “Etching the Insulating Layers” (part)

Secondly, we calculate main paths through our method and MPA respectively and we will analyze those patents below.

Figure 4 shows the main paths of patents related to “Etching the Insulating Layers” found by our method and MPA.

Fig. 4.
figure 4

Main path calculated by MPA and our method, the above is extracted by our method, the below is extracted by MPA. The patent on the left is granted before the patent on right

4.1 K-core Graph

We decompose graph through K-core decomposition according to set k = 10, 15, 20, 25, 30, and then analyze connected subgraphs by calculating features mentioned above, preparing for the experiment of counting number of patents existing in k-core graph. Table 5 shows the relevant data of the k-core graph of the patents related to “Etching the Insulating Layers”.

Table 5. Data of k-core graph of the patents related to “Etching the Insulating Layers”

Obviously, the numbers of nodes and edges are decreasing sharply as the K value of the network increases, which shows scale of net is becoming smaller and smaller. Additionally, when K value of the network increases, the density of graph becomes greater, which means that nodes remaining in K-core graph with high average cited frequency-cy, so they have higher reference value than nodes being removed. With the K value increasing, the number of connect blocks is decreasing, which means that the technologies represented by remaining connect blocks have more reference value than others.

In next experiment, we set k = 1, 5, 10, 15, 20, 25, 30 to decompose patent citation networks of two technologies respectively. Then we calculate number of core patents that existing in different K-core graphs of citation network for two technologies, the core patents were extracted by our method and MPA respectively. The result demonstrates our method has superior performance than MPA. The result as is shown in Fig. 5. In the experiment of patents related to “Etching the Insulating Layers”, the number of core patents that remaining in different subgraphs through our method is greater than that through MPA, and all of the subgraphs with high average cited frequency. In the experiment of patents related to “Encapsulation”, our method also performs better than MPA.

Fig. 5.
figure 5

Situations of core patents in different k-core graphs of two citation networks

4.2 Average Cited Frequency and Coverage

In the following experiments, the results show core patents found by our method are worthier.

We calculate the average cited frequency of core patent sets extracted by the two methods. Result of experiment of patents related to “Etching the Insulating Layers” shows that the value of average cited frequency (average out-degree value) of core patents extracted by our method and MPA are 17.66 and 11.61, as well as, the number in experiment of patents related to “Encapsulations” are 30.9 and 20.61 respectively. A patent with high cited frequency shows this patent is worthy of references [12]. So, a set of patents with high value of average cited frequency is more valuable than the set with lower number. The results prove main paths founded by our method show better performance. The detailed data is shown in Table 6.

Table 6. The results of experiments of two patent citations

In order to test the influence of core patent nodes, we calculate the number of patent nodes that can be affected by two sets of core patent nodes under the same situation. When the traversal time equals to 1, we calculate the number of patent nodes that reference the core patents, because these patents are directly affected by the core patents. At the same time, we call the collection of affected nodes 1-cluster. When the traversal time equals to 2, we calculate the number of patents that reference the patents that affected by the core patents directly, which are indirectly affected by the core patents. In this case, we call the collection of affected nodes 2-cluster, and so on [13].

In experiment of patents related to “Etching the Insulating Layers”, the numbers of patents affected by core patents, which extracted by our method, are 319 and 795 respectively. In the experiment of patents related to “Encapsulation”, the numbers are 748 and 894 respectively. All of the results are greater than number in MPA. Figure 6 shows the situation of nodes affected by core patents when traversal time equals to 3, core patents in left figure are extracted by MPA and core patents in right figure are extracted by our method. Affected nodes are marked with red color, unaffected nodes are marked with black color. Obviously, the core patents in the right figure have stronger influence[14].

Fig. 6.
figure 6

Situation of 3-cluster of MPA and our method (Color figure online)

4.3 Discussion Main Paths and Related Domains

In main paths of patents related to “Etching the Insulating Layers”, the result contains 18 patents extracted from 5448 patents, whose time span is from 1975 to 2018. And technologies of core patents include “semiconductor preparation method”, “thin film etching process”, “high-density plasms etching” and other methods. In the result of patents related to “encapsulation”, main paths contain 12 patents extracted from 2921 patents and the time span is from 1970 to 2019. Those patents include technologies of “Solder interconnection structure on organic substrates”, “Semiconductor device and method of forming micro interconnect structures”, “BAG packaging” and so on. After consulting relevant literature, those patents can better descript development of the domain respectively. Table 7 demonstrates part of the information of core patents in “Etching the Insulating Layers”.

Table 7. Information of core patents related to “Etching the Insulating Layers”

5 Conclusion

After experiments, patents extracted by improved method show better performance than MPA in average value of cited frequency and ability of coverage. Based on PageRank, our method reduces the impact of unreasonable references of patents and solves the problems caused by over citations of patents.

Certainly, there are still defect in our method, for example, our method doesn’t take damping factor into consideration. Researchers cannot consult information endlessly during they creation, and the factor can help us approach the real situation. So, there should be another parameter in the algorithm to represent this constraint. In the following research, we plan to conduct more experiments in more fields to find reasonable damping factor, and improve our method further.