Keywords

1 Introduction

Patent claims are often argued as a valuable source for the detection of technological changes and to gain technological insight (Campbell 1983; Ernst 1997; WIPO 2004). As an important part of unstructured segments of a patent document, claims hold explicit information and implicit knowledge revealing technological concepts, topics, and related R&D activities with concise, but precise language (Xie and Miyazaki 2013; WIPO 2002). Since manually conducting content analysis on massive patent documents is very time-consuming and laborious, in recent years, one of the fundamental changes to research in R&D management is the access to extremely powerful information techniques and a vast amount of digital and textual data (Daim et al. 2011). In particular, for efficient patent analysis, automatic approaches to assist domain experts and decision makers to discover and understand large volumes of patent documents have drawn increasing attention and still are in great demand (Abbas et al. 2014).

Much effort has been devoted to reveal latent knowledge from the textual data of patent documents. Watts and Porter (1997) suggested an approach to investigate terminological trends by tracking the historical change of keywords. Yoon and Park (2005) presented a keyword-based morphology study to identify the detailed configurations of promising technology. Zhang et al. (2014) introduced a term clumping approach based on principal components analysis to explore keywords and main phrases in abstract from scientific literature. In addition, text analytics have already been applied to technology intelligence application TrendPerceptor (Yoon and Kim 2012), Techpioneer (Yoon 2008), VantagePoint (Zhu and Porter 2002), and Aureka (Trippe 2003) to determine hidden concepts and relationships, where clustering, classification and mapping techniques were used to support further content analysis of technological documents. However, before most of these applications are applied, usually several sets of keywords need to be defined in advance, which still derive from the opinion and knowledge of domain experts. Moreover, the outcomes of majority traditional text mining techniques are based on single keywords with ranking, yet these words alone are usually too general or misleading for indicating a concept, especially when there are polysemous words actually describing different themes (Tseng et al. 2007).

To overcome the above-mentioned limitations, this research proposes a topic change identification approach using a well-known topic modeling approach, latent dirichlet allocation. Unsupervised topic modeling is applied to vast amounts of target patent claims, providing a corpus structure with minimal human intervention. There is no preset classification or keywords list for this approach, and the results are discovered in a completely unsupervised way. In addition, instead of using single terms, topics are represented by probability distributions over words. The actual semantic meaning of a topic is able to be delivered in this way, and at the same time, the polysemous words, which are actually depicting different concepts, can also be separated. After revealing topics from patent sub-collections of different years, a topic change model is presented to identify topic changes over time. Finally, to demonstrate the performance of our proposed approach, patents published during years 2009 to year 2013 in the United States Patent and Trademark Office (USPTO) with Australia as their assignee country are selected to present a case study. The experimental result demonstrates that the proposed approach is able to provide machine-identified topic changes automatically without any presetting of keywords. The outcomes of our approach will be used to serve R&D management assistance.

This paper is organized as follows: the first section reviews related research developments by introducing patent data in technological research and latent dirichlet allocation. Methodology Section describes the proposed topic change identification approach step by step. Case Study Section carries out experiments using USPTO patents to demonstrate the proposed approach in a real patent analysis context. The conclusions and future study are addressed in the last section.

2 Literature Review

2.1 Patent Data in Tech Mining

Patent documents are composed of structured information and unstructured descriptions of inventions. Analytical approaches based on structured data of patents, such as issue date, inventor, assignees, or International Patent Classification, have played the major role in both theoretical and practical research to gain insight of technology development in certain area (Lai and Wu 2005; Sheikh et al. 2011; Nishijima et al. 2013). However, the unstructured data in patent documents, such as abstracts, claims, and descriptions, usually contain much more abundant information than the structured sections, since they contain significant characteristics, detailed functionalities, or major contributions of technologies. Therefore, there has been a lot of interest in applying text mining techniques to conduct tech mining and set domain analysts free from studying and understanding massive amounts of technological content since the last decade (Tseng et al. 2007; Camus and Brancaleon 2003; Porter 2005).

Among all the unstructured segments of a patent file, patent claims play a role of embodying all the important technical features of an invention with the most essential technological terms to define the protection (Tong and Frame 1994). On one hand, they reveal the core inventive topics and the major technological scope of a patent; on the other hand, claims are written in concise, but precise language, which make them the best resource for identifying technological topics and facilitating patent document analysis (Xie and Miyazaki 2013; WIPO 2002; Yang and Soo 2012; Novelli 2014).

A patent claim usually consists of three parts: a preamble that serves as an introductory section to recite the primary purpose, function, or properties; a transition phrase, such as comprising, having including, consisting of, etc.; a “body” that contains the elements or steps that together describe the invention (Yang and Soo 2012; USPTO 2012; Sheldon 1995). This research utilizes patent claims as the main source of topic change analysis. Among patent databases from different countries, the United States Patent and Trademark Office (USPTO) database is mostly used because patents submitted in other countries are often also simultaneously submitted in the United States (USPTO 2015).

2.2 Latent Dirichlet Allocation

Latent dirichlet allocation (LDA) (Blei et al. 2003) is a probabilistic model that aims to estimate the properties of multinomial observations by unsupervised learning. It gives an estimation of the latent semantic topics hidden in large archives of documents and calculates the probabilities of how various documents belong to different topics. LDA has been used as an efficient tool to assist topic discovery and analysis, in practice. For example, Griffiths and Steyvers (2004) applied LDA-based topic modeling to discover the hot topics covered by papers in Proceedings of the National Academy of Sciences of the United States of America (PNAS); Yang et al. (2013) proposed a topic expertise model (TEM) based on LDA to jointly model topics and expertise for community question answering (CQA) with stack overflow data; Kim and Oh (2011) proposed a framework based on LDA to identify important topics and their meaningful structure within the news archives on the Web.

The graphical model of LDA is presented in Fig. 11.1, showing three rectangular plates where: \(D\) denotes the overall documents in a corpus; \(K\) indicates the topic numbers for \(D\); and \(N_{d}\) stands for the term number of dth document in document collection \(D\). Each node in the figure stands for a random variable in the generative process of LDA, while the plates indicate replication. In the left part of the figure, \(\vec{\vartheta }_{d}\) stands for the topic proportions for the dth document. For document d, the topic assignments are \(Z_{d}\), where \(Z_{d,n}\) indicates the topic assignment of the nth word in the dth document. On the right of the figure, the topics themselves are illustrated by \(\vec{\varphi }_{1:K}\), where each \(\vec{\varphi }_{k}\) is a distribution over vocabularies. All of the unshaded circles indicate hidden nodes. The shaded circles, on the contrary, are observable nodes, where \(W_{d,n}\) stands for the nth word in document \(d\). Finally, \(\alpha\) and \(\beta\) are two hyperparameters that determine the amount of smoothing applied to the topic distributions for each document and the word distributions for each topic (Blei et al. 2003; Steyvers and Griffiths 2007; Blei 2012; Heinrich 2005).

Fig. 11.1
figure 1

The graphical model of latent dirichlet allocation

The parameters of LDA need to be estimated by an iterative approach. Among existing approaches, Gibbs sampling is one of the most commonly used methods. It is an approximate inference algorithm based on the Markov chain Monte Carlo (MCMC) and has been widely used to estimate the assignment of words to topics by observed data (Griffiths and Steyvers 2004; Noel and Peterson 2014; Lukins et al. 2010). Gibbs sampling produces different results each time in executing LDA, so that the topic estimations are slightly different even with exactly the same setting of input and parameters; yet on the whole, the results of different experiments will not change much.

3 Methodology

This section explains the details of our proposed topic change identification approach. The framework is given first; each detailed step is illustrated subsequently.

3.1 Framework

The overall framework of our proposed topic change identification approach is shown in Fig. 11.2. First of all, users need to initiate a search statement to declare their domain analytic requirements and address a group of target patents in USPTO database. Patent ID, title, claims, issue time, assignees, United States Patent Classification (USPC), and other information of target patents are then crawled into a database waiting for further analysis. To identify topic changes over time, the whole patent collection is divided into several sub-collections first and labeled with their corresponding issue year. Subsequently, for each sub-collection, patent claims and titles, embodying essential technical terms, and USPC, providing a general understanding of the domain classification, are extracted from the target patents database separately. The two plates in the figure indicate replication.

Fig. 11.2
figure 2

The framework of the proposed topic change identification approach

Textual data composed by claims and titles, after data segmentation and cleaning, are then placed into a series of words exclusion modules to filter out the most common function words, high-frequency words that commonly appeared in patent claims, and academic words with vague and general meanings. Then, the prepared text will be passed to the topic modeling module. Meanwhile, the USPC information of the corresponding patents is extracted to assist final topic determination. As mentioned, the randomness introduced by the initiation of the sampling will affect the final result of LDA. To acquire the most reliable topics of the corpus, we utilize USPC as a measurement to evaluate results from \(m\) times experiments. Patents are clustered with both their USPC and topic proportions. The final topic modeling result is the one trial that provides the most similar clusters to the USPC clustering outcome. Finally, with all the topics estimated from patent sub-collections of different years, topic changes over time can be identified and presented to users.

3.2 Patent Corpus Text Cleaning

Patent claims are a special kind of textual data that contain plenty of technical terms, specific words serving as transition phrases, and numerous academic words that describe invention outcomes. Among all the terms that one claim may contain, only technical terms provide the most meaningful information that reflects technological topics and innovations. Therefore, for our patent corpus, each sub-collection, as shown in Fig. 11.3, before modeling topics with LDA, except all the punctuations, numbers, and HTML fragments left by webpage crawling, we also utilize three modules to remove general words from the corpus of patents as follows:

Fig. 11.3
figure 3

Relationships between sub-collections and topics

  • Stop words such as the, that, and these;

  • High-frequency words in patent claims such as claimed, comprising, and invention;

  • General academic words such as research, approach, and data.

The stop words list we applied is from an information retrieval Resources link from Stanford University (David et al. 2004); the patent claim commonly used phrases are summarized from a Transitional Phrase page on Wikipedia (2014); the general academic words list is provided by the University of Nottingham, we select the top 100 most frequent academic words and remove them from our final corpus (Haywood 2003; Zhang et al. 2014).

3.3 Topic Modeling

LDA utilizes a probability distribution over words, instead of a single term, to define a concept, delivering better semantic meaning of the topic and, at the same time, allowing polysemy. Thus, it is very suitable for “understanding” the content of large corpuses such as emails, news, scientific papers, and our main data source here, patent claims. After removing all commonly used words from the corpus, we utilize LDA to generate several groups of topics for each patent sub-collection in the corpus, which is labeled by its corresponding issue year. In a sub-collection, the claims and title of each patent constitute one document, and the number of documents equals the number of patents; the USPC and other structural information are stored alone in a single file to assist further topic determination. All the textual documents in the corpus are seen as mixtures of a number of topics; each topic is seen as a distribution over various vocabularies. Here, we present the global topics as \(\vec{P}_{1:t} = (\vec{P}_{1} ,\vec{P}_{2} , \ldots ,\vec{P}_{i} , \ldots ,\vec{P}_{t} )\), where \(\vec{P}_{i}\) stand for the topics of the ith sub-collection of the corpus. The relationship between sub-collections and topics is illustrated in Fig. 11.4.

Fig. 11.4
figure 4

Relationships between sub-collections and topics

Since we know nothing about the word distributions composing the topics and the topic distributions composing the documents, before topic modeling, assumptions need to be first drawn to determine the parameters \(k,\alpha ,\beta\) of LDA. According to previous research, hyperparameters \(\alpha ,\beta\) of the dirichlet distribution in LDA have a smoothing effect on multinomial parameters; that is, the lower the values of \(\alpha\) and \(\beta\) are, the more decisive topic associations there will be (Heinrich 2005). This research sets \(\alpha = 0.5\) and \(\beta = 0.1\), which are commonly used in LDA applications (Koltcov et al. 2014). For the setting of K, higher K will reduce the topical granularity but increase the processing time significantly. Therefore, during the implementation, K needs to be decided case by case, balancing user requirement and time consumption. Different parameter settings may improve modeling performance, yet optimizing these parameters is beyond the scope of this paper.

3.4 Final Topics Determination

We then apply Gibbs sampling to infer the needed distributions in LDA. Since the initial values of variables are determined randomly in Gibbs sampling, the outputs of LDA in multiple experiments with a same corpus are slightly different. To ensure the final topic modeling estimation as reliable as possible, evaluation criteria will be needed for the topics finalization. In this research, we select USPC as the criteria. As a predefined classification hierarchy built on domain expert judgments, USPC provides a general understanding of the technical domain of concern to one patent. Because patents covering similar topics are usually assigned to a same main USPC, thus here we use the main USPC to judge which estimation is closer to the actual topic structure.

For a sub-collection of corpus, multiple LDA experiments will produce a number of topic distribution matrixes, each indicating the topic distribution proportions of patent documents in the corresponding trial. As shown in the approach framework, Fig. 11.2, there will be \(m\) times experiments for every sub-collection; and after performing each time run, patents in the sub-collection are clustered with their calculated topic distributions using the hierarchical clustering approach (Steinbach et al. 2000). Meanwhile, the same group of patents will be also clustered with USPC information. The closer the two clustering results are, the more reliable the topic modeling result is.

Specifically, the values of indexes Jaccard et al. and F1 of \(m\) times experiments are used to measure the similarity of the two clustering results, one by topics and the other by USPC. The three indices are listed as follows (Halkidi et al. 2001):

$$J = a/(a + b + c),$$
(11.1)
$${\text{FM}} = a/\sqrt {r_{1} \cdot r_{2} } ,$$
(11.2)
$$F_{\beta } = \frac{{\left( {\beta^{2} + 1} \right) \cdot r_{1} \cdot r_{2} }}{{\beta^{2} \cdot r_{1} + r_{2} }},$$
(11.3)

where \(J\) stands for Jaccard coefficient, FM indicates Folkes & Mallows index, \(F_{\beta }\) presents the \(F1\) indice. In addition, \(r_{1} = a/\left( {a + b} \right)\), \(r_{2} = a/\left( {a + c} \right)\), where \(a\) represents the number of patents that belong to the same cluster of topics and to the same USPC in our case, \(b\) is the number of patents that belong to the same cluster of topics but to different USPC, and \(c\) is the number of patents that belong to different clusters of topics but to the same USPC. The topic modeling result that provides the highest index values is the optimal one.

3.5 Topic Change Identification

After locating the final topics and words underlying the sub-collections of our corpus, we are able to identify the topic change over time. As show in Fig. 11.5, we compare two groups of topics deriving from different corpus sub-collections, calculating the similarity of words between each topic in \(\vec{P}_{i}\) and all the topics in \(\vec{P}_{i - 1}\), in a traversal way. If two topics under different sub-collections contain approximately the same group of words, then we believe that these two topics are actually one topic evolving from year to year. However, if the majority of words comprising two topics are very different, then we believe these are two different topics. Finally, for documents sub-collection of year \(i\), if there is no similar topic can be matched in the previous year, year \(i - 1\), then the unmatched topic in the later year can be seen as a newly important one, which means it became more hot in the year \(i\).

Fig. 11.5
figure 5

Topic change identification model

3.6 Topic-Based Trend Estimation

If we already identified a topic evolving from year to year, besides discovering how the detailed content of the topic evolves from year to year with the above model, we can also use the topic distribution matrix to generate historical topic-based trend and forecast future trend. As an important part of LDA outcomes, the topic distribution matrix \(\vec{\vartheta }\) provides the estimated result that how all the topics distribute over the document collection. The summation of each row of the matrix equals 1. The sum values of each column, however, are different. The larger the sum of a column, the more important the corresponding topic is. Since the patents are issued along a time line, if we add up a group of elements in a column that associates with patents published in a same time interval (month or year), the summation can be used to present the weight of the topic in that time frame. Thus we can then get a temporal-weight matrix to reveal the importance of selected topics in different month or years.

After the temporal-weight matrix is achieved, we calculate the weight changes in a least-squares sense to estimate the general trend of the target topics. The temporal-weight values of each topic are fitted to a univariate quadratic polynomial, \(y = ax^{2} + bx + c\), where \(y\) stands for the topic weight, and \(x\) represents the time. We utilize the coefficients a and b to measure developing trends of topics, since \(a\) controls the speed of increase (or decrease) of the quadratic function, \(- b/2a\) control the axis of symmetry. For instance, if coefficient \(a\) is positive and the symmetry is on the left of y-axis, we consider the corresponding topic has a growing trend where the greater \(a\) is, the faster the growth will be.

4 Case Study

4.1 Data Collection

To demonstrate the performance of our proposed approach, patents published during years 2009 to year 2013 in USPTO (http://www.uspto.gov/) with Australia as their assignee country are selected to present a case study. There are 7071 target patents covering 343 different main USPCFootnote 1 , Footnote 2. Their patent ID, titles, issue time, inventors, Assignees, United States Patent Classification (USPC), International Patent Classification (IPC), and most importantly, their claims are clawed from USPTO and placed in a patents tool for further processing. The claims and title for each patent constitute one document in our corpus, which totals 7071 documents on the whole. Then, the whole document collection was divided into five sub-collections to present technological feature and essential terms of inventions by Australia assignees in the past five years. The detailed documents number was published every year from 2009 to 2010; the term number and USPC number in each corresponding sub-collection are shown in Table 11.1. Although the number of documents declined from year 2011, the term number kept rising, which implies that the average complexity of patent claims description is increasing in the resent three years. We also observe that the number of USPC in 2010 had a visible growth, suggesting that there may be a group of new topics appearing in year 2010 comparing with year 2009.

Table 11.1 The number of documents, terms, and USPC of patents published each year

4.2 Topic Set Determination

Before topic modeling, as mentioned, a number of parameters need to be set first, including the number of topics \(K\) and \(\alpha ,\beta\) of dirichlet distribution. In the case study, we applied \(K = 10\) with model hyperparameters \(\alpha = 0.5,\beta = 0.1\) to our target documents, to balance the topical granularity, convenience of understanding, and the speed of processing. There are 10 topics describing the essential technological content and feature for each year; and every topic is presented with 10 words given highest probability by this topic.

Indices Folkes & Mallows (FM), Jaccard (DJC), and F1 are calculated after we clustered the patents using both topic assignment and main USPC information. Observation for each year was performed 5 (\(m = 5\)) runs with 2000 iterations of Gibbs sampling. The detailed index values of five times experiments are listed in Table 11.2, where we can observe directly that the 3rd experiment (E3) of documents sub-collection in 2009, the 5th experiment of documents sub-collection in 2010 (E5), the 4th experiment of documents sub-collection in 2011 (E4), the 2nd experiment (E2) of documents sub-collection in 2012, and the 3rd experiment (E3) of documents sub-collection in 2013 have the largest value of all three indexes among all experimental trials. We believe that these models can fit the observation better and the topics and parameters provided by the five trials are our final topic modeling result.

Table 11.2 Indexes information for the final chosen experiment result

Since there is no preset classification or domain knowledge assistance needed, the topic modeling results are discovered in an unsupervised way. In the past five years, patents owned by Australia assignees cover several important technological topics, such as print head and nozzle, alkyl compound, pressure apparatus, and antibody sequence. The more the topic words are taken into consideration to describe a topic, the more clear and specific the topical semantic meaning will be. Specifically, the topics for each year are presented as follows. The order of the topics is random, and the numbers behind words are the probability values of corresponding topic words. Details of all the topics, the top 10 ranked words and their corresponding probabilities, are shown in Table 11.3 in the Appendix.

Table 11.3 The top 10 ranked words of all the topics from years 2009 to 2013 and their corresponding probabilities
  • The topics of year 2009 include printhead (0.0418) cartridge (0.0353), image (0.0217) device (0.0244), ink (0.0442) nozzle (0.0334), composition (0.0095) material (0.0065), portion (0.0246) assembly (0.0132), roller (0.0142) device (0.0122), alkyl (0.0109) compound (0.0183) formula (0.0111), computer (0.0079) gaming (0.0088), signal (0.0278) sensor (0.0108), and antibody (0.0379) sequence (0.0220).

  • The topics of year 2010 contain portion (0.0217) assembly (0.0090), light (0.0131)/optical (0.0104) device (0.0104), ink (0.0518) printhead (0.0476), layer (0.0101) material (0.0144), computer (0.0191) memory (0.0253) plurality (0.0161), coded (0.0252) device (0.0269), antibody (0.0117) sequence (0.0172), pressure (0.0164) apparatus (0.0370), alkyl (0.0096) compound (0.0184), and electrode (0.0146) system (0.0175).

  • The topics of year 2011 include layer (0.0166) material (0.0188), portion (0.0260) assembly (0.0202), ink (0.0579) printhead (0.0457), acid (0.0201) sequence (0.0234), alkyl (0.0142) compound (0.0159), pressure (0.0161) apparatus (0.0226), light (0.0133) device (0.0114), image (0.0170) print (0.0449), coded (0.0211) device (0.0207), and plurality (0.0084) apparatus (0.0096).

  • The topics of year 2012 cover configured (0.0165) signal (0.0325), fluid (0.0209) chamber (0.0145), portion (0.0240) assembly (0.0213), gaming (0.0513) system (0.0205), light (0.0145) lens (0.0067), signal (0.0104) sensor (0.0093), layer (0.0119) material (0.0196), portion (0.0164) apparatus (0.0101), computer (0.0202) memory (0.0150), and acid (0.0151) sequence (0.0162).

  • The topics of year 2013 comprise portion (0.0200) assembly (0.0122), gaming (0.0451) controller (0.0226), configured (0.0181) signal (0.0206), cushion (0.0345) mask (0.0287), acid (0.0167) sequence (0.0158), wireless (0.0132) signal (0.0092) sensor (0.0109), layer (0.0120) material (0.0135), optical (0.0095) lens (0.0098), message (0.0103) system (0.0272), and alkyl (0.0132) compound (0.0160).

4.3 Topic Change Identification

After discovering main topics underlying in patent claims of each year’s document collection, we then use the topic change model to identify the topic variation from years 2009 to 2013. For different groups of topics associated with two consecutive years, we conduct traversal comparison between the topics that belong to the later year with the topics related to the previous year. Topics that contain very similar words are considered as the same topic experiencing innovation; while topics that cannot match any existing ones count as new topics. Figure 11.6 illustrates the important topics that arose each year after 2009, by presenting the top 10 words for each topic using Pajek (Batagelj and Mrvar 2004).

Fig. 11.6
figure 6

Topics became newly important in each year of 2010–2013 and topmost frequent words of each topic

In year 2010, there are four different topics appeared compared with year 2009, including layer material that related to metal and polymer composition, electrode device, computer memory, and alkyl compound. In year 2011, one newly important topic appeared, pressure apparatus. Then, year 2012 introduced two new topics including light lens and gaming system/controller compared with the previous year. Finally, for year 2013, computer system related to vehicle and message appeared as a new theme. All the topics above were identified without assistance of preset domain knowledge. The detailed words and their corresponding probabilities of the new topics mentioned above are highlighted in boldface in Table 11.3 of the Appendix.

4.4 Topic-Based Trend Estimation

As mentioned, we can use the proposed approach to discover how the detailed content of a certain topic evolves from year to year and forecast the topic-based trend using historical status. In the case study, topic antibody fragment/sequence is chosen as an example. As shown in Fig. 11.7, we observe that the word distribution composing the topic develops over time. In year 2009, human and peptide were in the top words list, yet after this, the stress of the topic itself moved to plant, amino acid, nucleic acid, and polypeptide. The word “acid,” instead of “antibody,” ranked higher from year 2010 to 2013, which means they have larger probability of belonging to this topic as time goes on. The variation of the content of this topic may suggest that, in this area, the key point of technological research and development has shifted to amino/nucleic acid sequence.

Fig. 11.7
figure 7

An example of the topic “antibody” evolving over time

To estimate topic-based trend of this topic, we then generate its temporal-weight matrix with one month as time interval. Each element in the matrix presents the weight of the topic in a corresponding time frame, from January 2009 to December 2013. We calculate the weight changes in a least-squares sense to estimate the general trend of the target topic. Figure 11.8 shows the final result of topic-based trend estimation of the theme “antibody.” We can obverse directly that this topic appeared to have an upward trend. The significance of this topic kept growing continuously, from which we learn that the research and patenting for the topic of antibody is increasing over the past 5 years, and the importance of this topic has the potential to keep growing in future.

Fig. 11.8
figure 8

An example of the topic-based trend estimation of the theme “antibody”

5 Conclusion and Future Work

This paper proposed an unsupervised topic change identification approach for patent mining using latent dirichlet allocation. Patent claims that embody the most significant technological terms are chosen as the main textual data source of our research. To improve the usage of LDA in patent topic extraction, we utilize USPC as a measurement of different estimations, to select the optimal model of topic modeling. Machine-identified topics are then placed into a topic change model to locate topic variation over time. Since there is no need to define any keywords in advance and all topics are automatically identified in an unsupervised way, this approach is able to set domain experts and analysts free from reading, understanding and summarizing massive technical documents and records. Finally, a case study, using USPTO patents published during the years 2009–2013 with Australia as their assignee country, is presented. The experimental results demonstrate that the proposed approach can be used as an automatic tool to extract topics and identify topic changes from a large volume of patent documents. From the application perspective, the discovered topic variations can be utilized to assist further decision making in R&D management, especially for newly created innovative enterprises, for example, to provide a full understanding of the topic structure of a certain industry, seek technological opportunities, and so on.

As patents and other technological indicators are generating and accumulating in an increasing rate, approaches for automatically identifying topic changes using data mining and machine learning methods will continue to be emphasized. In future work, we will keep focusing on locating topic changes that associate with more meaningful temporal segmentation, like trend-turning intervals (Chen et al. 2015), to identify and analyze the context that contributes to trend changing of patenting activities.