1 Introduction

A patent can be briefly summarized as a contract between an inventor and a government entity that prevents others from using or profiting from an invention for a fixed period of time; in return, the inventor discloses the invention to the public for the common good. Patents are complex technical and legal documents that exhaustively detail the ideas and scopes of corresponding inventions. The United States Patent and Trademark Office (USPTO) is the government agency responsible for issuing and maintaining patents in the United States. The first U.S. patent was issued in 1790 [26] and since then the USPTO has issued over 9 million patentsFootnote 1. Hundreds of thousands of patent applications are submitted to the USPTO each year and data suggests that this number will continue to rise heavily in the future. In 2015, there were 589,410 utility patent applications compared to 490,226 in 2010 and 390,733 in 2005 [22]. A patent application undergoing the review process must meet the novelty criteria in order for it to be published; i.e., it must be original with respect to past inventions. Hence, a patentablity search is usually performed by inventors, lawyers, and patent examiners to determine whether the patent constitutes a novel invention. The manual process can be a strain on time and resource because of the scientific expertise needed to verify patentability. A patent classification system (PCS) is vital for organizing and maintaining the vast collection of patents for later lookup and retrieval. We first provide a brief dissection of a typical patent document before describing the various PCSs and formalizing the problem of assigning PCS codes to a patent. Since we are primarily concerned with patents in a U.S. context, mentions of patents in the remainder of this paper implicitly refer to U.S. patents unless otherwise stated.

A patent consists of several sections including title, abstract, description, and claims. From our observation, the title and abstract are consistent in length with those of a typical research paper, while description is much larger in detail and scope. The claims section is unique in that it describes the invention in units of “innovation”; each claim corresponds to a novel aspect of the invention and is conveyed in nuanced legal terminology. According to Tong et al. [20], the claims of a patent can actually be considered a collection of separate inventions; together they can be used to determine the true measure of a patent. A patent document additionally contains structured bibliographical information such as inventors, lawyers, publication date, application date, application number, and technology class assignments. Technology class assignments are available for each of the three PCSs: the U.S. Patent Classification (USPC) system, the Cooperative Patent Classification (CPC) system, and the International Patent Classification (IPC) system. The USPC has been the official PCS used and maintained by the USPTO since the first patent was issued. In January 1, 2013, however, it was replaced by the CPC system as a joint effort by the USPTO and European Patent Office (EPO) to promote patent document compatibility at the international level [24]. The CPC is intended to be a more detailed extension of the International Patent Classification (IPC) systemFootnote 2. As part of the transition, all US patent documents as of 1836 have been retroactively annotated with CPC codes using “an electronic concordance system” [18].

Table 1. Overview and example of CPC hierarchical taxonomy. The label count at each level is computed from the January 2015 version of the CPC scheme [23].

In CPC, the classification terms/labels (CPC codes) are organized in a taxonomy – a tree with each child label being a more specific classification of its parent label; that is, there is an IS-A relation between a label and its parent. A single patent can be manually assigned one or more labels corresponding to the leaf nodes (of the taxonomy) by a patent examiner. There are five levels of classification: section, class, subclass, group, and subgroup. As of January 2015, there are 9 sections, 127 classes, 654 subclasses, 10633 groups, 254794 subgroups in the CPC schema [23]. An example of a leaf label and its parent labels can be observed in Table 1. Since CPC is an extension of IPC, many of the characteristics of CPC as described can be similarly observed in IPC.

In this paper, we propose a supervised machine learning system for the classification of patents according to the newly implemented CPC system. To our knowledge, ours is the first automatic patent classification effort for CPC based on literature review. Our system exploits the hierarchical nature of the CPC taxonomy as well as the citation recordsFootnote 3. CPC codes appear in order of how adequately a code represents the invention [27]; here, we neglect the ordering and treat the problem as a multi-label classification problem. That is, our prediction is a set of labels, a task that is more challenging and comprehensive compared to past work that focuses on predicting a single “main” IPC label for each test patent. This is because many inventions span across multiple technological domains and this level of nuance cannot be captured in a single CPC code. As in past work that deal with the older IPC system, we collapse all CPC labels of a patent to their subclass representations and make predictions at the subclass level (row 3 of Table 1). We use real-world patent documents published by the USPTO in the years 2010 and 2011 to train, tune, and evaluate our proposed models. Moreover, we publicly release the datasetFootnote 4 used in training and evaluating our system to stimulate further research in the area.

2 Related Work and Background

Fall et al. [4] explored the task of patent classification for IPC using various supervised algorithms such as support vector machines (SVM), naïve Bayes, and k-nearest neighbors (k-NN). Their experiments used the title, first 300 words, and claims as the predictive scope for feature extraction. The system proposed does not attempt to make predictions on the correct set of labels but rather produces a exhaustive label ranking for each patent on which custom “precision” metrics are used to evaluate performance. For instance, the prediction for a test document is deemed correct if the top-ranked predicted label matches the first label of the patent. The authors also conducted experiments with variants of this metric such as whether any of the top-3 prediction labels matches the first label of the patent or whether the top-ranked prediction appears anywhere in the ground truth list of IPC labels. Their study concluded that SVM was superior to other methods under similar conditions.

Liu and Shih [11] proposed a hybrid system for USPC classification using patent network analysis in addition to traditional content-based features. Their approach is based on first constructing a network graph with patents, technology classes, inventors, and assignees as nodes. Edges indicate connectivity and edge weights are computed using a custom relation metric. A prediction for a test patent is made by looking at its nearest neighbors based on a relevance measure as determined by the constructed patent network. Li et al. [10] exploit patent citation records by proposing a graph kernel that captures informative features of the network. Richter and MacFarlane [14] showed that in some cases using features based on bibliographical meta-data can improve classification performance. Other studies [2, 8] found that exploiting the semantic structure of a patent such as its individual claims can result in similar gains. Automatic patent classification in the literature has primarily focused on IPC [2,3,4] or USPC [10, 11], and we observe k-NN to be a popular approach for many proposed systems [8, 11, 14]. When targeting IPC, classification is typically performed at the class or subclass level. This choice is motivated by the fact that labels at the subclass level are fairly static moving forward while group and subgroup labels are more likely to undergo revision with each update [4] of PCS system.

3 Datasets

The dataset used in our experiments consist of utility patent documents published by the USPTO in 2010 and 2011, not including pending patent applications. For the 2010 and 2011 datasets, there are 215,787 and 221,206 patent documents respectively. Specifics about training and test set splits for supervised experiments are outlined in Sect. 5. The patents documents are freely available in HTML format at the USPTO website, although not in a readily machine-processable format. As outlined earlier, each patent contains the text fields title, abstract, description, and claims. Furthermore, each document is annotated with a set of one or more CPC labels. From the dataset, we counted 613 unique CPC subclasses which correspond to the range of candidate labels for this predictive task.

Fig. 1.
figure 1

Overview of CPC subclass level statistics within the 2010 and 2011 datasets for (a) CPC subclasses along the x-axis in order of label frequency—only those above the 99th percentile are labeled—and (b) frequency of the number of CPC subclasses that appear in a document.

The distribution of CPC subclasses is skewed with some codes such as G06F (Electrical Data Processing), H01L (Semiconductor Devices), and H04L (Transmission of Digital Information) dominating the patent space at 5.9%, 4.6%, and 4.1% document assignments respectively as seen in Fig. 1(a). The distribution of the number of CPC subclass assignments per document is likewise skewed such that the average number of labels per patent is only 1.76. Indeed, approximately 60% of documents contain only a single subclass while some outlier documents may have as many as 21 subclasses as shown in Fig. 1(b).

4 Methods: Label Scoring, Reranking, and Thresholding

As indicated earlier, the CPC taxonomy is hierarchical and takes upon a tree structure so that each label exists as a non-root node in the tree; henceforth, we refer to nodes and labels interchangeably. Since we are concerned with classification at the subclass (or third) level in the hierarchy, nodes at the subclass level are considered leaf-nodes in our experimentsFootnote 5. However, in practice it is possible to choose any target depth d in the hierarchy as the level at which labels are trained on (and predictions are made) essentially assuming d is the leaf-node level. The following notation is used in the formulation of our methodology. Let \(L^i\) be the set of all labels at level i in the CPC tree. For a leaf node \(c \in L^d\), we define \([c]_i\) as the ancestor node of c at level i such that \([c]_i \in L^i\) and \(i \le d\). Given the tree structure, a node has unique ancestors at each level above it. As a special case, we have \(c = [c]_i\) if \(i = d\).

Fig. 2.
figure 2

The pipeline of the proposed framework. The core framework produces scores for each patent while the ranking and cut-off components jointly make a label set prediction based on these scores.

Basic multi-label pipeline. Figure 2 represents the high level skeleton of our approach, a pipeline setup that is quite common in multi-label classification scenarios [28] where binary classifiers, one per label, are used first based on n-gram features. Besides scores from these classifiers, additional domain-specific features that are not directly related to the lexical n-gram features of the text are also used to score the labels. Subsequently, the scores are used to rank all labels for a given input instance. Given the sparsity concerns, the default threshold of the binary classifiers might not be suitable for infrequent labels. So instead a so called cut-off or calibration method [5, Sect. 4.4] is used to make a partition of all labels where the top few labels that form one of the partitions are considered as the final predictions. This approach has been used in obtaining medical subject headings for biomedical research articles [12, 15] and assigning diagnosis codes to electronic medical records [7]. We will elaborate on each of the components in Fig. 2 in the rest of this section with a focus on novel CPC and patent specific variations we introduce in this paper. But first, we outline the configuration for the binary label specific classifiers.

Lexical features for binary classifiers. For the textual features, we extract unigram and bigram features from the title, abstract, and description fields. We found that including textual features from the claims field tend to have a neutral or negative impact on performance while adding drastically to the feature space. We suspect that unigram and bigram features are unable to adequately capture the nuance in legal language and style as presented in the itemized claims section. To further reduce the feature space, we apply a lowercase transformation prior to tokenization, remove 320 popular English stop words, and ignore terms that occur in fewer than ten instances in the entire corpus. We apply the well-known tf-idf transformation [16] to word counts to produce the final document-feature matrix. In total, there are nearly 6.8 million n-gram features in the final document-feature matrix.

Training label-specific classifiers. Our system uses a top-down approach [17] such that a local binary classifier is trained for each label in the taxonomy (including the non-leaf nodes). Let \(f_c\) be the classifier associated with the label \(c \in L^1 \cup \ldots \cup L^d\). The classifier \(f_c\) is trained on a set of positive and negative examples, such that the set of positive example set \([M_c]^+\) includes only documents that are labeled with c and the set of negative examples \([M_c]^-\) is the set of patents that are not labeled with c. Since there typically many more negative examples than positive examples for a given label, we use only a random subsample of negative examples such that \(\left| {[M_c]^+}\right| = \left| {[M_c]^-}\right| \). This under-sampling of the majority class is a well-known idea that seems to fare well in imbalanced datasets [25]. A simple logistic regression (LR) model is used for each classifier which expresses the output in the form of a probability estimate in the range of [0, 1].

4.1 Label Scoring

Since the classification task is multi-label, the prediction for some patent x is necessarily a subset of all possible labels at level d in the CPC hierarchy. A natural approach is to rank the labels based on some scoring function with respect to x, then truncate the ranked list after some cut-off k such that only the top k labels appear in the predicted set. We define three such scoring functions to be used as the basis for our label ranking methods: leaf-level score, hierarchical multiplicative-path score, and citation score. Let \(L^i(x)\) be the set of labels at level i assigned to patent x. Each scoring function takes two parameters: input patent x and a leaf-node label \(c \in L^d\).

The leaf-level score

$$\begin{aligned} S_L(c, x) = f_c (x) \end{aligned}$$
(1)

is simply the probability \(f_c (x)\) output by the binary classifier for c at level d in the hierarchy. Since the problem is mandatory leaf-node prediction, using this score alone in label ranking is referred to as flat classification in the literature [17]. The multiplicative-path score is the geometric meanFootnote 6 of probability outputs for each classifier along the path from the leaf-node label to the root, or more formally,

$$\begin{aligned} S_M(c, x) = \left( \prod _{i=1}^{d} f_{[c]_i} (x) \right) ^ {1/d}. \end{aligned}$$
(2)

Both \(S_M(c, x)\) and \(S_L(c, x)\) produce a real number in [0, 1]. Next, we define the citation score that is specific to the patent domain. Let R(x) be the set of patents cited by x. For a given patent x and a candidate label c, the raw citation score

$$\begin{aligned} S_R(c, x) = \left| {\{t: t \in R(x) \wedge c \in L^d (t) \}}\right| , \end{aligned}$$
(3)

which intuitively counts the number of cited patents that also are assigned our candidate code c. This is not unlike the k-NN approach where similar instances’ code sets are exploited to score candidate codes for an input instance. However, in this work, we simply used the cited patents as neighbors drastically reducing the notorious test time inefficiency issues for nearest neighbor models. The citation score above is simply a count but we linearly rescale it across all leaf-level labels such we obtain a value in [0, 1] range via

$$\begin{aligned} S_R'(c,x) = \frac{S_R(c,x) - \min {Z(x)}}{\max {Z(x)} - \min {Z(x)}} \end{aligned}$$
(4)

where \(Z(x) = \{ S_R(c,x) : c \in L^d\}\).

4.2 Label (re)ranking

The three different scoring functions in Eqs. (1), (2), and (4) offer evidences of the relevance of a label c for a test instance x. At this point, we do not know how much each of these contributes to an effective ranking model. Thus to leverage them, we combine them into a final score – for the purpose of label ranking – using two approaches

  1. 1.

    The first approach involves a grid search [6] over weights \(p_1 \ge 0\) and \(p_2 \ge 0\), where \(0 \le p_1 + p_2 \le 1\), for the combined scoring function

    $$\begin{aligned} S(c,x) = p_1 \cdot S_L(c,x) + p_2 \cdot S_M(c,x) + (1- p_1 - p_2) \cdot S_R'(c,x), \end{aligned}$$
    (5)

    that maximizes the micro F1-score over a validation dataset. Note that the coefficients of the three constituent functions in Eq. (5) add up to 1. Hence, given each score is also in [0, 1], this ensures \(S(c, x) \in [0,1]\).

  2. 2.

    The second approach uses what is known in the literature as stacking [9], a popular meta-learning trick. Here, we train a new binary LR classifier for each label at level d using scores \(S_L\), \(S_M\), and \(S_R'\) computed from each patent example in the validation set as features.

4.3 Label Cut-Off/Thresholding

As mentioned earlier, once we have a final ranking among leaf labels for some patent x, we need to choose the optimal cut-off k tailored specifically to x. Thus only the top k labels are included in the final prediction. One simple approach is to train a linear regression model to predict the number of labels based on core n-gram features [12] and a few select domain-specific features. We found that adding length-based features such as character-count and word-count of text fields resulted in poor performance. Instead, the following features were selected given a patent x: the number of claims made by x, the number of inventors associated with x, and descriptive statistics based on the distribution of known label counts associated with the set of citations R(x). We refer to this approach as the linear cut-off method.

The linear cut-off approach will serve as a reasonable baseline. However, the skewed distribution of label counts presents a source of difficulty. Approximately 60.95% and 26.39% of patents in the 2010 dataset have one and two labels respectively. The remaining 12.66% of patents are exceptions in that they have more than two labels with some patents having as many as 21 labels. A simple linear regression approach has the tendency to overestimate the label count, as such we consider a more sophisticated version of this method using a two-level top-down learning model. In the first level, we train a binary classifier to predict whether, for some test patent, the case is common (one or two labels) or an exception (three or more labels). If it is predicted common, we train another binary classifier at the second layer to predict whether the number of labels is one or two. If it is predicted to be the exception case, we instead use a linear regression model to predict a real number value, which is rounded to the nearest natural number and used as the outcome for the label-count predictor. We refer to this approach as the tree cut-off method and is inspired by the so called “hurdle” process in regression methods for count data [1].

Finally, we propose a more involved cut-off method that does not rely on supervised learning. In this alternative method, referred to as selective cut-off, there are two steps. (1) Recall our final scoring function S(cx) from Eq. (5) used to rank labels before applying cut-off. Suppose the label ranking is \(c_{i_1}, \ldots , c_{i_{\left| L^d\right| }}\), \(i_j \in \{1, \ldots , |L^d|\}\), we choose the cut-off k based on

$$\begin{aligned} \underset{k \in \{1,\ldots ,\left| L^d\right| -1\}}{\text {argmax}} S(c_{i_k},x) - S(c_{i_{k+1}},x) . \end{aligned}$$

This has the effect of choosing the cut-off at the greatest “drop-off point” in the ranking with respect to the score. (2) Performing the first step alone is prone to overestimating the actual k, so additional pruning is required. Suppose the top k labels are \(c_{i_1}, \ldots , c_{i_k}\). As a second step, we further remove a label from this ranking if its rank is higher than the average label count over all patents (in training data) to which it is assigned. Let \(A_j\) be the mean label count over all patents in the training set with label \(c_{i_j}\) for some \(1 \le j \le k\). We remove \(c_j\) from the final list if \(j > A_j\). Intuitively, for example, a label that typically appears in label sets of size 3 on average is not likely to rank \(4^{th}\) or higher when generalizing to unseen examples. In a way, this leverages the thematic aspects of patents and how certain themes typically lend themselves to a narrow or broad set of patent codes.

5 Experiment and Results

For our experiments, we optimized hyperparameters (from Eq. (5)) on the 2010 dataset using a split of 70% and 30% for training and development, respectively. We use this 2010 dataset exclusively for tuning hyperparameters. We evaluate the variants of our system by training and testing on the 2011 dataset (with the 70%–30% split) using the hyperparameters optimized on the 2010 dataset to simulate the prediction process – learn on existing data to predict for future patents. Since we are framing it as a multi-label problem with class imbalance concerns, we measure classification performance based on the popular micro-averaged F1 [21] measure. We evaluate all combinations of the proposed ranking and cut-off methods on top of the core framework (as described in Sect. 4). For each variant of our system, we perform 30 experiments each with a random train-test split of the 2011 dataset and record the micro-averaged F1/Precision/Recall as shown in Table 2. The macro-averaged F1 measure that gives equal importance to all labels regardless of their frequency is additionally recorded in Table 3.

Table 2. Results comparing variants of our method in micro-averaged F1
Table 3. Results comparing variants of our method in macro-averaged F1

Among the proposed approaches, the grid based ranking and selective cut-off combination achieved a micro-F1 of almost 70%, which is best performance (row 3 of Table 2). Based on non-overlapping confidence intervals, we also see the differences in performance with respect to other variants is statistically significant for this particular combination. From Table 2, it is clear that the stacked ranking approach performs poorly compared to the grid ranking approach and is observed to be approximately 15 points worse in micro-F1 due to the low recall. Among variants that use the grid ranking method, the selective cut-off approach is able to achieve superior precision with only a minor dip in recall. It can also be observed that, in terms of macro-F1, the grid ranking approach is still superior by far; with grid ranking, linear cut-off has a higher average macro-F1 over other cut-off methods but the gains are not statistically significant. Moreover, we note that the macro-F1 score is greater than the micro-F1 score for each respective method combination, which may suggest that the system is generally performing better on the low-frequency labels and worse on the popular ones [19]. This is a counterintuitive outcome and needs further examination as part of our future work. However, it could also be due to the case that some infrequent labels are very specific to certain esoteric domains for which the language used in the patents is highly specific.

6 Conclusion

In this paper, we proposed an automated framework for classification of patents under the newly implemented CPC system. Our system exploits the CPC taxonomy and citation records in addition to textual content. We have evaluated and compared variants of the proposed system on patents published in years 2010 and 2011. As a take away, we demonstrated that using the proposed framework with grid ranking (based on three different scoring functions) and the selective cut-off method outperform other variants of the system. In this work, we used logistic regression as the base classifier since it is fast and uses relatively fewer parameters. In future work, we propose to combine the information in the CPC hierarchy as a component of recurrent and convolutional neural networks for multi-label text classification using the cross-entropy loss [13].