Introduction

Patent classification is one of the first tasks performed by experts of patent offices when analyzing a patent application to register a new invention. Classification consists in assigning a set of category codes to the document, based on their content, ensuring in this way that patents with similar topics or technological areas are grouped under the same codes. Accurate classification of patent applications is vital for the inter-operability between different patent offices, and for conducting several tasks such as reliable patent search, management and retrieval (Zhang et al. 2015; Abbas et al. 2014); extract relevant content (Härtinger and Clarke 2015; Rodriguez-Esteban and Bundschus 2016; Wang et al. 2014); and investigate technology characteristics (Arts et al. 2018; Noh et al. 2015).

Nonetheless, the amount of patent application fillings is constantly growing and they are overwhelming the experts who analyze the documents. For example, the number of patent applications received by the United States Patent and Trademark Office (USPTO) in 2015 amounted to 629,647Footnote 1, whilst the European Patent Office (EPO) received approximately 296,227 patent applications in 2016Footnote 2.

Additional to the great amount of data, patent data have particularities that play a role when doing the classification. First, the categories are organized as a hierarchy, and this hierarchical structure is large and complex (containing thousands of categories in a tree-like structure). Second, patents are in turn complex and lengthy documents, composed by several pages and divided by sections. Third, patents are usually labelled with several categories at the same time, meaning they comprise different technological areas. If we additionally consider that experts are costly and vary in capabilities when performing the classification, it is clear that there is a necessity for reliable and efficient automatic methods to help in the patent classification task.

The automated patent classification task has been tackled along the last decade as a text classification problem using several methods and approaches (Benzineb and Guyot 2011; Gomez and Moens 2014). Nevertheless, despite the research done it is still an open problem with several unsolved issues regarding the general low accuracy obtained (Benzineb and Guyot 2011; D’hondt et al. 2017; Fall and Benzineb 2002; Gomez and Moens 2014). In this paper, we aim to contribute towards gaining more understanding of the problem by proposing a methodology to systematically study the effect on classification of three patent data properties in combination with two general text classification details. The properties we study are the use of: the different sections to extract content from patents, the different codes assigned to each patent, and the hierarchy of categories. The two details in classification we consider are: the way to represent documents and the base classifier.

Following our methodology, we train and test optimized classification models using different combinations of options for the mentioned properties. We then compare the results of applying the models over two standard patent datasets in English and German. We consider classification accuracy and computational efficiency, and conduct statistical tests to assert the validity of the comparisons. Additionally, we show comparisons with other works in the literature, considering the methodologies, the models and the results. In addition, we also conduct a statistical characterization of the datasets.

Our contributions for the problem of patent classification are five. (1) A methodology that take into account several relevant patent data properties and classification details for the study of the problem. (2) A thorough systematic analysis of the effect in classification results of different models built using the methodology as combinations of patent data properties. (3) The consideration of two important aspects of the problem, the optimization of hyper-parameters in the base classifiers and the language independence of the models. (4) Introduction of a hierarchical model that train local classifiers and compute the final classification as a weighted linear combination of the decisions along the hierarchy. (5) Use of our findings as a guideline for the patent classification task, such that other researchers will consider them when making decisions about what options would be more suitable for implementing or for testing their hypotheses, considering classification accuracy and computational efficiency. Some combinations of options help to build models that are slow to train but produce good classification results, while other produce slightly worst results at a fraction of the training time.

In the following sections we first describe the patent data properties (“Patent data properties” section). Later, we review the relevant related works in the literature of the problem (“Relevant related works” section). We then describe in detail our experimental methodology (“Experimental methodology” section) with the several options for each property and classification detail and the experimental setup. Afterwards, we present the results (“Experimental analysis” section). We conclude our work in “Conclusions” section with an overall discussion and possible future research directions.

Patent data properties

The first property of patent data is that the categories to classify patents are organized as a tree-like hierarchical structure. There exist several structures used by different patent offices, but the most widely used and globally agreed is the International Patent ClassificationFootnote 3 (IPC), used by more that 100 countries with around 95% of all existing patents classified according to it. The World Intellectual Property Organization (WIPO)Footnote 4 manages and updates annually the IPC, being IPC2018.01 the current version.

Every category in the IPC has a code and a title name. The IPC divides all technological fields in eight sections, designated by capital letters from A to H. Each section is subdivided in classes, labeled by the section code followed by two digits (e.g. H01). Each class is divided in subclasses, labeled by the class code followed by a capital letter (e.g. H01F). Each subclass is broken down in main groups, labeled by the subclass code followed by a one to three digits, an oblique stroke and the number 00 (e.g. H01F 1/00). Subgroups form subdivisions under the main groups. Each subgroup code includes the main group code, but replaces the last two digits by other than 00 (e.g. H01F 1/01). Subgroups are ordered in the structure as if their numbers were decimals of the number before the oblique stroke. For example, 1/036 is to be found after 1/03 and before 1/04. After subgroup level, the hierarchy is organized using dots preceding the title of the category (e.g. H01F 1/03.), where each dot represents a level down. An example of a sequence of category codes along the different levels of the IPC is shown in Table 1. The total number of categories per level in the IPC is shown in Table 2.

Table 1 Example of a sequence of codes in the IPC
Table 2 Number of categories in each level of the IPC (version IPC2018.01)

From a mathematical point of view, the IPC hierarchy is a directed rooted tree graph, with every code indicating a node. The nodes are connected by directed edges indicating PARENT–OF relationships, where each node has only one parent, meaning two nodes are connected by exactly one path. Figure 1 shows a portion of the IPC representing the tree graph. The root node is the level 0 and not shown.

Fig. 1
figure 1

Example of a portion of the IPC hierarchy starting at level 1, section B. The root node is level 0 (not shown)

The second property of patent data is that patents are complex documents (Zhang et al. 2015) and present differences with respect to other documents that are classified automatically, such as news, emails or web pages. Patents are long documents of several pages, their content is governed by legal agreements and is therefore semi-structured (divided by sections and with well-defined paragraphs), and are written using a formal language, with many technical words and sometimes fuzzy sentences (in order to avoid infringement of other patents or to extend the scope of the invention). The structure of a patent is important because it provides different input information to a classification model. The content of a patent is generally organized in the following sections (Fall and Benzineb 2002; Benzineb and Guyot 2011; Lupu and Hanbury 2013; Gomez and Moens 2014):

  • Title: indicates a descriptive name of the patent.

  • Bibliographical data: contains the number of the patent, the names of the inventor and the applicant, and sometimes the citations to other patents and documents.

  • Abstract: includes a brief description of the invention presented in the patent.

  • Description: contains a detailed description of the invention, including prior work, related technologies and examples.

  • Claims: explains the legal scope of the invention and for which application fields the patent is sought.

Most of the sections contain pure text, but in a patent, it is also frequent to find images, graphics and links. In this work, we focus exclusively on the textual content, since it is the largest component in patents and several other elements in the content are often described or explained using text.

The third property of patent data is that patents can have more than one category code assigned to it, meaning it encompass several technological areas. The first code assigned by the experts corresponds to the most relevant category (main code). Secondary codes are related with other categories that are relevant for the patent, but without any specific order of relevance. From a machine learning perspective, the task is considered a multi-label problem (Tsoumakas et al. 2010).

Relevant related works

There are several works on the automated classification of patents, starting with some surveys about the task (Fall and Benzineb 2002; Krier and Zaccà 2002; Benzineb and Guyot 2011; Gomez and Moens 2014) where some issues are highlighted, such as model accuracy, scalability, use of the hierarchy of categories, patent sections to use, and document representations.

Fall et al. introduced the WIPO-alpha and WIPO-de datasets in Fall et al. (2003, 2004) respectively, where they performed a comparison of several base classifiers: NB, KNN, SVM, SNoW (sparse network of winnows) and LLSF (linear least squares fit), using different patent sections independently. They also introduced a set of performance metrics to evaluate the task. In Seneviratne et al. (2015) and Tikk et al. (2005) the authors presented several hierarchical models, using several base classifiers (such as SVM, KNN and HITEC), patent sections and patent codes, which they evaluated using the full WIPO datasets with the same performance as defined by Fall et al. We compare one of our models with the results in these works.

There are other works that have used the WIPO datasets (specially the WIPO-alpha) for experimentation, many of them focusing on kernel classifier methods that take the hierarchical structure into account (Bi and Kwok 2014; Cai and Hofmann 2004; Chen and Chang 2012; Rousu et al. 2006; Tsochantaridis et al. 2005; Zhang 2014). However, most of these works focus on assigning a single code to a patent, only conduct experiments on a subset of the dataset (normally on the Section D of the IPC), which make it unclear if they are scalable to the full dataset, or focus on a specific task, such as preferential classification.

Some works have used models based on neural networks such as back-propagation (Trappey et al. 2006) and Winnow (D’hondt et al. 2017; Koster et al. 2003) using different features such as phrases and deep learning word representations. The authors found that word features produce better results that other features in most of the cases.

Along the years, the NTCIR workshops have organized several patent classification tasks (Iwayama et al. 2005, 2007; Kim and Choi 2007; Nanba et al. 2008, 2010) to classify Japanese patents by the F-terms, or research publications in English in the IPC, based on training with patent data. In Li and Shawe-Taylor (2007) the authors presented a series of methods based on SVM for cross-lingual patent classification between English and Japanase using the NTCIR-3 dataset.

CLEF-IP tracks included a task on patent classification Piroi (2010, 2011). The tasks provided a collection of over 1 million patents as training data to classify a test set of around 3000 patents at the subclass level of the IPC. Some of the best models were based on the Winnow classifier (Guyot et al. 2010; Verberne et al. 2010; Verberne and D’hondt 2011), using the abstract section and words and triplets features. In Giachanou and Salampasis (2014) and Giachanou et al. (2015) the authors used the CLEF-IP 2011 dataset to evaluate a series of methods based on information retrieval for patent classification, but using a subset of the test set of only 300 patents.

Some other works have used the WIPO, NTCIR, CLEF-IP or other patent datasets, but they used them as general text/graph datasets to conduct several forms of classification. These works had different goals than the particularities of patent classification, such as testing the efficiency and/or scalability of their particular methods in general text classification (Gomez and Moens 2014), hierarchical classification (Wang et al. 2014), or node classification in graphs (Dallachiesa et al. 2014); testing methods for extreme machine learning (Wang et al. 2014) or dimensionality reduction (Shalaby et al. 2014); quantifying the existence of concept drift in data (D’hondt et al. 2014); or classifying the data in user-defined hierarchies (Zhu et al. 2015). Most of these works used only the title, abstract or claims section from patents, used general accuracy and macro and/or micro-F1 as performance metrics, did not mention what patent codes they used, and considered or not the hierarchy depending on the problem they were studying.

Experimental methodology

Figure 2 shows a graphical depiction of our experimental methodology, which consists of several sequential phases composed of several steps or options.

Fig. 2
figure 2

General description of the experimental methodology

We start with a collection of patents and the first phase of the methodology is preprocessing. The step of preprocessing consists in to choose the section (or combination of sections) to extract information from patents: title (T), inventors (I), abstract (A), claims (C), short description (S) (first 30 lines of the description) and long (full) description (L). The second step consists in tokenizing the content to extract word features as sequence of letters, numbers and hyphens (to capture chemical compounds) and convert each word to lowercase. The third and fourth step consist in removing words that carry little information. By default we remove stop words and words appearing in less than five training documents to form a vocabulary.

The second phase of the methodology is the document representation. This phase consists in indexing and transforming all the patent documents to a vector space representation using the vocabulary extracted in the previous phase. We considered four weighting options when vectorizing the patents: term-frequency (tf), term frequency–inverse document frequency (tf–idf), entropy and word2vec (w2v), and we normalized the document vectors using the L2 norm. The w2v features where computed using the Gensim module from PythonFootnote 5 by training a skip-gram model (Mikolov et al. 2013) over the training part of each dataset (see next subsection), extracting word embeddings of 300 dimensions. The final document representation was computed by averaging all embeddings corresponding the its words.

We decided to use word features because in several works it have been pointed out that these features outperform other more complex representations in several classification/prediction tasks (Cinar et al. 2015; Basile et al. 2017), including patent classification (D’hondt et al. 2017).

In the third phase of the methodology, for training we can choose to include all the codes assigned to a patent or only the main code; specially considering that the multiple concurrent labels in a dataset could confuse a classification system (Fall et al. 2004; Tsoumakas et al. 2010).

The fourth phase of the methodology consists in deciding whether or not to use the hierarchical structure of the IPC. When the hierarchy is ignored (flat approach), for training we take all the categories at a defined level (section, class, subclass, main group), aggregates all the patents from the categories below, and build a single large multi-label classifier (Fig. 3a). During testing, the flat approach takes a test patent and returns all the categories from the defined level as a list ranked by probability.

For using the hierarchy (hierarchical approach), we introduce a model that trains a local multi-label classifier in each category that contains children categories, aggregating the patents from the children categories in the local classifier (Fig. 3b). During testing, our hierarchical approach takes a test patent and assigns codes from top to bottom in the hierarchy, starting at the section level. Each local classifier returns its local categories as a list ranked by probability, and from each list the model takes the top three. It then goes down only in the selected categories. When going down, it adds the corresponding probabilities for the assigned categories weighted by an importance factor of the category level. When classifying to the main group level, we will have 81 (\(3\times 3\times 3\times 3\)) selected codes. Each code would have a final probability P as:

$$\begin{aligned} P=aP_\mathrm{section}+bP_\mathrm{class}+cP_\mathrm{subclass}+dP_\mathrm{maingroup} \end{aligned}$$
(1)

where \(P_\mathrm{level}\) is the code probability for the given level and a, b, c and d are the level importance factors. These factors could modify the distribution of errors along the hierarchy. In our case, we established importance factors of 1/4 for all the levels. Thus, the final classification is a linear combination of the weighted decisions along the hierarchy.

Fig. 3
figure 3

Approaches of classification with multi-class classifiers in dashed squares. In the flat approach (a), the model predicts the local categories, in the hierarchical approach (b), the model predicts the children categories

In the fifth phase of the methodology, we choose what base classifier to use for building the local or global classifiers. We considered four options: Multinomial Naive Bayes (NB), a probabilistic classifier; K-Nearest Neighbors (KNN), a instance-based classifier; and linear Support Vector Machines (SVM) and Logistic Regression (LR), two discriminative classifiers. We used the implementation from WEKA (Hall et al. 2009) for NB, Liblinear (Fan et al. 2008) for SVM and LR, and for KNN a proprietary implementation that takes advantage of the sparseness of the vector representation. For models using SVM and LR we use L2 loss and regularization respectively, and for KNN we use 1-cosine similarity as the distance metric.

During the training phase, in our methodology we performed a fivefold cross validation over the training set for the models using KNN, SVM or LR, to look for the optimal number K of neighbors or the optimal soft margin parameter C respectively. We considered the values of \(K={1,5,10,20}\) and \(C={0.1,1,10,100}\), and we use the top metric (see below) as the optimality criterion. It is worth to mention that most of the works in the literature for patent classification do not perform a hyper-parameter optimization, but commonly takes the defaults values from the implementation.

For our experiments we trained a diversity of models using different combinations of the previous mentioned options: patent section to extract information, document representation, patent codes to use for training, use or not of the IPC hierarchy, and the base classifier. During the test phase, for each test patent, the flat models output all the possible categories, whilst the hierarchical models output a list of 81 codes, in both cases ranked by probability. For a matter of comparison, we choose the three top codes per test patent and evaluated each model using the performance metrics defined in Fall et al. (2003). The top metric compares the top predicted code with the main code of the test patent. The three metric compares the top three predicted codes with the main code of the test patent. The all metric compares the top predicted code with all the codes assigned to the test patent. To validate the results, we performed paired McNermar’s tests between each pair of models, considering a significance level of \(\alpha =0.01\) and using the Holm–Bonferroni method to correct for the number of comparisons.

All the code for implementing the methodology was written in Java and Python and will be released upon acceptance. We conducted all the experiments using a desktop Linux PC with a 3.4 GHz Core i7 processor and 16 GB in RAM.

Datasets description

We use for our experiments the standard patent collections introduced by Fall and Benzineb (2002), and Fall et al. (2004): WIPO-alpha, and WIPO-de. The WIPO-alphaFootnote 6 collection consists of 75,250 patent documents in English. The WIPO-deFootnote 7 consist of 110,826 patent documents in German provided by the German Patent Office and extracted from the DEPAROMFootnote 8 collection. Both datasets are stored in XML format and are already split in standard training and a test sets. Patents in both datasets include the sections: title, abstract (not present in the WIPO-de dataset), claims and the full description.

Table 3 shows the total number of IPC categories present in the WIPO-alpha and WIPO-de datasets, split per level of the IPC. We considered all the codes assigned to each patent for the statistics. The numbers in this table are consistent with the ones in Table 2, therefore the datasets are a good sub-sample of the whole IPC structure. Table 3 also shows the minimum, maximum, average and standard deviation of category codes assigned to each patent. The statistics are similar for both datasets, with around 95% of patents containing at most 5 codes.

Table 3 Number of categories per level of the IPC, and number of codes per patent in the WIPO-alpha and WIPO-de patent datasets

Table 4 breaks down the number of codes for the training and test parts of each patent collection, and shows how the patents are distributed along the categories. The table shows that there are more categories in the training part than in the test part in both dataset. The columns of Max/Min and Ent contain information about the degree of skewness of the category distributions. The former shows the ratio between the major and minor categories; higher ratios result in more skewed category distributions. The latter shows the Shannon entropy values; higher values of entropy imply more uncertainty in the distribution. From the statistics, we observe that such distribution is largely skewed and with a high degree of uncertainty (especially at the class and subclass levels). Figure 4a and 4b present the distributions of patents per category at the main group level for both patent collections. We observe here the skewness of the distributions, for WIPO-alpha 56% of the categories contain at most 10 patents, while only 4% contain more than 100 patents. Similar distributions have been observed in other hierarchical structures (Gomez and Moens 2012).

Table 4 Statistics of documents over categories in the WIPO-alpha and WIPO-de datasets
Fig. 4
figure 4

Distributions of patents over categories, a, b words over patents, c, d WIPO-alpha and WIPO-de datasets

For the experiments, we first extracted independently the content from sections: title, inventors, abstract, claims, short description, and full description, and also we extracted two combinations of all the sections, one with the short description (TIACS) and one with the long one (TIACL). TIACL combination corresponds to the full content of the patent. The TIACS and TIACL combinations for WIPO-de does not include the abstract section (it is not present in the documents of the dataset). Table 5 shows the statistics on the number of (unique and total) words from the different extracted sections/combinations of both datasets. The full description is the largest individual section by far, followed by the claims. These two sections dominate the combinations TIACL and TIACS respectively. However, for these sections there is a high degree of repetition for some words, as indicated by the number of unique words.

Figure 4c and 4d shows the distribution of words in the TIACL combinations of both datasets. As in other large document collections, the word distribution in WIPO-alpha and WIPO-de follows the Zipf’s law, with many words appearing in few documents and few words appearing in many documents. The number of words appearing in only one document (not included in the plots) in WIPO-alpha is 1,758,164 and in WIPO-de is 1,691,632. This could produce very large uninformative vocabularies, and thus filtering uncommon words is recommended. Table 6 shows the vocabulary size for each section/combination from both datasets (extracting from the training part).

Table 5 Statistics on number of words in each section and combination from the WIPO-alpha and WIPO-de datasets
Table 6 Number of words in the vocabulary of each section of the WIPO-alpha and WIPO-de datasets

Experimental analysis

Use of the hierarchy

In our first experiment we compare the effect of using or not the IPC hierarchy and show the results in Table 7. For this, we use the WIPO-alpha dataset and the following setup: only the abstract section, a tf document representation, all the codes from patents and the four base classifiers. The first four rows in the table correspond to flat models and the last four to hierarchical models. Each row corresponds to one base classifier, with the optimal parameter (K or C) between parentheses. In the columns appear the results per level of the hierarchy (using the three performance metrics), and the training and test times (training time includes the cross-validation for finding the optimal parameter). We indicate in bold the best results for every metric and level. Letter A in supper index indicate values that are not significantly different than the best values in a column.

Table 7 Performance comparison of the flat and the hierarchical approaches with the WIPO-alpha dataset

In this table, we observe that in both cases, with flat and hierarchical approaches, the best values are obtained by models that use LR with the same optimal parameter C. At the class and subclass levels, the hierarchical approach obtains better or similar values than the flat approach, but at the main group level, the flat approach obtains better results. This is consistent with findings for other hierarchical structures, such as web documents (Bennett and Nguyen 2009), since in the hierarchical approach, the errors are propagated and accumulated from top to bottom, and there is less information to discriminate because of the use of local classifiers. When analyzing the classification performance of the individual base classifiers, the models could be ranked in both approaches from best to worst as: LR, SVM, NB and KNN. Regarding training/test times for the base classifiers, NB presents almost no difference when using flat or hierarchical approaches, whilst for KNN the test time is similar on both approaches, but the flat approach is twice as fast as the hierarchical one for training. For KNN its training time corresponds to the cross-validation to find the optimal K. Most of the time in KNN is spent in calculating distances, and in the hierarchical approach, due to the set of local classifiers, it computes more distances than with the flat approach. Finally, SVM and LR have the biggest difference in training time between approaches, with hierarchical models being trained around 45 times faster than flat models, whilst their test times are comparable between approaches. This makes evident the advantage in efficiency of training local classifiers.

A first finding of this experiment is that the hierarchical approach in patent classification allows a faster training of models, but the flat approach creates models that have a better accuracy at the bottom level of the hierarchy. We could then decide if we want a better computational efficiency or a higher accuracy. A second finding is that our hierarchical approach, that builds local classifiers and compute the classification as a weighted linear combination of the decisions along the hierarchy, is well suited for the problem. A third finding is that LR is a consistent good base classifier with both flat or hierarchical approaches.

Document representations and patent codes

In the second experiment, we compare the effect of using different document representations and using all the codes from patents or only the main code during training. In this case, because of efficiency, we chose to build only hierarchical models. Tables 8 and 9 shows the results of this experiment for WIPO-alpha and WIPO-de respectively. In the case of WIPO-alpha, we use only the abstract section from patents, with WIPO-de we use only the claims section. The first 12 rows of the table correspond to the use of all the codes from patents, and the last 12 to the use of only the main code. Each row corresponds to one base classifier using a given document representation, with the optimal parameter between parentheses, and letter A in super index indicating values that are not significantly different than the best values in a column.

Table 8 Performance comparison of different document representations and use of all codes or only main code with the WIPO-alpha dataset
Table 9 Performance comparison of different document representations and use of all codes or only main code with the WIPO-de dataset

In the results of the second experiment regarding document representations, in both datasets we observe that in general tf–idf and entropy representations produce similar classification results between them, whether using all the codes or only the main code, and better than the ones of tf. Both methods also present generally better results than w2v, except with WIPO-de for some metrics when combining with LR. Models using tf–idf are generally the fastest to be trained, due to entropy requiring more processing time for transforming documents, and base classifiers requiring more computation to adjust models for tf and word2vec. In the case of w2v the training times are much higher than the other representations (except when combining with NB) because of the dense representation. The test times are similar between the same base classifiers regardless of the used document representation; except when using word2vec together with KNN.

Concerning the use of all the codes or only the main code for training, in both datasets we observe that generally the use of all codes produces slightly better classification results, specially at the subclass and main group levels. The better performance at bottom levels of using all the codes is because there are more patents per category at those levels, whilst there is less chance of having an overlap between categories, since patents with several codes are distributed among more categories, and those categories can be far apart in the hierarchy. With respect to the training time, models using only the main code are trained two to five times faster than models trained using all the codes, whilst their test times are similar.

Regarding the base classifier, with the WIPO-alpha dataset the best classification results are from models using LR whether in combination with all the codes or only the main code. With the WIPO-de, it is less clear if there is a dominant base classifier, but LR still keeps performing on top. The more competitive results among base classifiers in this case is because there is more information in the claims section and it helps to build more robust classifiers.

A first finding in this experiment is that using either tf–idf or entropy for document representation is more convenient than tf or word2vec, since they produce better classification results, but tf–idf should be preferred since its training and test processes are faster. A second finding is that using all the codes from patents to train models could produce slightly best results in some cases, but using only the main code produces good results at a fraction of the training time. We could then decide if we want a slightly higher accuracy at a higher computational cost. A third finding is that LR is a consistent good classification model along document representations and use of patent codes.

Using different patent sections

In the third experiment, we measure the effect of using the different patent sections/combinations for extracting information. Because of efficiency, for this experiment we build models using the hierarchical approach, a tf–idf representation and only the main code from patents. Tables 10 and 11 show the results of this experiment for WIPO-alpha and WIPO-de respectively. Each row corresponds to one base classifier using a given extracted section/combination, with the optimal parameter between parentheses and letter A in super index indicating values that are not significantly different than the best values in a column

Table 10 Performance comparison of different extracted patent sections/combinations with the WIPO-alpha dataset
Table 11 Performance comparison of different extracted patent sections/combinations with the WIPO-de dataset

In this experiment we observe that there exist two general trends in both datasets: the more information a section/combination contains (see Table 5), the better the classification results of the models that use such section/combination; and the more information a section has, the more expensive is to train and test a model, with larger differences observed in training times. There are two exceptions for the first trend: first, using the title section produces better classification results than using the inventors section, and second, using the short description produces better results than using the claims section. We think these two cases are due to two different issues, but both of them related with how the words are distributed over documents and categories. In the first case, an inventor is usually associated with a very small amount of patents usually in the same technological field. Nevertheless, there is the issue of synonymy. Different inventors from different fields share the same names, and since there are many inventors and they could be associated practically with any category, a combination of inventor names as a whole is not a good descriptor of a specific category. In the case of the claims section, it uses a combination of technical words describing the invention and legal words describing the scope of the patent. The legal words are general concepts that are shared by many patents from different categories, and do not carry much information about a specific category. On the contrary, the short description section consists of technical words describing specifically the invention, and contain technical words more associated with specific categories. For these issues, an interesting research direction would be to apply methods that manipulate word features to increase the cohesion between documents from the same category, while at the same time increasing the separability between categories (Gomez and Moens 2010).

Regarding the base classifiers, we also observe for both datasets a general trend in classification results. Independently of the section used, the classifiers sorted in descending order of performance are LR, SVM, KNN and NB; although there are some cases where the order is switched for SVM and LR.

A first finding of this experiment is that, contrary to what other works in the literature have concluded, the full patent description contains relevant and discriminative information to built robust classification models, since using it (or a combination that includes it) produces the best classification results. A second finding is that the short description could be considered a better way to summarize a patent than the abstract section, since the former produces better classification results for all the metrics with all the base classifiers at all the levels. This section should be preferred over abstract when building general patent classification models.

A general finding from all the experiments is that there are not major differences between the models’ classification performance regarding the patents’ language, and the existent differences could be due to WIPO-de having less patents in the test set than WIPO-alpha. This is an indication that classification models are language independent, and their performance is more associated with the distribution of word features over documents and categories. A second general finding is that discriminative classifiers (LR and SVM) tend to perform consistently better than the probabilistic classifier (NB) and the instance-based classifier (KNN). A final finding is that hyper-parameter optimization should be preferred over using the default parameter values to build consistent models.

Comparison with other methods in the literature

Considering the previous findings, we created a model that uses our hierarchical approach, a tf–idf representation, only the main code from patents, LR as the base classifier and the full description to extract the content. This model is fast to train because of the hierarchical approach and the tf–idf representation, includes enough information from the full description to be able to build robust discriminative classifiers, and uses a base classifier that performs consistently good. In this way, it is expected to reach good classification results with a moderate training cost.

In Table 12 we compare the results of our model with the ones obtained by Fall and Benzineb (2002) and Seneviratne et al. (2015) (1 and 2 in table) using WIPO-alpha, Fall et al. (2004) (3 in table) using WIPO-de, and Tikk et al. (2005) (4 in table) using both datasets. In the table we present the best results from the mentioned works.

Table 12 Performance comparison of most relevant works with results obtained by a model created with our methodology

The results in (1) and (3) were obtained following the same methodology between them: a flat approach, use of stemmed words, a tf representation and only the main category from patents. The authors experimented with a set of base classifiers: multinomial NB, KNN, linear SVM and SNoW in (1), and the same less SNoW plus LLSF in (3); and with different patent sections: title, abstract, claims, full description and the first 300 words of the full description. They did not perform optimization over the parameters of the classifiers and take a value of \({K}=30\) for KNN and the value of C is not mentioned. For the SVM classifier they selected a subset of 20,000 words in (1) and 50,000 in (3) using information gain, and limited the amount of document to 500 patents per class. Finally, they trained independent models at the class and subclass levels. In both works, the authors concluded that using the first 300 words from the description is the best option, whilst using the full patent description introduce noise and reduce the models’ performance. The results from (1) correspond to different classifiers, at the class level in order of metrics these are: SVM, NB and NB, at subclass level: SVM, KNN and SVM; whilst all the results from (3) correspond to LLSF.

The results in (2) were obtained with a modified KNN method using document signatures with a binary representation and a width of 4096 bits. The authors used a hierarchical approach, and, similar to (1), a value of \({K}=30\) and the title or the first 300 words from the description to extract the content. They do not mention if they use all the patent codes or only the main one. They concluded that the first 300 words produce the best results.

The results in (4) were obtained with HITEC (a back-propagation-based model) with the following setup: using a hierarchical approach, stemmed word features and a feature selection based on frequency, eliminating words appearing in less than two patents or in more than 25% of patents. The authors experimented with document representations, combinations of patent sections and use of all the codes or only the main code. The best results were obtained using an entropy representation, all the codes from patents, and a combination of the title, inventors and abstract. An additional conclusion they reached is that using only the main code, the full description and a tf–idf representation produces poor results.

The results obtained with our model are up to 10% better than the ones in (1) and (2) for WIPO-alpha. The improvement come mainly from the selected patent section, (optimized) base classifier and document representation. When comparing with the results in (4), we observe that there is little difference in performance with our model. The largest differences are at the subclass and main group levels with the three metric. The difference in performance seems to come from the different base classifiers and patent sections used. Additionally, according to our previous experiments, the flat approach in our model could potentially produce better results, but with an increase in computational cost.

With the WIPO-de dataset the results of our model are between 1% and 4% better than the ones in (3) and (4) for several metrics at several levels, with the exception of the tree metric at the subclass and main group levels. The differences seems to come from the selected patent section and base classifier.

Our findings seem to contradict partially what other authors concluded. Our findings indicate that the full description contains relevant and discriminative information for training classification models, the tf–idf representation performs very similar to the entropy, and the use of only the main code could perform similar to using all the codes at some levels of the hierarchy, with the advantage of speeding up the training process.

Our experiments show that analyzing the data properties of the patent classification problem is important to determine the appropriate model depending on our goal, considering classification accuracy and computational efficiency. Some options helps to build models that are slow to train but produce good classification results, while other produce slightly worst results at a fraction of the training time. Thus, the selection of the best options should be chosen following an adequate methodology.

Conclusions

In this paper, we have presented a methodology to conduct a systematic experimental study on automated patent classification, where we analyzed the effect in classification performance of three patent data properties in combination with two general text classification details. The properties we studied were the use of: the different sections to extract content from patents, the different codes assigned to each patent, and the hierarchy of categories. The two details in classification we considered were: the way to represent documents and the base classifier. In our methodology, we trained and tested a diversity of models using different combinations of options for the mentioned properties. We compared the results of the models over two standard patent datasets considering classification accuracy and computational efficiency. Additionally, we showed comparisons with other works in the literature, considering the methodologies, the models and the results.

We could draw several conclusions from our analysis. First, the flat approach produces better results than the hierarchical approach at the lowest level, but it is more expensive to train, specially for discriminative models such as SVM and LR. Second, the use of all the codes assigned to patents could produce better results at the lowest level than using only the main code, but it also increases the training time for a model. Third, the tf–idf and entropy document representations produce similar results, with both producing better results than the tf and w2v representations, whilst tf–idf is computed faster than entropy. Fourth, contrary to what several researchers in previous works claim, the full description is a good source of discriminative information and yield some of the highest results in comparison with other independent sections. The disadvantage of using this section is the time and memory requirements for training a model, because of the large number of features. Nevertheless, the short description could be a good summary of the patent, and even if such section misses some relevant features, this could be alleviated by combining it with other sections, such as the abstract or the claims. Fifth, the discriminative classifiers (SVM and LR) perform better than probabilistic (NB) and instance-based (KNN) classifiers. Sixth, it is important to perform a hyper-parameter optimization to optimize the performance. Finally, the models are language independent, and they depend more on how the words are distributed over documents and categories.

When comparing the results of one model created with our methodology with results from some reference works, we observed some details. First, there is extra confirmation that tf–idf and entropy perform better than tf and w2v. Second, discriminative classifiers (SVM, LR, back-propagation and LLSF) produce the top results and outperform probabilistic and instance based classifiers. Third, the hierarchical approach seems to be a better option than the flat approach up to a certain level of the hierarchy, especially when using the full description and a tf–idf representation. Finally, our model produced generally better, or at least at good, results as the other works. This means that an appropriate choosing of values for the patent data properties is important to obtain a good classification performance, and the best options should be chosen following an appropriate methodology.

It is clear from the results obtained in this work, as well from other works, that the automated classification of patents is still an open problem. The results at the lowest level of the hierarchy are still low to be considered acceptable in a practical setting. Possible research directions include using other features besides word features, such as sentences or topic model representations, in order to include more semantic information from textual content. Some works have already tried using phrases (D’hondt et al. 2013, 2014; Verberne et al. 2010; Verberne and D’hondt 2011), but the performance obtained is similar or even worse than using word features. We thus believe further research is necessary. Another direction could be the study of code propagation between documents that are close related in the hierarchical structure (Rossi et al. 2016). Finally, it would also interesting to study feature selection methods (Lamirel et al. 2015) that find the features that are highly associated with specific categories, maximizing the intra document similarity and minimizing the inter category similarity (Gomez and Moens 2010).