Keywords

1 Introduction

zbMATHFootnote 1 has classified more than 135k articles in 2019 using the Mathematics Subject Classification (MSC) scheme  [6]. With more than 6,600 MSC codes, this classification task requires significant in-depth knowledge of various sub-fields of mathematics to determine the fitting MSC codes for each article. In summary, the classification procedure of zbMATH and MR is two-fold. First, all articles are pre-classified into one of 63 primary subjects spanning from general topics in mathematics (00), to integral equations (45), to mathematics education (97). In a second step, subject editors assign fine-grained MSC codes in their area of expertise, i.a. with the aim to match potential reviewers.

The automated assignments of MSC labels has been analyzed by Rehurek and Sojka [9] in 2008 on the DML-CZ [13] and NUMDAM [3] full-text corpus. They report a micro-averaged \(F_1\) score of 81% for their public corpus. In 2013 Barthel, Tönnies, and Balke performed automated subject classification for parts of the zbMATH corpus [2]. They criticized the micro averaged \(F_1\) measure, especially, if the average is applied only to the best performing classes. However, they report a micro-averaged \(F_1\) score of \(67.1\%\) for the zbMATH corpus. They suggested training classifiers for a precision of \(95\%\) and assigning MSC class labels in a semi-automated recommendation setup. Moreover, they suggested to measure the human baseline (inter-annotator agreement) for the classification tasks. Moreover, they found that the combination of mathematical expressions and textual features improves the \(F_1\) score for certain MSC classes substantially. In 2014, Schöneberg and Sperber [11] implement a method that combined formulae and text using an adapted Part of Speech Tagging approach. Their paper reported a sufficient precision of \({>}.75\), however, it did not state the recall. The proposed method was implemented and is currently being used especially to pre-classify general journals [7] with additional information, like references. For a majority of journals, coarse- and fine-grained codes can be found by statistically analyzing the MSC codes from referenced documents matched within the zbMATH corpus. The editor of zbMATH hypothesizes that the reference method outperforms the algorithm developed by Schöneberg and Sperber. To confirm or reject this hypothesis was one motivation for this project.

The positive effect of mathematical features is confirmed by Suzuki and Fujii [15], who measured the classification performance based on an arXiv and mathoverflow dataset. In contrast, Scharpf et al. [10] could not measure a significant improvement of classification accuracy for the arxiv dataset when incorporating mathematical identifiers. In their experiments Scharpf et al. evaluated numerous machine learning methods, which extended [4, 14] in terms of accuracy and run-time performance, and found that complex compute-intensive neural networks do not significantly improve the classification performance.

In this paper, we focus on the coarse-grained classification of the primary MSC subject number (pMSCn) and explore how current machine learning approaches can be employed to automate this process. In particular, we compare the current state of the art technology [10] with a part of speech (POS) preprocessing based system customized for the application in zbMATH from 2014 [11].

We define the following research questions:

  1. 1.

    Which evaluation metrics are most useful to assess the classifications?

  2. 2.

    Do mathematical formulae as part of the text improve the classifications?

  3. 3.

    Does POS preprocessing [11] improve the accuracy of classifications?

  4. 4.

    Which features are most important for accurate classification?

  5. 5.

    How well do automated methods perform in comparison to a human baseline?

Fig. 1.
figure 1

Workflow overview.

2 Method

To investigate the given set of problems, we first created test and training datasets. We then investigated the different pMSCn encodings, trained our models and evaluated the results, cf Fig. 1.

2.1 Generation of a Test and Training Dataset

Filter Current High Quality Articles: The zbMATH database has assigned MSC codes to more than 3.6 M articles. However, the way in which mathematical articles are written has changed over the last century, and the classification of historic articles is not something we aim to investigate in this article. The first MSC was created in 1990, and has since been updated every ten years (2000, 2010, and 2020) [5]. With each update, automated rewrite rules are applied to map the codes from the old MSC to the next MSC version, which is connected with a loss of accuracy of the class labels. To obtain a coherent and high quality dataset for training and testing, we focused on the more recent articles from 2000 to 2019, which were classified using the MCS version 2010, and we only considered selected journalsFootnote 2. Additionally, we restricted our selection to English articles and limited ourselves to abstracts rather than reviews of articles. To be able to compare methods that are based on references and methods using text and title, we only selected articles with at least one reference that could be matched to another article. In addition, we excluded articles that were not yet published and processed. The list of articles is available from our website: https://automsceval.formulasearchengine.com.

Splitting to Test and Training Set: After applying the filter criteria as mentioned above, we split the resulting list of 442,382 articles into test and training sets. For the test set, we aimed to measure the bias of our zbMATH classification labels. Therefore, we used the articles for which we knew the classification labels by the MR service as the training set from a previous research project [1]. The resulting test set consisted of \(n = 32,230\) articles, and the training set contained 410,152 articles. To ensure that this selection did not introduce additional bias, we also computed the standard ten-fold cross validation, cf. Sect. 3.

Definition of Article Data Format: To allow for reproducibility, we created a dedicated dataset from our article selection, which we aim to share with other researchers. However, currently, legal restrictions apply and the dataset can not yet be provided for anonymous download at this date. However, we can grant access for research purposes as done in the past [2]. Each of the 442,382 articles in the dataset contained the following fields:

  • de. An eight-digit ID of the documentFootnote 3.

  • labels. The actual MSC codes (see Footnote 3).

  • title. The English title of the document, with LaTeX macros for mathematical language [12].

  • text. The text of the abstract with LaTeX macros.

  • mscs. A comma separated list of MSC codes generated from the references.

These 5 fields were provided as CSV files to the algorithms. The mscs field was generated as follows: For each reference in the document, we looked up the MSC codes of the reference. For example, if a certain document contained the references \(A,B,C\) that are also in the documents in zbMATH and the MSC codes of \(A,B,C\) are \(a_{1}\) and \(a_{2}\), \(b_{1}\), and \(c_{1} - c_{3}\), respectively, then the field mscs will read \(a_{1}a_{2},b_{1},c_{1}c_{2}c_{3}.\)

After training, we required each of our tested algorithms to return the following fields in CSV format for the test sets:

  • de (integer). Eight-digit ID of the document.

  • method (char(5)). Five-letter ID of the run.

  • pos (integer). Position in the result list.

  • coarse (integer). Coarse-grained MSC subject number.

  • fine (char(5), optional). Fine-grained MSC code.

  • score (numeric, optional). Self-confidence of the algorithm about the result.

We ensured that the fields de, method and pos form a primary key, i.e., no two entries in the result can have the same combination of values. Note that for the current multi-class classification problem, pos is always 1, since only the primary MSC subject number is considered.

2.2 Definition of Evaluation Metrics

While the assignment of all MSC codes to each article is a multi-label classification task, the assignment of the primary MSC subject, which we investigate in this paper, is only a multi-class classification problem. With \(k = 63\) classes, the probability of randomly choosing the correct class of size \(c_{i}\) is rather low \(P_{i} = \frac{c_{i}}{n}.\) Moreover, the dataset is not balanced. In particular, the entropy \(H = - \sum _{i = 1}^{k}P_{i}\log P_{i}\), can be used to measure the imbalance \(\widehat{H} = \frac{H}{\log k}\) by normalizing it to the maximum entropy \(\log {k.}\)

To take into account the imbalance of the dataset, we used weighted versions of precision \(p\), recall \(r,\) and the \(F_{1}\) measure \(f\). In particular, the precision \(p = \frac{\sum _{i = 1}^{k}c_{i}p_{i}}{n}\) with the class precision \(p_{i}\). \(r\) and \(F_{1}\) are defined analogously.

In the test set, no entries for the pMSCn 97 (Mathematics education) were included, thus

$$\begin{aligned} \widehat{H} = \frac{H}{\log k} = \frac{3.44}{\log 62} = .83 \end{aligned}$$

Moreover, we eliminate the effect of classes with only few samples by disregarding all classes with less than 200 entries. While pMSCn with few samples have little effect on the average metrics, the individual values are distracting in plots and data tables. Choosing 200 as the minimum evaluation class size reduces the number of effective classes to \(k = 37,\) which only has a minor effect on the normalized entropy as it is raised to \(\widehat{H} = .85.\) The chosen value of 200 can be interactively adjusted in the dynamic result figures we made available onlineFootnote 4. Additionally, the individual values for \(P_{i}\) that were used to calculate \(H\) are given in the column p in the table on that page. As one can experience in the online version of the figures, the impact on the choice of the minimum class size is insignificant.

2.3 Selection of Methods to Evaluate

In this paper, we compare 12 different methods for (automatically) determining the primary MSC subject in the test dataset:

  • zb1 Reference MSC subject numbers from zbMATH.

  • mr1 Reference MSC subject numbers from MR.

  • titer According to recent research performed on the arXiv dataset [10], we chose a machine learning method with a good trade-off between speed and performance. We combined the title, abstract text, and reference mscs of the articles via string concatenation. We encoded these string sources using the TfidfVectorizer of the Scikit-learnFootnote 5 python package. We did not alter the utf-8 encoding, and did not perform accent striping, or other character normalization methods, with the exception of lower-casing. Furthermore, we used the word analyzer without a custom stop word list, selecting tokens of two or more alphanumeric characters, processing unigrams, and ignoring punctuation. The resulting vectors consisted of float64 entries with l2 norm unit output rows. This data was passed to Our encoder. The encoder was trained on the training set to subsequently transform or vectorize the sources from the test set. We chose a lightweight LogisticRegression classifier from the python package Scikit-learn. We employed the l2 penalty norm with a \(10^{- 4}\) tolerance stopping criterion and a 1.0 regularization. Furthermore, we allowed intercept constant addition and scaling, but no class weight or custom random state seed. We fitted the classifier using the lbfgs (Limited-memory BFGS) solver for 100 convergence iterations. These choices were made based on a previous study in which we clustered arXiv articles.

  • refs Same as titer, but using only the mscs as inputFootnote 6.

  • titls Same as titer, but using only the title as input (see Footnote 6).

  • texts Same as titer, but using only the text as input (see Footnote 6).

  • tite Same as titer, but without using the mscs as input (see Footnote 6).

  • tiref : Same as titer, but without using the abstract text as input (see Footnote 6).

  • teref : Same as titer, but without using the title as input (see Footnote 6).

  • ref1 We used a simple SQL script to suggest the most frequent primary MSC subject based on the mscs input. This method is currently used in production to estimate the primary MSC subject.

  • uT1 We adjusted the JAVA program posLingue [11] to read from the new training and test sets. However, we did not perform a new training and instead reused the model that was trained in 2014. However, for this run, we removed all mathematical formulae from the title and the abstract text to generate a baseline.

  • uM1 The same as uT1 but in this instance, we included the formulae. We slightly adjusted the formula detection mechanism, since the way in which formulae are written in zbMATH had changed [12]. This method is currently used in production for articles that do not have references with resolvable mscs.

3 Evaluation and Discussion

After executing each of the methods described in the previous section, we calculated the precision \(p\), recall \(r,\) and \(F_{1}\) score \(f\) for each method, cf. Table 1. Overall, we find that results are similar whether we used zbMATH or MR as a baseline in our evaluation. Therefore, we will use zbMATH as the reference for the remainder of the paper. All data, including the test results using MR as the baseline is available from: https://automsceval.formulasearchengine.com.

Table 1. Precision p, recall r and \(F_1\)-measure f with regard to the baseline zb1 (left) and mr1 (right).
Fig. 2.
figure 2

Mathematical symbols in title and abstract text do not improve the classification quality. Method uT1 = left bar; method uM1 = right bar

Effect of Mathematical Expressions and Part-of-Speech Tags: By filtering out all mathematical expressions in the current production method uT1 in contrast to uM1 we could receive information on the impact of mathematical expressions on classification quality. We found that the overall \(F_{1}\) score without mathematical expressions \(f_{uT1} = 64.5\%\) is slightly higher than the score with mathematical expressions \(f_{uM1} = 64.4\%.\) Here, the main effect is an increase in recall from \(63.9\%\) to \(64.2\%.\) Additionally, a class-wise investigation showed that for most classes, uT1 outperformed uM1, cf. Fig. 2. Exceptions are pMSCn 46 (Functional analysis) and 17 (Nonassociative rings and algebras) where the inclusion of math tags raised the \(F_{1}\)-score slightly.

We evaluated the effect of part of speech tagging (POS), by comparing tite with uM1. \(f_{\mathrm {tite}} = .713\) clearly outperformed \(f_{uM1} = .64.\) This held true for all MSC subjects, cf. Fig. 3. We modified posLingo to output the POS tagged text and used this text as input and retrained scikit learn classifier tite2. However, this method did not lead to better results than tite.

Fig. 3.
figure 3

Part-of-speech tagging for mathematics does not improve the classification quality. Method uM1 = left bar, method tite = right bar.

Effect of Features and Human Baseline: The newly developed method combined method [10] works best in a combined approach that uses title, abstract text, and references titer \(f_{\mathrm {titer}} = 77.3\%.\) This method performs significantly better than methods that omit either one of these features. The best performing single feature method was refs \(f_{\mathrm {refs}} = 74.6\%\)) followed by text \(f_{\mathrm {text}} = 69.9\%\) and titls \(f_{\mathrm {titls}} = 62.3\%\). Thus, automatically generating the MSC subject while including the references appears to be a very valuable strategy. This becomes evident also when comparing the scores of approaches that only considered two features. For the approaches that excluded title (i.e. teref \(f_{\mathrm {text}} = 77\%\)) or abstract text (i.e. tiref \(f_{\mathrm {text}} = 76\%\)), the performance remained notably higher than when the approach excluded the reference mscs (tite \(f_{\mathrm {text}} = 71.3\%\)) However, it is also worth pointing out that the naive reference-based method, ref1 \(f_{\mathrm {text}} = 65.2\%\), which is currently being used in production still performs more poorly than just using tite despite this approach ignoring references. In conclusion, we can say that training a machine learning algorithm that weights all information from the fine grained MSC codes is clearly better than the majority vote of the references, cf. Fig. 4.

Fig. 4.
figure 4

Machine learning method (refs, left) clearly outperforms current production (ref1, right) method using references as only source for classification.

Fig. 5.
figure 5

For many pMSCn the best automatic method (titer, right) gets close to the performance of the human baseline (mr1 left)

Even the best performing machine learning algorithm, titer with \(f_{\mathrm {titer}} = 77.3\%\), is worth than using the classification by human experts from MR, the other mathematics publication reviewing service, resulted in a baseline of mr1 \(f_{mr1} = 81.2\%.\) However, there is no foundation that could allow us to determine which of the primary MSC subjects, either from MR or zbMATH, are truly correct. Assigning a two-digit label to mathematical research papers – which often cover overlapping themes and topics within mathematics – remains a challenge even to humans, who struggle to conclusively label publications as belonging to only a single class. While for some classes, expert agreement is very high, e.g. for class 20 agreement is \(89.1\%\), for other classes, such as 82, agreement is only at \(47.6\%\) regarding the \(F_{1}\) score, cf., Fig. 5. These discrepancies reflect the intrinsic problem that mathematics cannot be fully reflected by a hierarchical system. The differences in classifications made among the two reviewing services are likely also a reflection of emphasizing different facets of evolving research, which often derive from differences in the reviewing culture.

We also investigated the bias introduced by the non-random selection of the training set. Performing ten fold cross validation on the entire dataset yielded an accuracy of \(f_{titer,10} = .776\) with a standard deviation \(\sigma _{titer,10} = .002\). Thus, test set selection does not introduce a significant bias.

Fig. 6.
figure 6

Confusion matrix titer

Fig. 7.
figure 7

Precision recall curve titer.

After having discussed the strengths and weaknesses of the individual methods tested, we now discuss how the currently best-performing method, titer, can be improved. One standard tool to analyze misclassifications is a confusion matrix, cf., Fig. 6. In this matrix, off-diagonal elements of the matrix indicate that two sets of classes are often mixed by the classification algorithm. The x axis shows the true labels, while the y axis shows the predicted labels. The most frequent error of titer was that 68 (Computer science) was classified as 5 (Combinatorics). Moreover, 81 (Quantum theory) and 83 (Relativity and gravitational theory) were often mixed up.

However, in general the number of misclassifications were small and there was no immediate action that one could take to avoid special cases of misclassification that do not involve a human expert.

Since titer outperforms both the text-based and reference based methods currently used in zbMATH, we decided to develop a restful API that wraps our trained model into a service. We use pythons fastAPI under unicorn to handle higher loads. Our system is available as a docker container and can thus be scaled on demand. To simplify development and testing, we provide a static HTML page as a micro UI, which we call AutoMSC. This UI displays not only lists/suggests the most likely primary MSC subjects but also the less likely MSC subjects. We expect that our UI can support human experts, especially whenever the most likely MSC subject seems unsuitable. The result is displayed as a pie-chart, cf., Fig. 8 from https://automscbackend.formulasearchengine.com. To use the system in practice, an interface to the citation matching component of zbMATH would be desired to paste the actual references rather than the MSC subjects extracted from the references. Moreover, looking at the precision-recall curve (Fig. 7) for titer, suggests that one can also select a threshold for falling back to manual classification. For instance, if one requires a precision that is as high as the precision of the other human classifications by MR, one would need to only consider suggestions with a score \({>}0.5\). This would automatically classify \(86.2\%\) of the 135k articles being annually classified by subject experts at zbMATH/MR and thus significantly reduce the number of articles that humans must manually examine without a loss of classification quality. This is something we might develop in the future.

4 Conclusion and Future Work

Returning to our research questions, we summarize our findings as follows: First, we asked which metrics are best suited to assess classification quality. We demonstrated that the classification quality for the primary MSC subject can be evaluated with classical information retrieval methods such as precision, recall and \(F_{1}\)-score. We share the observation Barthel, Tönnies, and Balke [2] that the averages do not reflect the performance of outliers, cf. Fig. 1, 2, 3 and 4. However, for our methods the difference between the best and worst performing class was significantly smaller than reported by [2].

Second, we wanted to find out whether taking into account the mathematical formulae contained in publications could improve the accuracy of classifications. In accordance with [10], we did not find evidence that mathematical expressions improved pMSCn classification. However, we did not evaluate advanced encodings of mathematical formulae. This is will be a subject of future work, cf. Fig. 1.

Third we evaluated the effect of POS-preprocessing [11] and found that modern machine learning methods do not benefit from the POS tagging based model developed by [11], cf. Fig. 2.

Fourth we evaluated which features are most important for an accurate classification. We conclude that references have the highest prediction power, followed by the abstract text and title.

Finally, we evaluated the performance of automatic methods in comparison to a human baseline. We found that our best performing method has an \(F_1\) score of 77.2%. The manual classification is significantly better for most classes, cf. Fig. 4. However, the self-reported score can be used to reduce the manual classification effort by 86.2%, without a loss in classification quality.

In the future, we plan to extend our automated methods to predict full MSC codes. Moreover, we would like to be able to assign pMSCn to document sections, since we realize that some research just does not fit into one of the classes. We also plan to extend the application domain to other mathematical research artifacts, such as blog posts, software, or dataset descriptions. As a next step, we plan to generate pMSCn from authors using the same methods we applied for references. We speculate that authors will have a high impact on the classification, since authors often publish in the same field. For this purpose, we are leveraging our prior research on affiliation disambiguation, which could be used as fallback method for junior authors, who have not yet established a track record. Another extension is a better combination of the different features. Especially when performing research on the full MSC code-generation, we will need to use a different encoding for the MSC from references and authors. However, this new encoding requires more main memory for the training of the model and cannot be done on a standard laptop. Thereafter, we will re-investigate the impact of mathematical formulae since the inherently combined representation of text and formulae was not successful.

Fig. 8.
figure 8

Classification frontend

Our work represents a further step in the automation of Mathematics Subject Classification and can thus support reviewing services, such as zbMATH or Mathematical Reviews. For accessible exploration, we have made the best-performing approaches available in our AutoMSC implementation and have shared our code on our website. We envision that other application domains requiring an accurate labeling of publications into their respective Mathematics Subject Classification, for example, research paper recommendation systems, or reviewer recommendation systems, will also be able to benefit from this work. AutoMSC delivers comparable results to human experts in the first stage of MSC labeling, all without requiring manual labor or trained experts. In the future, zbMATH will use our new method for all journals that used to employ the method by Schöneberg and Sperber [11] introduced in 2014.