Keywords

1 Introduction

Historians and philosophers of mathematics are developing an ever-increasing interest in the role that diagrams play in mathematical reasoning and research practice [2,3,4,5,6]. This line of research has been highly successfull in unearthing the multi-faceted and complex roles which diagrams play in mathematics; and yet the philosophical study of mathematical diagrams still largely lacks quantitative data providing vital background information for the qualitative investigation of selected cases. Recently, a quantitative approach has led to new insights into the development in the use of diagrams over the twentieth century [9]. Among their findings is an apparent ‘valley’ in the use of diagrams, which seems to coincide with the rise of Bourbaki-style formalistic styles in mathematics during the mid-20th century [8, 9].

Despite their obvious interest to the historian and the philosopher, even those quantitative studies are based on a sample from only three journals and only include volumes in five year intervals. Judging from these investigations, the major limiting factor in large-scale quantitative investigations of diagrams seems to be the huge amounts of manual labour required to identify and code diagrams by hand. Thus, to substantiate and expand the quantitative approach, an automated procedure is required to count (and subsequently classify and analyse) diagrams in mathematical texts. To this end, recent developments in machine learning may be able to lend a hand to the historian and the philosopher of mathematical practice.

In this paper, we report on our construction of a machine learning system for automated detection of mathematical diagrams. Without providing the system with any definition of a mathematical diagram, we trained an object detector by feeding it instances of diagrams from a (relatively) small set of mathematical papers. Upon iterated training, our detector was able to predict diagrams outside its training base with a (to us) surprising accuracy and precision.

We open the paper by describing how we trained the system, and we report basic measurements of its accuracy. In the final section of the paper, we discuss how an automatic diagram detector may contribute to our philosophical and historical understanding of mathematics. There, we argue that the existence of such a system opens a variety of new philosophical research questions concerning the role and diversity of mathematical diagrams which it has hitherto not been feasible to pursue.

2 Methods

Any object detector involves a number of crucial choices of which model (and implementation) to use and how to build a good training set for the task at hand. We chose to build our diagram detector on one of the well-known existing models of object detectors based on regional convoluted neural networks, known as Fast R-CNN [7], implemented under the keras-framework and publicly available [1]. And we chose to build our training set from diagrams found in the volumes of the Journal für die reine und angewandte Mathematik, colloquially known as Crelle’s Journal after its first editor. The volumes of Crelle’s Journal published from its inception in 1826 until 1998 are available at the SUB Göttinger Digitalisierungszentrum, providing us with more than 130,000 pages of mathematical text spanning the twentieth century and more. What we will refer to as the object detector or the model is thus the implementation of the framework plus a given very large matrix of weights (approximately 100 MB) which represent the parameters of the model.

It is a real feat of the training process that we need not give a single exhaustive definition of a mathematical diagram as such a definition is incredibly difficult to come up with. The standard definitions include aspects such as 1. being essentially two-dimensional [10], and possibly, 2. being intended to provide certain types of cognitive aid in mathematical reasoning [8]. Mathematical practice, however, does not follow such rules consistently. Matrices are, for instance, generally not considered to be diagrams although they are two-dimensional, whereas Dynkin diagrams are considered to be diagrams even in cases where they are one-dimensional. For pragmatic reasons we combined these criteria in our code-book and considered (roughly speaking) a diagram to be a two-dimensional representation generally considered to be a diagram by mathematicians.

As is always the case with supervised training in machine learning, the quality of the detector is dependent on the quality of the training set. And thus, the practice-near definition of mathematical diagrams features into our detector through the code-books which were used in tagging the training set.

During the training, the detector went through a number of iterations, refining models through exposure to both true positives and false positives (see Fig. 1). Training by true positives provides the detector with input of (ideally) varied examples of what counts as a mathematical diagram. This is provided by human tagging of diagrams in selected parts of the corpus. For the various iterations we picked out subsets of the corpus \(X_i\), picked out all the pages on which diagrams were found, and identified the rectangles bounding the diagrams \(P_i\). To balance the identification of diagrams by ruling out false positives (here called background), we implemented a bootstrapping mechanism sometimes referred to as negative mining: If we let the model perform predictions on all pages in \(X_i\) for which there is no true positive identified in \(P_i\), we know that any box identified as a diagram is a false positive. These boxes collected as \(N_i\) can then be fed into the training of the next model as background. Thus, the training of a model builds upon the weights of the previous model and sets of boxes of true and false positives.

Fig. 1.
figure 1

Illustration of the process of training and validating the models. Boxes are corpora, ellipses are models, circles are sets of boxes (either green true positives or red false positives), diamond is active learning, yellow indicates human interaction, blue indicates prediction. H is hand-tagged, and the comparison of B and H amounts to the validation process discussed below. (Color figure online)

After we obtained Model 3, results were sufficiently good that we could apply a different method of training, which is a variant of the process known as active learning, where predictions made by the model are fed to an oracle (a human) who will classify them as true or false positives. Running model 3 on the entire corpus from Crelle’s Journal (all 130,000 pages) provided predictions of 8,700 boxes which were inspected, labeled and corrected where needed by a human agent. Together with the previously tagged true positives, these provided the training set for Model 4, which is the present culmination of our training process.

3 Results

The model was implemented under Linux Ubuntu 18.04, building on and and run on a computer with an CPU and an NVIDIA GPU. Run-time was a real bottleneck, both in training new iterations of the model and in running predictions on large corpora of texts. As the system is small and somewhat dated, this could be mitigated by using more modern and larger hardware.

When we ran predictions by Model 3 on the remaining corpus from Crelle’s Journal which was not used in training Model 3, we were quite surprised at the accuracy of true positives and true negatives; in other words, the detector was surprisingly efficient in predicting diagrams precisely when they were indeed present (see Fig. 2).

Fig. 2.
figure 2

One example of running the detector (model 3) on the entire corpus from Crelle’s Journal, i.e. those papers not used in training the model. For this particular page, it correctly identified two true positives.

However, we also encountered all the kinds of mistakes that we would expect: false positives, false negatives, and wrong partitionings. We found various types of false positives, i.e. predictions which do not correspond to diagrams. These included identifying library stamps or indented multiline formula but also some kinds of matrices and continued fractions which could rightly be considered diagrams on many definitions [10]. We also found various types of false negatives, in particular some triangular commutative diagrams which were not identified as diagrams by the detector. Another special kind of false negatives came from tableaux pages with many diagrams, especially when the bounding rectangles of different diagrams would overlap; this is thought to be a side-effect of the model chosen. Furthermore, we found instances, where the detector would identify sub-rectangles of a diagram as independent diagrams.

After training our models, and to assess their quality, we ran the detector against a baseline of 677 hand-tagged articles from three journals (Bulletin of the AMS, Acta Mathematica and Annals of Mathematics) which are outside the training set and were tagged for another project [9]. These articles spanned 23,500 pages and contained a total of 5,271 diagrams. Different measures exist for evaluating this type of machine classification, and the best choice of measure should be based on the concerns of the application. To measure the performance of our detector on such an asymmetric set (many more negatives than positives, higher price of false negatives than of false positives), we chose to balance recall (R) and precision (P) through the F1-score:

(1)

If no true positives are found in an article, the F1-score is undefined for that article. As can be seen from the equations, R measures how many positives are picked up and classified correctly, whereas P measures the degree to which those diagrams are identified are indeed true diagrams.

Fig. 3.
figure 3

Examples of false positives and wrong partitionings produced by running the detector (model 3) on the corpus from Crelle’s Journal against which it was not trained.

Fig. 4.
figure 4

Examples of false negatives produced by the detector (model 3) on the corpus from Crelle’s Journal against which it was not trained.

Fig. 5.
figure 5

Examples of wrong partitioning from running the detector (model 3) on the corpus of Crelle’s Journal on which it was not trained.

The F1-score for Model 4 against the entire baseline corpus was found to be 0.90777, which is significantly improved from 0.7198 for Model 3. This is a very good score for training Model 4 on a relatively small set of tagged images and testing the model against a corpus from different mathematical and typographical traditions. It also shows that Model 4 has succeeded in eliminating many of the false predictions made by Model 3.

4 Discussion

Our efforts to build a mathematical diagram detector have been successful to such a degree that we now have a tool that can provide large-scale quantitative background for historical and philosophical investigations of the use of diagrams in mathematics. This background is important for several different reasons. With the detector (and its subsequent improvements) it is possible to build large corpora of diagrams spanning many journals, periods, and sub-disciplines. This will allow a more grounded approach to the investigation of the function of diagrams as large samples that better represent the diversity in the types and uses of diagrams, can easily be accessed.

Furthermore, mathematicians do not only use diagrams (and other representations) as a way to convey mathematical content. Diagrams and other representations also play a major role in the heuristic phases of the mathematical work practice and during idea and concept development. Consequently, changes in the frequency and type of the diagrams being published not only reflects aesthetic and stylistic preferences, but may also indicate underlying changes in cognitive style and epistemic values among the practitioners. The precise understanding of the changes in diagram use over time or between different sub-disciplines of mathematics is thus not only of interest in and by itself, but may also be used to identify specifically interesting periods or publications for further historical or philosophical investigation of the role of diagrams.

Finally, the fact that it is at all possible to build and train a model capable of detecting mathematical diagrams is, in itself, an interesting philosophical result. As pointed out above, it is quite easy to point to many different examples of mathematical diagrams, but difficult to give clear definitions of the concept in terms of necessary and sufficient conditions. Despite this difficulty, the detector is largely capable of mirroring human judgement concerning weather or not something is a diagram (and some of the ‘mistakes’ made by earlier iterations of the detector even reflect the inconsistencies of the concept as when it classified a continued fraction as a diagram). Although a full explanation of the concept is beyond us, it simply seems that the prototypes embedded in the examples which we provided to the detector are strong enough to allow a reasonably clear concept to form from its actions.