Keywords

1 Introduction

At the present time information and communication technologies are actively implemented in research and development. Therefore, it became possible to use the entire corpus of accumulated scientific knowledge in conducting new research. Such use requires creation of complex of technologies that ensure optimal management of available knowledge, the organization has effective access to this knowledge, as well as sharing and multiple use of new kinds of knowledge structures. In mathematics considerable experience in using of electronic mathematical content within various projects on creation of mathematical digital libraries is accumulated.

Since inception of the first scientific information systems, mathematicians have been involved in the full cycle of software product development, from idea to implementation. Well-known examples are an open source system TeX [1] and commercial systems Wolfram Mathematica and WolframAlpha, led by Stephen Wolfram according to his principles of computational knowledge theory [2, 3]. Tools for mathematical content management are developed with the help of communities of mathematicians, e.g. MathJax by American Mathematical Society, information system Math-Net.Ru is developed at the Steklov Mathematical Institute of the Russian Academy of Sciences [4] and the collection of publicly available preprints arXiv.org (https://arxiv.org/).

Main challenges of mathematical knowledge management (MKM) are discussed in [5,6,7,8]. In [9] we discuss the most urgent tasks: modeling representations of mathematical knowledge; presentation formats; authoring languages and tools; creating repositories of formalized mathematics, and mathematical digital libraries; mathematical search and retrieval; implementing math assistants, tutoring and assessment systems; developing collaboration tools for mathematics; creating new tools for detecting repurposing material, including plagiarism of others’ work and self-plagiarism; creation of interactive documents; developing deduction systems. The solution of this task requires formalization of mathematical statements and proofs. While mathematics is full of formalisms, there is currently none of widely accepted formalisms for computer mathematics.

At the present time one of the largest formal mathematical libraries is Mizar (http://www.mizar.org/), which is a collection of papers prepared in the Mizar system of formal language), containing definitions, theorems and proofs [10, 11]. Mizar is one of the pioneering systems for mathematics formalization, which still has an active user community. The project has been in constant development since 1973.

Note the important results related to the level of formalization of representations of mathematical articles. For these purposes, developed languages of presentation of mathematical texts, specialized formal languages, as well as conversion software languages [12,13,14,15,16]. These technologies are also used to construct a mathematical ontology and creating semantic search service [6, 9, 17]. Effective communication requires a conceptualization as well as the sharable vocabulary. Ontologies that satisfy this requirement are described in [6, 9, 18].

In mathematical journals important part of the search service is to find fragments of formulas. For example, such a service is implemented in a digital repository Lobachevskii Journal of Mathematics (http://ljm.kpfu.ru/). For this we use converting documents in XML-format and formulas in of MathML-notation [12]. The above and many other mathematical implemented projects paved the way for the realization of a new idea, the creation of the World Digital Mathematical Library (WDML).

At the present time it formed a special type of information system called “digital ecosystem” [19]. The Digital Ecosystem is forming as the Information Technology, Telecommunications, and Media and Entertainment industries converge, users evolve from mere consumers to active participants, and governments face policy and regulatory challenges [20]. Research on digital ecosystems model adaptation to scientific and educational fields are described in [21,22,23,24].

This paper is devoted the development of the digital ecosystem OntoMath, whose task is the mathematical knowledge management in digital scientific collections. This ecosystem is a semantic publishing platform, which forms the semantic representation for the collections of mathematical articles and the set of ontologies and mathematical knowledge management services.

This article is an extension of the report “Mathematical Knowledge Management: Ontological Models and Digital Technology” at the conference “Data Analytics and Management in Data Intensive Domains” conference (DAMDID) [17]. We have expanded the description of the construction of Digital Mathematical Library technologies and we presented Object Paradigm of Mathematical Knowledge Representation. We have described the new tools a semantic search and knowledge management developed by us based on ontologies.

The paper is organized as follows. In Sect. 2, we consider the problems related to the management of digital mathematical libraries content. In Sect. 3 we present the object paradigm of representing mathematical knowledge. Then we present the architecture and the tools OntoMath ecosystem.

2 Digital Mathematics Libraries

At present, research activities in the field of mathematics associated with the use of modern information technology (cloud, semantic, etc.). These technologies are used in research of distributed scientific teams, the preparation and dissemination of mathematical knowledge in electronic form, the formation of mathematical digital libraries and of intellectual processing of their content. Special attention is given to the creation of a common information space by mathematical integration of existing and organizing new digital mathematical library (DML). The largest projects are “All-Russian Mathematical Portal Math-Net.RU” (http://www.mathnet.ru/), “Centre de diffusion de revues académiques mathématiques” (CEDRAM, http://www.cedram.org/), “Czech Digital Mathematics Library” (DML-CZ, http://dml.cz/), “The Polish Digital Mathematics Library” (DML-PL, http://pldml.icm.edu.pl/pldml/), “Göttinger DigitalisierungsZentrum” (GDZ, http://gdz.sub.uni-goettingen.de/gdz/), “Numérisation de documents anciens mathématiques” (NUMDAM, http://www.numdam.org/), Zentralblatt MATH (https://zbmath.org/), EMIS ELibM (http://www.emis.de/elibm/), “Bulgarian Digital Mathematics Library” (BulDML, http://sci-gems.math.bas.bg/jspui/) (see, e.g., [25,26,27]). Mathematical content is presented in a multidisciplinary digital libraries, for example, JSTOR (http://www.jstor.org/) and eLIBRARY (http://elibrary.ru/defaultx.asp).

This class also includes information of scientific publishing platform Elsevier (https://www.elsevier.com/), Springer (http://springerlink.bibliotecabuap.elogim.com/), Pleiades Publishing (http://pleiades.online/ru/publishers/), as well as system support of scientific journals, for example, Elpub (http://elpub.ru/).

Realization and development of digital mathematical libraries involve the development of special tools and continuous improvement of their functionality. An example is the Open Journal Systems (OJS, https://pkp.sfu.ca/ojs/). The platform used in many projects, particularly in Lobachevskii Journal of Mathematics (http://ljm.kpfu.ru/), one of the first digital mathematical journals. In the practice of this journal, intelligent information processing tools since 1998, were introduced [12, 28,29,30,31]. In particular, we performed automated MathML-markup articles of this journal [12, 32]. Paper [33] presents a system of services for the automated processing of large collections of scientific documents. These services provide verification of document compliance to the accepted rules of formation of collections and their conversion to the established formats; structural analysis of documents and extraction of metadata, as well as their integration into the scientific information space. The system allows to automatically perform a set of operations that cannot be realized in acceptable time with the traditional manual processing of electronic content. It is designed for the large collections of scientific documents.

The idea of creating a World Digital Mathematical Library (WDML) arose in 2002. The initial aim of this project was digitizing the entire set of mathematical literature (both modern and historical), link it to the present literature, and make it clickable (see [25, 34,35,36,37]). As noted in [35], the success of this project and its future impact on mathematics, science and education could be the most significant event since the invention of scientific journals and to become a prototype for a new model of scientific and technical cooperation, a new paradigm for the future of science electronically connected world. At the same time, the implementation of such a large project will inevitably cause a series of problems. These problems and ways to overcome them were analyzed in [38]. In particular, one of the recommendations was the proposal to develop and coordinate some local projects of creating DML (see [26, 38]).

Basic plans for the construction of WDML in 2014–2015 discussed various mathematical communities and enshrined in a number of documents (see [39, 40]). In particular, it was noted that the next step in the development of the project WDML will be building information networks, knowledge-based, contained in mathematical publications. The discussion of these ideas was attended by many research groups of mathematicians all over the world, including our group of Kazan Federal University. In February 2016 in the Fields Institute (Toronto, Ontario) by the Wolfram Foundation, the Fields Institute, and the IMU/CEIC working group for the creation of a World Digital Mathematics Library it was organized the Semantic Representation of Mathematical Knowledge Workshop (https://www.fields.utoronto.ca/programs/scientific/15-16/semantic/). Our report on this symposium was devoted to the modeling and software solutions in the area of semantic representation of mathematical knowledge [41]. These results correspond to the general ideology WDML project of part semantic representation and processing of mathematical knowledge and are a strategic direction of research of our group. In particular, they are connected with the construction of OntoMath ecosystem, which is described below.

3 Moving Towards the Object Paradigm of Mathematical Knowledge Representation

E-libraries as a collection of electronic documents provide a document search by their bibliographic descriptions and thematic classification codes, as well as full-text search within the documents by keywords. Creating a full text index is a main mechanism for text search.

The global initiative WDML specifies key areas related to both organizational efforts of the international mathematical community, including mathematical literature publishers, research and technology efforts aimed at development of existing and introduction of new (semantic) technologies of representation and processing of mathematical content. These semantic technologies include the following features:

  • Aggregation of different ontologies, indexes, and other resources created by the mathematical community, and ensuring broad access to their replenishment and editing;

  • Improving the access to mathematical publications – not only to searching and browsing, but for annotating, navigation, linking to other sources, data computing, data visualization, and so on.

The move towards the representation of the internal structure of mathematical knowledge creates a new paradigm of representation. The focus of representation has shifted to the selection of elements (classes) and their relationships, which allows researchers to create various network conceptual frameworks (e.g. the citation graph, the graph of mathematical concepts, etc.). Classification of mathematical objects and organization of the relevant repositories provide new computing capabilities for data processing such as extraction and processing of formulas, finding similar papers and so on.

WDML project is focused on the object system of organization and storage of mathematical knowledge. Unlike traditional electronic mathematical libraries in which the unit storage in the database is an electronic document, it is proposed to provide the mathematical knowledge of the collections of documents in the form of specially organized repository of mathematical objects.

One of the key ideas is to develop the classes of objects for adequate description and study of mathematical content. In a mathematical document it is easy enough to identify a set of basic classes of mathematical objects (sequences, functions, transformations, identities, symbols, formulas, theorems, statements, etc.). As noted in WDML project, one of the most important tasks is to build a list of mathematical objects in different areas of mathematics.

Standard classes of mathematical objects are theorems, axioms, proofs, mathematical definitions, etc. Important elements of the object model are semantic links (relations) between the elements. In order to build a document object model, it is proposed to use modern technologies of the Semantic Web. This representation of mathematical knowledge requires development of new management tools that will be relevant to mathematical knowledge (aggregation tools, semantic search, search formulas and identification of similar objects) [36, 39, 40].

Let us consider the key objects of mathematical knowledge management tools. Aggregation tools provide automatic collection of objects that meet certain criteria, as well as automatic replenishment of object lists. Object lists can be built according to different criteria, depending on the target application. For example, a useful list is a list of objects of a given domain (for example, a list of all known theorems of the group theory), or a list of objects of a particular class (e.g. theorem) related to the study of the mathematical properties of a given object (e.g. the geometric object “triangle”). These lists allow one to actually creating custom search indexes which would accumulate mathematical knowledge.

Navigation tools (with search tools) provide opportunities for navigation to target objects within the document. For example, the classical task is to find a given mathematical object and its properties, and to search for this given mathematical object and other mathematical objects related to certain mathematical equations. Another important task is to find a given mathematical object and scientific articles on this subject. At the same time, in contrast to the keyword search, object search would allow to take into account the semantics of links for object search, thereafter to improve search results.

For example, using the object properties for a given mathematical object (e.g., “Sobolev space”) it is possible to find and view relevant information about such properties as its mathematical definition, educational literature, context-related objects and others.

Semantic search is the method of information retrieval which determines the relevance of the document to the query semantically rather than syntactically. Semantic search in the object repository is organized by following semantic links that allow to find objects by their description (implicit reference to object), as well as by given object properties. For example, the following query is classified as an implicit reference to the object: “Find all the theorems, the proof of which uses Fermat’s theorem.”

Search by formulas: this search tool provides search of mathematical formulas and additional information about them (such as the name of the formula, the list of scientific and educational publications, etc.). Formula search queries, in general, can have different forms. For example, a text query to the variables (“Find a formula connecting the area of the circle and the length of its circumference”), or computing request (“Find a formula equivalent to the formula, the F”), or text query to the mathematical object connected with this formula (“Find evidence of Euler’s formula”).

Identification tools are designed to identify identical objects that are referred to by different names and with different notations.

Thus, the main purpose of WDML is to unite digital versions of all mathematical repositories, including both contemporary sources and sources that have become historical, on new conceptual base and to provide intelligent information retrieval and data processing [39, 40].

At the same time new ways to detect objects of scientific knowledge directly through the web, as well as tools and services for creating and sharing of new types of knowledge structures are becoming more popular in the scientific community. In the context of the concept of Linked Data, and the Semantic Web these tools and services can be used to create “cooperation graphs” (collaboration graph), which are used, for example, to calculate the collaboration distance between the authors and searching similar documents. These facilities open up new possibilities of fine-tuning searching and browsing (see, e.g., [42]). Many authors (e.g. [6, 9, 18]) highlight the importance of developing new domain ontologies, in particular in mathematics, because the traditional bibliographic classification is no longer sufficient. It needs a deeper representation that would contain more detailed descriptions by taking into account different points of view.

4 OntoMath Digital Ecosystem

4.1 General Description

OntoMath is a digital ecosystem of ontologies, textual analytics tools, and applications for mathematical knowledge management. This system consists of the following components:

  • Mocassin, an ontology of structural elements of mathematical scholarly papers;

  • OntoMathPRO, an ontology of mathematical knowledge concepts;

  • Semantic publishing platform;

  • Semantic formula search service;

  • Recommender system.

Briefly we describe these basic elements of the architecture of digital ecosystems OntoMath (Fig. 1).

Fig. 1.
figure 1

OntoMath ecosystem architecture

The core component of the OntoMath ecosystem is its semantic publishing platform. It builds an LOD representation for a collection of mathematical articles in LaTeX. The generated mathematical dataset includes metadata, the logical structure of documents, terminology, and mathematical formulas. Article metadata, the logical structure of documents, and terminology are expressed in terms of AKT Portal, Mocassin and OntoMathPRO ontologies, respectively. Mocassin ontology, in its turn, is built on SALT Document Ontology, which is ontology of the rhetorical structure of scholarly publications. Mocassin and OntoMathPRO ontologies are parts of OntoMath ecosystem but SALT is an external ontology. Two applications are built using the semantic publishing platform: a semantic formula search service and a recommender system.

As any digital ecosystem, OntoMath has components that are used for sociotechnical purposes. Such components are ontologies and semantic publishing platforms. They can be used by mathematicians and software systems developers.

4.2 Semantic Publishing Platform

As was mentioned above, a semantic publishing platform, which constitutes the core of the OntoMath ecosystem makes an LOD representation for a given sample of mathematical articles in LaTeX [43, 44]. Its main features are:

  • Indexing mathematical articles in LaTeX-format as LOD-compatible RDF-data;

  • Extracting articles’ metadata in terms of AKT Portal Ontology;

  • Mining the document logical structure using our ontology of structural elements of mathematical papers;

  • Eliciting instances of mathematical entities as the concepts of OntoMathPRO ontology;

  • Connecting the extracted textual instances to symbolic expressions and formulas in the mathematical notation;

  • Establishing the relationship between published data and RDF-existing sets of LOD data.

The developed technology has the following features:

  • Mathematics RDF-set is based on a collection of mathematical articles in Russian;

  • The RDF-built set that includes metadata and also specific semantic knowledge such as the knowledge generated as a result of special treatment of mathematical formulas (binding textual definitions of variables with the symbols of variables in formulas) and also the instances of OntoMathPRO ontology and the structural elements of the mathematical articles.

  • Semantic annotation of mathematical texts based on Mocassin and OntoMathPRO ontologies.

  • The MathLang Document Rhetorical (DRa) Ontology [45] enables one to interpret the elements of document structure using mathematical rhetorical roles that are similar to the ones defined in the statement level of OMDoc ontology. This semantics focuses on formalizing proof skeletons for generation of proof checker templates.

4.3 Ontologies

Mocassin (https://code.google.com/archive/p/mocassin/) is an ontology intended to annotate a logical structure of a mathematical document [43, 46]. This ontology extends SALT Document Ontology, defining concepts and relations specific to mathematical documents. Mocassin ontology represents a mathematical document as a set of interconnecting segments. It is designed using OWL2/RDFS [47] languages, which provided with expressive possibilities, as well as theoretical and practical output means.

Ontology Mocassin uses 15 concepts such as Document segment, Claim, Definition, Proposition, Example, Axiom, Theorem, Lemma, Proof, Equation, and others. The ontology defines relations between segments such as dependsOn, exemplifies, hasConsequence, hasSegment, proves, refersTo.

OntoMatnPRO (http://ontomathpro.org/) is an ontology of mathematical knowledge [9, 48, 49]. Its concepts are organized into two taxonomies:

  • Hierarchy of areas of mathematics: Logics, Set theory, Geometry, including its sub-fields, such as Differential Geometry, and so on;

  • Hierarchy of mathematical objects such as a set, function, integral, elementary event, Lagrange polynomial, etc.

This ontology defines the following relations:

  • Taxonomic relation (for example, “Lambda matrix” is a “Matrix”);

  • Logical dependency (for example, “Christoffel Symbol” is defined by “Connectedness”);

  • Associative relation between objects (for example, “Chebyshev Iterative Method” see also “Numerical Solution of Linear Equation Systems”);

  • Belongingness of objects to fields of mathematics (for example, “Barycentric Coordinates” belongs to “Metric Geometry”);

  • Associative relation between problems and methods (for example, “System of linear equations” is solved by “Gaussian elimination method”).

Each concept description has Russian and English labels, textual definitions, and relations with other concepts, links to external terminologies, such as DBpedia and ScienceWISE.

4.4 Applications

OntoMath Formula Search Engine is a semantic search service that uses a semantic representation of math document built on the base of Semantic Publishing Platform [48]. OntoMath Formula Search Engine implements new search on names of variables using OntoMath ontology. A variable in the formula is a symbol that denotes a mathematical object. Mathematical symbols can denote numbers (constants), variables, operations, functions, punctuation, grouping, and other aspects of logical syntax. Specific branches and applications of mathematics usually have specific naming conventions for variables. However, nonstandard names of variables may be used in some formulas. OntoMath Formula Search Engine allows finding the mathematical formulas containing a given mathematical object regardless of its name for the variable. For example, if we would like to find a formula that contains a mathematical object (e.g. the curvature), the service will find all the formulas that include this object (even with different names for the variable). Using an inference, the service can find the formulas containing not only the given object, but the objects below in the hierarchy of the ontology. For example, for searching the formulas which contain the polygon, OntoMath Formula Search can find the formulas which contain not only the polygon but other objects in the hierarchy (e.g. the triangle, the parallelogram, the trapezium, the hexagon and others). OntoMath formula search also allows restricting your search to the document area that you define. For example, you can search only in the defined areas or in a certain theorem area. These search functions of OntoMath Formula Search Engine differ from those of popular search services, such as (uni) quation, Springer LaTeX search, Wikipedia search formula, Wolfram search formula. These services have a great potential, including their stability for renaming variables and for expression transformation. However, they are syntactic and seek formulas containing a predetermined formula pattern.

We have implemented two applications for mathematical formula search such as syntactical search of formulas in MathML, and semantic ontology-based search.

The syntactical search leverages formula description from documents formatted in TeX. Our algorithm [12] transforms formulas from TeX format to MathML format. We set up an information retrieval system prototype for a collection of articles in Lobachevskii Journal of Mathematics. For the end-user, the query input interface supports a convenient syntax. The search results include highlighted occurrences of formulas as well as document metadata.

OntoMath Recommender System

As ecosystem OntoMath application we have developed a recommender system for the collection of physical and mathematical documents. One of the main functions of this system is the creation of the list of related documents (see [50]). Traditionally, the list of related documents is based on the keywords given by the authors, as well as bibliographic references available in the documents.

This approach has several disadvantages:

  • A list of keywords may be missing or incomplete;

  • A keyword may be ambiguous;

  • It is necessary to take into account the hierarchy of concepts.

  • It should be noted that the article may use the terminology in different languages.

Thus, the created recommender system has the following features:

  • It takes into account the professional profile of a particular user;

  • It forms different recommendations for different scenarios of work with the system (referee, user being introduced in the topic, etc.);

  • It assigns different weights to different concepts. Thus, for a scientific review, concepts denoting areas of mathematics are more important than those related to mathematical objects. At the same time, for a beginning researcher survey papers containing notions from different areas of mathematics and references to original works are more important.

Recommender system’s workflow consists of the following steps:

  • Ontology-based keywords extraction;

  • Semantic representation of an electronic collection of mathematical papers;

  • Calculation of the measure of thematic proximity between documents by using this representation;

  • Building a list of recommended papers.

5 Conclusion

The basic ideas, approaches and results of developing the discussed mathematical knowledge management technology are based on targeted ontologies in the field of mathematics. These solutions form the basis of the specialized digital ecosystem OntoMath which consists of a set of ontologies, text analytics tools and applications for managing mathematical knowledge. The studies are in line with the project aimed to create a World Digital Mathematical Library whose objective is to design a distributed system of interconnected repositories of digitized versions of mathematical documents.

The future of the OntoMath ecosystem is related to the development of new services for semantic text analytics and control of mathematical knowledge. The developed technologies are supposed to be evaluated with the help of the digital mathematical collections of Kazan Federal University.

The present work is aimed at further research in the field of mathematical knowledge management; it has been carried out by the authors of the paper since 1998, with the support of grants from the Russian Foundation for Basic Research, Kazan Federal University and the Academy of Sciences of the Tatarstan Republic.