1 GOALS AND OBJECTIVES OF CDSSK

The concept of the Common Digital Space of Scientific Knowledge (CDSSK) as a structured information environment reflecting scientific achievements was proposed in 2019 [1]. Previously, numerous general considerations had been expressed concerning the need for common information space [2, 3] or digital knowledge space [4]. The goals, functions, and architecture of CDSSK were discussed at the All-Russia scientific conference “Common Digital Space of Knowledge: Problems and Solutions” [5], in which several dozens of specialists from research and educational institutions, scientific libraries, and museums took part.

A major goal of CDSSK is information support of scientific research. Earlier, in the era of printing press, there were special services (industry-specific centers and scientific information departments in research institutions and academic libraries) that provided scientists and specialists with relevant information, while now the development of network technologies and the replacement of printed publications with digital ones have given rise to the thesis “there is everything on the Internet,” which has resulted in the task of information support of scientific research transferred to the scientists themselves. This approach results in researchers spending much of their time and intellectual efforts not on scientific work, but rather on acquaintance with information, much of which is not relevant to their research interests. On the other hand, this task cannot be omitted, since the lack of information may lead to duplication of research and waste of time spent on results previously obtained by other researchers.

The creation of CDSSK containing reliable and nonduplicate factual and documentary structured scientific information on various scientific subjects will alleviate, to a large extent, the problem of information support of scientific research. However, this is just one of the goals of CDSSK.

The overall goal of CDSSK is to create and maintain a digital information environment that is necessary for solving a set of social development tasks providing:

• information support for scientific research;

• support of educational processes, from secondary school (closely related to popularization of science) to postgraduate studies (directly related to information support of science);

• popularization of science (development of motivation for science and appropriate education, approval of funding of science by society);

• preservation of scientific knowledge;

• monitoring and management of science.

Thus, CDSSK should perform scientific-information, educational, general cultural, and management functions.

Accordingly, CDSSK should contain elements targeting the following categories of users:

— researchers, who should be provided with multidimensional retrospective and current information (filtered according to various criteria ensuring its validity and novelty)Footnote 1 on scientific subjects of interest;

— schoolchildren and students, who should receive time-tested and reliable basic information of various levels; this information should include facts with references to teaching materials, classical textbooks (which should also be selected based on independent criteria), and digital models of phenomena and events;

— specialists, namely, analysts and representatives of management structures analyzing the state and trends of development of various fields of science;

— the general public, who should be acquainted with the most interesting results obtained in various fields of science, as well as with the history of scientific discoveries and their authors;

— specialists and amateurs interested in the history of science and its creators.

A distinctive feature of CDSSK is that its software shell, using the capabilities of artificial intelligence, including natural language analysis, should handle a wide range of polythematic queries that do not necessarily contain terms explicitly present in object-specific metadata reflected by CDSSK. For example, the system should respond to queries of the following type: “What archaeological finds were discovered in the Urals in the 19th century?” Given this query, the system should return descriptions of all archaeological objects found in various areas of the Urals from 1801 to 1900. Note that information on an individual object may contain only the specific location of its discovery, while the conclusion that this place belongs to the Urals should be derived from an automatic analysis of the links of archaeological objects to objects of other categories (in this case, relevant to geography and time).

The result of a query to CDSSK should be not only an unambiguous reliable answer to a factual question, but also an opportunity to learn about the source of this answer and related heterogeneous scientific resources.

In this context, CDSSK is regarded as a scientific-purpose integrator of state information systems (such as the Great Russian Encyclopedia, [5], the National Electronic Library [7], the Russian Scientific Citation Index (RSCI) [8], and the State Catalog of Geographical Names [9]) with industry-specific scientific information systems, digital libraries, registers, etc.

The most active research related to the practical construction of CDSSK has been carried out at the Joint Supercomputer Center of the Russian Academy of Sciences, which is a branch of the Scientific Research Institute for System Analysis of the Russian Academy of Sciences. Specifically, the CDSSK architecture has been developed, the structure of its ontology has been proposed, and research has been conducted related to the implementation of specific solutions developed as applied, for example, to the digital library “Scientific Heritage of Russia” (SHR) and its information funds [10].

2 STRUCTURE OF CDSSK

The main components of CDSSK and thematic subspaces are ontology and content. Following the ontology model proposed in [11], CDSSK includes a set of subspaces corresponding to different science fields constructed according to unified principles based on ontology standards used in the semantic WEB environment [1215].

CDSSK is represented in the form of a five-level hierarchical structure: CDSSK, subspaces, classes of objects, attributes of objects of a class, and values of attributes.

The content of CDSSK is based on objects representing a set of structured multidimensional data reflecting information about a physical entity (e.g., about a specific person, a specific book, a museum object, etc.), a scientific concept (e.g., the Pythagorean theorem, Maxwell’s equations, corpus of Japanese texts, etc.), an event, etc.

Each object of CDSSK is characterized by a set of attributes (properties), their values, and relations to other objects. The list of attributes of an object is specified according to the role it plays in solving tasks within CDSSK.

A class is a collection of objects having a given set of attributes.

A subspace is a collection of classes of objects. CDSSK includes a universal subspace and a set of thematic subspaces.

The universal subspace contains auxiliary classes of objects and subject classes of objects of multidisciplinary character. Auxiliary classes are used to describe scientific objects of all subspaces and determine rules for forming attributes of objects of subject classes and relations between them. They include units of measurement, universal classification systems, data formats, etc. The subject classes of the universal subspace include general scientific events, world-class scholars and publications, etc.

A thematic subspace (e.g., the subspaces “mathematics,” “computer science,” “space research,” etc.) contains elements directly related to the given scientific direction and relations to elements of the universal and other thematic subspaces.

The unity of the space is ensured by relations between pairs of objects and values of their attributes.

The CDSSK are divided into three types: universal, quasi-universal, and specific.

Relations can be simple and compound. Simple relations contain (in terms of RDF triplets [16]) indications of the subject, object, and (optionally, depending on the particular form of relation) the relation value. The values of compound relations can contain “embeddings,” i.e., have their own attributes and their values; the number of embeddings is not limited and is determined by the directory of a given relation.

Universal relations are simple. They indicate only the fact of relations between elements and do not depend on the classes of objects they relate. This class includes relations, such as “equivalent,” “intersect,” “contain,” and “be contained in.”

Relations of this type are widely used in subject thesauri and in establishing correspondences between elements of classification systems. In CDSSK they are additionally used, for example, to indicate the hierarchy of institution departments, different names of institutions, different spellings of person’s first names and surnames, etc.

Quasi-universal relations link subjects of different classes to objects of a given class. They can be simple or compound. The list of quasi-universal relations can be supplemented as the CDSSK develops and new elements are added. Examples of quasi-universal relations are references to encyclopedic articles linked to a person, institution, event, scientific discovery, etc.; and time characteristics indicating the beginning or end of a process or event.

Specific relations are established between subjects and objects of given classes. They can be simple and compound. The number and form of specific relations are specified when ontologies of particular classes are formed. In contrast to universal relations, which are static, and to quasi-universal relations, the set of which grows rather slowly, the list of specific relations is dynamic, since it is determined by the development of the CDSSK and its arising tasks. Specific relations have attributes, whose values, in turn, can have their own attributes with values. For example, the relation “form of relationship of a person with an institution” has attributes including “staffer.” The value of this attribute is the position name, which can optionally have one or two attributes: the dates of starting and ending work in this position. Specific relations include links of a person to a publication, which can take the values “author,” “translator,” “artist,” “compiler,” etc. Depending on the CDSSK objectives, the relation between a discovery and its publication can be treated as specific. This relation can recommend a publication to different user groups; Depending on the CDSSK objectives, it can take the following values: “date of the first publications,” “recommended scientific monograph,” “recommended textbook for university students,” “recommended textbook for school students,” and “recommended popular-science publication.”Footnote 2

Information on the structure of all CDSSK elements is reflected in directories, which have a unified structure for each of six types of elements (subspace, class of objects, relations of three types, and attributes of objects/relations).

Particular values of attributes and relations for each object are contained in dictionaries, information on which is described in corresponding directories.

Directories of each hierarchical level can have a different number of components determining particular elements of this form. An exception is the top-level basic directory named CDSSK, which has nine components. The first six ones describe the structure of directories of elements, and the last three components describe the structure of dictionaries of attributes, objects, and relations.

3 FORMALIZED DESCRIPTION OF CDSSK ONTOLOGY

All CDSSK elements, including directories and dictionaries, have unique structured names (URN). Mnemonics of URN formation have been proposed that allow new attributes of objects and relations between them to be easily added to CDSSK.

Components of the CDSSK directory have the following form:

CDSSK.1: Structure of the directory of subspaces.

Name (URN) of the directory of subspaces: SUBS.

Components:

subspace name;

prefix (two characters);

description (explanatory text).

Examples:

SUBS.1: Universal; UN; subspace containing classes of objects not related directly to a particular scientific topic, including universal reference data.

SUBS.2: Computer science; 20; the subspace includes objects related to the scientific direction “computer science.”

CDSSK.2: Structure of the directory of classes.

Name (URN) of the directory of classes: Class.

Elements:

name;

prefix (UNab for universal and <PR>ab for local, where <PR> is the thematic subspace prefix consisting of two characters and ab denotes two arbitrary alphanumeric characters);

URN of the dictionary of attributes;

description (explanatory text).

Examples:

Class 1: persons; UN; UNPS; A_UNPS; information on persons related to scientific research.

Class 2. polythematic databases; UN; UNBD; A_UNBD; encyclopedias, databases of persons, institutions, documentary databases, resources catalogs, digital libraries.

Class 3: Formats; UN; UNFT; A_UNFT; representation formats for attributes of objects (number, time, date, text, etc.).

CDSSK.3: Structure of the directory of attributes.

Name (URN) of the directory of attributes is formed as A_class prefix.

Elements:

attribute name;

format of representation of attribute values (URN of the corresponding element of the directory of objects of the class “Formats of data”);

URN of the dictionary of attribute values (formed as the attribute N_URN);

the type of the dictionary of attribute values (S for static and D for dynamic);Footnote 3

additional information (explanatory text).

Examples:

A_UNFT.1: type of data presentation; ; N_A_UNFT.1; S; selected from the dictionary at the input of an object

A_UNFT.2: format type; ; N_A_UNFT.2; S; selected from the dictionary

A_UNFT.3: mandatory (r) or optional (f) attribute value; ; N_A_UNFT.3; S;

A_UNFT.4: unique (u) or multiple (m) attribute value; ; N_A_UNFT.4; S;

A_UNPS.1: person’s surname; UNFT.i;Footnote 4 N_A_UNPS.1; D; the surname is selected from the dictionary; if it is not found, it is input and checked for the equivalence to other spellings.

A_UNPS.8: qualification (academic degree); UNFT.j;Footnote 5 N_A_UNPS.8; S; selected from the dictionary

A_UNBD.4: resource URL; UNFT.10;Footnote 6 N_A_UNBD.4; D;

CDSSK.4: Structure of the directory of universal relations.

Name (URN) of the directory: REUN.

Elements:

name;

URN value of the data format dictionary determining the form of representing the given relation;

description of the relation.

Examples:

REUN.1: Equivalence; A_UNFT.2.6; used to denote identical attributes or relations (different spellings of surnames and first names, translated versions of a publication, different names of an institution, synonymous terms, etc.)/

REUN.3: Contains; N_A_UNFT.2.6; collection with respect to its constituent articles, continent with respect to the countries it contains, institution with respect to its divisions, etc.

CDSSK.5. Structure of the directory of quasi-universal relations.

Name (URN): REQU.

Elements:

name;

prefix of an object class;

need for the dictionary of values (Y/N);

URN of the dictionary of values (if the preceding attribute is equal to Y);

URN of an element of the data format dictionary determining the form of representing the given relation;

relation description.

Example:

REQU.1: Beginning of the time interval; UNTC; N; ; N_A_UNFT.2.6; value is determined by the URN of an element of the dictionary of UNTC time values, referred to in the value of the relation attribute.

Here, UNTC is the prefix of the universal class “time characteristics” and N_A_UNFT.2.6 is the URN of an element of the dictionary of format form values (see above) describing the structure of the given relation form.

CDSSK.6: Structure of the directory of specific relations.

Name (URN) of the directory: RESP.

The directory has a header of six attributes, which, in the case of a compound relation, is supplemented with three-attribute blocks describing the hierarchy of the relation values.

Elements:

relation name;

prefix of the class of a subject;

prefix of the class of an object;

format of relation representation (URN of the value of the corresponding element in the dictionary N_A_UNFT.2);

URN of directory of attributes of the relation;

the number of values of attributes having subordinated relations at the next level (zero or an integer n).

If the sixth element is nonzero, then n blocks of second-level relations are added:

URN of an element of the dictionary of attribute values having a subordinated relation;

URN of the directory of attributes of a subordinated relation;

the number of subordinated relations of the next level (0 – k).

The next n – 1 blocks contain information on second-level relations; they are followed blocks describing relations of the next levels.

Example:

RESP.3: person’s identifier in the database; UNPS; UNBD; N_A_UNFT.2.7; A_RESP.3; 0

Here, UNBD is the prefix of the universal class “databases” and

N_A_UNFT.2.7 is an element of the dictionary of values of the class “formats” (see below).

The directory of relation attributes A_RESP.3 has two elements. The first determines the database to which relation is established (this database should be described as an object of the class UNBD). The dictionary N_A_UNBD.1 contains the names of the databases from which the necessary information is selected; if information on a database is not available, its attributes are input into the corresponding dictionaries from the directory of attributes. The second element of the directory determines the identifier of a particular person in this database. The identifier is written in the dictionary N_A_RESP.3, which is specific to this relation. Both dictionaries can be supplemented with data as the CDSSK content is formed.

A_RESP.3.1: name of a database; ; N_A_UNBD.1; D;

A_RESP.3.2: index of a person in a database; ; N_A_RESP.3; D;

An example of a multilevel relation is a relation of a person to an institution (RESP.4), which can take the values of “staffer,” “sponsor,” “shareholder,” etc. If a person is (was) an employee of an organization, it is necessary to indicate his or her positions. For each position, in turn, the start and end of the term should be indicated. The structure of this relation is written as an element of the RESP.4 directory, but we do not describe it here, because it would require a large number of definitions of attribute directories and dictionaries of their values, which would significantly overload the article.

CDSSK.7: Structure of a dictionary of attribute values for objects and relations.

URN of a dictionary has the form N_<URN of an attribute>. The dictionary has one element, namely, the value of the format whose URN is given in the corresponding element of the attribute directory. For example,

N_A_UNBD.1.2: Great Russian Encyclopedia.

N_A_UNBD.4.2: https://bigenc.ru/.

N_A_UNFT.2.6: simple relation URNc of the first type between objects, attributes, or values O1 and O2 having URNO1 and URNO2, respectively, of the form <URNc>:<URNO1><URNO2>, where URNc is the URN of a particular relation.

N_A_UNFT.2.7: simple relation URNc of the second type between objects O1 and O2 having URNO1 and URNO2, respectively; the relation takes a value presented under the name URNd in the corresponding dictionary; the relation format has the form <URNc>: <URNO1><URNO2>=<URNd>

CDSSK.8: Structure of dictionaries of relations.

Elements of a relation dictionary have, as URN, the directory name for this relation and an index separated by a dot from the former.

Each element of a dictionary contains information on the relation between a pair of URN of particular objects or URN of attribute values represented in the format indicated in the corresponding relation directory.

For example, information that the Савин surname (with URN=N_A_UNPS.1.3) is equivalent to the Savin value (with URN=N_A_UNPS.1.53) is written as the following element (if the preceding k elements of the relation dictionary “equivalent” have been previously filled):

REUN.1.k+1: <N_A_UNPS.1.3><N_A_UNPS.1.53>

Information that a journal with URN=UNPB.i contains an article with URN=UNPB.j is represented as the dictionary element

REUN.3.n: < UNPB.i>< UNPB.j>

Information that a particular person (the set of data about whom is presented in the element of the person dictionary UNPS.k) has the NNN identifier in the RSCI is represented in the form

RESP.3.m:<UNPS.k><UNBD.p>=<N_A_RESP.3.q>,

where UNBD.p contains information about RSCI and N_A_RESP.3.q is the NNN identifier.

CDSSK.9: Structure of dictionaries of objects.

The dictionary name coincides with the URN of the class containing a given object. A dictionary element is the list of URN of elements of the attribute and relation dictionaries corresponding to the given object. In particular, this list should contain the URN of the values of attributes declared mandatory, which is controlled during the data input using the corresponding element of the format dictionary. Before giving an example of an attribute dictionary, we consider an example of formalizing the description of a specific relation between a person and a publication.

Each digital object in CDSSK belongs to only one class and represents a set of values of its characterizing attributes and relations to other objects, which is a component of the object dictionary for the given class.

The proposed structure of the CDSSK content description is such that new subspaces can be added to spaces and new forms of objects, attributes, and relations can be added to subspaces. Unification and structuring of elements significantly simplify and speed up the computer processing of data. Introduction of the object class “Formats” allows us to construct a typical set of algorithms for formal-logical data control and to automatically choose the necessary ones in every particular case.

The components of all dynamic dictionaries are formed automatically during the input of data in the CDSSK. This can be implemented as a software procedure if content is imported from an external structure: the batch data input application program processes directories of attributes and relations and writes elements in corresponding dictionaries. If objects are input by an operator, then a dialog program works, which, based on processing the directories, asks the operator to form the values of attributes of the input object and its relations to other objects. For each attribute and each relation, the operator should select values available in the CDSSK or enter new values. Input data is controlled automatically using data formats available in corresponding directories.

All values of object attributes are stored in a single copy in separate dictionaries. In processing a query that includes specified values of attributes and relations, at the first step of the search algorithm, these values are replaced by their URN (or a group of URN, taking into account the Boolean logic of the query). As a result, further processing takes place at the level of structured URN, which significantly speeds up the process of finding the desired objects and navigation over related resources The search results are elements of object and relation dictionaries that contain the URN of query elements. To visualize search results, the URN names are transformed back into meaningful information based on directories whose names are rigidly linked to dictionaries.

Work is under way to model the proposed CDSSK ontology structure on the example of the universal subspace. Currently, 12 auxiliary and 10 subject classes of objects have been identified.

The auxiliary classes of objects include

• data formats;

• universal classification systems;

• scientific directions;

• locations (spatial characteristics);

• time characteristics, dates;

• quantitative characteristics;

• qualitative characteristics;

• universal constants;

• units of measurement;

• languages;

• groups of persons;

• collections;

• numerical values.

The subject classes include

• persons;

• publications;

• qualification works;

• documents;

• museum objects;

• images and multimedia objects;

• events;

• institutions;

• polythematic databases, resource catalogs;

• awards, grants.

In most cases, it is clear from the class name what its objects are. However, in the presence of the class “Classification systems,” the class “Scientific directions” requires some explanation. In CDSSK this class is needed to describe processes associated with complex research as formulated in documents related to science planning. Obviously, each object of this class is connected with one or several “Regulations” from the class “Documents” by a specific relation of the type “First formulated,” as well as with a set of objects of the class “Universal classification systems”—specific sections of the SCSTI, UDC, HAC, etc. The latter is necessary for an integrated analysis of research results of in a particular scientific field.

The class “Groups of persons” has been introduced, which is necessary to indicate target categories of users for scientific resources (for example, textbooks for students of specific specialty, popular-science publications for high school students, etc.

For each class, corresponding directories and elements of static dictionaries are formed. Specific relations between objects of different classes and values of these relations are defined.

Below are several examples of such relations.

A simple specific relation of a person to a publication having multiple attributes (one person can play several roles):

—author;

—editor;

—compiler;

—translator;

—artist;

—personal data;

—copyright owner.

A simple specific relation of a person to a museum object having multiple attributes:

—author (creator);

—collector (for natural-science collections);

—owner;

—donator.

A compound quasi-universal relation of a publication to a classification system; it consists of two components: URN of the classification system name and URN of the index corresponding to the publication.

A simple specific relation between a discovery and its publication with attributes:

first publication date;

recommended textbook for university students;Footnote 7

recommended popular-science publications.

As individual elements are developed, they are implemented in the SHR digital library as a model of a CDSSK fragment. In the current version of SHR (http://e-heritage.ru), search for publications, persons, and museum objects corresponding specified values of the above-described relations has been implemented within the advanced search option.

4 CONCLUSIONS

The creation of CDSSK includes the construction of its ontology and ontologies of thematic subspaces and requires coordinated efforts made by leading information technology professionals, representatives of various scientific fields, and holders of information resources.