Keywords

1 Introduction

Today’s scholarly communication is a document-centred process and as such, rather inefficient. Scientists spend considerable time in finding, reading and reproducing research results from PDF files consisting of static text, tables, and figures. The explosion in the number of published articles  [12] aggravates this situation further: It gets harder and harder to stay on top of current research, that is to find relevant works, compare and reproduce them and, later on, to make one’s own contribution known for its quality.

Some of the available infrastructures in the research ecosystem already use knowledge graphs (KG)Footnote 1 to enhance their services. Academic search engines, for instance, such as Microsoft Academic Knowledge Graph  [24] or Literature Graph  [3] employ metadata-based graph structures which link research articles based on citations, shared authors, venues and keywords.

Recently, initiatives have promoted the usage of KGs in science communication, but on a deeper, semantic level  [4, 32, 37, 48, 51, 54]. They envision the transformation of the dominant document-centred knowledge exchange to knowledge-based information flows by representing and expressing knowledge through semantically rich, interlinked KGs. Indeed, they argue that a shared structured representation of scientific knowledge has the potential to alleviate some of the science communication’s current issues: Relevant research could be easier to find, comparison tables automatically compiled, own insights rapidly placed in the current ecosystem. Such a powerful data structure could, more than the current document-based system, also encourage the interconnection of research artefacts such as datasets and source code much more than current approaches (like DOI references etc.); allowing for easier reproducibility and comparison. To come closer to the vision of knowledge-based information flows, research articles should be enriched and interconnected through machine-interpretable semantic content. Jaradeh et al.’s study  [37] indicates that authors are also willing to contribute structured descriptions of their research articles.

The work of a researcher is manifold, but current proposals usually focus on a specific use case (e.g. the above-named examples focus on enhancing academic search). In this paper, we provide a detailed analysis of common work tasks in a scientist’s daily life and analyse (a) how they could be supported by an ORKG, (b) what requirements result for the design of (b1) the KG and (b2) the surrounding system, (c) how different use cases overlap in their requirements and can benefit from each other. Our analysis is led by the following research questions:

  1. 1.

    What functionalities should be provided by ORKG interfaces?

    1. (a)

      Which user interfaces are necessary?

    2. (b)

      Which machine interfaces are necessary?

  2. 2.

    What requirements can be defined for the underlying ontologies?

    1. (a)

      Which granularity of information representation is needed?

    2. (b)

      To what degree is domain specialisation needed?

  3. 3.

    What requirements can be defined for the instance data?

    1. (a)

      Which approaches (human vs. machine) are suitable to populate the KG?

    2. (b)

      Which coverage of research artefacts is necessary for the instance data?

    3. (c)

      Which quality is necessary for the instance data?

We follow the design science research (DSR) methodology  [33]. In this study, we focus on the first phase of DSR conducting a requirements analysis. The objective is to chart necessary (and desirable) requirements for successful KG-based science communication, and, consequently, provide a map for future research.

The remainder of the paper is organised as follows. Section 2 summarises related work on research knowledge graphs, scientific ontologies and methods for KG construction. The requirements analysis is presented in Sect. 3, while Sect. 4 discusses implications and possible approaches for ORKG construction. Finally, Sect. 5 concludes the requirements analysis and outlines areas of future work.

2 Related Work

This section provides a brief overview of (a) existing research KGs, (b) ontologies representing scholarly knowledge, and (c) approaches for KG construction.

2.1 Research Knowledge Graphs

Academic search engines (e.g. Google Scholar, Microsoft Academic, SemanticScholar) exploit graph structures such as the Microsoft Academic Knowledge Graph  [24], SciGraph  [68], or the Literature Graph  [3]. These graphs interlink research articles through metadata, e.g. citations, authors, affiliations, grants, journals, or keywords.

To help reproducing research results, initiatives such as Research Graph  [2], Research Objects  [7] and OpenAIRE  [48] interlink research articles with research artefacts such as datasets, source code, software, and presentation videos. Scholarly Link Exchange (Scholix)  [16] aims to create a standardised ecosystem to collect and exchange links between research artefacts and literature.

Some approaches were proposed to interlink articles at a more semantic level: Paperswithcode.com is a community-driven effort to link machine learning articles with tasks, source code and evaluation results to construct leaderboards. Ammar et al.  [3] interlink entity mentions in abstracts with DBpedia  [43] and Unified Medical Language System (UMLS)  [10], and Cohan et al.  [17] extend the citation graph with semantic citation intents (e.g. cites as background or as used method).

Various scholarly applications benefit from semantic content representation, e.g. academic search engines by exploiting general-purpose KGs  [67], and graph-based research paper recommendation systems  [8] by utilising citation graphs and mentioned genes. However, the coverage of science-specific concepts in general-purpose KGs is rather low [3], e.g. the task “geolocation estimation of photos” from Computer Vision is neither present in Wikipedia nor in CSO (Computer Science Ontology)  [59].

2.2 Scientific Ontologies

Various ontologies have been proposed to model metadata such as bibliographic resources and citations  [53]. Iniesta and Corcho  [58] reviewed ontologies to describe scholarly articles. In the following, we describe some ontologies that conceptualise the semantic content in research articles.

Several ontologies focus on rhetorical  [19, 30, 66] (e.g. Background, Methods, Results, Conclusion), argumentative  [45, 63] (e.g. claims, contrastive and comparative statements about other work) or activity-based [54] (e.g. sequence of research activities) aspects and elements of research articles. Others describe scholarly knowledge with interlinked entities such as problem, method, theory, statement  [15, 32], or focus on the main research findings and characteristics of research articles described in surveys with concepts such as problems, approaches, implementations, and evaluations  [25, 64].

There are various domain-specific ontologies, for instance, mathematics  [42] (e.g. definitions, assertions, proofs) and machine learning  [40, 49] (e.g. dataset, metric, model, experiment). The EXPeriments Ontology (EXPO) is a core ontology for scientific experiments conceptualising experimental design, methodology, and results  [61].

Taxonomies for domain-specific research areas support the characterisation and exploration of a research field. Salatino et al.  [59] provide an overview, e.g. Medical Subject Heading (MeSH), Physics Subject Headings (PhySH), Computer Science Ontology (CSO). Gene Ontology  [1] and Chemical Entities of Biological Interest (CheBi)  [21] are KGs for genes and molecular entities.

2.3 Construction of Knowledge Graphs

Automatic Construction from Text: Petasis et al.  [55] provide a review on ontology learning, that is ontology creation from text, while Lubani et al. [47] review ontology population systems. Pajura and Singh  [56] provide an overview of the involved tasks for KG population: (a) knowledge extraction to extract a graph from text with entity extraction and relation extraction, and (b) graph construction to clean and complete the extracted graph, as it is usually ambiguous, incomplete and inconsistent. Coreference resolution  [46] clusters different mentions of the same entity and entity linking  [41] maps them to entities in the KG. For taxonomy population Salatino et al.  [59] provide an overview of methods based on rule-based natural language processing (NLP), clustering and statistical methods. In particular, the Computer Science Ontology (CSO) has been populated automatically from research articles  [59].

Information Extraction from Scientific Text: Nasar et al.  [50] provide a survey about scientific information extraction. Beltagy et al.  [9] present benchmarks for several datasets.

There are datasets which are annotated at sentence level for several domains, e.g. biomedical  [22, 38], computer graphics  [28], computer science  [18], chemistry and computational linguistics  [63]. They focus either on the rhetorical structure in abstracts [18, 22, 38] or full articles [28, 45], or on the argumentative structure of full articles  [63]. The datasets differentiate between five and twelve concept classes (e.g. Background, Objective, Results). On abstracts and full articles machine learning approaches achieve an F1 score of 83–92%  [18] or 51–80%  [28, 44], respectively.

More recent corpora, annotated at phrasal level, aim at constructing a fine-grained KG from scholarly abstracts with the tasks of concept extraction  [5, 13, 31, 46], relation extraction  [5, 29, 46], and coreference resolution  [46]. They cover several domains, e.g. computational linguistics  [29, 31]; computer science, material sciences, and physics  [5]; machine learning  [46]; or a set of ten scientific, technical and medical domains  [13]. The datasets differentiate between four to seven concept classes (like Task, Method, Tool) and between two to seven relation types (like used-for, part-of, evaluate-for). Concept extraction, coreference resolution and relation extraction achieve an F1 score of 45–89%  [5, 9, 13], 48%  [46] and 28–50%  [5, 29, 46], respectively, and the inter-coder agreement is 60–76%  [5, 13, 46], 68%  [46] and 60%–90%  [5, 29, 31, 46], respectively. This indicates, that these tasks are not only difficult for machines but also for humans.

Manual Curation: WikiData  [65] is one of the most popular KGs with semantically structured, encyclopaedic knowledge curated manually by a community. As of March 2020, WikiData comprises 80M entities curated by almost 25.000 active contributors. The community also maintains a taxonomy of categories and “infoboxes” which define common properties of certain entity types. Paperswithcode.com is a further community-driven effort to interlink machine learning articles with tasks, source code and evaluation results. KGs such as Gene Ontology  [1] or Wordnet  [26] are curated by domain experts. Research article submission portals such as easychair.org enforce the submitter to provide machine-readable metadata. Librarians and publishers tag new articles with keywords and subjects  [68]. Virtual research environments enable the execution of data analysis on interoperable infrastructure and store the data and results in KGs  [62].

3 Requirements Analysis

As the discussion of related work reveals, existing research KGs focus on specific use cases (e.g. improve search engines, help to reproduce research results) and mainly manage metadata and research artefacts about articles. We envision a KG in which research articles are interlinked through a deep semantic representation of their content to enable further use cases. In the following, we formulate the problem statement and describe our research method. This motivates our use case analysis in Sect. 3.1, from which we derive requirements for an ORKG.

Problem Statement: Scholarly knowledge is very heterogeneous and diverse. Therefore, an ontology that conceptualises scholarly knowledge comprehensively does not (and unlikely will) exist. Besides, due to the complexity of the task, the population of comprehensive ontologies requires domain and ontology experts. Current automatic approaches can only populate rather simple ontologies and achieve moderate accuracy (see Sect. 2.3). On the one hand, we desire an ontology that can comprehensively capture scholarly knowledge and instance data with high quality and coverage. On the other hand, we are faced with a “knowledge acquisition bottleneck”.

Research Method: To illuminate the above problem statement we perform a requirements analysis. We follow the design science research (DSR) methodology  [14, 35]. The requirements analysis is a central phase in DSR, as it is the basis for design decisions and selection of methods to construct effective solutions systematically  [14]. DSR’s objective in general is the innovative, rigorous and relevant design of information systems for solving important business problems or the improvement of existing solutions  [14, 33]. To elicit requirements, we studied guidelines for systematic literature reviews  [27, 39, 52] and interviewed members of the ORKG team at TIB (https://projects.tib.eu/orkg/project/team/), who are software engineers and researchers in the field of computer science and environmental sciences. Based on the requirements, we elaborate possible approaches to construct an ORKG, which were identified through a literature review (see Sect. 2.3). To verify our assumptions on the presented requirements and approaches, ORKG team members reviewed them.

3.1 Overview of the Use Cases

We define functional requirements with use cases  [11]. A use case describes the interaction between a user and the system from the user’s perspective to achieve a certain goal. As a motivating scenario it also guides the design of a supporting ontology  [20].

Fig. 1.
figure 1

UML use case diagram for the main use cases between the actor researcher, an Open Research Knowledge Graph (ORKG), and external systems.

There are many use cases (e.g. literature reviews, plagiarism detection, peer reviewer suggestion) and several stakeholders (e.g. researchers, librarians, peer reviewer, practitioners) that may benefit from an ORKG. In this study, we focus on use cases that support researchers (a) conducting literature reviews, (b) obtaining a deep understanding of a research article and (c) reproducing research results. A full discussion of all possible use cases of graph-based knowledge management systems in the research environment is far beyond the scope of this article. With the chosen focus, we hope to cover the most frequent, literature-oriented tasks of scientists. Figure 1 depicts the main identified use cases, which are described briefly in the following. Please note that we focus on how semantic content can improve these use cases and not further metadata.

Get Research Field Overview: Survey articles provide an overview of a particular research field, e.g. a certain research problem or a family of approaches. The results in such surveys are sometimes summarised in structured and comparative tables (an approach usually followed in domains such as computer science, but not as systematically practised in other fields). However, once survey articles are published they are no longer updated. Moreover, they usually represent only the perspective of the authors, i.e. very few researchers in the field. To support researchers to obtain an up-to-date overview of a research field, the system should maintain such surveys in a structured way, and allow for dynamics and evolution. A researcher interested in such an overview should be able to search or to browse the desired research field. Then, the system should provide related articles and available overviews, e.g. in a table or a leaderboard chart. While the user interface shows tabular, leaderboards, or other visual representations the backend should semantically represent information to allow for exploiting overlaps in conceptualisations between research problems or fields.

Find Related Work: Finding relevant research articles is a daily core activity of researchers. It should be possible to pose queries for related work, which can be fine-grained or broad search intents. Systems should preferably support natural language queries as approached by semantic search and question answering engines  [6]. The system has to return a set of relevant articles.

Assess Relevance: Given a set of relevant articles the researcher has to assess whether the articles match the criteria of interest. Usually researchers skim through the title and abstract. Sometimes, the introduction and conclusions have to be considered. However, this is usually cumbersome and time-consuming. Presenting the researcher only the most important zones in the article in a structured way can boost this process. This includes, for instance, text passages that describe the problem tackled in the research work, the employed methods or materials, or the yielded results. Also, faceted drill-down methods based on the properties of semantic descriptions of research approaches will empower researchers to quickly filter and zoom into the most relevant literature.

Extract Relevant Information: To tackle a particular research question, the researcher has to extract relevant information from research articles. Such information is usually compiled in written text or comparison tables in a related work section or survey articles. For instance, for the question Which datasets exist for scientific sentence classification? a researcher who focuses on a new annotation study could be interested in (a) domains covered by the dataset and (b) the inter-coder agreement. Another researcher might follow the same question but with a focus on machine learning could be interested in (c) evaluation results and (d) feature types used. The system should support the researcher with tailored information extraction from a set of research articles: (1) the researcher defines a data extraction form as proposed in systematic literature reviews  [39] (e.g. the above fields (a)–(d)) and (2) the system presents the extracted information for the corresponding data extraction form and articles in a table.

Get Recommended Articles: When the researcher focuses on a particular article, further related articles should be recommended by the system, for instance, articles that address the same research problem or apply similar methods.

Obtain Deep Understanding: The system should help the researcher to obtain a deep understanding of a research article (e.g. equations, algorithms, diagrams, datasets). For this purpose, the system should interlink the article with artefacts such as conference videos, presentations, source code, datasets, etc., and visualise the artefacts appropriately. Also text passages can be interlinked, e.g. method explanations in Wikipedia, source code snippets implementing algorithms or equations described in the article.

Reproduce Results: The system should provide the researcher links to all necessary artefacts to reproduce research results, e.g. datasets, source code, virtual research environments, materials describing the study, etc. Further, the system should maintain semantic descriptions of domain-specific and standardised evaluation protocols and guidelines.

3.2 Knowledge Graph Requirements

Table 1. Requirements and approaches for the main use cases. The upper part describes the minimum requirements for the ontology (domain specialisation and granularity) and the instance data (coverage and quality). The bottom part provides possible approaches for manual, automatic and semi-automatic curation of the KG for the respective use cases. “X” indicates that the approach is suitable for the use case while “(x)” means that the approach is only appropriate with human supervision. The left part (delimited by the vertical triple line) groups use cases suitable for manual, and the right side for automatic approaches. Vertical double lines group use cases with similar requirements.

The non-functional requirements for the respective use cases are discussed in the light of the following dimensions.

  1. 1.

    Domain specialisation of the ontology: How domain-specific should the concepts be in the ontology? Various ontologies (e.g.   [13, 54]) propose domain independent concepts (e.g. Process, Method, Material). In contrast, Klampanos et al.  [40] present a very domain-specific ontology for artificial neural networks.

  2. 2.

    Granularity of the ontology: Which granularity is required to conceptualise scholarly knowledge? For instance, the annotation schemes for scientific corpora (see Sect. 2.3) have a rather low granularity, as they do not have more than 10 classes and 10 relation types. In contrast, various ontologies (e.g [32, 54]) with more than 20–35 classes and over 20–70 relations and properties are fine-grained and have a relatively high granularity.

  3. 3.

    Coverage of the instance data: Given an ontology, to which extent do all possible instances in all research articles have to be represented in the KG? For instance, given an ontology with a class “Task”, the instance data for that ontology would have a high coverage if all tasks mentioned in all research articles are present.

  4. 4.

    Quality of the instance data: Given an ontology, which quality is necessary for the corresponding instances? In a KG with high quality all present instances must conform to the ontology and reflect the content of the research articles properly, e.g. an article is correctly assigned to the task addressed in the article, the F1 score in the evaluation results is correctly extracted, etc.

Next, we discuss the seven main use cases with regard to the required level of ontology domain specialisation and granularity, as well as coverage and quality of instance data. Table 1 summarises the requirements for the use cases along the four dimensions at ordinal scale. The use cases are grouped together, when they have (1) similar justifications for the requirements, and (2) a high overlap in ontology concepts and instances.

Extract Relevant Information and Get Research Field Overview: The information to be extracted from relevant research articles for a data extraction form is very heterogeneous and depends highly on the intent of the researcher and the research questions. Thus, the ontology has to be domain-specific and fine-grained to offer all possible kinds of desirable information. In addition, the provided information has to be of high quality, e.g. a provided F1 score of an evaluation result must not be wrong. However, missing information for certain questions in the KG may be tolerable for a researcher.

Obtain Deep Understanding and Reproduce Results: The provided information for these use cases has to be of high quality (e.g. accurate links to dataset, source code, videos, articles, research infrastructures). The ontology for representing default artefacts can be rather domain-independent (e.g. Scholix  [16]). However, semantic representation of evaluation protocols require domain-dependent ontologies (e.g. EXPO  [61]). Missing information is tolerable for these use cases.

Find Related Work and Get Recommended Articles: When searching for related work, it is essential not to miss relevant articles. Previous studies revealed that more than half of search queries in academic search engines refer to scientific entities  [67] and the coverage of scientific entities in KGs is rather low  [3]. Despite the low coverage, Xiong et al.   [67] could improve the ranking of search results by exploiting KGs. Hence, the instance data for the “find related work” use case should have high coverage with fine-grained scientific entities. However, semantic search engines employ latent representations of KGs and text (e.g. graph and word embeddings)  [6]. Since a non-perfect ranking of the search results is tolerable for a researcher, lower quality of the instance data is acceptable. Furthermore, due to latent feature representations, the ontology can be kept rather simple and domain-independent. For instance, the STM corpus  [13] proposes four domain-independent concepts. Graph- and content-based research paper recommendation systems  [8] have similar requirements since they also leverage latent feature representations, require fine-grained scientific entities, and non-perfect recommendations are tolerable.

Assess Relevance: To help the researcher to assess the relevance of an article according to her needs, the system should highlight the most essential zones in the article to get a quick overview. The coverage and quality of the presented information must not be too low, as otherwise the user acceptance may suffer. However, it can be suboptimal, since it is acceptable for a researcher when some of the highlighted information is not essential or when some important information is missing. The ontology to represent essential information should be rather domain-specific and quite simple (cf. ontologies for scientific sentence classification in Sect. 2.3).

4 Implications for ORKG Construction

In this section, we discuss the implications for the design and construction of an ORKG and outline possible approaches, which are mapped to the use cases in Table 1. Based on the discussion in the previous section, we can subdivide the use cases into two groups: (1) requiring high quality and high domain specialisation with only low requirements on the coverage (left side in Table 1), and (2) requiring high coverage with rather low requirements on the quality and domain specialisation (right side in Table 1). The first group requires manual approaches while the second group could be accomplished with fully automatic approaches. However, manually curated data can also support use cases with automatic approaches, and vice versa. Besides, automatic approaches can complement manual approaches by providing suggestions in user interfaces.

Fig. 2.
figure 2

Conceptual meta-model in UML for templates and interface design for an external template-based information extractor.

4.1 Manual Approaches

Ontology Design: The first group of use cases requires rather domain-specific and fine-grained ontologies. We suggest to develop novel or reuse ontologies that fit the respective use case and the specific domain (e.g. EXPO  [61] for experiments). Moreover, appropriate and simple user interfaces are necessary for efficient and easy population.

However, such ontologies can evolve with the help of the community, as demonstrated by WikiData and Wikipedia with “infoboxes” (see Sect. 2.3). Therefore, the system should enable the maintenance of templates, which are pre-defined and very specific forms consisting of fields with certain types (see Fig. 2). For instance, to automatically generate leaderboards for machine learning tasks a template would have the fields Task, Model, Dataset and Score, which can then be filled in by a curator for articles providing such kind of results in a user interface generated from the template. Such an approach is also called meta-modelling  [11], as the meta-model for templates enables the definition of concrete templates, which are then instantiated for articles.

Knowledge Graph Population: Several user interfaces are required to enable manual population: (1) populate semantic content for a research article by (1a) choosing relevant templates or ontologies and (1b) fill in the values; (2) terminology management (e.g. domain-specific research fields); (3) maintain research field overviews by (3a) assigning relevant research articles to the research field, (3b) define corresponding templates and (3c) fill in the templates for the relevant research articles.

Further, the system should also provide APIs to enable population by third-party applications, e.g. (i) submission portals such as easychair.org during submission of an article; (ii) authoring tools such as overleaf.com during writing; (iii) virtual research environments  [62] to store evaluation results and links to datasets and source code during experimenting and data analysis.

To encourage crowd-sourced content, we see the following options: (a) top-down enforcement via submission portals and publishers; (b) incentive models: Researchers want their articles to be cited; semantic content helps other researchers to find, explore and understand an article; (c) provide public acknowledgements for curators.

4.2 (Semi-)automatic Approaches

The second group of use cases require a high coverage while a rather low quality and domain specialisation are acceptable. For these use cases, rather simple and domain-independent ontologies should be developed or reused.

Various approaches can be used to populate an ORKG (semi-)automatically. Methods for entity and relation extraction (see Sect. 2.3) can help to populate fine-grained KGs with high coverage and entity linking approaches can link mentions in text with entities. For cross-modal linking, Singh et al.  [60] propose an approach to detect URLs to datasets in research articles automatically, while the Scientific Software Explorer  [34] interlinks text passages in research articles with code fragments. To extract relevant information at sentence level, approaches for sentence classification in scientific text can be applied (see Sect. 2.3). To support the curator fill in templates semi-automatically, template-based extraction can (1) suggest relevant templates for a research article and (2) pre-fill fields of templates with appropriate values. For pre-filling, approaches such as for natural language inference used in leaderboard construction  [36] or end-to-end question answering  [23, 57] can be employed.

Further, the system should enable to plugin external information extractors, developed for certain scientific domains to extract specific types of information. For instance, as depicted in Fig. 2, an external template information extractor has to implement an interface with three methods. This enables the system (1) to filter relevant template extractors for an article and (2) extract field values from an article.

5 Conclusions

In this paper, we have presented a requirements analysis for an Open Research Knowledge Graph (ORKG). An ORKG should represent the content of research articles in a semantic way to enhance or enable a wide range of use cases. We identified literature-related core tasks of a researcher that can be supported by an ORKG and formulated them as use cases. For each use case, we discussed specificities and requirements for the underlying ontology and the instance data. In particular, we identified two groups of use cases: (1) the first group requires high-quality instance data and rather fine-grained, domain-specific ontologies, but with moderate coverage; (2) the second group requires a high coverage, but the ontologies can be kept rather simple and domain-independent, and a moderate quality of the instance data is sufficient. Based on the requirements, we have described possible manual and semi-automatic approaches (necessary for the first group), and automatic approaches (appropriate for the second group) for KG construction. In particular, we propose a framework with lightweight ontologies that can evolve by community curation. Further, we have described the interdependence with external systems, user interfaces, and APIs for third-party applications to populate an ORKG.

The results of our work aim to provide a holistic view of the requirements for an ORKG and be a guideline for further research. The suggested approaches have to be refined, implemented and evaluated in an iterative and incremental process (see www.orkg.org for the current progress). Additionally, our paper can serve as a foundation for a discussion on ORKG requirements with other researchers and practitioners.