Keywords

1 Introduction

The open science movement works towards the general availability of scientific insight and is considered one answer to the so-called “reproducibility crisis” [2]. Science results are often generated by a combination of software, data, and parameters, all of which contribute to the final result (and its interpretation). The complexity of all these elements is hardly describable in a single article – and often the publication does not allow the full reproduction of the achieved results. In the line of work towards consequent reproducibility of scientific results, there are three main tasks to be tackled: (a) motivate researchers to reproduce past results; (b) develop novel ways for the integrated presentation of scientific results; (c) develop tools which allow for exploration of existing scientific works.

The work at hand focuses on the two latter objectives. It presents a tool which facilitates the examination of existing research involving software by joint exploration of a scientific article and the respective source code. The prototype allows the exploration of both in one interface, and the semi-automatic creation of semantic relations between them. The software is extended by basic visualisations. This kind of work is related to research areas, which have been active for decades: (a) automatic code analysis, and (b) automatic analysis of scientific publications. Solutions for automatic code analysis aim at generating textual documentation [7], summarising code [8], or at generating visualisations [4]. Also common is the generation of formal code models using semantic technologies [1] or logical constructs as realised in tools such as JTransformerFootnote 1. While there is much work on linking code to other (textual) resources (e.g. traceability [3]), to documentation [4], on the automatic understanding of scientific publications [5], or on linking publications with software and archiving them [6], there has been little work on joint analytics of scientific software and publications, yet [9].

2 SciSoftX: Scientific Software Explorer

The Scientific Software Explorer provides researchers with functionalities for the exploration of external article-software ensembles and/or annotation of own works for better comprehensibility. Its final version will provide functionalities such as (a) manual annotation of article-software relations, (b) semi-automatic discovery of relations, and (c) visualisations for relation exploration.

Fig. 1.
figure 1

Main window of the GUI: linked code references are highlighted in colour. (Colour figure online)

2.1 Functionality

SciSoftX allows the user to open and simultaneously view a software project and a publication (Fig. 1). Parsing and processing of source code is realised using ANTLRFootnote 2 (Another tool for Language Recognition) that supports most of the relevant programming languages, while publications are processed via PDF.jsFootnote 3. The user can manually link code identifiers to relevant locations in the publication. When the user moves the mouse over a linked identifier in the publications, a tool tip shows the relevant source code positions.

Automatic Discovery of Code Identifiers and Snippets: At the current stage, the tool contains a basic method for the detection of code-relevant text snippets: It relies on the common convention of setting code elements in monospace fonts. The found identifiers are used to search the code model produced by ANTLR, multiple finds are disambiguated based on vicinity. In a random sample of 24 articles from computer science, the monospace-based linker was able to correctly detect 89.9% of the links annotated by a human expert.

Manual Annotation of Links: As a facilitator of exchange between scientists the tool also allows for the manual annotation of resources. In a step-wise process, the user marks article snippets, code elements and annotates the established link with one of the pre-defined labels. The created set of links can be exported to an XML format and imported by an interested reader.

Fig. 2.
figure 2

Graph-based view on connections between software and publication. Red nodes: mentions in publication; blue nodes: source code packages. (Colour figure online)

Visualisations: Graph-based visualisations illustrate relations between software and publication on different levels of abstraction. Figure 2 shows an example displaying the connections at the package (software) and page (publication) level.

2.2 Use Cases

Use Case 1 – Reader-side: A researcher reads a publication that refers to a blob of software and then tries to understand the structure and rationale of the software. This time-consuming task can be supported by the automatic creation of links between textual description and actual source code, and the visualisations provided by SciSoftX. The user can click on nodes in the visualisation or on text elements that are highlighted in the publication and explore the implementation details, discover additional parameters, and understand the relevant code part step by step. Furthermore, it is possible to manually add and save useful information and metadata, which can help future users to explore the software.

Use Case 2 – Author-side: Paper authors can use SciSoftX to ensure their software is easily understood, e.g. in a reviewing process or for re-use. Therefore, they make use of the manual and automatic methods to annotate the semantic relations between their paper and the underlying software and publish the annotations. The visualisation of cross-modal relations can aid the authors (and the reviewers) to decide whether all relevant code parts and parameters are covered by the publication. In this way, the tool helps to evaluate the quality of the software description in a paper.

3 Conclusion

Reproducibility is one of the major issues of today’s scientific landscape. In this paper, we have reported on work in progress for an analytics tool that allows users to explore relations between scientific software and publications. To this date, the tool features simple mechanisms for detecting links between software and publications which serve as a proof of concept. Future work will explore (a) more powerful infrastructures for code analysis, (b) more sophisticated means for text/image analysis, e.g. mapping diagrams and formulas to source code.