Keywords

1 Introduction

The Open Research Knowledge Graph (ORKG) [3] digital library addresses scholarly content digitalization as a distributed, decentralized, and collaborative scholarly knowledge creation process that can be powered with automated semantification modules via a continuous, ongoing development cycle of autonomously maintained AI micro-services. To this end, this paper presents ORKG-assays an AI-based semantification micro-service trained on structured data based on the Bioassay ontology (BAO), and fitted in the ORKG for the rapid assimiliation of digitalized biological assays (bioassays). While ORKG-assays will be the first Life Science domain supported by an automated semantification micro-service in the ORKG, to our knowledge, it fosters the development of the first end-to-end bioassay digitalization workflow in the overall scholarly community as well.

The ORKG-assays micro-service workflow involves four steps. 1) Querying a bioassay depositor for their unstructured or semi-structured assays. Commonly, bioassays raw data are obtained via the PubChem depository [12] – a major depositor of bioassays from various research institutes. 2) Semantifying the assay via the ORKG-assays AI clustering model. 3) Linking the depositor-provided assay cross-references to their scientific articles. And, 4) integrating the bioassay semantic graph in the ORKG. Programmed in Python, ORKG-assays provides web-based and programmatic tools for semantifying bioassay texts. The semantified bioassay once entered in the ORKG is editable via user-friendly frontend interfaces, is surveyable via tabulations [11] or 2-D chart visualizations, and is queryable for various scientific semantic ORKG relationships. The ORKG-assays AI clustering method demonstrates high semantification performance F1 scores above 80% and has been chosen after diverse methodological tests including the state-of-the-art, bidirectional transformer-based SciBERT model discussed in prior work [1].

Summing up, ORKG-assays offers a highly accurate and pragmatic semantification model alleviating unrealistic expectations on scientists to semantify their bioassays from scratch, by instead offering them a mere curatorial role of the automatic annotations. The pace with which novel bioassays are being submitted suggests that we have only begun to explore the scope of possible assay formats and technologies to interrogate complex biological systems. Thus this data domain, specifically, promises long-standing future application discovery many of which remain potentially untapped. Furthermore, inspired by the method we demonstrate, by drastically reducing the time required to semantify data for other scholarly domains as well, digitalization can be realistically advocated to become a standard part of the publication process.

2 Bioassay Digitalization in the ORKG

ORKG-assays will now be discussed as its implementation w.r.t. the KG Lifecycle requirements consisting of the graph creation, hosting, curation, and deployment modules. The ORKG-assays micro-service belongs in an early stage of graph creation, i.e. when generating the graph itself. Thus, while the graph creation module handling the normalization of variously formatted graph data is beyond the scope of ORKG-assays, it addresses extracting the assay texts from heterogeneous bioassay depositories each with different file formats, generating a BAO-based structured graph. The end-to-end ORKG-assays semantification pipeline in a micro-service is discussed below.

Data Preparation. This step relies on public access availability to an assay depository’s querying mechanism. PubChem, reported to have over 1 million assays [8], is queryable via its public REST API for its bioassays where some assays have depositor-provided cross-references to scientific articles in PubMed. Depending on the depositor, the data could be returned in JSON, XML, or CSV. We implemented a specific pipeline for “The Scripps Research Molecular Screening Center” which returned JSON query responses. It reported nearly 1,600 bioassays. However, to prepare the data, the bioassay description-specific sections had to be located in its JSON response file and the text then extracted. The text was merged from two separate parts, viz. assay overview and assay protocol summary. We noted that this parsing heuristic can be applied to most depositor responses, although there maybe some exceptions.

Automated Clustering-Based Semantification. Traditionally, AI-based scholarly KG construction is addressed by the recognition of entities and relations in scientific articles as sequence labeling and classification objectives [5,6,7, 9, 10]. We instead address the problem of bioassay semantification with a clustering objective. We choose clustering from our corpus observations that bioassays with similar text descriptions are semantified with similar sets of logical statements. Thus, the bioassays could be clustered based on their text descriptions and each cluster could be collectively semantified by the labels of the trained cluster. Indeed while entity and relation classification are sound strategies, they would be unnecessarily more complex and time-consuming methods for the problem at hand. We refer the reader to our prior work [2] which contrasts a classification versus a clustering objective for bioassay semantification.

The final semantification function in ORKG-assays was arrived at by an experimental process. This entailed testing two different vectorizations, i.e. TF-IDF and SciBERT [4], for the bioassay text to find the optimal representation for clustering by K-means with the elbow optimization strategy to find the best K value. While the TF-IDF vector is fitted on a training collection of assays, the SciBERT embeddings are directly queried for their pretrained 768 dimensional vectors. The results are shown in Table 1. We see that the direct TF-IDF vectorization on bioassay text outperforms the scholarly-articles-based pretrained SciBERT at 0.83 F1 vs. 0.77 F1 with fewer clusters (450 vs. 550).

Table 1. Semantification results by K-means clustering of vectorized bioassays
Fig. 1.
figure 1

ORKG frontend screens for user curation of an automatically semantified bioassay.

Building the Knowledge Graph. We leverage the ORKG to convert our structured annotations to a KG. The assay’s article’s PubMED metadata is first fetched, following which the digitalized bioassay is added in the form of research contributions of the paper via the ORKG KG building functions.

Data Workflows. 1. Add Paper Wizard. In the ORKG Frontend, as shown in Fig. 1, the user can add an assay by clicking the ‘Add Bioassay’ button. The assay gets automatically semantified with the result on a screen with checkboxes enabling accept or reject user interactions. On clicking ‘Insert Data,’ all selected statements and the user provenance form the ORKG. 2. Bulk Import via REST API. To ingest the data in bulk, iterative calls to the ORKG REST API with article metadata and structured bioassay as contributions encapsulated in a JSON object can be made. This process is depicted in Figs. 2 and 3.

Fig. 2.
figure 2

End-to-end ORKG-assays semantification pipeline which practically realizes the digitalization of digitized data involving data sources, data retrieval, an annotation service, and resulting triple statements.

Fig. 3.
figure 3

Conversion of an unstructured Bioassay to its equivalent digitalized representation and finally presented in the ORKG frontend (https://tinyurl.com/orkg-assay).

3 Conclusion

We presented ORKG-assays—an end-to-end digitalization workflow of unstructured descriptions of bioassays within a next-generation digital library, the ORKG. Its supplementary information is released online https://github.com/jd-coderepos/bioassays-ie. The hybrid design of ORKG-assays complementarily integrates automated and manual semantification methods since pure machine learning on its own tends to be insufficiently accurate and expecting scientists to find the time to semantify their assays from scratch is unrealistic.

Bioassays being highly diverse are clearly a non-trivial semantification domain posing challenges to standardizing and integrating the data with the goal to maximize their scientific and ultimately their public health impact as the assay screening results are carried forward into drug development programs with intelligent machine assistance. The current coronavirus pandemic situation sheds critical light on advancing the drug development research lifecycle for which bioassays are crucial, offering credence to our domain choice for semantification research. In this respect, the ORKG will not serve as a mere mirror of other Bioassay depositories, but will itself be a unique application of a highly-structured science-wide knowledge graph of scholarly contributions which incoporates semantified bioassays as well.