A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications

Gupta, Komal; Ahmad, Ammaar; Ghosal, Tirthankar; Ekbal, Asif

doi:10.1007/s00799-023-00393-y

A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications

Published: 23 January 2024

(2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal on Digital Libraries Aims and scope Submit manuscript

A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications

Download PDF

Komal Gupta ORCID: orcid.org/0000-0003-3975-9498¹,
Ammaar Ahmad¹,
Tirthankar Ghosal² &
…
Asif Ekbal¹

227 Accesses
1 Citation
Explore all metrics

Abstract

Research in Natural Language Processing (NLP) is increasing rapidly; as a result, a large number of research papers are being published. It is challenging to find the contributions of the research paper in any specific domain from the huge amount of unstructured data. There is a need for structuring the relevant contributions in Knowledge Graph (KG). In this paper, we describe our work to accomplish four tasks toward building the Scientific Knowledge Graph (SKG). We propose a pipelined system that performs contribution sentence identification, phrase extraction from contribution sentences, Information Units (IUs) classification, and organize phrases into triplets (subject, predicate, object) from the NLP scholarly publications. We develop a multitasking system (ContriSci) for contribution sentence identification with two supporting tasks, viz. Section Identification and Citance Classification. We use the Bidirectional Encoder Representations from Transformers (BERT)—Conditional Random Field (CRF) model for the phrase extraction and train with two additional datasets: SciERC and SciClaim. To classify the contribution sentences into IUs, we use a BERT-based model. For the triplet extraction, we categorize the triplets into five categories and classify the triplets with the BERT-based classifier. Our proposed approach yields the F1 score values of 64.21%, 77.47%, 84.52%, and 62.71% for the contribution sentence identification, phrase extraction, IUs classification, and triplet extraction, respectively, for non-end-to-end setting. The relative improvement for contribution sentence identification, IUs classification, and triplet extraction is 8.08, 2.46, and 2.31 in terms of F1 score for the NLPContributionGraph (NCG) dataset. Our system achieves the best performance (57.54% F1 score) in the end-to-end pipeline with all four sub-tasks combined. We make our codes available at: https://github.com/92Komal/pipeline_triplet_extraction.

ContriSci: A BERT-Based Multitasking Deep Neural Architecture to Identify Contribution Statements from Research Papers

Leveraging MRC Framework for Research Contribution Patterns Identification in Citation Sentences

Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles

Article 28 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, there has been a substantial increase in the availability of scientific articles online, with a significant growth observed over the past decade [34]. Due to this, extracting new scholarly information has become a major challenge for researchers. Given the large volume of publications available, science researchers and others interested in the field often struggle to efficiently navigate through the vast amount of information available to them [11]. In each article, new systems and tasks are introduced as the scientific organizations expand and evolve, and various methodologies are compared. Manual search and analysis of scientific literature can be time-consuming and error-prone. Despite the advancements in search engines, detecting new technologies and their relation to previous ones is still hard [1]. Search engines can return overwhelming results that may not be relevant or accurate [8]. However, scientific paper recommendation systems often require access to a researcher’s personal data and browsing history, raising privacy concerns for some users [63]. Traditional databases alone are insufficient for such recommendation systems as they struggle to handle unstructured data and lack the capability to effectively combine information from external sources [49]. Moreover, a knowledge graph provides a structured and comprehensive repository of scientific information [19]. This information can be easily accessed and analyzed by intelligent algorithms. Therefore, intelligent algorithms are needed to extract and organize scientific information from the vast knowledge graph, facilitating quick identification of new technologies and tasks by researchers. Information extraction (IE), such as identifying scientific entities and their relations, is important to organize the data into knowledge bases, including KG. KG provides a way to represent knowledge as a graph of entities and their relation enabling researchers to quickly identify new technologies and tasks by analyzing the connections between different entities within the graph. It is crucial to extract contributions from the research articles for building the KG. Additionally, KG can help understand novelty and concepts by providing a structured representation of existing knowledge and identifying gaps or missing links in the knowledge graph. This can help researchers identify new and unique connections between concepts, leading to the discovery of novel ideas and approaches. Novelty refers to new, original, or previously unknown ideas, while concepts that have not been captured or disregarded may still be old or existing ideas that are overlooked or not given enough attention. So there is an increasing demand for systems that help to extract and organize scientific information from scientific articles and automatically build the KG.

However, the process of building a high-quality SKG highlights several challenges and limitations. First, scientific data are highly heterogeneous, distributed, and often incomplete, making it difficult to integrate and represent in a unified graph structure. Second, extracting knowledge from unstructured text data, such as scientific publications, requires advanced NLP techniques and domain-specific ontologies. Representing and linking entities and relations in the KG requires careful design and curation to ensure accuracy and consistency. Despite the challenges, there have been significant efforts in recent years to build SKG, such as the Semantic Scholar [71], Microsoft Academic Graph [73], and CORD-19 [78] SKG. The limitation of these KGs is the absence of a contribution graph, which would enable the identification of the specific contributions made by research articles. As the number of research publications increases, it will become crucial to extract contributing sentences from scholarly articles and design KG to efficiently represent the knowledge. One such work is NLPContributionGraph (NCG) [21], an annotation system for describing academic contributions in NLP articles. The NCG corpus is annotated using this annotation scheme. Its objective is to automate scientific papers annotation to create scholarly contribution graphs across NLP domains. The NCG dataset is annotated for four different challenges: (1) extracting contribution sentences that show significant contributions in the research article. (2). Extract phrases from the contribution sentences. (3). Classification of contribution sentences into IU and (4). Triplet extraction. The available NCG dataset is annotated in the same sequence way, providing a useful resource for these tasks. Our main objective of this work is to extract contributions from the scientific articles and extract scientific terms and relations from the contribution. These terms and relations are used to build the SKG. The SKG allows machines to navigate through prior knowledge in the literature, make meaningful comparisons, understand the novelty of a new research article, etc. The NCG challenges serve as the basis for this paper [22]. In order to address these challenges, we use the SciBERT [10] deep learning model.

In this paper, we propose deep learning-based approaches to solve four problems, viz. contribution sentence identification, phrase extraction, information unit classification, and triplet extraction. For this, we propose a neural network-based technique for automatically identifying contribution sentences in research articles. We develop a multitasking deep neural network architecture named ContriSci. Multitask learning can help to address the issue of limited training data by leveraging the data from related tasks to improve performance on the primary task [59]. We implement the following two scaffold tasks for ContriSci model: (1). Section identification, (2). Citance classification. Section identification refers to identifying the section headings or labels in a document. The goal is to automatically recognize the hierarchical structure of a document and to identify the headings, such as introduction, methods, results, experiment and abstract. The task is often approached as a classification problem, where the model is trained on labeled examples to predict the section label of each sentence. We use the ACL Anthology Sentence Corpus (AASC)^{Footnote 1} dataset to train the section identification scaffold task. Citance classification is a method for classifying research statements as either citances or non-citances. Citances are statements that reference previously published work, while non-citances do not. In our research, we use citance classification to identify and analyze citances in a large corpus of scientific articles. The citance classification task is trained using the SciCite dataset [18]. We use the BERT-CRF [67] model to extract phrases from the contribution sentences. The neural network model cannot be adequately trained with only 6,093 training sentences in the NCG dataset. So we use two additional datasets, i.e., SciERC [44] and SciClaim [47]. The NCG dataset contains annotations for various IUs, namely ablation analysis, approach, baselines, experimental setup, experiments, hyperparameters, model, research problem, results, task, dataset, and code. We classify the sentences into the IUs using a BERT-based multi-class classifier [88]. Inspired by Liu et al. [39], we reorganize the dataset into five categories, namely A, B, C, D, E where each category is defined based on similar syntactic or semantic properties, allowing for more efficient and accurate extraction of relevant triplets. We generate all the possible combinations of the triplets. We implement BERT-based classifiers for A, B, C, D types triplets. For the type E triplets, we extract the triplets using the rule-based approach. Our main contributions can be summarized as follows:

1.
We propose a multitasking system for the identification of contribution statements from research articles with state-of-the-art results. This system can automatically identify and extract the contribution statements from research articles, which can help researchers quickly understand the main contributions of the paper.
2.
We built a BERT-CRF-based system for phrase extraction from contribution statements. Our approach exhibits reduced complexity in comparison with the existing models. The system can accurately extract phrases related to contributions from the identified contribution statements.
3.
We develop a multi-class BERT-based classifier for information unit classification with state-of-the-art results. The system can classify contribution statements into different information units.
4.
We develop a BERT-based system for triplet extraction. The system can organize phrases into triplets with state-of-the-art results.
5.
We propose a pipelined-based system for triplet extraction for building the KG with state-of-the-art results. The proposed system can automatically extract triplets to build the KG using the extracted information.

We structure the rest of our paper as follows: In Sect. 2, we provide a detailed description of the dataset used in our research. The related work is discussed in Sect. 3. We define the problem in Sect. 4. The problem is divided into four parts which are contribution sentence identification, phrase extraction, information unit classification, and triplets extraction. The IUs classification is the subtask of the triplet extraction task. It plays a vital role in the extraction of relevant triplets. Hence, both tasks are jointly discussed in the subsequent sections. In Sect. 5, we explain our dataset preprocessing steps to ensure the quality of our data. Section 6 is dedicated to the system overview in detail. We compare the performance of our proposed model to the baseline model and analyze the results along with addressing dataset annotation anomalies in Sect. 7. Finally, in Sect. 8, we conclude our findings and provide directions for future research.

2 Dataset description

We use the NLPContributionGraph (NCG) [21] dataset. The dataset is publicly available in three sets, i.e., training set^{Footnote 2}, trial set^{Footnote 3}, and test set.^{Footnote 4} The dataset is annotated at three distinct levels. The corpus contains two plain text formats for each article: (1). The PDF is converted to the plain text file using (GROBID, 2008)^{Footnote 5} parser. (2). The sentence is transformed into a tokenized form by utilizing Stanza [58]. The dataset is annotated into three levels, as shown in Fig. 1: (1) Contribution sentences, (2) Scientific terms and predicates from contribution sentences, and (3) The triplets viz. (subject, predicate, and object). These triplets are organized into two levels of knowledge [23]. At the top level, there is a placeholder called Contribution. Underneath that, there are twelve IUs, encompassing categories like ablation analysis, approach, baselines, experimental setup, experiments, hyperparameters, model, research problem, results, task, dataset, and code. Scholarly article contributions are categorized under at least three IU nodes, determined by their relevance to the article. The first triplet of each IU includes the Contribution subject, which we classify as type E triplets. Figure 1 shows the example of triplets, belonging to the ExperimentalSetup IU. Moreover, D’Souza et al. [22] present five general annotation guidelines for identifying contribution sentences in the NCG scheme. (1) Identify sentences that describe or indicate the contribution of the paper, such as introducing a new method or achieving a breakthrough result. (2) Focus on the main contribution of the paper, which is often stated in the introduction or abstract. (3) Annotate sentences that provide evidence or support for the main contribution, such as experiments, results, or analysis. (4) Avoid annotating sentences that describe background knowledge or unrelated information. (5) Consider the context and purpose of the paper when identifying contribution sentences, as the contribution may vary depending on the research question or goal. By following these guidelines, annotators can consistently and systematically identify and annotate contribution sentences. The annotation scheme is evaluated on a dataset of 200 articles, which are purposefully selected from the ACL Anthology. Each of the five NLP tasks is represented equally with 40 articles. To ensure the quality of the annotations, two annotators independently annotated every sentence in the dataset, and disagreements were resolved through adjudication by a third annotator. The authors calculate the inter-annotator agreement score using Cohen’s Kappa [70]. The obtained results indicated a substantial agreement of 0.75 for sentence-level annotation, indicating that the annotation scheme is reliable. The overall objective of these tasks is to build a KG. The structure of the dataset is as follows:

1.
The sentence.txt file contains the index number of contribution sentences.
2.
The entities.txt file contains phrases with paper id, starting index, and end index of the phrases.
3.
The Grobit-out.txt file contains the plain text of the article.
4.
The Stanza-out.txt file contains articles’ sentences in tokenized form with sentence numbers.
5.
The triplet folder contains information unit-wise triplets of the papers.
6.
The info-unit folder contains a .json file of information units, each containing respective contribution sentences.

Tables 1, 2, and 3 show the dataset statistics of the contribution sentences, phrases, and triplets, respectively. Due to the limited number of instances in the training set, we combine the trial set with the training set for the training deep learning model. Consequently, our training set encompasses both the original training and trial sets, while the test set evaluates the model’s performance. We create a validation set by randomly selecting 10% of the samples from the training set. In Tables 1, 2 the columns Avg. Length and Max. Length refer to the average and maximum length of sentences in terms of the number of tokens. These metrics are calculated by counting the number of tokens in each sentence and then averaging or taking the maximum across all sentences in the dataset. However, we count the number of sentences per section per document. On average, there are approximately 10–15 sentences.

Table 1 Data statistics of NCG corpus for contribution sentences (CS) and non-contribution sentences (NCS)

A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publications

Abstract

Similar content being viewed by others

ContriSci: A BERT-Based Multitasking Deep Neural Architecture to Identify Contribution Statements from Research Papers

Leveraging MRC Framework for Research Contribution Patterns Identification in Citation Sentences

Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles

Explore related subjects

1 Introduction

2 Dataset description

3 Related work

3.1 Contribution sentence identification

3.2 Phrase extraction

3.3 Triplet extraction

3.4 Multitask learning

4 Problem definition

5 Dataset pre-processing

5.1 Combining incomplete sentences in the stanza file

5.2 Extraction of main section and sub-section titles

5.3 Extracting previous and next sentence

6 System overview

6.1 Contribution sentences identification

6.1.1 Data analysis after pre-processing

6.1.2 Data for scaffold tasks

6.1.3 Methodology

6.2 Phrase extraction

6.2.1 Methodology

6.3 Information unit classification and triplet extraction

6.3.1 IU classification

6.3.2 Predicate classification

6.3.3 Triplet extraction

7 Evaluation

7.1 Experimental setup

7.2 Baseline

7.2.1 ContriSci model

7.2.2 Phrase extraction

7.2.3 Information units classification and triplet extraction

7.3 Results and analysis

7.3.1 ContriSci model

7.3.2 Phrases extraction

7.3.3 Information units classification and triplet extraction

7.3.4 Pipeline results

7.4 Error analysis

7.4.1 Error analysis in ContriSci model

7.4.2 Error analysis in phrase extraction model

7.4.3 Error analysis in triplet extraction model

7.5 Annotation anomalies

7.5.1 Annotation anomalies in contribution sentences

7.5.2 Annotation anomalies in phrases

7.6 Ablation analysis

7.6.1 Analysis of ContriSci model

7.6.2 Ablation analysis of phrase extraction

7.6.3 Ablation analysis of triplet extraction

8 Conclusion and future work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation