Information extraction from scientific articles: a survey

Nasar, Zara; Jaffry, Syed Waqar; Malik, Muhammad Kamran

doi:10.1007/s11192-018-2921-5

Information extraction from scientific articles: a survey

Published: 29 September 2018

Volume 117, pages 1931–1990, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Scientometrics Aims and scope Submit manuscript

Information extraction from scientific articles: a survey

Download PDF

5406 Accesses
65 Citations
4 Altmetric
Explore all metrics

Abstract

In last few decades, with the advent of World Wide Web (WWW), world is being overloaded with huge data. This huge data carries potential information that once extracted, can be used for betterment of humanity. Information from this data can be extracted using manual and automatic analysis. Manual analysis is not scalable and efficient, whereas, the automatic analysis involves computing mechanisms that aid in automatic information extraction over huge amount of data. WWW has also affected overall growth in scientific literature that makes the process of literature review quite laborious, time consuming and cumbersome job for researchers. Hence a dire need is felt to automatically extract potential information out of immense set of scientific articles to automate the process of literature review. Therefore, in this study, aim is to present the overall progress concerning automatic information extraction from scientific articles. The information insights extracted from scientific articles are classified in two broad categories i.e. metadata and key-insights. As available benchmark datasets carry a significant role in overall development in this research domain, existing datasets against both categories are extensively reviewed. Later, research studies in literature that have applied various computational approaches applied on these datasets are consolidated. Major computational approaches in this regard include Rule-based approaches, Hidden Markov Models, Conditional Random Fields, Support Vector Machines, Naïve-Bayes classification and Deep Learning approaches. Currently, there are multiple projects going on that are focused towards the dataset construction tailored to specific information needs from scientific articles. Hence, in this study, state-of-the-art regarding information extraction from scientific articles is covered. This study also consolidates evolving datasets as well as various toolkits and code-bases that can be used for information extraction from scientific articles.

Knowledge Extraction and Modeling from Scientific Publications

Metadata Extraction for Scientific Papers

CERMINE: automatic extraction of structured metadata from scientific literature

Article Open access 03 July 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In last few decades, advent of computers and later World Wide Web (WWW) has changed human civilization dramatically. Now we live in the world which is being overloaded with the data and the information. This information overload is posing new challenges to human intellect and hence creating opportunities for innovation. WWW has affected the overall growth in scientific literature. According to a study carried out in Price (1961), amount of research data doubles every ten to 15 years. Additional resources (Mudrak 2016; NSF 2018) pointed that around 2.2 million new scientific-articles were published in 2016. Some of the major reasons regarding rapid growth in number of scientific-articles include increased number of publication venues, online digital libraries and ease of access in acquiring scientific literature; whereas these facilities were not available in pre-digital age. In the light of report issued by International Association of Scientific, Technical and Medical Publishers, there is an increase in publishing scientists by 4–5% annually. Additionally, as of 2014, there exist around 28,100 peer-reviewed scholar journals in English (Ware and Mabe 2015).

This increase in scientific content poses significant challenges for the researchers who want to determine state of art in their respective field of interest. To perform literature review, firstly literature is required from variety of relevant research repositories. Later, the acquired results are filtered by means of manual analysis. After acquiring the relevant literature, the findings from these scientific-articles are consolidated in order to determine state-of-the-art of desired field. This whole process of performing systematic literature review is of utmost importance for researchers as it helps in performing gap analysis and determining room for innovation. At the same time, this is very time consuming, cumbersome and laborious task. According to one of the systematic literature review guideline, amount of time that is required to conduct a quality review can take up to 1 year (Morin 2017). In the light of another study, systematic literature review can take up to 186 weeks with single/multiple human resources (Borah et al. 2017).

To provide researchers with basic filters, many research organizations and scientific publishers such as ACM, IEEE and Springer etc. have provided digital research repositories. These libraries tend to offer search filters that provide ease to users while querying through millions of research articles. These digital research repositories employ metadata information from scientific articles in order to provide various searching facilities. Hence, metadata extraction from scientific articles eventually helps in saving researcher’s time while performing literature acquisition. In order to perform literature review, next step is to read and consolidate findings from acquired literature. This step requires to go through bulk of scientific articles in order to determine the state-of-the-art in a specific domain of interest. From a researcher’s point of view, this whole process is of utmost importance but time-consuming, laborious and cumbersome.

In the light of above points, it is evident that study of research papers by means of automated analysis will eventually aid researchers. Pertinent question in this regard is that how potential information from scientific articles can be automatically extracted. In order to address this and related problems, a whole domain named Information Extraction (IE) is dedicated for extraction of potential information nuggets from data. The IE is majorly focused on extraction of structured data from unstructured or semi-structured data. It is being widely used across multiple domains, for example, in the domain of medical sciences, IE is applied in order to extract information about patient’s information, their previous medical history, causes and respective cures (Harkema et al. 2005). The domain of IE is comprised of concepts and techniques of Machine Learning, Natural Language Processing (NLP), Text Mining (TM) and Information Retrieval (IR). There exist various research studies that focus on describing state-of-the-art in the domain of IE (Simoes et al. 2009; Sirsat et al. 2014).

The survey study presented in Simoes et al. (2009) has its major focus on categorizing various tasks of IE reported in literature and respective techniques used to perform those tasks. This study categorized IE tasks into five major classes that include segmentation, classification, association, normalization and co-reference resolution. Segmentation refers to the task of segmenting the data into atomic segments like tokens. Classification task deals with assigning each segment to its suitable class called entity. According to Simoes et al. (2009) major techniques employed in literature to perform classification include Hidden Markov Models (HMM) and Maximum Entropy Markov Models (MEMM). Association task focuses on extraction of relations between related various entities. Major algorithms that are being used for association mining task include context free grammars, MEMM and Conditional Random Fields (CRF). As far as normalization and co-reference resolution tasks are concerned, these are less-generic as they require domain-specific information. Normalization refers to the task where different representations of a similar entity are transformed into single entity. This task is usually carried out via human-designed conversion rules and regular expressions. Co-reference resolution refers to the problem of identifying various senses of text fragments that point towards a same real-world entity.

Amongst the various tasks mentioned for IE in Simoes et al. (2009), classification task is usually regarded as Named Entity Recognition and Classification (NERC). The NERC refers to a sub-problem in domain of IE that deals with extraction of named entities (NEs) while keeping surrounding context under consideration. The NERC deals with problem of recognition of named entities followed by their classification in rhetorical categories. It holds utmost importance in other IE, NLP and TM oriented tasks including relation extraction, event detection, question answering systems and machine translation. Table 1 represents the NEs that can be extracted from the following short paragraph.

Table 1 A sample NERC/IE task

Information extraction from scientific articles: a survey

Abstract

Similar content being viewed by others

Knowledge Extraction and Modeling from Scientific Publications

Metadata Extraction for Scientific Papers

CERMINE: automatic extraction of structured metadata from scientific literature

Explore related subjects

Introduction

Methodology

Evaluation metrics

Metadata extraction

Datasets

Approaches

Rule-based approaches

Machine-learning based approaches

Hidden Markov model

Conditional random fields

Support vector machines

Others

Conclusion

Key-insights extraction

Datasets

Approaches

Rule-based approaches

Machine-learning based approaches

Naïve Bayes

Hidden Markov model

Conditional random fields

Support vector machines

Others

Conclusion

Conclusion and future work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation