Introduction

In last few decades, advent of computers and later World Wide Web (WWW) has changed human civilization dramatically. Now we live in the world which is being overloaded with the data and the information. This information overload is posing new challenges to human intellect and hence creating opportunities for innovation. WWW has affected the overall growth in scientific literature. According to a study carried out in Price (1961), amount of research data doubles every ten to 15 years. Additional resources (Mudrak 2016; NSF 2018) pointed that around 2.2 million new scientific-articles were published in 2016. Some of the major reasons regarding rapid growth in number of scientific-articles include increased number of publication venues, online digital libraries and ease of access in acquiring scientific literature; whereas these facilities were not available in pre-digital age. In the light of report issued by International Association of Scientific, Technical and Medical Publishers, there is an increase in publishing scientists by 4–5% annually. Additionally, as of 2014, there exist around 28,100 peer-reviewed scholar journals in English (Ware and Mabe 2015).

This increase in scientific content poses significant challenges for the researchers who want to determine state of art in their respective field of interest. To perform literature review, firstly literature is required from variety of relevant research repositories. Later, the acquired results are filtered by means of manual analysis. After acquiring the relevant literature, the findings from these scientific-articles are consolidated in order to determine state-of-the-art of desired field. This whole process of performing systematic literature review is of utmost importance for researchers as it helps in performing gap analysis and determining room for innovation. At the same time, this is very time consuming, cumbersome and laborious task. According to one of the systematic literature review guideline, amount of time that is required to conduct a quality review can take up to 1 year (Morin 2017). In the light of another study, systematic literature review can take up to 186 weeks with single/multiple human resources (Borah et al. 2017).

To provide researchers with basic filters, many research organizations and scientific publishers such as ACM, IEEE and Springer etc. have provided digital research repositories. These libraries tend to offer search filters that provide ease to users while querying through millions of research articles. These digital research repositories employ metadata information from scientific articles in order to provide various searching facilities. Hence, metadata extraction from scientific articles eventually helps in saving researcher’s time while performing literature acquisition. In order to perform literature review, next step is to read and consolidate findings from acquired literature. This step requires to go through bulk of scientific articles in order to determine the state-of-the-art in a specific domain of interest. From a researcher’s point of view, this whole process is of utmost importance but time-consuming, laborious and cumbersome.

In the light of above points, it is evident that study of research papers by means of automated analysis will eventually aid researchers. Pertinent question in this regard is that how potential information from scientific articles can be automatically extracted. In order to address this and related problems, a whole domain named Information Extraction (IE) is dedicated for extraction of potential information nuggets from data. The IE is majorly focused on extraction of structured data from unstructured or semi-structured data. It is being widely used across multiple domains, for example, in the domain of medical sciences, IE is applied in order to extract information about patient’s information, their previous medical history, causes and respective cures (Harkema et al. 2005). The domain of IE is comprised of concepts and techniques of Machine Learning, Natural Language Processing (NLP), Text Mining (TM) and Information Retrieval (IR). There exist various research studies that focus on describing state-of-the-art in the domain of IE (Simoes et al. 2009; Sirsat et al. 2014).

The survey study presented in Simoes et al. (2009) has its major focus on categorizing various tasks of IE reported in literature and respective techniques used to perform those tasks. This study categorized IE tasks into five major classes that include segmentation, classification, association, normalization and co-reference resolution. Segmentation refers to the task of segmenting the data into atomic segments like tokens. Classification task deals with assigning each segment to its suitable class called entity. According to Simoes et al. (2009) major techniques employed in literature to perform classification include Hidden Markov Models (HMM) and Maximum Entropy Markov Models (MEMM). Association task focuses on extraction of relations between related various entities. Major algorithms that are being used for association mining task include context free grammars, MEMM and Conditional Random Fields (CRF). As far as normalization and co-reference resolution tasks are concerned, these are less-generic as they require domain-specific information. Normalization refers to the task where different representations of a similar entity are transformed into single entity. This task is usually carried out via human-designed conversion rules and regular expressions. Co-reference resolution refers to the problem of identifying various senses of text fragments that point towards a same real-world entity.

Amongst the various tasks mentioned for IE in Simoes et al. (2009), classification task is usually regarded as Named Entity Recognition and Classification (NERC). The NERC refers to a sub-problem in domain of IE that deals with extraction of named entities (NEs) while keeping surrounding context under consideration. The NERC deals with problem of recognition of named entities followed by their classification in rhetorical categories. It holds utmost importance in other IE, NLP and TM oriented tasks including relation extraction, event detection, question answering systems and machine translation. Table 1 represents the NEs that can be extracted from the following short paragraph.

Table 1 A sample NERC/IE task

Valencia is on her way to Wal-Mart super-store in Austin. She is asked to bring couple of coffee bags. Her nephews from Valencia are waiting for her arrival.

Here, in this example it could be observed that Valencia is a person name in opening sentence of paragraph, whereas in last sentence, it is a geographical location. Thus, NERC tends to recognize senses of entities based on the surrounding context.

There exist multiple survey studies that presents the current progress in domain of NERC (Kanya and Ravi 2012; Palshikar 2013; Patil et al. 2016; Sharnagat 2014). These surveys classify NER literature in terms of various factors. Some are focused on employed approaches that include rule-based and machine-learning oriented solutions. Whereas, some surveys perform primitive classification based on underlying resources’ language. Most of this literature is focused on developments of NERC in various news datasets and well-formatted English language where primary task is to identify person names, location and organization. These annotated benchmark datasets are available in variety of languages including English, Spanish, Arabic and Chinese.

In addition to survey studies focused on conventional NERC problems, there exist surveys that present developments of NERC when applied on medical scientific articles. In the past years, many developments in the domain of medical sciences, genetics and other biology domains (Abdelmagid et al. 2014; Duck et al. 2016; Shickel et al. 2017) are being made. Major reason of rapid development in the domain of biology and related domains is availability of formal ontologies, extensive corpuses and lexicons. These language resources and sophisticated rule-based as well as machine-learning approaches are usually employed to extract various entities. In addition, these entities are often related to genomics, gene relations, various proteins and molecular information. These survey studies often include bio-specific entities and hence are not generic in nature.

As research literature is exponentially increasing across various disciplines. Hence, there is a need to consolidate findings that have been made so far regarding information extraction from this ever-growing scientific literature. Further, the emphasis of this paper is to consolidate findings from studies that can be applicable to wider range of domains. Therefore, developments against bio-specific entities’ extraction are not included in current survey. However, if research study extracts generic insights from medical dataset, then such studies are part of current survey.

In order to present survey focused on generic IE from scientific literature, current survey presents ongoing advancements against two major information constituents of a scientific article as explained above. These constituents include its metadata and its body. To the best of our knowledge, there does not exist a comprehensive survey that is focused on presenting such insights from scholarly literature. Although, there exist comparative studies to evaluate performance of various information extractors from scientific articles, but these studies are focused on developed tools and more inclined towards practical aspects.

In the light of above points, it is evident that a survey study focused on presenting state of the art advancements along with open-areas carries huge importance. Therefore, the current work compiles and analyzes research work and applications of NERC task of IE, when applied on research papers with respect to metadata and article’s body. This study covers major datasets for scientific articles, respective evaluation metrics on various datasets against studies along with employed approaches to perform IE from scientific articles. As survey is majorly focused on general insights extraction from articles, therefore, emphasis has not been given to describe various employed tools or techniques to pre-process the scientific articles’ content. Hence, preprocessing techniques for scientific literature that are required to convert input into desired feature vectors are not part of current study.

This survey aims to assist researchers interested to learn about recent advancements and to have an overview regarding automatic IE from scientific articles. Current study further highlights the open research areas as well as future prospects in this domain. In addition to that, as metadata and insights from full-text can include many sub-fields. Therefore, study is focused towards providing detailed results to give a brief overview about on-going progress rather than reporting average results only. This is because, results against coarse-grained fields can provide better insights about current gaps in literature by letting the readers know about specific subset of fields that are currently performing lower than the rest. Hence, this study will be very helpful for researchers interested in mining of scientific literature.

Rest of this study is organized in following sections. “Methodology” section describes methodology to conduct this study. It briefly explains the primary classification of literature followed by widely used evaluation metrics in current domain in “Evaluation metrics” section. “Metadata extraction” and “Key-insights extraction” sections describe state-of-the-art in metadata and key-insights extraction from scientific articles respectively. Finally, “Conclusion and future work” section provides overall conclusion with future prospects with bibliography presented in “References”.

Methodology

In order to conduct this study, first of all literature review is performed to determine the state-of-the-art of current domain. For this purpose, two famous research repositories including ACM and IEEE were used to get the relevant domain papers till 2017. Most relevant seed words to scientific literature were firstly identified by means of exploring synonyms and related words. Later, both research repositories were queried with the identified seed words within titles of publications only. All queries were made via advanced search options whereas in case of ACM, ACM Guide to Computing Literature bibliographic database is used for wider coverage.

Querying mechanism enforced the presence of all words in acquired scientific article’s titles i.e. AND operation is being performed among query strings. Further double quotes ensure that whole phrase appears together in title. Out of all seed words, “research article” seed word resulted into huge number of results. When the acquired results were analyzed, it was noticed that there was huge noise in form of conference proceedings that used to end with: “Research Articles”. In addition, there were some publishers that were using similar words for proceedings name. Hence, in order to avoid such results, resultant literature was acquired by means of refining respective particular query and adding more fields to avoid noise. After filtering such records, around 200 results were acquired against “research article” query from ACM. Statistics regarding initial results acquired against each seed word are shared in Table 2.

Table 2 Stats against initial queries

These acquired results were later manually filtered based on their relevance and categorized in major classes. This categorization was made after reading the titles of scientific articles only. The tentative counts of articles against each category is also mentioned.

  1. 1.

    Information Extraction (~ 80)

  2. 2.

    Recommender Systems (~ 45)

  3. 3.

    Classification and Clustering (~ 20)

  4. 4.

    Summarization (~ 20)

  5. 5.

    Citation Analysis (~ 50)

  6. 6.

    Structural studies (~ 40)

A brief overview of overall methodology that is followed in order to perform this study is presented in Fig. 1. After categorization of acquired literature; articles regarding information extraction were studied. These articles were further categorized into two types; metadata and key-insights. Later, state-of-the-art approaches and datasets against each category were determined. In the light of this whole process, research findings against this study are consolidated to present the current state in this domain.

Fig. 1
figure 1

Overall flow of study

Many researchers have contributed their researches in order to extract the information from scientific articles. A scientific article generally consists of two major constructs: metadata and full-body text. Therefore, existing research studies can be broadly classified in two categories:

  1. 1.

    Metadata Extraction

  2. 2.

    Key-insights Extraction

Metadata Extraction: In order to automatically extract metadata, semi-structured format of scientific articles can be exploited. This metadata information holds great importance in context of digital research repositories. This information includes title of a scientific article, respective authors, publication venue, date of publication and keywords etc. In addition, metadata information within citation also carries immense importance especially in the domain of Scientometrics. As, metadata information can be used to perform variety of other tasks including article recommendation and citation analysis etc., current study compiles and presents research progress in this domain.

Key-Insights Extraction: Apart from structured format, the text of a research paper has got its own importance. A researcher can have various research questions while reading a scientific article. Some of these research questions include:

  1. 1.

    Problem addressed in a scientific article

  2. 2.

    Domain of a research study

  3. 3.

    Methodology/Algorithms/Processes used in order to address the problem

  4. 4.

    Datasets that are used in order to conduct experiments

  5. 5.

    Tools used to perform the experiments

  6. 6.

    Evaluation measures to gauge performance

  7. 7.

    Results achieved in a research study

  8. 8.

    Limitation of a research study

  9. 9.

    Future extensions

Automatic extraction of such insights can provide substantial ease to researchers while performing literature review. In addition, if these insights are extracted from bulk of scientific data, literature gaps can be identified efficiently. Hence, this study covers on-going advancements towards automatic key-insights extraction form scientific articles.

Evaluation metrics

One of the very important aspect to measure the progress within any research area is its evaluation. Due to their avid importance, this section briefly describes the evaluation metrics that are being employed in reported literature for IE. Evaluation of an IE system is usually performed by means of comparing the extracted information with the respective gold standard data-set. These gold-standard datasets are mostly annotated by humans and serve as the ground truth in any problem. Hence, to compare and evaluate performance using gold-standard datasets, major evaluation metrics include Precision, Recall, F-measure and Accuracy. Precision focuses on evaluating how many of the extracted information is correct. Recall, on the other hand, is focused on evaluating that how much of the correct information is extracted. Usually, a confusion matrix is constructed to calculate various evaluation measures for a classification problem. Table 3 shows a confusion matrix for binary classification problem. This concept can be further extended into multiple classes.

$${\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}$$
(1)
$${\text{Recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}$$
(2)

where FP is regarded as type-1 error and FN is regarded as type-2 error. Increase in FP tends to decrease precision whereas increase in FN tends to decrease recall. In order to take into account both of these measures, F-score is widely used that is harmonic mean between precision and recall.

Table 3 A confusion matrix for two class problem
$${\text{Fscore}} = \left( {1 + \beta^{2} } \right)\frac{{{\text{Precision}}*{\text{Recall}}}}{{\beta^{2} *{\text{Precision}} + {\text{Recall}}}}$$
(3)
$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{TN}}}}$$
(4)
$${\text{Error}} - {\text{rate}} = \frac{{{\text{FP}} + {\text{FN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{TN}}}}$$
(5)

Equation 3 facilitates researchers to weight precision and recall as per their information need. For \(\beta = 1\), this equation gives similar weight to precision and recall and usually termed as F-measure or balanced F-score/F1-score. Accuracy measure represents the ratio between total number of correct results over total results generated via system as shown in Eq. 4. Error-rate, on the other hand, represents ratio of total no of incorrect results made by the respective algorithm, over the total results as presented in Eq. 5.

Metadata extraction” and “Key-insights extraction” briefly explains current state-of-the-art regarding metadata insights and key-insights extraction from research articles. All evaluations results reported in this study are taken from respective research articles and all evaluation measures are being presented in percentages. F-score reported throughout the study is balanced F-score. In some studies, evaluation measures against token and field are presented. Token-level evaluation measures are based on the number of individual word tokens that are correctly classified in the respective label class. Field-level scores are based on the number of exact fields that are classified correctly as whole, whereas a field can contain multiple tokens. Thus, in case of field-level scores, there is no partial credit for sub-set of correct predictions at token level. In all the tables carrying evaluation measures; Precision, Recall, F-measure and Accuracy are represented as Prec., Rec., F1 and Acc. respectively.

Metadata extraction

Metadata is broadly classified into three types by NISO (2004) that includes descriptive, structural and administrative. Descriptive metadata is used for discovery and identification, which in turns helps in finding and searching tasks such as title, author, keywords etc. On the other hand, structural metadata helps in determining how a paper is organized. For example, an outline of a paper can give an insight about paper structure. Administrative metadata provides information regarding resource management such as file type, creation date etc.

In the context of research articles, metadata is usually of descriptive nature and it holds a great importance. It provides a brief overview about a scientific article by providing information such as title of an article, its authors and bibliography etc. Hence, researchers tend to decide paper relevance with their domain of interest based on metadata information such as title, abstract, references, authors, citing articles and affiliations. In addition to that, digital research repositories also make use of metadata in order to provide support regarding literature acquisition for research community. These libraries aid researchers by providing intelligent search tools that include search filtering based on keywords, authors, organizations, publication venues etc. that are part of metadata information. In addition to that, this information can also be used to recommend articles (Haruna et al. 2017; Knoth et al. 2017).

Further, by extracting citation level metadata; one can also provide statistical information regarding an article’s citations count and popularity over time. Citation-level metadata extraction is also very useful in the domain of Scientometrics. (Alam et al. 2017; Insights 2013). Table 4 presents the NEs that can be extracted from following reference strings.

Table 4 A sample NERC task from references
REF # 1:

Ramadge, P., & Wonham, W. (1989). The control of discrete event systems. Proceedings of the IEEE, 77 (1), 81–98

REF # 2:

W. H. Enright. Improving the efficiency of matrix operations in the numerical solution of stiff ordinary differential equations. ACM Trans. Math. Softw., 4(2), 127–136, June 1978

Figure 2 on the other hand, presents sample task of header level metadata extraction from Wang and Chai (2018). Header level metadata extraction deals with identification and extraction of title, authors, affiliations, emails, publication venue, DOI, keywords, abstract and other related fields usually from the title page of a scientific article. In respective figure, title, authors and their respective affiliation is being recognized from the title page of a scientific article.

Fig. 2
figure 2

Sample Header metadata extraction

In the light of above points, it is evident that metadata extraction carries huge importance in many research oriented tasks. As there exist wide variety of reporting styles in forms of various journals, conferences, technical reports and wide variety of citation formats; the task of header-level metadata extraction as well as citation-level metadata extraction becomes quite challenging. In the remaining section, first of all major datasets for metadata extraction will be discussed followed by widely used approaches that are being used to solve this problem.

Datasets

There are three widely used datasets namely CORA, FLUX-CiM and UMASS that were developed in 1999–2000, 2007 and 2013 respectively. Amongst these, CORA dataset is split into two parts: one is focused towards document metadata whereas other one is focused on metadata extraction from citation strings. Other two datasets are also focused on metadata extraction from citations.

CORAFootnote 1 dataset consists of computer science articles’ data. The widely used CORA-Header dataset for document header metadata extraction is presented in Seymore et al. (1999). This dataset has total fifteen (15) fields that are explained in Table 5 and comprises 935 records in total with 500 training records and 435 testing records. CORA-reference dataset (McCallum et al. 2000) contains 500 references in total, whereas 350 records are usually used for training and the remaining 150 for testing. This dataset contains total of thirteen (13) fields. Tables 5 and 6 compile the attributes that are part of CORA header and reference dataset respectively.

Table 5 Information against CORA Header dataset
Table 6 Information against CORA reference dataset

FLUX-CiM dataset consists of articles from varied domains including Computer Science (CS), Health Science (HS) and Social Sciences (SS) articles. CS article dataset carries total of 300 reference strings, where each reference is further segmented into ten fields. HS dataset contains total of 2000 reference strings and is developed using PubMed Central data and each reference string is further segmented into six fields (Cortez et al. 2007). While SS dataset also share same fields as that of HS dataset and is constructed using data from Scielo Digital Library. Mapping of entities from CORA and FLUX-CiM is presented in Table 7. FLUX-CiM dataset majorly differs from CORA in terms of variety as it includes citations from HS and SS as well. In addition, FLUX-CiM does not cover all the fields that are present in CORA.

Table 7 Mapping of CORA fields against FLUX-CiM

UMASSFootnote 2 dataset consisting of bibliography information from 5000 research papers is presented in 2013 (Anzaroot and Mccallum 2013). It consists of citations from total of 5000 articles from Arxiv. These articles are evenly distributed in four major domains that include physics, mathematics, computer science and quantitative biology. Dataset comprises variety of formats and styles, including journal pre-prints, conference papers and technical reports. Each of these citation strings is labeled in a hierarchical manner, with both coarse-grain labeled segments, as well as fine-grain labeled segments that are presented in Tables 8 and 9 respectively.

Table 8 UMass dataset coarse-grained entities
Table 9 UMASS dataset fine-grained fields

Approaches

During the last decades, many researchers have contributed in domain of IE from research papers. Multiple machine learning and NLP techniques are being used to extract metadata from scientific literature. Some of the widely used techniques include Rule-based systems and machine learning systems. Amongst the machine learning techniques: markov models, conditional random fields and support vector machines are being frequently used. Following section describes developments for metadata extraction with respect to each technique.

Rule-based approaches

Rule-based systems refer to the systems that rely on a set of predefined instructions that specify how to extract desired information from data. In the context of metadata extraction, many researchers have used rule-based approaches based on text structure and layouts. A study reported in Klink et al. (2000) uses rules that rely on textual and geometrical features. It focuses on extraction of following entities from an article’s metadata: abstract-body, abstract-heading, affiliation, biography, caption, drop-cap, highlight, keyword-body, keyword-heading, membership, page-number, pseudo code, publication-info, reader-service, synopsis, and text-body. They used rule-base that can be applied on multiple domains. Study claims to have reasonable results when rules are used along with fuzzy matching. Results are evaluated on 979 journal pages from University of Washington corpus.

Metadata extraction from research article in post-script format is reported in Giuffrida et al. (2000). This study employs knowledge base carrying rules against various metadata fields including title, author, affiliations, author-affiliation mapping and table of contents. The knowledge base makes use of visual and spatial knowledge in order to identify these metadata entities with fuzzy-logic. For example, rules such as “title is usually in big font and in start of the text” or “title should be above the abstract section” are used to extract metadata. In order to demonstrate the effectiveness of proposed approach, data set of hundred articles is used. This dataset has 70% conference articles and remaining consist of journal articles and technical reports. Respective accuracies against proposed approach are reported in Table 10.

Table 10 Accuracies against Post-Script (Giuffrida et al. 2000) format, OCR system (Mao et al. 2004) and Template Matching framework (Huang et al. 2006)

Study reported in Mao et al. (2004) makes use of OCR in order to identify the respective metadata spans. Additionally, it presents a dynamic feature update system that tends to generate and improve features, whereas these features include geometrical as well as contextual features that include font size, font type and bounding box. The distribution of these features is calculated using data from OCR and saved against each journal’s style. Feature generation algorithm later employs various string-matching algorithms to extract the feature vectors. Feature vectors learnt over previous journal issues and/or other journal issues are applied to extract information from current issues. These features are later used in a rule-based system to extract metadata information. In order to evaluate the proposed system, title pages of 309 medical research articles are used. These articles are scanned images from two medical journals and dataset include various types of articles including short papers, correspondences etc. Results are evaluated on 166 title pages of Indian Journal of Experimental Biology and 143 pages from Journal of Clinical Monitoring and Computing, where both these journals are scanned medical journals. Experiments results show that employment of multiple journal issues for feature learning yield better results than using one issue. Optimum labeling accuracies against this study are presented in Table 10.

The study proposed in Huang et al. (2006) makes use of template matching in order to extract header metadata information that includes title, author, authors’ affiliations, abstract and keywords. By analyzing the four widely used publication styles that include Springer Lecture Notes in Computer Science (LNCS), Elsevier, ACM and IEEE JNS formats; authors proposed a template that can carry various fields from these publication styles. Later, a finite state automaton is presented that is being used to perform template matching. Results are evaluated on 400 sampled articles from ACM, IEEE, Springer LNCS and Elsevier. Title extraction accuracy is highest, while affiliation extraction accuracy is the lowest one amongst the rest in the light of acquired results as shown in Table 10.

A hierarchical template-based citation metadata extraction for scholarly publications is presented in Day et al. (2007). A hierarchical knowledge representation framework that extracts important concepts from natural language texts is used. In order to cover major domain-specific constructs, proposed framework named INFOMAP consists of domain-specific concepts along with related sub-concepts, relevant categories, attributes and actions. This information eventually helps in maintaining relationships between various concepts and ultimately transforms this knowledge base into taxonomy. Using this taxonomy, INFOMAP classifies citation strings in concepts as well as their related concepts. A powerful feature of the framework is its ability to represent and match complicated template structures. The proposed template extraction framework is evaluated on self-generated dataset covering six major citation styles including APA, IEEE, ACM, ISR, MISQ and JMIS from 160,000 citations. Results are mentioned in Table 11 with overall average accuracy of 92.39%.

Table 11 Accuracies against INFOMAP (Day et al. 2007)

A template based metadata extraction architecture has been presented in Flynn et al. (2007). This work is majorly focused on processing of various type of data including data from various agencies, laboratories, universities etc. PDF containing either scanned images or text is taken as input. Data from Defense Technical Information Center (DTIC) and National Aeronautics and Space Administration (NASA) reports is used in study. DTIC dataset usually contains Report Document Page (RDP) forms. Hence, major emphasis of the proposed architecture is processing of form and non-form based data. Due to the layout of RDP, templates are very suitable choice. For inputs containing no RDP forms, a non-form based process is executed that firstly converts respective input into XML format. Results against form-based inputs show higher precision and recall, whereas achieved accuracy against non-form based metadata extraction is 66% and 64% against DTIC and NASA reports.

An unsupervised system for metadata extraction named FLUX-CIM is proposed in Cortez et al. (2007, 2009). This approach differs from existing rule-based/knowledge-based systems as this study automatically creates knowledge base using existing metadata records. In order to validate this approach, two types of dataset are constructed. First dataset consists of Computer Science articles and carries total of 300 reference strings, where each reference is further segmented into ten classes including Author, Title, Journal, Date, Pages, Conference (Book-title), Place (Location), Publisher, Number and Volume. Second dataset consist of articles from medical sciences and contains total of 2000 reference string and each reference string is further segmented into six fields including Author, Title, Journal, Date, Pages and Volume (Cortez et al. 2007). Another dataset from Social Sciences articles is constructed. Both health sciences and social sciences datasets carry uniform format of citations, hence, these datasets are referred as organized, because they follow similar citation formats and thus are relatively simpler to deal with. The automatic construction of knowledge-base is handled using existing data; e.g. for CORA; corresponding bibTex entries against training set were parsed and included in knowledge-base. The filed level Precision/Recall/F-measure against the developed dataset using the proposed unsupervised approach is presented in Table 12. One interesting claim of the authors which is backed with respective experiment is that, if extracted entities are straight away added into knowledge base, it can also improve the results as knowledge base size affects the overall performance. For future directions, author suggest learning implicit styling and improved matching functions to distinguish between similar entities such as author’s name or editor’s name.

Table 12 Evaluation measures against FLUX-CIM various datasets

In addition, experiments are conducted to compare the proposed approach with CRF that tend to provide state-of-the-art results in statistical modeling techniques. For the sake of comparison, CORA dataset was used as computer science dataset besides self-constructed dataset of social and health sciences. The respective F-scores against various datasets using proposed approach and CRF are presented in Table 13.

Table 13 F1-score against FLUX-CiM (Cortez et al. 2009), CRF and Template Extraction (TE) (Guo and Jin 2011)

Text formatting information is also used in Groza et al. (2009) to extract Title, Author, Sections and references from research articles in PDF format. This study proceeds with firstly carrying out a pilot study to determine the habits, beliefs and opinions regarding metadata reporting in research articles. Later, in light of the learnt insights, heuristics and rules are prepared that exploit formatting and font styling features. There are two major modules of the proposed approach namely first-page content extraction and full-text content extraction. First-page extraction deals with extraction of Title, Abstract and Author Names and full-content extraction specifically refers to extraction of section information and references. Evaluation is performed on 1203 documents following ACM or Springer LNCS format. Results show F-measure greater than 90% against all entities. One thing to note in the evaluation set is that all selected articles were correctly parsed from PDF format. By individually analyzing performance against Springer and ACM, extraction on Springer LNCS outperforms ACM due to less variation. This study has proposed several feature oriented mathematical functions in order to extract metadata information from scientific articles published in PDF format. Authors have presented two major applications of proposed system that include metadata extraction web service and personal research assistant. Various evaluation metrics against this study are reported in Table 14.

Table 14 Evaluation measures against Groza et al. (2009) and Adefowoke Ojokoh et al. (2009)

The methodology used in Adefowoke Ojokoh et al. (2009) combines segmentation based on keywords and pattern matching techniques (regular expressions) to extract general metadata from documents such as Title, Table of Contents and Abstract etc. This study was tested on dataset consisting forty thesis using precision, recall, accuracy and F-measure scores, whereas results against these evaluation measures are presented in Table 14.

Another study in this regard is presented in Guo and Jin (2011) that employs knowledge base and template extraction. Initially, templates are constructed using formation of citations. Total of 576 templates are created covering various reference styles. In addition to that, a knowledge base carrying names of authors, venues and publisher is populated. This knowledge base is basically used to determine that in which class; a particular input element belongs to. After getting primitive idea about possible and most-likely classes of input elements in a citation, template matching is performed using most similar template in the light of extracted insights earlier. Once these elements are extracted, metadata knowledge base is again queried to check if it has records against input citations. If record exists, results from knowledge base are returned as they are more accurate. Thus incorporation of knowledge base helps in improving the overall results. The proposed approach is evaluated on 97 computer science articles from IEEE and ACM, where these articles can be journals or conference papers. Table 13 shows the accuracies against extracted fields using proposed approach. This approach is not robust enough to handle articles carrying complex structures.

Another template based approach is proposed in Chen et al. (2012). This approach treats citation string as text data carrying fields to be extracted along with delimiters. The study is focused to extract seven attributes out of a citation string including Author, Title, Venue, Volume, Issue, Page, and Date. Venue fields is later post processed to identify journal, book-title and tech-report. Proposed approach has three major modules namely canonicalization algorithm, template database construction followed by query processing. In order to identify structural elements from a citation string, rule-base algorithm is employed that is being termed canonicalization algorithm. It employs various heuristics and makes use of patterns and reserved words in order to retain structural information in a contextual string. This information is later used in template-extraction module as well as query-processing module to define templates and search templates based on structured citation respectively. This algorithm is evaluated on three datasets including INFOMAP, CORA and FLUX-CiM. Respective evaluation metrics against each dataset are show in Table 15.

Table 15 Evaluation Measures against template-matching approach (Chen et al. 2012)

Rule-based systems tend to have very good performance due to the manual effort and human observations, but it certainly has obvious disadvantages. These systems are less adaptable than machine learning based systems, due to their dependence on text formatting, text location and graphical attributes of text. Rule formation in itself is a laborious and a time-consuming task. Complexity in the rules makes them powerful. But consequently, the processing of rules becomes time expensive with increase in time complexity. Hence, the overall time complexity of the system rapidly increases with the number of rules as concluded in Klink et al. (2000).

Machine-learning based approaches

Following section compiles the major approaches that employ machine learning concepts to perform metadata extraction from scientific articles.

Hidden Markov model

Hidden Markov Model (HMM) has strong statistical grounds that are robust in nature and efficient to develop. Its major weakness is its reliance on training data. It is widely being used across many domains including Speech Recognition (Juang and Rabiner 1991) and machine learning related problems. In current domain of interest, HMM is used along with multiple state-merging options in Seymore et al. (1999). It makes use of distantly labeled data-set (bibTex) to improve the accuracy of HMM model. It primarily deals with extraction of CORA Header entities. It was tested on manually tagged data along with bibTex Collection with 92.9% accuracy against all headers classes including 97.8% for Title and 97.2% for Authors. Detailed results against each field are shared in Table 16.

Table 16 Accuracy against CORA dataset in Seymore et al. (1999) and McCallum et al. (2000)

HMMs are also used in development of CORA system proposed in McCallum et al. (2000). This system is used as Internet portal for computer science articles providing various features such as searching and identification of metadata entities from scientific articles. Proposed approach is generic enough to apply on various other Internet portals. In the development system, one HMM is used to identify the fields such as author, title, affiliation from paper’s header. Second HMM is used to extract metadata information from references. With respect to HMMs for IE, primary focus of this study is to learn the parameters and transition structures using labeled and unlabeled text. Study shows that distant supervision tends to improve the results, whereas parameter estimation using forward-Baum–Welch (Baum 1972) degrades the performance. One primary reason can be that forward-Baum–Welch algorithm tends to get stuck in local maxima; therefore, it is sensitive towards initial parameter settings. Here, distant supervision refers to incorporation of data that is annotated for some other purpose such as bibTex that carries marked authors against an article, whereas it doesn’t carry all the required fields. Error-rate and accuracy against various fields are presented in Table 16.

A research study carried out in Hetzner (2008) employs HMM by means of Viterbi algorithm and string manipulation methods. In order to improve the performance, separate set of cue-words are constructed that are good indicators of fields to be extracted. Results of this study are evaluated on CORA dataset. A quite similar approach is proposed in Ni and Xu (2009) that is also focused towards citation metadata extraction by means of HMM. It makes use of Baum–Welch (BW) algorithm in order to learn the weights during HMM transitions. It also forms multiple states against potential information to be extracted from citation. This HMM-BW based model has been comparatively evaluated using existing HMM model (Hetzner 2008) as well as CRF (Peng and McCallum 2006) in respective study as well. Table 17 represents various evaluation measures against aforementioned HMM models.

Table 17 Evaluation measures against various HMM models

Another study proposed in Cui (2009) makes uses of HMM with text block as basis of Viterbi algorithm (Forney 1973) instead of words, along with some heuristic for email, phone numbers, keywords and web. The fields that are being extracted in this study include title, author, address, affiliation, email, web, phone, date, abstract and keyword. It is trained on 800 headers and tested with data of 135 headers with all fields’ precision and recall greater than 85%. This study is further extended in Cui and Chen (2010) to improve Viterbi algorithm in HMM model. It makes use of the idea that transition probability between the same states in the same line is far greater than that in different lines. Further it employs location based information to further improve the results of Viterbi model. As existing dataset does not contain location information, a new dataset consisting of 458 articles was constructed from VLDB conferences to train the location-based model. Table 18 presents evaluation measure when location based heuristics are excluded and included.

Table 18 Evaluation measures against Cui and Chen (2010)

Tri-gram HMMs are being employed in Ojokoh et al. (2011) to extract citation metadata. Total of twenty features are used as emission vocabulary to improve the model. These features include full-stop, comma, capital letter, all numbers etc. In order to further improve the results, shrinkage is employed. Shrinkage refers to technique that is usually used to handle the sparse data transitions while training HMMs. Results are evaluated on CORA and FLUX-CiM datasets. In addition, effect of data size on model is also experimented using FLUX-CiM dataset which shows that with increase in data, F-score and recall tend to decrease whereas precision increases. Moreover, one-third dataset was able to achieve 98% accuracy. Further data addition increments this overall gain minimally. The results are being shown in Table 19. Comparison is also made with existing bi-gram HMM study (Yin et al. 2004) that also employed similar idea of shrinkage but used bi-grams for network training. The study employing bi-grams for network training used self-created dataset for evaluation consisting of 713 citation strings obtained from 250 scientific articles. Tri-gram model (Ojokoh et al. 2011) is evaluated against self-annotated data of bi-gram model (Yin et al. 2004) as well that is being referred as ManCreat dataset. Evaluation metrics using ManCreat dataset against both bi-gram and tri-gram models are presented in Table 20.

Table 19 Evaluation Measures against Ojokoh et al. (2011) on CORA and FLUX-CiM datasets
Table 20 Evaluation measures against Ojokoh et al. (2011) and Yin et al. (2004) on ManCreat dataset

HMM tends to compute a probability distribution over possible sequences of labels followed by selection of best label sequence. Parameters in HMM are trained to maximize the joint likelihood of training examples. This requires enumerating all possible observation sequences. Due to that, long-range dependencies and interacting features can’t be represented into this model. These are the pioneer statistical models to be applied in order to solve sequence oriented problems. These models made the foundation of further improved models such as Maximum Entropy Markov models and Conditional Random Fields.

Conditional random fields

Conditional Random Field (CRF) is a statistical model that has the ability to incorporate effect of neighbors as well. CRFs are currently being used as an alternative to HMMs in Named Entity Recognition, Pattern Matching and other Machine Learning problems. Many researchers have applied concepts of CRF in domain of IE from research papers.

Research study reported in Peng and McCallum (2004) makes use of CRF with Gaussian priors, regularization and hyperbolic priors to extract metadata fields including: author, affiliation, address, note, email, date, abstract, introduction, phone, keywords, web, degree, publication number and page. In addition to that, CRF is also used to perform citation metadata extraction. This technique, when applied on a standard benchmark dataset, resulted in reduced error in average F-score and word error rate by 36% and 78% respectively, in comparison with the previous best SVM results of study (Han et al. 2003) with average F-score being 93.9% and overall accuracy being 98.3%. Study employs CORA header and reference dataset for evaluation. The extension of this study is presented in Peng and McCallum (2006) that provides mechanism to exploit co-reference citations using CiteSeerX 2007, which results in error rate reduction by 6-14% on self-annotated datasets that are tagged with co-reference information. Another dataset is developed in extended study that consists of 450 headers. This dataset contains font information as it is used as a feature to improve identification of field boundaries. For this dataset, scientific articles were randomly selected amongst 8000 articles that are crawled from internet from various sources. In order to train the model, 300 records were used while remaining 150 were used for testing. Table 21 shows the results against CORA and self-annotated dataset carrying font information.

Table 21 Evaluation Measures against Peng and McCallum (2006) against CORA and self-annotated dataset

The study presented in Yu and Fan (2007) applies CRF in order to extract metadata from Chinese research papers. It uses three different types of features that include local features regarding character specifics, layout features that carry information regarding word occurrence and external features that carry information from external lexicons such as family names and location names etc. Comparison is made with HMMs as well and results show that CRF tends to perform better in both languages. For English dataset, CORA header and reference dataset is used for evaluation whereas for Chinese, dataset was constructed using data from China National Knowledge Infrastructure. Header dataset for Chinese consist of 600 headers whereas reference dataset consists of 1500 references. In both languages sets, similar six fields are selected for experiments. Results for header and reference dataset against both languages are presented in Table 22.

Table 22 Evaluation Results in Yu and Fan (2007)

Another study employing CRF for the task of metadata extraction is presented in Councill et al. (2008). This study is a pioneer contribution in open-source domain and provides features for automatic reference string extraction followed by its segmentation into multiple classes. Additionally, study also focuses towards extraction of citation context. Citation context refers to those areas/sentences, which corresponds to a citing article. In order to develop this framework, CRF and heuristics are used. Heuristics are primarily used for extraction and identification of reference strings and citation contexts. CRF, on the other hand, is used in order to segment reference string into further categories. In order to evaluate the model’s performance, various experiments are performed including CORA, CiteSeer and FLUX-CiM datasets. Here, CiteSeer dataset consist of randomly sampled 200 reference strings from millions of reference strings available in CiteSeer system. Respective results against various datasets are presented in Table 23. The proposed system is integrated into CiteSeer system.

Table 23 Evaluation measures against Councill et al. (2008) using various datasets

Study presented in Anzaroot and Mccallum (2013) also use CRF in order to provide baseline results against developed UMASS dataset. Study argues about limitations of conventional CRFs in making predictions due to Markov’s assumptions. Therefore, future work is directed towards development of improved CRF models. In addition, dataset presented is to be revised and extended with time, as increase in tagged dataset eventually helps in improving the accuracy of machine learning systems. Baseline results against fine-grained dataset are presented in Table 24 with field level as well as token level evaluation. Other studies that are focused on improvement of underlying CRF models to improve the global context coverage include (Anzaroot et al. 2014; Vilnis et al. 2015). These studies discuss citation extraction as an application of improved CRF models on UMASS dataset.

Table 24 Base line Results against Anzaroot and Mccallum (2013) using CRF

Another approach that is using CRF for Information Extraction makes use of Particle Swarm Optimization algorithm (Kennedy and Eberhart 1995) which is used to evaluate the optimal value keeping evolution in context. This approach uses an optimized version of Particle Swarm Optimization algorithm in order to avoid local convergence by using iterative likelihood ratio as stop criterion (Shuxin et al. 2013). It improves results of existing CRF based studies of Peng and McCallum (2004, 2006) with average F-score of 93.9% and accuracy of 98.3%. Detailed results are presented in Table 25.

Table 25 Results against Shuxin et al. (2013) using optimized Particle Swarm Optimization algorithm

Another study employing CRF for the task of metadata extraction is presented in Souza et al. (2014). This study presents two-layer model of CRF. The study takes into account first page of research article as it carries the potential header metadata information. The first layer identifies larger components from article text that may contain metadata information. These components are header, title, author information, body, and footnote. The header usually holds important information about the conference/journal in which the paper has been published. The title class represents the title of the paper. Author information contains data about the authors, such as: name, affiliation, and email. As the body class does not include useful data for the task of metadata extraction, hence it is not further processed. On the other hand, as footnotes usually contain information about the publisher, conference, and additional information about the authors that may include authors ‘email and affiliation. Hence, a second CRF layer was created for header, author information and footnote. This extra layer allows to extract the actual metadata and define section-specific features. Results are evaluated on 100 papers whereas dataset and respective corpus is freely available over github.Footnote 3 Out of these 40 papers belong to an existing study that is focused towards extraction of structural contents from paper presented in Kan et al. (2010) using single-layer CRF. F1-score results against initial 40 papers from existing extraction study and total 100 papers are presented in Table 26.

Table 26 F1-scores against Souza et al. (2014) using two-layer CRF model

Another study (Cuong et al. 2015) has focused on improvement of conventional CRF results by introducing concepts of higher order semi-CRFs. These models have the capability to model the transition between variable length sequence segments, hence, giving them more power than traditional linear chain CRFs. Proposed approach is applied to variety of problems including author names, authors’ affiliation extraction as well as citation metadata extraction from scientific articles. The experiments are conducted using ParsCit datasetFootnote 4 with linear-chain CRFs as baseline and first order, second order and third order semi-Markov CRFs. Results show that second-order CRFs tends to give better results than the rest as shown in Table 27.

Table 27 F1-scores against Cuong et al. (2015)

CRF is currently giving state-of-the-art results in metadata extraction tasks. These models majorly deal with limitations of HMM. One potential drawback of CRF is that they are computationally very expensive. It is currently widely used statistical model for sequence labeling tasks.

Support vector machines

Support Vector Machines (SVM) (Cortes and Vapnik 1995) is another technique that has been widely used in literature for automatic metadata extraction. It is primarily a supervised learning technique that is generally used for classification and regression.

A research study in Han et al. (2003) used SVM in order to extract structured metadata from scientific literature. It applies SVM classifiers for two major classifications. One is line classification that is being performed using word and line specific features including word position, line number, and capitalized words. It is used to extract main feature that in turn help in classification. This classified line set is later being passed to another SVM classifier that performs chunk classification that is applied only to multi-line data. It is required to classify multi-line data into their respective categories and makes use of boundary heuristic and punctuation marks. The evaluation is performed using CORA header dataset and respective results are presented in Table 28.

Table 28 Results against Han et al. (2003) and CRIS system (Kovačević et al. 2011)

A research study proposed in Kovačević et al. (2011) makes use of SVM classifiers in order to extract eight fields of metadata that includes: title, authors, affiliation, address, email, abstract, keywords and publication note. This study employed SVM in variety of ways. It compared the results when a single classifier is used to classify all fields and when multiple classifiers are used to classify each category. It includes experiments on several classifiers including Decision Tree Classifier, K-Nearest Neighbor Classifier, Naïve Bayes Classifier and SVM. It concludes that best results are achieved when eight separate SVM classifiers are used, where each classifier is used to classify one category. It resulted in above 85% F-score measure for all categories except keywords. In addition to that, it is different from existing techniques as it takes into account the actual text of PDF files with font and styles. Other techniques proposed in Peng and McCallum (2004, 2006), Shuxin et al. (2013), Seymore et al. (1999) makes use of text only. Results are reported on self-annotated corpus of 100 computer science articles belonging to the domain of automatic term recognition and are shown in Table 28.

Others

There are several studies that either use hybrid approach to perform metadata extraction or makes use of other techniques that cannot be classified in aforementioned sections. This sub-section compiles such studies to provide brief overview of other on-going advancements.

Study performed in Marinai (2009) makes use of multi-layer perceptron in order to extract metadata information from PDF scientific articles. This extraction is done by means of exploiting visual and layout features of text. Proposed approach performs low level image processing to extract graphical features from initial pages of PDF articles, as they tend to carry major metadata information. Furthermore, DBLP indexing engine is also incorporated to improve the author extraction results. Tool is developed using Greenstone package development library that is focused on extraction of document title, author and related information. In order to evaluate the developed tool, eighty (80) articles from two conference proceedings including ICDAR and GREC, having double and single column document format respectively, are selected. The developed tool was later incorporated with Greenstone packages and results show that substantial work is required to improve the overall extracted results.

A Markov-based model study presented in Kern et al. (2012) makes use of entropy Markov model to extract metadata information from PDF articles called TeamBeam algorithm. TeamBeam uses variety of features and heuristic to identify various metadata fields. The procedure consists of three steps; first step deals with text block classification to identify major blocks carrying various metadata fields. Next step deals with token level classification of text contained in blocks. Final step is to extract metadata using block level as well as token-level classification information. Study performs extensive experimentation with three types of datasets including Mendely, E-prints and PubMed. In addition, classification performance against various algorithms are presented. Lastly, study also performs variety of experiments to see the impact of increased training data on overall extraction performance. E-prints dataset carries 2542 entries while Mendely and PubMed carry 20,672 and 19,581 entries respectively. All three datasets differ each other in terms of layout and formatting styles; whereas this informatin is being exploited as primary feature set in proposed approach. Metadata extraction results against various fields using TeamBeam algorithm are presented in Table 29.

Table 29 Results against TeamBeam (Kern et al. 2012)

The study carried out in Tkaczyk et al. (2015) focuses on the task of automatic metadata extraction by means of various machine learning constructs whereas the system is named as CERMINE. It divides the task into multiple independent modules that include layout analysis, content extraction, metadata classification and bibliography extraction. Layout analysis deals with character reading, page segmentation and order preservation. Content extraction deals with feature extraction that can identify various zones i.e. a particular piece of text belongs to either metadata, body, bibliography or others class. Using these features, SVM classifier is trained to perform primary zone classification. Metadata extraction deals with further classification of classified zones into pre-determined classes such as authors, affiliations etc. by means of SVM. Further, rule-based approach is also employed to extract metadata. Finally, last phase deals with bibliography extraction that has two major sub-modules namely reference strings extraction and reference parsing. Reference strings extraction deals with separation of individual references which is carried out using K-means clustering. Reference parsing, on the other hand, deals with metadata information extraction from individual references using CRF. Various datasets are used in order to evaluate each individual module. Comparative analysis is also presented with other freely available metadata extractors that include ParsCit (Councill et al. 2008), GROBID (Lopez 2009) and PDFX (Constantin et al. 2013). Compiled results show that overall CERMINE tends to outperform existing solutions. The results reported in Table 30 are evaluated on dataset from selected articles of PubMed Central (PMC). This tool was the top-performing in Semantic publication 2015 challenge for contextual information extraction (SemPub2015 2015).

Table 30 Results against CERMINE system (Tkaczyk et al. 2015) and other existing solutions on PMC data

The study presented in An et al. (2017) makes use of deep neural network in order to extract citation metadata. It employs deep learning model along with CRF in order to perform the extraction. This hybrid Neuro-CRF approach is currently giving state-of-the-art results in general Information extraction tasks as well (Huang et al. 2015; Ma and Hovy 2016; Strubell et al. 2017; Lee 2017). In this study, bi-directional LSTM (Britz 2015) model is used as deep learning framework with Glove’s 100 dimensional word embeddings at input layer, which are later fine-tuned. As deep learning frameworks require huge dataset to train, the model is trained on self-generated dataset of 50,000 citations. These citations belong to various domains including computer science, physics, philosophy, etc. with total of twenty-four (24) fine-grained fields. These fields are very close to UMASS dataset. Table 31 shows performance of proposed algorithms on UMASS dataset without any fine-tuning in first column, followed by results achieved when developed deep learning framework is trained using UMASS dataset. Final column contains baseline results of UMASS study that only employs CRF.

Table 31 Results using Bi-LSTM-CRF (An et al. 2017) framework against UMASS dataset

In addition to these studies, there exist multiple tools that are dedicated to perform metadata extraction from research articles, citations or both (Beel et al. 2013; Councill et al. 2008; Lopez 2009; Zahedi and Haustein 2017). Some comparisons of these tools are reported in literature (Atdağ and Labatut 2013; Granitzer et al. 2012). Currently, CERMINE and GROBID both are being actively developed and provide good performance over others such tools. List of tools along with their primary algorithm/mechanisms and available links are provided in Table 32. In the light of recent comparison study against various tools including GROBID, CERMINE, ParsCit, SciecneParse and PDFSSA4MET presented in Tkaczyk et al. (2018), GROBID tends to give better results followed by CERMINE and ParsCit.

Table 32 Tools for metadata extraction

Conclusion

In the light of literature reviewed regarding metadata extraction from scientific articles, a comprehensive summary is presented in Table 33. Reference field in table header represents respective research study. Type field represents that which type of information is being extracted i.e. either study performs header metadata extraction or citation metadata extraction. Format refers to the format of input required by the proposed methodology for further processing e.g. PDF, plain text etc. Approach refers to algorithm(s) applied to perform desired extraction from data. Features/Improvement refers to major distinctive contribution or features that are incorporated in study to improve the performance. Dataset refers to dataset name that is used for evaluation. No. field refers to total number of metadata entries that are being extracted in respective study. Lastly, Metric represents evaluation measure(s) applied to report results respectively. Here, in Metrics column: A, P, R, F, E represents Accuracy, Precision, Recall, F1-score and Error-rate respectively.

Table 33 Summary of Metadata Extraction articles

In the light of Table 32 and Table 33, it is evident that most studies are employing CRF to perform metadata extraction. Initially, linear-chain CRFs were being mainly used. But recent trends show the application of higher order CRFs to incorporate flexibility and to further improve results. Many other studies have opted various improvements to existing implementations. In case of CRF, highest order markov chains are being developed to capture probability of various segments having variable lengths. Similarly, performance gains over HMMs are also achieved by making use of higher order n-grams. Other improvements include smoothing techniques, improved error functions and optimization algorithms etc.

Other than algorithmic improvements, studies have also reported improved performance by means of employing various features of input data at hand. Regarding major features listed in Table 33, word features refer to any features that correspond to a word itself including its content, character length, casing features etc. Line features include number of words in line, total line length by characters and words. Spatial features refer to the location of any particular field in text. Formatting features include font stylings and font size information. External features refer to incorporation of external lexicons in the system. Neighbor features refer to incorporation of neighborhood information by means of contextual words or distance. Numeric features capture information about a word being a number or an alphanumeric sequence.

Amongst the primary challenge in metadata extraction is information loss that occur while converting input from one format to another. Many PDF to text conversion libraries result into errors during the phase of conversion. These errors tend to affect the performance of extraction task as described in Kern et al. (2012). Pre-processing techniques required to transform scientific literature in PDF to text format are not part of this study. But, it is a crucial part for all studies dealing with PDF format. On the other hand, studies employing OCR to identify blocks from visual format tends to perform really well and usually exploit layout and font styling information to improve results.

In addition to various tools and many scientific research studies, semantic publishing challenge (ceurws/lod 2014) has also been introduced that deals with various type of insights extraction from scholarly data. These insights include quality analysis, metadata extraction and interlinking of information among scholarly data. A recent study (Dimou et al. 2017) compares various challenges regarding semantic publishing and is focused on analyzing the current trends in various semantic challenges. This study further consolidates various insights that are analyzed and studied in the light of conducting various semantic challenges. Study aims to improve the quality of organized challenges and workshops by means of employing learnt insights from previous experiences that include feedback incorporation, dataset updates, evolution in tasks etc.

As the scientific community is contributing into this domain for past many years. Thus, now there exist variety of open-source platforms that assist in automatically extracting this information from scientific articles. These systems currently suffer with the issues of layout and formatting primarily due to format conversion. Recent comparison study (Tkaczyk et al. 2018) shows that amongst the various open source extractors, GROBID, CERMINE and ParsCit presents best results.

Key-insights extraction

In scope of current paper, key-insights refer to any valuable information enclosed within a research paper’s text that can be beneficial for researches. Key-insights refer to potential information nuggets contained in a scientific article. In literature, there exist wide terminologies to refer to similar concepts. (Augenstein et al. 2017) regard this task of key-insights extraction as Information Extraction, (QasemiZadeh and Schumann 2016) names this similar concepts as term recognition and classification. There exist other names as well including typed entity recognition, entity recognition, entity extraction, core scientific-concepts and argumentative zoning (Liakata et al. 2010; Tateisi et al. 2016). Examples of key-insights include underlying methodology or technique used, evaluation criteria, results, future work and limitations. These insights, if automatically retrieved, provide a researcher with a clear and concise concept of a research paper. This can be very fruitful for researchers who have to go through a bundle of research papers in order to have an idea about what is going on in their respective research domain. Table 34 presents the extracted key-insights from following passage taken from (Nasar and Jaffry 2018). Sentence-level insights are color-coded within passage where red presents Aim, Green presents Goal and blue presents Extension.

Table 34 Phrase-level key-insights

Decisions and beliefs of human beings about surroundings and their environments are affected by their trust on other agents they are communicating with. Hence, in this study, primary aim is to extend computational model of SA presented in [2] to trust-based SA using ABM and PBM techniques. Keeping this in view, key goal of current research is to analyze the proposed model with both computational modeling paradigms i.e. ABM and PBM, along with a comparative analysis on the basis of their dynamics. Rest of the paper describes related background, outlines methodology opted to build the system that is an extension to a previous model proposed in [2], briefly explains the conducted experiments and respective results, followed by conclusion and future directions.

If such information is automatically extracted from scientific articles, it would aid in variety of applications including automated literature review, trend analysis and personalized research assistance. Thus, reset of this section is focused on presenting progress in this area. Following section firstly highlights major datasets available followed by state-of-the-art approaches that are being employed to perform key-insight extraction from scientific articles.

Datasets

Datasets for key-insights extraction can be majorly classified into two major classes: sentence-level and phrase-level. There exist multiple datasets for sentence-level key-insights extraction, but majority work done belong to the domain of medical sciences. In addition, there are two types of potential insights that are being annotated. One insight is regarding the potential named entities i.e. concepts such as domain, results, technique etc. Other insights are related to relation between entities. For example, a technique or algorithm is applied to solve a particular task. So a relation of application between a TECHNIQUE and TASK can be established namely Apply(TECHNIQUE, TASK). Similarly, results achieved against various evaluation measures can also be expressed as relations e.g. Result(F-measure, 98). There exist very few studies that are focused towards relation extraction between entities from scientific articles though. As relations are usually expressed between core concepts, therefore, phrase-level datasets can be extended to further have relation information as well. Whereas, sentence level datasets cannot be used for this purpose as sentence itself is composed of multiple entities.

(Teufel and Moens 2002) has applied concept of argumentative zoning in order to summarize biomedical articles. This annotation scheme is further extended in Teufel et al. (2009) by means of improved granularity. All existing tags except TEXTUAL are further classified into multiple categories. Hence, in improved scheme, there exist total of fifteen rhetorical classes for sentence-level key-insights extraction across full-length articles. This scheme is used to annotate articles from chemistry and computational linguistics domain. Results show that this annotation scheme can be used for data annotation by non-experts as well. This was established by making an expert, a semi-expert and a non-expert person to annotate articles and later by calculating agreement between them.

Research work to extract sentence-level insights from full-text of the article is performed as part of ART project (Liakata 2009). This project formed the basis of semantic annotation project (Liakata 2010), that is focused towards semantic annotation of scientific articles and its various applications are being studied in the domain of life sciences and cancer research (Guo et al. 2011; Liakata et al. 2012). Another notable addition for full-length sentence-based key insight dataset is Dr. Inventor framework (Ronzano and Saggion 2015), that carries total of forty articles belonging to computer imaging domain only. In addition to full-length articles set, many of the studies related to sentence level key-insights extraction are focused on abstracts.

A recent and diverse study in this regard is Multi-label Argumentative Zoning for English Abstracts (MAZEA) (Dayrell et al. 2012). This study has used total of 645 abstracts from Physical Sciences and Engineering (PE) and 690 abstracts from Life and Health Sciences (LH). Existing datasets for sentence-level key-insights tends to classify a sentence into one category. Primary contribution of this study is that it allows to assign multiple labels to a sentence; hence, multiple labels can be applied to a single sentence. The respective dataset is publicly available.Footnote 5 Widely used sentence-level annotation schemes that are applied in both abstract-only and full-length articles are presented in Table 35.

Table 35 Annotation tags against Sentence-level Datasets

As far as entity level datasets are concerned, progress has been made in this direction recently. Pioneer study (Gupta and Manning 2011) in this regard comprises 475 abstracts from ACL. Another project named Term Entity Recognition (QasemiZadeh and Schumann 2016) is intended to perform task and entity recognition from ACL anthology corpus. This dataset (Handschuh and QasemiZadeh 2014) consists of three hundred annotated abstracts from ACL paper collection, where publication year of respective article ranges from 1965 to 2006.

Entity and Relation extraction project (Tateisi et al. 2016) is focused on phrase level entity extraction from Japanese as well as English scientific-articles. For English, it uses total 400 abstracts from scientific-articles where 250 belong to ACL anthology corpus and remaining 150 from ACM digital library. Out of 250 articles from ACL corpus, 100 are randomly selected from Gupta-Manning dataset (Gupta and Manning 2011). Entities used in this project are inspired from internet artifact ontology (IAO) (IAO 2015). This study further extends the dataset by annotating relation information as well. Total of twenty distinct relations are being annotated in the underlying dataset. A base study regarding this dataset was carried out in Tateisi et al. (2014) with three primitive entities. The base study was only focused on Japanese articles and it dealt with sixteen distinct relation types.

Science-IE project was organized as part of Semantic-Evaluation (Sem-Eval) in 2017, where Sem-Eval is ongoing series of evaluations related to computational semantic analysis of systems, that is usually held on yearly basis. Science-IE (Augenstein et al. 2017) project is collaboration effort among various universities. This project is focused on annotation of scientific-articles belonging to three major domains that include material sciences, physical sciences and computer sciences. The data consists of 500 passages that are selected from open-access scientific-publications available on the research repository of ScienceDirect. The annotated dataset includes three entities namely Task, Process and Material. This study also includes two primitive relations that are “synonym-of” and “hyponym-of”.

“Synonym-of” relation is being used to deal with abbreviations. For example, take this sentence: “This study is related to Information Extraction (IE)”. Here, if “Information Extraction” is assigned any class, a relation of “synonym-of” should be expressed between “Information Extraction” and “IE”. This will help in determining various mentions to a similar concept. “Hyponym-of” relation is used to describe hierarchy of objects. For Example: In sentence; “Apple is a fruit”, apple is a hyponym-of fruit. Similarly, in context of scientific article, if sentence appears saying, “NER is a sub-task of IE”, NER would be hyponym-of IE.

An alternate on-going effort in the direction of phrase-level key insights is the project of Information Retrieval Group at Iowa State University (Projects | ISU Information Retrieval Group 2017), which is related to automatic extraction of information from scientific-articles with primary focus on animal studies. Some phrase-level datasets along with the entities they cover and respective description of these entities are presented in Table 36.

Table 36 Annotation tags against Phrase-level Datasets

Most of these datasets are recently developed. Therefore, there exists no substantial progress regarding algorithm application against these datasets. One thing to note here is that, in the domain of biology, there exist multiple resources and databases that help in identifying genes, proteins, diseases etc. Variety of datasets exist that are focused towards annotation of bio-centered entities such as gene–gene interaction, protein identification etc. Thus, there exist multiple studies that are focused on biology oriented information extraction from scientific articles (Friedman et al. 2001; Hirschman et al. 2005; Li et al. 2015) exploiting available information. In current review study, focus is to extract general phrase-level insights that are applicable and useful in other domains as well such as Problem, Domain, Process and Result etc. Hence, studies focused on bio-specific information extraction are not included in this study.

Approaches

In the past years, many researchers have contributed in domain of information extraction from research papers. Multiple Machine Learning and NLP techniques are used to extract key-insights from scientific literature. For sentence-level key-insights extraction, many studies make use of rule-based approaches. In addition, many machine learning approaches are also applied including Bayesian classifiers, CRFs and SVM. Due to unavailability of benchmarked datasets for phrase-level insights during past years, there exist not much development in this regard. Majority approaches for phrase-level insights extraction makes use of rule-based and CRF on self-generated datasets.

Rule-based approaches

A research study carried out in Hanyurwimfura et al. (2012) takes into account abstract and conclusion text along with some assumptions regarding the orientation of sentences in these two. It majorly relies on rule-based approach. Some examples for the used heuristics in this study are: words such as ‘results’, ‘experiments’ and ‘evaluation’ are used to represent result in a research article and phrases such as ‘this paper’, ‘our approach’ are used to represent main-idea of paper. In addition, title of the study as well as its authors are also being extracted using simple heuristics. In order to determine results, experiment was conducted on 200 papers in group of 40 papers, which resulted in 89.4% precision and 91.2% recall. Apart from that, a survey on 20 papers is also conducted and extracted information was later manually evaluated which resulted in 7.75 ranks by readers (20 readers were employed in conduction of survey study) with range [0-10], with 10 being the highest.

Another study in this regard extract focus and technique as well along with domain (Gupta and Manning 2011) from scientific articles. In this study, pattern matching and dependency trees of sentences are being used along with seed-rules to identify focus, technique and domain. Later more patterns are being identified using bootstrapping approach. After extraction of focus, technique and domain concepts, LDA clustering (Blei et al. 2001) is performed in order to find topics. ACL anthology data-set (Bird et al. 2008) is used for evaluation. Four hundred and seventy four abstracts were hand-labeled for testing, which resulted in high recall and low precision.

Research study proposed in Houngb and Mercer (2012) primarily focuses on technique extraction from Biology Journals. Initially phrases are extracted containing Method-Mention terms such as algorithms, technique, method etc. Rules are formulated in order to extract such sentences from text and identifying the respective techniques used. Machine Learning techniques are also employed which makes use of word, POS, Word-shape (capitalized, start with capital letter, all lower case, all capital case, mixed case), Word-position (start of sentence, end of sentence, not beginning of sentence, not end of sentence), Token prefixes, Token suffixes, and Bigrams as features for CRF. Results are evaluated on two self-generated datasets. First dataset clearly mentions the method and consist of 918 sentences (dataset 1); whereas second one consists of 211 sentences (dataset 2) and does not contain method keyword. Each dataset contains pairs of sentences against every entry: where first sentence carries the method while other carries its potential usage. Later these sentences are tokenized and converted into BIO data tagging format for phrase-level method mention extraction. Results show Precision/Recall/F-measure of (85.40 100 91.89) and (81.8 75.00 78.26) against rule based system and CRF-based Machine Learning system respectively. Where, rule-based systems are being evaluated on dataset 1 whereas CRF system is being evaluated on dataset 2.

Machine-learning based approaches

Following section compiles the major approaches that employ machine learning concepts to perform key-insights extraction from scientific articles.

Naïve Bayes

Pioneer study to perform sentence based classification from abstracts is presented in Teufel and Moens (2002). It makes use of Naïve-Bayes classification in order to classify sentences in aim, contrast, basis and background. In order to evaluate the system, total eighty conference articles from computational linguistics domain are annotated. Two types of evaluation are being performed, one deals with rhetoric classification performed using Naive-Bayes. Other is relevance based evaluation that tells that according to humans, how much relevant results are being extracted. Tags, their descriptions and respective evaluation measures are presented in Table 37.

Table 37 Evaluation measure using NB in Teufel and Moens (2002)

A sentence-level key-insights extraction study in medical sciences is being proposed in Ruch et al. (2007). It makes use of Naïve-Bayes classifier in order to classify sentences from abstract in four categories including purpose, methods, conclusion and results. Results show F-score of 85. The dataset used for evaluation comprises 12,000 abstracts from MEDELINE that carries implicit tags against these four categories.

In order to extract domains from research articles, study presented in Lakhanpal et al. (2015) makes use of preposition disambiguation. It relies on rules that are based on prepositions in a sentence. By following rules, phrases are identified which are later classified using Naïve-Bayes classification. Results shows 90% precision and 91% recall when applied on ACM SIGKDD (1995) papers from 2010–2014.

Hidden Markov model

The study carried out in Lin et al. (2006) use HMM in order to assign rhetoric categories to sentences. The study is focused on medical abstracts, which generally follow the pattern of Introduction, Method, Result and Conclusion. Latent Discriminative Analysis (LDA) is also employed in order to further improve the performance. Multiple experiments are performed where HMM with LDA performed best against abstracts selected from MEDELINE. Respective evaluation measure against best approach is presented in Table 38.

Table 38 Evaluation measures against in Lin et al. (2006)

Another study that employs HMM (Wu et al. 2006) is used in order to extract Move structures. Move structures refer to the categories of functional roles. These structures include Background, Purpose, Method, Result and Conclusion. Total 709 sentences are tagged that belong to 106 abstracts from CiteSeer. Study tends to exploit Move-constructs and collocation information to improve HMM model. This approach results into best precision of 80.54.

Conditional random fields

The work presented in Hirohata et al. (2008) is focused on extraction of section related information from article abstracts. It makes use of CRFs in order to identify sentences from abstract against major sections that include Objective, Method, Result and Conclusion. In order to develop the model, corpus of 51,000 abstracts is developed. The corpus consists of abstracts that have the exact four section labels. The proposed method achieved 95.5% per-sentence accuracy and 68.8% per-abstract accuracy.

A research study proposed in Kondo et al. (2009) analyze research paper’s title in order to identify underlying Technique and Research Field of the respective research paper. In order to extract the desired fields, firstly Cue words are identified using Rule-based approaches. Later these words are searched in research paper’s titles. This helps in identifying research paper Goal, underlying Methodology and major Topic or Research Field of paper. CRF are used with word POS, word being a Method, Goal or Head word as features in order to classify the identified words into their respective classes. Experiments were performed on Japanese and English literature, which resulted into 82.5% precision and 81.6% recall for the Japanese research papers, while for English literature it resulted in scores of 73.5% precision and 78% recall.

The study presented in Lin et al. (2010) also makes use of CRF in order to extract metadata information as well as key-insights information from medical articles. This metadata is regarded as formulaic author metadata in this study. It includes Author Name, Email and Institution. For key-insights, this study extracts entities that are part of full-text and depict information related to nature of study. In order to perform the training and later evaluation, gold set is prepared by means of annotating 185 open-accesses PubMed articles. This article set belongs to studies performed from 2008 to 2009 and strictly consist of research articles excluding any reviews, case-studies, editorials and perspectives. Annotators were provided with Rich Text Format (RTF), generated by means of processing respective HTML version of research article, along with primitive annotation guidelines. Results show that CRF is very effective in determining formulaic author metadata with average F-score of 89.9%, whereas key-insights extraction shows relatively poor performance with 26.1% F-measure as shown in Table 39.

Table 39 Evaluation measures against Lin et al. (2010)

Another study that performs both sentence level and phrase level KIE from articles is carried out in (Kovacevic et al. 2012). This study makes use of various features to perform extraction. First of all, sentence level extraction is made using similar annotation scheme and categories as used in (Teufel and Moens 2002). After primary classification of sentences, the sentences of OWN category are further sub-divided into results, solution and else category. Further, solution category’s sentences are later annotated to extract phrase level concepts including method, task, tools and resources. This classification is being performed by means of CRF. The evaluation metrics against these insights are being presented in Table 40. Study has rigorously experimented with various features. In the light of results, all entries except resources tend to perform optimally when all features are incorporated. These features include lexical, syntactic, citation and frequency features.

Table 40 Evaluation metrics against Kovačević et al. (2012)
Support vector machines

A relevant study that is focused towards sentence extraction from scientific articles is presented in (Guo et al. 2010). It performs comparative analysis between three various annotation schemes for sentence-level key-insights extraction. These schemes are based on section names (Hirohata et al. 2008), argumentative zones (AZ) and core-scientific concepts (CoreSC). Later two schemes are associated with ART project (Liakata 2009), a pioneer project to deal with sentence-level key-insights extraction from full-text scientific articles of medical sciences. This proposed approach makes use of Naïve-Bayes and SVM classifiers to perform IE from abstracts only. Results show that SVM presents better results than Naïve-Bayes classifier as shown in Table 41.

Table 41 F-measures against various annotation schemes in Guo et al. (2010)

An SVM based solution is presented in (Ronzano and Saggion 2015) to extract sentence-level key-insights. Linear kernel was used for training. As for the data, total 40 computer graphics paper from the Dr. Inventor Rhetorically Annotated Corpus (Fisas et al. 2015) containing total of 8877 sentences were used. Annotation categories for sentences are almost same as followed by the ART project. All the sentences of the Corpus have been manually characterized by three annotators with inter-annotator agreement of 65.67%. Proposed SVM model takes into account both lexical and syntactic features to model each sentence. Java based machine learning library of Weka2.0 is used to perform all the tasks related to rhetorical sentence classification. The model resulted into F1 score equal to 76.4 against a tenfold cross validation.

Others

There are several studies that either use hybrid approach to perform metadata extraction or makes use of other techniques that cannot be classified in existing techniques as in stated above sections. This sub-section highlights such studies.

The baseline results against Typed Entity and Relation Extraction project (Tateisi et al. 2016) are calculated using joint modeling approach presented in (Miwa and Sasaki 2014). This approach uses tables in order to maintain history. Tables are filled using history based approach where every cell is mapped with labels. In order to map sequence to tables, tables are firstly transformed into one dimensional form using static ordering. Preceding assignments in cells are taken into account while adding labels in the cells in order to avoid any illegal assignment. A structured learning approach using margin is used in order to learn the weights and multiple training algorithms are employed including Perceptron, AdaGrad and SVM (Chang and Yih 2013; Collins 2002; Duchi et al. 2011; Mejer and Crammer 2010). These weights help in mapping entities and relations into a table. As this dataset contains total 400 articles from ACM and ACL, where 100 articles belong to Gupta-Manning dataset (Gupta and Manning 2011). Results against 10-cross validation for randomly selected 250 articles excluding Gupta-Manning as well as results against Gupta-Manning dataset only are reported in Table 42. Annotated dataset against Japanese and English scientific articles is publicly available.

Table 42 Results against Tateisi et al. (2016)

ScienceIE project was conducted as Sem-Eval Task in 2017. Against the developed dataset in ScienceIE, a competition was held. This competition has total of three evaluation scenarios. First scenario was focused on information extraction when only plain text of scientific article’s content is provided. Second scenario provides additional key-phrases along with plain-text. Third scenario provides partial information regarding key-phrases along with their rhetorical class i.e. Task, Process and Material. Various groups participated in this project to compete. Hybrid models of recurrent neural networks with CRF performed best, with maximum F-measure score of 43 against first evaluation scenarios. Lexical feature based SVM model provided maximum F-measure of 64 in second evaluation scenario. For third evaluation scenario, convolution neural network based approach performed better than the rest with F-measure of 64. Detail of overall evaluation and sub-tasks involved along with dataset is publicly available.

Conclusion

In the light of literature reviewed regarding KIE from scientific articles, a comprehensive summary is presented in Table 43. Reference entry in table header represents respective research study. Level entry represents that which type of information is being extracted i.e. phrase-level (Phr) or sentence level (Sen). Origin refers to the data sections taken from an article including abstract (abs), conclusion (Con), keywords (KW), Full-article (FA) etc. Approach refers to algorithm(s) applied to perform desired extraction from data. Domain refers to the area of study that is selected for evaluation e.g. computer science, physical science, etc. Size refers to total number of articles/abstracts included in a study, whereas exceptions are marked with asterisk (*) and represents number of sentences. Entities represent type of key-insights that are being extracted in a research study. Lastly, metric represents evaluation measure(s) applied to report results respectively. Here, in metrics column: A, P, R, F, SE represents Accuracy, Precision, Recall, F1-score and Subjective Evaluation measures respectively. In domain header: BL, CL, CS, HS, PS, MS, MeS, BM and CV represent biology, computational linguistics, computer science, health sciences, physical sciences, material sciences, medical sciences, biomedicine and computer vision respectively.

Table 43 Summary against key-insights extraction from scientific articles

In the light of Table 43, it is very much evident that majority of work has been reported on abstracts only. The primary issue of very limited studies on full-text article can be the complexity involved regarding annotation task. As entities grow, the time of annotating an article can exponentially explode while dealing with full-text scientific articles. Even, in case of abstract, fine-grained annotation can take lot of time as reported in (Augenstein et al. 2017) due to subjectivity of classes at hand. This time can be saved by using crisp annotation guidelines. As “Datasets” section points recent contributions regarding annotation guidelines and datasets for KIE, progress is yet to be made to perform phrase-level KIE on full-text scientific articles. Table 44 compiles all open source datasets along with their details.

Table 44 Available datasets for KIE

Conclusion and future work

This study is focused towards determining state-of-the-art regarding potential information that can be extracted from scientific articles. As a scientific article follows a semi-structured format. Therefore, on the basis of its structure: information to be extracted from an article is broadly classified into two major categories namely metadata extraction (ME) and key-insights extraction (KIE). ME from scientific articles refers to identification and extraction of metadata elements such as Title, Author, Affiliations etc. In order to perform ME, there exist multiple datasets that vary on the basis of article’s sources, publication venues, data size and granularity of fields. On these datasets, multiple approaches including Rule based approaches and machine learning approaches including HMM, CRF and SVM are applied. Amongst these, CRF tends to outperform other approaches with reported F-measure of more than 0.95. Currently, deep learning approaches are not being widely employed to perform ME. As, hybrid deep learning frameworks are performing really well and currently governing the state-of-the-art in general Information Extraction tasks. Thus, application of deep learning frameworks and their hybrid versions are an open-area in context of ME. Apart from various techniques and datasets, there are variety of open-source tools that aids in automatic extraction of meta-data entities from research articles’ header as well as bibliography. One of the primary challenges in ME is to minimize information loss while converting scientific article from one format to another.

As far as KIE is concerned, there exist two broad classifications regarding insights to be extracted namely sentence-level key-insights and phrase-level key-insights. Sentence-level KIE processes are focused on classification of sentences in pre-defined categories based on insights they carry. Widely used approaches to perform sentence-level KIE include Rule-based approaches, Bayesian classification, SVM and CRFs. Majority of work regarding sentence-level KIE is based on medical studies. Although, there are a couple of studies that perform sentence level KIE on full-length articles, most developments in this area are based on articles’ abstracts only.

Phrase-level KIE, on the other hand, is focused towards extraction of phrases carrying potential information. Mostly work done with phrase-level KIE; such as Problem, Domain, Technique, Results etc. is reported on self-created datasets that are not publicly available. In addition, the guidelines and inter-annotator agreements while developing these datasets are also not reported. Apart from that, various other limitations in existing studies were found which include lack of proper dissemination of achieved results; lack of expression regarding the methodology used to perform desired task; ambiguity in explanation of the corpus used for data evaluation and deficiency in performing cross-validation for various techniques while reporting results (Houngb and Mercer 2012; Kavila and Rani 2016).

Regarding available datasets for phrase-level key-insights; in past several years, researchers have been working to create benchmark datasets to extract phrase-level key-insights. There exist wide varieties of key-insights that are being annotated in currently available datasets. Some are specific to a domain such as computational linguistics; others are generic and cover a variety of disciplines. One of the major limitations of existing phrase-level annotated datasets is that they only consist of a single passage. Another challenge is the unavailability of crisp definitions for various key-insights. This gives rise to subjective notions across phrase-level key-insights that are being identified in various datasets. Therefore, in order to minimize the subjective individual biases regarding any key-insight, respective definitions should be crisp and clear. Hence, primary open research task with regard to phrase-level key-insights dataset is identification of specific concepts or key-insights to be extracted from scientific articles. Once these are identified, next question would be to devise the criterion that helps in determining particular phrase as a key-insight. These criteria will eventually help in development of annotation guidelines. Once annotation guidelines are developed, next major contribution would be to prepare a dataset in light of these guidelines.

Additionally, as majority of phrase-level datasets are developed recently, therefore, a great deal of development is required in order to efficiently extract the potential information insights followed by relation extraction (RE) between extracted conceptual insights. In scientific articles, relation can express application of a technique to solve a problem, results generated against various evaluation measures etc. This information can serve multiple benefits such as ontology construction and question answering systems. Hence, datasets preparation and algorithm development for RE is an open research area as well. Other open research questions include analysis and application of various state-of-the-art IE approaches on various existing datasets. These analyses will further reveal the potential advantages and pitfalls of existing techniques. CRF is generally regarded as the state-of-the-art statistical technique for ME, but recently, after identification of its limitations in one of the dataset, several research studies were carried out to improve those limitations (Anzaroot et al. 2014; Vilnis et al. 2015). Similarly, by acquiring brief understanding after application of existing solutions on KIE, analysis of primary reasons for achieved results followed by ways to improve and mitigate the identified challenges remains an open research area.

Regarding primary limitations of current survey study, it only contains those articles that are focused on extracting generic insights from scientific articles. Thus, articles focused on key-insights extraction specific to any domain are not catered. Furthermore, pre-processing techniques that are applied to convert data from one format to another as well as to generate textual, layout, and formatting features are not part of study.