Natural language processing algorithms for domain-specific data extraction in material science: Reseractor

Gupta, Antrakrate; Mittal, Divyansh; Goel, Ojsi; Jha, Shikhar Krishn

doi:10.1007/s10853-024-09980-z

Natural language processing algorithms for domain-specific data extraction in material science: Reseractor

Computation & theory
Published: 25 July 2024

Volume 59, pages 13856–13872, (2024)
Cite this article

Journal of Materials Science Aims and scope Submit manuscript

273 Accesses
Explore all metrics

Abstract

With the advent of several tools and web engines trained for finding journal articles out of billions of research papers on millions of topics in different databases with a high degree of generalizability, it often leads to a loss of specificity. Scientific pursuits need a tool to extract data from selected resources for performing domain-specific tasks. Current algorithms and generalized tools lack specificity and are challenged by errors in analysing data from a bundle of specific documents selected eclectically. Current work addresses the need for such a tool, which focuses on specificity based on users' input keywords and phrases to find relevant information from bundles of articles from the web. Reseractor is based on a customized algorithm, Whitespace, in synergy with output from open-access tools for document image analysis and focused domain data extraction using NLP. The current tool is designed for the material science domain with the features of adopting various generalized and scientific corpora as layers. It is tested on two sets of different bundles of papers and gives an accuracy of 81.12% along with a recall of 78.38% and a precision of 84.06%. Owing to the simple and direct applicability of algorithms, users from other domains can directly use their corpora in algorithms and remodel the tool for their purpose. Current work fulfills the need for domain-specific experimental data extraction stored in organized and structured databases for upcoming computational researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1

Challenges and Advances in Information Extraction from Scientific Literature: a Review

Article 05 October 2021

Insights into relevant knowledge extraction techniques: a comprehensive review

Article 03 October 2019

TechMiner: Extracting Technologies from Academic Publications

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

Pseudocode for the tool, Reseractor is available in supplementary document (Online resource 1: section S3). The code is uploaded in GitHub repository [31] and relevant data used in this work are available from the corresponding author upon reasonable request on mail or website https://home.iitk.ac.in/~skjha/.

References

Choudhary K, Kelley ML, (2023) ChemNLP: a natural language processing based library for materials chemistry text data. arXiv:2209.08203
OpenAI. (n.d.). ChatGPT — a model interacting in a conversational way, trained on more human feedback. Retrieved from https://openai.com/blog/chatgpt
PDF.ai — a model interacting in a conversational way, trained on more human feedback for the user uploaded pdf. Retrieved from https://pdf.ai/
Google LLC. (n.d.). Google Scholar. Retrieved from https://scholar.google.com/
Consensus. https://consensus.app/
National Center for Biotechnology Information. (Year, if available). PubMed. Retrieved from https://pubmed.ncbi.nlm.nih.gov/
Clarivate Analytics. (n.d.). Web of Science. https://clarivate.com/products/web-of-science/
Crossref. https://www.crossref.org/
Elicit. https://elicit.com/
QuillBot. (n.d.). Free paraphrasing tool - Best Article Rewriter. https://quillbot.com/
Grammarly. (n.d.). Writing suggestions across all your favorite websites. https://www.grammarly.com/
Olivetti EA, Cole JM, Kim E et al (2020) Data-driven materials research enabled by natural language processing and information extraction. Appl Phys Rev 7:041317. https://doi.org/10.1063/5.0021106
Article CAS Google Scholar
Smith R (2007) An Overview of the Tesseract OCR Engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2. IEEE, Curitiba, Parana, Brazil, pp 629–633. https://doi.org/10.1109/ICDAR.2007.4376991
Article Google Scholar
Google Vision API. https://cloud.google.com/vision/docs/apis
Shen Z, Zhang R, Dell M, et al (2021) LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv:2103.15348
Gao X, Tan R, Li G (2020) Research on text mining of material science based on natural language processing. IOP Conf Ser Mater Sci Eng 768:072094. https://doi.org/10.1088/1757-899X/768/7/072094
Article Google Scholar
Kay A (2007). Tesseract: an open-source optical character recognition engine. Linux J 2007(159):2
Google Scholar
Semantic Scholar. https://www.semanticscholar.org/
Research Gate. https://www.researchgate.net/
Beltagy I, Lo K, Cohan A (2019) SciBERT: a pretrained language model for scientific text. arXiv:1903.10676
Raabe D Glossary of materials science
Gupta T, Zaki M, Krishnan NMA, Mausam (2022) MatSciBERT: a materials domain language model for text mining and information extraction. Npj Comput Mater 8:102. https://doi.org/10.1038/s41524-022-00784-w
Article Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Bilal M, Almazroi AA (2023) Effectiveness of fine-tuned BERT model in classification of helpful and unhelpful online customer reviews. Electron Commer Res 23:2737–2757. https://doi.org/10.1007/s10660-022-09560-w
Article Google Scholar
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization
Tshitoyan V, Dagdelen J, Weston L et al (2019) Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571:95–98. https://doi.org/10.1038/s41586-019-1335-8
Article CAS PubMed Google Scholar
Shah PK, Perez-Iratxeta C, Bork P, Andrade MA (2003) Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics 4:1–9. https://doi.org/10.1186/1471-2105-4-20
Article Google Scholar
Dalianis H (2018) Evaluation metrics and evaluation. Clinical Text Mining. Springer International Publishing, Cham, pp 45–53. https://doi.org/10.1007/978-3-319-78503-5_6
Chapter Google Scholar
Ramakrishnan C, Patnia A, Hovy E, Burns GA (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7:1–10. https://doi.org/10.1186/1751-0473-7-7
Article Google Scholar
Chaurasia N, Jha SK, Sangal S (2023) A novel training methodology for phase segmentation of steel microstructures using a deep learning algorithm. Materialia 30:101803. https://doi.org/10.1016/j.mtla.2023.101803
Article CAS Google Scholar
Reseractor tool. https://github.com/ShikharJha/Reseractor

Download references

Acknowledgements

We thank the Ministry of Education, Government. of India for supporting this work under Prime Minister Research Fellowship (PMRF) endowed to the author. We thank the Department of Materials Science and Engineering, IIT Kanpur, for their facilities and staff support. We would like to thank the developers of Layout Parser and Tesseract for providing their tools as open access where we could use our algorithm in synergy with theirs to contribute to document image analysis methods and domain-specific natural language processing.

Author information

Authors and Affiliations

Department of Materials Science and Engineering, Indian Institute of Technology, Kanpur, India
Antrakrate Gupta, Divyansh Mittal, Ojsi Goel & Shikhar Krishn Jha

Authors

Antrakrate Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Divyansh Mittal
View author publications
You can also search for this author in PubMed Google Scholar
Ojsi Goel
View author publications
You can also search for this author in PubMed Google Scholar
Shikhar Krishn Jha
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Antrakrate Gupta contributed to conceptualization, methodology (equal), formal analysis (equal), writing, investigation (equal), methodology (equal), visualization (equal), funding acquisition (equal). Divyansh Mittal contributed to coding, methodology (supporting), formal analysis (equal), writing (supporting), visualization (equal), and software. Ojsi Goel contributed to coding and methodology for Whitespace algorithm proposed above (equal). Shikhar Krishn Jha* contributed to conceptualization, methodology (equal), writing—review & editing, project administration, funding acquisition (equal) and supervision.

Corresponding author

Correspondence to Shikhar Krishn Jha.

Ethics declarations

Conflict of interest

The authors have no conflicts to disclose.

Supplementary Information

The video “Reseractor_tool_video.mpg” is available online with the article. It shows preliminary working function of developed tool based on algorithm described in the article.

Additional information

Handling Editor: Ghanshyam Pilania.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 6863 KB)

Supplementary file2 (MPG 14032 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gupta, A., Mittal, D., Goel, O. et al. Natural language processing algorithms for domain-specific data extraction in material science: Reseractor. J Mater Sci 59, 13856–13872 (2024). https://doi.org/10.1007/s10853-024-09980-z

Download citation

Received: 16 March 2024
Accepted: 03 July 2024
Published: 25 July 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s10853-024-09980-z

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing algorithms for domain-specific data extraction in material science: Reseractor

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Challenges and Advances in Information Extraction from Scientific Literature: a Review

Insights into relevant knowledge extraction techniques: a comprehensive review

TechMiner: Extracting Technologies from Academic Publications

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Supplementary Information

Additional information

Publisher's Note

Supplementary information

Supplementary file1 (DOCX 6863 KB)

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Navigation

Natural language processing algorithms for domain-specific data extraction in material science: Reseractor

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Challenges and Advances in Information Extraction from Scientific Literature: a Review

Insights into relevant knowledge extraction techniques: a comprehensive review

TechMiner: Extracting Technologies from Academic Publications

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Supplementary Information

Additional information

Publisher's Note

Supplementary information

Supplementary file1 (DOCX 6863 KB)

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation