Abstract
With the advent of several tools and web engines trained for finding journal articles out of billions of research papers on millions of topics in different databases with a high degree of generalizability, it often leads to a loss of specificity. Scientific pursuits need a tool to extract data from selected resources for performing domain-specific tasks. Current algorithms and generalized tools lack specificity and are challenged by errors in analysing data from a bundle of specific documents selected eclectically. Current work addresses the need for such a tool, which focuses on specificity based on users' input keywords and phrases to find relevant information from bundles of articles from the web. Reseractor is based on a customized algorithm, Whitespace, in synergy with output from open-access tools for document image analysis and focused domain data extraction using NLP. The current tool is designed for the material science domain with the features of adopting various generalized and scientific corpora as layers. It is tested on two sets of different bundles of papers and gives an accuracy of 81.12% along with a recall of 78.38% and a precision of 84.06%. Owing to the simple and direct applicability of algorithms, users from other domains can directly use their corpora in algorithms and remodel the tool for their purpose. Current work fulfills the need for domain-specific experimental data extraction stored in organized and structured databases for upcoming computational researchers.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Pseudocode for the tool, Reseractor is available in supplementary document (Online resource 1: section S3). The code is uploaded in GitHub repository [31] and relevant data used in this work are available from the corresponding author upon reasonable request on mail or website https://home.iitk.ac.in/~skjha/.
References
Choudhary K, Kelley ML, (2023) ChemNLP: a natural language processing based library for materials chemistry text data. arXiv:2209.08203
OpenAI. (n.d.). ChatGPT — a model interacting in a conversational way, trained on more human feedback. Retrieved from https://openai.com/blog/chatgpt
PDF.ai — a model interacting in a conversational way, trained on more human feedback for the user uploaded pdf. Retrieved from https://pdf.ai/
Google LLC. (n.d.). Google Scholar. Retrieved from https://scholar.google.com/
Consensus. https://consensus.app/
National Center for Biotechnology Information. (Year, if available). PubMed. Retrieved from https://pubmed.ncbi.nlm.nih.gov/
Clarivate Analytics. (n.d.). Web of Science. https://clarivate.com/products/web-of-science/
Crossref. https://www.crossref.org/
Elicit. https://elicit.com/
QuillBot. (n.d.). Free paraphrasing tool - Best Article Rewriter. https://quillbot.com/
Grammarly. (n.d.). Writing suggestions across all your favorite websites. https://www.grammarly.com/
Olivetti EA, Cole JM, Kim E et al (2020) Data-driven materials research enabled by natural language processing and information extraction. Appl Phys Rev 7:041317. https://doi.org/10.1063/5.0021106
Smith R (2007) An Overview of the Tesseract OCR Engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2. IEEE, Curitiba, Parana, Brazil, pp 629–633. https://doi.org/10.1109/ICDAR.2007.4376991
Google Vision API. https://cloud.google.com/vision/docs/apis
Shen Z, Zhang R, Dell M, et al (2021) LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv:2103.15348
Gao X, Tan R, Li G (2020) Research on text mining of material science based on natural language processing. IOP Conf Ser Mater Sci Eng 768:072094. https://doi.org/10.1088/1757-899X/768/7/072094
Kay A (2007). Tesseract: an open-source optical character recognition engine. Linux J 2007(159):2
Semantic Scholar. https://www.semanticscholar.org/
Research Gate. https://www.researchgate.net/
Beltagy I, Lo K, Cohan A (2019) SciBERT: a pretrained language model for scientific text. arXiv:1903.10676
Raabe D Glossary of materials science
Gupta T, Zaki M, Krishnan NMA, Mausam (2022) MatSciBERT: a materials domain language model for text mining and information extraction. Npj Comput Mater 8:102. https://doi.org/10.1038/s41524-022-00784-w
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Bilal M, Almazroi AA (2023) Effectiveness of fine-tuned BERT model in classification of helpful and unhelpful online customer reviews. Electron Commer Res 23:2737–2757. https://doi.org/10.1007/s10660-022-09560-w
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization
Tshitoyan V, Dagdelen J, Weston L et al (2019) Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571:95–98. https://doi.org/10.1038/s41586-019-1335-8
Shah PK, Perez-Iratxeta C, Bork P, Andrade MA (2003) Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics 4:1–9. https://doi.org/10.1186/1471-2105-4-20
Dalianis H (2018) Evaluation metrics and evaluation. Clinical Text Mining. Springer International Publishing, Cham, pp 45–53. https://doi.org/10.1007/978-3-319-78503-5_6
Ramakrishnan C, Patnia A, Hovy E, Burns GA (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7:1–10. https://doi.org/10.1186/1751-0473-7-7
Chaurasia N, Jha SK, Sangal S (2023) A novel training methodology for phase segmentation of steel microstructures using a deep learning algorithm. Materialia 30:101803. https://doi.org/10.1016/j.mtla.2023.101803
Reseractor tool. https://github.com/ShikharJha/Reseractor
Acknowledgements
We thank the Ministry of Education, Government. of India for supporting this work under Prime Minister Research Fellowship (PMRF) endowed to the author. We thank the Department of Materials Science and Engineering, IIT Kanpur, for their facilities and staff support. We would like to thank the developers of Layout Parser and Tesseract for providing their tools as open access where we could use our algorithm in synergy with theirs to contribute to document image analysis methods and domain-specific natural language processing.
Author information
Authors and Affiliations
Contributions
Antrakrate Gupta contributed to conceptualization, methodology (equal), formal analysis (equal), writing, investigation (equal), methodology (equal), visualization (equal), funding acquisition (equal). Divyansh Mittal contributed to coding, methodology (supporting), formal analysis (equal), writing (supporting), visualization (equal), and software. Ojsi Goel contributed to coding and methodology for Whitespace algorithm proposed above (equal). Shikhar Krishn Jha* contributed to conceptualization, methodology (equal), writing—review & editing, project administration, funding acquisition (equal) and supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts to disclose.
Supplementary Information
The video “Reseractor_tool_video.mpg” is available online with the article. It shows preliminary working function of developed tool based on algorithm described in the article.
Additional information
Handling Editor: Ghanshyam Pilania.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Below is the link to the electronic supplementary material.
Supplementary file2 (MPG 14032 KB)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gupta, A., Mittal, D., Goel, O. et al. Natural language processing algorithms for domain-specific data extraction in material science: Reseractor. J Mater Sci 59, 13856–13872 (2024). https://doi.org/10.1007/s10853-024-09980-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10853-024-09980-z