Skip to main content

Advertisement

Log in

Natural language processing algorithms for domain-specific data extraction in material science: Reseractor

  • Computation & theory
  • Published:
Journal of Materials Science Aims and scope Submit manuscript

Abstract

With the advent of several tools and web engines trained for finding journal articles out of billions of research papers on millions of topics in different databases with a high degree of generalizability, it often leads to a loss of specificity. Scientific pursuits need a tool to extract data from selected resources for performing domain-specific tasks. Current algorithms and generalized tools lack specificity and are challenged by errors in analysing data from a bundle of specific documents selected eclectically. Current work addresses the need for such a tool, which focuses on specificity based on users' input keywords and phrases to find relevant information from bundles of articles from the web. Reseractor is based on a customized algorithm, Whitespace, in synergy with output from open-access tools for document image analysis and focused domain data extraction using NLP. The current tool is designed for the material science domain with the features of adopting various generalized and scientific corpora as layers. It is tested on two sets of different bundles of papers and gives an accuracy of 81.12% along with a recall of 78.38% and a precision of 84.06%. Owing to the simple and direct applicability of algorithms, users from other domains can directly use their corpora in algorithms and remodel the tool for their purpose. Current work fulfills the need for domain-specific experimental data extraction stored in organized and structured databases for upcoming computational researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

Pseudocode for the tool, Reseractor is available in supplementary document (Online resource 1: section S3). The code is uploaded in GitHub repository [31] and relevant data used in this work are available from the corresponding author upon reasonable request on mail or website https://home.iitk.ac.in/~skjha/.

References

  1. Choudhary K, Kelley ML, (2023) ChemNLP: a natural language processing based library for materials chemistry text data. arXiv:2209.08203

  2. OpenAI. (n.d.). ChatGPT — a model interacting in a conversational way, trained on more human feedback. Retrieved from https://openai.com/blog/chatgpt

  3. PDF.ai — a model interacting in a conversational way, trained on more human feedback for the user uploaded pdf. Retrieved from https://pdf.ai/

  4. Google LLC. (n.d.). Google Scholar. Retrieved from https://scholar.google.com/

  5. Consensus. https://consensus.app/

  6. National Center for Biotechnology Information. (Year, if available). PubMed. Retrieved from https://pubmed.ncbi.nlm.nih.gov/

  7. Clarivate Analytics. (n.d.). Web of Science. https://clarivate.com/products/web-of-science/

  8. Crossref. https://www.crossref.org/

  9. Elicit. https://elicit.com/

  10. QuillBot. (n.d.). Free paraphrasing tool - Best Article Rewriter. https://quillbot.com/

  11. Grammarly. (n.d.). Writing suggestions across all your favorite websites. https://www.grammarly.com/

  12. Olivetti EA, Cole JM, Kim E et al (2020) Data-driven materials research enabled by natural language processing and information extraction. Appl Phys Rev 7:041317. https://doi.org/10.1063/5.0021106

    Article  CAS  Google Scholar 

  13. Smith R (2007) An Overview of the Tesseract OCR Engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2. IEEE, Curitiba, Parana, Brazil, pp 629–633. https://doi.org/10.1109/ICDAR.2007.4376991

    Article  Google Scholar 

  14. Google Vision API. https://cloud.google.com/vision/docs/apis

  15. Shen Z, Zhang R, Dell M, et al (2021) LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv:2103.15348

  16. Gao X, Tan R, Li G (2020) Research on text mining of material science based on natural language processing. IOP Conf Ser Mater Sci Eng 768:072094. https://doi.org/10.1088/1757-899X/768/7/072094

    Article  Google Scholar 

  17. Kay A (2007). Tesseract: an open-source optical character recognition engine. Linux J 2007(159):2

    Google Scholar 

  18. Semantic Scholar. https://www.semanticscholar.org/

  19. Research Gate. https://www.researchgate.net/

  20. Beltagy I, Lo K, Cohan A (2019) SciBERT: a pretrained language model for scientific text. arXiv:1903.10676

  21. Raabe D Glossary of materials science

  22. Gupta T, Zaki M, Krishnan NMA, Mausam (2022) MatSciBERT: a materials domain language model for text mining and information extraction. Npj Comput Mater 8:102. https://doi.org/10.1038/s41524-022-00784-w

    Article  Google Scholar 

  23. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  24. Bilal M, Almazroi AA (2023) Effectiveness of fine-tuned BERT model in classification of helpful and unhelpful online customer reviews. Electron Commer Res 23:2737–2757. https://doi.org/10.1007/s10660-022-09560-w

    Article  Google Scholar 

  25. Loshchilov I, Hutter F (2019) Decoupled weight decay regularization

  26. Tshitoyan V, Dagdelen J, Weston L et al (2019) Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571:95–98. https://doi.org/10.1038/s41586-019-1335-8

    Article  CAS  PubMed  Google Scholar 

  27. Shah PK, Perez-Iratxeta C, Bork P, Andrade MA (2003) Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics 4:1–9. https://doi.org/10.1186/1471-2105-4-20

    Article  Google Scholar 

  28. Dalianis H (2018) Evaluation metrics and evaluation. Clinical Text Mining. Springer International Publishing, Cham, pp 45–53. https://doi.org/10.1007/978-3-319-78503-5_6

    Chapter  Google Scholar 

  29. Ramakrishnan C, Patnia A, Hovy E, Burns GA (2012) Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7:1–10. https://doi.org/10.1186/1751-0473-7-7

    Article  Google Scholar 

  30. Chaurasia N, Jha SK, Sangal S (2023) A novel training methodology for phase segmentation of steel microstructures using a deep learning algorithm. Materialia 30:101803. https://doi.org/10.1016/j.mtla.2023.101803

    Article  CAS  Google Scholar 

  31. Reseractor tool. https://github.com/ShikharJha/Reseractor

Download references

Acknowledgements

We thank the Ministry of Education, Government. of India for supporting this work under Prime Minister Research Fellowship (PMRF) endowed to the author. We thank the Department of Materials Science and Engineering, IIT Kanpur, for their facilities and staff support. We would like to thank the developers of Layout Parser and Tesseract for providing their tools as open access where we could use our algorithm in synergy with theirs to contribute to document image analysis methods and domain-specific natural language processing.

Author information

Authors and Affiliations

Authors

Contributions

Antrakrate Gupta contributed to conceptualization, methodology (equal), formal analysis (equal), writing, investigation (equal), methodology (equal), visualization (equal), funding acquisition (equal). Divyansh Mittal contributed to coding, methodology (supporting), formal analysis (equal), writing (supporting), visualization (equal), and software. Ojsi Goel contributed to coding and methodology for Whitespace algorithm proposed above (equal). Shikhar Krishn Jha* contributed to conceptualization, methodology (equal), writing—review & editing, project administration, funding acquisition (equal) and supervision.

Corresponding author

Correspondence to Shikhar Krishn Jha.

Ethics declarations

Conflict of interest

The authors have no conflicts to disclose.

Supplementary Information

The video “Reseractor_tool_video.mpg” is available online with the article. It shows preliminary working function of developed tool based on algorithm described in the article.

Additional information

Handling Editor: Ghanshyam Pilania.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 6863 KB)

Supplementary file2 (MPG 14032 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, A., Mittal, D., Goel, O. et al. Natural language processing algorithms for domain-specific data extraction in material science: Reseractor. J Mater Sci 59, 13856–13872 (2024). https://doi.org/10.1007/s10853-024-09980-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10853-024-09980-z

Navigation