Abstract
The continuous development of the Linked Data Web depends on the advancement of the underlying extraction mechanisms. This is of particular interest for the scientific publishing domain, where currently most of the data sets are being created manually. In this article, we present a Machine Learning pipeline that enables the automatic extraction of heading metadata (i.e., title, authors, etc) from scientific publications. The experimental evaluation shows that our solution handles very well any type of publication format and improves the average extraction performance of the state of the art with around 4%, in addition to showing an increased versatility. Finally, we propose a flexible Linked Data-driven mechanism to be used both for refining and linking the automatically extracted metadata.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Alexander K, Cyganiak R, Hausenblas M, Zhao J (2011) Describing linked datasets with the VoID vocabulary. W3C Interest Group Note. http://www.w3.org/TR/void/ (March 2011)
Barabasi AL, Jeong H, Neda Z, Ravasz E, Schubert A, Vicsek T (2002) Evolution of the social network of scientific collaborations. Physica A Stat Mech Appl 311(3–4): 590–614
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284(5): 35–43
Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far. Int J Semant Web Inf Syst 5(3): 1–22
Bouquet P, Stoermer H, Niederee C, Mana A (2008) Entity name system: the backbone of an open and scalable web of data. In: Proceedings of the IEEE international conference on semantic computing (ICSC 2008), IEEE Computer Society, pp 554–561
Cesario E, Folino F, Locane A, Manco G, Ortale R (2008) Boosting text segmentation via progressive classification. Knowl Inf Syst 15(3): 285–320
Dorji TC, sayed Atlam E, Yata S, Fuketa M, Morita K, ichi Aoe J (2010) Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary. Knowl Inf Syst 27(1): 141–161
Giuffrida G, Shek EC, Yang J (2000) Knowledge-based metadata extraction from Postscript files. In: JCDL ’05, San Antonio, Texas, USA, pp 77–84
Groza T, Handschuh S, Hulpus I (2009) A document engineering approach to automatic extraction of shallow metadata from scientific publications. Tech. Rep. 2009–06–01. Digital Enterprise Research Institute
Haghighi A, Klein D (2010) Coreference resolution in a modular, entity-centered model. In: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Han H, Giles CL, Manavoglu E, Zha H, Zhang Z, Fox EA (2003) Automatic document metadata extraction using support vector machines. In: JCDL ’03, Houston, pp 37–48
Han H, Manavoglu E, Zha H, Tsioutsiouliklis K, Giles CL, Zhang X (2005) Rule-based word clustering for document metadata extraction. In: Proceedings of the 2005 ACM symposium on applied computing, Santa Fe, New Mexico
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the fourteenth international joint conference on artificial intelligence, pp 1137–1143
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML’01, Francisco, pp 282–289
Liu X, Bollen J, Nelson ML, de Sompel HV (2005) Co-authorship networks in the digital library research community. Inf Process Manage 41(6): 1462–1480
Möller K, Heath T, Handschuh S, Domingue J (2007) Recipes for semantic web dog food—The ESWC and ISWC metadata projects. In: Proceedings of ISWC 2007, Busan, Korea
Monge A, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD workshop on data mining and knowledge discovery
National Archives and Records Administration (2007) The soundex indexing system. Technical report, May 2007
Peng F, McCallum A (2006) Accurate information extraction from research papers using conditional random fields. Inf Process Manage Int J 42(4): 963–979
Peng W, Li T (2011) Temporal relation co-clustering on directional social network and author-topic evolution. Knowl Inf Syst 26(3): 467–486
PDFBox http://pdfbox.apache.org/
Qi Y, Kuksa P, Collobert R, Sadamasa K, Kavukcuoglu K, Weston J (2009) Semi-supervised sequence labeling with self-learned features. In: Proceedings of IEEE international conference on data mining (ICDM)
Sanchez D, Isern D, Millan M (2010) Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27(3): 393–418
Sandler T, Ungar LH, Crammer K (2009) Resolving identity uncertainty with learned random walks. In: Proceedings of IEEE international conference on data mining (ICDM)
Seymore K, McCallum A, Rosenfeld R (1999) Learning hidden Markov model structure for information extraction. In: Proceedings of the AAAI’99 workshop on machine learning for information extraction, pp 37–42
Shaw WM, Burgin R, Howell P (1997) Performance standards and evaluations in IR test collections: cluster-based retrieval models. Inf Process Manage 33(1): 1–14
Sure Y, Bloehdorn S, Haase P, Hartmann J, Oberle D (2005) The SWRC ontology—semantic web for research communities. In: Proceedings of the 12th Portuguese conference on artificial intelligence (EPIA 2005), Covilha, Portugal
Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin
Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Discovering and Maintaining Links on the Web of Data. In: Proceedings of the international semantic web conference (ISWC 2009)
Wick M, Culotta A, Rohanimanesh K, McCallum A (2009) An entity based model for coreference resolution. In: Proceedings of the nineth SIAM international conference on data mining, pp 365–377
Yilmazel O, Finneran CM, Liddy ED (2004) Metaextract: an nlp system to automatically assign metadata. In: JCDL ’04, pp 241–242
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Groza, T., Grimnes, G.A., Handschuh, S. et al. From raw publications to Linked Data. Knowl Inf Syst 34, 1–21 (2013). https://doi.org/10.1007/s10115-011-0473-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0473-6