Abstract
The definition of alternative processing techniques as applied to business documents is inevitably at odds with long-standing issues derived by the unstructured nature of most business-related information. In particular, more and more refined methods for automated data extraction have been investigated over the years. The last frontier in this sense is Semantic Role Labeling (SRL), which extracts relevant information purely based on the overall meaning of sentences. This is carried out by mapping specific situations described in the text into more general scenarios (semantic frames). FrameNet originated as a semantic frame repository by applying SRL techniques to large textual corpora, but its adaptation to languages other than English has been proven a difficult task. In this paper, we introduce a new implementation of SRL called Verb-Based SRL (VBSRL) for information extraction. VBSRL relies on a different conceptual theory used in the context of natural language understanding, which is language-independent and dramatically elevates the importance of verbs to abstract from real-life situations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
See https://framenet.icsi.berkeley.edu/fndrupal/framenets_in_other_languages for a complete summary of all undergoing projects.
- 4.
In the following, all references and examples written in Italian shall be reported in italics with the corresponding English translation in regular typeset.
References
Aggarwal, C.C.: Data Mining - The Textbook. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14142-8
Basili, R., Brambilla, S., Croce, D., Tamburini, F.: Developing a large scale FrameNet for Italian: the IFrameNet experience. In: Basili, R., Nissim, M., Satta, G. (eds.) Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it, Collana dell’Associazione Italiana di Linguistica Computazionale, Rome, pp. 59–64. Associazione Italiana di Linguistica Computazionale, Accademia University Press, December 2017
Cristani, M., Tomazzoli, C.: A multimodal approach to exploit similarity in documents. In: Ali, M., Pan, J.-S., Chen, S.-M., Horng, M.-F. (eds.) IEA/AIE 2014. LNCS (LNAI), vol. 8481, pp. 490–499. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07455-9_51
Cristani, M., Bertolaso, A., Scannapieco, S., Tomazzoli, C.: Future paradigms of automated processing of business documents. IJIM 40, 67–75 (2018)
Cristani, M., Cuel, R.: A survey on ontology creation methodologies. Int. J. Semantic Web Inf. Syst. 1(2), 49–69 (2005)
Cristani, M., Tomazzoli, C.: A multimodal approach to relevance and pertinence of documents. In: Fujita, H., Ali, M., Selamat, A., Sasaki, J., Kurematsu, M. (eds.) IEA/AIE 2016. LNCS (LNAI), vol. 9799, pp. 157–168. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42007-3_14
Fillmore, C.J.: Frame Semantics, pp. 111–137. Hanshin Publ. Co., Seoul (1982)
Huynh, D.T., Zhou, X.: Exploiting a proximity-based positional model to improve the quality of information extraction by text segmentation. In: Wang, H., Zhang, R (eds.) Proceedings of the Twenty-Fourth Australasian Database Conference. ADC 2013, Adelaide, Australia, vol. 137, pp. 23–31. Australian Computer Society Inc, January 2013
Kabak, Y., Dogac, A.: A survey and analysis of electronic business document standards. ACM Comput. Surv. 42(3), 11:1–11:31 (2010)
Lenci, A.: Distributional semantics in linguistic and cognitive research. Rivista di Linguistica 20(1), 1–31 (2008)
Lenci, A., Johnson, M., Lapesa, G.: Building an Italian FrameNet through semi-automatic corpus analysis. In: Proceedings of International Conference on Language Resources and Evaluation (LREC), Valletta, Malta (2010)
Laxmi Lydia, E., Kannan, S., Suman Rajest, S., Satyanarayana, S.: Correlative study and analysis for hidden patterns in text analytics unstructured data using supervised and unsupervised learning techniques. Int. J. Cloud Comput. 9(2/3), 150–162 (2020)
MacGillivray, C., Reinsel, D.: Worldwide global DataSphere IoT device and data forecast, 2019–2023. Technical report US45066919, International Data Corporation (IDC), Framingham, MA, USA, May 2019
Majumder, B.P., Potti, N., Tata, S., Wendt, J.B., Zhao, Q., Najork, M.: Representation learning for information extraction from form-like documents. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6495–6504. Association for Computational Linguistics, July 2020
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (2001)
Montemagni, S., et al.: Building the Italian syntactic-semantic treebank. In: Abeille, A. (ed.) Treebanks, pp. 189–210. Springer, Dordrecht (2003). https://doi.org/10.1007/978-94-010-0201-1_11
Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)
Scannapieco, S., Tomazzoli, C.: Shoo the spectre of ignorance with QA2SPR - an open domain question answering architecture with semantic prioritisation of roles. In: Armano, G., Bozzon, A., Cristani, M., Giuliani, A. (eds.) Proceedings of the 3rd International Workshop on Knowledge Discovery on the WEB. CEUR Workshop Proceedings, vol. 1959, Cagliari, Italy. CEUR-WS.org, September 2017
Schank, R.C.: The fourteen primitive actions and their inferences. Technical report, Stanford University, Stanford, CA, USA (1973)
Shilakes, C.C., Tylman, J.: Enterprise information portals. Techreport, Merrill Lynch (1998)
Tonelli, S., Pighin, D., Giuliano, C., Pianta, E.: Semi-Automatic Development of FrameNet for Italian (2009)
Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Bender, E.M., Derczynski, L., Isabelle, P. (eds.) Proceedings of the 27th International Conference on Computational Linguistics COLING, Santa Fe, New Mexico, USA, pp. 2145–2158. Association for Computational Linguistics, August 2018
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Scannapieco, S., Ponza, A., Tomazzoli, C. (2022). VBSRL: A Semantic Frame-Based Approach for Data Extraction from Unstructured Business Documents. In: Arai, K. (eds) Intelligent Computing. Lecture Notes in Networks and Systems, vol 283. Springer, Cham. https://doi.org/10.1007/978-3-030-80119-9_68
Download citation
DOI: https://doi.org/10.1007/978-3-030-80119-9_68
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80118-2
Online ISBN: 978-3-030-80119-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)