Abstract
This paper describes the creation of an annotated corpus supporting the task of extracting information–particularly canonical citations, that are references to the ancient sources–from Classics-related texts. The corpus is multilingual and contains approximately 30,000 tokens of POS-tagged, cleanly transcribed text drawn from the L’Année Philologique. In the corpus the named entities that are needed to capture such citations were annotated by using an annotation scheme devised specifically for this task.
The contribution of the paper is two-fold: firstly, it describes how the corpus was created using Active Annotation, an approach which combines automatic and manual annotation to optimize the human resources required to create any corpus. Secondly, the performances of an NER classifier, based on Conditional Random Fields, are evaluated using the created corpus as training and test set: the results obtained by using three different feature sets are compared and discussed.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Mimno, D.: Computational Historiography: Data Mining in a Century of Classics Journals. ACM Transactions on Computational Logic, 1–19 (2005)
McCarty, W.: Humanities Computing. Palgrave Macmillan (2005)
Crane, G.: From the old to the new: intergrating hypertext into traditional scholarship. In: Proceedings of the ACM Conference on Hypertext, Chapel Hill, North Carolina, United States, pp. 51–55. ACM (1987)
Bolter, J.D.: The Computer, Hypertext, and Classical Studies. The American Journal of Philology 112, 541–545 (1991)
Bolter, J.D.: Hypertext and the Classical Commentary. In: Accessing Antiquity: The Computerization of Classical Studies, pp. 157–171. University of Arizona Press, Tucson (1993)
Ruddy, D., Rebillard, E.: Text Linking in the Humanities: Citing Canonical Works Using OpenURL (2009)
Smith, N.: Digital Infrastructure and the Homer Multitext Project. In: Bodard, G., Mahony, S. (eds.) Digital Research in the Study of Classical Antiquity, pp. 121–137. Ashgate Publishing, Burlington (2010)
Romanello, M.: New Value-Added Services for Electronic Journals in Classics. JLIS.it 2 (2011)
Romanello, M.: A semantic linking framework to provide critical value-added services for E-journals on classics. In: Mornati, S., Chan, L. (eds.) ELPUB 2008. Open Scholarship: Authority, Community, and Sustainability in the Age of Web 2.0 - Proceedings of the 12th International Conference on Electronic Publishing held in Toronto, Canada, June 25-27, pp. 401–414 (2008)
Crane, G., Seales, B., Terras, M.: Cyberinfrastructure for Classical Philology. Digital Humanities Quarterly 3 (2009)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 3–26 (2007)
Romanello, M., Boschetti, F., Crane, G.: Citations in the digital library of classics: extracting canonical references by using conditional random fields. In: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries. NLPIR4DL 2009, Morristown, NJ, USA, pp. 80–87. Association for Computational Linguistics (2009)
Romanello, M., Thomas, A.: The World of Thucydides: From Texts to Artefacts and Back. In: Zhou, M., Romanowska, I., Zhongke, W., Pengfei, X., Verhagen, P. (eds.) Revive the Past. Proceeding of the 39th Conference on Computer Applications and Quantitative Methods in Archaeology, Beijing, April 12-16, pp. 276–284. Amsterdam University Press (2012)
Smith, D.A., Crane, G.: Disambiguating Geographic Names in a Historical Digital Library. LNCS, pp. 127–136 (2001)
Babeu, A., Bamman, D., Crane, G., Kummer, R., Weaver, G.: Named Entity Identification and Cyberinfrastructure. In: Kovács, L., Fuhr, N., Meghini, C. (eds.) ECDL 2007. LNCS, vol. 4675, pp. 259–270. Springer, Heidelberg (2007)
Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers (2007)
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF Reference String Parsing Package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. (3), pp. 661–667. Citeseer, European Language Resources Association, ELRA (2008)
Kim, Y.M., Bellot, P., Faath, E., Dacos, M.: Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers. In: Beigbeder, M., Eglin, V., Ragot, N., Géry, M. (eds.) CORIA, pp. 329–340 (2012)
Galibert, O., Rosset, S., Tannier, X., Grandry, F.: Hybrid Citation Extraction from Patents. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation, LREC 2010. European Language Resources Association, ELRA (2010)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) Machine Learning International Workshop then Conference, ICML 2001, pp. 282–289. Citeseer (2001)
Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In: Getoor, L., Taskar, B. (eds.) Introduction to Statistical Relational Learning. MIT Press (2006)
Vlachos, A.: Active annotation. In: Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006), pp. 64–71 (2006)
Ekbal, A., Bonin, F., Saha, S., Stemle, E., Barbu, E., Cavulli, F., Girardi, C., Poesio, M.: Rapid Adaptation of NE Resolvers for Humanities Domains using Active Annotation. Journal for Language Technology and Computational Linguistics 26, 39–51 (2011)
Settles, B.: Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin- Madison (2009)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53, 918–936 (2006)
Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: JNLPBA 2004: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, Morristown, NJ, USA, pp. 104–107. Association for Computational Linguistics (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Romanello, M. (2013). Creating an Annotated Corpus for Extracting Canonical Citations from Classics-Related Texts by Using Active Annotation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-37247-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)