Creating an Annotated Corpus for Extracting Canonical Citations from Classics-Related Texts by Using Active Annotation

Romanello, Matteo

doi:10.1007/978-3-642-37247-6_6

Matteo Romanello¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2265 Accesses
2 Citations

Abstract

This paper describes the creation of an annotated corpus supporting the task of extracting information–particularly canonical citations, that are references to the ancient sources–from Classics-related texts. The corpus is multilingual and contains approximately 30,000 tokens of POS-tagged, cleanly transcribed text drawn from the L’Année Philologique. In the corpus the named entities that are needed to capture such citations were annotated by using an annotation scheme devised specifically for this task.

The contribution of the paper is two-fold: firstly, it describes how the corpus was created using Active Annotation, an approach which combines automatic and manual annotation to optimize the human resources required to create any corpus. Secondly, the performances of an NER classifier, based on Conditional Random Fields, are evaluated using the created corpus as training and test set: the results obtained by using three different feature sets are compared and discussed.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Extraction and Characterization of Citations in Scientific Papers

Automatically Detecting References from the Scholarly Literature to Records in Archives

Two-Tier Machine Learning Using Conditional Random Fields with Constraints

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Mimno, D.: Computational Historiography: Data Mining in a Century of Classics Journals. ACM Transactions on Computational Logic, 1–19 (2005)
Google Scholar
McCarty, W.: Humanities Computing. Palgrave Macmillan (2005)
Google Scholar
Crane, G.: From the old to the new: intergrating hypertext into traditional scholarship. In: Proceedings of the ACM Conference on Hypertext, Chapel Hill, North Carolina, United States, pp. 51–55. ACM (1987)
Google Scholar
Bolter, J.D.: The Computer, Hypertext, and Classical Studies. The American Journal of Philology 112, 541–545 (1991)
Article Google Scholar
Bolter, J.D.: Hypertext and the Classical Commentary. In: Accessing Antiquity: The Computerization of Classical Studies, pp. 157–171. University of Arizona Press, Tucson (1993)
Google Scholar
Ruddy, D., Rebillard, E.: Text Linking in the Humanities: Citing Canonical Works Using OpenURL (2009)
Google Scholar
Smith, N.: Digital Infrastructure and the Homer Multitext Project. In: Bodard, G., Mahony, S. (eds.) Digital Research in the Study of Classical Antiquity, pp. 121–137. Ashgate Publishing, Burlington (2010)
Google Scholar
Romanello, M.: New Value-Added Services for Electronic Journals in Classics. JLIS.it 2 (2011)
Google Scholar
Romanello, M.: A semantic linking framework to provide critical value-added services for E-journals on classics. In: Mornati, S., Chan, L. (eds.) ELPUB 2008. Open Scholarship: Authority, Community, and Sustainability in the Age of Web 2.0 - Proceedings of the 12th International Conference on Electronic Publishing held in Toronto, Canada, June 25-27, pp. 401–414 (2008)
Google Scholar
Crane, G., Seales, B., Terras, M.: Cyberinfrastructure for Classical Philology. Digital Humanities Quarterly 3 (2009)
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 3–26 (2007)
Article Google Scholar
Romanello, M., Boschetti, F., Crane, G.: Citations in the digital library of classics: extracting canonical references by using conditional random fields. In: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries. NLPIR4DL 2009, Morristown, NJ, USA, pp. 80–87. Association for Computational Linguistics (2009)
Google Scholar
Romanello, M., Thomas, A.: The World of Thucydides: From Texts to Artefacts and Back. In: Zhou, M., Romanowska, I., Zhongke, W., Pengfei, X., Verhagen, P. (eds.) Revive the Past. Proceeding of the 39th Conference on Computer Applications and Quantitative Methods in Archaeology, Beijing, April 12-16, pp. 276–284. Amsterdam University Press (2012)
Google Scholar
Smith, D.A., Crane, G.: Disambiguating Geographic Names in a Historical Digital Library. LNCS, pp. 127–136 (2001)
Google Scholar
Babeu, A., Bamman, D., Crane, G., Kummer, R., Weaver, G.: Named Entity Identification and Cyberinfrastructure. In: Kovács, L., Fuhr, N., Meghini, C. (eds.) ECDL 2007. LNCS, vol. 4675, pp. 259–270. Springer, Heidelberg (2007)
Chapter Google Scholar
Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers (2007)
Google Scholar
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF Reference String Parsing Package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. (3), pp. 661–667. Citeseer, European Language Resources Association, ELRA (2008)
Google Scholar
Kim, Y.M., Bellot, P., Faath, E., Dacos, M.: Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers. In: Beigbeder, M., Eglin, V., Ragot, N., Géry, M. (eds.) CORIA, pp. 329–340 (2012)
Google Scholar
Galibert, O., Rosset, S., Tannier, X., Grandry, F.: Hybrid Citation Extraction from Patents. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation, LREC 2010. European Language Resources Association, ELRA (2010)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) Machine Learning International Workshop then Conference, ICML 2001, pp. 282–289. Citeseer (2001)
Google Scholar
Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In: Getoor, L., Taskar, B. (eds.) Introduction to Statistical Relational Learning. MIT Press (2006)
Google Scholar
Vlachos, A.: Active annotation. In: Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM 2006), pp. 64–71 (2006)
Google Scholar
Ekbal, A., Bonin, F., Saha, S., Stemle, E., Barbu, E., Cavulli, F., Girardi, C., Poesio, M.: Rapid Adaptation of NE Resolvers for Humanities Domains using Active Annotation. Journal for Language Technology and Computational Linguistics 26, 39–51 (2011)
Google Scholar
Settles, B.: Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin- Madison (2009)
Google Scholar
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53, 918–936 (2006)
Article MathSciNet Google Scholar
Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: JNLPBA 2004: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, Morristown, NJ, USA, pp. 104–107. Association for Computational Linguistics (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Digital Humanities, King’s College London, 26-29 Drury Lane, London, WC2B 5RL, UK
Matteo Romanello

Authors

Matteo Romanello
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Romanello, M. (2013). Creating an Annotated Corpus for Extracting Canonical Citations from Classics-Related Texts by Using Active Annotation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-37247-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Creating an Annotated Corpus for Extracting Canonical Citations from Classics-Related Texts by Using Active Annotation

Abstract

Chapter PDF

Similar content being viewed by others

Extraction and Characterization of Citations in Scientific Papers

Automatically Detecting References from the Scholarly Literature to Records in Archives

Two-Tier Machine Learning Using Conditional Random Fields with Constraints

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Creating an Annotated Corpus for Extracting Canonical Citations from Classics-Related Texts by Using Active Annotation

Abstract

Chapter PDF

Similar content being viewed by others

Extraction and Characterization of Citations in Scientific Papers

Automatically Detecting References from the Scholarly Literature to Records in Archives

Two-Tier Machine Learning Using Conditional Random Fields with Constraints

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation