Abstract
This article surveys linguistic annotation in corpora and corpus linguistics. We first define the concept of ‘corpus’ as a radial category and then, in Sect. 2, discuss a variety of kinds of information for which corpora are annotated and that are exploited in contemporary corpus linguistics. Section 3 then exemplifies many current formats of annotation with an eye to highlighting both the diversity of formats currently available and the emergence of XML annotation as, for now, the most widespread form of annotation. Section 4 summarizes and concludes with desiderata for future developments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
A reviewer points out that most corpora are in English and are thus by default Unicode-compliant, since English orthographic characters use the ASCII subset of Unicode.
- 2.
A reviewer points out that entry of IPA characters is still difficult on some computers, although software like IPA Palette (http://www.blugs.com/IPA/) make this task easier than it has been.
References
Aijmer, K.: Parallel and comparable corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, pp. 275–292. Walter de Gruyter, Berlin (2008)
Aldebazal, I., Aranzabe, M.J., Arriola, J.M., Dias de Ilarraza, A.: Syntactic annotation in the reference Corpus for the processing of basque (EPEC): theoretical and practical issues. Corpus Linguist. Linguistic Theory 5(2), 241–269 (2009)
Anthony, L.: AntConc: a freeware concordance program for Windows, Macintosh OS X, and Linux. http://www.antlab.sci.waseda.ac.jp/antconc_index.html (2014)
Archer, D., Wilson, A., Rayson, P.: Introduction to the USAS Category System. Lancaster University, Lancaster. http://ucrel.lancs.ac.uk/usas/usas%20guide.pdf (2002)
Archer, D., Culpeper, J., Davies, M.: Pragmatic annotation. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, pp. 613–642. Walter de Gruyter, Berlin (2008)
Bard, E.G., Sotillo, C., Anderson, A.H., Thompson, H.S., Taylor, M.M.: The DCIEM map task corpus: spontaneous dialogue under sleep deprivation and drug treatment. Speech Commun. 20(1/2), 71–84 (1996)
Beal, J.C., Corrigan, K.P., Moisl, H.L. (eds.): Creating and Digitizing Language Corpora. Vol. 1: Synchronic databases. Palgrave Macmillan, Houndmills (2007a)
Beal, J.C., Corrigan, K.P., Moisl, H.L. (eds.) Creating and Digitizing Language Corpora. Vol. 2: Diachronic databases. Palgrave Macmillan, Houndmills (2007b)
Berez, A.L., Gries, S.T.: Correlates to middle marking in Dena’ina iterative verbs. Int. J. Am. Linguist. 76(1), 145–165 (2010)
Bird, S., Liberman, M.: A formal framework for linguistic annotation. Speech Commun. 33(1–2), 23–60 (2001)
Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER: linguistic interpretation of a German Corpus. J. Lang. Comput. 2004(2), 597–620 (2004)
Carletta, J., McKelvie, D., Isard, A., Mengel, A., Klein, M., Møller, M.B.: A generic approach to software support for linguistic annotation using XML. In: Sampson, G., McCarthy, D. (eds.) Corpus Linguistics: Readings in a Widening Discipline, pp. 449–459. Continuum, London (2004)
Cox, C.: Corpus linguistics and language documentation: challenges for collaboration. In: Newman, J., Harald Baayen, R., Rice, S. (eds.) Corpus-based Studies in Language Use, Language Learning, and Language Documentation, pp. 239–264. Rodopi, Amsterdam (2011)
Czaykowska-Higgins, E.: Research models, community engagement, and linguistic fieldwork: reflections on working with Canadian Indigenous communities. Lang. Doc. Conserv. 3(1), 15–50 (2009)
Dagneaux, E.S.D., Granger, S.: Computer-aided error analysis. System 26, 163–174 (1998)
DGS-Korpus Sign Language Corpora Survey. http://www.sign-lang.uni-hamburg.de/dgs-korpus/index.php/sl-corpora.html. Accessed 20 Sept 2013
Díaz-Negrillo, A.: A fine-grained error tagger for English learner corpora. Unpublished Ph.D. thesis, University of Jaén (2007)
Du Bois, J.W., Cumming, S., Schuetze-Coburn, S., Paolino, D. (eds.): Discourse Transcription. University of California, Santa Barbara (1992). (Santa Barabara Papers in Linguistics, vol. 4)
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Fellbaum, C., Garabowski, J., Landes, S., Baumann, A.: Matching words to senses in WordNet: Naïve versus expert differentiation. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 217–239. MIT Press, Cambridge (1998)
Fillmore, C.J.: Frame semantics and the nature of language. Ann. New York Acad. Sci. Conf. Origin Dev. Lang. Speech 280, 20–32 (1976)
Fitschen, A., Gupta, P.: Lemmatising and morphological tagging. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 552–564. Walter de Gruyter, Berlin (2008)
Gahl, S.: The “Up” Corpus: A corpus of speech samples across adulthood. Corpus Linguistics and Linguistic Theory
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., Zue, V.: TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium, Philadelphia (1993)
Garside, R., Fligelstone, S., Botley, S.: Discourse annotation: anaphoric relations in corpora. In: Garside, R., Leech, G., McEnery, T. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora, pp. 66–84. Longman, London (1997)
Garside, R., Leech, G., McEnery, T. (eds.): Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London (1997)
Gilquin, G., Gries, S.Th.: Corpora and experimental methods: a state-of-the-art review. Corpus Linguist. Linguistic Theory 5(1), 1–26 (2009)
Godfrey, J.J., Holliman, E.: Switchboard-1 Release 2. Linguistic Data Consortium, Philadelphia (1997)
Granger, S., Dagneaux, E., Meunier, F. (eds.): The International Corpus of Learner English. Handbook and CD-ROM. Presses Universitaires de Louvain, Louvain-la-Neuve (2002)
Gries, S.T.: Corpus-based methods and cognitive semantics: the many meanings of to run. In: Gries, S.T., Stefanowitsch, A. (eds.) Corpora in Cognitive Linguistics: Corpus-based Approaches to Syntax and Lexis, pp. 57–99. Mouton de Gruyter, Berlin (2006)
Gries, S.T.: Data in construction grammar. In: Trousdale, G., Hoffmann, T. (eds.) The Oxford Handbook of Construction Grammar, pp. 93–108. Oxford University Press, Oxford (2013)
Hanke, T.: HamNoSys - representing sign language data in language resources and language processing contexts. In: Streiter, O., Chiara, C. (eds).: Proceedings of the Workshop Representation and Processing of Sign Languages, LREC 2004, pp. 1–6. ELRA, Paris (2004)
Hirschmann, L., Chinchor, N.A.: MUC-7 Coreference Task Definition. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/co_task.html (1997). version 3.0
Hunston, S.: Corpora in Applied Linguistics. Cambridge University Press, Cambridge (2002)
Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. Proceedings of LREC 1998, 463–470 (1998)
Iruskieta, M., Diaz de Ilarraza, A., Lersundi, M.: Establishing criteria for RST-based discourse segmentation and annotation for texts in Basque. Corpus Linguistics and Linguistic Theory
Jefferson, G.: Sequential aspects of storytelling in conversation. In: Schenkein, J. (ed.) Studies in the Organization of Conversational Interaction, pp. 219–248. Academic Press, New York (1978)
Jefferson, G.: Issues in the transcription of naturally-occurring talk: caricature versus capturing pronunciation particulars. Tilburg Papers in Language and Literature 34 (1983a)
Jefferson, G.: An Exercise in the Transcription and Analysis of Laughter. Tilburg Papers in Language and Literature 34. Tilburg University, Tilburg (1983b)
Jefferson, G.: An exercise in the transcription and analysis of laughter. In: van Dijk, T.A. (ed.) Handbook of Discourse Analysis, vol. III, pp. 25–34. Academic Press, New York (1985)
Jefferson, G.: A case of transcriptional stereotyping. J. Pragmat. 26(2), 159–170 (1996)
Johnston, T.: Auslan Corpus Annotation Guidelines. Macquarie University, Sydney (2013)
Jorgensen, J.: The psychological reality of word senses. J. Psycholinguist. Res. 19(3), 167–190 (1990)
Jun, S.-A. (ed.): Prosodic Typology: The Phonology of Intonation and Phrasing. Oxford University Press, Oxford (2005)
Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004)
Kilgarriff, A.: I don’t believe in word senses. Comput. Humanit. 31(2), 91–113 (1997)
Kipp, M., Neff, M., Albrecht, I.: An annotation scheme for conversational gesture: how to economically capture timing and form. Lang. Resour. Eval. 41(3/4), 325–339 (2007)
Koehn, P.: Europarl: a Parallel Corpus for Statistical Machine Translation. University of Edinburgh, MT Summit (2005)
Lücking, A., Bergman, K., Hahn, F., Kopp, S., Rieser, H.: The bielefeld speech and gesture alignment Corpus (SaGA). In: Proceedings of the LREC 2010 Workshop: Multimodal Corpora-Advances in Capturing, Coding and Analyzing Multimodality, pp. 92–98 (2010)
Leech, G.: Adding linguistic annotation. In: Wynne, M. (ed.) Developing Linguistic Corpora: A Guide to Good Practice, pp. 17–29. Oxbow, Oxford (2005)
Leech, G., McEnery, T., Wynne, M.: Further levels of annotation. In: Garside, R., Leech, G., McEnery, T. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora, pp. 85–101. Longman, London (1997)
Lu, H.-C.: An annotated Taiwanese learners’ Corpus of Spanish. CATE. Corpus Linguist. Linguist. Theory 6(2), 297–300 (2010)
Lüdeling, A., Kytö, M. (eds.): Corpus Linguistics: an International Handbook, vol. 1. Walter de Gruyter, Berlin (2008)
MacWhinney, B.: The expanding horizons of corpus analysis. In: Newman, J., Harald Baayen, R., Rice, S. (eds.) Corpus-based Studies in Language use, Language Learning, and Language Documentation, pp. 178–212. Rodopi, Amsterdam (2011)
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated Corpus of English: the penn treebank. Comput. Linguist. 19(2), 313–330 (1993)
Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands. EUDICO Linguistic Annotator (ELAN). http://tla.mpi.nl/tools/tla-tools/elan/ (2014)
McEnery, T., Ostler, N.: A new agenda for corpus linguistics - working with all of the world’s languages. Lit. Linguist. Comput. 15(4), 403–419 (2000)
McEnery, T., Xiao, R., Tono, Y.: Corpus-based Language Studies: An Advanced Resource Book. Routledge, London (2006)
Mitkov, R.: Corpora for anaphora nad coreference resolution. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 579–598. Walter de Gruyter, Berlin (2008)
Müller, C.: Redebegleitende Gesten: Kulturgeschichte – Theorie – Sprachvergleich, vol. 1 of Körper – Kultur – Kommunikation. Berlin, Berlin (1998)
Nelson, G., Wallis, S., Aarts, B.: Exploring Natural Language: Working with the British Component of the International Corpus of English. John Benjamins, Amsterdam (2002)
Oostdijk, N., Boves, L.: Preprocessing speech corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 642–663. Walter de Gruyter, Berlin (2008)
Ostler, N.: Corpora of less studies languages. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 457–483. Walter de Gruyter, Berlin (2008)
Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–105 (2005)
Pellard, T.: Ōgami (Miyako ryukyuan). In: Shimoji, M., Pellard, T. (eds.) An Introduction to Ryukyuan Languages, pp. 113–166. Research Institute for Languages and Cultures of Asia and Africa, Tokyo (2010)
Pierrehumbert, J.: The Phonology and Phonetics of English Intonation. Unpublished Ph.D. Dissertation, MIT (1980)
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., Webber, B.: The penn discourse treebank 2.0. In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008) (2008)
Pustejovsky, J., et al.: The timebank corpus. Proc. Corpus Linguist. 2003, 647–656 (2003)
Rayson, P., Stevenson, M.: Sense and semantic tagging. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, pp. 564–579. Walter de Gruyter, Berlin (2008)
Rice, K.: Ethical issues in linguistic fieldwork. In: Thieberger, N. (ed.) Oxford Handbook of Linguistic Fieldwork, pp. 407–429. Oxford University Press, Oxford (2012)
van Rooy, B., Schäfer, L.: The effect of learner errors on POS tag errors during automatic POS tagging. S. Afr. Linguist. Appl. Lang. Studies 20(4), 325–335 (2002)
Roy, D.: New horizons in the study of child language acquisition. In: Proceedings of Interspeech, Brighton, England (2009)
Rühlemann, C., O’Donnell, M.B.: Introducing a corpus of conversational stories: construction and annotation of the Narrative Corpus and interim results. Corpus Linguistics and Linguistic Theory
Sacks, H., Schegloff, E.A., Jefferson, G.: A simplest systematics for the organization of turn-taking for conversation. Language 50(4), 696–735 (1974)
Santorini, B.: Part-of-Speech Tagging Guidelines for the Penn Treebank Project. 3rd revision, 2nd printing. ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz (1990)
Schegloff, E.A.: Sequence Organization in Interaction. Cambridge University Press, Cambridge (2007)
Schmid, H.: Tokenizing and part-of-speech tagging. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 527–551. Walter de Gruyter, Berlin (2008)
Sloetjes, H., Wittenburg, P.: In: Proceedings of the LREC, Annotation by category - ELAN and ISO DCR (2008)
Streeck, J.: Depicting by gesture. Gesture 8(3), 285–301 (2008)
Tagliamonte, S.: Representing real language: consistency, trade-offs, and thinking ahead! In: Beal, J.C., Corrigan, K.P., Moisl, H.L. (eds.), Creating and Digitizing Language Corpora, vol. 1: Synchronic Databases, pp. 205–240. Palgrave Macmillan, Houndmills (2007)
Taylor, A., Marcus, M.P., Santorini, B.: The penn treebank: an overview. Text, Speech Lang. Technol. 20, 5–22 (2003)
The British National Corpus, version 3 (BNC XML Edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/ (2007)
Thieberger, N., Berez, A.L.: Linguistic data management. In: Thieberger, N. (ed.) Oxford Handbook of Linguistic Fieldwork, pp. 90–118. Oxford University Press, Oxford (2012)
Thompson, H.S., McKelvie, D.: Hyperlink semantics for standoff markup of read-only documents. In: Proceedings of the SGML Europe (1997). http://www.ltg.ed.ac.uk/~ht/sgmleu97.html
University of Hamburg. iLex – a tool for sign language lexicography and corpus analysis. (2014) http://www.sign-lang.uni-hamburg.de/ilex/
Woodbury, A.: Language documentation. In: Austin, P.K., Sallabank, J. (eds.) The Cambridge Handbook of Endangered Languages, pp. 159–186. Cambridge University Press, Cambridge (2011)
Xiao, R.: Theory-driven corpus research: using corpora to inform aspect theory. In: Lüdeling, A., Kytö, M. (eds.) Corpus lInguistics: An International Handbook, vol. 2, pp. 987–1008. Walter de Gruyter, Berlin (2008)
Zinsmeister, H., Hinrichs, E., Kübler, S., Witt, A.: Linguistically annotated corpora: quality assurance, reusability and sustainability. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics: An International Handbook, vol. 1, pp. 759–776. Walter de Gruyter, Berlin (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Gries, S.T., Berez, A.L. (2017). Linguistic Annotation in/for Corpus Linguistics. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_15
Download citation
DOI: https://doi.org/10.1007/978-94-024-0881-2_15
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)