Abstract
This case study describes the creation process for the Manually Annotated Sub-Corpus (MASC), a 500,000 word subset of the Open American National Corpus (OANC). The corpus includes primary data from a balanced selection of 19 written and spoken genres, all of which is annotated for almost 20 varieties of linguistic phenomena at all levels. All annotations are either hand-validated or manually-produced. MASC is unique in that it is fully open and free for any use, including commercial use.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The consortium members who contributed texts to the ANC are Oxford University Press, Cambridge University Press, Langenscheidt Publishers, and the Microsoft Corporation.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
To date, we have collected over five million words of college essays and fiction contributed by college students.
- 9.
- 10.
For this reason, we were unable to include a million words of contributed data from the ACL Anthology in the ANC.
- 11.
- 12.
- 13.
Defined in ISO/IEC 10646.
- 14.
The ANC maintains a GATE plugin repository, which includes import and export modules for annotated documents in GrAF (see Sect. 2.4), at http://www.anc.org/tools/gate/gate-update-site.xml.
- 15.
- 16.
Some of these modules were developed or improved by students at Vassar College, who did the analysis and JAPE rule-writing as a term project for an advanced undergraduate course on Computational Linguistics.
- 17.
General Architecture for Text Engineering; http://gate.ac.uk.
- 18.
The contents of the ANC First Release are described at http://www.anc.org/FirstRelease/.
- 19.
- 20.
Available at http://www.anc.org/data/oanc/contributed-annotations/.
- 21.
- 22.
NSF CRI 0708952.
- 23.
- 24.
MASC includes about 5 K of the 10 K LU corpus, eliminating non-English and translated texts as well as texts that are not free of usage and redistribution restrictions. See https://catalog.ldc.upenn.edu/LDC2009T10.
- 25.
The list does not include WordNet sense annotations because they are not applied to full texts.
- 26.
- 27.
Primarily, the students were Cognitive Science majors with a Linguistics emphasis. Over the four years of the project, sixteen different students worked on validation.
- 28.
All of the MASC project’s annotation guidelines are accessible from http://www.anc.org/wiki/#AnnotationValidation.
- 29.
- 30.
Sense and frame element annotations were handled separately; see chapter “Semantic Annotation of MASC”, in this volume.
- 31.
We created a post-processing JAPE script that modifies the default ANNIE tokenization slightly.
- 32.
Several years ago, the PTB project changed its tokenization, which originally did not break hyphenated words, because of difficulties with cases such as “New York-based” encountered in the Unified Linguistic Annotation project (see https://catalog.ldc.upenn.edu/LDC2009T07). However, this disallowed tagging the hyphenated word as an adjective, which, despite the need to manually correct tokenizations such as New+York-based, was deemed preferable.
- 33.
- 34.
- 35.
Because of the unexpected difficulty of correcting the ANNIE tags by this method, the first release of the full MASC (version 3.0.0) did not contain the tags corrected from the PTB data, but had been post-processed with JAPE scripts to correct systematic errors.
- 36.
- 37.
- 38.
- 39.
For comprehensive overview of GrAF and its headers, see [17].
- 40.
- 41.
- 42.
Available from https://pypi.python.org/pypi/graf-python/0.3.0.
- 43.
- 44.
- 45.
- 46.
The ANNIS implementation for accessing MASC annotations is available from http://www.anc.org/software/annis.
- 47.
- 48.
- 49.
Note that GrAF is a “true” standoff format, as opposed to hybrid standoff formats as described in chapter “Designing Annotation Schemes: From Model to Representation” in this volume.
References
Baker, C.F., Fellbaum, C.: WordNet and FrameNet as complementary resources for annotation. In: Proceedings of the Third Linguistic Annotation Workshop, pp. 125–129. Association for Computational Linguistics, Suntec, Singapore (2009). http://www.aclweb.org/anthology/W/W09/W09-3021
Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Proceedings of the 17th International Conference on Computational Linguistics, vol.1, pp. 86–90. Association for Computational Linguistics, Stroudsburg, PA, USA (1998)
Blumtritt, J., Bouda, P., Rau, F.: Poio API and GraF-XML: a radical stand-off approach in language documentation and language typology. In: Proceedings of Balisage: The Markup Conference 2013, Balisage Series on Markup Technologies, vol. 10, Montreal, Canada (2013). doi:10.4242/BalisageVol10.Bouda01
Chiarcos, C., Hellmann, S., Nordhoff, S.: Linking linguistic resources: examples from the Open Linguistics Working Group. In: C. Chiarcos, S. Nordhoff, S. Hellmann (eds.) Linked Data in Linguistics, pp. 201–216. Springer, Heidelberg (2012)
Chiarcos, C., Ritz, J., Stede, M.: By all these lovely Tokens... Merging conflicting tokenizations. Lang. Res. Eval. 46(1), 53–74 (2012)
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust nlp tools and applications. In: Proceedings of ACL’02 (2002)
Dridan, R., Oepen, S.: Tokenization: returning to a long solved problem–a survey, contrastive experiment, recommendations, and toolkit. In: ACL (2), pp. 378–382. The Association for Computational Linguistics (2012)
Fellbaum, C., Baker, C.: Aligning verbs in WordNet and FrameNet. Linguistics (to appear)
Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Lang. Eng. 10(3–4), 327–348 (2004). doi:10.1017/S1351324904003523
Fillmore, C.J., Jurafsky, D., Ide, N., Macleod, C.: An American National Corpus: a proposal. In: Proceedings of the First Annual Conference on Language Resources and Evaluation, pp. 965–969. European Language Resources Association, Paris (1998)
Fokkens, A., van Erp, M., Postma, M., Pedersen, T., Vossen, P., Freire, N.: Offspring from reproduction problems: What replication failure teaches us. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 1691–1701. Association for Computational Linguistics, Sofia, Bulgaria (2013)
Ide, N.: An open linguistic infrastructure for annotated corpora. In: I. Gurevych, J. Kim (eds.) The People Web Meets NLP: Collaboratively Constructed Language Resources, pp. 263–84. Springer, Heidelberg (2013)
Ide, N., Romary, L.: International standard for a linguistic annotation framework. Natural Lang. Eng. 10(3–4), 211–225 (2004). doi:10.1017/S135132490400350X
Ide, N., Romary, L.: Representing linguistic corpora and their annotations. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006) (2006)
Ide, N., Suderman, K.: Integrating linguistic resources: the American National Corpus model. In: Proceedings of the Fifth Language Resources and Evaluation Conference (LREC). Genoa, Italy (2006)
Ide, N., Suderman, K.: GrAF: a graph-based format for linguistic annotations. In: Proceedings of the Linguistic Annotation Workshop, pp. 1–8. Association for Computational Linguistics, Prague, Czech Republic (2007). http://www.aclweb.org/anthology/W/W07/W07-1501
Ide, N., Suderman, K.: The Linguistic Annotation Framework: a Standard for Annotation Interchange and Merging. Language Resources and Evaluation (2014)
Ide, N., Bonhomme, P., Romary, L.: XCES: an XML-based encoding standard for linguistic corpora. In: Proceedings of the Second International Language Resources and Evaluation Conference. European Language Resources Association, Paris (2000)
Ide, N., Reppen, R., Suderman, K.: The American National Corpus: more than the web can provide. In: Proceedings of the Third Language Resources and Evaluation Conference, pp. 839–844. Las Palmas (2002)
Ide, N., Suderman, K., Simms, B.: ANC2Go: a web application for customized corpus creation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC). European Language Resources Association, Valletta, Malta (2010)
ISO: Language Resource Management - Linguistic Annotation Framework. ISO 24612 (2012)
Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33(1) (2007)
Kremer, G., Erk, K., Pad, S., Thater, S.: What substitutes tell us – analysis of an “all-words” lexical substitution corpus. In: Proceedings of the Conference of the European. Association for Computational Linguistics. Gothenburg, Sweden (2014)
Macleod, C., Grishman, R., Meyers, A., Barrett, L., Reeves, R.: Nomlex: a lexicon of nominalizations. Proc. Euralex 98, 187–193 (1998)
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: description and construction of text structures. In: Kempen, G. (ed.) Natural Language Generation: New Results in Artificial Intelligence, Psychology, and Linguistics, pp. 85–95. Nijhoff, Dordrecht (1987)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)
Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC corpus with BabelNet. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland (2014)
Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Neumann, A., Ide, N., Stede, M.: Importing MASC into the ANNIS linguistic database: a case study of mapping GrAF. In: Proceedings of the Seventh Linguistic Annotation Workshop (LAW), pp. 98–102. Sofia, Bulgaria (2013)
Pradhan, S.S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: OntoNotes: a unified relational semantic representation. In: ICSC ’07: Proceedings of the International Conference on Semantic Computing, pp. 517–526. IEEE Computer Society, Washington, DC, USA (2007). http://dx.doi.org/10.1109/ICSC.2007.67
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Ide, N. (2017). Case Study: The Manually Annotated Sub-Corpus. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_19
Download citation
DOI: https://doi.org/10.1007/978-94-024-0881-2_19
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)