Abstract
The development of complex and consistent linguistic annotations over large and varied corpora requires an approach which allows for the incremental improvement of existing annotations by encoding all manual effort in such a way that its value is preserved and enhanced even as the resource is improved over time. This manual effort includes both annotation design and disambiguation; in the case of syntactico-semantic annotations, the former can be encoded in a machine-readable grammar and the latter as a series of decisions made at a level of granularity which supports both efficient human disambiguation and later machine re-use of the individual decisions. The general approach can be applied beyond syntactico-semantic annotation to any annotation project where the design of the representations can be encoded as a grammar, and thus we frame our methodological discussion in terms of incremental improvement, with syntactico-semantic annotations as a case study.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
A grammar is never complete, however, and new texts always hold the promise of new linguistic phenomena to investigate. The ability to process the text with a grammar encoding the existing analyses makes it much easier to discover those which are not yet covered by the grammar, even as they become ever less frequent. One example of this method of discovering previously unrecognized phenomena is presented in Flickinger and Wasow [17].
- 2.
This example is an adaptation of a sentence that appears in the WSJ portion of the PTB, as well as in the much smaller Cross-Framework Parser Evaluation Shared Task (PEST) corpus discussed by Ivanova, Oepen, Øvrelid, and Flickinger [27].
- 3.
For a thorough introduction to Minimal Recursion Semantics and its integration into the ERG for purposes of compositionality, see [13].
- 4.
An example of a syntactic construction contributing semantic information is the one that licenses determinerless or ‘bare’ noun phrases and inserts a quantifier elementary predication.
- 5.
- 6.
More precisely, the Redwoods Treebank stores for each sentence two classes of discriminants: those manually selected by the annotator, and the rest which can be inferred from the manual choices. These inferred discriminants generally add to the robustness of the annotations, offering redundant sources of disambiguation, but this redundancy can get in the way of some kinds of grammar changes. Hence the annotation update machinery includes the ability to restrict the set of old discriminants to only manually selected ones, in those instances where applying the full set of discriminants results in the rejection of all new analyses. This restriction happily often leads to successful disambiguation even given significant changes to the grammar, by ignoring inferred discriminants that were previously redundant, but are now in fact inconsistent with the current state of the grammar.
- 7.
And, naturally, the contrast of approaches is not at all black-and-white, as there are bound to be elements of data preparation or guiding annotators through automated analysis (e.g. tagging and syntactic parsing) in most contemporary annotation work.
- 8.
The contemporaneous development of two initiatives in grammar-based treebanking is not entirely coincidental, as the original Redwoods tree selection tool was developed by Rob Malouf, prior to his joining the Alpino team at Groningen.
- 9.
More recent work at Groningen has focused on annotated resources that combine syntactic and semantic representations, this time for English, in the form of the Groningen Meaning Bank [4]. This work, however, does not build on either a precision hand-crafted grammar or a discriminant-based treebanking strategy, so it is of less direct relevance here.
- 10.
A correct analysis will also be lacking for sentences containing authored errors, for example from careless editing or typographical mistakes or second-language interference, but in the current Redwoods corpora such errors are not frequent enough to affect the present discussion.
References
Abney, S.P.: Stochastic attribute-value grammars. Comput. Ling. 23, 597–618 (1997)
Adolphs, P., Oepen, S., Callmeier, U., Crysmann, B., Flickinger, D., Kiefer, B.: Some fine points of hybrid natural language parsing. In: Proceedings of the 6th International Conference on Language Resources and Evaluation. Marrakech, Morocco (2008)
Alshawi, H. (ed.): The Core Language Engine. MIT Press, Cambridge (1992)
Basile, V., Bos, J., Evang, K., Venhuizen, N.: UGroningen. Negation detection with discourse representation structures. In: Proceedings of the 1st Joint Conference on Lexical and Computational Semantics, pp. 301–309. Montréal, Canada (2012)
Bender, E.M.: Grammar engineering for linguistic hypothesis testing. In: Gaylord, N., Palmer, A., Ponvert, E. (eds.) Proceedings of the Texas Linguistics Society X Conference. Computational Linguistics for Less-studied Languages, pp. 16–36. CSLI Publications, Stanford (2008)
Bender, E. M., Flickinger, D., Oepen, S., Zhang, Y.: Parser evaluation over local and non-local deep dependencies in a large corpus. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 397–408. Edinburgh, Scotland, UK (2011)
Bender, E. M., Flickinger, D., Oepen, S., Packard, W., Copestake, A.: Layers of interpretation: on grammar and compositionality. In: Proceedings of the 11th International Conference on Computational Semantics. London (2015)
Bond, F., Fujita, S., Hashimoto, C., Kasahara, K., Nariyama, S., Nichols, E., Amano, S.: The Hinoki Treebank. A treebank for text understanding. In: Proceedings of the 1st International Joint Conference on Natural Language Processing, pp. 158–167. Hainan Island, China (2004)
Bouma, G., van Noord, G., Malouf, R.: Alpino. Wide-coverage computational analysis of Dutch. In: Daelemans, W., Sima-an, K., Veenstra, J., Zavrel, J. (eds.) Computational Linguistics in the Netherlands, pp. 45–59. Rodopi, Amsterdam (2001)
Branco, A., Costa, F., Silva, J., Silveira, S., Castro, S., Avelãs, M., Graça, J.: Developing a deep linguistic databank supporting a collection of treebanks. The CINTIL DeepGramBank. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. Valletta, Malta (2010)
Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.: The TIGER treebank. In Proceedings of the First Workshop on Treebanks and Linguistic Theories, Sozopol (2002)
Carter, D.: The TreeBanker. A tool for supervised training of parsed corpora. In: Proceedings of the Workshop on Computational Environments for Grammar Development and Linguistic Engineering, pp. 9–15. Madrid, Spain (1997)
Copestake, A., Flickinger, D., Pollard, C., Sag, I.A.: Minimal Recursion Semantics. An introduction. Res. Lang. Comput. 3(4), 281–332 (2005)
Dipper, S.: Grammar-based corpus annotation. In: Proceedings of the Workshop on Linguistically Interpreted Corpora, pp. 56–64. Luxembourg, Luxembourg (2000)
Flickinger, D.: On building a more efficient grammar by exploiting types. Nat. Lang. Eng. 6(1), 15–28 (2000) (Eds: Flickinger, D., Oepen, S., Tsujii, J., Uszkoreit, H.)
Flickinger, D.: Accuracy vs. robustness in grammar engineering. In: Bender, E.M., Arnold, J.E. (eds.) Language from a Cognitive Perspective: Grammar, Usage, and Processing, pp. 31–50. CSLI Publications, Stanford (2011)
Flickinger, D., Wasow, T.: A corpus-driven anaysis of the Do-Be construction. In: Hofmeister, P., Norcliffe, E. (eds.) The Core and the Periphery: Data-driven Perspectives on Syntax Inspired by Ivan A. Sag, pp. 35–64. CSLI Publications, Stanford (2013)
Flickinger, D., Oepen, S., Ytrestøl, G.: WikiWoods. Syntacto-semantic annotation for English Wikipedia. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. Valletta, Malta (2010)
Flickinger, D., Kordoni, V., Zhang, Y., Branco, A., Simov, K., Osenova, P., Castro, S.: ParDeepBank. Multiple parallel deep treebanking. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, pp. 97–108. Edições Colibri, Lisbon (2012)
Flickinger, D., Zhang, Y., Kordoni, V.: DeepBank. A dynamically annotated treebank of the Wall Street Journal. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, pp. 85–96. Edições Colibri, Lisbon (2012)
Fokkens, A., Bender, E.M.: Time travel in grammar engineering. Using a metagrammar to broaden the search space. In: Duchier, D., Parmentier, Y. (Eds.) Proceedings of the ESSLLI Workshop on High-Level Methodologies in Grammar Engineering, pp. 105–116. Düsseldorf, Germany (2013)
Fujita, S., Bond, F., Tanaka, T., Oepen, S.,: Exploiting semantic information for HPSG parse selection. Res. Lang. Comput. 8(1), 1–22 (2010)
Gawron, J.M., King, J., Lamping, J., Loebner, E., Paulson, E.A., Pullum, G. K., Wasow, T.: Processing English with a Generalized Phrase Structure Grammar. In: Proceedings of the 20th Meeting of the Association for Computational Linguistics, pp. 74–81. Toronto, Ontario, Canada (1982)
Hajič, J.: Building a syntactically annotated corpus. The Prague Dependency Treebank. In Issues of Valency and Meaning, pp. 106–132. Karolinum, Prague (1998)
Hemphill, C.T., Godfrey, J.J., Doddington, G.R.: The ATIS spoken language systems pilot corpus. In: Proceedings of the DARPA Speech and Natural Language Workshop, pp. 96–101. (1990)
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes: The 90% solution. In Proceedings of Human Language Technologies: The 2006 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short papers, pp. 57–60. New York City, USA (2006)
Ivanova, A., Oepen, S., Øvrelid, L., Flickinger, D.: Who did what to whom? A contrastive study of syntacto-semantic dependencies. In: Proceedings of the Sixth Linguistic Annotation Workshop, pp. 2–11. Jeju, Republic of Korea (2012)
Johnson, M., Geman, S., Canon, S., Chi, Z., Riezler, S.: Estimators for stochastic ‘unification-based’ grammars. In: Proceedings of the 37th Meeting of the Association for Computational Linguistics, pp. 535–541. College Park, USA (1999)
Kingsbury, P., Palmer, M.: From TreeBank to PropBank. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, pp. 1989–1993. Las Palmas, Spain (2002)
Losnegaard, G.S., Lyse, G.I., Thunes, M., Rosén, V., Smedt, K.D., Dyvik, H., Meurer, P.: What we have learned from Sofie. Extending lexical and grammatical coverage in an LFG parsebank. In: Proceedings of the META-RESEARCH Workshop on Advanced Treebanking at LREC2012, pp. 69–76. Istanbul, Turkey (2012)
MacKinlay, A., Dridan, R., Flickinger, D., Oepen, S., Baldwin, T.: Using external treebanks to filter parse forests for parse selection and treebanking. In: Proceedings of the 2011 International Joint Conference on Natural Language Processing, pp. 246–254. Chiang Mai, Thailand (2011)
Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpora of English: The Penn Treebank. Comput. Ling. 19, 313–330 (1993)
Marimon, M., Fisas, B., Bel, N., Villegas, M., Vivaldi, J., Torner, S., Villegas, M.: The IULA Treebank. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 1920–1926. Istanbul, Turkey (2012)
Oepen, S., Flickinger, D.P.: Towards systematic grammar profiling. Test suite technology ten years after. Comput. Speech Lang. 12(4), 411–436 (1998)
Oepen, S., Lønning, J.T.: Discriminant-based MRS banking. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 1250–1255. Genoa, Italy (2006)
Oepen, S., Flickinger, D., Toutanova, K., Manning, C. D.: LinGO Redwoods. A rich and dynamic treebank for HPSG. Res. Lang. Comput. 2(4), 575–596 (2004)
Packard, W.: Full forest treebanking. Unpublished master’s thesis, University of Washington (2015)
Pollard, C., Sag, I.A.: Information-based syntax and semantics. Volume 1: Fundamentals. CSLI Publications, Stanford (1987)
Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. The University of Chicago Press, Chicago (1994)
Pozen, Z.: Using lexical and compositional semantics to improve HPSG parse selection. Unpublished master’s thesis, University of Washington (2013)
Rimell, L., Clark, S., Steedman, M.: Unbounded dependency recovery for parser evaluation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 813–821. Singapore (2009)
Rosén, V., Meurer, P., De Smedt, K.: Designing and implementing discriminants for LFG grammars. In: Butt, M., King, T.H. (eds.) Proceedings of the 12th International LFG Conference. Stanford, USA (2007)
Song, S., Bender, E.M.: Individual constraints for information structure. In: Müller, S. (ed.) Proceedings of the 19th International Conference on Head- Driven Phrase Structure Grammar, pp. 330–348. CSLI Publications, Stanford, CA, USA (2012)
van der Beek, L., Bouma, G., Malouf, R., van Noord, G.: The Alpino dependency treebank. In: Theune, M., Nijholt, A., Hondorp, H. (eds.) Computational Linguistics in the Netherlands 2001. Selected papers from the twelfth CLIN meeting. Rodopi, Amsterdam (2002)
Zhang, Y., Krieger, H.-U.: Large-scale corpus-driven PCFG approximation of an HPSG. In: Proceedings of the 12th International Conference on Parsing Technologies, pp. 198–208. Dublin, Ireland (2011)
Zhang, Y., Wang, R.: Cross-domain dependency parsing using a deep linguistic grammar. In: Proceedings of the 47th Meeting of the Association for Computational Linguistics, pp. 378–386. Suntec, Singapore (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Flickinger, D., Oepen, S., Bender, E.M. (2017). Sustainable Development and Refinement of Complex Linguistic Annotations at Scale. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_14
Download citation
DOI: https://doi.org/10.1007/978-94-024-0881-2_14
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)