Skip to main content

Sustainable Development and Refinement of Complex Linguistic Annotations at Scale

  • Chapter
  • First Online:
Handbook of Linguistic Annotation

Abstract

The development of complex and consistent linguistic annotations over large and varied corpora requires an approach which allows for the incremental improvement of existing annotations by encoding all manual effort in such a way that its value is preserved and enhanced even as the resource is improved over time. This manual effort includes both annotation design and disambiguation; in the case of syntactico-semantic annotations, the former can be encoded in a machine-readable grammar and the latter as a series of decisions made at a level of granularity which supports both efficient human disambiguation and later machine re-use of the individual decisions. The general approach can be applied beyond syntactico-semantic annotation to any annotation project where the design of the representations can be encoded as a grammar, and thus we frame our methodological discussion in terms of incremental improvement, with syntactico-semantic annotations as a case study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    A grammar is never complete, however, and new texts always hold the promise of new linguistic phenomena to investigate. The ability to process the text with a grammar encoding the existing analyses makes it much easier to discover those which are not yet covered by the grammar, even as they become ever less frequent. One example of this method of discovering previously unrecognized phenomena is presented in Flickinger and Wasow [17].

  2. 2.

    This example is an adaptation of a sentence that appears in the WSJ portion of the PTB, as well as in the much smaller Cross-Framework Parser Evaluation Shared Task (PEST) corpus discussed by Ivanova, Oepen, Øvrelid, and Flickinger [27].

  3. 3.

    For a thorough introduction to Minimal Recursion Semantics and its integration into the ERG for purposes of compositionality, see [13].

  4. 4.

    An example of a syntactic construction contributing semantic information is the one that licenses determinerless or ‘bare’ noun phrases and inserts a quantifier elementary predication.

  5. 5.

    Indeed the interaction of phenomena is often a primary source of evidence for or against specific analyses (see [5, 21]).

  6. 6.

    More precisely, the Redwoods Treebank stores for each sentence two classes of discriminants: those manually selected by the annotator, and the rest which can be inferred from the manual choices. These inferred discriminants generally add to the robustness of the annotations, offering redundant sources of disambiguation, but this redundancy can get in the way of some kinds of grammar changes. Hence the annotation update machinery includes the ability to restrict the set of old discriminants to only manually selected ones, in those instances where applying the full set of discriminants results in the rejection of all new analyses. This restriction happily often leads to successful disambiguation even given significant changes to the grammar, by ignoring inferred discriminants that were previously redundant, but are now in fact inconsistent with the current state of the grammar.

  7. 7.

    And, naturally, the contrast of approaches is not at all black-and-white, as there are bound to be elements of data preparation or guiding annotators through automated analysis (e.g. tagging and syntactic parsing) in most contemporary annotation work.

  8. 8.

    The contemporaneous development of two initiatives in grammar-based treebanking is not entirely coincidental, as the original Redwoods tree selection tool was developed by Rob Malouf, prior to his joining the Alpino team at Groningen.

  9. 9.

    More recent work at Groningen has focused on annotated resources that combine syntactic and semantic representations, this time for English, in the form of the Groningen Meaning Bank [4]. This work, however, does not build on either a precision hand-crafted grammar or a discriminant-based treebanking strategy, so it is of less direct relevance here.

  10. 10.

    A correct analysis will also be lacking for sentences containing authored errors, for example from careless editing or typographical mistakes or second-language interference, but in the current Redwoods corpora such errors are not frequent enough to affect the present discussion.

References

  1. Abney, S.P.: Stochastic attribute-value grammars. Comput. Ling. 23, 597–618 (1997)

    Google Scholar 

  2. Adolphs, P., Oepen, S., Callmeier, U., Crysmann, B., Flickinger, D., Kiefer, B.: Some fine points of hybrid natural language parsing. In: Proceedings of the 6th International Conference on Language Resources and Evaluation. Marrakech, Morocco (2008)

    Google Scholar 

  3. Alshawi, H. (ed.): The Core Language Engine. MIT Press, Cambridge (1992)

    Google Scholar 

  4. Basile, V., Bos, J., Evang, K., Venhuizen, N.: UGroningen. Negation detection with discourse representation structures. In: Proceedings of the 1st Joint Conference on Lexical and Computational Semantics, pp. 301–309. Montréal, Canada (2012)

    Google Scholar 

  5. Bender, E.M.: Grammar engineering for linguistic hypothesis testing. In: Gaylord, N., Palmer, A., Ponvert, E. (eds.) Proceedings of the Texas Linguistics Society X Conference. Computational Linguistics for Less-studied Languages, pp. 16–36. CSLI Publications, Stanford (2008)

    Google Scholar 

  6. Bender, E. M., Flickinger, D., Oepen, S., Zhang, Y.: Parser evaluation over local and non-local deep dependencies in a large corpus. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 397–408. Edinburgh, Scotland, UK (2011)

    Google Scholar 

  7. Bender, E. M., Flickinger, D., Oepen, S., Packard, W., Copestake, A.: Layers of interpretation: on grammar and compositionality. In: Proceedings of the 11th International Conference on Computational Semantics. London (2015)

    Google Scholar 

  8. Bond, F., Fujita, S., Hashimoto, C., Kasahara, K., Nariyama, S., Nichols, E., Amano, S.: The Hinoki Treebank. A treebank for text understanding. In: Proceedings of the 1st International Joint Conference on Natural Language Processing, pp. 158–167. Hainan Island, China (2004)

    Google Scholar 

  9. Bouma, G., van Noord, G., Malouf, R.: Alpino. Wide-coverage computational analysis of Dutch. In: Daelemans, W., Sima-an, K., Veenstra, J., Zavrel, J. (eds.) Computational Linguistics in the Netherlands, pp. 45–59. Rodopi, Amsterdam (2001)

    Google Scholar 

  10. Branco, A., Costa, F., Silva, J., Silveira, S., Castro, S., Avelãs, M., Graça, J.: Developing a deep linguistic databank supporting a collection of treebanks. The CINTIL DeepGramBank. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. Valletta, Malta (2010)

    Google Scholar 

  11. Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.: The TIGER treebank. In Proceedings of the First Workshop on Treebanks and Linguistic Theories, Sozopol (2002)

    Google Scholar 

  12. Carter, D.: The TreeBanker. A tool for supervised training of parsed corpora. In: Proceedings of the Workshop on Computational Environments for Grammar Development and Linguistic Engineering, pp. 9–15. Madrid, Spain (1997)

    Google Scholar 

  13. Copestake, A., Flickinger, D., Pollard, C., Sag, I.A.: Minimal Recursion Semantics. An introduction. Res. Lang. Comput. 3(4), 281–332 (2005)

    Google Scholar 

  14. Dipper, S.: Grammar-based corpus annotation. In: Proceedings of the Workshop on Linguistically Interpreted Corpora, pp. 56–64. Luxembourg, Luxembourg (2000)

    Google Scholar 

  15. Flickinger, D.: On building a more efficient grammar by exploiting types. Nat. Lang. Eng. 6(1), 15–28 (2000) (Eds: Flickinger, D., Oepen, S., Tsujii, J., Uszkoreit, H.)

    Google Scholar 

  16. Flickinger, D.: Accuracy vs. robustness in grammar engineering. In: Bender, E.M., Arnold, J.E. (eds.) Language from a Cognitive Perspective: Grammar, Usage, and Processing, pp. 31–50. CSLI Publications, Stanford (2011)

    Google Scholar 

  17. Flickinger, D., Wasow, T.: A corpus-driven anaysis of the Do-Be construction. In: Hofmeister, P., Norcliffe, E. (eds.) The Core and the Periphery: Data-driven Perspectives on Syntax Inspired by Ivan A. Sag, pp. 35–64. CSLI Publications, Stanford (2013)

    Google Scholar 

  18. Flickinger, D., Oepen, S., Ytrestøl, G.: WikiWoods. Syntacto-semantic annotation for English Wikipedia. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. Valletta, Malta (2010)

    Google Scholar 

  19. Flickinger, D., Kordoni, V., Zhang, Y., Branco, A., Simov, K., Osenova, P., Castro, S.: ParDeepBank. Multiple parallel deep treebanking. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, pp. 97–108. Edições Colibri, Lisbon (2012)

    Google Scholar 

  20. Flickinger, D., Zhang, Y., Kordoni, V.: DeepBank. A dynamically annotated treebank of the Wall Street Journal. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, pp. 85–96. Edições Colibri, Lisbon (2012)

    Google Scholar 

  21. Fokkens, A., Bender, E.M.: Time travel in grammar engineering. Using a metagrammar to broaden the search space. In: Duchier, D., Parmentier, Y. (Eds.) Proceedings of the ESSLLI Workshop on High-Level Methodologies in Grammar Engineering, pp. 105–116. Düsseldorf, Germany (2013)

    Google Scholar 

  22. Fujita, S., Bond, F., Tanaka, T., Oepen, S.,: Exploiting semantic information for HPSG parse selection. Res. Lang. Comput. 8(1), 1–22 (2010)

    Google Scholar 

  23. Gawron, J.M., King, J., Lamping, J., Loebner, E., Paulson, E.A., Pullum, G. K., Wasow, T.: Processing English with a Generalized Phrase Structure Grammar. In: Proceedings of the 20th Meeting of the Association for Computational Linguistics, pp. 74–81. Toronto, Ontario, Canada (1982)

    Google Scholar 

  24. Hajič, J.: Building a syntactically annotated corpus. The Prague Dependency Treebank. In Issues of Valency and Meaning, pp. 106–132. Karolinum, Prague (1998)

    Google Scholar 

  25. Hemphill, C.T., Godfrey, J.J., Doddington, G.R.: The ATIS spoken language systems pilot corpus. In: Proceedings of the DARPA Speech and Natural Language Workshop, pp. 96–101. (1990)

    Google Scholar 

  26. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes: The 90% solution. In Proceedings of Human Language Technologies: The 2006 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short papers, pp. 57–60. New York City, USA (2006)

    Google Scholar 

  27. Ivanova, A., Oepen, S., Øvrelid, L., Flickinger, D.: Who did what to whom? A contrastive study of syntacto-semantic dependencies. In: Proceedings of the Sixth Linguistic Annotation Workshop, pp. 2–11. Jeju, Republic of Korea (2012)

    Google Scholar 

  28. Johnson, M., Geman, S., Canon, S., Chi, Z., Riezler, S.: Estimators for stochastic ‘unification-based’ grammars. In: Proceedings of the 37th Meeting of the Association for Computational Linguistics, pp. 535–541. College Park, USA (1999)

    Google Scholar 

  29. Kingsbury, P., Palmer, M.: From TreeBank to PropBank. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, pp. 1989–1993. Las Palmas, Spain (2002)

    Google Scholar 

  30. Losnegaard, G.S., Lyse, G.I., Thunes, M., Rosén, V., Smedt, K.D., Dyvik, H., Meurer, P.: What we have learned from Sofie. Extending lexical and grammatical coverage in an LFG parsebank. In: Proceedings of the META-RESEARCH Workshop on Advanced Treebanking at LREC2012, pp. 69–76. Istanbul, Turkey (2012)

    Google Scholar 

  31. MacKinlay, A., Dridan, R., Flickinger, D., Oepen, S., Baldwin, T.: Using external treebanks to filter parse forests for parse selection and treebanking. In: Proceedings of the 2011 International Joint Conference on Natural Language Processing, pp. 246–254. Chiang Mai, Thailand (2011)

    Google Scholar 

  32. Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpora of English: The Penn Treebank. Comput. Ling. 19, 313–330 (1993)

    Google Scholar 

  33. Marimon, M., Fisas, B., Bel, N., Villegas, M., Vivaldi, J., Torner, S., Villegas, M.: The IULA Treebank. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 1920–1926. Istanbul, Turkey (2012)

    Google Scholar 

  34. Oepen, S., Flickinger, D.P.: Towards systematic grammar profiling. Test suite technology ten years after. Comput. Speech Lang. 12(4), 411–436 (1998)

    Google Scholar 

  35. Oepen, S., Lønning, J.T.: Discriminant-based MRS banking. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 1250–1255. Genoa, Italy (2006)

    Google Scholar 

  36. Oepen, S., Flickinger, D., Toutanova, K., Manning, C. D.: LinGO Redwoods. A rich and dynamic treebank for HPSG. Res. Lang. Comput. 2(4), 575–596 (2004)

    Google Scholar 

  37. Packard, W.: Full forest treebanking. Unpublished master’s thesis, University of Washington (2015)

    Google Scholar 

  38. Pollard, C., Sag, I.A.: Information-based syntax and semantics. Volume 1: Fundamentals. CSLI Publications, Stanford (1987)

    Google Scholar 

  39. Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. The University of Chicago Press, Chicago (1994)

    Google Scholar 

  40. Pozen, Z.: Using lexical and compositional semantics to improve HPSG parse selection. Unpublished master’s thesis, University of Washington (2013)

    Google Scholar 

  41. Rimell, L., Clark, S., Steedman, M.: Unbounded dependency recovery for parser evaluation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 813–821. Singapore (2009)

    Google Scholar 

  42. Rosén, V., Meurer, P., De Smedt, K.: Designing and implementing discriminants for LFG grammars. In: Butt, M., King, T.H. (eds.) Proceedings of the 12th International LFG Conference. Stanford, USA (2007)

    Google Scholar 

  43. Song, S., Bender, E.M.: Individual constraints for information structure. In: Müller, S. (ed.) Proceedings of the 19th International Conference on Head- Driven Phrase Structure Grammar, pp. 330–348. CSLI Publications, Stanford, CA, USA (2012)

    Google Scholar 

  44. van der Beek, L., Bouma, G., Malouf, R., van Noord, G.: The Alpino dependency treebank. In: Theune, M., Nijholt, A., Hondorp, H. (eds.) Computational Linguistics in the Netherlands 2001. Selected papers from the twelfth CLIN meeting. Rodopi, Amsterdam (2002)

    Google Scholar 

  45. Zhang, Y., Krieger, H.-U.: Large-scale corpus-driven PCFG approximation of an HPSG. In: Proceedings of the 12th International Conference on Parsing Technologies, pp. 198–208. Dublin, Ireland (2011)

    Google Scholar 

  46. Zhang, Y., Wang, R.: Cross-domain dependency parsing using a deep linguistic grammar. In: Proceedings of the 47th Meeting of the Association for Computational Linguistics, pp. 378–386. Suntec, Singapore (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Flickinger .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Flickinger, D., Oepen, S., Bender, E.M. (2017). Sustainable Development and Refinement of Complex Linguistic Annotations at Scale. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_14

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics