Sustainable Development and Refinement of Complex Linguistic Annotations at Scale

Flickinger, Dan; Oepen, Stephan; Bender, Emily M.

doi:10.1007/978-94-024-0881-2_14

Dan Flickinger³,
Stephan Oepen⁴ &
Emily M. Bender⁵

2226 Accesses
1 Altmetric

Abstract

The development of complex and consistent linguistic annotations over large and varied corpora requires an approach which allows for the incremental improvement of existing annotations by encoding all manual effort in such a way that its value is preserved and enhanced even as the resource is improved over time. This manual effort includes both annotation design and disambiguation; in the case of syntactico-semantic annotations, the former can be encoded in a machine-readable grammar and the latter as a series of decisions made at a level of granularity which supports both efficient human disambiguation and later machine re-use of the individual decisions. The general approach can be applied beyond syntactico-semantic annotation to any annotation project where the design of the representations can be encoded as a grammar, and thus we frame our methodological discussion in terms of incremental improvement, with syntactico-semantic annotations as a case study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Open Linguistic Infrastructure for Annotated Corpora

Annotations that Effectively Contribute to Semantic Interpretation

Oyster: A Tool for Fine-Grained Ontological Annotations in Free-Text

Notes

1.
A grammar is never complete, however, and new texts always hold the promise of new linguistic phenomena to investigate. The ability to process the text with a grammar encoding the existing analyses makes it much easier to discover those which are not yet covered by the grammar, even as they become ever less frequent. One example of this method of discovering previously unrecognized phenomena is presented in Flickinger and Wasow [17].
2.
This example is an adaptation of a sentence that appears in the WSJ portion of the PTB, as well as in the much smaller Cross-Framework Parser Evaluation Shared Task (PEST) corpus discussed by Ivanova, Oepen, Øvrelid, and Flickinger [27].
3.
For a thorough introduction to Minimal Recursion Semantics and its integration into the ERG for purposes of compositionality, see [13].
4.
An example of a syntactic construction contributing semantic information is the one that licenses determinerless or ‘bare’ noun phrases and inserts a quantifier elementary predication.
5.
Indeed the interaction of phenomena is often a primary source of evidence for or against specific analyses (see [5, 21]).
6.
More precisely, the Redwoods Treebank stores for each sentence two classes of discriminants: those manually selected by the annotator, and the rest which can be inferred from the manual choices. These inferred discriminants generally add to the robustness of the annotations, offering redundant sources of disambiguation, but this redundancy can get in the way of some kinds of grammar changes. Hence the annotation update machinery includes the ability to restrict the set of old discriminants to only manually selected ones, in those instances where applying the full set of discriminants results in the rejection of all new analyses. This restriction happily often leads to successful disambiguation even given significant changes to the grammar, by ignoring inferred discriminants that were previously redundant, but are now in fact inconsistent with the current state of the grammar.
7.
And, naturally, the contrast of approaches is not at all black-and-white, as there are bound to be elements of data preparation or guiding annotators through automated analysis (e.g. tagging and syntactic parsing) in most contemporary annotation work.
8.
The contemporaneous development of two initiatives in grammar-based treebanking is not entirely coincidental, as the original Redwoods tree selection tool was developed by Rob Malouf, prior to his joining the Alpino team at Groningen.
9.
More recent work at Groningen has focused on annotated resources that combine syntactic and semantic representations, this time for English, in the form of the Groningen Meaning Bank [4]. This work, however, does not build on either a precision hand-crafted grammar or a discriminant-based treebanking strategy, so it is of less direct relevance here.
10.
A correct analysis will also be lacking for sentences containing authored errors, for example from careless editing or typographical mistakes or second-language interference, but in the current Redwoods corpora such errors are not frequent enough to affect the present discussion.

References

Abney, S.P.: Stochastic attribute-value grammars. Comput. Ling. 23, 597–618 (1997)
Google Scholar
Adolphs, P., Oepen, S., Callmeier, U., Crysmann, B., Flickinger, D., Kiefer, B.: Some fine points of hybrid natural language parsing. In: Proceedings of the 6th International Conference on Language Resources and Evaluation. Marrakech, Morocco (2008)
Google Scholar
Alshawi, H. (ed.): The Core Language Engine. MIT Press, Cambridge (1992)
Google Scholar
Basile, V., Bos, J., Evang, K., Venhuizen, N.: UGroningen. Negation detection with discourse representation structures. In: Proceedings of the 1st Joint Conference on Lexical and Computational Semantics, pp. 301–309. Montréal, Canada (2012)
Google Scholar
Bender, E.M.: Grammar engineering for linguistic hypothesis testing. In: Gaylord, N., Palmer, A., Ponvert, E. (eds.) Proceedings of the Texas Linguistics Society X Conference. Computational Linguistics for Less-studied Languages, pp. 16–36. CSLI Publications, Stanford (2008)
Google Scholar
Bender, E. M., Flickinger, D., Oepen, S., Zhang, Y.: Parser evaluation over local and non-local deep dependencies in a large corpus. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 397–408. Edinburgh, Scotland, UK (2011)
Google Scholar
Bender, E. M., Flickinger, D., Oepen, S., Packard, W., Copestake, A.: Layers of interpretation: on grammar and compositionality. In: Proceedings of the 11th International Conference on Computational Semantics. London (2015)
Google Scholar
Bond, F., Fujita, S., Hashimoto, C., Kasahara, K., Nariyama, S., Nichols, E., Amano, S.: The Hinoki Treebank. A treebank for text understanding. In: Proceedings of the 1st International Joint Conference on Natural Language Processing, pp. 158–167. Hainan Island, China (2004)
Google Scholar
Bouma, G., van Noord, G., Malouf, R.: Alpino. Wide-coverage computational analysis of Dutch. In: Daelemans, W., Sima-an, K., Veenstra, J., Zavrel, J. (eds.) Computational Linguistics in the Netherlands, pp. 45–59. Rodopi, Amsterdam (2001)
Google Scholar
Branco, A., Costa, F., Silva, J., Silveira, S., Castro, S., Avelãs, M., Graça, J.: Developing a deep linguistic databank supporting a collection of treebanks. The CINTIL DeepGramBank. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. Valletta, Malta (2010)
Google Scholar
Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.: The TIGER treebank. In Proceedings of the First Workshop on Treebanks and Linguistic Theories, Sozopol (2002)
Google Scholar
Carter, D.: The TreeBanker. A tool for supervised training of parsed corpora. In: Proceedings of the Workshop on Computational Environments for Grammar Development and Linguistic Engineering, pp. 9–15. Madrid, Spain (1997)
Google Scholar
Copestake, A., Flickinger, D., Pollard, C., Sag, I.A.: Minimal Recursion Semantics. An introduction. Res. Lang. Comput. 3(4), 281–332 (2005)
Google Scholar
Dipper, S.: Grammar-based corpus annotation. In: Proceedings of the Workshop on Linguistically Interpreted Corpora, pp. 56–64. Luxembourg, Luxembourg (2000)
Google Scholar
Flickinger, D.: On building a more efficient grammar by exploiting types. Nat. Lang. Eng. 6(1), 15–28 (2000) (Eds: Flickinger, D., Oepen, S., Tsujii, J., Uszkoreit, H.)
Google Scholar
Flickinger, D.: Accuracy vs. robustness in grammar engineering. In: Bender, E.M., Arnold, J.E. (eds.) Language from a Cognitive Perspective: Grammar, Usage, and Processing, pp. 31–50. CSLI Publications, Stanford (2011)
Google Scholar
Flickinger, D., Wasow, T.: A corpus-driven anaysis of the Do-Be construction. In: Hofmeister, P., Norcliffe, E. (eds.) The Core and the Periphery: Data-driven Perspectives on Syntax Inspired by Ivan A. Sag, pp. 35–64. CSLI Publications, Stanford (2013)
Google Scholar
Flickinger, D., Oepen, S., Ytrestøl, G.: WikiWoods. Syntacto-semantic annotation for English Wikipedia. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. Valletta, Malta (2010)
Google Scholar
Flickinger, D., Kordoni, V., Zhang, Y., Branco, A., Simov, K., Osenova, P., Castro, S.: ParDeepBank. Multiple parallel deep treebanking. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, pp. 97–108. Edições Colibri, Lisbon (2012)
Google Scholar
Flickinger, D., Zhang, Y., Kordoni, V.: DeepBank. A dynamically annotated treebank of the Wall Street Journal. In: Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, pp. 85–96. Edições Colibri, Lisbon (2012)
Google Scholar
Fokkens, A., Bender, E.M.: Time travel in grammar engineering. Using a metagrammar to broaden the search space. In: Duchier, D., Parmentier, Y. (Eds.) Proceedings of the ESSLLI Workshop on High-Level Methodologies in Grammar Engineering, pp. 105–116. Düsseldorf, Germany (2013)
Google Scholar
Fujita, S., Bond, F., Tanaka, T., Oepen, S.,: Exploiting semantic information for HPSG parse selection. Res. Lang. Comput. 8(1), 1–22 (2010)
Google Scholar
Gawron, J.M., King, J., Lamping, J., Loebner, E., Paulson, E.A., Pullum, G. K., Wasow, T.: Processing English with a Generalized Phrase Structure Grammar. In: Proceedings of the 20th Meeting of the Association for Computational Linguistics, pp. 74–81. Toronto, Ontario, Canada (1982)
Google Scholar
Hajič, J.: Building a syntactically annotated corpus. The Prague Dependency Treebank. In Issues of Valency and Meaning, pp. 106–132. Karolinum, Prague (1998)
Google Scholar
Hemphill, C.T., Godfrey, J.J., Doddington, G.R.: The ATIS spoken language systems pilot corpus. In: Proceedings of the DARPA Speech and Natural Language Workshop, pp. 96–101. (1990)
Google Scholar
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes: The 90% solution. In Proceedings of Human Language Technologies: The 2006 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short papers, pp. 57–60. New York City, USA (2006)
Google Scholar
Ivanova, A., Oepen, S., Øvrelid, L., Flickinger, D.: Who did what to whom? A contrastive study of syntacto-semantic dependencies. In: Proceedings of the Sixth Linguistic Annotation Workshop, pp. 2–11. Jeju, Republic of Korea (2012)
Google Scholar
Johnson, M., Geman, S., Canon, S., Chi, Z., Riezler, S.: Estimators for stochastic ‘unification-based’ grammars. In: Proceedings of the 37th Meeting of the Association for Computational Linguistics, pp. 535–541. College Park, USA (1999)
Google Scholar
Kingsbury, P., Palmer, M.: From TreeBank to PropBank. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, pp. 1989–1993. Las Palmas, Spain (2002)
Google Scholar
Losnegaard, G.S., Lyse, G.I., Thunes, M., Rosén, V., Smedt, K.D., Dyvik, H., Meurer, P.: What we have learned from Sofie. Extending lexical and grammatical coverage in an LFG parsebank. In: Proceedings of the META-RESEARCH Workshop on Advanced Treebanking at LREC2012, pp. 69–76. Istanbul, Turkey (2012)
Google Scholar
MacKinlay, A., Dridan, R., Flickinger, D., Oepen, S., Baldwin, T.: Using external treebanks to filter parse forests for parse selection and treebanking. In: Proceedings of the 2011 International Joint Conference on Natural Language Processing, pp. 246–254. Chiang Mai, Thailand (2011)
Google Scholar
Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpora of English: The Penn Treebank. Comput. Ling. 19, 313–330 (1993)
Google Scholar
Marimon, M., Fisas, B., Bel, N., Villegas, M., Vivaldi, J., Torner, S., Villegas, M.: The IULA Treebank. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 1920–1926. Istanbul, Turkey (2012)
Google Scholar
Oepen, S., Flickinger, D.P.: Towards systematic grammar profiling. Test suite technology ten years after. Comput. Speech Lang. 12(4), 411–436 (1998)
Google Scholar
Oepen, S., Lønning, J.T.: Discriminant-based MRS banking. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 1250–1255. Genoa, Italy (2006)
Google Scholar
Oepen, S., Flickinger, D., Toutanova, K., Manning, C. D.: LinGO Redwoods. A rich and dynamic treebank for HPSG. Res. Lang. Comput. 2(4), 575–596 (2004)
Google Scholar
Packard, W.: Full forest treebanking. Unpublished master’s thesis, University of Washington (2015)
Google Scholar
Pollard, C., Sag, I.A.: Information-based syntax and semantics. Volume 1: Fundamentals. CSLI Publications, Stanford (1987)
Google Scholar
Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. The University of Chicago Press, Chicago (1994)
Google Scholar
Pozen, Z.: Using lexical and compositional semantics to improve HPSG parse selection. Unpublished master’s thesis, University of Washington (2013)
Google Scholar
Rimell, L., Clark, S., Steedman, M.: Unbounded dependency recovery for parser evaluation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 813–821. Singapore (2009)
Google Scholar
Rosén, V., Meurer, P., De Smedt, K.: Designing and implementing discriminants for LFG grammars. In: Butt, M., King, T.H. (eds.) Proceedings of the 12th International LFG Conference. Stanford, USA (2007)
Google Scholar
Song, S., Bender, E.M.: Individual constraints for information structure. In: Müller, S. (ed.) Proceedings of the 19th International Conference on Head- Driven Phrase Structure Grammar, pp. 330–348. CSLI Publications, Stanford, CA, USA (2012)
Google Scholar
van der Beek, L., Bouma, G., Malouf, R., van Noord, G.: The Alpino dependency treebank. In: Theune, M., Nijholt, A., Hondorp, H. (eds.) Computational Linguistics in the Netherlands 2001. Selected papers from the twelfth CLIN meeting. Rodopi, Amsterdam (2002)
Google Scholar
Zhang, Y., Krieger, H.-U.: Large-scale corpus-driven PCFG approximation of an HPSG. In: Proceedings of the 12th International Conference on Parsing Technologies, pp. 198–208. Dublin, Ireland (2011)
Google Scholar
Zhang, Y., Wang, R.: Cross-domain dependency parsing using a deep linguistic grammar. In: Proceedings of the 47th Meeting of the Association for Computational Linguistics, pp. 378–386. Suntec, Singapore (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Stanford University, Stanford, USA
Dan Flickinger
University of Oslo, Oslo, Norway
Stephan Oepen
University of Washington, Seattle, USA
Emily M. Bender

Authors

Dan Flickinger
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Oepen
View author publications
You can also search for this author in PubMed Google Scholar
Emily M. Bender
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dan Flickinger .

Editor information

Editors and Affiliations

Department of Computer Science, Vassar College, Poughkeepsie, New York, USA
Nancy Ide
Department of Computer Science, Volen Center for Complex Systems, Brandeis University, Waltham, Massachusetts, USA
James Pustejovsky

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Flickinger, D., Oepen, S., Bender, E.M. (2017). Sustainable Development and Refinement of Complex Linguistic Annotations at Scale. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_14

Download citation

DOI: https://doi.org/10.1007/978-94-024-0881-2_14
Published: 17 June 2017
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics

Sustainable Development and Refinement of Complex Linguistic Annotations at Scale

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Open Linguistic Infrastructure for Annotated Corpora

Annotations that Effectively Contribute to Semantic Interpretation

Oyster: A Tool for Fine-Grained Ontological Annotations in Free-Text

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Sustainable Development and Refinement of Complex Linguistic Annotations at Scale

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Open Linguistic Infrastructure for Annotated Corpora

Annotations that Effectively Contribute to Semantic Interpretation

Oyster: A Tool for Fine-Grained Ontological Annotations in Free-Text

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation