Skip to main content

The Prague Dependency Treebank

A Three-Level Annotation Scenario

  • Chapter
Treebanks

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 20))

Abstract

The availability of annotated data (with as rich and “deep” annotation as possible) is desirable in any new developments. Textual data are being used for so-called training phase of various empirical methods solving various problems in the field of computational linguistics. While there are many methods that use texts in their plain (or raw) form (in most cases for so-called unsupervised training), more accurate results may be obtained if annotated corpora are available. The data annotation itself is a complex task. While morphologically annotated corpora (pioneered by Henry Kučera in the 60’s) are now available for English and other languages, syntactically annotated corpora are rare. Inspired by the Penn Treebank, the most widely used syntactically annotated corpus of English, we decided to develop a similarly sized corpus of Czech with a rich annotation scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  • Bémová Alla, Buráňová Eva, Hajič Jan, Kárníc Jiří, Pajas Petr, Panevová Jarmila, Urešová Zdeňka, Jan Štěpánek. (1997). Anotace na analytické rovině — příručka pro anotátory [Annotation on the Analytical Level — An-notator’s Guidelines], Technical Report #4 (draft), ÚFAL MFF UK, Prague, Czech Republic (in Czech).

    Google Scholar 

  • Chen Keh-Jiann et al. (2003). Sinica Treebank, this volume.

    Google Scholar 

  • Collins, Michael. (1997). Statistical Parser Based on Bigram Lexical Dependencies. In Proceedings of the 35th Annual Meeting of the ACL/EACL’97, p. 16–23, Madrid, Spain.

    Google Scholar 

  • Collins, Michael, Hajič Jan, Brill Eric, Ramshaw Lance, Christopher Tillmann. (1999). A Statistical Parser of Czech. In Proceedings of 37th ACL’99, p. 505–512, University of Maryland, College Park, June 22-25.

    Google Scholar 

  • Czech National Corpus (CNC). http://ucnk.ff.cuni.cz.

    Google Scholar 

  • Hajič, Jan. (1998). Building a Syntactically Annotated Corpus: The Prague Dependency Treebank. In Issues of Valency and Meaning. Studies in Honor of Jarmila Panevová, ed. Eva Hajičová, p. 106–132, Karolinum, Charles University Press, Prague, Czech Republic.

    Google Scholar 

  • Hajič, Jan. (in press). Disambiguation of Rich Inflection (Computational Morphology of Czech). Charles University Press — Karolinum.

    Google Scholar 

  • Hajič, Jan, Brill Eric, Collins Michael, Hladká Barbora, Jones Douglas, Kuo Cynthia, Ramshaw Lance, Schwartz Oren, Tillmann Christopher, Daniel Ze-man. (1998). Core Natural Language Processing Technology Applicable to Multiple Languages: Workshop98 Final Report for the 1998 Language Engineering Workshop for Students and Professionals: Integrating Research and Education, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, Research Note 37.

    Google Scholar 

  • Hajič, Jan, Eva Hajičová. (1997). Syntactic Tagging in the Prague Tree Bank. In Proceedings of the Second European Seminar “Language Applications for a Multilingual Europe” (ed. by R. Marcinkeviciene and N. Volz), p. 55–68, Kaunas.

    Google Scholar 

  • Hajič, Jan, Barbora Hladká. (1997). Probabilistic and Rule-Based Tagger of an Inflective Language — a Comparison. In Proceedings of the 5th Conference on Applied Natural Language Processing, p. 111–118, Washington, USA.

    Google Scholar 

  • Hajič, Jan, Barbora Hladká. (1998). Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In Proceedings of COLING-ACL Conference, p. 483–490, Montreal, Canada.

    Google Scholar 

  • Hajičová, Eva. (2000). Dependency-Based Underlying-Structure Tagging of a Very Large Corpus, TAL, 41-1, p. 47–66.

    Google Scholar 

  • Hajičová, Eva, Panevová Jarmila, Petr Sgall. (1998). Language Resources Need Annotations To Make Them Really Reusable: The Prague Dependency Treebank. In Proceedings of the First International Conference on Language Resources & Evaluation. Granada, Spain, p. 713–718.

    Google Scholar 

  • Křen, Michal. (1996). GRAPH editor MSc. Thesis, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic

    Google Scholar 

  • Marcus M. P., Kim G., Marcinkiewicz M. A. et al. (1994). The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the ARPA Human Language Technology Workshop. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Marcus M. P., Santorini Beatrice, Marcinkiewicz M. A. (1993). Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2), 313–330.

    Google Scholar 

  • Palmer, M., Dang, H.T., J. Rosenzweig. (2000). Sense Tagging the Penn Tree-bank. In: Proceedings of LREC’OO, Athens, Greece.

    Google Scholar 

  • Panevová, Jarmila. (1980). Formy a funkce ve stavbě české věty [Forms and functions in the structure of the Czech sentence], Prague: Academia.

    Google Scholar 

  • Prague Dependency Treebank (PDT). http://ufal.ms.mff.cuni.cz/pdt/pdt.html.

    Google Scholar 

  • Sgall, Petr. (1967). Generativní popis jazyka a česká deklinace. Academia, Prague, Czech Republic.

    Google Scholar 

  • Sgall, Petr, Hajičová Eva, Jarmila Panevová. (1986) The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Reidel Publishing Company, Dordrecht, Netherlands, Academia, Prague, Czech Republic.

    Google Scholar 

  • Šmilauer, Vladimír. (1969). Novočeská skladba [Syntax of Contemporary Czech], 3rd ed., SPN, Prague, Czech Republic.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Böhmová, A., Hajič, J., Hajičová, E., Hladká, B. (2003). The Prague Dependency Treebank. In: Abeillé, A. (eds) Treebanks. Text, Speech and Language Technology, vol 20. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0201-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-94-010-0201-1_7

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-1335-5

  • Online ISBN: 978-94-010-0201-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics