Abstract
The availability of annotated data (with as rich and “deep” annotation as possible) is desirable in any new developments. Textual data are being used for so-called training phase of various empirical methods solving various problems in the field of computational linguistics. While there are many methods that use texts in their plain (or raw) form (in most cases for so-called unsupervised training), more accurate results may be obtained if annotated corpora are available. The data annotation itself is a complex task. While morphologically annotated corpora (pioneered by Henry Kučera in the 60’s) are now available for English and other languages, syntactically annotated corpora are rare. Inspired by the Penn Treebank, the most widely used syntactically annotated corpus of English, we decided to develop a similarly sized corpus of Czech with a rich annotation scheme.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bémová Alla, Buráňová Eva, Hajič Jan, Kárníc Jiří, Pajas Petr, Panevová Jarmila, Urešová Zdeňka, Jan Štěpánek. (1997). Anotace na analytické rovině — příručka pro anotátory [Annotation on the Analytical Level — An-notator’s Guidelines], Technical Report #4 (draft), ÚFAL MFF UK, Prague, Czech Republic (in Czech).
Chen Keh-Jiann et al. (2003). Sinica Treebank, this volume.
Collins, Michael. (1997). Statistical Parser Based on Bigram Lexical Dependencies. In Proceedings of the 35th Annual Meeting of the ACL/EACL’97, p. 16–23, Madrid, Spain.
Collins, Michael, Hajič Jan, Brill Eric, Ramshaw Lance, Christopher Tillmann. (1999). A Statistical Parser of Czech. In Proceedings of 37th ACL’99, p. 505–512, University of Maryland, College Park, June 22-25.
Czech National Corpus (CNC). http://ucnk.ff.cuni.cz.
Hajič, Jan. (1998). Building a Syntactically Annotated Corpus: The Prague Dependency Treebank. In Issues of Valency and Meaning. Studies in Honor of Jarmila Panevová, ed. Eva Hajičová, p. 106–132, Karolinum, Charles University Press, Prague, Czech Republic.
Hajič, Jan. (in press). Disambiguation of Rich Inflection (Computational Morphology of Czech). Charles University Press — Karolinum.
Hajič, Jan, Brill Eric, Collins Michael, Hladká Barbora, Jones Douglas, Kuo Cynthia, Ramshaw Lance, Schwartz Oren, Tillmann Christopher, Daniel Ze-man. (1998). Core Natural Language Processing Technology Applicable to Multiple Languages: Workshop98 Final Report for the 1998 Language Engineering Workshop for Students and Professionals: Integrating Research and Education, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, Research Note 37.
Hajič, Jan, Eva Hajičová. (1997). Syntactic Tagging in the Prague Tree Bank. In Proceedings of the Second European Seminar “Language Applications for a Multilingual Europe” (ed. by R. Marcinkeviciene and N. Volz), p. 55–68, Kaunas.
Hajič, Jan, Barbora Hladká. (1997). Probabilistic and Rule-Based Tagger of an Inflective Language — a Comparison. In Proceedings of the 5th Conference on Applied Natural Language Processing, p. 111–118, Washington, USA.
Hajič, Jan, Barbora Hladká. (1998). Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In Proceedings of COLING-ACL Conference, p. 483–490, Montreal, Canada.
Hajičová, Eva. (2000). Dependency-Based Underlying-Structure Tagging of a Very Large Corpus, TAL, 41-1, p. 47–66.
Hajičová, Eva, Panevová Jarmila, Petr Sgall. (1998). Language Resources Need Annotations To Make Them Really Reusable: The Prague Dependency Treebank. In Proceedings of the First International Conference on Language Resources & Evaluation. Granada, Spain, p. 713–718.
Křen, Michal. (1996). GRAPH editor MSc. Thesis, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic
Marcus M. P., Kim G., Marcinkiewicz M. A. et al. (1994). The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the ARPA Human Language Technology Workshop. San Francisco: Morgan Kaufmann.
Marcus M. P., Santorini Beatrice, Marcinkiewicz M. A. (1993). Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2), 313–330.
Palmer, M., Dang, H.T., J. Rosenzweig. (2000). Sense Tagging the Penn Tree-bank. In: Proceedings of LREC’OO, Athens, Greece.
Panevová, Jarmila. (1980). Formy a funkce ve stavbě české věty [Forms and functions in the structure of the Czech sentence], Prague: Academia.
Prague Dependency Treebank (PDT). http://ufal.ms.mff.cuni.cz/pdt/pdt.html.
Sgall, Petr. (1967). Generativní popis jazyka a česká deklinace. Academia, Prague, Czech Republic.
Sgall, Petr, Hajičová Eva, Jarmila Panevová. (1986) The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Reidel Publishing Company, Dordrecht, Netherlands, Academia, Prague, Czech Republic.
Šmilauer, Vladimír. (1969). Novočeská skladba [Syntax of Contemporary Czech], 3rd ed., SPN, Prague, Czech Republic.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Böhmová, A., Hajič, J., Hajičová, E., Hladká, B. (2003). The Prague Dependency Treebank. In: Abeillé, A. (eds) Treebanks. Text, Speech and Language Technology, vol 20. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0201-1_7
Download citation
DOI: https://doi.org/10.1007/978-94-010-0201-1_7
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-1335-5
Online ISBN: 978-94-010-0201-1
eBook Packages: Springer Book Archive