Abstract
The chapter presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the EAGLES-based morphosyntactic specifications, morphosyntactic lexicons, and an annotated multilingual corpora. The parallel corpus, the novel “1984” by George Orwell, is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset is extensively documented, and freely available for research purposes. This case study gives a history of the development of the MULTEXT-East resources, presents their encoding and components, discusses related work and gives some conclusions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Buchholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning, Association for Computational Linguistics, Stroudsburg, PA, USA, CoNLL-X ’06, pp. 149–164 (2006). http://dl.acm.org/citation.cfm?id=1596276.1596305
Carpenter, B.: The Logic of Typed Feature Structures. Tracts in Theoretical Computer Science. Cambridge University Press, Cambridge (1992)
Chiarcos, C., Erjavec, T.: OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In: Proceedings of the 5th Linguistic Annotation Workshop, Association for Computational Linguistics, Stroudsburg, PA, USA, LAW V ’11, pp. 11–20 (2011). http://dl.acm.org/citation.cfm?id=2018966.2018968
Čerepnalkoski, D.: Constructing n-way alignment using multiple pair-wise alignments (Seminar work at Jožef Stefan International Postgraduate School) (2008)
Derzhanski, I.A., Kotsyba, N.: Towards a consistent morphological tagset for Slavic languages: extending MULTEXT-East for Polish, Ukrainian and Belarusian. In: Proceedings of the Mondilex Third Open Workshop: Metalanguage and Encoding Scheme Design for Digital Lexicography, pp. 9–26. Ľ. Štúr Institute of Linguistic, Slovak Academy of Sciences, Bratislava, Slovakia (2009)
Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.J., Petkevič, V., Tufiş, D.: Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern European languages. In: Proceedings of COLING-ACL ’98, pp. 315–319. ACL, Montréal, Québec, Canada (1998)
EAGLES Expert Advisory Group on Language Engineering Standards. http://www.ilc.pi.cnr.it/EAGLES/home.html (1996)
Erjavec, T.: MULTEXT-East Version 4: Multilingual morphosyntactic specifications, lexicons and corpora. In: Seventh International Conference on Language Resources and Evaluation, LREC’10, ELRA, Paris (2010)
Erjavec, T.: MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Lang. Res. Eval. 46(1), 131–142 (2012). doi:10.1007/s10579-011-9174-8
Erjavec, T.: The goo300k corpus of historical Slovene. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Erjavec, T.: Vzporedni korpus SPOOK: označevanje, zapis in iskanje // The SPOOK parallel corpus: annotation, enoding and search. In: Vintar, Š. (ed.) Slovenski prevodi skozi korpusno prizmo // Slovene translations through a corpus prism, pp. 14–31. Zbirka Prevodoslovje in uporabno jezikoslovje, Znanstvena založba Filozofske fakultete, Ljubljana (2013)
Erjavec, T., Džeroski, S.: Machine learning of language structure: lemmatising unknown Slovene words. Appl. Artif. Intell. 18(1), 17–41 (2004)
Erjavec, T., Fišer, D., Krek, S., Ledinek, N.: The JOS linguistically tagged corpus of Slovene. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC’10, ELRA, Paris (2010)
Farrar, S., Langendoen, D.T.: A linguistic ontology for the Semantic Web. GLOT International 7(3), 97–100 (2003). http://linguistics-ontology.org/
Feldman, A., Hana, J.: A Resource-Light Approach to Morpho-Syntactic Tagging. Rodopi, Amsterdam (2010)
Francopoulo, G., Declerck, T., Sornlertlamvanich, V., De la Clergerie, E., Monachini, M.: Data category registry : morpho-syntactic and syntactic profiles. In: Proceedings of the LREC 2008 Workshop on Uses and Usage of Language Resource-related Standards, pp. 31–40 [Marrakech], 27 May (2008)
Hajič, J.: Morphological Tagging: Data vs. Dictionaries. In: Proceedings of the ANLP/NAACL 2000, Seattle, pp. 94–101 (2000)
Hajič, J., Panevová, J., Hajičová, E., Pajas, P., Sgall, P., Štěpánek, J., Havelka, J., Milkulová, M.: Prague Dependency Treebank 2.0. Catalog Number LDC2006T01 (2006)
Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of the First International Conference on Language Resources and Evaluation, LREC’98, ELRA, Granada, pp. 463–470 (1998). http://www.cs.vassar.edu/CES/
Ide, N.: Cross-lingual sense determination: Can it work? Comput. Humanit. 34, 223–234 (2000)
Ide, N., Véronis, J.: Multext (multilingual tools and corpora). In: Proceedings of the ACL, pp. 90–96 (1994)
Ide, N., Romary, L., Bonhomme, P.: CES/XML : An XML-based Standard for Linguistic Corpora. In: Proceedings of the Second International Conference on Language Resources and Evaluation, LREC’00, Athens (2000)
Ide, N., Erjavec, T., Tufiş, D.: Sense discrimination with parallel corpora. In: Proceedings of the Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pp. 54–60. ACL, Philadelphia (2002)
ISO: ISO/IEC 19757-2:2003 - Information technology – Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based validation – RELAX NG (2000)
Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., Wright, S.E.: ISOcat: corralling data categories in the wild. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC’08, ELRA, Paris (2008)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Conference on Tenth Machine Translation Summit, AAMT, AAMT, Phuket, Thailand, pp. 79–86 (2005). http://mt-archive.info/MTS-2005-Koehn.pdf
Martin, J., Mihalcea, R., Pedersen, T.: Word alignment for languages with scarce resources. In: Proceedings of the ACL Workshop on Building and Using Parallel Texts, Association for Computational Linguistics, Ann Arbor, Michigan, pp. 65–74 (2005). http://www.aclweb.org/anthology/W/W05/W05-0809
Patejuk, A., Przepiórkowski, A.: ISOcat Definition of the national corpus of Polish tagset. In: Proceedings of LREC 2010 workshop on LRT Standards (2010)
Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. In: (Chair) NCC., Choukri, K., Declerck, T., Doğan, MU., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Przepiórkowski, A., Woliński, M.: A Flexemic Tagset for Polish. In: Proceedings of the Morphological Processing of Slavic Languages, EACL 2003 (2003)
Rosen, A.: Morphological tags in parallel corpora. In: Čermák, F., Klégr, A., Corness, P. (eds.) InterCorp: Exploring a Multilingual Corpus. Praha, Nakladatelstvé Lidové noviny (2010)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, pp. 44–49 (1994)
Sperberg-McQueen, C.M., Burnard, L. (eds.): Guidelines for Electronic Text Encoding and Interchange P3. Text Encoding Initiative, Chicago (1994)
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. CoRR abs/cs/0609058. http://arxiv.org/abs/cs/0609058 (2006)
TEI Consortium (ed.): TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium, http://www.tei-c.org/Guidelines/P5/ (2007)
Toutanova, K., Cherry, C.: A global model for joint lemmatization and part-of-speech prediction. In: Proceedings of the ACL (2009)
Tufiş, D.: Tiered tagging and combined language model classifiers. In: Jelinek, F., Noth, E. (eds.) Text, Speech and Dialogue, Springer-Verlag, Berlin, no. 1692 in Lecture Notes in Artificial Intelligence, pp. 28–33 (1999)
Tufiş, D.: A cheap and fast way to build useful translation lexicons. In: Proceedings of the 19th international conference on Computational linguistics, Association for Computational Linguistics (2002)
Tufiş, D., Cristea, D., Stamou, S.: BalkaNet: aims, methods, results and perspectives. A general overview. Romanian. J. Inform. Sci. Technol. 7(1–2), 9–43 (2004)
Zeman, D.: Reusable tagset conversion using tagset drivers. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 213–218. European Language Resources Association, Marrakech, Morocco (2008)
Zeman, D.: Hard problems of tagset conversion. In: Proceedings of the Second International Conference on Global Interoperability for Language Resources, pp. 181–185. City University of Hong Kong, Hong Kong, China (2010)
Zeman, D., Dušek, O., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtský, Z., Hajič, J.: HamleDT: harmonized multi-language dependency treebank. Lang. Res. Eval. 48(4), 601–637 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix 1. Examples of Annotated Text from Orwell’s “1984”
Appendix 1. Examples of Annotated Text from Orwell’s “1984”
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Erjavec, T. (2017). MULTEXT-East. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_17
Download citation
DOI: https://doi.org/10.1007/978-94-024-0881-2_17
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)