MULTEXT-East

Erjavec, Tomaž

doi:10.1007/978-94-024-0881-2_17

Tomaž Erjavec³

2192 Accesses
3 Citations
1 Altmetric

Abstract

The chapter presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the EAGLES-based morphosyntactic specifications, morphosyntactic lexicons, and an annotated multilingual corpora. The parallel corpus, the novel “1984” by George Orwell, is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset is extensively documented, and freely available for research purposes. This case study gives a history of the development of the MULTEXT-East resources, presents their encoding and components, discusses related work and gives some conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Buchholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning, Association for Computational Linguistics, Stroudsburg, PA, USA, CoNLL-X ’06, pp. 149–164 (2006). http://dl.acm.org/citation.cfm?id=1596276.1596305
Carpenter, B.: The Logic of Typed Feature Structures. Tracts in Theoretical Computer Science. Cambridge University Press, Cambridge (1992)
Book Google Scholar
Chiarcos, C., Erjavec, T.: OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In: Proceedings of the 5th Linguistic Annotation Workshop, Association for Computational Linguistics, Stroudsburg, PA, USA, LAW V ’11, pp. 11–20 (2011). http://dl.acm.org/citation.cfm?id=2018966.2018968
Čerepnalkoski, D.: Constructing n-way alignment using multiple pair-wise alignments (Seminar work at Jožef Stefan International Postgraduate School) (2008)
Google Scholar
Derzhanski, I.A., Kotsyba, N.: Towards a consistent morphological tagset for Slavic languages: extending MULTEXT-East for Polish, Ukrainian and Belarusian. In: Proceedings of the Mondilex Third Open Workshop: Metalanguage and Encoding Scheme Design for Digital Lexicography, pp. 9–26. Ľ. Štúr Institute of Linguistic, Slovak Academy of Sciences, Bratislava, Slovakia (2009)
Google Scholar
Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.J., Petkevič, V., Tufiş, D.: Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern European languages. In: Proceedings of COLING-ACL ’98, pp. 315–319. ACL, Montréal, Québec, Canada (1998)
Google Scholar
EAGLES Expert Advisory Group on Language Engineering Standards. http://www.ilc.pi.cnr.it/EAGLES/home.html (1996)
Erjavec, T.: MULTEXT-East Version 4: Multilingual morphosyntactic specifications, lexicons and corpora. In: Seventh International Conference on Language Resources and Evaluation, LREC’10, ELRA, Paris (2010)
Google Scholar
Erjavec, T.: MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Lang. Res. Eval. 46(1), 131–142 (2012). doi:10.1007/s10579-011-9174-8
Erjavec, T.: The goo300k corpus of historical Slovene. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Google Scholar
Erjavec, T.: Vzporedni korpus SPOOK: označevanje, zapis in iskanje // The SPOOK parallel corpus: annotation, enoding and search. In: Vintar, Š. (ed.) Slovenski prevodi skozi korpusno prizmo // Slovene translations through a corpus prism, pp. 14–31. Zbirka Prevodoslovje in uporabno jezikoslovje, Znanstvena založba Filozofske fakultete, Ljubljana (2013)
Google Scholar
Erjavec, T., Džeroski, S.: Machine learning of language structure: lemmatising unknown Slovene words. Appl. Artif. Intell. 18(1), 17–41 (2004)
Article Google Scholar
Erjavec, T., Fišer, D., Krek, S., Ledinek, N.: The JOS linguistically tagged corpus of Slovene. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC’10, ELRA, Paris (2010)
Google Scholar
Farrar, S., Langendoen, D.T.: A linguistic ontology for the Semantic Web. GLOT International 7(3), 97–100 (2003). http://linguistics-ontology.org/
Feldman, A., Hana, J.: A Resource-Light Approach to Morpho-Syntactic Tagging. Rodopi, Amsterdam (2010)
Book Google Scholar
Francopoulo, G., Declerck, T., Sornlertlamvanich, V., De la Clergerie, E., Monachini, M.: Data category registry : morpho-syntactic and syntactic profiles. In: Proceedings of the LREC 2008 Workshop on Uses and Usage of Language Resource-related Standards, pp. 31–40 [Marrakech], 27 May (2008)
Google Scholar
Hajič, J.: Morphological Tagging: Data vs. Dictionaries. In: Proceedings of the ANLP/NAACL 2000, Seattle, pp. 94–101 (2000)
Google Scholar
Hajič, J., Panevová, J., Hajičová, E., Pajas, P., Sgall, P., Štěpánek, J., Havelka, J., Milkulová, M.: Prague Dependency Treebank 2.0. Catalog Number LDC2006T01 (2006)
Google Scholar
Ide, N.: Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In: Proceedings of the First International Conference on Language Resources and Evaluation, LREC’98, ELRA, Granada, pp. 463–470 (1998). http://www.cs.vassar.edu/CES/
Ide, N.: Cross-lingual sense determination: Can it work? Comput. Humanit. 34, 223–234 (2000)
Article Google Scholar
Ide, N., Véronis, J.: Multext (multilingual tools and corpora). In: Proceedings of the ACL, pp. 90–96 (1994)
Google Scholar
Ide, N., Romary, L., Bonhomme, P.: CES/XML : An XML-based Standard for Linguistic Corpora. In: Proceedings of the Second International Conference on Language Resources and Evaluation, LREC’00, Athens (2000)
Google Scholar
Ide, N., Erjavec, T., Tufiş, D.: Sense discrimination with parallel corpora. In: Proceedings of the Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pp. 54–60. ACL, Philadelphia (2002)
Google Scholar
ISO: ISO/IEC 19757-2:2003 - Information technology – Document Schema Definition Language (DSDL) – Part 2: Regular-grammar-based validation – RELAX NG (2000)
Google Scholar
Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., Wright, S.E.: ISOcat: corralling data categories in the wild. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC’08, ELRA, Paris (2008)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Conference on Tenth Machine Translation Summit, AAMT, AAMT, Phuket, Thailand, pp. 79–86 (2005). http://mt-archive.info/MTS-2005-Koehn.pdf
Martin, J., Mihalcea, R., Pedersen, T.: Word alignment for languages with scarce resources. In: Proceedings of the ACL Workshop on Building and Using Parallel Texts, Association for Computational Linguistics, Ann Arbor, Michigan, pp. 65–74 (2005). http://www.aclweb.org/anthology/W/W05/W05-0809
Patejuk, A., Przepiórkowski, A.: ISOcat Definition of the national corpus of Polish tagset. In: Proceedings of LREC 2010 workshop on LRT Standards (2010)
Google Scholar
Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. In: (Chair) NCC., Choukri, K., Declerck, T., Doğan, MU., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey (2012)
Google Scholar
Przepiórkowski, A., Woliński, M.: A Flexemic Tagset for Polish. In: Proceedings of the Morphological Processing of Slavic Languages, EACL 2003 (2003)
Google Scholar
Rosen, A.: Morphological tags in parallel corpora. In: Čermák, F., Klégr, A., Corness, P. (eds.) InterCorp: Exploring a Multilingual Corpus. Praha, Nakladatelstvé Lidové noviny (2010)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, pp. 44–49 (1994)
Google Scholar
Sperberg-McQueen, C.M., Burnard, L. (eds.): Guidelines for Electronic Text Encoding and Interchange P3. Text Encoding Initiative, Chicago (1994)
Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. CoRR abs/cs/0609058. http://arxiv.org/abs/cs/0609058 (2006)
TEI Consortium (ed.): TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium, http://www.tei-c.org/Guidelines/P5/ (2007)
Toutanova, K., Cherry, C.: A global model for joint lemmatization and part-of-speech prediction. In: Proceedings of the ACL (2009)
Google Scholar
Tufiş, D.: Tiered tagging and combined language model classifiers. In: Jelinek, F., Noth, E. (eds.) Text, Speech and Dialogue, Springer-Verlag, Berlin, no. 1692 in Lecture Notes in Artificial Intelligence, pp. 28–33 (1999)
Google Scholar
Tufiş, D.: A cheap and fast way to build useful translation lexicons. In: Proceedings of the 19th international conference on Computational linguistics, Association for Computational Linguistics (2002)
Google Scholar
Tufiş, D., Cristea, D., Stamou, S.: BalkaNet: aims, methods, results and perspectives. A general overview. Romanian. J. Inform. Sci. Technol. 7(1–2), 9–43 (2004)
Google Scholar
Zeman, D.: Reusable tagset conversion using tagset drivers. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 213–218. European Language Resources Association, Marrakech, Morocco (2008)
Google Scholar
Zeman, D.: Hard problems of tagset conversion. In: Proceedings of the Second International Conference on Global Interoperability for Language Resources, pp. 181–185. City University of Hong Kong, Hong Kong, China (2010)
Google Scholar
Zeman, D., Dušek, O., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtský, Z., Hajič, J.: HamleDT: harmonized multi-language dependency treebank. Lang. Res. Eval. 48(4), 601–637 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000, Ljubljana, Slovenia
Tomaž Erjavec

Authors

Tomaž Erjavec
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomaž Erjavec .

Editor information

Editors and Affiliations

Department of Computer Science, Vassar College, Poughkeepsie, New York, USA
Nancy Ide
Department of Computer Science, Volen Center for Complex Systems, Brandeis University, Waltham, Massachusetts, USA
James Pustejovsky

Appendix 1. Examples of Annotated Text from Orwell’s “1984”

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Erjavec, T. (2017). MULTEXT-East. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_17

Download citation

DOI: https://doi.org/10.1007/978-94-024-0881-2_17
Published: 17 June 2017
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics

MULTEXT-East

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

The State of Computational Morphology for Europe’s Languages and the META-NET Strategic Research Agenda

The Difficult Identification of Multiworld Expressions: From Decision Criteria to Annotated Corpora

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix 1. Examples of Annotated Text from Orwell’s “1984”

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MULTEXT-East

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

The State of Computational Morphology for Europe’s Languages and the META-NET Strategic Research Agenda

The Difficult Identification of Multiworld Expressions: From Decision Criteria to Annotated Corpora

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix 1. Examples of Annotated Text from Orwell’s “1984”

Appendix 1. Examples of Annotated Text from Orwell’s “1984”

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation