Abstract
Cognates are words of the same origin that belong to distinct languages. The problem of automatic identification of cognates arises in language reconstruction and bitext-related tasks. The evidence of cognation may come from various information sources, such as phonetic similarity, semantic similarity, and recurrent sound correspondences. I discuss ways of defining the measures of the various types of similarity and propose a method of combining then into an integrated cognate identification program. The new method requires no manual parameter tuning and performs well when tested on the Indoeuropean and Algonquian lexical data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F., Purdy, D., Smith, N., Yarowsky, D.: Statistical machine translation. Technical report, Johns Hopkins University (1999)
Brew, C., McKelvie, D.: Word-pair extraction for lexicography. In: Oflazer, K., Somers, H. (eds.) Proceedings of the 2nd International Conference on New Methods in Language Processing, Ankara, Bilkent University, pp. 45–55 (1996)
Kenneth, W.: Church. Char align: A program for aligning parallel texts at the character level. In: Proceedings of ACL 1993: 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 1–8 (1993)
Dyen, I., Kruskal, J.B., Black, P.: An Indoeuropean classification: A lexicostatistical experiment. Transactions of the American Philosophical Society 82(5) (1992)
Fellbaum, C. (ed.): WordNet: an electronic lexical database. The MIT Press, Cambridge (1998)
Guy, J.B.M.: An algorithm for identifying cognates in bilingual wordlists and its applicability to machine translation. Journal of Quantitative Linguistics 1(1), 35–42 (1994), MS-DOS executable available at http://garbo.uwasa.fi
Hewson, J.: Comparative reconstruction on the computer. In: Proceedings of the 1st International Conference on Historical Linguistics, pp. 191–197 (1974)
Hewson, J.: A computer-generated dictionary of proto-Algonquian. Canadian Museum of Civilization, Hull (1993)
Hewson, J.: Vocabularies of Fox, Cree, Menomini, and Ojibwa (1999), Computer file
Kessler, B.: The Significance of Word Lists. CSLI Publications, Stanford (2001), Word lists available at http://spell.psychology.wayne.edu/~bkessler
Koehn, P., Knight, K.: Knowledge sources for word-level translation models. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 27–35 (2001)
Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: Proceedings of NAACL 2000: 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 288–295 (2000)
Kondrak, G.: Identifying cognates by phonetic and semantic similarity. In: Proceedings of NAACL 2001: 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 103–110 (2001)
Kondrak, G.: Determining recurrent sound correspondences by inducing translation models. In: Proceedings of COLING 2002: 19th International Conference on Computational Linguistics, pp. 488–494 (2002)
Kondrak, G.: Identifying complex sound correspondences in bilingual wordlists. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 432–443. Springer, Heidelberg (2003)
Kondrak, G., Dorr, B.: Identification of confusable drug names: A new approach and evaluation methodology (2004) (in preparation)
Kondrak, G., Marcu, D., Knight, K.: Cognates can improve statistical translation models. In: Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 46–48 (2003), Companion volume
Mann, G.S., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of NAACL 2001: 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 151–158 (2001)
McEnery, T., Oakes, M.: Sentence and word alignment in the CRATER Project. In: Thomas, J., Short, M. (eds.) Using Corpora for Language Research, pp. 211–231. Longman (1996)
Dan Melamed, I.: Automatic discovery of non-compositional compounds in parallel data. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pp. 97–108 (1997)
Dan Melamed, I.: Bitext maps and alignment via pattern recognition. Computational Linguistics 25(1), 107–130 (1999)
Dan Melamed, I.: Models of translational equivalence among words. Computational Linguistics 26(2), 221–249 (2000)
Oakes, M.P.: Computer estimation of vocabulary in protolanguage from word lists in four daughter languages. Journal of Quantitative Linguistics 7(3), 233–243 (2000)
Simard, M., Foster, G.F., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Montreal, Canada, pp. 67–81 (1992)
Swadesh, M.: Lexico-statistical dating of prehistoric ethnic contacts. Proceedings of the American Philosophical Society 96, 452–463 (1952)
Tiedemann, J.: Automatic construction of weighted string similarity measures. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, Maryland (1999)
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the Association for Computing Machinery 21(1), 168–173 (1974)
Yarowsky, D., Wincentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Proceedings of ACL 2000, pp. 207–216 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kondrak, G. (2004). Combining Evidence in Cognate Identification. In: Tawfik, A.Y., Goodwin, S.D. (eds) Advances in Artificial Intelligence. Canadian AI 2004. Lecture Notes in Computer Science(), vol 3060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24840-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-24840-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22004-6
Online ISBN: 978-3-540-24840-8
eBook Packages: Springer Book Archive