Abstract
There is tremendous interest in mining the abundant user generated content on the web. Many analysis techniques are language dependent and rely on accurate language identification as a building block. Even though there is already research on language identification, it focused on very ‘clean’ editorially managed corpora, on a limited number of languages, and on relatively large-sized documents. These are not the characteristics of the content to be found in say, Twitter or Facebook postings, which are short and riddled with vernacular.
In this paper, we propose an automated, unsupervised, scalable solution based on publicly available data. To this end we thoroughly evaluate the use of Wikipedia to build language identifiers for a large number of languages (52) and a large corpus and conduct a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language (model) profile size and number of languages tested. Then, we show the value in using Wikipedia to train a language identifier directly applicable to Twitter. Finally, we augment the language models and customize them to Twitter by combining our Wikipedia models with location information from tweets. This method provides massive amount of automatically labeled data that act as a bootstrapping mechanism which we empirically show boosts the accuracy of the models.
With this work we provide a guide and a publicly available tool [1] to the mining community for language identification on web and social data.
Chapter PDF
Similar content being viewed by others
References
Automatic language identification tool, http://research.microsoft.com/lid/
Aggarwal, C.C. (ed.): Social Network Data Analytics. Springer (2011)
Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., Wilson, T.: Language identification for creating language-specific twitter collections. In: Proc. Second Workshop on Language in Social Media, pp. 65–74 (2012)
Carter, S., Weerkamp, W., Tsagkias, E.: Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal (2013)
Cavnar, W.: Using an n-gram-based document representation with a vector processing retrieval model, pp. 269–269. NIST SPECIAL PUBLICATION SP (1995)
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: SIDAIR (1994)
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267(5199), 843–848 (1995)
Dunning, T.: Statistical identification of language. Technical Report MCCS-94-273, New Mexico State University (1994)
Grothe, L., Luca, W.D., Nurnberger, A.: A comparative study on language identification methods. In: Proc. of LREC (2008)
Lopez, A.: Statistical machine translation. ACM Comput. Surv. 40(3) (2008)
Lui, M., Baldwin, T.: landid.py: An off-the-shelf language identification tool. In: Proc. of ACL (2012)
Majliš, M.: Yet another language identifier. In: EACL 2012, p. 46 (2012)
McNamee, P.: Language identification: A solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges 20(3) (2005)
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), 1–135 (2007)
Spearman, C.: The proof and measurement of association between two things. The American Journal of Psychology 15(1), 72–101 (1904)
Spearman, C.: Footrule for measuring correlation. The British Journal of Psychiatry 2(1), 89–108 (1906)
Tromp, E., Pechenizkly, M.: Graph-based n-gram language identification on short texts. In: Proc. of BENELEARN (2011)
Vatanen, T., Vayrynen, J., Virpioja, S.: Language identification of short text segments with n-gram models. In: Proc. of LREC (2010)
Wasserman, L.: All of statistics. Springer (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Goldszmidt, M., Najork, M., Paparizos, S. (2013). Boot-Strapping Language Identifiers for Short Colloquial Postings. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science(), vol 8189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40991-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-40991-2_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40990-5
Online ISBN: 978-3-642-40991-2
eBook Packages: Computer ScienceComputer Science (R0)