Boot-Strapping Language Identifiers for Short Colloquial Postings

Goldszmidt, Moises; Najork, Marc; Paparizos, Stelios

doi:10.1007/978-3-642-40991-2_7

Moises Goldszmidt²³,
Marc Najork²³ &
Stelios Paparizos²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8189))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2960 Accesses
8 Citations

Abstract

There is tremendous interest in mining the abundant user generated content on the web. Many analysis techniques are language dependent and rely on accurate language identification as a building block. Even though there is already research on language identification, it focused on very ‘clean’ editorially managed corpora, on a limited number of languages, and on relatively large-sized documents. These are not the characteristics of the content to be found in say, Twitter or Facebook postings, which are short and riddled with vernacular.

In this paper, we propose an automated, unsupervised, scalable solution based on publicly available data. To this end we thoroughly evaluate the use of Wikipedia to build language identifiers for a large number of languages (52) and a large corpus and conduct a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language (model) profile size and number of languages tested. Then, we show the value in using Wikipedia to train a language identifier directly applicable to Twitter. Finally, we augment the language models and customize them to Twitter by combining our Wikipedia models with location information from tweets. This method provides massive amount of automatically labeled data that act as a bootstrapping mechanism which we empirically show boosts the accuracy of the models.

With this work we provide a guide and a publicly available tool [1] to the mining community for language identification on web and social data.

Download to read the full chapter text

Chapter PDF

TweetLID: a benchmark for tweet language identification

Article 26 September 2015

MBLA Social Corpus

Language Set Identification in Noisy Synthetic Multilingual Documents

Keywords

References

Automatic language identification tool, http://research.microsoft.com/lid/
Aggarwal, C.C. (ed.): Social Network Data Analytics. Springer (2011)
Google Scholar
Bergsma, S., McNamee, P., Bagdouri, M., Fink, C., Wilson, T.: Language identification for creating language-specific twitter collections. In: Proc. Second Workshop on Language in Social Media, pp. 65–74 (2012)
Google Scholar
Carter, S., Weerkamp, W., Tsagkias, E.: Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal (2013)
Google Scholar
Cavnar, W.: Using an n-gram-based document representation with a vector processing retrieval model, pp. 269–269. NIST SPECIAL PUBLICATION SP (1995)
Google Scholar
Cavnar, W., Trenkle, J.: N-gram-based text categorization. In: SIDAIR (1994)
Google Scholar
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267(5199), 843–848 (1995)
Article Google Scholar
Dunning, T.: Statistical identification of language. Technical Report MCCS-94-273, New Mexico State University (1994)
Google Scholar
Grothe, L., Luca, W.D., Nurnberger, A.: A comparative study on language identification methods. In: Proc. of LREC (2008)
Google Scholar
Lopez, A.: Statistical machine translation. ACM Comput. Surv. 40(3) (2008)
Google Scholar
Lui, M., Baldwin, T.: landid.py: An off-the-shelf language identification tool. In: Proc. of ACL (2012)
Google Scholar
Majliš, M.: Yet another language identifier. In: EACL 2012, p. 46 (2012)
Google Scholar
McNamee, P.: Language identification: A solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges 20(3) (2005)
Google Scholar
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), 1–135 (2007)
Google Scholar
Spearman, C.: The proof and measurement of association between two things. The American Journal of Psychology 15(1), 72–101 (1904)
Article Google Scholar
Spearman, C.: Footrule for measuring correlation. The British Journal of Psychiatry 2(1), 89–108 (1906)
MathSciNet Google Scholar
Tromp, E., Pechenizkly, M.: Graph-based n-gram language identification on short texts. In: Proc. of BENELEARN (2011)
Google Scholar
Vatanen, T., Vayrynen, J., Virpioja, S.: Language identification of short text segments with n-gram models. In: Proc. of LREC (2010)
Google Scholar
Wasserman, L.: All of statistics. Springer (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Mountain View, CA, 94043, USA
Moises Goldszmidt, Marc Najork & Stelios Paparizos

Authors

Moises Goldszmidt
View author publications
You can also search for this author in PubMed Google Scholar
Marc Najork
View author publications
You can also search for this author in PubMed Google Scholar
Stelios Paparizos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001, Leuven, Belgium
Hendrik Blockeel
Fraunhofer IAIS, Department of Knowledge Discovery, Schloss Birlinghoven, University of Bonn, 53754, Sankt Augustin, Germany
Kristian Kersting
LIACS, Universiteit Leiden, Niels Bohrweg 1, 2333, Leiden, CA, The Netherlands
Siegfried Nijssen
Department of Computer Science and Engineering, Czech Technical University, Technicka 2, 16627, Prague 6, Czech Republic
Filip Železný

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goldszmidt, M., Najork, M., Paparizos, S. (2013). Boot-Strapping Language Identifiers for Short Colloquial Postings. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science(), vol 8189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40991-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-40991-2_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40990-5
Online ISBN: 978-3-642-40991-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Boot-Strapping Language Identifiers for Short Colloquial Postings

Abstract

Chapter PDF

Similar content being viewed by others

TweetLID: a benchmark for tweet language identification

MBLA Social Corpus

Language Set Identification in Noisy Synthetic Multilingual Documents

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Boot-Strapping Language Identifiers for Short Colloquial Postings

Abstract

Chapter PDF

Similar content being viewed by others

TweetLID: a benchmark for tweet language identification

MBLA Social Corpus

Language Set Identification in Noisy Synthetic Multilingual Documents

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation