Abstract
In spite of the great number of diachronic studies in various languages, the methodology for investigating language change has not evolved much in the last fifty years. Following the progressive trends in other fields, in this paper, we argue for the adoption of a machine learning approach in diachronic studies, which could offer a more efficient analysis of a large number of features and easier comparison of the results across different genres, languages and language varieties. We suggest the use of statistical tests as an initial step for feature selection in an approach which uses the F-measure of the classification algorithms as a measure of the extent of diachronic changes. Furthermore, we compare the performance of the classification task after the feature selection made by statistical tests and the CfsSubsetEval attribute selection algorithm. The experiments were conducted on the British part of the biggest existing diachronic corpora of 20th century written English language – the ‘Brown family’ of corpora, using 23 different stylistic features. The results demonstrated that the use of the statistical tests for feature selection can significantly increase the accuracy of the classification algorithms.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Adolph, R.: The Rise of Modern Prose Style. M.I.T. Press, Cambridge (1966)
Aldrich, J., Nelson, F.: Linear probability, logit, and probit models. Quantitative applications in the social sciences. Sage, London (1984)
Altmann, G., von Buttlar, H., Rott, W., Strau, U.: A law of change in language. In: Brainerd, B. (ed.) Historical Linguistics, pp. 104–115. Brockmeye, Bochum (1983)
Bennett, J.R.: Prose Style: A Historical Approach through Studies. Chandler, San Francisco (1971)
Biber, D.: Investigating Macroscopic Textual Variation through Multifeature/Multidimensional Analyses. Linguistics 23, 337–360 (1985)
Biber, D.: Variation across speech and writing. Cambridge University Press, Cambridge (1988)
Biber, D., Finegan, E.: An Initial Typology of English Text Types. In: Aarts, J., Meijs, W. (eds.) Corpus Linguistics H: New Studies in the Analysis and Exploitation of Computer Corpora, pp. 19–46. Rodopi, Amsterdam (1986)
Biber, D., Finegan, E.: Drift and the evolution of English style: A history of three genres. Language 65, 487–517 (1989)
le Cessie, S., van Houwelingen, J.: Ridge Estimators in Logistic Regression. Applied Statistics 41(1), 191–201 (1992)
Connexor: Machinese language analysers (2006)
Corpas Pastor, G., Mitkov, R., Afzal, N., Pekar, V.: Translation Universals: Do they exist? A corpus-based NLP study of convergence and simplification. In: Proceedings of the AMTA, Waikiki, Hawaii (2008)
Geisler, C.: Relativization in Ulster English. In: Poussa, P. (ed.) Relativisation on the North Sea Littoral (LINCOM Studies in Language Typology 07), pp. 135–146. Lincom Europa, München (2002)
Geisler, C.: Statistical reanalysis of corpus data. ICAME Journal 32, 35–46 (2008)
Gordon, I.A.: The Movement of English Prose. Indiana University Press, Bloomington (1966)
Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning. In: McDonald, C. (ed.) Computer Science 1998 Proceedings of the 21st Australasian Computer Science Conference, ACSC 1998, pp. 181–191. Springer, Berlin (1998)
John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13(3), 637–649 (2001)
Kroch, A.: Function and grammar in the history of English: Periphrastic “do”. In: Fasold, R. (ed.) Language Change and Variation, pp. 133–172. Benjamins, Amsterdam (1989)
Kroch, A.: Reflexes of grammar in patterns of language change. In: Language Variation and Change, vol. 1, pp. 199–244 (1989)
Landwehr, N., Hall, M., Frank, E.: Logistic Model Trees. Machine Learning 59, 161–205 (2005)
Leech, G., Smith, N.: Extending the possibilities of corpus-based research on English in the twentieth century: a prequel to LOB and FLOB. ICAME Journal 29, 83–98 (2005)
Leech, G., Smith, N.: Recent grammatical change in written English 1961-1992: some preliminary findings of a comparison of American with British English. In: Renouf, A., Kehoe, A. (eds.) The Changing Face of Corpus Linguistics, pp. 186–204. Rodopi, Amsterdam (2006)
Mair, C., Hundt, M., Leech, G., Smith, N.: Short term diachronic shifts in part-of-speech frequencies: a comparison of the tagged LOB and F-LOB corpora. International Journal of Corpus Linguistics 7, 245–264 (2002)
Mair, C., Leech, G.: Current change in English syntax. In: Aarts, B., MacMahon, A. (eds.) The Handbook of English Linguistics, ch. 14. Blackwell, Oxford (2006)
Platt, J.C.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning. The MIT Press, London (1998)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Senter, R.J., Smith, E.A.: Automated readability index. Tech. rep., University of Cincinnati. Ohio, Cincinnati (1967)
Sumner, M., Frank, E., Hall, M.: Speeding up Logistic Model Tree Induction. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 675–683. Springer, Heidelberg (2005)
Tukey, J.: Exploratory data analysis. Addison-Wesley, Reading (1977)
Štajner, S., Mitkov, R.: Diachronic Stylistic Changes in British and American Varieties of 20th Century Written English Language. In: Proceedings of the RANLP 2011 Workshop “Language Technologies for Digital Humanities and Cultural Heritage”, pp. 78–85 (2011)
Štajner, S., Mitkov, R.: Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach. In: Calzolari, N., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey (May 2012)
Westin, I.: Language Change in English Newspaper Editorials. Rodopi, Amsterdam (2002)
Westin, I., Geisler, C.: A multi-dimensional study of diachronic variation in British newspaper editorials. ICAME Journal 26, 133–152 (2002)
Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques. Morgan Kaufmann Publishers (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Štajner, S., Evans, R. (2013). Can Statistical Tests Be Used for Feature Selection in Diachronic Text Classification?. In: Dediu, AH., Martín-Vide, C., Mitkov, R., Truthe, B. (eds) Statistical Language and Speech Processing. SLSP 2013. Lecture Notes in Computer Science(), vol 7978. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39593-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-39593-2_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39592-5
Online ISBN: 978-3-642-39593-2
eBook Packages: Computer ScienceComputer Science (R0)