Abstract
This paper investigates stylistic changes in a set of Portuguese historical texts ranging from the 17th to the early 20th century and presents a supervised method to classify them per century. Four stylistic features – average sentence length (ASL), average word length (AWL), lexical density (LD), and lexical richness (LR) – were automatically extracted for each sub-corpus. The initial analysis of diachronic changes in these four features revealed that the texts written in the 17th and 18th centuries have similar AWL, LD and LR, which differ significantly from those in the texts written in the 19th and 20th centuries. This information was later used in automatic classification of texts per century, leading to an F-Measure of 0.92.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Joseph, B., Janda, R.: The Handbook of Historical Linguistics. Blackwell Publishing (2003)
Smith, J., Kelly, C.: Stylistic constancy and change across literary corpora: Using measures of lexical richness to date works. Computers and the Humanities 36, 411–430 (2002)
Štajner, S., Mitkov, R.: Diachronic stylistic changes in british and american varieties of 20th century written english language. In: Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage, Hissar, Bulgaria, pp. 78–85 (2011)
Zampieri, M., Becker, M.: Colonia: Corpus of historical portuguese. ZSM Studien, Special Volume on Non-Standard Data Sources in Corpus-Based Research 5 (2013)
Leech, G., Hundt, M., Mair, C., Smith, N.: Change in Contemporary English: A Grammatical Study. Cambridge University Press, Cambridge (2009)
Galves, C., Sandalo, F.: Clitic-placement in modern and classical European Portuguese. MIT Working Papers in Linguistics 47, 115–128 (2004)
Britto, H., Finger, M., Galves, C.: Computational and linguistic aspects of the Tycho Brahe parsed corpus of historical portuguese. In: Proceedings of the First Freiburg Workshop on Romance Corpus Linguistics, Freiburg, Germany (2000)
Dalli, A., Wilks, Y.: Automatic dating of documents and temporal text classification. In: Proceedings of the Workshop on Annotating and Reasoning about Time and Events, Sidney, Australia, pp. 17–22 (2006)
Abe, H., Tsumoto, S.: Text categorization with considering temporal patterns of term usages. In: Proceedings of ICDM Workshops, pp. 800–807. IEEE (2010)
Mokhov, S.: A marf approach to deft 2010. In: Proceedings of TALN 2010, Montreal, Canada (2010)
Trieschnigg, D., Hiemstra, D., Theune, M., de Jong, F., Meder, T.: An exploration of language identification techniques for the dutch folktale database. In: Proceedings of LREC 2012 (2012)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK (1994)
Witten, I., Frank, E.: Data mining: Practical machine learning tools and techniques. Morgan Kaufmann Publishers (2005)
John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation 13, 637–649 (2001)
Platt, J.C.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning (1998)
Cohen, W.: Fast Effective Rule Induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Zampieri, M., Gebre, B.G.: Automatic identification of language varieties: The case of Portuguese. In: Proceedings of KONVENS 2012, Vienna, Austria, pp. 233–237 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Štajner, S., Zampieri, M. (2013). Stylistic Changes for Temporal Text Classification. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_65
Download citation
DOI: https://doi.org/10.1007/978-3-642-40585-3_65
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40584-6
Online ISBN: 978-3-642-40585-3
eBook Packages: Computer ScienceComputer Science (R0)