Abstract
Although term weighting approach is typically used to improve the performance of text classification, this approach may not provide consistent results while imbalanced data distribution is available. This paper presents a probability based term weighting approach which addresses the different aspects of class imbalance problem in text classification. In this approach, we proposed two term evaluation functions called as PNF and \(PNF^2\) which can produce more influential weights by relying on the imbalanced data sets. These functions can determine the significance of a term in association with a particular category. This is a crucial point because in one hand a frequent term is more important than a rare term in a particular category according to feature selection approach, and on the other hand a rare term is no less important than a frequent term based on idf assumption of traditional term weighting approach. Incorporation of these two approaches at the same time is the main idea that make them superior to other weighting methods. The achieved results from experiments which were carried out on two popular benchmarks (Reuters-21578 and WebKB) demonstrate that the probability based term weighting approach yields more consistent results than the other methods on the imbalanced data sets.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5), 429–449 (2002)
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6(1), 1–6 (2004)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)
Ogura, H., Amano, H., Kondo, M.: Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications 38(5), 4978–4989 (2011)
Taşcı, Ş., Güngör, T.: Comparison of text feature selection policies and using an adaptive framework. Expert Systems with Applications 40(12), 4871–4886 (2013)
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and its Applications. STUDFUZZ, vol. 138, pp. 81–97. Springer, Heidelberg (2004)
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 721–735 (2009)
Liu, Y., Loh, H.T., Sun, A.: Imbalanced text classification: A term weighting approach. Expert Systems with Applications 36(1), 690–701 (2009)
Ren, F., Sohrab, M.G.: Class-indexing-based term weighting for automatic text classification. Information Sciences 236, 109–125 (2013)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)
Sun, A., Lim, E.P., Liu, Y.: On strategies for imbalanced text classification using svm: A comparative study. Decision Support Systems 48(1), 191–201 (2009)
Erenel, Z., Altınçay, H.: Nonlinear transformation of term frequencies for term weighting in text categorization. Engineering Applications of Artificial Intelligence 25(7), 1505–1514 (2012)
Cachopo, A.M.d.J.C.: Improving Methods for Single-label Text Categorization. PhD thesis, Universidade Técnica de Lisboa (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Naderalvojoud, B., Sezer, E.A., Ucan, A. (2015). Imbalanced Text Categorization Based on Positive and Negative Term Weighting Approach. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-24033-6_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24032-9
Online ISBN: 978-3-319-24033-6
eBook Packages: Computer ScienceComputer Science (R0)