Abstract
Feature selection in text classification is applied to reduce the dimensionality of the vector space model. As a result, computational costs are reduced during model training and the quality of text classification is improved by eliminating noisy features. In the present paper, a modified pointwise mutual information-based method for feature selection (mPMI-based feature selection) in text classification is examined. The proposed approach overcomes the perceived shortcomings of PMI feature selection measure. The results of the experiments conducted are summarized and analyzed in order to compare the proposed approach with other approaches for feature selection across different classifiers and datasets. The obtained results confirm that mPMI-based feature selection is comparable or leads to a significant improvement in the performance of text classification for a small number of selected features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Kira, K. Rendell, L.: The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 129–134 (1992)
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57
Urbanowicz, R.J., Meeker, M., Cava, W., Olson, R.S., Moore, J.H.: Relief-based feature selection: Introduction and review. J. Biomed. Inform. 85, 189–203 (2018)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML), pp. 412–420 (1997)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80–88 (2004)
Fattah, M.A.: A novel statistical feature selection approach for text categorization. J. Inf. Process. Syst. 13(5), 1397–1409 (2017)
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naïve Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML), pp. 258–267 (1999)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Xu, Y., Jones, G., Li, J.T., Wang, B., Sun, C.M.: A study on mutual information-based feature selection for text categorization. J. Comput. Inf. Syst. 3(3), 1007–1012 (2007)
Schneider, K.-M.: Weighted average pointwise mutual information for feature selection in text categorization. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 252–263. Springer, Heidelberg (2005). https://doi.org/10.1007/11564126_27
Kim, K., Zzang, S.Y.: Trigonometric comparison measure: a feature selection method for text categorization. Data Knowl. Eng. 119, 1–21 (2019)
Wan, C., Wang, Y., Liu, Y., Ji, J., Feng, G.: Composite feature extraction and selection for text classification. IEEE Access 7, 35208–35219 (2019)
Georgieva-Trifonova, T., Stefanova, M., Kalchev, S.: Customer feedback text analysis for online stores reviews in Bulgarian. IAENG Int. J. Comput. Sci. 45(4), 560–568 (2018)
Macnamara, J.: Media content analysis: Its uses; benefits and best practice methodology. Asia Pacific Public Relations J. 6(1), 1–34 (2005)
Newman, D., Karimi, S., Cavedon, L.: External evaluation of topic models. In: Proceedings of the 14th Australasian Document Computing Symposium, pp. 11–18 (2009)
Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th International Conference on Computational Semantics (2013)
Duy, J., Jiangz, J., Songy, D., Liao, L.: Topic modeling with document relative similarities. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, pp. 3469–3475 (2015)
Chen, G.-B., Kao, H.-Y.: Word co-occurrence augmented topic model in short text. Comput. Linguist> Chinese Lang. Process. 20(2), 45–64 (2015)
Naskar, D., Mokaddem, S., Rebollo, M., Onaindia, E.: Sentiment analysis in social networks through topic modeling. In: Language Resources and Evaluation Conference, pp. 46–53 (2016)
Wood, J., Tan, P., Wang, W., Arnold, C.: Source-LDA: enhancing probabilistic topic models using prior knowledge sources. In: Proceedings of the IEEE 33rd International Conference on Data Engineering (2017)
Ouertatani, A., Gasmi, G., Latiri, C.: Opinion polarity detection in Twitter data combining sequence mining and topic modeling. In: Proceedings of the International Conference of the CLEF Association, Labs Working Notes (2017)
Yuan, M., Durme, B.V., Boyd-Graber, J.: Multilingual anchoring: interactive topic modeling and alignment across languages. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, pp. 8667–8677 (2018)
Li, Q., Li, S., Hu, J., Zhang, S., Hu, J.: Tourism review sentiment classification using a bidirectional recurrent neural network with an attention mechanism and topic-enriched word vectors. Sustainability 10(9), 3313 (2018)
Cardenas, R., Bello, K., Coronado, A., Villota, E.: Improving topic coherence using entity extraction denoising. Prague Bull. Math. Linguist. 110(1), 85–101 (2018)
Luo, X., Yi, Y.: Topic-specific emotion mining model for online comments. Future Internet 11(3), 79 (2019)
Georgieva-Trifonova, T., Stefanova, M., Kalchev, S.: Dataset for: Customer Feedback Text Analysis for Online Stores Reviews in Bulgarian. Harvard Dataverse, Bulgarian (2018). https://doi.org/10.7910/DVN/TXIK9P
Lewis, D.D.: Reuters-21578 text Categorization test collection (1997). https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Joachims, T.: Learning to Classify Text Using Support Vector Machines. Springer US, Boston, MA (2002). https://doi.org/10.1007/978-1-4615-0907-3
Nugumanova, A., Bessmertny, I., Pecina, P., Baiburin, E.: Semantic relations in text classification based on bag-of-words model. Softw. Syst. 2(114), 89–99 (2016). (in Russian)
Cachopo, A.C.: Datasets for single-label text categorization (1997). https://ana.cachopo.org/datasets-for-single-label-text-categorization
Nakov, P.: BulStem: design and evaluation of inflectional stemmer for Bulgarian. In: Proceedings of Workshop on Balkan Language Resources and Tools (2003)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). https://doi.org/10.1108/eb046814
Lu, F., Bai, Q.: A refined weighted k-nearest neighbours algorithm for text categorization. In: Proceedings of International Conference on Intelligent Systems and Knowledge Engendering, pp. 326–330. IEEE (2010)
Mitchell, T.M.: Machine Learning. McGraw Hill, New York, NY (1996)
Candel, A., Parmar, V.: Deep Learning with H2O, H2O.ai, Inc. (2015)
Cohen, W.W.: Fast effective rule induction. In: Machine Learning Proceedings 1995, pp. 115–123. Elsevier (1995). https://doi.org/10.1016/B978-1-55860-377-6.50023-2
Gaines, B.R., Compton, P.: Induction of ripple-down rules applied to modeling large databases. J. Intell. Inf. Syst. 5(3), 211–228 (1995)
Frank, E., Witten, I.H.: Generating accurate rule sets without global optimization. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp.144–151 (1998)
Georgieva-Trifonova, T.: Results from “Modified Pointwise Mutual Information-Based Feature Selection for Text Classification”. Harvard Dataverse, V3 (2021). https://doi.org/10.7910/DVN/JEI1HR
Trifonov, T., Tsonkova, V.: Statistics in Economics and Management. Astarta, Plovdiv (2009). (in Bulgarian)
Trifonov, T.: Applied mathematics. Astarta, Plovdiv (2005). (in Bulgarian)
Trifonov, T.: Statistics. Faber, Veliko Tarnovo (2012). (in Bulgarian)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Georgieva-Trifonova, T. (2022). Modified Pointwise Mutual Information-Based Feature Selection for Text Classification. In: Arai, K. (eds) Proceedings of the Future Technologies Conference (FTC) 2021, Volume 2. FTC 2021. Lecture Notes in Networks and Systems, vol 359. Springer, Cham. https://doi.org/10.1007/978-3-030-89880-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-89880-9_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89879-3
Online ISBN: 978-3-030-89880-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)