Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Georgieva-Trifonova, Tsvetanka

doi:10.1007/978-3-030-89880-9_26

Tsvetanka Georgieva-Trifonova ORCID: orcid.org/0000-0002-5997-2344¹⁰

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 359))

Included in the following conference series:

Proceedings of the Future Technologies Conference

905 Accesses
3 Citations

Abstract

Feature selection in text classification is applied to reduce the dimensionality of the vector space model. As a result, computational costs are reduced during model training and the quality of text classification is improved by eliminating noisy features. In the present paper, a modified pointwise mutual information-based method for feature selection (mPMI-based feature selection) in text classification is examined. The proposed approach overcomes the perceived shortcomings of PMI feature selection measure. The results of the experiments conducted are summarized and analyzed in order to compare the proposed approach with other approaches for feature selection across different classifiers and datasets. The obtained results confirm that mPMI-based feature selection is comparable or leads to a significant improvement in the performance of text classification for a small number of selected features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Understanding of Data Preprocessing for Dimensionality Reduction Using Feature Selection Techniques in Text Classification

Filter feature selection methods for text classification: a review

Article 11 May 2023

Discriminant Mutual Information for Text Feature Selection

References

Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book Google Scholar
Kira, K. Rendell, L.: The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 129–134 (1992)
Google Scholar
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57
Chapter Google Scholar
Urbanowicz, R.J., Meeker, M., Cava, W., Olson, R.S., Moore, J.H.: Relief-based feature selection: Introduction and review. J. Biomed. Inform. 85, 189–203 (2018)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML), pp. 412–420 (1997)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80–88 (2004)
Article Google Scholar
Fattah, M.A.: A novel statistical feature selection approach for text categorization. J. Inf. Process. Syst. 13(5), 1397–1409 (2017)
MathSciNet Google Scholar
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., Wang, Z.: A novel feature selection algorithm for text categorization. Expert Syst. Appl. 33(1), 1–5 (2007)
Article Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naïve Bayes. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML), pp. 258–267 (1999)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Xu, Y., Jones, G., Li, J.T., Wang, B., Sun, C.M.: A study on mutual information-based feature selection for text categorization. J. Comput. Inf. Syst. 3(3), 1007–1012 (2007)
Google Scholar
Schneider, K.-M.: Weighted average pointwise mutual information for feature selection in text categorization. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 252–263. Springer, Heidelberg (2005). https://doi.org/10.1007/11564126_27
Chapter Google Scholar
Kim, K., Zzang, S.Y.: Trigonometric comparison measure: a feature selection method for text categorization. Data Knowl. Eng. 119, 1–21 (2019)
Article Google Scholar
Wan, C., Wang, Y., Liu, Y., Ji, J., Feng, G.: Composite feature extraction and selection for text classification. IEEE Access 7, 35208–35219 (2019)
Article Google Scholar
Georgieva-Trifonova, T., Stefanova, M., Kalchev, S.: Customer feedback text analysis for online stores reviews in Bulgarian. IAENG Int. J. Comput. Sci. 45(4), 560–568 (2018)
Google Scholar
Macnamara, J.: Media content analysis: Its uses; benefits and best practice methodology. Asia Pacific Public Relations J. 6(1), 1–34 (2005)
Article Google Scholar
Newman, D., Karimi, S., Cavedon, L.: External evaluation of topic models. In: Proceedings of the 14th Australasian Document Computing Symposium, pp. 11–18 (2009)
Google Scholar
Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th International Conference on Computational Semantics (2013)
Google Scholar
Duy, J., Jiangz, J., Songy, D., Liao, L.: Topic modeling with document relative similarities. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, pp. 3469–3475 (2015)
Google Scholar
Chen, G.-B., Kao, H.-Y.: Word co-occurrence augmented topic model in short text. Comput. Linguist> Chinese Lang. Process. 20(2), 45–64 (2015)
Google Scholar
Naskar, D., Mokaddem, S., Rebollo, M., Onaindia, E.: Sentiment analysis in social networks through topic modeling. In: Language Resources and Evaluation Conference, pp. 46–53 (2016)
Google Scholar
Wood, J., Tan, P., Wang, W., Arnold, C.: Source-LDA: enhancing probabilistic topic models using prior knowledge sources. In: Proceedings of the IEEE 33rd International Conference on Data Engineering (2017)
Google Scholar
Ouertatani, A., Gasmi, G., Latiri, C.: Opinion polarity detection in Twitter data combining sequence mining and topic modeling. In: Proceedings of the International Conference of the CLEF Association, Labs Working Notes (2017)
Google Scholar
Yuan, M., Durme, B.V., Boyd-Graber, J.: Multilingual anchoring: interactive topic modeling and alignment across languages. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, pp. 8667–8677 (2018)
Google Scholar
Li, Q., Li, S., Hu, J., Zhang, S., Hu, J.: Tourism review sentiment classification using a bidirectional recurrent neural network with an attention mechanism and topic-enriched word vectors. Sustainability 10(9), 3313 (2018)
Article Google Scholar
Cardenas, R., Bello, K., Coronado, A., Villota, E.: Improving topic coherence using entity extraction denoising. Prague Bull. Math. Linguist. 110(1), 85–101 (2018)
Article Google Scholar
Luo, X., Yi, Y.: Topic-specific emotion mining model for online comments. Future Internet 11(3), 79 (2019)
Article Google Scholar
Georgieva-Trifonova, T., Stefanova, M., Kalchev, S.: Dataset for: Customer Feedback Text Analysis for Online Stores Reviews in Bulgarian. Harvard Dataverse, Bulgarian (2018). https://doi.org/10.7910/DVN/TXIK9P
Lewis, D.D.: Reuters-21578 text Categorization test collection (1997). https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Joachims, T.: Learning to Classify Text Using Support Vector Machines. Springer US, Boston, MA (2002). https://doi.org/10.1007/978-1-4615-0907-3
Book Google Scholar
Nugumanova, A., Bessmertny, I., Pecina, P., Baiburin, E.: Semantic relations in text classification based on bag-of-words model. Softw. Syst. 2(114), 89–99 (2016). (in Russian)
Article Google Scholar
Cachopo, A.C.: Datasets for single-label text categorization (1997). https://ana.cachopo.org/datasets-for-single-label-text-categorization
Nakov, P.: BulStem: design and evaluation of inflectional stemmer for Bulgarian. In: Proceedings of Workshop on Balkan Language Resources and Tools (2003)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). https://doi.org/10.1108/eb046814
Article Google Scholar
Lu, F., Bai, Q.: A refined weighted k-nearest neighbours algorithm for text categorization. In: Proceedings of International Conference on Intelligent Systems and Knowledge Engendering, pp. 326–330. IEEE (2010)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw Hill, New York, NY (1996)
MATH Google Scholar
Candel, A., Parmar, V.: Deep Learning with H2O, H2O.ai, Inc. (2015)
Google Scholar
Cohen, W.W.: Fast effective rule induction. In: Machine Learning Proceedings 1995, pp. 115–123. Elsevier (1995). https://doi.org/10.1016/B978-1-55860-377-6.50023-2
Chapter Google Scholar
Gaines, B.R., Compton, P.: Induction of ripple-down rules applied to modeling large databases. J. Intell. Inf. Syst. 5(3), 211–228 (1995)
Article Google Scholar
Frank, E., Witten, I.H.: Generating accurate rule sets without global optimization. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp.144–151 (1998)
Google Scholar
Georgieva-Trifonova, T.: Results from “Modified Pointwise Mutual Information-Based Feature Selection for Text Classification”. Harvard Dataverse, V3 (2021). https://doi.org/10.7910/DVN/JEI1HR
Trifonov, T., Tsonkova, V.: Statistics in Economics and Management. Astarta, Plovdiv (2009). (in Bulgarian)
Google Scholar
Trifonov, T.: Applied mathematics. Astarta, Plovdiv (2005). (in Bulgarian)
Google Scholar
Trifonov, T.: Statistics. Faber, Veliko Tarnovo (2012). (in Bulgarian)
Google Scholar

Download references

Author information

Authors and Affiliations

“St. Cyril and St. Methodius” University of Veliko Tarnovo, Veliko Tarnovo, Bulgaria
Tsvetanka Georgieva-Trifonova

Authors

Tsvetanka Georgieva-Trifonova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tsvetanka Georgieva-Trifonova .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Georgieva-Trifonova, T. (2022). Modified Pointwise Mutual Information-Based Feature Selection for Text Classification. In: Arai, K. (eds) Proceedings of the Future Technologies Conference (FTC) 2021, Volume 2. FTC 2021. Lecture Notes in Networks and Systems, vol 359. Springer, Cham. https://doi.org/10.1007/978-3-030-89880-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-89880-9_26
Published: 04 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89879-3
Online ISBN: 978-3-030-89880-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Understanding of Data Preprocessing for Dimensionality Reduction Using Feature Selection Techniques in Text Classification

Filter feature selection methods for text classification: a review

Discriminant Mutual Information for Text Feature Selection

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Understanding of Data Preprocessing for Dimensionality Reduction Using Feature Selection Techniques in Text Classification

Filter feature selection methods for text classification: a review

Discriminant Mutual Information for Text Feature Selection

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation