Abstract
The aim of this work is to build a generic model of Document Clustering that automatically groups together the related documents. Model is built with unsupervised and supervised learning with the assumption of no prior knowledge of the given domain. No manual effort is required for creating the training document set, instead the proposed model automatically generates training document. After that, it uses those for categorizing text documents. In the proposed model, the entire process is broadly divided into two steps. First, the initial classification is done in an unsupervised way. Apply K-means algorithm on the unlabeled documents in order to prepare the training dataset. Text documents are represented here as feature vector format where keywords extracted are considered as a feature. Here the selected representative documents are considered as the initial centroids. In step 2, create a supervised classifier on the initially categorized set. The categorized documents resulted from the previous step are used to train the supervised classifier. Naive Bayes classifier will be used as a statistical text classifier which uses word frequencies as features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques. WSEAS Transactions on Computers, 4(8), 966–974.
Purohit, A., Atre, D., Jaswani, P., & Asawara, P. (2015). Text classification in data mining. International Journal of Scientific and Research Publications, 5(6), 1–7.
Morariu, D. I., Cretulescu, R. G., & Breazu, M.: Feature selection in document classification. https://pdfs.semanticscholar.org/.
http://www.codeproject.com/Articles/822379/Text-Mining-and-its-Business-Applications.
https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/.
Liu, Y. C., Liu, M., Wang, X. L. (2012). Application of self-organizing maps in text clustering: a review (vol. 10). https://doi.org/10.5772/50618.
https://www.kdnuggets.com/2015/01/text-analysis-101-document-classification.html.
Ko, Y., & Seo, J.: Automatic text categorization by unsupervised learning. In: Proceedings of the 18th Conference on Computational Linguistics (vol. 1, pp. 453–459). Association for Computational Linguistics, July 2000.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Patra, R. (2021). Automated Document Categorization Model. In: Das, S., Das, S., Dey, N., Hassanien, AE. (eds) Machine Learning Algorithms for Industrial Applications. Studies in Computational Intelligence, vol 907. Springer, Cham. https://doi.org/10.1007/978-3-030-50641-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-50641-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50640-7
Online ISBN: 978-3-030-50641-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)