Abstract
This article presents a novel text categorization framework based on Support Vector Machine (SVM) and SMTP similarity mea-sure. The performance of the SVM mainly depends on the selection of kernel function and soft margin parameter C. To reduce the impact of kernel function and parameter C, in this article, a novel text categorization framework called SVM-SMTP framework is developed. In the proposed SVM-SMTP framework, we used Similarity Measure for Text Processing (SMTP) measure in place of optimal separating hyper-plane as categorization decision making function. To assess the efficacy of the SVM-SMTP framework, we carried out experiments on publically available datasets: Reuters-21578 and 20-NewsGroups. We compared the results of SVM-SMTP framework with other four similarity measures viz., Euclidean, Cosine, Correlation and Jaccard. The experimental results show that the SVM-SMTP framework outperforms the other similarity measures in terms of categorization accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR). 34(1), 1–47 (2012)
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2), 103–130 (1997)
Harish, B.S., Guru, D.S., Manjunath, S.: Representation and classification of text documents: A brief review. IJCA Special Issue RTIPPR. 2, 110–119 (2010)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Machine Learning: ECML-98, pp. 137–142 (1998)
Apt, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Trans. Inf. Syst. (TOIS) 12(3), 233–251 (1994)
Chen, C.M., Lee, H.M., Hwang, C.W.: A hierarchical neural network document classifier with linguistic feature selection. Appl. Intell. 23, 277–294 (2005)
Isa, D., Lee, L.H., Kallimani, V.P., Rajkumar, R.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Trans. Knowl. Data Eng. 20(9), 1264–1272 (2008)
Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using Euclidean distance function for text document categorization. Appl. Intell. 37(1), 80–99 (2012)
Li, Y., Zhang, T.: Deep neural mapping support vector machines. Neural Netw. 93, 185–194 (2017)
Avci, E.: Selecting of the optimal feature subset and kernel parameters in digital modulation classification by using hybrid genetic algorithm support vector machines: HGASVM. Expert Syst. Appl. 36(2), 1391–1402 (2009)
Chen, Y.C., Su, C.T.: Distance-based margin support vector machine for classification. Appl. Math. Comput. 283, 141–152 (2016)
Fischetti, M.: Fast training of support vector machines with gaussian kernel. Discrete Optim. 22, 183–194 (2016)
Schoenharl, T.W., Madey, G.: Evaluation of measurement techniques for the validation of agent-based simulations against streaming data. In: International Conference on Computational Science, pp. 6–15. Springer, Berlin, Heidelberg (2008)
Al-Anzi, F.S., AbuZeina, D.: Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing. J. King Saud Univ.-Comput. Inf. Sci. 29(2), 189–195 (2017)
Gonzlez, C.G., Bonventi Jr, W., Rodrigues, A.V.: Density of closed balls in real-valued and autometrized boolean spaces for clustering applications. In: Brazilian Symposium on Artificial Intelligence, pp. 8–22. Springer Berlin, Heidelberg (2008)
Nigam, K., McCallum, A., Mitchell, T.: Semi-supervised text classification using EM. In: Semi-Supervised Learning, pp. 33–56 (2006)
Lin, Y.S., Jiang, J.Y., Lee, S.J.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2014)
Briggs, T., Oates, T.: Discovering domain-specific composite kernels. In: Proceedings of the National Conference on Artificial Intelligence, vol. 20, no. 2, pp. 732–738 (2005)
Diosan, L., Rogozan, A., Pecuchet, J.P.: Improving classification performance of support vector machine by genetically optimising kernel shape and hyper-parameters. Appl. Intell. 36(2), 280–294 (2012)
Quang, A.T., Zhang, Q.L., Li, X.: Evolving support vector machine parameters. In: 2002 Proceedings of the International Conference on Machine Learning and Cybernetics. vol.1, pp. 548–551. IEEE (2002)
Sun, J.: Fast tuning of SVM kernel parameter using distance between two classes. In: 3rd International Conference on Intelligent System and Knowledge Engineering, 2008. ISKE 2008, vol. 1, pp. 108–113. IEEE (2008)
Sun, J., Zheng, C., Li, X., Zhou, Y.: Analysis of the distance between two classes for tuning SVM hyperparameters. IEEE Trans. Neural Netw. 21(2), 305–318 (2010)
Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13) (2013)
Thomas, A.M., Resmipriya, M.G.: An efficient text classification scheme using clustering. Proced. Technol. 24, 1220–1225 (2016)
Nagwani, N.K.: A comment on a similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 27(9), 2589–2590 (2015)
Haykin, S., Network, N.: A comprehensive foundation. In: Neural Networks, no. 2, pp. 41 (2004)
Reuters-21578. http://www.daviddlewis.com/resources/testcollections/reuters21578/
Newsgroups. http://people.csail.mit.edu/jrennie/20Newsgroups/
Tsai, C.F., Chang, C.W.: SVOIS: support vector oriented instance selection for text classification. Inf. Syst. 38(8), 1070–1083 (2013)
Harish, B.S., Revanasiddappa, M.B., Kumar, S.A.: A modified support vector clustering method for document categorization. In: IEEE International Conference on Knowledge Engineering and Applications (ICKEA), pp. 1–5 (2016)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1), 143–175 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Revanasiddappa, M.B., Harish, B.S. (2019). A New Framework to Categorize Text Documents Using SMTP Measure. In: Ray, K., Sharma, T., Rawat, S., Saini, R., Bandyopadhyay, A. (eds) Soft Computing: Theories and Applications. Advances in Intelligent Systems and Computing, vol 742. Springer, Singapore. https://doi.org/10.1007/978-981-13-0589-4_48
Download citation
DOI: https://doi.org/10.1007/978-981-13-0589-4_48
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0588-7
Online ISBN: 978-981-13-0589-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)