Abstract
Automatic text categorization is defined as the task to assign free text documents to one or more predefined categories based on their content. Classical method for computing text similarity is to calculate the cosine value of angle between vectors. In order to improve the categorization performance, this paper puts forward a new algorithm to compute the text similarity based on standard deviation. Experiments on Chinese text documents show the validity and the feasibility of the standard deviation-based algorithm.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., Tzeras, K.: Air/x - a rule-based multistage indexing systems for large subject fields. In: Proceedings of RIAO 1991, pp. 606–623 (1991)
Yang, Y., Chute, C.G.: A Linear Least Squares Fit mapping method for information retrieval from natural language texts. In: Proceedings of 14th International Conference on Computational Linguistics (COLING 1992), vol. II, pp. 447–453 (1992)
Creecy, R.H., Masand, B.M., Smith, S.J., Waltz, D.L.: Trading MIPS and memory for knowledge engineering: classifying census returns on the connection machine. Comm. ACM 35, 48–63 (1992)
Yang, Y., Chute, C.G.: An example-based mapping method for text classification and retrieval. ACM Transactions on Information Systems (TOIS) 12, 253–277 (1994)
Tzeras, K., Hartmann, S.: Automatic Indexing Based on Bayesian Inference Networks. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIDIR 1993), pp. 22–34 (1993)
Lewis, D., Ringuette, M.: A comparison of two learning algorithms for text clas sification. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Moulinier, I.: Is learning bias an issue on the text categorization problem? Technical report, LAFORIA-LIP6, Universite Paris VI (1997)
Apte, C., Damerau, F., Weiss, S.: Towards language independent automated learning of text categorization models. In: Proceedings of the Seventeenth Annual International ACM/SIGIR Conference (1994)
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval(SDAIR 1995) (1995)
Moulinier, I., Raskinis, G., Ganascia, J.: Text categorization: a symbolic approach. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (1996)
William, W.C., Singer, Y.: Context-sensitive learning methods for text classification. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315 (1996)
David, D.L., Robert, E.S., Callan, J.P., Papka, R.: Training Algorithms for Linear Text Classifiers. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 298–306 (1996)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398. Springer, Heidelberg (1998)
Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323. Prentice Hall Inc., Englewood Cliffs (1971)
Weiss, S.M., Apte, C., Damerau, F.J., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing Text-Mining Performance. IEEE Intelligent Systems and Their Applications [see also IEEE Intelligent Systems] 14, 63–69 (1999)
Salton, G., Lesk, M.E.: Computer evaluation of Indexing and text processing. Association for Computing Machinery 15, 8–36 (1968)
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of ACM 18, 613–620 (1975)
Yiming, Y., Jan, P.P.: A comparative study on feature selection in text Categorization. In: Proceedings of ICML1997, 14th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (1997)
Tom, M.: Machine Learning. McGraw-Hill, New York (1996)
Quinlan, J.: Induction of decision trees. Machine Learning 1, 81–106 (1986)
Keeneth, W.C., Patric, H.: Word association norms, mutual information and lexicography. In: Proceeding of ACL, Vancouver, Canada, vol. 27, pp. 76–83 (1989)
Fano, R.: Transmission of Information. MIT Press, Cambrige (1961)
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network apporach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval(SDAIR 1995) (1995)
Yiming, Y.: Noise Reduction in a Statistical Approach to Text Categorization. In: ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1995), pp. 256–263 (1995)
Salton, G.: Automatic text processing: the transformation analysis and retrieval of information by Computer. Aoldison-wesley, Reading (1989)
Bin, L., Tiejun, H., Jun, C., Wen, G.: A New Statistical-based Method in Automatic Text Classification. Journal of Chinese information processing 16, 18–24 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, T., Guo, J. (2005). Text Similarity Computing Based on Standard Deviation. In: Huang, DS., Zhang, XP., Huang, GB. (eds) Advances in Intelligent Computing. ICIC 2005. Lecture Notes in Computer Science, vol 3644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11538059_48
Download citation
DOI: https://doi.org/10.1007/11538059_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28226-6
Online ISBN: 978-3-540-31902-3
eBook Packages: Computer ScienceComputer Science (R0)