Abstract
Text similarity is an effective metric for estimating the text matching degree between two or more texts. Vector Space Model (VSM) is employed for the text similarity calculation in most instances. However, it is insufficient and ill-suited to solve the complex tasks since the high calculation dimension and computational complexity. Therefore, it is crucial to calculate the similarity of two news text, so that whether two reported news is the identical event or the same type of information would be ascertained. According to the analysis of the news reports, five basic factors in terms of “when”, “where”, “what”, “why”, and “who” are taken into account for distinguishing a news report. By analyzing these features, in this study, a method to calculate the similarity of news text is proposed. The proposed method fully integrates the influence of the five news feature words into the evaluation of text similarity, which avoids the problem happened in the text interference and computational efficiency to a large extent. There are four steps to execute the proposed method, i.e. extraction of the news elements, classification of these elements, calculation of the similarity, and comparison with available literatures. Experimental results suggest that our proposal outperforms the vector space cosine coefficient method, Jaccard coefficient method and entropy method in terms of the time complexity and computational accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Salton, G.: Automatic text processing: the transformation, analysis and retrieval of information by computer. DBLP 41(3), 753–755 (1989)
Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector spaces model in information retrieval. In: Fundamenta Informaticae, pp. 18–25 (1985)
Deerwester, S., Dumais, S.T., Furnas, G.W.: Indexing by latent semantic analysis. J. Am. Soc. Inference Sci. 1(6), 391–407 (1990)
Dumais, S., Landaner, T., Littman, M.: Automatic cross-linguistic information retrieval using latent semantic indexing. In: SIGIR96 Workshop on Cross-Linguistic Information Retrieval, pp. 1–8 (1996)
Pan, Q.H., Wang, J., Shi, Z.Z.: Text similarity computing based on attribute theory. Chin. J. Comput. 22(1), 651–655 (1999)
Zhang, P.Y., Chen, C.M., Huang, B.: Texts similarity algorithm based on subtrees matching. Pattern Recognit. Artif. Intell. 27, 226–233 (2014)
Li, L., Zhu, A., Su, T.: Research and implementation of an improved VSM-based text similarity algorithm. Comput. Appl. Softw. 29(1), 282–284 (2012)
Zhong, Z.M., Liu, Z.T., Zhou, W., Fu, J.F.: Text-based event similarity calculation. J. Guangxi Norm. Univ. Nat. Sci. Ed. 27(1), 149–152 (2009)
Xue, S., Niu, Y.: Research on Chinese text similarity based on vector space model. Electron. Des. Eng. 24(10), 28–31 (2016)
Piao, Y., Jiang, H., Wang, X.: Tensor based approach to XML similarity calculation. Control Decis. 31(9), 1711–1714 (2016)
Li, S., Ling, W., Gong, J., Zhou, C.: Text similarity method based on entropy. Appl. Res. Comput. 33(3), 665–668 (2016)
Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM, New York (2015)
Song, Y., Dan, R.: Unsupervised sparse vector densification for short text similarity. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1275–1280 (2015)
Lin, Y.S., Jiang, J.Y., Lee, S.J.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2014)
Paul, C., Rettinger, A., Mogadala, A., Knoblock, C.A., Szekely, P.: Efficient graph-based document similarity. In: Proceedings of ESWC16. Springer, Berlin (2016)
Vences, R., Gmez, J., Menndez, V.: A document recommendation system using a document-similarity ontology. IEEE Lat. Am. Trans. 14(7), 3329–3334 (2016)
Lan, M., Xu, J., Gao, W.: Ontology feature extraction via vector learning algorithm and applied to similarity measuring and ontology mapping. IAENG Int. J. Comput. Sci. 43(1), 10–19 (2016)
Pelleg, D., Moore, A.: X-means: extending K-means with efficient estimation of the number of clusters. In: The 17th International Conference on Machine Learning (ICML), Palo Alto, Santa Clara, pp. 727–734 (2000)
Cao, Y.G., Cao, Y.Z., Jin, M.Z., Liu, C.: Information retrieval oriented adaptive Chinese word segmentation system. J. Softw. 17(1), 356–363 (2006)
Ji, J.Z., Bei, F., Wu, C.S., Chai, Y., Song, C.: Influence of part-of-speeches on the network topic detection of Chinese news and micro-blog. J. Beijing Univ. Technol. 41(1), 526–533 (2015)
Liu, Q., Li, S.J.: Word similarity computing based on how-net. In: The 3rd Chinese Lexical Semantics Workshop Taipei China, pp. 59–76 (2002)
Meng, L., Huang, R., Gu, J.: A review of semantic similarity measures in wordnet. Int. J. Hybrid Inf. Technol. 6(1), 1–12 (2013)
Acknowledgments
This work is supported by the National Natural Science Foundation of China (61462054, 61363044) and the Science and Technology Plan Projects of Yunnan Province (2015FB135).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, H., Ye, J., Hou, Z., Fan, L. (2019). Fusion News Elements of News Text Similarity Calculation. In: Deng, K., Yu, Z., Patnaik, S., Wang, J. (eds) Recent Developments in Mechatronics and Intelligent Robotics. ICMIR 2018. Advances in Intelligent Systems and Computing, vol 856. Springer, Cham. https://doi.org/10.1007/978-3-030-00214-5_66
Download citation
DOI: https://doi.org/10.1007/978-3-030-00214-5_66
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00213-8
Online ISBN: 978-3-030-00214-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)