Abstract
Text similarity algorithm is widely used in plurality fields, such as copy detection, text classification, machine translation, intelligent question answering system and natural language processing. At present, vector space model algorithm, which is more commonly used, does not consider the information of semantic features adequately, and the accuracy of the semantic similarity computation results can be further improved. This paper proposes a text similarity computation method which combines the HowNet with vector space model. Similarity computation is divided into two levels. In the level of words, words-similarity calculation based on HowNet prevents the loss of semantic information. In the level of texts, text-similarity calculation by vector space model ensures the integrity of the information expressed in the texts. This paper designs an experiment of news text classification based on KNN algorithm, in which data obtained from a part of the Chinese news in Sogou data corpora. Experimental results show that the method proposed in this paper is more accurate than the traditional vector space model algorithm.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In daily life, people use text to express and transmit the information, so the information retrieval technology and data mining depends on the processing and operation of text information [1]. Text similarity computation is an effective and direct method to solve the problem of resource acquisition and analysis. Generally, we use text similarity to measure the degree of association between the two different texts. When the similarity value is larger, the correlation degree of the two texts is higher, and vice versa. Salton et al. proposed vector space model (VSM) which maps the text into a high dimension space vector [2]. In this way, the mathematical computation can be easily carried out, and the computation process is simplified. However, semantic information is often lost in the process of mapping, due to the linear independent hypothesis of the space vector model, which affects the accuracy of the final results.
This paper will introduce the text preprocessing and vector space model, and proposes a computation method of text similarity, which is based on HowNet and vector space model, aiming at the problem of absence of semantic information.
2 Vector Space Model
2.1 Chinese Text Preprocessing
Compared with English text, Chinese text has no space to separate the words. Chinese word segmentation is a technology to segment Chinese text into a series of individual words. There are some words which have a very small contribution to the meaning of the text and appear many times, known as the stop word [3]. These words do not provide any value to the meaning of the text, but also increase the dimension of the text representation. It would increase the computational complexity and time overhead, so it needs to be removed.
2.2 Vector Space Model
Vector space model is simple. It will map a text into a space vector, and use a vector to represent the text information. By using the knowledge of space mathematics, we can directly compute the similarity between two text vectors, which is measured by the cosine value of the angle of the vectors.
The vector space model has a basic assumption, which feature item that can represent text content is only related to the number of times it appears, but has nothing to do with its position and sequence [4]. Vector space model is composed of features and their weights. Features usually refer to non-redundant words after text preprocessing, and each feature corresponds to a weight, which describe the important degree of the feature in the text. When computing the similarity of text, the cosine theorem is generally used, the formula is as follows:
\( {\text{D}}_{1} \) and \( {\text{D}}_{2} \) represent two texts, \( {\text{W}}_{ik} \) represents the weight of the k-th feature in the i-th text.
We use the TF-IDF method to calculate the weight. The formula is as follows:
In Eq. (2), q stands for the number of times that feature \( {\text{T}}_{k} \) appears in the text \( {\text{D}}_{i} \). N is the total amount of texts, n is the number of texts containing the feature \( {\text{T}}_{k} \), and \( \alpha \) is called empirical constant whose general value is 0.01.
3 Text Similarity Computation
3.1 HowNet
Concepts and sememes are the foundation of HowNet [5]. Concept is the description of the words, and each word can have a number of concepts which are generally described by knowledge description language that is made up of sememes. HowNet uses one or more of the sememes to describe the concept, and a forest model is used to describe the relationship between sememes [6].
There are ten trees in sememe forest. The relationship between sememes is complicated, which is usually gotten from upper and lower relations. This relationship is called sememe hierarchy tree, showed in Fig. 1 [7]. The description of the concept, in addition to the use of sememes, is also applied to a variety of semantic symbols for assist.
3.2 The Similarity of Words
According to the distance between the two sememes on sememe hierarchy tree, the sememe similarity is calculated by Eq. (3).
Two sememes is \( p_{1} \) and \( p_{2} \). d is the distance between the two sememes on referenceeme hierarchy tree. If they are on different tree, the value of d towards infinity. \( \alpha \) is adjustable parameter which value is1.6.
Because Eq. (3) only considers the upper and lower relations, but not the depth of the sememe, the public father node and the antisense of sememes. It needs to be improved.
\( dep\left( {p_{1} } \right) \) means the depth of sememe \( p_{1} \) on the tree. \( dep\,(t) \) is amount of the depth of tree. \( dis(p_{1} ,p_{2} ) \) is the distance between the two sememes. \( c\,(p_{1} ,p_{2} ) \) is the amount of the public father nodes of \( p_{1} \) and \( p_{2} \). \( a\,(p_{1} ,p_{2} ) \) means whether there is antisense on the path of \( p_{1} \) and \( p_{2} \). If there is, the value is 1, otherwise, the value is 0. \( \alpha \) is adjustable parameter which value is 1.6.
Knowledge description language is made up of independent sememe description, relation sememe description and symbol sememe description (Table 1).
In general, if two specific words are same, the similarity is 1, otherwise, the similarity is 0.
For independent sememe description, make pairs of sememes or specific words alternately, select the maximum similarity, and record the value. Then select the most similar pairs in the remainder of the match repeatedly until one of them is complete, and discard the remaining sememes and specific words. Finally, carry out an arithmetic average value of maximum similarity values, and record the value as \( {\text{sim}}_{1} \).
For relation sememe description, if relation is same, make pairs of sememes or specific words alternately, select the maximum similarity, and record the value repeatedly. Get the arithmetic average eventually, and record the value as \( {\text{sim}}_{2} \).
For symbol sememe description, the processing likes relation sememe description. Here to judge symbol rather than relation, record the arithmetic average value as \( {\text{sim}}_{3} \).
Formula for calculating the similarity between concept \( C_{1} \) and \( C_{2} \) is as follows:
\( \beta_{i} \;(1 \le i \le 3) \) is adjustable parameter, and meets the conditions: \( \beta_{1} + \beta_{2} + \beta_{3} = 1,\;\beta_{1} \ge \beta_{2} \ge \beta_{3} \) General experience is: \( \beta_{1} = 0.7 ,\beta_{2} = 0.17 ,\beta_{3} = 0.13 \).
Assume that the word \( W_{1} \) has m concepts: \( C_{11} ,\,C_{12} , \ldots ,C_{1m} \). And the word \( W_{2} \) has n concepts: \( C_{21} ,\,C_{22} , \ldots ,\,C_{{2{\text{n}}}} \). The similarity between the words \( W_{1} \) and \( W_{2} \) is expressed as \( {\text{Sim}}\,\left( {W_{1} ,W_{2} } \right) \).
3.3 The Similarity of Texts
Chinese text preprocessing of semantic based text similarity computation is same as text similarity computation based on VSM. Assume two Chinese texts after preprocess and expressed as: \( {\text{D}}_{1} = ({\text{T}}_{11} ,\,{\text{W}}_{11} ,\,{\text{T}}_{12} ,\,{\text{W}}_{12} , \ldots ,{\text{T}}_{{1{\text{m}}}} ,\,{\text{W}}_{{1{\text{m}}}} ) \), \( {\text{D}}_{2} = ({\text{T}}_{21} ,\,{\text{W}}_{21} ,\,{\text{T}}_{22} ,\,{\text{W}}_{22} , \ldots ,{\text{T}}_{{2{\text{n}}}} ,\,{\text{W}}_{{2{\text{n}}}} ) \). \( {\text{T}}_{11} ,\,{\text{T}}_{12} , \ldots ,{\text{T}}_{{1{\text{m}}}} \) are feature terms of text \( {\text{D}}_{1} ,\;{\text{W}}_{11} ,\,{\text{W}}_{12} , \ldots ,{\text{W}}_{{1{\text{m}}}} \) are weights of the feature terms.
Use Eq. (7) to calculate the similarity of each feature item between two texts. Then construct feature items similarity matrix \( {\text{A}}\,\left( {a_{ij} = {\text{Sim}}\left( {T_{{1{\text{i}}}} ,T_{{2{\text{j}}}} } \right)} \right) \). In the matrix A, find the largest element, and compare it with adjustable threshold, the value of the threshold in this paper is 0.7. If the element value is larger than the threshold value, delete the row and column of the element from the matrix A while keeping elements index in matrix A unchanged, and record the row and column of the element. Keep this step repeatedly until all the elements in matrix A are not larger than the threshold value or there is no element in matrix A.
Treat the feature items which the similarity is larger than threshold as the same dimension, and use Eq. (1) to calculate the text similarity.
4 Experimental Results and Analysis
According to the part of the data of Chinese news corpus of Sogou, design an experiment of news text classification based on KNN algorithm which K value is set to 7. The data contains five fields of electronics, sports, tourism, education and military. The total number of text data is 1000 and 200 text data in each field. The training set for each field is 150 text data, and the test set is 50 text data (Table 2).
The classification experimental results show that the two semantic algorithms are much better than the traditional algorithm based on VSM. Besides, the algorithm proposed in this paper is more efficient than the other sematic based algorithm.
5 Conclusions
This paper mainly discusses the text similarity computation based on VSM and the semantic based text similarity computation, and improves the method of computing the similarity between words. Finally, a text classification experiment based on KNN is designed to analyze the advantages and disadvantages of the two different methods of text similarity computation. In the course of the study, we find a problem which is worth studying and solving: considering the order of sequence of words to prevent losing a part of useful text information.
References
Jin Xiqian. (2009). Research on Semantic Based Chinese Text Similarity Algorithm. (Doctoral dissertation, Zhejiang University of Technology).
G. Salton, A. Wong, ang C.S. Yang, A Vector Space Model for Information Retrieval, Journal of the ASIS, 18:11, 613–620, November 1975.
Liu Xiaojun, Zhao Dong, & Yao Weidong. (2007). A Two Factor Similarity Algorithm for Chinese Text Search. Computer Simulation, 24(12), 312–314.
Chen Feihong. (2011). Research on Chinese Text Similarity Algorithm Based on Vector Space Model. (Doctoral dissertation, University of Electronic Science and technology).
Kuai Yuanyuan. (2014). Research on Semantic Based Text Similarity Algorithm. Computer CD software and Applications (9), 302–303.
Liu Qun & Li Sujian. (2002). Based on the HowNet Lexical Semantic Similarity Computation. Chinese of computational linguistics.
Fan Hongyi, & Zhang Yangsen (2014). A method for semantic similarity of words based on HowNet. Journal of Beijing Information Science and Technology University: Natural Science Edition (4), 42–45.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Liu, Y., Li, Z. (2017). Semantic Based Text Similarity Computation. In: Zhao, P., Ouyang, Y., Xu, M., Yang, L., Ouyang, Y. (eds) Advanced Graphic Communications and Media Technologies . PPMT 2016. Lecture Notes in Electrical Engineering, vol 417. Springer, Singapore. https://doi.org/10.1007/978-981-10-3530-2_43
Download citation
DOI: https://doi.org/10.1007/978-981-10-3530-2_43
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3529-6
Online ISBN: 978-981-10-3530-2
eBook Packages: EngineeringEngineering (R0)