Semantic Based Text Similarity Computation

Liu, Yaqi; Li, Zhijiang

doi:10.1007/978-981-10-3530-2_43

Yaqi Liu⁶ &
Zhijiang Li⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 417))

Included in the following conference series:

China Academic Conference on Printing & Packaging and Media Technology

2676 Accesses
2 Citations

Abstract

Text similarity algorithm is widely used in plurality fields, such as copy detection, text classification, machine translation, intelligent question answering system and natural language processing. At present, vector space model algorithm, which is more commonly used, does not consider the information of semantic features adequately, and the accuracy of the semantic similarity computation results can be further improved. This paper proposes a text similarity computation method which combines the HowNet with vector space model. Similarity computation is divided into two levels. In the level of words, words-similarity calculation based on HowNet prevents the loss of semantic information. In the level of texts, text-similarity calculation by vector space model ensures the integrity of the information expressed in the texts. This paper designs an experiment of news text classification based on KNN algorithm, in which data obtained from a part of the Chinese news in Sogou data corpora. Experimental results show that the method proposed in this paper is more accurate than the traditional vector space model algorithm.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Fusion News Elements of News Text Similarity Calculation

An Approach to Semantic Text Similarity Computing

Distance Weighted Cosine Similarity Measure for Text Classification

Keywords

1 Introduction

In daily life, people use text to express and transmit the information, so the information retrieval technology and data mining depends on the processing and operation of text information [1]. Text similarity computation is an effective and direct method to solve the problem of resource acquisition and analysis. Generally, we use text similarity to measure the degree of association between the two different texts. When the similarity value is larger, the correlation degree of the two texts is higher, and vice versa. Salton et al. proposed vector space model (VSM) which maps the text into a high dimension space vector [2]. In this way, the mathematical computation can be easily carried out, and the computation process is simplified. However, semantic information is often lost in the process of mapping, due to the linear independent hypothesis of the space vector model, which affects the accuracy of the final results.

This paper will introduce the text preprocessing and vector space model, and proposes a computation method of text similarity, which is based on HowNet and vector space model, aiming at the problem of absence of semantic information.

2 Vector Space Model

2.1 Chinese Text Preprocessing

Compared with English text, Chinese text has no space to separate the words. Chinese word segmentation is a technology to segment Chinese text into a series of individual words. There are some words which have a very small contribution to the meaning of the text and appear many times, known as the stop word [3]. These words do not provide any value to the meaning of the text, but also increase the dimension of the text representation. It would increase the computational complexity and time overhead, so it needs to be removed.

2.2 Vector Space Model

Vector space model is simple. It will map a text into a space vector, and use a vector to represent the text information. By using the knowledge of space mathematics, we can directly compute the similarity between two text vectors, which is measured by the cosine value of the angle of the vectors.

The vector space model has a basic assumption, which feature item that can represent text content is only related to the number of times it appears, but has nothing to do with its position and sequence [4]. Vector space model is composed of features and their weights. Features usually refer to non-redundant words after text preprocessing, and each feature corresponds to a weight, which describe the important degree of the feature in the text. When computing the similarity of text, the cosine theorem is generally used, the formula is as follows:

$$ {\text{Sim}}\left( {{\text{D}}_{1} ,{\text{D}}_{2} } \right) = \frac{{\sum\nolimits_{k = 1}^{n} {{\text{W}}_{1k} *{\text{W}}_{2k} } }}{{\sqrt {\sum\nolimits_{k = 1}^{n} {{\text{W}}_{1k}^{2} } } *\sqrt {\sum\nolimits_{k = 1}^{n} {{\text{W}}_{2k}^{2} } } }} $$

(1)

$ {\text{D}}_{1} $ and $ {\text{D}}_{2} $ represent two texts, $ {\text{W}}_{ik} $ represents the weight of the k-th feature in the i-th text.

We use the TF-IDF method to calculate the weight. The formula is as follows:

$$ W_{k} = {\text{TF}}_{{\left( {i.k} \right)}} * {\text{IDF}}_{{\left( {{\text{i}}.{\text{k}}} \right)}} = q*{ \log }\left( {\frac{N}{n} + \alpha } \right) $$

(2)

In Eq. (2), q stands for the number of times that feature $ {\text{T}}_{k} $ appears in the text $ {\text{D}}_{i} $. N is the total amount of texts, n is the number of texts containing the feature $ {\text{T}}_{k} $, and $ \alpha $ is called empirical constant whose general value is 0.01.

3 Text Similarity Computation

3.1 HowNet

Concepts and sememes are the foundation of HowNet [5]. Concept is the description of the words, and each word can have a number of concepts which are generally described by knowledge description language that is made up of sememes. HowNet uses one or more of the sememes to describe the concept, and a forest model is used to describe the relationship between sememes [6].

There are ten trees in sememe forest. The relationship between sememes is complicated, which is usually gotten from upper and lower relations. This relationship is called sememe hierarchy tree, showed in Fig. 1 [7]. The description of the concept, in addition to the use of sememes, is also applied to a variety of semantic symbols for assist.

3.2 The Similarity of Words

According to the distance between the two sememes on sememe hierarchy tree, the sememe similarity is calculated by Eq. (3).

$$ {\text{Sim}}\left( {p_{1}, p_{2} } \right) = \frac{\alpha }{d + \alpha } $$

(3)

Two sememes is $ p_{1} $ and $ p_{2} $. d is the distance between the two sememes on referenceeme hierarchy tree. If they are on different tree, the value of d towards infinity. $ \alpha $ is adjustable parameter which value is1.6.

Because Eq. (3) only considers the upper and lower relations, but not the depth of the sememe, the public father node and the antisense of sememes. It needs to be improved.

$$ {\text{Sim}}\left( {p_{1} ,p_{2} } \right) = [e^{{dis\left( {p_{1} ,p_{2} } \right)^{ - 1} }} ]^{{a(p_{1} ,p_{2} )}} \frac{{\alpha \cdot c\,(p_{1} ,p_{2} )}}{{\alpha \cdot c\,(p_{1} ,p_{2} ) + (1 - \frac{{dep\left( {p_{1} ,p_{2} } \right)}}{dep\,(t)} + 0.01) \cdot dis(p_{1} ,p_{2} )}} $$

(4)

$$ dep\left( {p_{1} ,p_{2} } \right) = dep\left( {p_{1} } \right) + dep(p_{2} ) $$

(5)

$ dep\left( {p_{1} } \right) $ means the depth of sememe $ p_{1} $ on the tree. $ dep\,(t) $ is amount of the depth of tree. $ dis(p_{1} ,p_{2} ) $ is the distance between the two sememes. $ c\,(p_{1} ,p_{2} ) $ is the amount of the public father nodes of $ p_{1} $ and $ p_{2} $. $ a\,(p_{1} ,p_{2} ) $ means whether there is antisense on the path of $ p_{1} $ and $ p_{2} $. If there is, the value is 1, otherwise, the value is 0. $ \alpha $ is adjustable parameter which value is 1.6.

Knowledge description language is made up of independent sememe description, relation sememe description and symbol sememe description (Table 1).

Table 1 Sememe descriptions

Full size table

In general, if two specific words are same, the similarity is 1, otherwise, the similarity is 0.

For independent sememe description, make pairs of sememes or specific words alternately, select the maximum similarity, and record the value. Then select the most similar pairs in the remainder of the match repeatedly until one of them is complete, and discard the remaining sememes and specific words. Finally, carry out an arithmetic average value of maximum similarity values, and record the value as $ {\text{sim}}_{1} $.

For relation sememe description, if relation is same, make pairs of sememes or specific words alternately, select the maximum similarity, and record the value repeatedly. Get the arithmetic average eventually, and record the value as $ {\text{sim}}_{2} $.

For symbol sememe description, the processing likes relation sememe description. Here to judge symbol rather than relation, record the arithmetic average value as $ {\text{sim}}_{3} $.

Formula for calculating the similarity between concept $ C_{1} $ and $ C_{2} $ is as follows:

$$ {\text{Sim}}(C_{1} ,C_{2} ) = \sum\limits_{i = 1}^{3} {\beta_{i} } \prod\limits_{j = 1}^{i} {sim_{j} } $$

(6)

$ \beta_{i} \;(1 \le i \le 3) $ is adjustable parameter, and meets the conditions: $ \beta_{1} + \beta_{2} + \beta_{3} = 1,\;\beta_{1} \ge \beta_{2} \ge \beta_{3} $ General experience is: $ \beta_{1} = 0.7 ,\beta_{2} = 0.17 ,\beta_{3} = 0.13 $.

Assume that the word $ W_{1} $ has m concepts: $ C_{11} ,\,C_{12} , \ldots ,C_{1m} $. And the word $ W_{2} $ has n concepts: $ C_{21} ,\,C_{22} , \ldots ,\,C_{{2{\text{n}}}} $. The similarity between the words $ W_{1} $ and $ W_{2} $ is expressed as $ {\text{Sim}}\,\left( {W_{1} ,W_{2} } \right) $.

$$ {\text{Sim}}\,\left( {W_{1} ,W_{2} } \right) = { \hbox{max} }_{i = 1 \ldots m,\,j = 1 \ldots n} \,{\text{SIM}}\,(C_{1i} ,C_{2j} ) $$

(7)

3.3 The Similarity of Texts

Chinese text preprocessing of semantic based text similarity computation is same as text similarity computation based on VSM. Assume two Chinese texts after preprocess and expressed as: $ {\text{D}}_{1} = ({\text{T}}_{11} ,\,{\text{W}}_{11} ,\,{\text{T}}_{12} ,\,{\text{W}}_{12} , \ldots ,{\text{T}}_{{1{\text{m}}}} ,\,{\text{W}}_{{1{\text{m}}}} ) $, $ {\text{D}}_{2} = ({\text{T}}_{21} ,\,{\text{W}}_{21} ,\,{\text{T}}_{22} ,\,{\text{W}}_{22} , \ldots ,{\text{T}}_{{2{\text{n}}}} ,\,{\text{W}}_{{2{\text{n}}}} ) $. $ {\text{T}}_{11} ,\,{\text{T}}_{12} , \ldots ,{\text{T}}_{{1{\text{m}}}} $ are feature terms of text $ {\text{D}}_{1} ,\;{\text{W}}_{11} ,\,{\text{W}}_{12} , \ldots ,{\text{W}}_{{1{\text{m}}}} $ are weights of the feature terms.

Use Eq. (7) to calculate the similarity of each feature item between two texts. Then construct feature items similarity matrix $ {\text{A}}\,\left( {a_{ij} = {\text{Sim}}\left( {T_{{1{\text{i}}}} ,T_{{2{\text{j}}}} } \right)} \right) $. In the matrix A, find the largest element, and compare it with adjustable threshold, the value of the threshold in this paper is 0.7. If the element value is larger than the threshold value, delete the row and column of the element from the matrix A while keeping elements index in matrix A unchanged, and record the row and column of the element. Keep this step repeatedly until all the elements in matrix A are not larger than the threshold value or there is no element in matrix A.

Treat the feature items which the similarity is larger than threshold as the same dimension, and use Eq. (1) to calculate the text similarity.

4 Experimental Results and Analysis

According to the part of the data of Chinese news corpus of Sogou, design an experiment of news text classification based on KNN algorithm which K value is set to 7. The data contains five fields of electronics, sports, tourism, education and military. The total number of text data is 1000 and 200 text data in each field. The training set for each field is 150 text data, and the test set is 50 text data (Table 2).

Table 2 The classification accuracy of the experiment

Full size table

The classification experimental results show that the two semantic algorithms are much better than the traditional algorithm based on VSM. Besides, the algorithm proposed in this paper is more efficient than the other sematic based algorithm.

5 Conclusions

This paper mainly discusses the text similarity computation based on VSM and the semantic based text similarity computation, and improves the method of computing the similarity between words. Finally, a text classification experiment based on KNN is designed to analyze the advantages and disadvantages of the two different methods of text similarity computation. In the course of the study, we find a problem which is worth studying and solving: considering the order of sequence of words to prevent losing a part of useful text information.

References

Jin Xiqian. (2009). Research on Semantic Based Chinese Text Similarity Algorithm. (Doctoral dissertation, Zhejiang University of Technology).
Google Scholar
G. Salton, A. Wong, ang C.S. Yang, A Vector Space Model for Information Retrieval, Journal of the ASIS, 18:11, 613–620, November 1975.
Google Scholar
Liu Xiaojun, Zhao Dong, & Yao Weidong. (2007). A Two Factor Similarity Algorithm for Chinese Text Search. Computer Simulation, 24(12), 312–314.
Google Scholar
Chen Feihong. (2011). Research on Chinese Text Similarity Algorithm Based on Vector Space Model. (Doctoral dissertation, University of Electronic Science and technology).
Google Scholar
Kuai Yuanyuan. (2014). Research on Semantic Based Text Similarity Algorithm. Computer CD software and Applications (9), 302–303.
Google Scholar
Liu Qun & Li Sujian. (2002). Based on the HowNet Lexical Semantic Similarity Computation. Chinese of computational linguistics.
Google Scholar
Fan Hongyi, & Zhang Yangsen (2014). A method for semantic similarity of words based on HowNet. Journal of Beijing Information Science and Technology University: Natural Science Edition (4), 42–45.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Printing and Packaging, Wuhan University, Wuhan, China
Yaqi Liu & Zhijiang Li

Authors

Yaqi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhijiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhijiang Li .

Editor information

Editors and Affiliations

China Academy of Printing Technology, Beijing, China
Pengfei Zhao
China Academy of Printing Technology, Beijing, China
Yun Ouyang
China Academy of Printing Technology, Beijing, China
Min Xu
China Academy of Printing Technology, Beijing, China
Li Yang
China Academy of Printing Technology, Beijing, China
Yujie Ouyang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Li, Z. (2017). Semantic Based Text Similarity Computation. In: Zhao, P., Ouyang, Y., Xu, M., Yang, L., Ouyang, Y. (eds) Advanced Graphic Communications and Media Technologies . PPMT 2016. Lecture Notes in Electrical Engineering, vol 417. Springer, Singapore. https://doi.org/10.1007/978-981-10-3530-2_43

Download citation

DOI: https://doi.org/10.1007/978-981-10-3530-2_43
Published: 22 March 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3529-6
Online ISBN: 978-981-10-3530-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Semantic Based Text Similarity Computation

Abstract

Similar content being viewed by others

Fusion News Elements of News Text Similarity Calculation

An Approach to Semantic Text Similarity Computing

Distance Weighted Cosine Similarity Measure for Text Classification

Keywords

1 Introduction

2 Vector Space Model

2.1 Chinese Text Preprocessing

2.2 Vector Space Model

3 Text Similarity Computation

3.1 HowNet

3.2 The Similarity of Words

3.3 The Similarity of Texts

4 Experimental Results and Analysis

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Semantic Based Text Similarity Computation

Abstract

Similar content being viewed by others

Fusion News Elements of News Text Similarity Calculation

An Approach to Semantic Text Similarity Computing

Distance Weighted Cosine Similarity Measure for Text Classification

Keywords

1 Introduction

2 Vector Space Model

2.1 Chinese Text Preprocessing

2.2 Vector Space Model

3 Text Similarity Computation

3.1 HowNet

3.2 The Similarity of Words

3.3 The Similarity of Texts

4 Experimental Results and Analysis

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation