Keywords

1 Introduction

In daily life, people use text to express and transmit the information, so the information retrieval technology and data mining depends on the processing and operation of text information [1]. Text similarity computation is an effective and direct method to solve the problem of resource acquisition and analysis. Generally, we use text similarity to measure the degree of association between the two different texts. When the similarity value is larger, the correlation degree of the two texts is higher, and vice versa. Salton et al. proposed vector space model (VSM) which maps the text into a high dimension space vector [2]. In this way, the mathematical computation can be easily carried out, and the computation process is simplified. However, semantic information is often lost in the process of mapping, due to the linear independent hypothesis of the space vector model, which affects the accuracy of the final results.

This paper will introduce the text preprocessing and vector space model, and proposes a computation method of text similarity, which is based on HowNet and vector space model, aiming at the problem of absence of semantic information.

2 Vector Space Model

2.1 Chinese Text Preprocessing

Compared with English text, Chinese text has no space to separate the words. Chinese word segmentation is a technology to segment Chinese text into a series of individual words. There are some words which have a very small contribution to the meaning of the text and appear many times, known as the stop word [3]. These words do not provide any value to the meaning of the text, but also increase the dimension of the text representation. It would increase the computational complexity and time overhead, so it needs to be removed.

2.2 Vector Space Model

Vector space model is simple. It will map a text into a space vector, and use a vector to represent the text information. By using the knowledge of space mathematics, we can directly compute the similarity between two text vectors, which is measured by the cosine value of the angle of the vectors.

The vector space model has a basic assumption, which feature item that can represent text content is only related to the number of times it appears, but has nothing to do with its position and sequence [4]. Vector space model is composed of features and their weights. Features usually refer to non-redundant words after text preprocessing, and each feature corresponds to a weight, which describe the important degree of the feature in the text. When computing the similarity of text, the cosine theorem is generally used, the formula is as follows:

$$ {\text{Sim}}\left( {{\text{D}}_{1} ,{\text{D}}_{2} } \right) = \frac{{\sum\nolimits_{k = 1}^{n} {{\text{W}}_{1k} *{\text{W}}_{2k} } }}{{\sqrt {\sum\nolimits_{k = 1}^{n} {{\text{W}}_{1k}^{2} } } *\sqrt {\sum\nolimits_{k = 1}^{n} {{\text{W}}_{2k}^{2} } } }} $$
(1)

\( {\text{D}}_{1} \) and \( {\text{D}}_{2} \) represent two texts, \( {\text{W}}_{ik} \) represents the weight of the k-th feature in the i-th text.

We use the TF-IDF method to calculate the weight. The formula is as follows:

$$ W_{k} = {\text{TF}}_{{\left( {i.k} \right)}} * {\text{IDF}}_{{\left( {{\text{i}}.{\text{k}}} \right)}} = q*{ \log }\left( {\frac{N}{n} + \alpha } \right) $$
(2)

In Eq. (2), q stands for the number of times that feature \( {\text{T}}_{k} \) appears in the text \( {\text{D}}_{i} \). N is the total amount of texts, n is the number of texts containing the feature \( {\text{T}}_{k} \), and \( \alpha \) is called empirical constant whose general value is 0.01.

3 Text Similarity Computation

3.1 HowNet

Concepts and sememes are the foundation of HowNet [5]. Concept is the description of the words, and each word can have a number of concepts which are generally described by knowledge description language that is made up of sememes. HowNet uses one or more of the sememes to describe the concept, and a forest model is used to describe the relationship between sememes [6].

There are ten trees in sememe forest. The relationship between sememes is complicated, which is usually gotten from upper and lower relations. This relationship is called sememe hierarchy tree, showed in Fig. 1 [7]. The description of the concept, in addition to the use of sememes, is also applied to a variety of semantic symbols for assist.

Fig. 1
figure 1

Sememe hierarchy tree

3.2 The Similarity of Words

According to the distance between the two sememes on sememe hierarchy tree, the sememe similarity is calculated by Eq. (3).

$$ {\text{Sim}}\left( {p_{1}, p_{2} } \right) = \frac{\alpha }{d + \alpha } $$
(3)

Two sememes is \( p_{1} \) and \( p_{2} \). d is the distance between the two sememes on referenceeme hierarchy tree. If they are on different tree, the value of d towards infinity. \( \alpha \) is adjustable parameter which value is1.6.

Because Eq. (3) only considers the upper and lower relations, but not the depth of the sememe, the public father node and the antisense of sememes. It needs to be improved.

$$ {\text{Sim}}\left( {p_{1} ,p_{2} } \right) = [e^{{dis\left( {p_{1} ,p_{2} } \right)^{ - 1} }} ]^{{a(p_{1} ,p_{2} )}} \frac{{\alpha \cdot c\,(p_{1} ,p_{2} )}}{{\alpha \cdot c\,(p_{1} ,p_{2} ) + (1 - \frac{{dep\left( {p_{1} ,p_{2} } \right)}}{dep\,(t)} + 0.01) \cdot dis(p_{1} ,p_{2} )}} $$
(4)
$$ dep\left( {p_{1} ,p_{2} } \right) = dep\left( {p_{1} } \right) + dep(p_{2} ) $$
(5)

\( dep\left( {p_{1} } \right) \) means the depth of sememe \( p_{1} \) on the tree. \( dep\,(t) \) is amount of the depth of tree. \( dis(p_{1} ,p_{2} ) \) is the distance between the two sememes. \( c\,(p_{1} ,p_{2} ) \) is the amount of the public father nodes of \( p_{1} \) and \( p_{2} \). \( a\,(p_{1} ,p_{2} ) \) means whether there is antisense on the path of \( p_{1} \) and \( p_{2} \). If there is, the value is 1, otherwise, the value is 0. \( \alpha \) is adjustable parameter which value is 1.6.

Knowledge description language is made up of independent sememe description, relation sememe description and symbol sememe description (Table 1).

Table 1 Sememe descriptions

In general, if two specific words are same, the similarity is 1, otherwise, the similarity is 0.

For independent sememe description, make pairs of sememes or specific words alternately, select the maximum similarity, and record the value. Then select the most similar pairs in the remainder of the match repeatedly until one of them is complete, and discard the remaining sememes and specific words. Finally, carry out an arithmetic average value of maximum similarity values, and record the value as \( {\text{sim}}_{1} \).

For relation sememe description, if relation is same, make pairs of sememes or specific words alternately, select the maximum similarity, and record the value repeatedly. Get the arithmetic average eventually, and record the value as \( {\text{sim}}_{2} \).

For symbol sememe description, the processing likes relation sememe description. Here to judge symbol rather than relation, record the arithmetic average value as \( {\text{sim}}_{3} \).

Formula for calculating the similarity between concept \( C_{1} \) and \( C_{2} \) is as follows:

$$ {\text{Sim}}(C_{1} ,C_{2} ) = \sum\limits_{i = 1}^{3} {\beta_{i} } \prod\limits_{j = 1}^{i} {sim_{j} } $$
(6)

\( \beta_{i} \;(1 \le i \le 3) \) is adjustable parameter, and meets the conditions: \( \beta_{1} + \beta_{2} + \beta_{3} = 1,\;\beta_{1} \ge \beta_{2} \ge \beta_{3} \) General experience is: \( \beta_{1} = 0.7 ,\beta_{2} = 0.17 ,\beta_{3} = 0.13 \).

Assume that the word \( W_{1} \) has m concepts: \( C_{11} ,\,C_{12} , \ldots ,C_{1m} \). And the word \( W_{2} \) has n concepts: \( C_{21} ,\,C_{22} , \ldots ,\,C_{{2{\text{n}}}} \). The similarity between the words \( W_{1} \) and \( W_{2} \) is expressed as \( {\text{Sim}}\,\left( {W_{1} ,W_{2} } \right) \).

$$ {\text{Sim}}\,\left( {W_{1} ,W_{2} } \right) = { \hbox{max} }_{i = 1 \ldots m,\,j = 1 \ldots n} \,{\text{SIM}}\,(C_{1i} ,C_{2j} ) $$
(7)

3.3 The Similarity of Texts

Chinese text preprocessing of semantic based text similarity computation is same as text similarity computation based on VSM. Assume two Chinese texts after preprocess and expressed as: \( {\text{D}}_{1} = ({\text{T}}_{11} ,\,{\text{W}}_{11} ,\,{\text{T}}_{12} ,\,{\text{W}}_{12} , \ldots ,{\text{T}}_{{1{\text{m}}}} ,\,{\text{W}}_{{1{\text{m}}}} ) \), \( {\text{D}}_{2} = ({\text{T}}_{21} ,\,{\text{W}}_{21} ,\,{\text{T}}_{22} ,\,{\text{W}}_{22} , \ldots ,{\text{T}}_{{2{\text{n}}}} ,\,{\text{W}}_{{2{\text{n}}}} ) \). \( {\text{T}}_{11} ,\,{\text{T}}_{12} , \ldots ,{\text{T}}_{{1{\text{m}}}} \) are feature terms of text \( {\text{D}}_{1} ,\;{\text{W}}_{11} ,\,{\text{W}}_{12} , \ldots ,{\text{W}}_{{1{\text{m}}}} \) are weights of the feature terms.

Use Eq. (7) to calculate the similarity of each feature item between two texts. Then construct feature items similarity matrix \( {\text{A}}\,\left( {a_{ij} = {\text{Sim}}\left( {T_{{1{\text{i}}}} ,T_{{2{\text{j}}}} } \right)} \right) \). In the matrix A, find the largest element, and compare it with adjustable threshold, the value of the threshold in this paper is 0.7. If the element value is larger than the threshold value, delete the row and column of the element from the matrix A while keeping elements index in matrix A unchanged, and record the row and column of the element. Keep this step repeatedly until all the elements in matrix A are not larger than the threshold value or there is no element in matrix A.

Treat the feature items which the similarity is larger than threshold as the same dimension, and use Eq. (1) to calculate the text similarity.

4 Experimental Results and Analysis

According to the part of the data of Chinese news corpus of Sogou, design an experiment of news text classification based on KNN algorithm which K value is set to 7. The data contains five fields of electronics, sports, tourism, education and military. The total number of text data is 1000 and 200 text data in each field. The training set for each field is 150 text data, and the test set is 50 text data (Table 2).

Table 2 The classification accuracy of the experiment

The classification experimental results show that the two semantic algorithms are much better than the traditional algorithm based on VSM. Besides, the algorithm proposed in this paper is more efficient than the other sematic based algorithm.

5 Conclusions

This paper mainly discusses the text similarity computation based on VSM and the semantic based text similarity computation, and improves the method of computing the similarity between words. Finally, a text classification experiment based on KNN is designed to analyze the advantages and disadvantages of the two different methods of text similarity computation. In the course of the study, we find a problem which is worth studying and solving: considering the order of sequence of words to prevent losing a part of useful text information.