Keywords

1 Introduction

The web provides important and even exclusive resources for geographic information retrieval and knowledge discovery [1]. At the same time, geo-entity relations are commonly used in describing the locations of entities and geographical phenomena which are crucial for building geographic knowledge systems [2]. To better understand the geographic semantics embedded in rich web texts, it’s a pressing need for robust and effective solutions in geo-entity relation extraction.

The frequently used supervised learning methods which perform well with specified static texts behave poorly in extracting geo-entity relations from web texts [3]. Firstly, building massive patterns or corpora are expensive and training models is time-consuming, the massive web texts cannot be processed in real-time with supervised methods [4]. Secondly, web texts may cover various domains with strong heterogeneities, leading to a poor portability for model training [5]. Thirdly, the dynamic nature of web texts constantly generates additional relation types which cannot be captured by predefined patterns and pre-trained models [6]. The unsupervised learning methods have attracted more attentions in the field of web texts mining because they don’t need large scale patterns and corpora. Additionally, they can be utilized for additional relation exploring, which are more suitable for dynamic text mining [7].

Keywords play an important role in relation recognition with unsupervised learning methods, which provide rich clues to describe the relations between entities [8]. Unsupervised methods regard keyword extraction as a ranking task and extract the top-ranked as keywords [9]. The existing keyword extraction methods for relation recognition are mainly based on frequency statistics. These methods are based on the hypothesis that there exist a large number of redundant terms which imply the relations for a specific entity pair. However, this hypothesis is not appropriate to extract keywords for geo-entity based on the following reasons: Firstly, the specific geo-entity pair rarely co-occurs in one sentence based on our experiments [10]. Besides, the number of terms in the context of the specific geo-entity pair is very limited, which makes the terms rather sparse. Secondly, the synonymy exacerbates the problem of sparseness [11]. Thirdly, there is a strong correlation between the types of geo-entity and the terms [12]. For example, “flow into” can only describe the relation between water bodies, not buildings. However, it is not applicable for semantic relations which are not restricted by the type of geo-entity pair. Therefore, only frequency statistic is hard to distinguish the keywords from others and will not work well in recognizing geo-entity relations with sparse distribution. Besides, different languages vary in word segmentation, part-of-speech (POS) tagging and syntactic analyzing, which have a great influence on keyword extraction. Compared with English, a character-based language like Chinese needs a different strategy of keyword extraction for geo-entity relation.

This paper focuses on how to extract keywords from mass Chinese web texts for recognizing geo-entity relations with extremely sparse distribution. Our contributions are as follows:

  1. (1)

    We propose the context enhanced method to reduce the term sparseness of keyword extraction. To the best of our knowledge, the sparse distribution of geo-entity relation is firstly presented in the field of geo-entity relation recognition. We also prove sparseness reduction is essential for generating high-quality keywords and achieving an unsupervised recognizing method of sparse geo-entity relation.

  2. (2)

    In order to reveal the specific characteristics of the given web texts and deal with heterogeneous web texts, we use feature selection and weight statistics to increase the distinctions between the terms in context. Different with the frequency-based methods, we additionally explore multiple lexical features in real-time and dynamically adjust their weights.

  3. (3)

    Our method significantly outperforms other comparing algorithms (DF and Entropy), and has the ability of discovering additional keywords that is appropriate to dynamic text mining.

The remainder of this paper is organized as follows. A context enhanced methodology of keyword extraction for sparse geo-entity relation is presented in Sect. 2. The experiments and discussion are presented in Sect. 3. Conclusion is drawn in Sect. 4.

2 Methodology

2.1 Definitions

  • Input: Chinese texts crawled from assigned websites. One piece of texts is shown below.

  • Output: a set of keywords for geo-entity pairs.

  • Geo-entity pair ( e 1 , e 2 ): two geo-related entities co-occurring in one sentence. The first geo-entity appearing in one sentence is paired with other geo-entities in the same sentence. For example, (中关村Zhongguancun, 海淀区Haidian District), (中关村Zhongguancun, 北京大学Peking University) and (中关村Zhongguancun, 清华大学Tsinghua University) are geo-entity pairs in the first sentence.

  • Geo-entity relation r: a state of connectedness between geo-entities, divided into two types, spatial relations and semantic relations. Spatial relations consist of topological, directional and distance relations, such as “within”, “south” and “10 kilometres”. Semantic relations are “hypernym”, “hyponym”, “equal”, to name a few. Both of them can be represented as a set of facts with the form (e 1 , r, e 2 ). The examples of fact are (中关村, 相邻, 北京大学) and (中关村, 别名, 中国的硅谷), which are (Zhongguancun, adjacent, Peking University) and (Zhongguancun, alias, China’s Silicon Valley) in English.

  • Term t: a phrase or a word with semantic information in a sentence, such as “位于”, “科技中心”(“be located in”, “technology hub” in English) and so on.

  • Context c: all terms existing before, between and after the specified geo-entity pair in a sentence except for other geo-entities in the same sentence, with the stop words filtered. The stop words are function words, such as “被”, “并”, “都” (“by”, “and”, “both” in English) and so on. For example, the context of (中关村, 海淀区), (中关村, 北京大学) and (中关村, 清华大学) contains 2 terms, (位于, 邻近) which are (be located in, proximity) in English.

  • Keyword k: the terms picked out from context as indicators in relation expressions. For example, the term “proximity” picked out from the context (be located in, proximity) is a keyword revealing the topological relation “adjacent” for the geo-entity pair (Zhongguancun, Peking University) (Table 1).

    Table 1. Examples of geo-entity pairs and corresponding keywords.

2.2 Sparseness Reduction

The terms in the context of a specific geo-entity pair are usually sparse. Merging the contexts of geo-entity pairs with the same type will reduce the sparseness of terms in one context. This requires a fine-grained mapping table connecting types to geo-entities. In this paper, an online Chinese encyclopaedia (Baidu BaikeFootnote 1) is used for obtaining the type labels of each geo-entity. Similar to Wikipedia, Baidu Baike attaches each piece of web texts with multiple type labels according to the ranked importance for each entry. For example, the entry “Beijing” has 4 type labels, “municipality”, “ancient capital”, “China” and “first-tier city”.

The process of sparseness reduction for terms is shown in Fig. 1. Firstly, we search the geo-entities in Baidu Baike one by one, and obtain the corresponding label types. Secondly, all type labels of the specified geo-entity are assessed by using their orders and frequencies, and the most important label is picked out as the geo-entity type. After all geo-entity is assigned its type, the type of geo-entity pair (e x , e y ) can be decided with the name T xy  = <type ex , type ey >. Thirdly, we merge the contexts of geo-entity pairs with the same type, and the number of terms in context will be increased. This process enhanced the information used to extract keywords for geo-entity pairs. Moreover, the term’s semantics are also fused with the help of the synonym dictionary CiLin Footnote 2 to reduce the sparseness of terms.

Fig. 1.
figure 1

Sparseness reduction for terms in contexts.

2.3 Corpus Generation

A large-scale corpus is needed to select the effective features for keyword extraction. It is generated automatically based on two well-known frequency-based statistical methods, namely the DF (Domain Frequency) and Entropy. DF and Entropy methods are used for extracting keywords from the entire web texts. The intersection of these two resulted keyword sets forms the corpus for feature selection. DF is shown in formula (1). Entropy is shown in formula (2)–(3).

$$ DF_{t} = \frac{{f_{{t,T_{i} }} }}{{\sum\limits_{j = 1}^{N} {f_{{t,T_{j} }} } }} $$
(1)
$$ S_{i,j} = \exp (\frac{\ln 0.5}{{\overline{D} }} \times D_{i,j} ) $$
(2)
$$ Entropy_{t} = \left\{ {\begin{array}{*{20}l} { - \sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N} {(S_{i,j} \log S_{i,j} + (1 - S_{i,j} )\log (1 - S_{i,j} )),} } } \hfill & {0 < S_{i,j} < 1} \hfill \\ {0,} \hfill & {others} \hfill \\ \end{array} } \right. $$
(3)

In formula (1), f t,Ti denotes the frequency of term t appearing in the contexts of geo-entity pairs with the type \( T_{i} \in TS \). TS is the type set of geo-entity pairs with the size of N. In formula (2), S i,j denotes the similarity between the context p i and p j , which is measured by the average distance of all contexts and the distance D i,j between p i and p j after removing the term t from all contexts. Formula (3) denotes the entropy of term t measured by S i,j .

2.4 Feature Selection

Feature selection is crucial for keyword extraction, which has been proved to have a positive effect on classification accuracy [13] as well as be able to reveal the nature of keywords more comprehensively from multiple perspectives instead of the single aspect “term frequency”. Taking the text piece example in Sect. 3.1, the selected features are defined as follows.

  1. (1)

    The POS of term (noun, verb, preposition or others). e.g., the POS of “邻近” is a verb in Chinese with a meaning of ‘be close to’.

  2. (2)

    The length of term, which is measured by the number of characters. e.g., the length of “邻近” is 2, which means “邻近” has 2 characters.

  3. (3)

    The location of term (left of e 1 , between e 1 and e 2 , or right of e 2 ). e.g., the location of “邻近” is between the geo-entity pair (e 1  = 中关村, e 2  = 北京大学).

  4. (4)

    The previous term just before e 1 . e.g., the previous term just before e 1  = 中关村 is null.

  5. (5)

    The next term just after e 1 . e.g., the next term just after e 1  = 中关村 is “位于”.

  6. (6)

    The previous term just before e 2 . e.g., the previous term just before e 2  = 北京大学 is “邻近”.

  7. (7)

    The next term just after e 2 . e.g., the next term after e 2  = 北京大学 is “和”.

  8. (8)

    The distance between the term and e 1 . e.g., the distance between “邻近” and e 1  = 中关村 is 3. Note that the distances in features (8)–(11) are measured by the number of elements after word segmentation.

  9. (9)

    The distance between the term and e 2 . e.g., the distance between the term “邻近” and e 2  = 北京大学 is 0.

  10. (10)

    The distance between the term and the head of sentence. e.g., the distance between the term “邻近” and the head of sentence is 4.

  11. (11)

    The distance between the term and the tail of sentence. e.g., the distance between the term “邻近” and the tail of sentence is 4.

2.5 Term Assessing

After selecting features, the process of term assessing is conducted, this considers the influence of the length, POS, location and distance of the terms, shown in formula (4)–(8). These lexical features are statistically determined according to the credible results of two frequency-based statistical methods and changed with the input texts in real time.

$$ wgt^{(t)} = \theta_{LEN} \times (\theta_{POS} + \theta_{LOC} + \theta_{DIS} ) $$
(4)
$$ \theta_{LEN} = \left\{ {\begin{array}{*{20}l} {1,\text{ }} \hfill & {\hbox{min} < length(t_{pos} ) < \hbox{max} } \hfill \\ {0,\text{ }} \hfill & {others} \hfill \\ \end{array} } \right. $$
(5)
$$ \theta_{POS} = p(t_{POS} ) $$
(6)
$$ \theta_{LOC} = \left\{ \begin{aligned} p(t_{loc} |tp(e_{1} )) \hfill \\ p(t_{loc} |tn(e_{1} )) \hfill \\ p(t_{loc} |tp(e_{2} )) \hfill \\ p(t_{loc} |tn(e_{2} )) \hfill \\ \end{aligned} \right. $$
(7)
$$ \theta_{DIS} = \left\{ {\begin{array}{*{20}l} {p(dis(e_{1} )|t_{loc} )} \hfill \\ {p(dis(e_{2} )|t_{loc} )} \hfill \\ {p(dis(head)|t_{loc} )} \hfill \\ {p(dis(tail)|t_{loc} )} \hfill \\ \end{array} } \right. $$
(8)

In formula (4), wgt (t) denotes the weight of term t for the specified geo-entity pair, considering the importance of length θ LEN , part-of-speech θ POS , location θ LOC and distance θ DIS . Formula (5) denotes the weight of the length of term t affected by the POS of t (t pos ). The length of each type of POS has its own valid range. The wgt (t) will be equal zero if the length of t with t pos is out of the range. Formula (6) denotes the weight of POS, which is the probability of the event that the POS of t, namely (t pos ), is equal to the specific part-of-speech. Formula (7) denotes the weight of relative location affected by the previous and next terms of geo-entity. t loc denotes relative location of term t, which can be left, between or right. tp(e 1 ) denotes the previous term of e 1 , tn(e 1 ) denotes the next term of e 1 . For example, p(t loc  = between|tp(e 1 )) denotes the probability that the term t located between e 1 and e 2 is the keyword when the previous term of e 1 is a specific term. Formula (8) denotes the weight of distance affected by the location of term. dis(e 1 ) denotes the distance between t and e 1 . dis(e 2 ) denotes the distance between t and e 2 . dis(head) denotes the distance between t and the head of the sentence. dis(tail) denotes the distance between t and the tail of the sentence. For example, p(dis(e 1 )|t loc  = between) denotes the probability that the term t with a definite distance to e 1 is the keyword when t is located between e 1 and e 2 .

All terms in contexts are assessed by formula (4) and ranked in descending order. After ranking, a local ordered list of terms is generated for each geo-entity pair, which indicates the decreasing importance of the terms for geo-entity relation expression. The most important term is picked out as the keyword of the specified geo-entity pair.

3 Experiments

3.1 Dataset

All the articles on Chinese national geography are crawled from Encyclopaedia of ChinaFootnote 3, with 2.3 million words in total. These articles describe the geographic, cultural and historical knowledge of toponyms, which provide rich information for geo-entity relation extraction. These articles are pre-processed using GATEFootnote 4 and 31,065 geo-entity pairs are generated. They are randomly divided into 3 groups to check the robustness of the proposed method.

3.2 Baselines

The proposed method is compared with DF and Entropy. Specifically, DF method extends the classic TFIDF using the frequency of the terms in the context of the type-specific entity pairs, which would favor specific relational terms as opposed to generic ones. Entropy method converts the context to a vector of terms and assesses the discrimination of each term based on the informatics theory, which would provide useful heuristic information for keyword extraction.

3.3 Metrics

Because the number of the keywords in the experiment is unknown, we can only define the precision as shown in formula (9). Cnt(right set) denotes how many the extracted keywords are correct. Cnt(result set) denotes the total number of keywords in the results.

$$ Precision = \frac{{Cnt(right\text{ }set)}}{{Cnt(result\text{ }set)}} $$
(9)

We randomly sample part of data from the results, and manually evaluate them by two people, and evaluate the coherence of their annotation by kappa coefficient (κ) as formula (10). P 0 denotes the relative annotation agreement between the two people, P e denotes the hypothetical probability of chance agreement. If κ > 0.8, the annotations are accepted and the mean precision of the two evaluations is calculated. Otherwise, evaluation is conducted again.

$$ \kappa = \frac{{P_{0} - P_{e} }}{{1 - P_{e} }} $$
(10)

3.4 Results

3.4.1 Keyword Extraction

We utilize the proposed method with the first group as an example. The results are shown in Fig. 2. The terms in context are ordered by their descending importance ranks, the one with the maximal weight is picked out as the keyword for each pair of geo-entities. Note that some geo-entity pairs own multiple keywords because multiple terms in one context have the equal weight. For example, the geo-entity pair (Zhejiang Province, Qiandao Lake) has keywords “artificial-lake” and “reservoir”.

Fig. 2.
figure 2

Examples of extracted keywords in the first group of data.

3.4.2 Additional Keywords

Compared with the corpus, some new geo-entity pairs and keywords in each group are extracted, as shown in Fig. 3. In the horizontal axis, pair denotes geo-entity pair, type(kw) denotes the number of keyword’s types, and the index numbers correspond to each group. The vertical axis denotes how many new objects are extracted. For example, in the first group, 35.4 % additional geo-entities pairs and 31.3 % additional keyword’s types are generated with our method.

Fig. 3.
figure 3

Additional geo-entity pairs and keywords

The extraction percentage of additional geo-entity pairs is almost the same in the three methods. Additionally, the DF explores the largest number of new types of keywords (average 56.6 % in three groups), while the Entropy misses the most of the keywords.

3.4.3 Precision

The extracted keywords are evaluated manually and the kappa coefficient κ is calculated. The additional objects are evaluated to assess the ability of a keyword extraction method adapting to the unknown data in the corpus. 100 additional geo-entity pairs with additional types of keywords are sampled randomly from the results, and added into the evaluation set. Then two people simultaneously check if the extracted keyword in the evaluation set is the relational term of one specific geo-entity pair. The kappa coefficient κ with a value 0.83 declares a high coherence and proves the validity of the evaluation.

Table 2 shows that how many geo-entity pairs with the additional types of keyword are extracted correctly (new(kw), in short), and the mean of all results which contains the existed keywords and the new discovered ones extracted correctly (AVG, in short). The proposed method gets an average precision of 85.5 %, which is about 41 % and 36 % higher than DF and Entropy. More importantly, the precision of new types of keywords extracted with the presented method is 60.3 %, surpassing by 28 % and 33 % with DF and Entropy respectively. Although DF method obtains the largest number of new types of keywords (shown in Fig. 3), it has the low precision of new types of keywords (31.7 %). Moreover, Entropy method misses the most keywords and has the lowest precision.

Table 2. The precision of three methods for all additional extracted keywords (%)

3.4.4 Discussion

As mentioned in Sect. 1, the frequency-based methods for keyword extraction are derived from TF-IDF and Entropy. TF-IDF is under the premise that entity relations would appear frequently in massive texts. And Entropy is dependent on the hypothesis that the relational terms used to describe the specific relation appear more often than others. Both TFIDF and Entropy assess the importance of terms by frequency statistic. Unfortunately, there is usually no significant frequency difference between keywords and other terms because the keywords are sparsely distributed. Thus, it is difficult to distinguish the keywords from contexts using the frequency-based methods. Therefore, TFIDF (including DF) and Entropy do not perform well in keyword extraction for sparse geo-entities, especially on the additional types of keywords.

On the contrary, we extract keywords not only with the term frequency, but also the lexical features to reveal the specific characters of the given texts. Besides, the reliability is kept with combining the types of geo-entities with the lexical features, which produces massive keywords with a higher quality dealing with the sparse geo-entity relations. Moreover, our method can discover additional keywords from the original web texts, which is a step forward comparing with supervised learning methods.

However, there are still two kinds of keywords we can’t effectively deal with: (1) Keywords with semantic constraints. Sometimes relations depend on time, spatial or semantic constraints, which no longer meet the format of the triplet. For example, the sentence “艾比湖蒙古语称为艾比淖尔(Aibi Lake is called Ebi Bur in Mongol).” expresses the facts (Aibi Lake, alias in Mongol, Ebi Bur). Our method can extract the keywords “be called” which is the meaning of “alias”, but miss the semantic constraint. More features should be considered when dealing with keywords with semantic constraints, such as grammatical structure, semantic coherence and so on. Besides, dependency parsing is also an effective solution for completing relation expression. (2) Implicit keywords. One sentence implies a kind of relation between two geo-entities, whereas the keywords describing this relation do not appear in the sentence. For example, the sentence “The water resources of Min River are 13.32 million kilowatt, accounting for 18.85 % of water resources of Sichuan Province” describes a topological relation (Min River; Sichuan Province; within), but there are no terms meaning “within” in the sentence. Geometric information from geographical knowledge bases (such as Geonames and OpenStreetMap) would be beneficial to extract implicit spatial keywords.

Note that the main contribution of this study is to alleviate the influence of context sparseness. The proposed method solves this problem with the help of a fine-grained mapping table and an open synonym dictionary. Because the languages only influence the feature selection and the weights of features, specific features should be selected in the context enhanced method for different languages.

4 Conclusion

This paper proposed a context enhanced method to extract the keywords from mass web texts to recognize geo-entity relations with sparse distributions. We adopt two strategies to reduce the sparseness of terms in contexts. The first is a fine-grained type table used to merge the contexts for increasing the number of terms, and the second is semantic fusion conducted to reduce the sparseness of terms in all contexts. Moreover, we consider the global and local features by introducing the characteristics of length, part-of-speech, position and distance of terms to improve the performance. It is demonstrated that the proposed method can efficiently enhance the ability of discovering geo-entity relation keywords with sparse distributions. This method also generates massive additional keywords which is helpful to realize the unsupervised learning methods of geo-entity relation recognition.