Introduction

Keyword extraction is an important process in a domain knowledge analysis. Keywords can be regarded as the knowledge generalization of the full text in a corresponding literature and help readers to quickly grasp the core idea, core technique, or core methodology. However, when the literature of a certain field accumulates to a large amount, selecting the most representative keywords of the domain turns out to be a hard problem (Su and Lee 2010). A number of researches have already conducted research on keyword extraction. Many methods like network-based analysis (Newman 2008) or the top-n high-frequency keywords (Zhao and Wang 2010) are proposed. These methods however, still focus on the high-frequency keywords, even methods using the network characteristics are demonstrated to be highly correlated with the frequency-based results. To analyze domain knowledge, methods using high-frequency words may not be sufficient because the high-frequency words are often general and can hardly describe the distinguishing details and boundaries of domain knowledge. In contrast, low-frequency keywords might be related to new and innovative concepts emerging in a field (Quoniam et al. 1998). Therefore, locating low-frequency keywords can identify dynamic areas especially in the sub-domains. Thus, analyzing the domain knowledge with the low-frequency keywords rather than the high-frequency words has become to a research hotspot.

With this idea that low-frequency keywords can also be important specially in domain knowledge analysis, Term Frequency-Keyword Active Index (TF-KAI) method (Chen and Xiao 2016) is proposed. This method can be regarded as extension based on the method of Term Frequency-Inversed Document Frequency (TF-IDF) (Salton and Buckley 1988), which measures the popularity and discrimination capacity of words in target documents and the background documents, where TF is used to measure the popularity and the IDF is used to measure the discrimination between the target documents and background documents. As in a keyword analysis approach, the analysis targets are the keyword lists, in which a keyword is more sparsely distributed than in traditional full-text documents, thus the possibility of a keyword emerging in the background documents is much decreased. The TF-KAI method introduced the Active Index (AI) to enlarge the discrimination factor, thus the TF-KAI can help identify the representative keywords more efficiently than the traditional TF-IDF methods.

The TF-KAI method uses keyword rankings as the basis for importance evaluation; thus the frequencies of a keyword found in current domain literature and background literature collections, determines how important the keyword is in the domain. Keywords of similar meaning (synonym) however, can be written in different forms. Take the term “slope collapse” from the field of geographic natural hazards as an example. In many cases, “slope collapse” can be written as “slope disintegration”, “hill collapse”, or “gully slump”. Similarly, in the remote sensing domain, night-time light has been used to study the power outages caused by rainstorms. In this area, “night-time lights” can also be written as “nightlight” or “nocturnal light”. Therefore, the semantic meanings behind keywords must be considered; otherwise, keywords with similar meanings could be counted separately (Wang et al. 2012). Very different keyword lists could be obtained, thus affecting analysis. Keywords with similar meanings (synonyms) must be merged before evaluating the importance of any specific keyword.

Synonyms or semantic similarity problems have been research foci in the Natural Language Processing (NLP) field for a long time. Many mature methodologies such as corpus-based and knowledge-based processing have been proposed to deal with this problem. However, in many instances of keyword analysis in the bibliometric field, semantic similarity problems have not been addressed with automatic methods. Nevertheless, authors have noted the disadvantages stemming from this issue (Yang et al. 2016), while synonym problems are often solved by manual methods. For example, the bibliometric analysis software CiteSpace (Chen 2006) asks users to manually select words for an alias list to merge the synonyms. Manual processes that merge synonyms often give high-quality results, but manual processing however, requires certain expert knowledge in the domain. Moreover, manual processing is often overwhelming when the amount of literature is large. Existing automatic methods for merging synonyms in keyword analysis use static knowledge databases and do not account for dynamic keyword contexts that might affect the semantic meanings of the keywords. Therefore, a method to handle and depict dynamic meaning of words in domain-specific contexts is needed.

To deal with the lack of domain-specific synonym merging and the inaccuracy problem stemming from the use of word frequencies as a measure of importance of a term, we introduce the Google Word2Vec model (Mikolov et al. 2013b), which uses word contexts to model the semantic meaning of a word when merging synonyms. Specially, we propose the use of “semantic units” to represent the results of the Word2Vec based synonyms merges. A semantic unit is a collection of keywords in which each of the keywords has a similarity value with other keywords in the collection. The similarity value of every two keyword in the collection must surpass a certain threshold to be included in the semantic unit. In addition, we also adopted the idea of tuning and applied it to popularity and discrimination measurements in TF-KAI, and extend it to the Semantic Frequency-Semantic Active Index (SF-SAI) to get the most representative keywords. To verify the effectiveness of these proposed methods, we selected the “natural hazard” topical literature as the background corpus and the “geographical natural hazard” topical literature as the target domain field. After the synonym merging process and obtaining the semantic units, we determined representative keywords using SF-SAI methods. By comparing with the original TF-KAI results qualitatively and quantitatively, we demonstrate the advantages of the proposed SF-SAI methods.

The rest of this paper is organized as follows: “Related work” section introduces related studies, “Data and materials” section describes the experimental dataset and collection dates. “Methodology” section introduces the proposed methodology. “Results analysis” section presents experimental results, discusses the domain analysis results and the advantages and disadvantages of the proposed method. “Conclusions” section draws some conclusions.

Related work

Identifying representative keywords using frequency method

Term Frequency and Term Frequency-Inverse Document Frequency (TF-IDF)

Term frequency (TF) is the most common method for quantitatively analysis of the literatures. Highly frequent keywords can be regarded as indicators of research hot spots. The frequency can indicate the popularity of the term or keyword, however, in the domain keyword extraction task, TF method does not provide efficient results, because it lacks the ability to discriminate when filtering out general and not specific words from the keyword list. TF-IDF extends TF to resolve this issue (Salton and Buckley 1988).

$${\text{TF-IDF}} = n\left( {i,j} \right) \times \log \left( {\frac{{n\left( {\text{all}} \right)}}{{n\left( {i,{\text{all}}} \right)}}} \right)$$
(1)

where \({\text{n}}\left( {i,j} \right)\) stands for the TF of word \(i\) in the corpus \(j\). The \(\log \left( {n\left( {\text{all}} \right)/n\left( {i,{\text{all}}} \right)} \right)\) stands for the IDF, specially, \(n\left( {\text{all}} \right)\) stands for the count of all documents, \(n\left( {i,{\text{all}}} \right)\) stands for the counts of all the documents contain the word \(i\).

TF-IDF combines popularity and discrimination measurements; through the IDF process most irrelative words are quickly eliminated. This function of IDF will be strengthened when the count of a keyword decreases in the original body of documents, or background corpus. In keyword extraction tasks, the results of TF and TF-IDF methods behave very similarly. The TF-IDF does not provide accurate domain-specific representative results, because the processed documents are different from traditional ones who often process the full-text documents. Therefore, same words have relatively higher possibility of emerging in both a target document and the background documents (the so called “Inverse Document” in an Inverse Document Frequency) than the document composed of keyword lists in our case. Therefore, the functionality of IDF has little effect and the discrimination ability seems absent in the process, Term Frequency-Keyword Active Index resolves these issues by introducing new discrimination factor, Keyword Active Index.

Term Frequency-Keyword Active Index (TF-KAI)

Active Index (AI) is an concept used to describe whether a country/region has comparative advantages in a particular field according to the share in total world publication (Chen et al. 2015). AI > 1 means that the country/region emphasizes a given domain comparing with its average research level, and AI < 1 means that the country/region has loose research in the field comparing to its average research level. Therefore the research interest of the country/region can be depicted. Borrowing this idea, Chen et al. used the AI to describe the research interest of certain domains and introduced the KAI instead of IDF to provide the discrimination between the domain and the background.

$${\text{TF-KAI}}\,\sim\,n\left( {i,j} \right)^{2}\,\times\,\frac{{n\left( {\text{all}} \right)}}{{n\left( {i,{\text{all}}} \right)}}$$
(2)

Here, the TF-KAI method eliminates IDF log computation and thus greatly enhances the discrimination factors. In reference (Chen and Xiao 2016), top 97 keywords of three methods, TF, TF-IDF, and TF-KAI, are selected as the analysis target. 58 unique keywords are identified by TF-KAI methods. These keywords do not overlap with those from TF and TF-IDF and turn out to be more representative for the domain knowledge in digital library field in contrast to the background of information science. The TF-KAI provides a solution for simple and efficient keyword extraction. However, it does not consider the relatedness among the keywords.

Identifying representative keywords using network methods

Network based methods can be regarded as the most common methods that consider the relatedness between the keywords. They are applied to identify the keywords inherit graph characteristics from the co-word occurrence networks. A co-word network takes the keywords as nodes and co-occurrences of keywords as the edges connecting the nodes. This network property can be leveraged to evaluate the keyword nodes. Many mature metrics measure keyword behaviors in complex co-word networks. For example, the node centrality metrics, betweenness centrality and eigenvector centrality, can be used to measure the importance of keywords in the network (Borgatti 2005). High node centrality indicates the keywords emerge frequently in the text; high betweenness centrality signifies that a keyword plays a connecting role between different sub-networks. Through an analysis of different patterns in a co-word network constructed from keyword lists, meaningful keywords extracted from the literature can be discriminated (Ding et al. 2001). However, most of these centrality metrics are highly correlated with the frequency-based methods. Moreover, the computation load is high; and semantic meanings behind the keywords are not considered. Therefore, the semantic based methods are needed.

Identifying synonyms using semantic similarity measurement

Semantic similarity methods have been applied in Natural Language Processing (NLP), Artificial Intelligence, Cognition Science, and Psychology. These similarity measuring methods can be categorized into two main types: knowledge-based and corpus-based methods.

Before 2013, the most common semantic similarity methods were knowledge-based, such as WordNet (Miller 1995), a well-known human-curated lexical database. The static and pre-built structures are stored in the lexical databased found in WordNet, semantic similarity can be measured through path-based, information content-based, feature-based, and hybrid measures (Meng et al. 2013). WordNet builds upon the relatedness among words, including synonyms, hyponyms, meronyms, hypernyms, and holonyms. It provides a general language ontology enabling high accuracy similarity test results. There is a limitation however, because the knowledge base is pre-built and static, thus many newly emerging words appearing in the information explosion over the internet do not included in the knowledge base.

Corpus-based methods address this situation by modeling semantic meanings from the existing corpus. For example, Latent Semantic Analysis (LSA) and Pointwise Mutual Information (PMI) are classical similarity measures derived from the existing corpora (Mihalcea et al. 2006); however, the computation load is often high. The Word2Vec model is also based on a corpus but extends the n-gram linguistic model (Mikolov et al. 2013b), which can help decide the semantic distance between two words without supervised information. Word2Vec is also an extension of the work regarding the neural-probability language model (NLM) developed by (Huang et al. 2012), but has higher efficiency when computing the word vectors. Word2Vec models semantic meaning based on the relations between words and the surrounding context word collection. The Continuous Bag of Words (CBOW) and Skip-gram models (Mikolov et al. 2013a) are supported in the Word2Vec model. More specifically, the CBOW model predicts target words using previous and the subsequent context words. The Skip-gram model predicts the words surrounding a target word. With these model implementations, Word2Vec exhibits highly efficiency in semantic similarity testing with relatively high accuracy. Although when comparing Word2Vec with the knowledge-based similarity measurements, the accuracy of Word2Vec based results is slightly lower, nevertheless it has higher recall. Because of these characteristics, Word2Vec has been widely adopted in many similarity measure related studies (Handler 2014).

All these similarity measure methods have been studied and applied in various applications, but the synonym problem in the keyword analysis field has only a limited set of solutions. For example, Wang dealt with the synonym problems in a co-word analysis by using a thesaurus to merge keywords with similar meanings (Wang et al. 2012). Feng improved co-word analysis results by applying the ontological concept mapping (Feng et al. 2017). The methods using a thesaurus and ontological concept mapping are knowledge-based methods; the accuracy of synonym identification can be assured, but some details like newly emerging terms or low-frequency keywords, and very common in research articles, might be missed. Because knowledge based similarity computation is based on expertise knowledge and historical records, these approaches cannot deal with synonym problems among low-frequency keywords. Therefore, we choose this Word2Vec model, a corpus-based method, to dynamically model the semantic meanings behind the low-frequency keywords, lessening the impact of synonym problems.

Data and materials

We collected our experimental data from the well-known scientific database: the core collections of Web of Science (WOS). Because the background of our team in geographic information science, we choose the most familiar field of natural hazards as the background. Natural hazards can have enormous impact on the living conditions and economic development of countries and districts. Research on or related to this topic can be conducted from many different vantage points including urban planning, government policy, or the economic development planning. Therefore, many research goals are clustered in this area for solving problems and supporting decision-making. Geographic natural hazards are related to geological structures, and vegetation coverage. Many natural hazards however, are not so closely related to geography, such as climate change, greenhouse effects, and extreme weather. So it is meaningful to filter the background information from current domain corpus to identify ways in which geographic related nature-hazard research is different from general research on natural hazards.

To refine the search conditions, we set the search index to the range of Social Science Citation Index (SSCI), Science Citation Index-Expanded (SCIE) and English articles. We also set the search time range to “1985–2016”. For the environmental natural-hazard publication corpus, we set the topic words as “natural hazard”. For the corresponding domain publication corpus, we set the topic words as “natural hazard and geography”. Finally, we get a background corpus of 10,384 records and a domain corpus of 614 records. The search date is 2017-01-01.

The basic descriptive statistics of the dataset appear in Table 1. The keyword count is the amount of all unique keywords. While the accumulated keyword count indicates that the total keyword count includes duplicate keywords. Thus, we can tell that most of the keywords have low frequencies, most appearing not more than twice. All the words in the abstract were the input to the Word2Vec model. These plain text words found in the abstract were the corpus for building up the semantic space; more details about this procedure are discussed in the “Methodology” section of this paper. The words in the keywords collection were evaluated by the original TF-KAI and our proposed SF-SAI methods.

Table 1 Geography domain corpus and background corpus of natural hazard

Methodology

In this paper, we applied the ideas behind the TF-IDF method to express popularity and discrimination between a domain and background field, thus highlighting the domain-specific characteristics. The focus of this paper is to elucidate an approach that extends keyword frequency statistics and manual word disambiguation for automatic semantic unit generation, which is presented as the workflow in Fig. 1. Specially, we define “semantic unit” as a collection of keywords, every two keywords have a high similarity value exceeding a user defined similarity threshold.

Fig. 1
figure 1

Workflow of the publication keyword extractions

In Fig. 1, the part outside the dashed rectangle has been described in previous work of Chen and Xiao 2016. The TF-KAI method can achieve better results than TF and TF-IDF when finding domain specific keywords. However, as we argued, similarity and ambiguity of words should be considered for a more accurate and sufficient analysis, we introduce the word-embedding model to express contextual semantic information. In this paper, three extraction methods, TF, TF-IDF, and TF-KAI are extended to Semantic Frequency (SF), Semantic Frequency-Semantic Inverse Document Frequency (SF-SIDF), and Semantic Frequency-Semantic Active Index (SF-SAI), respectively. The details of how to compute the values of these metrics are illustrated as follows:

Computing a Semantic Frequency (SF) value

In the TF methods, term frequency is expressed as \(n\left( {i,j} \right)\). When extending to SF, a keyword frequency is replaced with the frequency of a semantic unit. We use the words in the domain abstract as the corpus to generate the Word2Vec models; Word2Vec provides extraction at a more granular word level. The co-word network can be represented as a collection of 0 or 1, such as (\(0,0,0,1,1,1,0, \ldots 1,1)\), also called a “one-hot” representation, which can be a very large and sparse matrix. The mathematical vector generated by Word2Vec is much denser, represented as (\(0.54, 0.38, 0.34, 0.21, 0.02, \ldots 0.34, 0.37)\), also called a “distributed representation”. Thus, in word vector space, the keywords can be mapped to computable mathematical vectors. To be more specific, the Word2Vec works with plain texts from abstract of literatures as illustrated in Fig. 2.

Fig. 2
figure 2

The Word2Vec working process

From Fig. 2, we can tell that the inputs to the Word2Vec are word sequences, generated from the plain text extracted from the literature records. The plain text in the abstracts can be obtained from files downloaded from the Web of Science (WoS) database. Considering computing efficiency and similarity modeling accuracy, we selected the Skip-gram model rather than the CBoW model for training. Skip-gram and CBow are often used interchangeably, they represent different ways of modeling but both can train a Word2Vec model. The each of the author keywords can be mapped to a geometry point in a 100-dimensional semantic space using the Word2Vec model, represented as a 100-dimensional mathematical vector. Using the cosine similarity computation method, the similarity between two words or multi-word terms can be generated. We illustrate how semantic similarity is computed by cosine similarity, in Fig. 3.

Fig. 3
figure 3

The Cosine similarity diagram for two-dimensional vectors

Figure 3 shows two vectors, \(a\left( {x_{1} , y_{1} } \right)\) and \(b\left( {x_{2} ,y_{2} } \right)\) in a two-dimensional space. The similarity will be computed as illustrated in formula (3):

$$\cos \theta = \cos \left( {a, b} \right) = \frac{a \cdot b}{{\left| {\left| a \right|} \right|\left| {\left| b \right|} \right|}} = \frac{{x_{1} x_{2} + y_{1} y_{2} }}{{\sqrt {x_{1}^{2} + x_{2}^{2} } \times \sqrt {y_{1}^{2} + y_{2}^{2} } }}$$
(3)

When the dimension is extended to higher levels, such as 100 dimensions in our case, vector a and b will be \(a\left( {a_{1} ,a_{2} ,a_{3} \ldots a_{n} } \right)\) and \(b\left( {b_{1} ,b_{2} ,b_{3} \ldots b_{n} } \right)\). The corresponding equation can be written as formula (4):

$$\cos \theta = \cos \left( {a,b} \right) = \frac{{\mathop \sum \nolimits_{1}^{n} \left( {a_{i} \times b_{i} } \right)}}{{\sqrt {\mathop \sum \nolimits_{1}^{n} a_{i}^{2} } \times \sqrt {\mathop \sum \nolimits_{1}^{n} b_{i}^{2} } }}$$
(4)

Through the computation, similarity between every two word vectors can be obtained. Note that the similarity value \(\cos \theta\) is belongs to \(\left[ {0,1} \right]\), 0 means that there is no semantic overlapping between the two words. 1 means that the two words have the same semantic meaning. In our paper, we use an experienced value of similarity threshold, thus the similarity value of every two words exceeding the threshold can be regarded as belonging to the same “semantic unit”. Therefore SF can be written as \(n\left( {i_{\text{cluster}}, j} \right)\). To be more specific, \(n\left( {i_{\text{cluster}}, j} \right)\) can be described by formula (5):

$$n\left( {i_{\text{cluster}} ,j} \right) = \sum \left\{ {{\text{keyword}}\_{\text{freq}}\left( k \right)|\cos \left( {i,k} \right) > t, \;k \in j, \;j \in {\text{t}}\_{\text{corpus}}} \right\}$$
(5)

where k stands a random keyword in the j document in “t_orpus”, the target corpus; t stands for the similarity threshold set by experience. Thus, SF is the sum of all the keywords that have a higher semantic similarity with keyword i than the threshold t.

Computing SF-SIDF and SF-SAI values

In the TF-IDF methods, IDF means the frequency of a certain word that appears in the inversed documents, or background corpus in this paper, as seen in formula (1). Similarly, when extended to semantic method, the TF-IDF is extended to SF-SIDF, as seen in the formula (3). When computing the SIDF, we use the generated Word2Vec model. And when a document contains a word that has the similarity value higher than similar threshold t with the semantic unit in the SF results, the \(n\left( {i_{\text{sim}} ,{\text{all}}} \right)\) of the SIDF counts one. To be more specific, the equation of \(n\left( {i_{\text{sim}} ,{\text{all}}} \right)\) can be written as formula (6):

$$n\left( {i_{\text{sim}} ,{\text{all}}} \right) = \sum \left\{ {{\text{document}}\_{\text{freq}}\left( k \right)|\cos \left( {i,k} \right) > t, \;k \in {\text{all}}, \;{\text{all}} \in {\text{b}}\_{\text{corpus}}} \right\}$$
(6)

where k is also a random keyword but in the background corpus; “all” stands for documents from the background corpus; “b_corpus” stands for the background corpus; \(n\left( {i_{\text{sim}} ,{\text{all}}} \right)\) stands for the number of the documents that contain the words or similar words. Thus the semantic inversed document frequency can be written as \(\log \left( {n\left( {\text{all}} \right)/n\left( {i_{\text{sim}} ,{\text{all}}} \right)} \right)\). The SF-SIDF can be written as formula (7):

$${\text{SF-SIDF}} = n\left( {i_{\text{cluster}} ,j} \right) \times \log \left( {\frac{{n \left( {\text{all}} \right)}}{{n \left( {i_{\text{sim}} ,{\text{all}}} \right)}}} \right)$$
(7)

The KAI value is used to describe the active degree of keyword to the domain and can highlight the domain preferences. When extending the TF-KAI to SF-SAI, the process use the similarity computation methods based on the Word2Vec model generated from the domain corpus. SF-SAI can be written as formula (8).

$${\text{SF-SAI}} = n\left( {i_{\text{cluster}} ,j} \right)^{2} \times \frac{{n\left( {\text{all}} \right)}}{{n\left( {i_{\text{sim}} ,{\text{all}}} \right)}}$$
(8)

Note that the semantic similarity threshold t is assigned as 0.97 by experience learnt from multiple times of experiments, which can obtain a relatively accurate semantic similarity. The process of the Word2Vec modeling is implemented based on python packages for NLP and machine learning, including NLTK (Bird 2006) and Gensim.

Results Analysis

Limitation in TF-KAI results

The extracted keywords from TF, TF-IDF and TF-KAI are listed in “Appendix 1”. As a routine for selecting domain keywords, most keyword analyses take less than 100 keywords for the analysis task (Chen and Xiao 2016); in the selected top 99 keywords for all three methods, 33 keywords are overlapped. The TF and TF-IDF results have 89 overlapping keywords; TF-KAI results and TF-IDF results have 40 overlapping results. TF-KAI identifies 59 keywords that are different from TF and TF-IDF methods, which is a similar result reported in Chen and Xiao (2016). It’s evident that TF-KAI method is more efficient for identifying domain-specific keywords than the TF or TF-IDF method.

The TF-KAI method has indeed provided comparatively better results in our experiment. However, the limitation of this method of neglecting the semantics behind the keywords is visible. From “Appendix 1”, we can tell that the top three keywords in the TF-KAI list are “geograph_inform_system_gis”, “geograph_inform_system”, and “gis”, which represent the same or similar meaning but written in different forms. Therefore, the semantic meanings must be considered. More synonym examples are collected in Table 2.

Table 2 Exemplar synonyms in TF-KAI results

From Table 2, we can tell that some of the keywords have similar meaning but were considered separately rather than as semantic unit, thus resulting in very different ranking results. Some of the keywords only appeared once, but they could not be ignored as they also stand for very closely related research directions. In addition, some keywords like “geograph_inform_system” can also be written as “geograph_inform_system_gis”. Both of the two expressions have a relatively high frequency. In this case, merging these two keywords with similar meanings will make this semantic unit rank in much higher place.

To analyze the results of TF-KAI more intuitively, we adopted word embedding to generate the heat map depicted in Fig. 4. The points scattered on the map are the 59 unique keywords of TF-KAI that are different from TF and TF-IDF results, each of the points representing a keyword. The closer the data points are, the more similar the semantics of corresponding words. As the default word embedding setting is a 1 * 100 dimensioned vector, dimensional reduction applied in the t-SNE (Der Maaten and Hinton 2008) method generates two-dimensional data vectors. 1868 word vectors are used to generate the heat map using the method kernel density estimation (Rosenblatt 1956). As the keywords do not distribute evenly in the semantic space, the clustering degree of keyword is varied. Deeper color means that there is a higher possibility to find a similar keyword in that area. Overlapping of different keywords indicates semantic high similarity among those words, suggesting the inefficiency of the TF-KAI keyword extraction process. Semantic overlapping keywords must be merged into independent semantic units. Therefore, it is necessary to take semantic meanings of keywords into consideration.

Fig. 4
figure 4

Semantic density map of 59 TF-KAI unique keywords in the all domain keywords

Semantic based results (SF, SF-SIDF, and SF-SAI)

Semantics are significant and unneglectable in bibliometric and scientometric analysis. Here we facilitate the proposed SF, SF-SIDF and SF-SAI method to extract domain-specific keywords.

Generating SF results

It is important to set the granularity of the abstraction level of the semantic meanings. Too large or too small granularity creates interpretation difficulties. Many domain specific keywords are distributed in low-density areas. This is also in line with that the assumption that domain-specific keywords are often very unique. In this paper, we use the similarity threshold to control the clustering granularity of the semantic units. In Table 3, the most similar keywords for five exemplar keywords are collected.

Table 3 Examples of the most similar keywords for five exemplar keywords

From the table, we can tell that the similarity computation can reveal the semantic meaning to some extent. However, because of the semantic characteristics, keywords do not distribute evenly in the semantic space. Some keywords tend to have more similar keywords and others do not. Therefore, setting different thresholds will lead to different numbers of similar keywords, result in semantic units with different size. Table 4 is the table for semantic units under different similarity thresholds (ranged from 0.95 to 0.99).

Table 4 The top five semantic units with different similarity thresholds by SF ranking

We found by experimentation with different similarity thresholds that a similarity threshold of 0.97 better represents word-level semantic meanings although word-level meaning is a relatively vague concept that is hard to quantitatively measure.

Generating SF-SIDF and SF-SAI results

Background information must be discriminated prior to generating SF-SIDF and SF-SAI results. The discrimination factor in TF-IDF method is the IDF value, counted by a function of \(\log \left( {n\left( {\text{all}} \right)/n\left( {i,{\text{all}}} \right)} \right)\). In semantic based methods, the discrimination factor; SIDF values, are counted by the function, \(\log \left( {n\left( {\text{all}} \right)/n\left( {i_{\text{sim}} ,{\text{all}}} \right)} \right)\). The \(n\left( {i_{\text{sim}} ,{\text{all}}} \right)\) value is not exactly the same semantic unit as the SF function \(n\left( {i_{\text{cluster}}, j} \right)\), but keywords in the background corpus similar or belonging to the semantic unit. Because the background corpus has a larger amount of keywords, and many of them are different from the keywords in the semantic units generated by domain corpus. We can tell from the Table 2 that keywords in the semantic units have various similarity values with keywords from the background corpus. Finding similar words in a background corpus relies on word-level meaning; therefore, we set the similarity threshold to a rigid range. We obtained similar keywords by setting the similarity threshold to 0.97. SF-SIDF and SF-SAI results are obtained based on the setting, as illustrated in the “Appendix 2”. Based on semantic units, we made a ranking list for total 1355 semantic units. To see functionality of the discrimination factor, we also generated the correlations of between SF and SF-SIDF, SF and SF-SAI results.

As Fig. 5 illustrates, SF-SIDF also has a very high correlation to SF results (R 2 = 0.5057). Thus, SIDF does not discriminate the representative keywords in the current corpus from the background corpus. SF-SAI results, on the other hand, showed a low correlation with SF (R 2 = 0.1972). Therefore, SF-SAI methods based on semantic meanings, produces more effective results than SF-SIDF, when calculating representative keywords.

Fig. 5
figure 5

Semantic unit frequency correlated to the SF-SIDF (left) and SF-SAI (right)

Qualitative analysis of the TF-KAI and SF-SAI results

To examine the performance of our proposed methods, we have collected and compared the keyword results produced by TF-KAI and the proposed SF-SAI. Because, the TF-KAI method is regarded as effective in domain analysis, thus we use the TF-KAI results as the baseline. We regard the keywords belonging to one semantic units represent the same meaning. We selected a random keyword of each “semantic unit” to compare with the keywords found in the list generated by TF-KAI. Examining the results, we find that 66 keywords in the TF-KAI list are included in SF-SAI list and 61 semantic units contain TF-KAI keywords. We regard the 61 semantic units as overlapping with the TF-KAI. Because, some of the TF-KAI words are clustered into the same semantic units in the SF-SAI results, 66 keywords are regarded as overlapping with SF-SAI. With this relatively high overlapping rate, we can conclude that SF-SAI achieves relatively complete results that TF-KAI provides.

Whether SF-SAI results can better stand for the domain knowledge depends on whether the unique part of SF-SAI results are better than the unqiue part of TF-KAI results. We collected all 137 keywords, including 38 unique semantic units, 33 unique keywords, and 66 overlapping keywords, as shown in “Appendix 3”. In line with the tradition in keyword analysis, we explored the structure of domain knowledge using keyword clusters extracted through the hierachical clustering method, as shown in Fig. 6.

Fig. 6
figure 6

the hierarchical clustering results for all 137 keywords. The 33 TF-KAI keywords, 38 SF-SAI keywords, and 66 overlapping keywords can be found in the “Appendix 3

Table 5 lists the corresponding clustering results obtained hierarchical clustering as visualized in in Fig. 6. The keyword clusters are listed from top to down as in the dentogram, from cluster 1 to cluster 7.

Table 5 keyword clusters generated by TF-KAI and SF-SAI methods

We analyzed the corresponding seven clusters, demonstrating how SF-SAI more effectively illustrates the knowledge structure of a specific domain:

Cluster 1 contains the keywords related to the topic of natural hazards and primary school education. TF-KAI unique keywords includes author names like “dewey_john” and “white_gilbert”. They are indeed important scholars in the geographic research field and natural hazard research. However, TF-KAI methods only consider the string-level uniqueness. The author names are rarely used as keywords but TF-KAI selected them rather than the research topics behind the author names, given their uniqueness. In contrast, SF-SAI results yields keywords related to geolocations where geographical disasters happened, like “istanbul_turkey”, “gargano_promontori_southern_itali”, and “guangdong_provinc”.

Cluster 2 includes topics related to human issues caused by geographical natural hazards. The TF-KAI results contain human related keywords like “human_settlement” and “environment_justice”; but these terms are not directly specified to geographic natural hazards. The SF-SAI results however, bring in keywords that are more related to disasters. The term “social_media” as found in SF-SAI results reflects the popularity of these tools for understanding human reactions towards disasters. The term “claim_payout” seen in the SF-SAI results is a frequently mentioned keyword used in disaster management senarios.

Cluster 3 is related to urban planning issues associated with disasters. TF-KAI results contain words such as “european_polici”, and “participatori_plan” which are related to the urban planning and management but hardly connected to natural hazard senarios. Moreover, other TF-KAI keywords are also general terms like “decis_tree_dt”, “multi_criteria_decis_analysi_mcda”, and “support_vector_machin_svm”. They were most likely selected because of the string-level uniqueness rather than the semantic-level uniqueness. SF-SAI produced more relevant results including topics like “urban_terror”, “land_us”, “sustain_tourism”, and “participatori_gis”. Specially, “participatori_gis” is a public participation appoarch in GIS that enables the planner to collect information and suggestions from the local citizens, that can help disaster management and relief.

Cluster 4 contains keywords related to risk assessment. Cluster 4 is a small cluster, each of TF-KAI and SF-SAI contains two unique keywords and these keywords are similar. But there are some differences. Expressions such as “multi_criteria_method” produced by TF-KAI are very general phrases and may not sufficiently reflect the geographic natural hazard research focus. SF-SAI keywords place more emphases on the disasters, like “multi_risk_assess” and “multi_hazard_zone”.

Cluster 5 is related to the hazard evaluation and prediction appoarches. Unique TF-KAI keywords contain the words “cross_applic”, “multicriteria_analysi”, “parallel_comput”, “fuzzi_relat”, and “probabl_model”. These reflect a limited range of geographical characteristics. They are more general terms that can be universally applied in all natural hazard related research, rather than specially geographic natural hazards. The expessions “flash_flood”, “wenchuan_earthquak”, “wildland_fire_risk”, “hurrican_mitc”, and “DEA” however, generated by SF-SAI are more specific to the hazard related topics and maybe more interesting to the domain experts.

Cluster 6 reveals the keywords on the volunerability topic in multiple disaster senarios. “eman_coeffici” is relevant to the human health hazard exposed to the radiation, not very relevant to geographical disasters. They are selected because they have special lingusitic form, but they have limited semantic relvance to the geographical natural hazard topics. In constrast, SF-SAI keywords offer more detials for the volunerability index, which are more representative in the field of geographical natural hazard research.

Cluster 7 contains keywords related to disaster risk evaluation. In this cluster the, TF-KAI and SF-KAI methods yielded similar keywords. But, the expression “landslid_monitor” found in the TF-KAI results is a board concept, and may not be sufficient to capture the research focuses of domain experts. The term “disast_risk_index” seen in the SF-SAI results however, is a more concrete index used to quantiativly evaluate the possiblity of the disaster events, and would likely be of interest to many domain experts.

To more intuitively understand the results produced by TF-KAI and SF-SAI, we also describe the results in semantic density maps, as illustrated in Figs. 7 and 8. In the “land_use” area of the SF-SAI map, TF-KAI map shows a keyword, “cross_application”, which is a general term. Where the “participate_GIS” area appears in the SF-SAI map, a coresponding area of TF-KAI map shows “parallel_computing”. In this regard, participate GIS refers to GIS applications that incorporate public involvement, a set of information collection methods that support rapid information updates in disaster areas. The term “parallel_computing” is useful in geographic natural hazard research, but is a much more general term refering to data-processing applicable to all natural hazard datasets. In the most dense semantic area in the lower part of Fig. 8, SF-SAI offers keywords “claim_payout” and “vulnerabilidad_de_la_infraestructura”. Correspondingly, TF-KAI methods produce “decis_tree_dt” and “multi_criteria_decis_analysi_mcda” in the results, terms that have limited domain-specific representative capbility.

Fig. 7
figure 7

The SF-SAI unique keywords on a semantic density map (comparing with TF-KAI)

Fig. 8
figure 8

The TF-KAI unique keywords on a semantic density map (comparing with SF-SAI)

Quantitative experiments

Through the analysis above, we can only understand the advantages of SF-SAI subjectively. To make this advantages more quantitatively meaured, we conducted an blind testing similar to that reported in reference (Chen and Xiao 2016). By asking experts to evaluate whether the extracted keywords are representative for the domain knowledge, the estimation results can be obtained. The possibility of a unique keyword to be identified by experts as representative keywords can be described as the percentage of the identified keywords in the total TF-KAI or SF-SAI unique keywords. As shown in Table 6, each semantic unit of SF-SAI provides a keyword. Adding up with the 33 unique TF-KAI keywords, 61 keywords in total were selected as the testing materials, as seen in “Appendix 4”.

Table 6 Results of the blind-testing by experts

From the result table, we can tell that most of the unique keywords are more representatvie in SF-SAI values than the TF-KAI methods. Therefore, we can conclude that the efficency of selecting domain-speicific keywords is improved.

The number of the selected keywords also affect the estimation result of the proposed method. For example, selecting top 100 and top ten semantic units will surely results in different estimation results of SF-SAI. Therefore, we set up an experiment to see how the number of selected keywords can affect the estimation results. As in previous experiment, we use the TF-KAI results as the baseline. By undertaking ten groups comparisons, from top ten to top 99 units of TF-KAI and SF-SAI list, the overlapping counts and proportions are illustrated in Fig. 9.

Fig. 9
figure 9

The overlapping units of SF-SAI comparing with TF-SAI ranged from top ten to top 99

As in the domain keyword analysis, no more than 100 keywords were selected. When we compare with the TF-KAI methods and SF-SAI methods, we find that when less than 50 units are selected, the overlapping percentage of TF-KAI and SF-SAI keywords are relatively low. From 50 to 99 units, the overlapping percentage increased at a steady rate. In the situation of selecting 99 units, we can tell that 61 units are overlapped and the overlapping percentage achived a value of 0.616. SF-SAI and TF-KAI are highly correlated.

To eleminate the possible bias incurred by selecting only one sampe of top 99 keywords, we conducted additional blind tests with different smaple sizes, the top 20, top 40, top 60, top 80 keywords as identified by SF-SAI and TF-KAI methods, to see if SF-SAI is also more effective than TF-KAI with different sample sizes. Table 7 shows the experimental results with different sample sizes.

Table 7 Results of blind-test with different smaple sizes

From Table 7, we can tell that the smapling size affects the evaluation of the TF-KAI and SF-SAI but in a slight way. When the sample size is smaller, the average ratio of SF-SAI keywords considered representive by our experts is lower. TF-KAI results however, are affected by the sampling size differently. When the smaple size was changed from 40 to 60 keywords, because the overlapping keywords increased, the identification ratio decreased to 0.038. Overall however, we can still conclude that SF-SAI gives more complete results.

Discussion

The advantages of the semantic over frequency based methods

The advantage of semantic based methods can be concluded as improvement of accuracy, which means that more keywords identified by the semantic keywords are regarded as more representative for domain knowledge. The words of same or similar meaning but in different forms (synonyms) are merged into same semantic units. TF-KAI are regarded as efficient in finding domain keywords. 61 semantic units in SF-SAI results have 66 overlapping keywords out of total 99 TF-KAI keywords and the rest unique 39 semantic units of SF-SAI results are demonstrated to be more efficient to represent the domain knowledge, which indicates that SF-SAI can achieve the similar or better performance of the TF-KAI methods.

The reason of better performance in the SF-SAI is that we take the word context into consideration by introducing the Word2Vec model and the properly set the similarity threshold. Synonyms in the keywords were merged into semantic units by relatedness and similarity. Note that the results of the SF-SAI methods are sensitive to the similarity threshold. This can be decided by the experts of different analysis purpose. In our paper, we obtained the semantic units by setting the similarity threshold and merging conditions to a rigid range of “> 0.97”. In extreme case, the SF-SAI can equal to the TF-KAI methods when the similar threshold setting as “= 1”. If the experts wish to understand a more general domain concepts, they can alter the similarity threshold to generate the semantic units at different leveled details.

The effectiveness of the proposed methods is also related to the special organizing way of the data source. Traditional data source processed in the TF-IDF is the full-text dataset. In full-text datasets, different documents contain same high frequency words like “the” or “a”. Therefore, the TF-IDF method can efficiently remove the meaningless words “the” or “a”. In our paper and in reference (Chen and Xiao 2016), the corpus however, is built up by keyword collections. The keywords stand for domain knowledge that are often rarely used in other scientific domains and are often in the multi-word forms, both of which decreases the overlapping rate of the word in the domain and background corpus. Thus, it is not easy to find the same or similar words in the background documents and the discriminating factor of IDF is ineffective. Therefore, the SAI methods can improve the situation by enhancing the discriminating domain and background.

In addition to the effectiveness of keyword selection, SF-SAI also provide additional similar words in the semantic units, which can be helpful for domain analysis. Keywords are the abstraction of the corresponding knowledge. Single keywords can be hard to interpret, because the related information is not clear. The final purpose of domain keywords extraction is the domain knowledge analysis. If the relationships between the keywords are not clear, interpreting the domain knowledge can be not effective enough. For example, the keywords “spatial_heterogen” and “spatial_homogen” are not serious synonyms but are much related to each other and both are important concepts of the spatial statistical analysis. The information of relatedness cannot be provided in the frequency based methods like TF-KAI. Therefore, the semantic based methods SF-SAI can help more effective interpretation than sole frequency based method.

The limitation of the proposed semantic based methods

The semantic based methods have the limitation because of the vagueness of the semantics and corpus size. Some semantic units in our semantic based methods contain keywords that might not have exactly same meanings. As the semantic units are generated by the similarity threshold, therefore, the granularities of the semantic units are hard to be fixed. Though, many of the semantic units are not serious synonyms, they do have relatively high relatedness. As the experiment results tell, the efficiency can still be assured in the domain knowledge analysis. Therefore, one of the limitations is stemmed from the vagueness of semantic and the similarity threshold methods.

In addition, the number of selected keywords can also affect the analysis results. In our paper, we can find that different number of selected keyword will result in different overlapping rates in SF-SAI and TF-KAI results. Thus, the ranking is important factor in the domain keyword analysis, words with ranking lower than threshold (like top 100, top 40, or top 20 keywords) will be ignored in the analysis. Therefore, the future work can move to combine the ranking into evaluating the performance of the extraction methods.

Conclusions

In this paper, we propose a new method that introduces a word-embedding model for domain keyword extraction that extends existing TF, TF-IDF and TF-KAI domain-specific keyword methods by adding semantic measurements, thus becoming SF, SF-SIDF and SF-SAI. A case study using a dataset derived from geographic natural hazard literature, demonstrates that the proposed methods improves quality of the keyword extraction results. We compared the TF-KAI and SF-SAI results, finding that SF-SAI results better represent the domain knowledge and can extract domain-specific keywords more effectively.

Domain-specific knowledge is a relative concept and is present in every domain. In our experiments, the extracted domain keywords from the geographic natural hazard field are the mixture of terms from many different fields including computer science, geographical information sciences, geophysics, cartography, and so on. These keywords are also often associated with governmental tasks, including urban planning, disaster response, and other public interest issues, which require varied expert knowledge. Therefore, these keywords relate to diverse sets of background information, especially when experts from different backgrounds evaluate them. Understanding background and domain specific characteristics increases understanding of domain development and can help researchers identify potential focuses and directions in their work. Therefore, the proposed SF-SAI method can help a wide range of disciplinary or topical analyses.

The improvements we introduce to existing methods tackle the semantic vagueness in the natural language of the literature. The limitations are also resulted partially from the vagueness of semantic meanings. In this paper, a similarity threshold is set up by experience, and thus lacks quantitative criterion. Therefore, our future work will consider ways to automatically set the similarity threshold and build up an evaluation approach to determine whether the semantic unit granularity is suitable for various analytical demands.