Keywords

1 Introduction

Text mining aims at mining high-quality information from mass text. However, great challenges have been posed for many text mining tasks because of the increasing sheer volume of text data and the difficulty of capturing valuable knowledge hidden in them. Therefore efficient and high-quality text mining algorithms are demanded and effective document representation and accurate semantic relatedness estimation become increasingly crucial. Traditional approaches for document representation are mostly based on the Vector Space (VSM) Model or the Bag of Words (BOW) model which takes a document as an unordered collection of words and only document-level statistical information is recorded (e.g., document frequency, inverse document frequency). Due to the lack of capturing semantics in texts, for certain tasks, especially fine-grained information discovery applications, such as mining relationships between concepts, VSM demonstrates its inherent limitations because of its rationale for computing relatedness between words only based on the statistical information collected from documents themselves. It leads to great semantic loss because terms not appearing in the text literally cannot be taken into consideration.

Our previous work [1] introduced a special case of text mining focusing on detecting semantic relationships between two concepts across documents, which we refer to as Concept Chain Queries (CCQ). A concept chain query involving concepts A and B has the following meaning: find the most plausible relationship between concept A and concept B assuming that one or more instances of both concepts occur in the corpus, but not necessarily in the same document. For example, both may be football lovers, but mentioned in different documents. However, the previous solution was built under the VSM assumption only for the document collection, which limited the scope of the discovered results. For instance, “Albert Gore” is closely related to “George W. Bush” since the two men together produced the most controversial presidential election in 2000, which was the only time in American history that the Supreme Court has determined the outcome of a presidential election. However, “Albert Gore” cannot be identified as a relevant concept to “George W. Bush” if it does not occur in the document collection where the concept chain queries are performed. Furthermore, the semantic relatedness between concepts computed in [1] is solely measured by the statistical information gathered from the corpus such as term frequency (TF), inverse document frequency (IDF), with no background knowledge incorporated.

In this work, we present a new approach that attempts to address the above problems by utilizing background knowledge to provide a better semantic representation of any text and a more appropriate estimation of semantic relatedness between concepts. This is accomplished through leveraging Wikipedia, the world’s currently largest human built encyclopedia. Specifically, in addition to inspecting the given documents, we sift through the articles and anchor texts in the space of Wikipedia, attempting to integrate relevant background knowledge for the topics being searched. Our algorithm is motivated by the Explicit Semantic Analysis (ESA) [3] technique where ESA maps a given text or concept to a conceptual vector space spanned by all Wikipedia articles, and thus rich background knowledge can be integrated into the semantic representation of that text or concept. Here we adapt and improve the ESA model in two dimensions. First, we attempt to identify only the most relevant concepts generated from ESA for topic semantic representation and relatedness computation through introducing a series of heuristic steps for noise reduction. Second, we go one step further to take into account anchor texts inside relevant Wikipedia articles. This is based on the observation that the anchor texts within an article are usually highly relevant to it. Therefore, if an article is identified to be relevant to our search topic, it is highly likely that its anchors are topic-relevant as well. To validate the proposed techniques, a significant amount of queries covering different scenarios were conducted. The results showed that through incorporating Wikipedia knowledge, the most relevant concepts to the given topics were ranked in top positions.

Our contribution of this effort can be summarized as follows: (1) a new Wiki-enabled cross-document knowledge discovery framework has been proposed and implemented which effectively complements the existing information contained in the document collection and provides a more comprehensive knowledge representation and mining framework supporting various query scenarios; (2) effective noise filtering techniques are provided through developing a series of heuristic strategies for noise reduction, which further increases the reliability of the overall knowledge encoded; (3) to the best of our knowledge, little work has been done to consider ESA as an effective aid in cross-document knowledge discovery. In this work, built on the traditional BOW representation for corpus content analysis, the ESA technique has been successfully integrated into the discovery process and a better estimation of semantic relatedness is provided by combining various evidences from Wikipedia such as article content and anchor texts. We envision this integration would also benefit other related tasks such as question answering and cross–document summarization; (4) the proposed approach presents a new perspective of alleviating semantic loss caused by only using the Vector Space Model (VSM) on the corpus level through incorporating relevant background knowledge from Wikipedia; (5) in addition to uncovering “what relationships might exist between two topics of interest”, our method further explores another dimension of the analysis by generating evidence trails from Wikipedia to interpret the nature of the potential concept relationships.

The remainder of this paper is organized as follows: Sect. 2 describes related work. Section 3 introduces concept chain queries. In Sect. 4, we present our proposed method of utilizing Wikipedia knowledge for answering concept chain queries. Experimental results are presented and analysed in Sect. 5, and is followed by the conclusion and future work.

2 Related Work

Mining semantic relationships/associations between concepts from text is important for inferring new knowledge and detecting new trends. Built within the discovery framework established by Swanson and Smalheiser [4], Srinivasan proposed the open and closed text mining algorithm [2] to automatically discover interesting concepts from MEDLINE. There has also been work on discovering connections between concepts across documents using social network graphs, where nodes represent documents and links represent connections (typically URL links) between documents. However, much of the work on social network analysis has focused on special problems, such as detecting communities [7, 11]. Our previous work [1] introduced Concept Chain Queries (CCQ), a special case of text mining focusing on detecting cross-document links between concepts in document collections, which was motivated by Srinivsan’s closed text mining algorithm [4]. Specifically, the solution proposed attempted to generate concept chains based on the “Bag of Words” (BOW) representation on the text corpus and extended the technique in [2] by considering multiple levels of interesting concepts instead of just one level as in the original method. Each document in [1] was represented as a vector containing all the words appearing in the relevant text snippets in the corpus but did not take any auxiliary knowledge into consideration, whereas in this new solution, in addition to corpus level content analysis, we further examine the potential of integrating the Explicit Semantic Analysis (ESA) [3] technique to better serve this task which effectively incorporates more comprehensive knowledge from Wikipedia. There have been a lot of efforts in earlier research as discussed in [31], trying to add semantics to traditional VSM based text processing. Deerwester [32] introduced Latent Semantic Indexing (LSI) for automatic identification of concepts using singular value decomposition. However, it has been found that LSI can rarely improve the strong baseline established by SVM [5, 35, 36]. This becomes part of our motivations of integrating ESA in this work.

WordNet, a lexical database for the English language [18], has been widely used to overcome the limitations of the VSM in text retrieval [19], document clustering [20, 21] and document categorization [22, 23]. For example, Hotho et al. [6] utilized WordNet to improve the VSM text representation and Scott et al. [9] proposed a new representation of text based on WordNet hypernyms. These WordNet-based approaches were shown to alleviate the problems of BOW model but are subject to relatively limited coverage compared to Wikipedia, the world’s largest knowledge base to date. Gurevych et al. used Wikipedia to integrate semantic relatedness into the information retrieval process [24], and Müller et al. [25] used Wikipedia in domain-specific information retrieval. Gabrilovich et al. [5] applied machine learning techniques to Wikipedia and proposed a new method to enrich document representation from this huge knowledge repository. Specifically, they built a feature generator to identify most relevant Wikipedia articles for each document, and then used concepts corresponding to these articles to create new features. As claimed in [5], one of the advantages using Wikipedia over Open Directory Project (ODP) is the articles in Wikipedia are much cleaner than typical Web pages, and mostly qualify as standard written English. However, without proper feature selection strategies employed, there will still be a large amount of noise concepts introduced by the feature generator. Another concern needing to be drawn here is the challenge of efficiently processing large scale data. Bonifati and Cuzzocrea [28] presented a novel technique to fragment large XML documents using structural constraints such as size, tree-width, and tree-depth. Cuzzocrea et al. [29] used K-means clustering algorithm to perform the fragmentation of very large XML data warehouses at scale. Cuzzocrea and Bertino [30] proposed a framework for efficiently processing distributed collections of XML documents. While in this work, we import the XML dump of Wikipedia into relational database and build multi-level indices on the database to support efficient queries against Wiki data.

Serving as an integral part of information retrieval and natural language processing, semantic similarity estimation between words has gained increasing attention over the past years. Various web resources have been considered for this purpose [1416]. Rinaldi [34] proposed a metric to compute the semantic relatedness between words based on a semantic network built from ontological information. Bollegala [17] developed an automatic method for semantic similarity calculation using returned page counts and text snippets generated by a Web search engine. Gabrilovich et al. also [3] presented a novel method, Explicit Semantic Analysis (ESA), for fine-grained semantic representation of unrestricted natural language texts. Using this approach, the meaning of any text can be represented as a weighted vector of Wikipedia-based concepts (articles), called an interpretation vector [3]. Gabrilovich et al. [3] also discussed the problem of possibly containing noise concepts in the vector, especially for text fragments containing multi-word phrases (e.g., multi-word names like George Bush). Our proposed solution is motivated by this work and to tackle the above problems we have developed a sequence of heuristic strategies to filter out irrelevant concepts and clean the vector. Another interesting work is an application of ESA in a cross-lingual information retrieval setting to allow retrieval across languages [8]. In that effort the authors performed article selection to filter out those irrelevant Wikipedia articles (concepts). However, we observed the selection process resulted in the loss of many dimensions in the following mapping process, whereas in our proposed approach, the process of article selection is postponed until two semantic profiles have been merged so that the semantic loss could be possibly reduced to the minimum. Furthermore, in comparison to [13, 27], we also tap into another valuable information resource, i.e. the Wikipedia anchor texts, along with articles to provide better semantic relatedness estimation.

3 Concept Chain Queries

As described earlier, concept chain query (CCQ) is attempting to detect links between two concepts (e.g., two person names) across documents. A concept chain query involving concept A and concept B intends to find the best path linking concept A to concept B. The paths found stand for potential conceptual connections between them. Figure 1 gives an example of CCQ, where the query pair is “Nashiri :: Nairobi attack”. Since “Nashiri” co-occurs with “Jihad Mohammad Ali al Makki” in the same sentence in Document 1, and “Nairobi attack” co-occurs with “Jihad Mohammad Ali al Makki” in the same sentence in Document 2, “Nashiri” and “Nairobi attack” can be linked through the concept “Jihad Mohammad Ali al Makki”.

Fig. 1.
figure 1

A concept chain example for the query “Nashiri :: Nairobi attack”

3.1 Semantic Profile for Topic Representation

A semantic profile is essentially a set of concepts that together represent the corresponding topic. To further differentiate between the concepts, semantic type (ontological information) is employed in profile generation. The concept mapping process is basically a two-step task: (1) we extract concepts from the document collection using Semantex [10]; (2) the extracted concepts are mapped the counterterrorism domain ontology [1]. Table 1 illustrates part of semantic type – concept mappings.

Table 1. Semantic type - concept mapping

Thus each profile is defined as a vector composed of a number of semantic types.

$$ profile(T) = \left\{ {ST_{ 1} ,ST_{ 2} , \ldots ,ST_{\text{n}} } \right\} $$
(1)

Where ST i represents a semantic type to which concepts appearing in the topic-related text snippets belong. We used sentence as window size to measure relevance of appearing concepts to the topic term. Under this representation each semantic type is again referred to as an additional level of vector composed of a number of terms that belong to this semantic type.

$$ ST_{i} \; = \;\left\{ {w_{i,1} m_{ 1} ,w_{i,2} m_{ 2} , \ldots ,w_{i,n} m_{n} } \right\} $$
(2)

Where m j represents a concept belonging to semantic type ST i , and w i,j represents its weight under the context of ST i and sentence level closeness. When generating the profile we replace each semantic type in (1) with (2).

In (2), to compute the weight of each concept, we employ a variation of TF*IDF weighting scheme and then normalize the weights:

$$ w_{i,j} = s_{i,j} \;/\;highest(s_{i,l} ) $$
(3)

Where l = 1, 2, …, r and there are totally r concepts for ST i , s i,j  = df i,j *Log(N/df j ), where N is the number of sentences in the collection, df j is the number of sentences concept m j occurs, and df i,j is the number of sentences in which topic T and concept m j co-occur and m j belongs to semantic type ST i . By using the above three formulae we can build the corresponding profile representing any given topic.

3.2 Concept Chain Generation

We adapt Srinivasan’s closed discovery algorithm [2] to build concept chains for any two given topics. Each concept chain generated reveals a plausible path from concept A to concept C (suppose A and C are two given topics of interest). The algorithm of generating concept chains connecting A to C is composed of the following three steps.

  1. 1.

    Conduct independent searches for A and C. Build the A and C profiles. Call these profiles AP and CP respectively.

  2. 2.

    Compute a B profile (BP) composed of terms in common between AP and CP. The weight of a concept in BP is the sum of its weights in AP and CP. This is the first level of intermediate potential concepts.

  3. 3.

    Expand the concept chain using the created BP profile together with the topics to build additional levels of intermediate concept lists which (i) connect the topics to each concept in BP profile in the sentence level within each semantic type, and (ii) also normalize and rank them (as detailed in Sect. 3.1).

4 Wikipedia as an Information Resource

Wikipedia is currently the largest human built encyclopedia in the world. It has over 5,000,000 articles by April 05, 2011, and is maintained by over 100,000 contributors from all over the world. As of February 2013, there are editions of Wikipedia in 285 languages. Knowledge in Wikipedia ranges from psychology, math, physics to social science and humanities. To utilize Wikipedia knowledge to complement the existing information contained in the document collection, two important information resources, Wikipedia article contents and anchor texts are considered. Specifically, appropriate content and link analysis will be performed on Wikipedia data and the mined relevant knowledge will be used to further improve our query model and semantic relatedness estimation module.

4.1 Semantic Relatedness Measures

Semantic relatedness indicates degree to which words are associated via any type (such as synonymy, meronymy, hyponymy, hypernymy, functional, associative and other types) of semantic relationships [37]. The measures of computing semantic relatedness between concepts can be grouped into four classes in general [33]: the path length based measures that use the length of path connecting concepts in the taxonomy to measure the closeness between concepts; the information content based measures that rely on the shared information content between concepts; the feature based measures that exploit the common characteristics of concepts; and the hybrid measures that combine the previous three measures. The similarity measures defined in this work can be viewed as an extension of the information content based measure.

4.2 Article Content Analysis

For content analysis, we have adapted the Explicit Semantic Analysis (ESA) technique proposed by Gabrilovich et al. [3] as our underlying content-based measure for analyzing Wikipedia articles relevant to the given topics of interest. In ESA, each term (e.g., topic of interest) is represented by a concept vector containing relevant concepts (Wikepedia articles) to the topic along with their association strengths and each text fragment can also be mapped to a weighted vector of Wikipedia concepts called an interpretation vector. Therefore, computing semantic relatedness between any two text fragments can be naturally transformed into computing the Cosine similarity between interpretation vectors of two texts.

Using the ESA method, each article in Wikipedia is treated as a Wikipedia concept (the title of an article is used as a representative concept to represent the article content), and each document is represented by an interpretation vector containing related Wikipedia concepts (articles) with regard to this document. Formally, a document d can be represented as follows:

$$ \phi (d)\; = < \;as(d,a_{1} ), \ldots ,as(d,a_{n} ) > $$
(4)

Where as(d,a i ) denotes the association strength between document d and Wikipedia article a i . Suppose d is spanned by all words appearing in it, i.e., d =< w 1w 2, …, w j  > , and the association strength as(d,a i ) is computed by the following function:

$$ as(d,a_{i} ) = \sum\limits_{{w_{j} \in \;d}} {tf_{d} (w_{j} )tf\; \bullet \;idf_{{a_{i} }} (w_{j} )} $$
(5)

Where tf d (w j ) is the occurrence frequency of word w j in document d, and \( tf\; \bullet \;idf_{{a_{i} }} (w_{j} ) \) is the tfidf value of word w j in Wikipedia article a i . As a result, the vector for a document is represented by a list of real values indicating the association strength of a given document with respect to Wikipedia articles. By using efficient indexing strategies such as single-pass in memory indexing, the computational cost of building these vectors can be reduced to within 200–300 ms. In concept chain queries, the topic input is always a single concept (a single term or phrase), and thus Eq. (5) can be simplified as below as tf d (w j ) always equals 1:

$$ as(d,a_{i} ) = \sum\limits_{{w_{j} \in \;d}} {tf\; \bullet \;idf_{{a_{i} }} (w_{j} )} $$
(6)

As discussed above, the original ESA method is subject to the noise concepts introduced, especially when dealing with multi-word phases. For example, when the input is Angelina Jolie, the generated interpretation vector will contain a fair amount of noise concepts such as Eudocia Angelina, who was the queen consort of Stephen II Nemanjić of Serbia from 1196 to 1198. This Wikipedia concept (article) is selected and ranked high in the interpretation vector because the term Angelina occurs many times in the article “Eudocia Angelina”, but obviously this article is irrelevant to the given topic Angelina Jolie.

In order to make the interpretation vector more precise and relevant to the topic, we have developed a sequence of heuristics to clean the vector. Basically, we use a modified Levenshtein Distance algorithm to measure the relevance of the given topic to each Wikipedia concept generated in the interpretation vector. Instead of using allowable edit operations of a single character to measure the similarity between two strings as in the original Levenshtein Distance algorithm, we view a single word as a unit for edit operations, and thus the adapted algorithm can be used to compute the similarity between any two text snippets. The heuristic steps used to remove noise concepts are illustrated in Fig. 2.

Fig. 2.
figure 2

The interpretation vector cleaning procedure

4.3 Article Link Analysis

Anchor texts, another type of valuable information resource provided by Wikipedia in addition to the textual content of articles, imply rich hidden associations between different Wikipedia concepts. For example, the Wikipedia article talking about “Osama bin Laden” contains a great number of potential terrorists who are related to him and terrorism events that he was involved in (appearing as anchor texts in the article). Therefore, through inspecting anchor texts in each relevant Wikipedia article, we are able to find a fair amount of interesting concepts related to the topic. Figure 3 gives part of the anchors in the article “Osama bin Laden”.

Fig. 3.
figure 3

Wikipedia anchors related to “Osama bin Laden

We assume that two concepts (articles) sharing similar anchors may be closer to each other in terms of semantic relatedness. As discussed earlier, given a topic of interest, we can represent it as an interpretation vector containing the relevant Wikipedia articles using the ESA method. Also, each Wikipedia article can be further represented by the anchors appearring in it. Therefore, we can build an additional vector, called anchor vector, based on the interpretation vector produced for a given search topic. Simiarly, we can approach the semantic relatedness between two topics from another perspective by calculating the Cosine score of the two anchor vectors built for them.

Formally, suppose the interpretation vector for a topic T i is V i  =< article 1article 2, …, article m  > , where article i in V i represents a Wikipedia article relevant to T i , then the topic T i can be further represented as an Anchor Vector (AV) as follows.

$$ \begin{aligned} AV(T_{i} )\; = & < < \;w_{i,1,1} anchor_{1,1} ,w_{i,2,1} anchor_{2,1} , \ldots \; > , \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\; & \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \ldots , \\ & < \;w_{i,1,m} anchor_{1,m} ,w_{i,2,m} anchor_{2,m} , \ldots \; > > \\ \end{aligned} $$
(7)

Where anchor x,y represents the anchor text anchor x appearing in article y in V i , and w i,x,y is the weight for anchor x,y . To calculate w i,x,y , we count the number of sub-vectors within AV(T i ) in which anchor x,y appears, and then normalize it:

$$ w_{i,x,y} \; = \;\frac{{w_{i,x,y} }}{{highest(w_{i,d,y} )}} $$
(8)

Where d = 1,2,…,r and there are totally r anchors in Wikipedia. Therefore, the semantic relatedness between two topics of interest can be estimated as follows:

$$ Sim\left( {T_{i} ,T_{j} } \right)\; = \;Co{\text{sine}}\left( {AV\left( {T_{i} } \right),AV\left( {T_{j} } \right)} \right) $$
(9)

4.4 Integrating Wikipedia Knowledge into Concept Chain Queries

Given the advantages of using Wikipedia as an effective information aid for semantic representation, we integrate the knowledge derived from Wikipedia into our concept chain queries. Specifically, we build interpretation vectors (using our adapted ESA method) and anchor vectors (using the method described in Sect. 4.2) for both the two given topics and each intermediate concept in the merged BP profile, and then compute the Cosine similarities between the topics and each concept in the BP profile using the corresponding interpretation vectors and anchor vectors, respectively. The final ranking will be an integrated scheme considering the following three types of similarities.

Corpus-level TF*IDF-based Similarity.

As the most widely used document representation, the BOW representation has demonstrated its advantages. It is simple to compute and strictly sticking to the terms occurring in the document, thereby preventing outside noise concepts that do not appear in the document from flowing into the feature space of the representation. Given these benefits, a variation of TF*IDF weighting scheme under the context of BOW representation is incorporated into our final ranking to capture corpus level statistical information. We call this kind of similarity the TF*IDF-based similarity.

ESA-based Similarity.

Unlike the BOW model, ESA makes use of the knowledge outside the documents themselves to compute semantic relatedness. It well compensates for the semantic loss resulted from the BOW technique. The relatedness between two concepts in ESA is computed using their corresponding interpretation vectors containing related concepts derived from Wikipedia. In the context of concept chain queries, we compute the Cosine similarity between the interpretation vectors of topic A and each concept in the intermediate BP profile, as well as between topic C and each concept V i , and take the average of two Cosine similarities as the overall similarity for each concept V i in BP. We call this kind of similarity the ESA-based similarity.

Anchor-based Similarity.

Anchor texts have served as another important information aid in our algorithms to provide highly relevant concepts to the given topics through considering the descriptive or contextual information for relevant Wikipedia articles. As with the case of computing the ESA-based similarity for topic A(C) and each concept V i in the intermediate BP profile using the interpretation vectors, here anchor vectors are used to measure concept closeness. We refer to this type of similarity the Anchor-based similarity.

Integrating TF*IDF-based Similarity, ESA-based Similarity and Anchor-based Similarity into the Final Ranking.

The TF*IDF-based similarity, ESA-based similarity and Anchor-based similarity are finally combined to form a final ranking for concepts generated in the intermediate profiles:

$$ S_{overall} = (1 - \lambda_{1} - \lambda_{2} )S_{TFIDF} + \lambda_{1} S_{ESA} + \lambda_{2} S_{anchor} $$
(10)

Where λ1 and λ2 are two tuning parameters that can be adjusted based on the preference on the three similarity schemes in the experiments. S TFIDF refers to the TF*IDF-based similarity, S ESA the ESA-based similarity, and S anchor the Anchor-based similarity.

4.5 Annotating Semantic Relationships Between Concepts

In addition to answering “what relationships might exist between two topics?”, we go one step further to collect relevant text snippets extracted from multiple Wikipedia articles in which the discovered chains appear. This is in fact a multi-document summary that explains the plausible relationship between topics with intensive knowledge derived from Wikipedia. For example, given a query pair: “Bin Laden” and “Abdel-Rahman”, one of the discovered concept chains is: Bin Laden  Azzam  Abdel-Rahman. Our goal now is to find supporting evidence that interprets how “Bin Laden” is linked to “Abdel-Rahman” through “Azzam” in the space of Wikipedia. We consider this process as the chain-focused sentence retrieval problem and decompose it into the following two subtasks.

Chain-Relevant Article Retrieval.

This subtask takes a generated concept chain as input and attempts to find relevant Wikipedia articles for it. One important criterion that needs to be met is the identified Wikepeida articles should be relevant to the whole chain (i.e. relevance to the given topics (end points of the chain) as well as intervening concepts), not just to any individual segment of the chain. To achieve this, we first (1) build the corresponding interpretation vectors for all of the concepts appearing in the chain; (2) perform noise removal using the cleaning procedure described in Fig. 2; (3) construct a ranked list of Wikipedia articles by intersecting the resulting interpretation vectors with each article weighted using formula 6; (4) follow similar steps as above to construct a ranked list of anchors (note that an anchor also represents a Wikipedia article) with each anchor weighted using formula 8. The articles represented by the concepts in the two ranked lists are viewed as chain relevant articles.

Chain-Focused Sentence Retrieval.

This step inspects the content of each article generated from the previous step and extracts sentences that explain each segment of chain. For example, the chain Bin Laden  Azzam  Abdel-Rahman is composed of two segments: Bin Laden  Azzam and Azzam  Abdel-Rahman. For the segment Bin Laden  Azzam, sentences where “Bin Laden” and “Azzam” co-occur will be extracted as supporting evidence for this partial chain. Figure 4 shows the generated evidence trail for this example.

Fig. 4.
figure 4

Evidence trail generated from Wikipedia articles for the concept chain Bin Laden  Azzam  Abdel-Rahman

4.6 The New Mining Model

To summarize, the new model of answering concept chain queries consists of two sequential steps as shown in Figs. 5 and 6. Figure 5 illustrates the first step which discovers potential relationships between two given topics from the given document collection without background knowledge incorporated. Figure 6 details how Wikipedia knowledge is integrated into this discovery process and facilitates better estimation of semantic relatedness between concepts. Also, we go one step further and require the response to be a set of Wikipedia text snippets (i.e. evidence trail) in which the discovered concept chain occurs. This may assist a user with the second dimension of the analysis process, i.e. when the user has to peruse the documents to figure out the nature of the relationship underlying a suggested chain.

Fig. 5.
figure 5

The new model of answering concept chain queries: component-1

Fig. 6.
figure 6

The new model of answering concept chain queries: component-2

5 Empirical Evaluation

A challenging task for the evaluation was constructing an evaluation data set, since there are no standard data sets available for quantitatively evaluating concept chains. We performed our evaluation using the 9/11 counterterrorism corpus. The Wikipedia snapshot used in the experiments was dumped on April 05, 2011.

5.1 Processing Wikipedia Dumps

As an open source project, the entire content of Wikipedia is easily obtainable. All the information from Wikipedia is available in the form of database dumps that are released periodically, from several days to several weeks apart. The version used in this work was released on April 05, 2011, which was separated into 15 compressed XML files and totally occupies 29.5 GB after decompression, containing articles, templates, image descriptions, and primary meta-pages. We leveraged MWDumper [12] to import the XML dumps into our MediaWiki database, and after the parsing process, we identified 5,553,542 articles.

5.2 Evaluation Data

We performed concept chain queries on the 9/11 counterterrorism corpus. This involves processing a large open source document collection pertaining to the 9/11 attack, including the publicly available 9/11 commission report. The report consists of Executive Summary, Preface, 13 chapters, Appendix and Notes. Each of them was considered as a separate document resulting in 337 documents. The whole collection was processed using Semantex [10] and concepts were extracted and selected as shown in Table 1. Query pairs covering various scenarios (e.g., ranging from popular entities to rare entities) were selected by the assessors and used as our evaluation data. We selected chains of lengths ranging from 1 to 4 in terms of the number of associations. The chains were selected by going through the same procedure as in [26], which is also described as follows:

  1. 1.

    We ran queries with various pairs of topics: in the counterterrorism corpus, the topics were mostly named entities.

  2. 2.

    For each topic pair, the relevant paragraphs for either topic were then manually inspected: we selected those where there was a logical connection between the two topics.

  3. 3.

    After achieving agreement among all annotators, we then generated the concept chains for these topic pairs (and paragraphs) as evaluation data.

The above process generated 37 chains in 9/11 corpus which will be used as truth chains for later experiments.

5.3 Experimental Results

Parameter Settings.

As mentioned in Sect. 4.3, a combination of TF*IDF-based similarity, ESA-based similarity and Anchor-based similarity is used to rank the links detected by our system. λ1 and λ2 in Eq. 10 are two parameters that need to be tuned so that the generated similarity between two concepts best matches the judgements from our assessors. To accomplish this, we first built a set of training data composed of 10 query pairs randomly selected from the evaluation set, and then generated BP profiles for each of them using our proposed method. Among each BP profile, we selected the top 5 concepts (links) within each semantic type, and compared their rankings with the assessors’ judgements. The values of λ1 and λ2 were tuned in the range of [0.1, 1]. Specifically, we set λ 2 = 0 or λ 1 = 0 to evaluate the contribution of each individual part (the ESA-based similarity or the Anchor-based similarity) in the final weighting scheme. When λ 1 ≠ 0 and λ 2 ≠ 0, the best performance was obtained when λ 1 = 0.4 and λ 2 = 0.3. These settings were also used in our later experiments.

Query Results.

Before proceeding to the evaluation of the proposed model, we first conducted an experiment to demonstrate the improved performance of our adapted ESA method against the original ESA. We selected 10 concepts that we have good knowledge about as shown in Table 2 and then built the interpretation vectors for each of them using the original ESA and our adapted ESA respectively. We calculated the averaged precision defined as below to measure the performance of the two approaches.

Table 2. Ten concepts used for the interpretation vector construction
$$ aveP\; = \;\left( {\sum\nolimits_{i = 1}^{N} {\frac{concept\;found\;and\;relevant}{total\;concepts\;found}} } \right)/N $$
(11)

where N is the number of concepts under consideration. The results are illustrated in Fig. 7 where the X-axis indicates the number of concepts kept in each of the interpretation vectors and the Y-axis indicates the averaged precision ratio. It is obvious that our adapted ESA achieves significant improvement over the original one for identifying topic-related Wikipedia concepts.

Fig. 7.
figure 7

The Averaged Precision of the generated interpretation vectors using the original ESA and adapted ESA based on processing data in Table 2

Table 3 shows the top 15 concepts generated in the interpretation vectors for 4 sample concepts. For example, for “Lewinsky Scandal”, the top 15 concepts in the interpretation vector built using our adapted ESA include most of the people involved in this event in addition to Clinton and Lewinsky themselves, such as Linda Tripp who secretly recorded Lewinsky’s confidential phone calls about her relationship with Clinton, and Betty Currie who was the personal secretary of Clinton and well known in the scandal for handling gifts given to Lewinsky by Clinton. However, most of the top concepts identified using the original ESA are representing some irrelevant events.

Table 3. Top 15 concepts in the sample interpretation vectors using the adapted ESA and the original ESA

To further evaluate the performance of the original ESA and the adapted ESA in semantic profile generation, we selected 10 query pairs as shown in Table 4 and generated semantic profiles serving as linking concepts (i.e. BP profile) through selecting common concepts appearing in the two interpretation vectors built for the two given topics. Each concept in the semantic profile was weighted using the original ESA and our adapted ESA respectively. We again calculated the averaged precision to measure the percentage of the relevant concepts in the generated profile. The results are shown in Fig. 8 where the X-axis indicates the number of concepts kept in each generated semantic profile and the Y-axis indicates the averaged precision. It is demonstrated that for BP level semantic profile generation, our adapted ESA also performs much better than the original ESA.

Table 4. Ten query pairs used for the semantic profile generation comparison
Fig. 8.
figure 8

The Averaged Precision of the Intermediate Semantic Profile (BP profile) Generation using the original ESA and adapted ESA based on processing data in Table 4.

Table 5 below shows the top 10 concepts generated in the semantic profiles for our sample query pairs: “George Bush :: Al Gore” and “Michael Jordan :: Charles Barkley”. For the query pair “Michael Jordan :: Charles Barkley”, the top 10 concepts identified as the interlinking terms using our adapted ESA include their most relevant persons, events, etc., and noise concepts are successfully removed from the semantic profile and key interlinking concepts are boosted to higher positions such as “I_May_Be_Wrong_but_I_Doubt_It”, a memoir by Charles Barkley that recounts some of Barkley’s memorable experiences including his involvement with Michael Jordan as a member of the “Dream Team”, and “1993_NBA_Finals”, the championship round of a historic season when Michael Jordan led the Chicago Bulls to play against the Phoenix Suns which was led by Charles Barkley. By contrast, the original ESA failed to rank high for some very important concepts related to them.

Table 5. Top 10 concepts in the sample semantic profiles generated by the adapted ESA and the original ESA

In terms of concept chain queries, we have also conducted a qualitative evaluation of the proposed model for generating various lengths of chains using the precision ratio defined below.

$$ precision\; = \;\frac{concept\;chains\;found\;and\;correct}{total\;concept\;chains\;found} $$
(12)

Figures 9, 10, 11 and 12 make a comparison of the search results in various models. We have implemented a competitive baseline algorithm (i.e. Srinivasan’s ‘closed’ discovery algorithm) where only the corpus-level TFIDF-based statistical information is considered. In the four figures, the X-axis indicates the number of concepts kept in each semantic type in the search results (S N means the top N are kept) and the Y-axis indicates the precision values. It is easy to observe that the search performance has been significantly improved with the integration of Wikipedia knowledge, and the best performance is observed when both the Wiki article content and anchor texts are involved.

Fig. 9.
figure 9

Search results of chains of length 1

Fig. 10.
figure 10

Search results of chains of length 2

Fig. 11.
figure 11

Search results of chains of length 3

Fig. 12.
figure 12

Search results of chains of length 4

We further used the 37 truth chains described above to measure the performance of the baseline model and various Wiki-enabled models in detecting these chains. In Fig. 13, the X-axis has the same meaning as in Figs. 9, 10, 11 and 12 and the Y-axis now denotes the percentage of the 37 truth chains found by different models. The results also agree with our expectation that the largest percentage of the truth chains were retrieved when incorporating both article content and anchor texts from Wikipedia into the query process.

Fig. 13.
figure 13

Comparison of search results using 37 truth chains

Table 6 shows the evidence trails generated for concept chains discovered from Wikipedia. Note that the generated evidence trail is not necessarily from the same Wikipedia article, but could be found through the discovery of knowledge holding across articles. For example, for the concept chain Betty Ong  September 11  Mohamed Atta, sentences were extracted as the supporting evidence from two different Wikipedia articles “Flight attendant” and “American Airlines Flight 11”. Our proposed model successfully found “Flight attendant” and “American Airlines Flight 11” as two highly relevant Wikipedia articles with regard to “Betty Ong” who was a “Flight attendant” onboard “American Airlines Flight 11” when it was hijacked and flown into the North Tower of the World Trade Center and “Mohamed Atta” who was one of the ringleaders of the “September 11” attacks, and crashed the “American Airlines Flight 11” into the World Trade Center as part of the 9/11 attacks.

Table 6. Evidence trails generated from Wikipedia

6 Conclusion and Future Work

This paper proposes a new solution for improving cross-document knowledge discovery through our introduced concept chain queries, which focus on detecting semantic relationships between concepts across documents. In this effort, we attempt to incorporate relevant Wikipedia knowledge into the search process, which effectively complements the existing knowledge in document collections and further improves search quality and coverage. Additionally, a better measure for estimating semantic relatedness between terms is devised through integrating various evidence resources from Wikipedia. Experimental results demonstrate the effectiveness of our proposed new approach and show its advantage of alleviating semantic loss caused by only using the Vector Space Model (VSM) on the corpus level.

Future directions include the exploration of other potential resources provided by Wikipedia to further improve query processing, such as infobox information, categories that relevant Wiki articles belong to and the underlying category hierarchy. These valuable information resources may be combined with our defined semantic types to further contribute to ontology modeling. As a cross language knowledge base, we also plan to explore the utilization of Wikipedia knowledge in a cross-lingual setting to better serve different query purposes.