1 Introduction

The volume of text documents are growing day by day due to the extensive progress of the internet. It motivates the need of efficient machine learning and information retrieval algorithms for text mining. An important algorithm in document classification and clustering is finding the similarity or closeness between documents. Hence, document similarity is useful method in the field of information retrieval. Existing approaches represent documents using feature vector and similarity is calculated using TF-IDF (Term frequency-inverse document frequency) and vector space model. These approaches do not work well to find semantic and meaningful similarity between documents. Also such similarity measure has the problem of sparsity. There are major issues in existing approaches like documents which are dissimilar may have the same content and meaning of term is not considered.

As the text data is increasing day by day, it is highly dimensional and carries semantic information. Therefore it is essential to identify core semantic. Core semantic generally represents theme of document.

The core of this paper is to use concepts instead of terms to capture the topics of documents by creating a concept based document representation model. Concepts are units of knowledge [1], each with a unique meaning. There are three major benefits of using concepts in document similarity. Concepts are less redundant, ambiguity is considered in concepts and semantic relation between concepts can be used to compute document similarity.

There are number of semantic approaches proposed to handle the problem of document similarity. In ontology based approach WordNet and Wikipedia are commonly used ontologies to find the semantic relation of word. Redundancy and ambiguity problems are addressed using ontologies. There are many challenges that still exist for semantic similarity computation. (i) Synonym and polysemy problems. (ii) Extraction of core semantics from text document. (iii) High dimensional terms in text. These challenges have been addressed by various researchers but have some weaknesses like the theme of document is not represented and concept representing term increases word space without increasing the performance of similarity.

This paper focuses on providing the similarity between documents using concept and their relations.

The contributions of the presented work are

  1. (i)

    In this paper, the modified similarity measure based on WordNetsynset, WordNet gloss and Wikipedia context is proposed. Previous works have showed that the structural information on WordNet can improve the similarity measure, but the impact of adding textual information using WordNetsynset and Wikipedia are not still extensively researched. In this paper we explore combination of WordNetsynset, WordNet gloss and Wikipedia context can give a more accurate assessment of similarity computation.

  2. (ii)

    The weighted conceptual graph of coexistence term is proposed for representing a text document. The existing document representation model neglects the association of terms with concepts related to the document. The proposed graph-based document model captures the valuable knowledge based on the relationship between terms and concepts. It also captures the semantic relationship between words and more accurately represents theme of the document.

  3. (iii)

    We introduce inverted index mechanism for indexing concept association with terms. Although inverted index have been extensively used in the area of information retrieval, their potential impact on associating concepts to terms in document similarity has not been fully investigated. We have observed that it is the most successful data structure for this application. Concepts are disambiguated using Wu-palmer semantic similarity measure and Later, concept score is calculated based on its association with other concepts.

  4. (iv)

    The proposed method can estimate better similarity as compared to traditional bag of words approach by observing experimental results.

The paper is organized as follows. Related work is explained in Sect. 2. Proposed work is described in Sect. 3. The detail phases such as graph construction, concept extraction, concept disambiguation and graph similarity is described in Sect. 3. Performance analysis is presented in Sect. 4 and conclusion is stated in Sect. 5.

2 Related work

Document collection is represented as

$$ D = \{ d_{1} ,d_{2} , \ldots d_{m} \} $$
(1)

where m is number of documents in the collection.

Each document vector is represented as set of keywords

$$ d_{i} = \{ k_{i1} ,k_{i2} , \ldots k_{in} \} $$
(2)

where n is number of index terms.

Hence document set is defined as,

$$ D = \sum\limits_{(i = 0)}^{m} {\sum\limits_{(j = 0)}^{n} {d_{ij} } } $$
(3)

The basis approaches for document similarity works on features which captures important characteristics of the data.

  1. 1.

    Text based similarity: Text based document similarity [2] is measure by comparing the words in two documents as features. Vector space model uses cosine similarity to compute document similarity.

  2. 2.

    Co-occurrence based similarity: Document similarity is proposed where documents are represented by the set of candidate keywords [3]. Initial similarities are computer between documents and keywords and the weights of keywords to documents using cosine similarity. This system has a problem of Semantic ambiguity.

    Traditional algorithms do not consider the semantic relationships among words [4] so they cannot accurately represent the meaning of documents [5].

  3. 3.

    Semantic similarity: In case of documents where different terms are used to describe the same concept in different documents. Text based retrieval may give inaccurate and incomplete result. Semantic similarity states that two keywords are related because they share some aspects of their meaning.

Semantic similarity is useful term and measurement of semantic similarity between two keywords is a challenging task. It is computed using following approaches:

  1. (i)

    Ontology based similarity

Ontologies have been of great interest for semantic similarity research group as they offer a structured and unambiguous representation of knowledge in the form of concepts connected by means of semantic meaning. These structures help to assess the degree of semantic proximity between terms.

To overcome the problem of redundancy and ambiguity, semantic information from ontology such as WordNet [6,7,8,9,10] and Wikipedia [11,12,13] has been widely used to improve the quality of similarity.

WordNet is a lexical knowledge base, containing information about synonyms, hypernyms and meronyms. WordNet is richer in content than other machine processing dictionaries and thesauri. It is easier to use for external applications and is freely available.

Wikipedia is a multilingual, web-based, freely available encyclopedia, constructed in a collaborative effort of voluntary contributors. Articles in Wikipedia form a heavily interlinked knowledge base enriched with a category system emerging from collaborative tagging, which constitutes a thesaurus. Wikipedia thus contains a rich body of lexical semantic information.

Ontology based similarity is carried out by using following approaches

  1. 1.

    Edge-counting approach Ontologies can be seen as a directed graph in which concepts are related by means of taxonomic (is-a) links. Limitation of such approach is only minimum path between taxonomies are considered.

  2. 2.

    Feature-based measure Similarity between concepts is assessed as function of their properties. This approach exploits more on semantic knowledge evaluating common and difference of compared concepts. Limitation of this approach is weighting parameter of features need to be tuned with ontology.

  3. (ii)

    Corpus based similarity

To overcome the weakness of text similarity is to leverage the information derived from a large corpus, which is often the Web. Semantic similarity between web documents is calculated on the basis of page count and snippets retrieved using web search engine. Lexical patterns are extracted from the text snippets and clustered to identify the different patterns that describe same semantic relation. It is an automatic method based on lexical pattern generation works on named entity which is not possible using manual ontologies but it depends on search engine performance.

4. Conceptual similarity: Concept based document similarity is essential to find conceptual similarity between document. It is [14] used to handle similarity based on concepts it represents. Concept based similarity is find out using manually built thesauri, term concurrences data, by extracting latent word relationships and extracting concept from the corpus. Concept based document similarity is characterized using three parameters

  1. 1.

    Concept representation

  2. 2.

    Mapping method

  3. 3.

    Use in Information retrieval

The concept extraction is also used in generation of n-gram for matching two documents [15] using Wikipedia information.

Concept relations should be used to quantify the relevance of concept with respect to a term. Tingting et al. [5] proposed method to find conceptual representation using WordNet.

However, some important words which are not included in WordNet lexicon [1] are not considered as concepts for similarity evaluation. Also there still exist several challenges, such as synonym and polysemy, high dimensionality, extracting core semantics from texts. Hence there is need to extract concepts using world knowledge and rank them. The extracted concepts are large, so there is necessity to reduce the feature space. In this paper we quantify the closeness between the concepts and integrate into document similarity measure. Hence, documents do not have same words or concepts to judge their similarity. Also there is a need of structured representation of texts that combine fine grained semantic relations. Conceptual graph model is more suitable model for describing the relationship between terms and concepts. Edges in the semantic graphs should be considered weighted so as to capture the degree of associability between concepts. Once graph is constructed, researchers have proposed different graph similarity methods for computing document similarity and performance is good in comparison with state of the art methods.

3 Document similarity

Document similarity measures are important components of many text analysis tasks like information retrieval, document classification and document clustering. Traditional measure the structural words overlap between documents. These methods ignore the deeper conceptual connections.

A novel method is proposed which uses ontology based approach and integrate Word-net and Wikipedia to measure semantic similarity between documents. Figure 1 illustrates the proposed system architecture.

Fig. 1
figure 1

System architecture

First, document is preprocessed using NLP pre-processing steps: stop word removal, stemming, POS tagging using OPENNLP model. We applied co-reference resolution proposed in [16]. Co-reference resolution is a method of finding association of feature terms or mentions in the discourse that refers to the same entity. The noun terms (NN, NNS, NNPS, and NNP) are extracted using pre-processing model. Term frequency of each noun term is calculated and score is assigned to each term. The number of times terms occur together is used for defining edge weight. The weighted terms are used and graph is constructed based on its occurrence in the document. The graph construction is done using coexistence based representation and is described in Sect. 3.1.

Once the quality terms are extracted, we query ontologies like Wikipedia and WordNet using the terms and related concepts are extracted. Inverted index mechanism is used for indexing concepts with term. The detail concept extraction is explained in Sect. 3.2.

All extracted concepts are not useful hence to choose the most relevant concepts; the relatedness of the concepts is calculated. The semantic similarity is used for disambiguate the concepts. The extracted concepts are disambiguated using Wu palmer’s semantic similarity measure. 90% similar concepts are chosen for given terms. The weights are assigned to the concepts according to its associated terms. The concept disambiguation module is described in Sect. 3.3. The constructed concept graphs are used and graph similarity is applied to compute the similarity score. The vertex similarity score computation is described in Sect. 3.4.

3.1 Graph construction

Text document can be represented as a graph in many ways. Nodes denote features and edge represent relationship between nodes [17,18,19,20,21].

For a given document, its graph representation is defined as a vertex corresponds to unique terms of the document and edges represent coexistence [22] between the terms within a document. Edge weight is number of occurrences of the terms in the document.

Let \( T = \{ (t_{1} ,f_{1} ),(t_{2} ,f_{1} ), \ldots (t_{p} ,f_{p} )\} \) where \( t_{i} \in \left\{ {NN,NNP,NNS,NNPS} \right\} \) and \( f_{i} = TF \) \( G = \left( {V,E} \right) \)where \( V = \{ (t_{1} ,f_{1} ),(t_{2} ,f_{1} ), \ldots (t_{p} ,f_{p} )\} \) \( E = \{ (e_{1} ,f_{1} ), \ldots (e_{p} ,f_{p} )\} \) \( t_{1} \) is neighbor of \( t_{2} \), if \( t_{1} \) occurs together with \( t_{2} \) and number of times it occurs in given document is represented using the edge weight.

$$ e_{1} = \left\{ {\left( {t_{1} ,t_{2} } \right),w_{1} } \right\} $$

Example

Doc 1: information retrieval has changed recently. Data mining is an essential for data search. Information mining is part of data mining.

$$ {\text{V}} = \{ ({\text{information}}, 2),({\text{mining}}, 3),({\text{data}}, 3),({\text{retrieval}}, 1),({\text{search}}, 1)\} $$

The graph constructed with adjacency matrix for this example is shown in Fig. 2.

Fig. 2
figure 2

Text representation using coexistence graph

3.2 Concept extraction

Goal of our system is to provide meaningful representation of document. Hence concept plays an important role in representing the document. Terms are taken as input to this module and WordNet and Wikipedia are used for querying related concepts.

WordNet is the largest lexical English database which is used widely across the world. It provides sense of words and relation between words. However the WordNet is insufficient as it does not provide all the words. Wikipedia is currently the largest knowledge repository on the web.

Most of the researchers are used Wikipedia and WordNet individually as knowledge base for extracting concepts. Advantages of both the ontologies are considered in this work. We attempted to use both the system to extract the appropriate concepts.

Definition 1

(WordNetSynset + WordNet Gloss + Wikipedia Context) Let WordNet \( Synset = Si = \{ S_{i1} ,S_{i2} , \ldots S_{il} \} \) where \( l \) is number of senses for \( t_{i} \) such that \( t_{i} \)\( \in T \).Each Synset in WordNet has a gloss [23] associated with it that contains one or more definitions and some examples. Let Synset-Gloss \( Synset - Gloss S_{ij} \) be the definition and examples of \( S_{ij} \). Let Wikipedia-context \( (t_{i} ) \) be the union of concepts occurs in the top paragraph. Then the extracted concepts \( C_{i} \) is defined as,

$$ C_{i} = S_{i } \cup Gloss(t_{i} ) \cup \left\{ { \mathop \sum \limits_{i = 1}^{p} \mathop \sum \limits_{j = 1}^{l} Synset - Gloss S_{ij} } \right\} \cup \left\{ {Wikipedia - context (t_{i} )} \right\} $$
(4)

The formula (4) is a representation of concept extraction. Table 1 shows example of Definition 1.

Table 1 An example of concept description

The extracted concepts are added in the graph. The example of a concept graph is shown in Fig. 3. For efficiency and easier access, inverted indexing is used.

Fig. 3
figure 3

Sample graph representation with concept

Inverted indexes are the most basic and commonly used data structure in the field of information retrieval. To avoid a lengthy sequential scan through each term in a document and to improve run-time performance of retrieval, inverted index is used.

An entry for each of the \( n \) concepts according to Eq. (1) is stored in a structure called an index. For each concept, a pointer references is called a posting list. The posting list contain the terms of a document which has relation with concept. In the Fig. 4, the posting list contains both the terms and the term frequency. The concepts are indexed and terms are appended in the file. The inverted index file provides faster access to the associated terms.

Fig. 4
figure 4

a Input document. b Extracted concepts for the terms (output of a). c Inverted index constructed for concepts (output of b)

The Fig. 4a–c represents the steps in creation of inverted index. The concepts are indexed and initially weight given for concepts is 0. The steps are given in Algorithm 1.

figure a

In order to extract the relevant concepts of the term, the relation between the concepts is calculated. If the concepts of a given document are highly related to each other, they are more related to the theme of the document. To find similarity between concepts, semantic similarity is required. Semantic similarity computation is used in many applications in the information retrieval research area. Semantic similarity is calculated on the basis of information content and structure based. The score of sample concepts by using different measures is given in Table 2. The information content based approach is based on probability of occurrence of word in instance of Synset it occurs in. This measure has sparse data problem. The other type is using hierarchical structure of WordNet with path length property. This method combines the length of the path with the depth of the concepts in a weighted and non-linear manner. This method is not having sparse data problem.Hence Wu-palmers [24] method is used for disambiguating concepts in this paper. The description of Wu palmer’s measure is given in Definition 2. In order to get the quality concepts, 90% similarity between concepts representing vertices is used. The integrated graph of terms and concepts example is described in Fig. 5.

Table 2 Semantic relatedness using different measure (WP: Wu palmers, LCH: Leacock and Chodorow, Path: Shortest Path, Lin: Lin et al., Res: Resnik, JCN: Jiang & Conrath)
Fig. 5
figure 5

Concept association score based on Wu-palmers method of example given in Fig. 2

Definition 2

Wu-palmers Semantic similarity computation: Let \( N_{1} \) and \( N_{2} \) is the number of IS-A links from \( C_{1} \) and \( C_{2} \) respectively to the most specific common concept \( C \), and \( H \) is the number of IS-A links from \( C \) to the root of the taxonomy.

$$ Sim_{wu\_palmer} \left( {C_{1} ,C_{2} } \right) = \left( { \frac{2H}{N1 + N2 + 2H}} \right) $$
(5)

3.3 Concept weight calculation

The disambiguated nouns may increase the dimensionality of the feature space. Hence we need to find a way to reduce the dimensionality and to get the core concepts. The one concept is associated with many terms in a document hence we use this property to assign weight to concept. The importance of the word senses within a given document should be considered. Hence, we focused on assigning weight to concepts based on the association with the terms. Description of weight calculation of concept is given in Definition 3.

We have Inverted index file which has concepts and associated terms. The weight associated with term is useful for describing the weight of concepts. For example the terms information, data and retrieval has concept “information”. Hence information (concept) = 08 + 0.3 + 0.54 = 1.64. The example with inverted index is described in Fig. 6.

Fig. 6
figure 6

Inverted index example where information (concept) = 0.8 + 0.3 + 0.54 = 1.64

Definition 3

(Concept weight) given document d, let \( \{ (t_{1} ,f_{1} ) \ldots (t_{p} ,f_{p} )\} \) be the terms with frequencies and \( \{ (c_{1} ,w_{1} ) \ldots (c_{q} ,w_{q} )\} \) be the disambiguated concepts in C. Hence the mapping \( f:T \to C \), for \( t_{k} \) and \( t_{m} \) \( (t_{k} ,t_{m} \in T) \), weight of \( c_{i} \) is computed by

$$ w_{i} = \{ f_{k} + f_{m} \} $$
(6)

3.4 Concept score calculation

In the previous step we calculated the concept weight based on the terms associated with it. The concepts are semantically related to each other. To get quality concepts, there is need to rank these concepts based on its association with other concepts. The concept is important if it is linked to other concepts of higher weight. Hence, Concept score is calculated on the basis of weight assigned to terms and association score. The weighted average of concept node is calculated to calculate score of concept. The formula is described in Definition 4 and the example is given in Fig. 7.

Fig. 7
figure 7

Concept score calculations

Definition 4

For given document d, let \( \{ (t_{1} ,f_{1} ) \ldots ( t_{p} , f_{p} )\} \) be the terms with frequencies and \( \{ (c_{1} ,w_{1} ) \ldots ( c_{q} , w_{q} )\} \) be the disambiguated concepts of document \( d \), Hence

$$ Score\;(c_{i} ) = w_{i} \times \frac{{\mathop \sum \nolimits_{j = 1}^{l} w_{j} \times c_{j} }}{{\mathop \sum \nolimits_{j = 1}^{l} c_{j} }} $$
(7)

A large value of \( Score\left( {c_{i} } \right) \) indicates that \( c_{i} \) is a conceptually important concept in a document.

4 Graph similarity

Many researchers have worked to compute similarity between graphs which is representation of document. Euclidean distance is commonly used method to find similarity based on vertex label. Schumacher et al. [14] proposed graph similarity using Graph edit distance and graph isomorphism. Weighted graph similarity for the application of document similarity is unexplored area. As weighted graph is used for modeling document, we have explored use of vertex cosine similarity [25] for finding similarity between two documents. The formula is given in Eqs. (8) and (9).

$$ Label-distance ( v_{i} , v_{j} ) = \left\{ {\begin{array}{*{20}c} {1, if\, v_{i} = v_{j} } \\ {0, if\, v_{i} \ne v_{j} } \\ \end{array} } \right. $$
(8)
$$ Vertex-Similarity ( v_{i} ,v_{j} ) = \frac{{\left| {common - neighbours\;( v_{i} , v_{j} )} \right|}}{{\sqrt {deg \;(v_{i} ) \times deg \;(v_{j} )} }} $$
(9)
figure b

5 Performance analysis

In this section, we evaluate our system on two different dataset, compare the results with traditional vector space model and discuss results.

5.1 Dataset

Performance comparison is done with two set self-generated dataset (One is of four categories and other is of 8 categories where 100 documents are in each category) and 20 news group dataset. 20 news group dataset is widely used in existing work of document similarity. For the analysis the data from five different categories is considered. Self-generated dataset description is given in Table 3.

Table 3 Description of self-constructed dataset

5.2 Evaluation metrics

In experiment analysis, the similarity is computed using precision, recall and F measure. F-measure combines the information of precision and recall. The confusion matrix is constructed for successful and unsuccessful retrieval and performance is analyzed.

The cluster quality is evaluated using purity. Purity assumes that all the document of a cluster is the member of actual class of cluster. It is calculated using following formula

$$ purity (W, C ) = \frac{1}{N}\mathop \sum \limits_{k} \hbox{max}\, j |W_{k} \cap C_{j} | $$
(10)

where \( W = \{ W_{1} ,W_{2} , \ldots W_{k} \} \) the set of clusters, C is \( = \{ C_{1} ,C_{2} , \ldots C_{j} \} \) is set of classes and N is number of documents.

Correlation between two documents is calculated using Matthew’s correlation coefficient (MCC). It is the metric used for accessing the quality of the predicted value to the observed value. As a metric for comparing documents, it takes into account the correctness AND the incorrectness (true/false positives/negatives) of the comparison, allowing for a more balanced score among different comparisons.

$$ MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {(TP + FP)(TP + FN)(TN + FP)(TN + FN)} }} $$
(11)

where TP = true positives, TN = true negatives, FP = false positives and FN = false negatives.

5.3 Results and analysis

The self-constructed dataset 1 is mostly the semantically related documents. Hence the terms in the documents are not similar. Table 4 shows detail analysis comparison with vector space model. In vector space model the terms similarity is considered hence our system is improved as concept are taken into consideration. Table 5 describes the performance on self-generated dataset 2. The performance is compared on 2 categories, 3 categories and 9 categories. We found the performance is consistent among all. Even though, this dataset has mostly related document, our system has improved performance in all metrics.

Table 4 Similarity performance using self-generated dataset 1
Table 5 Similarity Performance using self-generated dataset 2

Figure 8 describes precision and recall of concept based similarity method (CS) in comparison with vector space model. It shows that the precision and recall range is from 0.8 to 1.0 for the proposed system while for traditional TF-IDF is from 0.4 to 0.8.

Fig. 8
figure 8

Precision and Recall of concept based similarity method (CS) (on self-generated dataset 1 and 20 newspaper dataset respectively) in comparison with TF-IDF

Figure 9 shows concept based ROC area under curve is 1. Also the performance of these algorithms at different false positive regions varies from the traditional TF-IDF scheme. At a low false-positive point i.e. FPR = 0.12, the concept based similarity performs better than considering terms. This phenomenon can be clearly observed from the ROC curves.

Fig. 9
figure 9

ROC curves of the similarity using different methods on 20 newspaper dataset

The similarity of proposed system is compared with two BOW and two concept based state of the art algorithms. The analysis is shown in Table 6. The analysis shows that the proposed work achieve high performance on 20 newsgroup dataset.

Table 6 Similarity performance on 20 newsgroup dataset

The application of document similarity is also evaluated using clustering method. The self-generated dataset 1 and 2 are used for performing document clustering. For self-generated dataset 1, the cluster 1 and cluster 2 performances is same but for cluster 3 and 4, the performance is improved. Figure 10 shows the clustering performance using purity measure on self-generated dataset 1.

Fig. 10
figure 10

Clustering performance on self-generated dataset. PS proposed system, VSM vector space model

Table 3 shows detail analysis of self-generated dataset 2. The performance of our system is better than the base across all categories of dataset.

Self-generated dataset 2 is also analyzed to find correlation between documents using MCC. The analysis indicates the improvement in the correlation using concept based similarity across all categories. Figure 11 shows the analysis.

Fig. 11
figure 11

MCC performance on self-generated dataset 2

6 Conclusions

In this study, an important problem of document similarity is addressed. Concept based similarity using graph model is proposed in this paper. This paper presents a methodology for Concept extraction using WordNetsynset, WordNet gloss and Wikipedia context and represented structural information using conceptual graph. An important data structure an inverted index is constructed to maintain the association of concepts and terms. The proposed approach uses WordNet based concept disambiguation method. Further concept score is calculated to reduce the dimensions of feature space. The coexistence graph is refined using integrating concepts. Graph model possesses different basic properties.

Simple graph property using degree and adjacency property is used for calculating vertex similarity to measure similarity between two graphs.

The four important problems are addressed in this paper. The problems are appropriate concept extraction, handling disambiguation between terms, representing conceptual description using graph and similarity computation. Experimental results on three real datasets show that the proposed approach outperforms the term based similarity.

Future work can be done in this topic for using fuzzy graph that represent weight of vertex and edge. Another interesting topic to do further is to use an efficient graph similarity measure.