Keywords

1 Introduction

Today the world became web dependent. With the booming of the Internet, the World Wide Web contains a billion of textual documents. To extract the knowledge from high dimensional domains like text or web, our search engines are not enough smart to provide the accurate results. This factor leads the WWW to urgent need for effective clustering on high dimensional data.

Many traditional approaches are proposed and developed to analyze the high dimensional data. Text Clustering is one of the best mechanisms to identify the similarity between the documents. But most of the clustering approaches are depends upon the factors like term frequency, document frequency, feature selection and support vector machines (SVM). But there is still uncertainty while processing highly dimensional data.

This research is mainly focuses on improving the text categorization on text document clusters. The proposed TRI and SETC will boost up the text categorization by providing semantically enriched document clusters. The primary goal is to measure the most frequent terms occurring on any text document clusters with our proposed metric Term Rank Identifier (TRI). For those frequent terms the semantic relations are calculated with Wordnet Tools. The basic idea behind the frequent item selection is to reduce the high dimensionality of data. The secondary goal is to apply our proposed text clustering algorithm Semantically Enriched Terms Clustering (SETC) to cluster the documents which are measured by TRI.

2 Related Work

There exist two categories of major text clustering algorithms: Hierarchical and Partition methods. Agglomerative hierarchical clustering (AHC) algorithms initially treat each document as a cluster, uses different kinds of distance functions to compute the similarity between all pairs of clusters, and then merge the closest pair [1]. On other side Partition algorithms considers the whole database is a unique cluster. Based on a heuristic function, it selects a cluster to split. The split step is repeated until the desired number of clusters is obtained. These two categories are compared in [2].

The FTC algorithm introduced in used the shared frequent word sets between documents to measure their closeness in text clustering [3]. The FIHC algorithm proposed in [4] went further in this direction. It measures the cohesiveness of a cluster directly by using frequent word sets, such that the documents in the same cluster are expected to share more frequent word sets than those in different clusters. FIHC uses frequent word sets to construct clusters and organize them into a topic hierarchy. Since frequent word sequences can represent the document well, clustering text documents based on frequent word sequences is meaningful. The idea of using word sequences for text clustering was proposed in [5]; However, STC does not reduce the high dimension of the text documents; hence its complexity is quite high for large text databases.

The sequential aspect of word occurrences in documents should not be ignored to improve the information retrieval performance [6]. They proposed to use the maximal frequent word sequence, which is a frequent word sequence not contained in any longer frequent word sequence. So, in view of all the text clustering algorithms discussed above we proposed TRI and SETC.

2.1 Traditional Text Categorization Measures

2.1.1 χ2 Statistics

In text mining for the information retrievals, we frequently use χ2 Statistics in order to measure the term frequencies and term-category dependencies. It can be done by measuring the co-occurrences of the terms and listed in contingency tables (Table 1). Suppose that a corpus contains n labeled documents, and they fall into m categories. After the stop words removal and the stemming, distinct terms are extracted from the corpus.

Table 1 General notation of 2 × 2 contingency table

For the χ2 term-category dependency test, we consider two strategies one is the null hypothesis and the alternative hypothesis. The null hypothesis states that the two variables, term and category, are independent of each other. On the other hand, the alternative hypothesis states that there is some dependency between the two variables.

General formula to calculate the dependency is

$$ \upchi^{2} = \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{k}} \left| {\frac{{({\text{O}}_{\text{i}} - {\text{E}}_{\text{i}} )^{2} }}{{{\text{E}}_{\text{i}} }}} \right| $$
(1)

where

  • O i —the observed frequency in the ith cell of the table.

  • E i —the expected frequency in the ith cell of the table

The degrees of freedom are (r − 1) (c − 1). Here r = # of rows and c = # of columns.

2.2 Term Rank Identifier (TRI)

In our exploration, we found that χ 2 does not fully explore all the information provided in term-category independence test. We point out where the problem is due to identifying only positive term category dependencies based upon the frequent words. In view of this, we proposed a new term-category dependency measure, denoted TRI, which identifies highly related terms based upon their frequencies and each term is assigned with ranks and is categorized by its semantics.

Example 1

For suppose a database D consists of 5 documents D = {d1, d2, d3, d4, d5} are categorized as three categories c1 = {d1, d2, d5}, c2 = {d1, d2, d4} and C3 = {d3} and we observed four different terms t1, t2, t3 and t4.

The above illustrated example is represented in Table 2. If we observe closely that the term T1 almost all occurred in all documents except in d2, d5. And coming to the term T2 even its rank is 2 but it is occurred only in d1, d2 documents. Likewise by analyzing all the occurrences of different terms we concluded that term-category frequency is not much better in all cases. So our proposed metric Term Rank Identifier (TRI) measures the semantic relatedness (Table 3) of each term in every document.

Table 2 Term-ranking based on their frequencies
Table 3 Calculating semantically related terms

So from Table 3 we can say that the terms T1, T2 and T3 are semantically related to each and every category. Compare to c3; c1 and c2 categories consists of highly related terms. So we can determine that documents of c1 = {d1, d2, d5}, c2 = {d1, d2, d4} and consisting of similar information and these documents are clustered by our proposed Semantically Enriched Terms Clustering (SETC) Algorithm.

3 Proposed Text Clustering Algorithm

3.1 Overview of Text Clustering

In many traditional text clustering algorithms, text documents are represented by using the vector space model [7]. In this model, each document d is considered as a vector in the term-space and is represented by term-frequency (TF) vector: Normally, there are several preprocessing steps, including the stop words removal and the stemming, on the documents. A widely used refinement to this model is to weight each term based on its inverse document frequency (IDF) [8] in the corpus.

For the problem of clustering text documents, there are different criterion functions available. The most commonly used is the cosine function [8]. The cosine function measures the similarity between two documents as the correlation between the document vectors representing them.

For two documents di and dj, the similarity can be calculated as

$$ \cos \left( {{\text{d}}_{\text{i}} ,\,{\text{d}}_{\text{j}} } \right) = {{{\text{d}}_{\text{i}} * {\text{d}}_{\text{j}} } \mathord{\left/ {\vphantom {{{\text{d}}_{\text{i}} * {\text{d}}_{\text{j}} } {\left\| { {\text{d}}_{\text{i}} } \right\|\;\left\| { {\text{d}}_{\text{j}} } \right\|}}} \right. \kern-0pt} {\left\| { {\text{d}}_{\text{i}} } \right\|\;\left\| { {\text{d}}_{\text{j}} } \right\|}} $$
(2)

where * represents the vector dot product, and \( \left\| {{\text{d}}_{\text{i}} } \right\| \) denotes length of vector ‘di’. The cosine value is 1 when two documents are identical and 0 if there is nothing in common between them. The larger cosine value indicates that these two documents share more terms and are more similar. The K-means algorithm is very popular for solving the problem of clustering a data set into k clusters. If the dataset contains n documents, d1; d2;…; dn, then the clustering is the optimization process of grouping them into k clusters so that the global criterion function is either minimized or maximized.

$$ \mathop \sum \limits_{{{\text{j}} = 1}}^{\text{k}} \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{n}} {\text{f}}({\text{d}}_{\text{i}} ,{\text{Cen}}_{\text{j}} ) $$
(3)

where Cenj represents the centroid of a cluster cj, for j = 1;…; k, and f(di,Cenj ) is the clustering criterion function for a document di, and a Centroid Cenj. When the cosine function is used, each document is assigned to the cluster with the most similar centroid, and the global criterion function is maximized as a result.

3.2 Semantically Enriched Terms Clustering (SETC)

In the previous section we described that our proposed metric TRI identifies the semantically highly related terms. The semantic relativeness is calculated with the help of Wordnet 3.0. (Lexical Semantic Analyzer). It is used to calculate the synonyms and estimated relative frequencies of given terms.

Algorithm:

The objective of the algorithm is to generate semantically highly related terms

  • Input: Set of different text documents and Wordnet 3.0. for Semantics.

  • Output: Categorized Class labels which generates taxonomies.

  1. Step 1:

    Given a collection of text documents D = {d1, d2, d3, d4, d5}. Finds the unigrams, bigrams, trigrams and multigrams for every document.

    • Unigram—Frequently Occurring 1 Word

    • Bigram—Frequently Occurring 2 Words

    • Trigram—Frequently Occurring 3 Words

    • Multigrams—Frequently Occurring 4 or more Words.

  2. Step 2:

    Assign ranks to the each term based upon their relative frequencies in a single document or in clustered documents.

    $$ {\mathbf{Rank}} = {\mathbf{Term}}\;{\mathbf{Frequency}}\;({\mathbf{TF}}),\;{\mathbf{Min}}\_{\mathbf{Support}} = {\mathbf{2}} $$
  3. Step 3:

    Identify the semantic relationship between the terms by using a Lexical Semantic Analyzer Wordnet 3.0

    $$ {\mathbf{Sem}}\_{\mathbf{Rel}}({\mathbf{Terms}}) = {\mathbf{Synonyms}}\;{\mathbf{or}}\;{\mathbf{Estimated}}\;{\mathbf{Relative}}\;{\mathbf{Frequency}} $$
  4. Step 4:

    Categorizing the semantically enriched terms into different categories by assigning the class labels.

  5. Step 5:

    Construct taxonomies which are generated by class labels.

Primarily, we considered a single document d1 and measured the term-category dependency and identified frequent terms and these terms are assigned with ranks based upon their frequencies in that particular document d1. Next the semantic related ness between each terms can be measured with our metric TRI and terms are categorized according to synonymy and expected related frequencies with the help of Wordnet 3.0. Lexical Semantic Analyzer. Like that each document d2…dn can be categorized with the help of our proposed metric TRI.

Later, our proposed Semantically Enriched Terms Clustering (SETC) Algorithm clusters all the documents into k no of clusters. Our proposed method is quite differentiated from traditional K-Means and K-Medoids partition algorithms. These algorithms do clustering as a mean of the data objects and centroid values. But compare to these traditional algorithms our proposed SETC algorithm with TRI metric is out performing and improving the accuracy of text categorization by focusing the term semantics.

4 Experimental Results

In this section, we compared our proposed metric with the existing measures like χ2 Statistics (Table 4) and observed that our metric TRI is identifying the semantically highly related terms effectively.

Table 4 Performance comparisons between χ2 statistics and TRI

The performance of our integrated approach is compared with traditional and most familiar clustering algorithms like K-Means, K-Medoids and TCFS are applied on datasets like 20-News Groups, Reuters, PubMed and Wordsink, we observed that SETC (Table 5) with TRI is producing good results. The statistics are shown here.

Table 5 Performance comparisons of SETC with other clustering methods

Figure 1 represents the performance improvements of our proposed algorithm by comparing with traditional and well-known clustering algorithms.

Fig. 1
figure 1

Performance improvements of SETC with different clustering algorithms

5 Conclusion

In this paper, we introduced a new metric named as Term Rank Identifier (TRI) which calculates the highly related terms based upon their synonyms and expected relative frequencies. The comparison is made on real data sets with available measures like χ2 Statistics and GSS Coefficients; we observed that, it is performing well. And we proposed a Text Clustering algorithm named as Semantically Enriched Terms Clustering (SETC), which is integrated with TRI. Our proposed SETC algorithm is compared with other clustering and feature selection algorithms like K-Means, K-Medoids, TCFS with CHIR.The experimental results shows that our SETC is outperforming in terms of clustering accuracy on different data sets.

In Future, we enhance the text categorization and clustering capabilities by proposing additional measures which are independent of scope of the cluster. And we are planning to build ontologies automatically by introducing NLP Lexical Analyzers.