An Integrated Approach to Improve the Text Categorization Using Semantic Measures

Purna Chand, K.; Narsimha, G.

doi:10.1007/978-81-322-2208-8_5

K. Purna Chand⁷ &
G. Narsimha⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 32))

2416 Accesses
3 Citations

Abstract

Categorization of text documents plays a vital role in information retrieval systems. Clustering the text documents which supports for effective classification and extracting semantic knowledge is a tedious task. Most of the existing methods perform the clustering based on factors like term frequency, document frequency and feature selection methods. But still accuracy of clustering is not up to mark. In this paper we proposed an integrated approach with a metric named as Term Rank Identifier (TRI). TRI measures the frequent terms and indexes them based on their frequency. For those ranked terms TRI will finds the semantics and corresponding class labels. In this paper, we proposed a Semantically Enriched Terms Clustering (SETC) Algorithm, it is integrated with TRI improves the clustering accuracy which leads to incremental text categorization. Our experimental analysis on different data sets proved that the proposed SETC performing better.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing

Article 29 January 2015

An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

Keywords

1 Introduction

Today the world became web dependent. With the booming of the Internet, the World Wide Web contains a billion of textual documents. To extract the knowledge from high dimensional domains like text or web, our search engines are not enough smart to provide the accurate results. This factor leads the WWW to urgent need for effective clustering on high dimensional data.

Many traditional approaches are proposed and developed to analyze the high dimensional data. Text Clustering is one of the best mechanisms to identify the similarity between the documents. But most of the clustering approaches are depends upon the factors like term frequency, document frequency, feature selection and support vector machines (SVM). But there is still uncertainty while processing highly dimensional data.

This research is mainly focuses on improving the text categorization on text document clusters. The proposed TRI and SETC will boost up the text categorization by providing semantically enriched document clusters. The primary goal is to measure the most frequent terms occurring on any text document clusters with our proposed metric Term Rank Identifier (TRI). For those frequent terms the semantic relations are calculated with Wordnet Tools. The basic idea behind the frequent item selection is to reduce the high dimensionality of data. The secondary goal is to apply our proposed text clustering algorithm Semantically Enriched Terms Clustering (SETC) to cluster the documents which are measured by TRI.

2 Related Work

There exist two categories of major text clustering algorithms: Hierarchical and Partition methods. Agglomerative hierarchical clustering (AHC) algorithms initially treat each document as a cluster, uses different kinds of distance functions to compute the similarity between all pairs of clusters, and then merge the closest pair [1]. On other side Partition algorithms considers the whole database is a unique cluster. Based on a heuristic function, it selects a cluster to split. The split step is repeated until the desired number of clusters is obtained. These two categories are compared in [2].

The FTC algorithm introduced in used the shared frequent word sets between documents to measure their closeness in text clustering [3]. The FIHC algorithm proposed in [4] went further in this direction. It measures the cohesiveness of a cluster directly by using frequent word sets, such that the documents in the same cluster are expected to share more frequent word sets than those in different clusters. FIHC uses frequent word sets to construct clusters and organize them into a topic hierarchy. Since frequent word sequences can represent the document well, clustering text documents based on frequent word sequences is meaningful. The idea of using word sequences for text clustering was proposed in [5]; However, STC does not reduce the high dimension of the text documents; hence its complexity is quite high for large text databases.

The sequential aspect of word occurrences in documents should not be ignored to improve the information retrieval performance [6]. They proposed to use the maximal frequent word sequence, which is a frequent word sequence not contained in any longer frequent word sequence. So, in view of all the text clustering algorithms discussed above we proposed TRI and SETC.

2.1 Traditional Text Categorization Measures

2.1.1 χ² Statistics

In text mining for the information retrievals, we frequently use χ² Statistics in order to measure the term frequencies and term-category dependencies. It can be done by measuring the co-occurrences of the terms and listed in contingency tables (Table 1). Suppose that a corpus contains n labeled documents, and they fall into m categories. After the stop words removal and the stemming, distinct terms are extracted from the corpus.

Table 1 General notation of 2 × 2 contingency table

Full size table

For the χ² term-category dependency test, we consider two strategies one is the null hypothesis and the alternative hypothesis. The null hypothesis states that the two variables, term and category, are independent of each other. On the other hand, the alternative hypothesis states that there is some dependency between the two variables.

General formula to calculate the dependency is

$$ \upchi^{2} = \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{k}} \left| {\frac{{({\text{O}}_{\text{i}} - {\text{E}}_{\text{i}} )^{2} }}{{{\text{E}}_{\text{i}} }}} \right| $$

(1)

where

O _i—the observed frequency in the ith cell of the table.
E _i—the expected frequency in the ith cell of the table

The degrees of freedom are (r − 1) (c − 1). Here r = # of rows and c = # of columns.

2.2 Term Rank Identifier (TRI)

In our exploration, we found that χ ² does not fully explore all the information provided in term-category independence test. We point out where the problem is due to identifying only positive term category dependencies based upon the frequent words. In view of this, we proposed a new term-category dependency measure, denoted TRI, which identifies highly related terms based upon their frequencies and each term is assigned with ranks and is categorized by its semantics.

Example 1

For suppose a database D consists of 5 documents D = {d₁, d₂, d₃, d₄, d₅} are categorized as three categories c₁ = {d₁, d₂, d₅}, c₂ = {d₁, d₂, d₄} and C₃ = {d₃} and we observed four different terms t₁, t₂, t₃ and t₄.

The above illustrated example is represented in Table 2. If we observe closely that the term T₁ almost all occurred in all documents except in d₂, d₅. And coming to the term T₂ even its rank is 2 but it is occurred only in d₁, d₂ documents. Likewise by analyzing all the occurrences of different terms we concluded that term-category frequency is not much better in all cases. So our proposed metric Term Rank Identifier (TRI) measures the semantic relatedness (Table 3) of each term in every document.

Table 2 Term-ranking based on their frequencies

Full size table

Table 3 Calculating semantically related terms

Full size table

So from Table 3 we can say that the terms T₁, T₂ and T₃ are semantically related to each and every category. Compare to c₃; c₁ and c₂ categories consists of highly related terms. So we can determine that documents of c₁ = {d₁, d₂, d₅}, c₂ = {d₁, d₂, d₄} and consisting of similar information and these documents are clustered by our proposed Semantically Enriched Terms Clustering (SETC) Algorithm.

3 Proposed Text Clustering Algorithm

3.1 Overview of Text Clustering

In many traditional text clustering algorithms, text documents are represented by using the vector space model [7]. In this model, each document d is considered as a vector in the term-space and is represented by term-frequency (TF) vector: Normally, there are several preprocessing steps, including the stop words removal and the stemming, on the documents. A widely used refinement to this model is to weight each term based on its inverse document frequency (IDF) [8] in the corpus.

For the problem of clustering text documents, there are different criterion functions available. The most commonly used is the cosine function [8]. The cosine function measures the similarity between two documents as the correlation between the document vectors representing them.

For two documents d_i and d_j, the similarity can be calculated as

$$ \cos \left( {{\text{d}}_{\text{i}} ,\,{\text{d}}_{\text{j}} } \right) = {{{\text{d}}_{\text{i}} * {\text{d}}_{\text{j}} } \mathord{\left/ {\vphantom {{{\text{d}}_{\text{i}} * {\text{d}}_{\text{j}} } {\left\| { {\text{d}}_{\text{i}} } \right\|\;\left\| { {\text{d}}_{\text{j}} } \right\|}}} \right. \kern-0pt} {\left\| { {\text{d}}_{\text{i}} } \right\|\;\left\| { {\text{d}}_{\text{j}} } \right\|}} $$

(2)

where * represents the vector dot product, and $ \left\| {{\text{d}}_{\text{i}} } \right\| $ denotes length of vector ‘d_i’. The cosine value is 1 when two documents are identical and 0 if there is nothing in common between them. The larger cosine value indicates that these two documents share more terms and are more similar. The K-means algorithm is very popular for solving the problem of clustering a data set into k clusters. If the dataset contains n documents, d₁; d₂;…; d_n, then the clustering is the optimization process of grouping them into k clusters so that the global criterion function is either minimized or maximized.

$$ \mathop \sum \limits_{{{\text{j}} = 1}}^{\text{k}} \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{n}} {\text{f}}({\text{d}}_{\text{i}} ,{\text{Cen}}_{\text{j}} ) $$

(3)

where Cen_j represents the centroid of a cluster cj, for j = 1;…; k, and f(d_i,Cen_j ) is the clustering criterion function for a document d_i, and a Centroid Cen_j. When the cosine function is used, each document is assigned to the cluster with the most similar centroid, and the global criterion function is maximized as a result.

3.2 Semantically Enriched Terms Clustering (SETC)

In the previous section we described that our proposed metric TRI identifies the semantically highly related terms. The semantic relativeness is calculated with the help of Wordnet 3.0. (Lexical Semantic Analyzer). It is used to calculate the synonyms and estimated relative frequencies of given terms.

Algorithm:

The objective of the algorithm is to generate semantically highly related terms

Input: Set of different text documents and Wordnet 3.0. for Semantics.
Output: Categorized Class labels which generates taxonomies.

Step 1:
Given a collection of text documents D = {d₁, d₂, d₃, d₄, d₅}. Finds the unigrams, bigrams, trigrams and multigrams for every document.
- Unigram—Frequently Occurring 1 Word
- Bigram—Frequently Occurring 2 Words
- Trigram—Frequently Occurring 3 Words
- Multigrams—Frequently Occurring 4 or more Words.
Step 2:
Assign ranks to the each term based upon their relative frequencies in a single document or in clustered documents.
$$ {\mathbf{Rank}} = {\mathbf{Term}}\;{\mathbf{Frequency}}\;({\mathbf{TF}}),\;{\mathbf{Min}}\_{\mathbf{Support}} = {\mathbf{2}} $$
Step 3:
Identify the semantic relationship between the terms by using a Lexical Semantic Analyzer Wordnet 3.0

$$ {\mathbf{Sem}}\_{\mathbf{Rel}}({\mathbf{Terms}}) = {\mathbf{Synonyms}}\;{\mathbf{or}}\;{\mathbf{Estimated}}\;{\mathbf{Relative}}\;{\mathbf{Frequency}} $$
Step 4:
Categorizing the semantically enriched terms into different categories by assigning the class labels.
Step 5:
Construct taxonomies which are generated by class labels.

Primarily, we considered a single document d₁ and measured the term-category dependency and identified frequent terms and these terms are assigned with ranks based upon their frequencies in that particular document d₁. Next the semantic related ness between each terms can be measured with our metric TRI and terms are categorized according to synonymy and expected related frequencies with the help of Wordnet 3.0. Lexical Semantic Analyzer. Like that each document d₂…d_n can be categorized with the help of our proposed metric TRI.

Later, our proposed Semantically Enriched Terms Clustering (SETC) Algorithm clusters all the documents into k no of clusters. Our proposed method is quite differentiated from traditional K-Means and K-Medoids partition algorithms. These algorithms do clustering as a mean of the data objects and centroid values. But compare to these traditional algorithms our proposed SETC algorithm with TRI metric is out performing and improving the accuracy of text categorization by focusing the term semantics.

4 Experimental Results

In this section, we compared our proposed metric with the existing measures like χ2 Statistics (Table 4) and observed that our metric TRI is identifying the semantically highly related terms effectively.

Table 4 Performance comparisons between χ2 statistics and TRI

Full size table

The performance of our integrated approach is compared with traditional and most familiar clustering algorithms like K-Means, K-Medoids and TCFS are applied on datasets like 20-News Groups, Reuters, PubMed and Wordsink, we observed that SETC (Table 5) with TRI is producing good results. The statistics are shown here.

Table 5 Performance comparisons of SETC with other clustering methods

Full size table

Figure 1 represents the performance improvements of our proposed algorithm by comparing with traditional and well-known clustering algorithms.

5 Conclusion

In this paper, we introduced a new metric named as Term Rank Identifier (TRI) which calculates the highly related terms based upon their synonyms and expected relative frequencies. The comparison is made on real data sets with available measures like χ2 Statistics and GSS Coefficients; we observed that, it is performing well. And we proposed a Text Clustering algorithm named as Semantically Enriched Terms Clustering (SETC), which is integrated with TRI. Our proposed SETC algorithm is compared with other clustering and feature selection algorithms like K-Means, K-Medoids, TCFS with CHIR.The experimental results shows that our SETC is outperforming in terms of clustering accuracy on different data sets.

In Future, we enhance the text categorization and clustering capabilities by proposing additional measures which are independent of scope of the cluster. And we are planning to build ontologies automatically by introducing NLP Lexical Analyzers.

References

Liu, X., Song, Y., Liu, S., Wang, H.: Automatic taxonomy construction from keywords. In: Proceedings of KDD’12, pp. 12–16, August, Beijing, China (2012)
Google Scholar
Li, Y., Luo, C., Chung, S.M.: Text clustering with feature selection by using statistical data. IEEE Trans. Knowl. Data Eng. 20(5), 641–651 (2008)
Article Google Scholar
Doucet, A., Ahonen-Myka, H.: Non-contiguous word sequences for information retrieval. In: Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL-2004). Workshop on Multiword Expressions and Integrating Processing, pp. 88–95 (2004)
Google Scholar
Fung, B.C.M., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining, pp. 59–70 (2003)
Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436–442 (2002)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD-2000 Workshop on Text Mining, pp. 1–20 (2000)
Google Scholar
Ahonen-Myka, H.: Finding all maximal frequent sequences in text. In: Proceedings of ICML-99 Workshop on Machine Learning in Text Data Analysis, pp. 11–17 (1999)
Google Scholar
A Clustering Toolkit, Release 2.1.1. http://www.cs.umn.edu/karypis/cluto/
Beydoun, G., Garcia-Sanchez, F., Vincent-Torres, C.M., Lopez-Lorca, A.A., Martinez-Bejar, R.: Providing metrics and automatic enhancement for hierarchical taxonomies. Inf. Process. Manage. 49(1), 67–82 (2013)
Google Scholar
Pont, U., Hayegenfar, F.S., Ghiassi, N., Taheri, M., Sustr, C., Mahdavi, A.: A semantically enriched optimization environment for performance-guided building design and refurbishment. In: Proceedings of the 2nd Central European Symposium on Building Physics, pp. S. 19–26, 9–11 Sept 2013, Vienna, Austria. (2013). ISBN 978-3-85437-321-6
Google Scholar
Ahonen-Myka, H.: Discovery of frequent word sequences in text. In: Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery in Data Mining, pp. 16–19 (2002)
Google Scholar
The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/lemur/
Data Mining: Concepts and Techniques—Jiawei Han, Micheline Kamber Harcourt India, 3rd edn. Elsevier, Amsterdam (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of CSE, JNTU College of Engineering, Kakinada, Andhra Pradesh, India
K. Purna Chand & G. Narsimha

Authors

K. Purna Chand
View author publications
You can also search for this author in PubMed Google Scholar
G. Narsimha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Purna Chand .

Editor information

Editors and Affiliations

University of Canberra, Canberra, Australia and University of South Australia, Adelaide, South Australia, Australia
Lakhmi C. Jain
Department of Computer Science and Engineering, Veer Surendra Sai University of Technology, Sambalpur, Odisha, India
Himansu Sekhar Behera
Computer Science & Engineering, Kalyani University, Nadia, West Bengal, India
Jyotsna Kumar Mandal
Dept. of Computer Science and Engineering, National Institute of Technology Rourkela, Rourkela, India
Durga Prasad Mohapatra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Purna Chand, K., Narsimha, G. (2015). An Integrated Approach to Improve the Text Categorization Using Semantic Measures. In: Jain, L., Behera, H., Mandal, J., Mohapatra, D. (eds) Computational Intelligence in Data Mining - Volume 2. Smart Innovation, Systems and Technologies, vol 32. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2208-8_5

Download citation

DOI: https://doi.org/10.1007/978-81-322-2208-8_5
Published: 11 December 2014
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2207-1
Online ISBN: 978-81-322-2208-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

An Integrated Approach to Improve the Text Categorization Using Semantic Measures

Abstract

Similar content being viewed by others

A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing

An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification

Combining semantic and term frequency similarities for text clustering

Keywords

1 Introduction

2 Related Work

2.1 Traditional Text Categorization Measures

2.1.1 χ² Statistics

2.2 Term Rank Identifier (TRI)

Example 1

3 Proposed Text Clustering Algorithm

3.1 Overview of Text Clustering

3.2 Semantically Enriched Terms Clustering (SETC)

Algorithm:

4 Experimental Results

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An Integrated Approach to Improve the Text Categorization Using Semantic Measures

Abstract

Similar content being viewed by others

A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing

An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification

Combining semantic and term frequency similarities for text clustering

Keywords

1 Introduction

2 Related Work

2.1 Traditional Text Categorization Measures

2.1.1 χ2 Statistics

2.2 Term Rank Identifier (TRI)

Example 1

3 Proposed Text Clustering Algorithm

3.1 Overview of Text Clustering

3.2 Semantically Enriched Terms Clustering (SETC)

Algorithm:

4 Experimental Results

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

2.1.1 χ² Statistics