Representative Based Document Clustering

Banerjee, Arko; Pujari, Arun K.

doi:10.1007/978-3-319-07353-8_47

Arko Banerjee⁷ &
Arun K. Pujari⁸

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 27))

1983 Accesses

Abstract

In this paper we propose a novel approach to document clustering by introducing a representative-based document similarity model that treats a document as an ordered sequence of words and partitions it into chunks for gaining valuable proximity information between words. Chunks are subsequences in a document that have low internal entropy and high boundary entropy. A chunk can be a phrase, a word or a part of word. We implement a linear time unsupervised algorithm that segments sequence of words into chunks. Chunks that occur frequently are considered as representatives of the document set. The representative based document similarity model, containing a term-document matrix with respect to the representatives, is a compact representation of the vector space model that improves quality of document clustering over traditional methods.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Extended Strategies for Document Clustering with Word Co-occurrences

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

A semi-supervised framework for concept-based hierarchical document clustering

Article 02 October 2023

Keywords

References

Cohen, P., Adams, N., Heeringa, B.: Voting experts: An unsupervised algorithm for segmenting sequences. Journal of Intelligent Data Analysis (2006)
Google Scholar
Hewlett, D., Cohen, P.: Bootstrap Voting Experts. In: IJCAI, pp. 1071–1076 (2009)
Google Scholar
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proc. 21st Ann. Int’l ACM SIGIR Conf., pp. 45–54 (1998)
Google Scholar
Zamir, O., Etzioni, O.: Grouper: A Dynamic Clustering Interface to Web Search Results. Computer Networks 31(11-16), 1361–1374 (1999)
Article Google Scholar
Hammouda, K., Kamel, M.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)
Article Google Scholar
Sun, J., Shen, Z., Li, H., Shen, Y.: Clustering Via Local Regression. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 456–471. Springer, Heidelberg (2008)
Chapter Google Scholar
Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
Article Google Scholar
Wu, M., Scholkopf, B.: A local learning Approach for Clustering. In: Advances in Neural Information Processing Systems, vol. 19 (2006)
Google Scholar
Zhao, Y., Karypis, G.: Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning 55, 311–331 (2004)
Article MATH Google Scholar
Lewis, D.D.: Reuters-21578 text categorization test collection, http://www.daviddlewis.com/resources/testcollections/reuters21578
TREC: Text REtrieval Conference, http://trec.nist.gov
Strehl, A., Ghosh, J.: Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research 3, 583–617 (2002)
MathSciNet Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Engineering and Management, Kolaghat, WB, India
Arko Banerjee
University of Hyderabad, Hyderabad, Andhra Pradesh, India
Arun K. Pujari

Authors

Arko Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Arun K. Pujari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arko Banerjee .

Editor information

Editors and Affiliations

Indian Statistical Institute, Machine Intelligence Unit, Kolkata, India
Malay Kumar Kundu
Dept. of Computer Science and Engineering, National Institute of Technology Rourkela, Rourkela, India
Durga Prasad Mohapatra
Dept. of Electronics and Tele-Communication Engineering, Jadavpur University Artificial Intelligence Laboratory, Kolkata, India
Amit Konar
Dept. of Computer Science and Engineering, St. Thomas' College of Engineering & Technology, Kidderpore, West Bengal, India
Aruna Chakraborty

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Banerjee, A., Pujari, A.K. (2014). Representative Based Document Clustering. In: Kumar Kundu, M., Mohapatra, D., Konar, A., Chakraborty, A. (eds) Advanced Computing, Networking and Informatics- Volume 1. Smart Innovation, Systems and Technologies, vol 27. Springer, Cham. https://doi.org/10.1007/978-3-319-07353-8_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-07353-8_47
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07352-1
Online ISBN: 978-3-319-07353-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Representative Based Document Clustering

Abstract

Chapter PDF

Similar content being viewed by others

Extended Strategies for Document Clustering with Word Co-occurrences

Combining semantic and term frequency similarities for text clustering

A semi-supervised framework for concept-based hierarchical document clustering

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Representative Based Document Clustering

Abstract

Chapter PDF

Similar content being viewed by others

Extended Strategies for Document Clustering with Word Co-occurrences

Combining semantic and term frequency similarities for text clustering

A semi-supervised framework for concept-based hierarchical document clustering

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation