Vector Space Models for Search and Cluster Mining

Kobayashi, Mei; Aono, Masaki

doi:10.1007/978-1-4757-4305-0_5

Mei Kobayashi &
Masaki Aono

2257 Accesses
10 Citations

Abstract

This chapter consists of two parts: a review of search and cluster mining algorithms based on vector space modeling followed by a description of a prototype search and cluster mining system. In the review, we consider Latent Semantic Indexing (LSI), a method based on the Singular Value Decomposition (SVD) of the document attribute matrix and Principal Component Analysis (PCA) of the document vector covariance matrix. In the second part, we present novel techniques for mining major and minor clusters from massive databases based on enhancements of LSI and PCA and automatic labeling of clusters based on their document contents. Most mining systems have been designed to find major clusters and they often fail to report information on smaller minor clusters. Minor cluster identification is important in many business applications, such as detection of credit card fraud, profile analysis, and scientific data analysis. Another novel feature of our method is the recognition and preservation of naturally occurring overlaps among clusters. Cluster overlap analysis is important for multiperspective analysis of databases. Results from implementation studies with a prototype system using over 100,000 news articles demonstrate the effectiveness of search and clustering engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Ando and L. Lee.Latent semantic space.In Proceedings of the ACM Special Interest Group for Information Retrieval (SIGIR) Conference, Helsinki, Finland, pages 154–162, 2001.
Google Scholar
R. Ando.Latent semantic space.In Proceedings of the ACM Special Interest Group for Information Retrieval (SIGIR) Conference, Athens, pages 216–223, 2000.
Google Scholar
M. Berry, Z. Drmac, and E. Jessup.Matrices, vector spaces, and information retrieval.SIAM Review, 41 (2): 335–362, 1999.
MATH Google Scholar
M. Berry, S. Dumais, and G. O’Brien.Using linear algebra for intelligent information retrieval.SIAM Review, 37 (4): 573–595, 1995.
MathSciNet MATH Google Scholar
K. Blom and A. Ruhe.Information retrieval using very short Krylov sequences.In Proceedings of the Computational Information Retrieval Conference held at North Carolina State University, Raleigh, Oct. 22, 2000, M. Berry, ed., SIAM, Philadelphia, pages 39–52, 2001.
Google Scholar
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman.Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6): 391–407, 1990.
Article Google Scholar
I. Dhillon and J. Kogan (eds.).Proceedings of the Workshop on Clustering High Dimensional Data and its Applications.SIAM, Philadelphia, 2002.
Google Scholar
Dem97] J. Demmel.Applied Numerical Linear Algebra.SIAM,Philadelphia, 1997.
Google Scholar
C. Eckart and G. Young. A principal axis transformation for non-Hermitian matrices. Bulletin of the American Mathematics Society, 45: 118–121, 1939.
Article MathSciNet Google Scholar
G. Golub and C. Van Loan.Matrix Computations, third edition. John Hopkins Univ. Press, Baltimore, MD, 1996.
Google Scholar
D. Harman.Ranking algorithms.In Information Retrieval, R. Baeza-Yates and B. Ribeiro-Neto, eds., ACM, New York, pages 363–392, 1999.
Google Scholar
S. Haykin Neural Networks: A comprehensive foundation, second edition. Prentice-Hall, Upper Saddle River, NJ, 1999.
Google Scholar
H. Hotelling.Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24: 417–441, 1933.
Article Google Scholar
A. Jain and R. Dubes.Algorithms for Clustering Data.Prentice-Hall, Englewood Cliffs, NJ, 1988.
Google Scholar
M. Kobayashi and M. Aono.Major and outlier cluster analysis using dynamic re-scaling of document vectors.In Proceedings of the SIAM Text Mining Workshop, Arlington, VA, SIAM, Philadelphia, pages 103–113, 2002.
Google Scholar
M. Kobayashi, M. Aono, H. Samukawa, and H. Takeuchi.Information retrieval apparatus for accurately detecting multiple outlier clusters.patent, filing, IBM Corporation, 2001.
Google Scholar
M. Kobayashi, M. Aono, H. Samukawa, and H. Takeuchi.Matrix computations for information retrieval and major and outlier cluster detection. Journal of Computational and Applied Mathematics, 149 (1): 119–129, 2002.
Article MathSciNet MATH Google Scholar
S. Katz. Distribution of context words and phrases in text and language modeling.Natural Language Engineering, 2 (1): 15–59, 1996.
Article Google Scholar
M. Kobayashi, L. Malassis, and H. Samukawa.Retrieval and ranking of documents from a database.patent, filing, IBM Corporation, 2000.
Google Scholar
K. Mardia, J. Kent, and J. Bibby.Multivariate Analysis.Academic, New York, 1979.
Google Scholar
C. Manning and H. Schütze.Foundations of Statistical Natural Language Processing.MIT Press, Cambridge, MA, 2000.
Google Scholar
B. Parlett.The Symmetric Eigenvalue Problem.SIAM, Philadelphia, 1997.
Google Scholar
K. Pearson. On lines and planes of closest fit to systems of points in space. The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, Sixth Series, 2: 559–572, 1901.
Article Google Scholar
H. Park, M. Jeon, and J. Rosen.Lower dimensional representation of text data in vector space based information retrieval.In Proceedings of the Computational Information Retrieval Conference held at North Carolina State University, Raleigh, Oct. 22, 2000, M. Berry, ed., SIAM, Philadelphia, pages 3–24, 2001.
Google Scholar
H. Park, M. Jeon, and J.B. Rosen.Lower dimensional representation of text data based on centroids and least squares.BIT, 2003, to appear.
Google Scholar
Y. Qu, G. Ostrouchov, N. Samatova, and A. Geist.Principal component analysis for dimension reduction in massive distributed data sets.In SIAM Workshop on High Performance Data Mining, S. Parthasarathy, H. Kargupta, V. Kumar, D. Skillicorn, and M. Zaki, eds., Arlington, VA, pages 7–18, 2002.
Google Scholar
E. Rasmussen.Clustering algorithms.In Information Retrieval, W. Frakes and R. Baeza-Yates, eds., Prentice-Hall, Englewood Cliffs, NJ, pages 419–442, 1992.
Google Scholar
G. Salton. The SMART Retrieval System.Prentice-Hall, Englewood Cliffs, NJ, 1971.
Google Scholar
H. Sakano and K. Yamada.Horror story: The curse of dimensionality.lnformation Processing Society of Japan (IPSJ) Magazine,43(5):562–567, 2002.
Google Scholar
I. Witten and E. Frank.Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations.Morgan Kaufmann, San Francisco, 1999.
Google Scholar

Download references

Authors

Mei Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar
Masaki Aono
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Tennessee, 203 Claxton Complex, 37996-3450, Knoxville, TN, USA
Michael W. Berry

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kobayashi, M., Aono, M. (2004). Vector Space Models for Search and Cluster Mining. In: Berry, M.W. (eds) Survey of Text Mining. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-4305-0_5

Download citation

DOI: https://doi.org/10.1007/978-1-4757-4305-0_5
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-3057-6
Online ISBN: 978-1-4757-4305-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Vector Space Models for Search and Cluster Mining

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

Combinatorial Optimization Approaches for Data Clustering

A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing

References

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Vector Space Models for Search and Cluster Mining

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

New Metrics and Related Statistical Approaches for Efficient Mining in Very Large and Highly Multidimensional Databases

Combinatorial Optimization Approaches for Data Clustering

A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing

References

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation