An Index Structure for Data Mining and Clustering

Wang, Xiong; Wang, Jason T. L.; Lin, King-Ip; Shasha, Dennis; Shapiro, Bruce A.; Zhang, Kaizhong

doi:10.1007/s101150050009

An Index Structure for Data Mining and Clustering

Original Paper
Published: June 2000

Volume 2, pages 161–184, (2000)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Knowledge and Information Systems Aims and scope Submit manuscript

An Index Structure for Data Mining and Clustering

Download PDF

Xiong Wang¹,
Jason T. L. Wang¹,
King-Ip Lin²,
Dennis Shasha³,
Bruce A. Shapiro⁴ &
…
Kaizhong Zhang⁵

154 Accesses
39 Citations
Explore all metrics

Abstract.

In this paper we present an index structure, called MetricMap, that takes a set of objects and a distance metric and then maps those objects to a k-dimensional space in such a way that the distances among objects are approximately preserved. The index structure is a useful tool for clustering and visualization in data-intensive applications, because it replaces expensive distance calculations by sum-of-square calculations. This can make clustering in large databases with expensive distance metrics practical. We compare the index structure with another data mining index structure, FastMap, recently proposed by Faloutsos and Lin, according to two criteria: relative error and clustering accuracy. For relative error, we show that (i) FastMap gives a lower relative error than MetricMap for Euclidean distances, (ii) MetricMap gives a lower relative error than FastMap for non-Euclidean distances (i.e., general distance metrics), and (iii) combining the two reduces the error yet further. A similar result is obtained when comparing the accuracy of clustering. These results hold for different data sizes. The main qualitative conclusion is that these two index structures capture complementary information about distance metrics and therefore can be used together to great benefit. The net effect is that multi-day computations can be done in minutes.

Author information

Authors and Affiliations

Department of Computer and Information Science, New Jersey Institute of Technology, Newark, NJ, USA, , , , , , US
Xiong Wang & Jason T. L. Wang
Department of Mathematical Sciences, University of Memphis, Memphis, TN, USA, , , , , , US
King-Ip Lin
Courant Institute of Mathematical Sciences, New York University, New York, USA, , , , , , US
Dennis Shasha
Laboratory of Experimental and Computational Biology, National Cancer Institute, Frederick, MD, USA, , , , , , US
Bruce A. Shapiro
Department of Computer Science, The University of Western Ontario, London, Ontario, Canada, , , , , , CA
Kaizhong Zhang

Authors

Xiong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jason T. L. Wang
View author publications
You can also search for this author in PubMed Google Scholar
King-Ip Lin
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Shasha
View author publications
You can also search for this author in PubMed Google Scholar
Bruce A. Shapiro
View author publications
You can also search for this author in PubMed Google Scholar
Kaizhong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Received February 1998 / Revised July 1999 / Accepted in revised form September 1999

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X., Wang, J., Lin, KI. et al. An Index Structure for Data Mining and Clustering. Knowledge and Information Systems 2, 161–184 (2000). https://doi.org/10.1007/s101150050009

Download citation

Issue Date: June 2000
DOI: https://doi.org/10.1007/s101150050009

Keywords: Biomedical applications; Data engineering; Distance metrics; Knowledge discovery; Visualization

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Index Structure for Data Mining and Clustering

Abstract.

Article PDF

Similar content being viewed by others

Impact of Distance Measures on the Performance of Clustering Algorithms

FPDclustering: a comprehensive R package for probabilistic distance clustering based methods

Hierarchical Clustering for Large Data Sets

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Navigation

An Index Structure for Data Mining and Clustering

Abstract.

Article PDF

Similar content being viewed by others

Impact of Distance Measures on the Performance of Clustering Algorithms

FPDclustering: a comprehensive R package for probabilistic distance clustering based methods

Hierarchical Clustering for Large Data Sets

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation