On data mining, compression, and Kolmogorov complexity

Faloutsos, Christos; Megalooikonomou, Vasileios

doi:10.1007/s10618-006-0057-3

On data mining, compression, and Kolmogorov complexity

Published: 27 January 2007

Volume 15, pages 3–20, (2007)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

On data mining, compression, and Kolmogorov complexity

Download PDF

Christos Faloutsos¹ &
Vasileios Megalooikonomou²

2030 Accesses
50 Citations
6 Altmetric
Explore all metrics

Abstract

Will we ever have a theory of data mining analogous to the relational algebra in databases? Why do we have so many clearly different clustering algorithms? Could data mining be automated? We show that the answer to all these questions is negative, because data mining is closely related to compression and Kolmogorov complexity; and the latter is undecidable. Therefore, data mining will always be an art, where our goal will be to find better models (patterns) that fit our datasets as best as possible.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Barnsley M (1988). Fractals everywhere. Academic Press Inc, San Diego, CA
MATH Google Scholar
Barnsley MF, Sloan AD (1988) A better way to compress images. Byte, January 1988. pp 215–223
Bentley JL, Sleator DD, Tarjan RE, Wei VK (1986) A locally adaptive data compression scheme. CACM, 29(4):320–330
MATH Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth Inc., Belmont, CA, USA
MATH Google Scholar
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: KDD Conference, Seattle, WA, pp 79–88
Codd EF (1971) A database sublanguage founded on the relational calculus. In: SIGFIDET Workshop, pp 35–68
Cover TM, Thomas JA (1991) Elements of information theory. John Wiley and Sons
Damerau FJ (1964) A technique for computer detection and correction of spelling errors. CACM, March 1964. 7(3):171–176
Google Scholar
Elias P (1975) Universal codeword sets and representations of integers. IEEE Trans Inf Theory IT-21:194–203
Article MATH Google Scholar
Eschera MA, Fu KS (1984) A graph distance measure for image analysis. IEEE Trans Syst Man Cybern 14:398–407
Google Scholar
Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer Academic Publishers
Hamerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the NIPS
Keogh E (2002) Exact indexing of dynamic time warping. Int. conf. on very large data bases, 406–417
Keogh EJ, Lonardi S, Ratanamahatana C(Ann) (2004) Towards parameter-free data mining. In: KDD Conference, Seattle, WA, pp 206–215
Kumaraswamy K, Megalooikonomou V, Faloutsos C (2004) Fractal dimension and vector quantization. Inform Process Lett 91(3):107–113
Article Google Scholar
Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17(2):149–154
Article Google Scholar
Li M, Chen X, Li X, Ma B, Vitanyi P (2003) The similarity metric. In: Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms, pp 863–872
Li M, Vitanyi P (1997) An Introduction to Kolmogorov complexity and its applications, 2nd edn. Springer Verlag, New York
MATH Google Scholar
Mackenzie D (1999) New clues to why size equals destiny. Science, 284(5420):1607–1609; 10.1126/science.284.5420.1607
Google Scholar
Megalooikonomou V (1997) Kolmogorov incompressibility method in formal proofs: a critical survey. Technical Report TR CS-97-01, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County
Megalooikonomou V, Kontos D, Danglemaier J, Javadi A, Bakic PA, Maidment ADA (2006) A representation and classification scheme for tree-like structures in medical images: An application on branching pattern analysis of ductal trees in x-ray galactograms. In: Proceedings of the SPIE conference on medical imaging, San Diego, CA, vol 6144, March 2006, pp 497–505
Megalooikonomou V, Wang Q, Li G, Faloutsos C, (2005) A Multiresolution Symbolic Representation of Time Series. In: Proceedings of the 21st IEEE international conference on data engineering (ICDE05), Tokyo, Japan, Apr. 2005, pp 668–679
Mehta M, Agrawal R, and Rissanen J (1996) SLIQ: a fast scalable classifier for data mining. In: Proceedings of the 1996 Int. Conference on Extending Database Technology (EDBT’96), March 1996. Avignon, France, pp 18–32
Mitchell T (1997) Machine Learning. McGraw Hill
Martin-Löf P (1966) The definition of random sequences. Inform Contr 9:602–619
Article Google Scholar
Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C (2003) LOCI: Fast outlier detection using the local correlation integral. ICDE, March 2003, pp 315–326
Pelleg D, Moore AW (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: ICML ’00: Proceedings of the seventeenth international conference on machine learning, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc, pp 727–734
Quinlan J (1993) C4.5: Programs for machine learning. Morgan Kauffman
Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice Hall
Ramakrishnan R, Gehrke J (2002) Database management systems, 3rd edn. McGraw-Hill
Schroeder M (1991) Fractals, chaos, power laws: minutes from an infinite paradise. W.H. Freeman and Company, New York
MATH Google Scholar
Yi B-K, Jagadish HV, Faloutsos C (1998) Efficient retrieval of similar time sequences under time warping. ICDE 98, Feb. 23–27
Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD, may 1996, pp 103–114
Zhu L, Rao A, Zhang A (2002) Theory of keyblock-based image retrieval. ACM Trans Inform Syst 20(2):224–257
Article Google Scholar
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate encoding. IEEE Trans Inform Theory IT-24:530–536
Article MATH Google Scholar
Zobel J, Moffat A, Sacks-Davis R (1992) An efficient indexing technique for full-text database systems. VLDB, Aug. 23–27, pp 352–362

Download references

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213-3891, USA
Christos Faloutsos
Department of Computer and Information Sciences, Temple University, 314 Wachman Hall, 1805 N. Broad Street, Philadelphia, PA, 19122, USA
Vasileios Megalooikonomou

Authors

Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar
Vasileios Megalooikonomou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christos Faloutsos.

Additional information

Communicated by Geoffrey Webb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Faloutsos, C., Megalooikonomou, V. On data mining, compression, and Kolmogorov complexity. Data Min Knowl Disc 15, 3–20 (2007). https://doi.org/10.1007/s10618-006-0057-3

Download citation

Received: 19 May 2006
Accepted: 20 September 2006
Published: 27 January 2007
Issue Date: August 2007
DOI: https://doi.org/10.1007/s10618-006-0057-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On data mining, compression, and Kolmogorov complexity

Abstract

Article PDF

Similar content being viewed by others

Mining and Using Sets of Patterns through Compression

Abstract Representations and Generalized Frequent Pattern Discovery

Kolmogorov Complexity in Perspective Part II: Classification, Information Processing and Duality

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On data mining, compression, and Kolmogorov complexity

Abstract

Article PDF

Similar content being viewed by others

Mining and Using Sets of Patterns through Compression

Abstract Representations and Generalized Frequent Pattern Discovery

Kolmogorov Complexity in Perspective Part II: Classification, Information Processing and Duality

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation