Abstract
Will we ever have a theory of data mining analogous to the relational algebra in databases? Why do we have so many clearly different clustering algorithms? Could data mining be automated? We show that the answer to all these questions is negative, because data mining is closely related to compression and Kolmogorov complexity; and the latter is undecidable. Therefore, data mining will always be an art, where our goal will be to find better models (patterns) that fit our datasets as best as possible.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Barnsley M (1988). Fractals everywhere. Academic Press Inc, San Diego, CA
Barnsley MF, Sloan AD (1988) A better way to compress images. Byte, January 1988. pp 215–223
Bentley JL, Sleator DD, Tarjan RE, Wei VK (1986) A locally adaptive data compression scheme. CACM, 29(4):320–330
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth Inc., Belmont, CA, USA
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: KDD Conference, Seattle, WA, pp 79–88
Codd EF (1971) A database sublanguage founded on the relational calculus. In: SIGFIDET Workshop, pp 35–68
Cover TM, Thomas JA (1991) Elements of information theory. John Wiley and Sons
Damerau FJ (1964) A technique for computer detection and correction of spelling errors. CACM, March 1964. 7(3):171–176
Elias P (1975) Universal codeword sets and representations of integers. IEEE Trans Inf Theory IT-21:194–203
Eschera MA, Fu KS (1984) A graph distance measure for image analysis. IEEE Trans Syst Man Cybern 14:398–407
Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer Academic Publishers
Hamerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the NIPS
Keogh E (2002) Exact indexing of dynamic time warping. Int. conf. on very large data bases, 406–417
Keogh EJ, Lonardi S, Ratanamahatana C(Ann) (2004) Towards parameter-free data mining. In: KDD Conference, Seattle, WA, pp 206–215
Kumaraswamy K, Megalooikonomou V, Faloutsos C (2004) Fractal dimension and vector quantization. Inform Process Lett 91(3):107–113
Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17(2):149–154
Li M, Chen X, Li X, Ma B, Vitanyi P (2003) The similarity metric. In: Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms, pp 863–872
Li M, Vitanyi P (1997) An Introduction to Kolmogorov complexity and its applications, 2nd edn. Springer Verlag, New York
Mackenzie D (1999) New clues to why size equals destiny. Science, 284(5420):1607–1609; 10.1126/science.284.5420.1607
Megalooikonomou V (1997) Kolmogorov incompressibility method in formal proofs: a critical survey. Technical Report TR CS-97-01, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County
Megalooikonomou V, Kontos D, Danglemaier J, Javadi A, Bakic PA, Maidment ADA (2006) A representation and classification scheme for tree-like structures in medical images: An application on branching pattern analysis of ductal trees in x-ray galactograms. In: Proceedings of the SPIE conference on medical imaging, San Diego, CA, vol 6144, March 2006, pp 497–505
Megalooikonomou V, Wang Q, Li G, Faloutsos C, (2005) A Multiresolution Symbolic Representation of Time Series. In: Proceedings of the 21st IEEE international conference on data engineering (ICDE05), Tokyo, Japan, Apr. 2005, pp 668–679
Mehta M, Agrawal R, and Rissanen J (1996) SLIQ: a fast scalable classifier for data mining. In: Proceedings of the 1996 Int. Conference on Extending Database Technology (EDBT’96), March 1996. Avignon, France, pp 18–32
Mitchell T (1997) Machine Learning. McGraw Hill
Martin-Löf P (1966) The definition of random sequences. Inform Contr 9:602–619
Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C (2003) LOCI: Fast outlier detection using the local correlation integral. ICDE, March 2003, pp 315–326
Pelleg D, Moore AW (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: ICML ’00: Proceedings of the seventeenth international conference on machine learning, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc, pp 727–734
Quinlan J (1993) C4.5: Programs for machine learning. Morgan Kauffman
Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice Hall
Ramakrishnan R, Gehrke J (2002) Database management systems, 3rd edn. McGraw-Hill
Schroeder M (1991) Fractals, chaos, power laws: minutes from an infinite paradise. W.H. Freeman and Company, New York
Yi B-K, Jagadish HV, Faloutsos C (1998) Efficient retrieval of similar time sequences under time warping. ICDE 98, Feb. 23–27
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD, may 1996, pp 103–114
Zhu L, Rao A, Zhang A (2002) Theory of keyblock-based image retrieval. ACM Trans Inform Syst 20(2):224–257
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate encoding. IEEE Trans Inform Theory IT-24:530–536
Zobel J, Moffat A, Sacks-Davis R (1992) An efficient indexing technique for full-text database systems. VLDB, Aug. 23–27, pp 352–362
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Geoffrey Webb.
Rights and permissions
About this article
Cite this article
Faloutsos, C., Megalooikonomou, V. On data mining, compression, and Kolmogorov complexity. Data Min Knowl Disc 15, 3–20 (2007). https://doi.org/10.1007/s10618-006-0057-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-006-0057-3