Abstract
Instance selection in the supervised machine learning, often referred to as the data reduction, aims at deciding which instances from the training set should be retained for further use during the learning process. Instance selection can result in increased capabilities and generalization properties of the learning model, shorter time of the learning process, or it can help in scaling up to large data sources. The paper proposes a cluster-based instance selection approach with the learning process executed by the team of agents and discusses its four variants. The basic assumption is that instance selection is carried out after the training data have been grouped into clusters. To validate the proposed approach and to investigate the influence of the clustering method used on the quality of the classification, the computational experiment has been carried out.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aha DW, Kibler D, Albert MK (1999) Instance-based learning algorithms. Mach Learn 6: 37–66
Andrews NO, Fox EA (2007) Clustering for data reduction: a divide and conquer approach. Technical Report TR-07-36, Computer Science, Virginia Tech
Asuncion A, Newman DJ (2007) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine http://www.ics.uci.edu/~mlearn/MLRepository.htm. Accessed 24 June 2009
Bellifemine FL, Caire G, Greenwood D (2007) Developing multi-agent systems with JADE. Wiley Series in Agent Technology, London
Bezdek JC, Kuncheva LI (2000) Nearest prototype classifier design: an experimental study. Int J Intell Syst 16(2): 1445–1473
Cano JR, Herrera F, Lozano M (2004) On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Appl Soft Comput 6: 323–332
Caragea D, Silvescu A, Honavar V (2003) A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. Int J Hybrid Intell Syst 1(1–2): 80–89
Chang Chin-Liang (1974) Finding prototypes for nearest neighbor classifier. IEEE Trans Comput 23(11): 1179–1184
Czarnowski I, Jȩdrzejowicz P (2004) An approach to instance reduction in supervised learning. In: Coenen F (eds) Research and development in intelligent systems XX. Springer, London, pp 267–282
Czarnowski I, Jȩdrzejowicz P (2009) Distributed learning algorithm based on data Reduction. In: Proceedings of ICAART 2009, Porto, Potrugal, pp 198–203
Czarnowski I, Jȩdrzejowicz P, Wierzbowska I (2009) An A-Team approach to learning classifiers from distributed data sources. Int J Intell Inf Database Syst 3(4)
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(3): 131–156
Duch W, Blachnik M, Wieczorek T (2005) Probabilistic distance measures for prototype-based rules. In: Proceedings of the 12th international conference on neural information processing ICONIP 2005, pp 445–450
Eschrich S, Ke J, Hall LO, Goldgof DB (2003) Fast accurate fuzzy clustering through data reduction. IEEE Trans Fuzzy Syst 11(2): 262–270
Frawley WJ, Piatetsky-Shapiro G, Matheus C (1991) Knowledge discovery in databases—an overview. In: Piatetsky-Shapiro G, Matheus C, Knowledge discovery in databases. AAAI/MIT Press, Cambridge
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32: 675–701
Grudziński K, Duch W (2000) SBL-PM: Simple algorithm for selection of reference instances in similarity based methods. In: Proceedings of the intelligence systems, Bystra, Poland, pp 99–107
Gu B, Hu F, Liu H (2001) Sampling: knowing whole from its part. In: Liu H, Motoda H (eds) Instance selection and construction for data mining. Kluwer, Dordrecht, pp 21–37
Hamo Y, Markovitch S (2005) The COMPSet algorithm for Subset Selection. In: Proceedings of the nineteenth international joint conference for artificial intelligence, Edinburgh, Scotland, pp 728–733
Hart PE (1968) The condensed nearest neighbour rule. IEEE Trans Inf Theory 14: 515–516
Jankowski N, Grochowski M (2005) Instances selection algorithms in the conjunction with LVQ. In: Hamza MH (eds) Artificial intelligence and applications. ACTA Press, Innsbruck, p 453
Jȩdrzejowicz P (1999) Social learning algorithm as a tool for solving some difficult scheduling problems. Found Comput Decis Sci 24: 51–66
Kim S-W, Oommen BJ (2003) A brief taxonomy and ranking of creative prototype reduction schemes. Pattern Anal Appl 6: 232–244
Kohonen T (1986) Learning vector quantization for pattern recognition. Technical Report TKK-F-A601, Helsinki University of Technology, Espoo, Finland
Kuncheva LI, Bezdek JC (1998) Nearest prototype classification: clustering, genetic algorithm or random search?. IEEE Trans Syst Man Cybern 28(1): 160–164
Lazarevic A, Obradovic Z (2001) The distributed boosting algorithm. In: Proceedings ACM-SIG KDD international conference on knowledge discovery and data mining, San Francisco, pp 311–316
Liu H, Lu J, Yao (1998) Identifying relevant databases for multidatabase mining. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining, pp 210–221
Liu H, Motoda H (2001) Instance selection and construction for data mining. Kluwer, Dordrecht
Lu X, Xueguang S (2004) Methods of chemometrics. Science Press, Beijing
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, vol 1, pp 281–297
Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14: 273–298 doi:10.1007/s10115-007-0090-6
Morgan J, Daugherty R, Hilchie A, Carey B (2003) Sample size and modeling accuracy of decision tree based data mining tools. Acad Inf Manage Sci J 6(2): 71–99
Nanni L, Lumini A (2009) Particle swarm optimization for prototype reduction. Neurocomputing 72(4–6): 1092–1097
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, SanMateo
Raman B (2003) Enhancing learning using feature and example selection. Texas A&M University, College Station
Ritter GL, Woodruff HB, Lowry SR, Isenhour TL (1975) An algorithm for a selective nearest decision rule. IEEE Trans Inf Theory 21: 665–669
Rozsypal A, Kubat M (2003) Selecting representative examples and attributes by a genetic algorithm. Intell Data Anal 7(4): 291–304
Skalak DB (1994) Prototype and feature selection by sampling and random mutation hill climbing algorithm. In: Proceedings of the international conference on machine learning, pp 293–301
Song HH, Lee SW (1996) LVQ combined with simulated annealing for optimal design of large-set reference models. Neural Netw 9(2): 329–336
Struyf A, Hubert M, Rousseeuw PJ (1996) Clustering in object-oriented environment. J Stat Softw 1(4): 1–30
Talukdar SN, Pyo SS, Giras T (1983) Asynchronous procedures for parallel processing. IEEE Trans PAS 102(11): 3652–3659
The European Network of Excellence on Intelligence Technologies for Smart Adaptive Systems (EUNITE)—EUNITE World Competition in domain of Intelligent Technologies (2002). http://neuron.tuke.sk/competition2 (Accessed 30 April 2002)
Tomek I (1976) An experiment with the edited nearest-neighbour rule. IEEE Trans Syst Man Cybern 6(6): 448–452
Uno T (2009) Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowl Inf Syst. doi:10.1007/s10115-009-0271-6
Wang BX, Japkowicz N (2009) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst. doi:10.1007/s10115-009-0198-y
Weiss SM, Kulikowski CA (1991) Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning and expert systems. Morgan Kaufmann, San Mateo
Wilson DR, Martinez TR (2000a) An integrated instance-based learning algorithm. Comput Intell 16: 1–28
Wilson DR, Martinez TR (2000b) Reduction techniques for instance-based learning algorithm. Mach Learn 33(3): 257–286
Winton D, Pete E (2001) Using instance selection to combine multiple models learned from disjoint subsets. In: Liu H, Motoda H (eds) Instance selection and construction for data mining. Kluwer, Dordrecht
Witten IH, Frank E (2003) Data mining: practical machine learning tools and techniques with JAVA implementations. Morgan Kaufmann, San Francisco
Wolper DH (2001) The supervised learning no free lunch theorems, Technical Raport. NASA Ames Research Center, Moffett Field
Wu Y, Ianakiew KG, Govindraju V (2001) Improvement in k-nearest neighbor classification. In: Proceedings of ICARP 2001, LNCS 2013. Springer, Berlin, pp 222–229
Yasser EL-Manzalawy, Honavar V (2005) WLSVM: Integrating LibSVM into Weka environment. http://www.cs.iastate.edu/~yasser/wlsv. Accessed 20 Jan 2008
Yu K, Xu Xiaowei, Ester M, Kriegel H-P (2004) Feature weighting and instance selection for collaborative filtering: an information-theoretic approach. Knowl Inf Syst 5(2): 201–224
Zhu X, Wu X (2006) Scalable representative instance selection and ranking. In: IEEE Proceedings of the 18th international conference on pattern recognition, vol 3, pp 352–355
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution,and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Czarnowski, I. Cluster-based instance selection for machine classification. Knowl Inf Syst 30, 113–133 (2012). https://doi.org/10.1007/s10115-010-0375-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0375-z