Cluster-based instance selection for machine classification

Czarnowski, Ireneusz

doi:10.1007/s10115-010-0375-z

Cluster-based instance selection for machine classification

Regular Paper
Open access
Published: 22 January 2011

Volume 30, pages 113–133, (2012)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

Cluster-based instance selection for machine classification

Download PDF

Ireneusz Czarnowski¹

1835 Accesses
54 Citations
6 Altmetric
Explore all metrics

Abstract

Instance selection in the supervised machine learning, often referred to as the data reduction, aims at deciding which instances from the training set should be retained for further use during the learning process. Instance selection can result in increased capabilities and generalization properties of the learning model, shorter time of the learning process, or it can help in scaling up to large data sources. The paper proposes a cluster-based instance selection approach with the learning process executed by the team of agents and discusses its four variants. The basic assumption is that instance selection is carried out after the training data have been grouped into clusters. To validate the proposed approach and to investigate the influence of the clustering method used on the quality of the classification, the computational experiment has been carried out.

Article PDF

Cluster-Based Instance Selection for the Imbalanced Data Classification

An Approach to Imbalanced Data Classification Based on Instance Selection and Over-Sampling

Bi-criteria Data Reduction for Instance-Based Classification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Aha DW, Kibler D, Albert MK (1999) Instance-based learning algorithms. Mach Learn 6: 37–66
Google Scholar
Andrews NO, Fox EA (2007) Clustering for data reduction: a divide and conquer approach. Technical Report TR-07-36, Computer Science, Virginia Tech
Asuncion A, Newman DJ (2007) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine http://www.ics.uci.edu/~mlearn/MLRepository.htm. Accessed 24 June 2009
Bellifemine FL, Caire G, Greenwood D (2007) Developing multi-agent systems with JADE. Wiley Series in Agent Technology, London
Book Google Scholar
Bezdek JC, Kuncheva LI (2000) Nearest prototype classifier design: an experimental study. Int J Intell Syst 16(2): 1445–1473
Article Google Scholar
Cano JR, Herrera F, Lozano M (2004) On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Appl Soft Comput 6: 323–332
Article Google Scholar
Caragea D, Silvescu A, Honavar V (2003) A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. Int J Hybrid Intell Syst 1(1–2): 80–89
Google Scholar
Chang Chin-Liang (1974) Finding prototypes for nearest neighbor classifier. IEEE Trans Comput 23(11): 1179–1184
Article MATH Google Scholar
Czarnowski I, Jȩdrzejowicz P (2004) An approach to instance reduction in supervised learning. In: Coenen F (eds) Research and development in intelligent systems XX. Springer, London, pp 267–282
Chapter Google Scholar
Czarnowski I, Jȩdrzejowicz P (2009) Distributed learning algorithm based on data Reduction. In: Proceedings of ICAART 2009, Porto, Potrugal, pp 198–203
Czarnowski I, Jȩdrzejowicz P, Wierzbowska I (2009) An A-Team approach to learning classifiers from distributed data sources. Int J Intell Inf Database Syst 3(4)
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(3): 131–156
Article Google Scholar
Duch W, Blachnik M, Wieczorek T (2005) Probabilistic distance measures for prototype-based rules. In: Proceedings of the 12th international conference on neural information processing ICONIP 2005, pp 445–450
Eschrich S, Ke J, Hall LO, Goldgof DB (2003) Fast accurate fuzzy clustering through data reduction. IEEE Trans Fuzzy Syst 11(2): 262–270
Article Google Scholar
Frawley WJ, Piatetsky-Shapiro G, Matheus C (1991) Knowledge discovery in databases—an overview. In: Piatetsky-Shapiro G, Matheus C, Knowledge discovery in databases. AAAI/MIT Press, Cambridge
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32: 675–701
Article Google Scholar
Grudziński K, Duch W (2000) SBL-PM: Simple algorithm for selection of reference instances in similarity based methods. In: Proceedings of the intelligence systems, Bystra, Poland, pp 99–107
Gu B, Hu F, Liu H (2001) Sampling: knowing whole from its part. In: Liu H, Motoda H (eds) Instance selection and construction for data mining. Kluwer, Dordrecht, pp 21–37
Google Scholar
Hamo Y, Markovitch S (2005) The COMPSet algorithm for Subset Selection. In: Proceedings of the nineteenth international joint conference for artificial intelligence, Edinburgh, Scotland, pp 728–733
Hart PE (1968) The condensed nearest neighbour rule. IEEE Trans Inf Theory 14: 515–516
Article Google Scholar
Jankowski N, Grochowski M (2005) Instances selection algorithms in the conjunction with LVQ. In: Hamza MH (eds) Artificial intelligence and applications. ACTA Press, Innsbruck, p 453
Google Scholar
Jȩdrzejowicz P (1999) Social learning algorithm as a tool for solving some difficult scheduling problems. Found Comput Decis Sci 24: 51–66
Google Scholar
Kim S-W, Oommen BJ (2003) A brief taxonomy and ranking of creative prototype reduction schemes. Pattern Anal Appl 6: 232–244
Article MathSciNet Google Scholar
Kohonen T (1986) Learning vector quantization for pattern recognition. Technical Report TKK-F-A601, Helsinki University of Technology, Espoo, Finland
Kuncheva LI, Bezdek JC (1998) Nearest prototype classification: clustering, genetic algorithm or random search?. IEEE Trans Syst Man Cybern 28(1): 160–164
Article Google Scholar
Lazarevic A, Obradovic Z (2001) The distributed boosting algorithm. In: Proceedings ACM-SIG KDD international conference on knowledge discovery and data mining, San Francisco, pp 311–316
Liu H, Lu J, Yao (1998) Identifying relevant databases for multidatabase mining. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining, pp 210–221
Liu H, Motoda H (2001) Instance selection and construction for data mining. Kluwer, Dordrecht
Google Scholar
Lu X, Xueguang S (2004) Methods of chemometrics. Science Press, Beijing
Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, vol 1, pp 281–297
Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14: 273–298 doi:10.1007/s10115-007-0090-6
Article MATH Google Scholar
Morgan J, Daugherty R, Hilchie A, Carey B (2003) Sample size and modeling accuracy of decision tree based data mining tools. Acad Inf Manage Sci J 6(2): 71–99
Google Scholar
Nanni L, Lumini A (2009) Particle swarm optimization for prototype reduction. Neurocomputing 72(4–6): 1092–1097
Article Google Scholar
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, SanMateo
Google Scholar
Raman B (2003) Enhancing learning using feature and example selection. Texas A&M University, College Station
Google Scholar
Ritter GL, Woodruff HB, Lowry SR, Isenhour TL (1975) An algorithm for a selective nearest decision rule. IEEE Trans Inf Theory 21: 665–669
Article MATH Google Scholar
Rozsypal A, Kubat M (2003) Selecting representative examples and attributes by a genetic algorithm. Intell Data Anal 7(4): 291–304
MATH Google Scholar
Skalak DB (1994) Prototype and feature selection by sampling and random mutation hill climbing algorithm. In: Proceedings of the international conference on machine learning, pp 293–301
Song HH, Lee SW (1996) LVQ combined with simulated annealing for optimal design of large-set reference models. Neural Netw 9(2): 329–336
Article MathSciNet Google Scholar
Struyf A, Hubert M, Rousseeuw PJ (1996) Clustering in object-oriented environment. J Stat Softw 1(4): 1–30
Google Scholar
Talukdar SN, Pyo SS, Giras T (1983) Asynchronous procedures for parallel processing. IEEE Trans PAS 102(11): 3652–3659
Google Scholar
The European Network of Excellence on Intelligence Technologies for Smart Adaptive Systems (EUNITE)—EUNITE World Competition in domain of Intelligent Technologies (2002). http://neuron.tuke.sk/competition2 (Accessed 30 April 2002)
Tomek I (1976) An experiment with the edited nearest-neighbour rule. IEEE Trans Syst Man Cybern 6(6): 448–452
Article MATH MathSciNet Google Scholar
Uno T (2009) Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data. Knowl Inf Syst. doi:10.1007/s10115-009-0271-6
Wang BX, Japkowicz N (2009) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst. doi:10.1007/s10115-009-0198-y
Weiss SM, Kulikowski CA (1991) Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning and expert systems. Morgan Kaufmann, San Mateo
Google Scholar
Wilson DR, Martinez TR (2000a) An integrated instance-based learning algorithm. Comput Intell 16: 1–28
Article MathSciNet Google Scholar
Wilson DR, Martinez TR (2000b) Reduction techniques for instance-based learning algorithm. Mach Learn 33(3): 257–286
Article Google Scholar
Winton D, Pete E (2001) Using instance selection to combine multiple models learned from disjoint subsets. In: Liu H, Motoda H (eds) Instance selection and construction for data mining. Kluwer, Dordrecht
Google Scholar
Witten IH, Frank E (2003) Data mining: practical machine learning tools and techniques with JAVA implementations. Morgan Kaufmann, San Francisco
Google Scholar
Wolper DH (2001) The supervised learning no free lunch theorems, Technical Raport. NASA Ames Research Center, Moffett Field
Google Scholar
Wu Y, Ianakiew KG, Govindraju V (2001) Improvement in k-nearest neighbor classification. In: Proceedings of ICARP 2001, LNCS 2013. Springer, Berlin, pp 222–229
Yasser EL-Manzalawy, Honavar V (2005) WLSVM: Integrating LibSVM into Weka environment. http://www.cs.iastate.edu/~yasser/wlsv. Accessed 20 Jan 2008
Yu K, Xu Xiaowei, Ester M, Kriegel H-P (2004) Feature weighting and instance selection for collaborative filtering: an information-theoretic approach. Knowl Inf Syst 5(2): 201–224
Article Google Scholar
Zhu X, Wu X (2006) Scalable representative instance selection and ranking. In: IEEE Proceedings of the 18th international conference on pattern recognition, vol 3, pp 352–355

Download references

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution,and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and Affiliations

Department of Information Systems, Gdynia Maritime University, Gdynia, Poland
Ireneusz Czarnowski

Authors

Ireneusz Czarnowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ireneusz Czarnowski.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Czarnowski, I. Cluster-based instance selection for machine classification. Knowl Inf Syst 30, 113–133 (2012). https://doi.org/10.1007/s10115-010-0375-z

Download citation

Received: 24 November 2009
Revised: 30 June 2010
Accepted: 13 November 2010
Published: 22 January 2011
Issue Date: January 2012
DOI: https://doi.org/10.1007/s10115-010-0375-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Cluster-based instance selection for machine classification

Abstract

Article PDF

Similar content being viewed by others

Cluster-Based Instance Selection for the Imbalanced Data Classification

An Approach to Imbalanced Data Classification Based on Instance Selection and Over-Sampling

Bi-criteria Data Reduction for Instance-Based Classification

References

Open Access

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cluster-based instance selection for machine classification

Abstract

Article PDF

Similar content being viewed by others

Cluster-Based Instance Selection for the Imbalanced Data Classification

An Approach to Imbalanced Data Classification Based on Instance Selection and Over-Sampling

Bi-criteria Data Reduction for Instance-Based Classification

Explore related subjects

References

Open Access

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation