Abstract
Instance selection aims to search for a representative portion of data that serves the same purpose as the whole data. In this chapter we propose a novel procedure for instance selection based on hypertuples, a generalization of traditional database tuples. This procedure has two tasks: building a model and selecting instances based on the model. For the first task, we propose to merge data tuples while ensuring some criteria are satisfied. This merge operation results in a set of hypertuples which, under certain conditions, serves as a model of the original data. We identify two types of criteria for the task of instance selection: preserving classification structures and maximizing density of hypertuples. For the first criterion we propose a formalism that leads to a unique solution — the least E-set. We then propose algorithms for finding this unique solution and for finding a compromised solution efficiently. For the second criterion we propose a new measure of density, which is normalized and quantized, and which applies to both numerical and categorical data. Using this measure of density, we then propose a hill-climbing algorithm that can efficiently find a quasi-optimal set of hypertuples, which is “quasi-densest”.
Having a model of data, we can generate a set of representative instances — the second task of the procedure. We propose to calculate the centers of the hypertuples in the model and take these centers as the representative instances of the original data. To use the selected instances for classification, we propose to use a nearest neighbor (NN) approach.
Experiments using real world public data show that, when used with the proposed NN classifier, the selected instances are not only representative but even outperform C5 in some cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aha, D. W. (1990). A study of instance-based algorithms for supervised learning tasks. Technical report, University of California, Irvine.
Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene analysis. John Wiley & Sons.
Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, pages 226–231. AAAI Press.
Grätzer, G. (1978). General Lattice Theory. Birkhäuser, Basel.
Jensen, R. E. (1969). A dynamic programming algorithm for cluster analysis. Operations Research, 17:1034–1057.
Mitchell, T. M. (1997). Machine Learning. The McGraw-Hill Companies, Inc.
Wang, H., Dubitzky, W., Düntsch, I., and Bell, D. (1999). A lattice machine approach to automated casebase design: Marrying lazy and eager learning. In Proc. IJCAI99, pages 254–259, Stockholm, Sweden.
Wang, H., Düintsch, I., and Bell, D. (1998). Data reduction based on hyper relations. In Proceedings of KDD98, New York, pages 349–353.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Wang, H. (2001). Instance Selection Based on Hypertuples. In: Liu, H., Motoda, H. (eds) Instance Selection and Construction for Data Mining. The Springer International Series in Engineering and Computer Science, vol 608. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3359-4_14
Download citation
DOI: https://doi.org/10.1007/978-1-4757-3359-4_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-4861-8
Online ISBN: 978-1-4757-3359-4
eBook Packages: Springer Book Archive