Abstract
Multirelational classification aims at discovering useful patterns across multiple inter-connected tables (relations) in a relational database. Many traditional learning techniques, however, assume a single table or a flat file as input (the so-called propositional algorithms). Existing multirelational classification approaches either “upgrade” mature propositional learning methods to deal with relational presentation or extensively “flatten” multiple tables into a single flat file, which is then solved by propositional algorithms. This article reports a multiple view strategy—where neither “upgrading” nor “flattening” is required—for mining in relational databases. Our approach learns from multiple views (feature set) of a relational databases, and then integrates the information acquired by individual view learners to construct a final model. Our empirical studies show that the method compares well in comparison with the classifiers induced by the majority of multirelational mining systems, in terms of accuracy obtained and running time needed. The paper explores the implications of this finding for multirelational research and applications. In addition, the method has practical significance: it is appropriate for directly mining many real-world databases.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aggarwal CC (2004). On leveraging user access patterns for topic specific crawling. Data Min Knowl Discov 9(2): 123–145
Agrawal R, Imielinski T and Swami AN (1993). Database mining: a performance perspective. IEEE Trans Knowl Data Eng 5(6): 914–925
Berka P (2000) Guide to the financial data set. In: Siebes A, Berka P (eds) PKDD2000 discovery challenge
Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: learning to scale up record linkage. In: ICDM ’06: Proceedings of the sixth international conference on data mining. Washington, DC, USA, IEEE Computer Society pp. 87–96
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp. 39–48
Blockeel H and Raedt LD (1998). Top-Down Induction of First-Order Logical Decision Trees. Artif Intell 101(1–2): 285–297
Blockeel H, Raedt LD, Jacobs N and Demoen B (1999). Scaling up inductive logic programming by learning from interpretations. Data Min Knowl Discov 3(1): 59–93
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the workshop on computational learning theory
Breiman L (1996). Bagging predictors. Mach Learn 24(2): 123–140
Burges CJC (1998). A tutorial on support vector machines for pattern recognition. Data Mining Knowl Discov 2(2): 121–167
Chen R, Sivakumar K and Kargupta H (2004). Collective mining of Bayesian networks from distributed heterogeneous data. Knowl Inf Syst 6(2): 164–187
Cheng J, Sweredoski MJ and Baldi P (2005). Accurate prediction of protein disordered regions by mining protein structure data. Data Min Knowl Discov 11(3): 213–222
Cheung DW, Ng VT, Fu AW and Fu Y (1996). Efficient mining of association rules in distributed databases. IEEE Trans Knowl Data Eng 8(6): 911–922
Cho V and Wüthrich B (2002). Distributed mining of classification rules. Knowl Inf Syst 4(1): 1–30
Clark P and Niblett T (1989). The CN2 induction algorithm. Mach Learn 3(4): 261–283
Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of the joint SIGDAT Conference on empirical methods in natural language processing and very large corpora
Coursac I, Duteil N, Lucas N (2002) PKDD 2001 discovery challenge—medical domain. In: The PKDD discovery challenge 2001, vol 3(2)
Dasgupta S, Littman ML, McAllester DA (2001) PAC generalization bounds for co-training. In: NIPS, pp 375–382
de Sa VR and Ballard DH (1998). Category learning through multi-modality sensing. Neural Comput 10(5): 1097–1117
Domingos P (1999) MetaCost: a general method for making classifiers cost-Sensitive. In: KDD’99, pp 155–164
Domingos P, Pazzani MJ (1996) Beyond independence: conditions for the optimality of the simple bayesian classifier. In: ICML ’96: Proceedings of the 13th international conference on machine learning. pp 105–112
Dzeroski S and Raedt LD (2003). Multi-relational data mining: an introduction. SIGKDD Explor Newsl 5(1): 1–16
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: International conference on machine learning, pp 148–156
Garcia-Molina H, Ullman J and Widom J (2002). Database systems: the complete book. Prentice Hall, Englewood Cliffs
Gehrke J, Ramakrishnan R and Ganti V (2000). RainForest—a framework for fast decision tree construction of large datasets. Data Min Knowl Discov 4(2–3): 127–162
Getoor L, Friedman N, Koller D, Taskar B (2001) Learning probabilistic models of relational structure. In: Proceedings of the 18th international conference on machine learning, pp 170–177
Ghiselli EE (1964). Theory of psychological measurement. McGrawHill, New York
Ginsberg M (1994). Essentials of artificial intelligence. Kaufmann, San Francisco
Glocer K, Eads D, Theiler J (2005) Online feature selection for pixel classification. In: ICML ’05: Proceedings of the 22nd international conference on machine learning. ACM Press, New York pp 249–256
Guo H and Viktor HL (2004). Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor Newsl 6(1): 30–39
Guo H, Viktor HL (2005) Mining relational databases with multi-view learning. In: MRDM ’05: Proceedings of the 4th International Workshop on Multi-relational Mining. ACM Press, pp 15–24
Guo H, Viktor HL (2006) Mining relational data through correlation-based multiple view validation. In: KDD ’06. ACM Press, New York, pp 567–573
Hall M (1998) Correlation-based feature selection for machine learning. Ph.D dissertation Waikato University
Han J and Kamber M (2005). Data mining: concepts and techniques, 2nd edn. Kaufmann, San Francisco
Hulten G, Domingos P, Abe Y (2003) Mining massive relational databases. In: Proceedings of the IJCAI-2003 workshop on learning statistical models from relational data, pp 53–60
Joachims T (1999). Support vector machines (Aktuelles Schlagwort). KI 13(4): 54–55
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: UAI, pp 338–345
Kargupta H, Huang W, Sivakumar K and Johnson E (2001). Distributed clustering using collective principal component analysis. Knowl Inf Syst 3(4): 422–448
Kietz J-U, Zücker R, Vaduva A (2000) MINING MART: Combining case-based-reasoning and multistrategy learning into a framework for reusing KDD-applications. In: 5th Int’l workshop on multistrategy learning (MSL 2000). Guimaraes, Portugal
Knobbe AJ (2004) Multi-relational data mining. Ph.D. thesis, University Utrecht
Knobbe AJ, de Haas M, Siebes A (2001) Propositionalisation and aggregates. In: PKDD ’01: Proceedings of the 5th European conference on principles of data mining and knowledge discovery. Springer, London, pp 277–288
Kohavi R (1995) Wrappers for performance enhancement and oblivious decision graphs. Ph.D. thesis, Stanford University
Kohavi R and John GH (1997). Wrappers for feature subset selection. Artif Intell 97(1–2): 273–324
Krogel M-A (2005) On propositionalization for knowledge discovery in relational databases. Ph.D. thesis, Fakultät fuer Informatik, Otto-von-Guericke-Universität Magdeburg
Krogel M-A, Rawles S, Zelezny F, Flach PA, Lavrac N, Wrobel S (2003) Comparative evaluation of approaches to propositionalization. In: ILP, pp 197–214
Krogel M-A, Wrobel S (2001) Transformation-based learning using multirelational aggregation. In: ILP, pp 142–155
Langley P, Sage S (1994) Induction of selective Bayesian classifiers. In: UAI ’94: Proceedings of the 10th annual conference on uncertainty in AI). pp 399–40, Morgan Kaufmann, San Francisco
Lavrac N and Dzeroski S (1993). Inductive logic programming: techniques and applications. Routledge, New York
Lavrač N (1990) Principles of knowledge acquisition in expert systems. Ph.D. thesis, Faculty of Technical Sciences, University of Maribor
Michalski RS, Mozetic I, Hong J, Lavrac N (1986) The multi-purpose incremental learning system AQ15 and its testing application to three medical domains. In: AAAI, pp 1041–1047
Muggleton S (1995). Inverse entailment and progol. New Generat Comput, Special issue on Inductive Logic Programming 13(3–4): 245–286
Muggleton S, Feng C (1990) Efficient induction of logic programs. In: Proceedings of the 1st conference on algorithmic learning theory. Ohmsma, Tokyo pp 368–381
Muggleton S and Raedt LD (1994). Inductive logic programming: theory and methods. J Log Programm 19/20: 629–679
Muslea IA (2002) Active learning with multiple views. Ph.D. thesis, Department of Computer Science, University of Southern California
Neville J, Jensen D, Friedland L, Hay M (2003) Learning relational probability trees. In: KDD ’03. pp 625–630, ACM Press, New York
Parthasarathy S, Zaki MJ, Ogihara M and Li W (2001). Parallel data mining for association rules on shared-memory systems. Knowl Inf Syst 3(1): 1–29
Perlich C, Provost FJ (2003) Aggregation-based feature invention and relational concept classes. In: KDD’03, pp 167–176
Press WH, Flannery BP, Teukolsky SA and Vetterling WT (1988). Numerical recipes in C: the art of scientific computing. Cambridge University Press, Cambridge
Quinlan JR (1993). C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Quinlan JR, Cameron-Jones RM (1993) FOIL: a midterm report. In: ECML, pp 3–20
Raedt LD, Laer WV (1995) Inductive constraint logic. In: Proceedings of the 6th conference on algorithmic learning theory, vol 997. Springer, Heidelberg
Ramakrishnan R and Gehrke J (2003). Database management systems. McGraw-Hill, New York
Russell S and Norvig P (1995). Artificial Intelligence: a modern approach. Prentice Hall, Englewood Cliffs
Sayal M and Scheuermann P (2001). Distributed web log mining using maximal large itemsets. Knowl Inf Syst 3(4): 389–404
Skillicorn DB and Wang Y (2001). Parallel and sequential algorithms for data mining using inductive logic. Knowl Inf Syst 3(4): 405–421
Srinivasan A and King RD (1999). Feature construction with inductive logic programming: a study of quantitative predictions of biological activity aided by structural attributes. Data Min Knowl Discov 3(1): 37–57
Srinivasan A, Muggleton SH, Sternberg MJE and King RD (1996). Theories for mutagenicity: a study in first-order and feature-based induction. Artif Intell 85(1–2): 277–299
Taskar B, Abbeel P, Koller D (2002) Discriminative probabilistic models for relational data. In: UAI, pp 485–492
Vens C, Assche AV, Blockeel H, Dzeroski S (2004) First order random forests with complex aggregates. In: ILP, pp 323–340
Webb G and Zheng Z (2004). Multistrategy ensemble learning: reducing error by combining ensemble learning techniques. IEEE Trans Knowl Data Eng 16(8): 980–991
Webb GI (2000). MultiBoosting: a technique for combining boosting and bagging. Mach Learn 40(2): 159–196
Witten IH and Frank E (2000). Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco
Wolpert DH (1992). Stacked generalization. Neural Netw 5(2): 241–259
Wu X, Zhang C and Zhang S (2005). Database classification for multi-database mining. Inf Syst 30(1): 71–88
Wu X and Zhang S (2003). Synthesizing high-frequency rules from different data sources. IEEE Trans Knowl Data Eng 15(2): 353–367
Yin X, Han J, Yang J, Yu PS (2004) CrossMine: efficient classification across multiple database relations. In: ICDE’04, Boston, pp 399–410
Zhang S, Wu X and Zhang C (2003). Multi-database mining. IEEE Comput Intell Bull 2(1): 5–13
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Guo, H., Viktor, H.L. Multirelational classification: a multiple view approach. Knowl Inf Syst 17, 287–312 (2008). https://doi.org/10.1007/s10115-008-0127-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0127-5