Abstract
Identifier attributes—very high-dimensional categorical attributes such as particular product ids or people's names—rarely are incorporated in statistical modeling. However, they can play an important role in relational modeling: it may be informative to have communicated with a particular set of people or to have purchased a particular set of products. A key limitation of existing relational modeling techniques is how they aggregate bags (multisets) of values from related entities. The aggregations used by existing methods are simple summaries of the distributions of features of related entities: e.g., MEAN, MODE, SUM, or COUNT. This paper's main contribution is the introduction of aggregation operators that capture more information about the value distributions, by storing meta-data about value distributions and referencing this meta-data when aggregating—for example by computing class-conditional distributional distances. Such aggregations are particularly important for aggregating values from high-dimensional categorical attributes, for which the simple aggregates provide little information. In the first half of the paper we provide general guidelines for designing aggregation operators, introduce the new aggregators in the context of the relational learning system ACORA (Automated Construction of Relational Attributes), and provide theoretical justification. We also conjecture special properties of identifier attributes, e.g., they proxy for unobserved attributes and for information deeper in the relationship network. In the second half of the paper we provide extensive empirical evidence that the distribution-based aggregators indeed do facilitate modeling with high-dimensional categorical attributes, and in support of the aforementioned conjectures.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Bernstein, A., Clearwater, S., Hill, S., Perlich, C., & Provost, F. (2002). Discovering knowledge from relational data extracted from business news. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2002 (pp. 7–20). University of Alberta, Edmonton, Canada.
Blockeel, H., & Raedt, L.D. (1998). Top-down induction of first-order logical decision trees. Artificial Intelligence, 101, 285–297.
Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:7, 1145–1159.
Brazdil, P., Gama, J., & Henery, R. (1994). Characterizing the applicability of classification algorithms using meta level learning. In Proceedings of the 7th European Conference on Machine Learning (pp. 83–102).
Chakrabarti, S., Dom, B., & Indyk, P. (1998). Enhanced hypertext categorization using hyperlinks. In Proceedings of the International Conference on Management of Data (pp. 307–318).
Cortes, C., Pregibon, D., & Volinsky, C. (2002). Communities of interest. Intelligent Data Analysis, 6:3, 211–219.
Craven, M., & Slattery, S. (2001). Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43, 97–119.
DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177–188.
Domingos, P., & Richardson, M. (2001). Mining the network value of customers. In Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining (pp. 57–66).
Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1, 291–316.
Flach, P., & Lachiche, N. (2004). Naive Bayesian classification for structured data. In Machine Learning, 57, 233–269.
Gärtner, T. (2003). A survey of kernels for structured data. SIGKDD Explorations, 5, 49–58.
Gärtner, T., Lloyd, J.W., & Flach, P.A. (2002). Kernels for structured data. In Proceedings of the 12th International Conference on Inductive Logic Programming (pp. 66–83). Springer.
Goldberg, H., & Senator, T. (1995). Restructuring databases for knowledge discovery by consolidation and link formation. In Proceedings of the 1st International Conference On Knowledge Discovery and Data Mining (pp. 136–141). Montreal, Canada: AAAI Press.
Jensen, D., & Getoor, L. (2003). In Proceedings of the Workshop on Learning Statistical Models from Relational Data at IJCAI-2003. American Association for Artificial Intelligence.
Jensen, D., & Neville, J. (2002). Linkage and autocorrelation cause feature selection bias in relational learning. In Proceedings of the 19th International Conference on Machine Learning (pp. 259–266). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Jensen, D., Neville, J., & Gallagher, B. (2004). Why collective inference improves relational classification. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (pp. 593–598). New York, NY, USA: ACM Press.
Jensen, D., Neville, J., & Hay, M. (2003). Avoiding bias when aggregating relational data with degree disparity. In Proceedings of the 20th International Conference on Machine Learning (pp. 274–281).
Kietz, J.-U., & Morik, K. (1994). A polynomial approach to the constructive induction of structural knowledge. Machine Learning, 14, 193–217.
Kirsten, M., Wrobel, S., & Horvath, T. (2000). Distance based approaches to relational learning and clustering. In S. Ďzeroski & N.Lavrač (Eds.), Relational data mining, (pp. 213–232). Springer Verlag.
Knobbe, A., Haas, M.D., & Siebes, A. (2001). Propositionalisation and aggregates. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (pp. 277–288).
Koller, D., & Pfeffer, A. (1998). Probabilistic frame-based systems. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI) (pp. 580–587).
Kramer, S., Lavrač, N., & Flach, P. (2001). Propositionalization approaches to relational data mining. In S. Ďzeroski and N. Lavrač (Eds.), Relational data mining (pp. 262–291). Springer-Verlag.
Krogel, M.-A., Rawles, S., Železng, F., Flach, P., Lavrač, N., & Wrobel, S. (2003). Comparative evaluation of approaches to propositionalization. In 13th International Conference on Inductive Logic Programming (ILP) (pp. 197–214).
Krogel, M.-A., & Wrobel, S. (2001). Transformation-based learning using multirelational aggregation. In Proceedings of the 11th International Conference on Inductive Logic Programming (ILP) (pp. 142–155).
Krogel, M.-A., & Wrobel, S. (2003). Facets of aggregation approaches to propositionalization. In Proceedings of the 13th International Conference on Inductive Logic Programming (ILP) (pp. 30–39).
Lavrač, N., & Ďzeroski, S. (1994). Inductive logic programming: techniques and application. New York. Ellis Horwood
Libkin, L., & Wong L. (1994). New techniques for studying set languages, bag languages and aggregate functions. In Proceedings of the 13th Symposium on Principles of Database Systems (pp. 155–166).
Macskassy, S., & Provost, F. (2003). A simple relational classifier. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2003.
Macskassy, S., & Provost, F. (2004). Classification in networked Data: A Toolkit and a Univariate Case Study (Technical Report CeDER-04-08). Stern School of Business, New York University.
McCallum, A., Nigam, K., J. Rennie, & Seymore, K. (2000). Automating the construction of internet portals with machine learning. Information Retrival, 3, 127–163.
McCreath, E. (1999). Induction in First Order Logic from Noisy Training Examples and Fixed Example Set Size. Doctoral dissertation, Universtity of New South Wales.
Michalski, R. (1983). A theory and methodology of inductive learning. Artificial Intelligence, 20, 111–161.
Morik, K. (1999). Tailoring representations to different requirements. In Proceedings of the 10th International Conference on Algorithmic Learning Theory (ALT) (pp. 1–12).
Muggleton, S. (2001). CProgol4.4: A tutorial introduction. In S. Ďzeroski & N.Lavrač (Eds.), Relational Data Mining pp.(105–139). Springer-Verlag.
Muggleton, S., & DeRaedt, L. (1994). Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19 & 20, 629–680.
Neville, J., & Jensen, D. (2005). Leveraging relational autocorrelation with latent group models. In Proceedings of the 5th IEEE International Conference on Data Mining (pp. 49–55). New York, NY, USA: ACM Press.
Neville, J., Jensen, D., Friedland, L., & Hay, M. (2003a). Learning relational probability trees. In Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (pp. 625–630). New York, NY, USA: ACM Press.
Neville, J., Jensen, D., & Gallagher, B. (2003b). Simple estimators for relational Bayesian classifers. In Proceedings of the 3rd International Conference on Data Mining (pp. 609–612). Washington, DC, USA: IEEE Computer Society.
Neville, J., Rattigan, M., & Jensen, D. (2003c). Statistical relational learning: Four claims and a survey. In Proceedings of the Workshop on Learning Statistical Models from Relational Data at IJCAI-2003.
Özsoyoǵlu, G., Özsoyoǵlu, Z., & Matos, V. (1987). Extending relational algebra and relational calculus with set-valued atributes and aggregate functions. ACM Transactions on Database Systems, 12, 566–592.
Perlich, C. (2005a). Approaching the ILP challenge 2005: Class-conditional Bayesian propositionalization for genetic classification. In Late-Braking track at the 15th International Conference on Inductive Logic Programming (pp. 99–104).
Perlich, C. (2005b). Probability estimation in mulit-relational domain. Doctoral dissertation, Stern School of Business.
Perlich, C., & Provost, F. (2003). Aggregation-based feature invention and relational concept classes. In Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (pp. 167–176). New York, NY, USA: ACM Press.
Perlich, C., Provost, F., & Simonoff, J. (2003). Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research, 4, 211–255.
Pompe, U., & Kononenko, I. (1995). Naive Bayesian classifier with ILP-R. In Proceedings of the 5th International Workshop on Inductive Logic Programming (pp. 417–436).
Popescul, A., & Ungar, L. (2003). Structural logistic regression for link analysis. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2003.
Popescul, A., & Ungar, L. (2004). Cluster-based concept invention for statistical relational learning. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (pp. 665–670).
Popescul, A., Ungar, L., Lawrence, S., & Pennock, D.M. (2002). Structural logistic regression: Combining relational and statistical learning. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2003 (pp. 130–141).
Provost, F., Perlich, C., & Macskassy, S. (2003). Relational learning problems and simple models. In Proceedings of the Workshop on Learning Statistical Models from Relational Data at IJCAI-2003 (pp. 116–120).
Quinlan, J. (1993). C4.5: Programs for machine learning. Los Altos, California: Morgan Kaufmann Publishers.
Quinlan, J., & Cameron-Jones, R. (1993). FOIL: A midterm report. In Proceedings of the 6th European Conference on Machine Learning (ECML) (pp. 3–20).
Slattery, S., & Mitchell, T. (2000). Discovering test set regularities in relational domains. In Proceedings of the 17th International Conference on Machine Learning (pp. 895–902).
Taskar, B., Abbeel, P., & Koller, D. (2002). Discriminative probabilistic models for relational data. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence (pp. 485–492). Edmonton, Canada: Morgan Kaufmann.
Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic classification and clustering in relational data. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (pp. 870–878).
Witten, I., & Frank, E. (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.
Wnek, J., & Michalski, R. (1993). Hypothesis-driven constructive induction in AQ17-HCI: A method and experiments. Machine Learning, 14, 139–168.
Woznica, A., Kalousis, A., & Hilario, M. (2004). Kernel-based distances for relational learning. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2004.
Zheng, Z., Kohavi, R., & Mason, L. (2001). Real World Performance of Association Rule Algorithms. In Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining (pp. 401–406).
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Hendrik Blockeel, David Jensen and Stefan Kramer
An erratum to this article is available at http://dx.doi.org/10.1007/s10994-006-8633-8.
Rights and permissions
About this article
Cite this article
Perlich, C., Provost, F. Distribution-based aggregation for relational learning with identifier attributes. Mach Learn 62, 65–105 (2006). https://doi.org/10.1007/s10994-006-6064-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-6064-1