Distribution-based aggregation for relational learning with identifier attributes

Perlich, Claudia; Provost, Foster

doi:10.1007/s10994-006-6064-1

Distribution-based aggregation for relational learning with identifier attributes

Published: 27 January 2006

Volume 62, pages 65–105, (2006)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Distribution-based aggregation for relational learning with identifier attributes

Download PDF

Claudia Perlich¹ &
Foster Provost²

1062 Accesses
68 Citations
Explore all metrics

An Erratum to this article was published on 01 May 2006

Abstract

Identifier attributes—very high-dimensional categorical attributes such as particular product ids or people's names—rarely are incorporated in statistical modeling. However, they can play an important role in relational modeling: it may be informative to have communicated with a particular set of people or to have purchased a particular set of products. A key limitation of existing relational modeling techniques is how they aggregate bags (multisets) of values from related entities. The aggregations used by existing methods are simple summaries of the distributions of features of related entities: e.g., MEAN, MODE, SUM, or COUNT. This paper's main contribution is the introduction of aggregation operators that capture more information about the value distributions, by storing meta-data about value distributions and referencing this meta-data when aggregating—for example by computing class-conditional distributional distances. Such aggregations are particularly important for aggregating values from high-dimensional categorical attributes, for which the simple aggregates provide little information. In the first half of the paper we provide general guidelines for designing aggregation operators, introduce the new aggregators in the context of the relational learning system ACORA (Automated Construction of Relational Attributes), and provide theoretical justification. We also conjecture special properties of identifier attributes, e.g., they proxy for unobserved attributes and for information deeper in the relationship network. In the second half of the paper we provide extensive empirical evidence that the distribution-based aggregators indeed do facilitate modeling with high-dimensional categorical attributes, and in support of the aforementioned conjectures.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Bernstein, A., Clearwater, S., Hill, S., Perlich, C., & Provost, F. (2002). Discovering knowledge from relational data extracted from business news. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2002 (pp. 7–20). University of Alberta, Edmonton, Canada.
Blockeel, H., & Raedt, L.D. (1998). Top-down induction of first-order logical decision trees. Artificial Intelligence, 101, 285–297.
Article MATH MathSciNet Google Scholar
Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:7, 1145–1159.
Article Google Scholar
Brazdil, P., Gama, J., & Henery, R. (1994). Characterizing the applicability of classification algorithms using meta level learning. In Proceedings of the 7th European Conference on Machine Learning (pp. 83–102).
Chakrabarti, S., Dom, B., & Indyk, P. (1998). Enhanced hypertext categorization using hyperlinks. In Proceedings of the International Conference on Management of Data (pp. 307–318).
Cortes, C., Pregibon, D., & Volinsky, C. (2002). Communities of interest. Intelligent Data Analysis, 6:3, 211–219.
Google Scholar
Craven, M., & Slattery, S. (2001). Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43, 97–119.
Article MATH Google Scholar
DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177–188.
Article Google Scholar
Domingos, P., & Richardson, M. (2001). Mining the network value of customers. In Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining (pp. 57–66).
Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1, 291–316.
Article Google Scholar
Flach, P., & Lachiche, N. (2004). Naive Bayesian classification for structured data. In Machine Learning, 57, 233–269.
Gärtner, T. (2003). A survey of kernels for structured data. SIGKDD Explorations, 5, 49–58.
Google Scholar
Gärtner, T., Lloyd, J.W., & Flach, P.A. (2002). Kernels for structured data. In Proceedings of the 12th International Conference on Inductive Logic Programming (pp. 66–83). Springer.
Goldberg, H., & Senator, T. (1995). Restructuring databases for knowledge discovery by consolidation and link formation. In Proceedings of the 1st International Conference On Knowledge Discovery and Data Mining (pp. 136–141). Montreal, Canada: AAAI Press.
Jensen, D., & Getoor, L. (2003). In Proceedings of the Workshop on Learning Statistical Models from Relational Data at IJCAI-2003. American Association for Artificial Intelligence.
Jensen, D., & Neville, J. (2002). Linkage and autocorrelation cause feature selection bias in relational learning. In Proceedings of the 19th International Conference on Machine Learning (pp. 259–266). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Jensen, D., Neville, J., & Gallagher, B. (2004). Why collective inference improves relational classification. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (pp. 593–598). New York, NY, USA: ACM Press.
Jensen, D., Neville, J., & Hay, M. (2003). Avoiding bias when aggregating relational data with degree disparity. In Proceedings of the 20th International Conference on Machine Learning (pp. 274–281).
Kietz, J.-U., & Morik, K. (1994). A polynomial approach to the constructive induction of structural knowledge. Machine Learning, 14, 193–217.
Article MATH Google Scholar
Kirsten, M., Wrobel, S., & Horvath, T. (2000). Distance based approaches to relational learning and clustering. In S. Ďzeroski & N.Lavrač (Eds.), Relational data mining, (pp. 213–232). Springer Verlag.
Knobbe, A., Haas, M.D., & Siebes, A. (2001). Propositionalisation and aggregates. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (pp. 277–288).
Koller, D., & Pfeffer, A. (1998). Probabilistic frame-based systems. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI) (pp. 580–587).
Kramer, S., Lavrač, N., & Flach, P. (2001). Propositionalization approaches to relational data mining. In S. Ďzeroski and N. Lavrač (Eds.), Relational data mining (pp. 262–291). Springer-Verlag.
Krogel, M.-A., Rawles, S., Železng, F., Flach, P., Lavrač, N., & Wrobel, S. (2003). Comparative evaluation of approaches to propositionalization. In 13th International Conference on Inductive Logic Programming (ILP) (pp. 197–214).
Krogel, M.-A., & Wrobel, S. (2001). Transformation-based learning using multirelational aggregation. In Proceedings of the 11th International Conference on Inductive Logic Programming (ILP) (pp. 142–155).
Krogel, M.-A., & Wrobel, S. (2003). Facets of aggregation approaches to propositionalization. In Proceedings of the 13th International Conference on Inductive Logic Programming (ILP) (pp. 30–39).
Lavrač, N., & Ďzeroski, S. (1994). Inductive logic programming: techniques and application. New York. Ellis Horwood
Google Scholar
Libkin, L., & Wong L. (1994). New techniques for studying set languages, bag languages and aggregate functions. In Proceedings of the 13th Symposium on Principles of Database Systems (pp. 155–166).
Macskassy, S., & Provost, F. (2003). A simple relational classifier. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2003.
Macskassy, S., & Provost, F. (2004). Classification in networked Data: A Toolkit and a Univariate Case Study (Technical Report CeDER-04-08). Stern School of Business, New York University.
McCallum, A., Nigam, K., J. Rennie, & Seymore, K. (2000). Automating the construction of internet portals with machine learning. Information Retrival, 3, 127–163.
Article Google Scholar
McCreath, E. (1999). Induction in First Order Logic from Noisy Training Examples and Fixed Example Set Size. Doctoral dissertation, Universtity of New South Wales.
Michalski, R. (1983). A theory and methodology of inductive learning. Artificial Intelligence, 20, 111–161.
Article MathSciNet Google Scholar
Morik, K. (1999). Tailoring representations to different requirements. In Proceedings of the 10th International Conference on Algorithmic Learning Theory (ALT) (pp. 1–12).
Muggleton, S. (2001). CProgol4.4: A tutorial introduction. In S. Ďzeroski & N.Lavrač (Eds.), Relational Data Mining pp.(105–139). Springer-Verlag.
Muggleton, S., & DeRaedt, L. (1994). Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19 & 20, 629–680.
Article MATH MathSciNet Google Scholar
Neville, J., & Jensen, D. (2005). Leveraging relational autocorrelation with latent group models. In Proceedings of the 5th IEEE International Conference on Data Mining (pp. 49–55). New York, NY, USA: ACM Press.
Neville, J., Jensen, D., Friedland, L., & Hay, M. (2003a). Learning relational probability trees. In Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (pp. 625–630). New York, NY, USA: ACM Press.
Neville, J., Jensen, D., & Gallagher, B. (2003b). Simple estimators for relational Bayesian classifers. In Proceedings of the 3rd International Conference on Data Mining (pp. 609–612). Washington, DC, USA: IEEE Computer Society.
Neville, J., Rattigan, M., & Jensen, D. (2003c). Statistical relational learning: Four claims and a survey. In Proceedings of the Workshop on Learning Statistical Models from Relational Data at IJCAI-2003.
Özsoyoǵlu, G., Özsoyoǵlu, Z., & Matos, V. (1987). Extending relational algebra and relational calculus with set-valued atributes and aggregate functions. ACM Transactions on Database Systems, 12, 566–592.
Article Google Scholar
Perlich, C. (2005a). Approaching the ILP challenge 2005: Class-conditional Bayesian propositionalization for genetic classification. In Late-Braking track at the 15th International Conference on Inductive Logic Programming (pp. 99–104).
Perlich, C. (2005b). Probability estimation in mulit-relational domain. Doctoral dissertation, Stern School of Business.
Perlich, C., & Provost, F. (2003). Aggregation-based feature invention and relational concept classes. In Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (pp. 167–176). New York, NY, USA: ACM Press.
Perlich, C., Provost, F., & Simonoff, J. (2003). Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research, 4, 211–255.
Article MathSciNet Google Scholar
Pompe, U., & Kononenko, I. (1995). Naive Bayesian classifier with ILP-R. In Proceedings of the 5th International Workshop on Inductive Logic Programming (pp. 417–436).
Popescul, A., & Ungar, L. (2003). Structural logistic regression for link analysis. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2003.
Popescul, A., & Ungar, L. (2004). Cluster-based concept invention for statistical relational learning. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (pp. 665–670).
Popescul, A., Ungar, L., Lawrence, S., & Pennock, D.M. (2002). Structural logistic regression: Combining relational and statistical learning. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2003 (pp. 130–141).
Provost, F., Perlich, C., & Macskassy, S. (2003). Relational learning problems and simple models. In Proceedings of the Workshop on Learning Statistical Models from Relational Data at IJCAI-2003 (pp. 116–120).
Quinlan, J. (1993). C4.5: Programs for machine learning. Los Altos, California: Morgan Kaufmann Publishers.
Google Scholar
Quinlan, J., & Cameron-Jones, R. (1993). FOIL: A midterm report. In Proceedings of the 6th European Conference on Machine Learning (ECML) (pp. 3–20).
Slattery, S., & Mitchell, T. (2000). Discovering test set regularities in relational domains. In Proceedings of the 17th International Conference on Machine Learning (pp. 895–902).
Taskar, B., Abbeel, P., & Koller, D. (2002). Discriminative probabilistic models for relational data. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence (pp. 485–492). Edmonton, Canada: Morgan Kaufmann.
Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic classification and clustering in relational data. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (pp. 870–878).
Witten, I., & Frank, E. (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.
Wnek, J., & Michalski, R. (1993). Hypothesis-driven constructive induction in AQ17-HCI: A method and experiments. Machine Learning, 14, 139–168.
Google Scholar
Woznica, A., Kalousis, A., & Hilario, M. (2004). Kernel-based distances for relational learning. In Proceedings of the Workshop on Multi-Relational Data Mining at KDD-2004.
Zheng, Z., Kohavi, R., & Mason, L. (2001). Real World Performance of Association Rule Algorithms. In Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining (pp. 401–406).

Download references

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, PO Box 704, Yorktown Heights, NY, 10598
Claudia Perlich
New York University, New York, NY
Foster Provost

Authors

Claudia Perlich
View author publications
You can also search for this author in PubMed Google Scholar
Foster Provost
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudia Perlich.

Additional information

Editors: Hendrik Blockeel, David Jensen and Stefan Kramer

An erratum to this article is available at http://dx.doi.org/10.1007/s10994-006-8633-8.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Perlich, C., Provost, F. Distribution-based aggregation for relational learning with identifier attributes. Mach Learn 62, 65–105 (2006). https://doi.org/10.1007/s10994-006-6064-1

Download citation

Received: 11 August 2004
Revised: 22 February 2005
Accepted: 05 July 2005
Published: 27 January 2006
Issue Date: February 2006
DOI: https://doi.org/10.1007/s10994-006-6064-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Distribution-based aggregation for relational learning with identifier attributes

Abstract

Article PDF

Similar content being viewed by others

Statistical Relational Learning

Knowledge Discovery from Constrained Relational Data: A Tutorial on Markov Logic Networks

Propositionalisation of Continuous Attributes beyond Simple Aggregation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distribution-based aggregation for relational learning with identifier attributes

Abstract

Article PDF

Similar content being viewed by others

Statistical Relational Learning

Knowledge Discovery from Constrained Relational Data: A Tutorial on Markov Logic Networks

Propositionalisation of Continuous Attributes beyond Simple Aggregation

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation