Abstract
Many domains in the field of Inductive Logic Programming (ILP) involve highly unbalanced data. A common way to measure performance in these domains is to use precision and recall instead of simply using accuracy. The goal of our research is to find new approaches within ILP particularly suited for large, highly-skewed domains. We propose Gleaner, a randomized search method that collects good clauses from a broad spectrum of points along the recall dimension in recall-precision curves and employs an “at least L of these K clauses” thresholding method to combine sets of selected clauses. Our research focuses on Multi-Slot Information Extraction (IE), a task that typically involves many more negative examples than positive examples. We formulate this problem into a relational domain, using two large testbeds involving the extraction of important relations from the abstracts of biomedical journal articles. We compare Gleaner to ensembles of standard theories learned by Aleph, finding that Gleaner produces comparable testset results in a fraction of the training time.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aitken, S. (2002). Learning Information Extraction Rules: An Inductive Logic Programming Approach. Proceedings of the 15th European Conference on Artificial Intelligence. Amsterdam, Netherlands.
Becker, W., Reece, J., & Poenie, M. (1996). The World of the Cell. Benjamin Cummings.
Blaschke, C., Hirschman, L., & Valencia, A. (2002). Information Extraction in Molecular Biology. Briefings in Bioinformatics, 3, 154–165.
Blockeel, H., & Dehaspe, L. (2000). Cumulativity as Inductive Bias. PKDD 2000 Workshop on Data Mining, Decision Support, Meta-learning and ILP. Lyon, France.
Bradley, A. (1997). The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30, 1145–1159.
Breiman, L. (1996). Bagging Predictors. Machine Learning, 24, 123–140.
Bunescu, R., Ge, R., Kate, R., Marcotte, E., Mooney, R., Ramani, A., & Wong, Y. (2005). Comparative Experiments on Learning Information Extractors for Proteins and their Interactions. Journal of Artificial Intelligence in Medicine, 3(2), 139–155.
Califf, M. E., & Mooney, R. (1998). Relational Learning of Pattern-Match Rules for Information Extraction. Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing (pp. 6–11). Menlo Park, CA: AAAI Press.
Clark, P., & Boswell, R. (1991). Rule Induction with CN2: Some Recent Improvements. Proceedings of the European Working Session on Machine Learning (pp. 151–163). Porto, Portugal: Springer-Verlag New York, Inc.
Clark, P., & Niblett, T. (1989). The CN2 Induction Algorithm. Machine Learning, 3, 261–283.
Cortes, C., & Mohri, M. (2003). AUC Optimization vs. Error Rate Minimization. Neural Information Processing Systems (NIPS). MIT Press.
Craven, M., & Slattery, S. (2001). Relational Learning with Statistical Predicate Invention: Better Models for Hypertext. Machine Learning, 43, 97–119.
Davis, J., Burnside, E., Dutra, I. C., Page, D., & Costa, V. S. (2005a). An Integrated Approach to Learning Bayesian Networks of Rules. 16th European Conference on Machine Learning (pp. 84–95). Porto, Portugal: Springer.
Davis, J., Dutra, I. C., Page, D., & Costa, V. S. (2005b). Establish Entity Equivalence in Multi-Relation Domains. Proceedings of the International Conference on Intelligence Analysis. Vienna, Va.
Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, Pennsylvania.
de Castro Dutra, I., Page, D., Costa, V. S., & Shavlik, J. (2002). An Empirical Evaluation of Bagging in Inductive Logic Programming. Twelfth International Conference on Inductive Logic Programming (pp. 48–65). Sydney, Australia.
Dietterich, T. (1998). Machine-Learning Research: Four Current Directions. The AI Magazine, 18, 97–136.
Džeroski, S., & Lavrač, N. (2001). An Introduction to Inductive Logic Programming. Relational Data Mining (pp. 48–66). Springer-Verlag.
Eliassi-Rad, T., & Shavlik, J. (2001). A Theory-Refinement Approach to Information Extraction. Proceedings of the 18th International Conference on Machine Learning. Williamstown, Massachusetts.
Fawcett, T. (2001). Using Rule Sets to Maximize ROC Performance. IEEE International Conference on Data Mining (ICDM) (pp. 131–138).
Fawcett, T. (2003). ROC Graphs: Notes and Practical Considerations for Researchers (Technical Report). HP Labs HPL-2003–4.
Freitag, D., & Kushmerick, N. (2000). Boosted Wrapper Induction. Proceedings of the 15th National Conference on Artificial Intelligence (AAAI) (pp. 577–583). Austin, Texas.
Freund, Y., & Schapire, R. (1996). Experiments with a New Boosting Algorithm. International Conference on Machine Learning (pp. 148–156). Bari, Italy.
Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning, 29, 131–163.
Friedman, N., Getoor, L., Koller, D., & Pfeffer, A. (1999). Learning Probabilistic Relational Models. Proceedings of the 16th International Conference on Artificial Intelligence (IJCAI) (pp. 1300–1309). Stockholm, Sweden.
Fürnkranz, J., & Flach, P. (2005). ROC ‘n’ rule learning—Towards a better understanding of covering algorithms. Machine Learning, 58, 39–77.
Fürnkranz, J. (1999). Separate-and-Conquer Rule Learning. Artificial Intelligence Review, 13, 3–54.
Goadrich, M., Oliphant, L., & Shavlik, J. (2004). Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction. Proceedings of the 14th International Conference on Inductive Logic Programming (ILP). Porto, Portugal.
Goadrich, M., Oliphant, L., & Shavlik, J. (2005). Learning to Extract Genic Interactions using Gleaner. Proceedings of the Learning Language in Logic 2005 Workshop at the International Conference on Machine Learning. Bonn, Germany.
Hoche, S., & Wrobel, S. (2001). Relational Learning Using Constrained Confidence-Rated Boosting. 11th International Conference on Inductive Logic Programming. Strasbourg, France.
Hodges, P. E., Payne, W. E., & Garrels, J. I. (1997). The Yeast Protein Database (YPD): A Curated Proteome Database for saccharomyces cerevisiae. Nucleic Acids Research, 26, 68–72.
Hoos, H., & Stutzle, T. (2004). Stochastic local search: foundations and applications. Morgan Kaufmann.
Hu, Z. (2003). Guidelines for Protein Name Tagging (Technical Report). Georgetown University.
Kauchak, D., Smarr, J., & Elkan, C. (2004). Sources of Success for Boosted Wrapper Induction. Journal of Machine Learning Research, 5, 499–527.
Kersting, K., & Raedt, L. D. (2000). Bayesian Logic Programs. Proceedings of the Work-in-Progress Track at the 10th International Conference on Inductive Logic Programming (pp. 138–155). London, England.
Koller, D., & Pfeffer, A. (1997). Learning Probabilities for Noisy First-Order Rules. Fifteenth International Joint Conference on Artificial Intelligence (IJCAI). Nagoya, Japan.
Landwehr, N., Kersting, K., & Raedt, L. D. (2005). nFOIL: Integrating Naive Bayes and FOIL. National Conference on Artificial Intelligene (AAAI). Pittsburg, Pennsylvania.
Lewis, D. (1991). Evaluating Text Categorization. Proceedings of Speech and Natural Language Workshop (pp. 312–318). Pacific Grove, California: Morgan Kaufmann.
Manning, C., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT Press.
Michalski, R., & Larson, J. (1977). Inductive Inference of VL Decision Rules. Proceedings of the Workshop in Pattern-Directed Inference Systems. Hawaii.
Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.
Muggleton, S. (1995). Inverse Entailment and Progol. New Generation Computing Journal, 13, 245–286.
Muggleton, S. (1996). Stochastic Logic Programs. Proceedings of the 5th International Workshop on Inductive Logic Programming (p. 29). Stockholm, Sweden.
Muggleton, S. (2000). Learning Stochastic Logic Programs. Proceedings of the AAAI2000 Workshop on Learning Statistical Models from Relational Data. Austin, Texas.
Nilsson, U., & Maluszyński, J. (2000). Logic Programming and PROLOG (2ed). John Wiley & Sons.
Opitz, D., & Shavlik, J. (1996). Actively Searching for an Effective Neural-Network Ensemble. Connection Science, 8, 337–353.
Pompe, U., & Kononenko, I. (1995). Naive Bayesian Classifier within ILP-R. Fifth International Workshop on Inductive Logic Programming (pp. 417–436). Tokyo, Japan.
Popescul, A., Ungar, L., Lawrence, S., & Pennock, D. (2003). Statistical Relational Learning for Document Mining. IEEE International Conference on Data Mining, ICDM-2003. Melbourne, Florida.
Porter, M. (1980). An Algorithm for Suffix Stripping. Program, 14, 130–137.
Quinlan, J. R. (1990). Learning Logical Definitions from Relations. Machine Learning, 5, 239–266.
Quinlan, J. R. (2001). Relational Learning and Boosting. Relational Data Mining (pp. 292–306). Springer-Verlag.
Ray, S., & Craven, M. (2001). Representing Sentence Structure in Hidden Markov Models for Information Extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI). Seattle, Washington.
Richardson, M., & Domingos, P. (2006). Markov Logic Networks. Machine Learning, 62, 107–136.
Riloff, E. (1998). The Sundance Sentence Analyzer. http://www.cs.utah.edu/projects/nlp/.
Rissanen, J. (1978). Modeling by Shortest Data Description. Automatica, 14, 465–471.
Rückert, U., Kramer, S., & Raedt, L. D. (2002). Phase Transitions and Stochastic Local Search in k-Term DNF Learning. Proceedings of the 13th European Conference on Machine Learning (ECML-02). Helsinki, Finland.
Rückert, U., & Kramer, S. (2003). Stochastic Local Search in k-Term DNF Learning. Proceedings of 20th International Conference on Machine Learning (ICML-2003). Washington, D.C., USA.
Rückert, U., & Kramer, S. (2004). Toward Tight Bounds for Rule Learning. Proceedings of 21st International Conference on Machine Learning (ICML-04). Banff, Canada.
Selman, B., Kautz, H., & Cohen, B. (1993). Local Search Strategies for Satisfiability Testing. Proceedings of the Second DIMACS Challange on Cliques, Coloring, and Satisfiability. Providence, RI.
Shatkay, H., & Feldman, R. (2003). Mining the Biomedical Literature in the Genomic Era: An Overview. Journal of Computational Biology, 10, 821–55.
Srinivasan, A., & King, R. (1996). Feature Construction with Inductive Logic Programming: A Study of Quantitative Predictions of Biological Activity Aided by Structural Attributes. Proceedings of the 6th International Workshop on Inductive Logic Programming (pp. 352–367). Stockholm, Sweden.
Srinivasan, A., Muggleton, S., Sternberg, M., & King, R. (1996). Theories for Mutagenicity: A Study in First-Order and Feature-Based Induction. Artificial Intelligence, 85, 277–299.
Srinivasan, A. (2003). The Aleph Manual Version 4. http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/.
Tang, L., Mooney, R., & Melville, P. (2003). Scaling up ILP to Large Examples: Results on Link Discovery for Counter-Terrorism. KDD Workshop on Multi-Relational Data Mining. Washington, DC.
Taskar, B., Abbeel, P., Wong, M.-F., & Koller, D. (2003). Label and Link Prediction in Relational Data. IJCAI Workshop on Learning Statistical Models from Relational Data. Acapulco, Mexico.
železný, F., Srinivasan, A., & Page, D. (2003). Lattice-Search Runtime Distributions may be Heavy-Tailed. Proceedings of the 12th International Conference on Inductive Logic Programming 2002 (pp. 333–345). Syndey, Australia.
železný, F., Srinivasan, A., & Page, D. (2004). A Monte Carlo Study of Randomized Restarted Search in ILP. Proceedings of 14th International Conference on Inductive Logic Programming (ILP-04). Porto, Portugal.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Rui Camacho
Rights and permissions
About this article
Cite this article
Goadrich, M., Oliphant, L. & Shavlik, J. Gleaner: Creating ensembles of first-order clauses to improve recall-precision curves. Mach Learn 64, 231–261 (2006). https://doi.org/10.1007/s10994-006-8958-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-8958-3