Effective feature construction by maximum common subgraph sampling

Schietgat, Leander; Costa, Fabrizio; Ramon, Jan; De Raedt, Luc

doi:10.1007/s10994-010-5193-8

Effective feature construction by maximum common subgraph sampling

Published: 20 July 2010

Volume 83, pages 137–161, (2011)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Effective feature construction by maximum common subgraph sampling

Download PDF

Leander Schietgat¹,
Fabrizio Costa¹,
Jan Ramon¹ &
…
Luc De Raedt¹

747 Accesses
17 Citations
Explore all metrics

Abstract

The standard approach to feature construction and predictive learning in molecular datasets is to employ computationally expensive graph mining techniques and to bias the feature search exploration using frequency or correlation measures. These features are then typically employed in predictive models that can be constructed using, for example, SVMs or decision trees. We take a different approach: rather than mining for all optimal local patterns, we extract features from the set of pairwise maximum common subgraphs. The maximum common subgraphs are computed under the block-and-bridge-preserving subgraph isomorphism from the outerplanar examples in polynomial time. We empirically observe a significant increase in predictive performance when using maximum common subgraph features instead of correlated local patterns on 60 benchmark datasets from NCI. Moreover, we show that when we randomly sample the pairs of graphs from which to extract the maximum common subgraphs, we obtain a smaller set of features that still allows the same predictive performance as methods that exhaustively enumerate all possible patterns. The sampling strategy turns out to be a very good compromise between a slight decrease in predictive performance (although still remaining comparable with state-of-the-art methods) and a significant runtime reduction (two orders of magnitude on a popular medium size chemoinformatics dataset). This suggests that maximum common subgraphs are interesting and meaningful features.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Ben-David, S., Eiron, N., & Simon, H. U. (2002). Limitations of learning via embeddings in Euclidean half spaces. Journal of Machine Learning Research, 3, 441–461.
Article MathSciNet Google Scholar
Bringmann, B., Zimmermann, A., Raedt, L. D., & Nijssen, S. (2006). Don’t be afraid of simpler patterns. In Proceedings of the tenth European conference on principles and practice of knowledge discovery in databases (pp. 55–66).
Bunke, H., & Shearer, K. (1998). A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19, 255–259.
Article MATH Google Scholar
Ceroni, A., Costa, F., & Frasconi, P. (2007). Classification of small molecules by two- and three-dimensional decomposition kernels. Bioinformatics, 23(16), 2038–2045.
Article Google Scholar
Chaoji, V., Al Hasan, M., Salem, S., Besson, J., & Zaki, J. M. (2008). Origami: a novel and effective approach for mining representative orthogonal graph patterns. Statistical Analysis and Data Mining, 1(2), 67–84.
Article MathSciNet Google Scholar
De Raedt, L. (2008). Logical and relational learning. Berlin: Springer.
Book MATH Google Scholar
De Raedt, L., & Ramon, J. (2009). Deriving distance metrics from generality relations. Pattern Recognition Letters, 30(3), 187–191.
Article MATH Google Scholar
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.
Google Scholar
Deshpande, M., Kuramochi, M., Wale, N., & Karypis, G. (2005). Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1036–1050.
Article Google Scholar
Diestel, R. (2000). Graph theory. Berlin: Springer.
Google Scholar
Garey, M. R., & Johnson, D. (1979). Computers and intractability: a guide to the theory of NP-completeness. New York: Freeman.
MATH Google Scholar
Gärtner, T. (2005). Kernels for structured data. PhD thesis, University of Bonn, Germany.
Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123.
Article Google Scholar
He, H., & Singh, A. K. (2006). Graphrank: statistical modeling and mining of significant subgraphs in the feature space. In ICDM ’06: proceedings of the sixth international conference on data mining, Washington, DC, USA (pp. 885–890). Las Alamitos: IEEE Comput. Soc.
Chapter Google Scholar
Horváth, T., Gärtner, T., & Wrobel, S. (2004). Cyclic pattern kernels for predictive graph mining. In KDD ’04: proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 158–167).
Horváth, T., Ramon, J., & Wrobel, S. (2006). Frequent subgraph mining in outerplanar graphs. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, August 2006, pp. 197–206.
Joachims, T. (2002). Learning to classify text using support vector machines: methods, theory, and algorithms. Berlin: Springer.
Google Scholar
Karunaratne, T., & Boström, H. (2006). Learning to classify structured data by graph propositionalization. In Proceedings of the second IASTED international conference on computational intelligence (pp. 393–398).
Kramer, S., De Raedt, L., & Helma, C. (2001). Molecular feature mining in HIV data. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD-01 (pp. 136–143). New York: ACM.
Chapter Google Scholar
Kramer, S., Lavrač, N., & Flach, P. (2001). Propositionalization approaches to relational data mining. In S. Džeroski & N. Lavrač (Eds.), Relational data mining (pp. 262–291). Berlin: Springer.
Google Scholar
Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38.
Article MATH MathSciNet Google Scholar
Plotkin, G. (1971). A further note on inductive generalization. In Machine intelligence (Vol. 6, pp. 101–124). Edinburgh: Edinburgh University Press.
Google Scholar
Provost, F., & Fawcett, T. (1998). Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In Proceedings of the third international conference on knowledge discovery and data mining (pp. 43–48). Menlo Park: AAAI Press.
Google Scholar
Raymond, J., & Willett, P. (2002). Maximum common subgraph isomorphism algorithms for the matching of chemical structures. Journal of Computer-Aided Molecular Design, 16, 521–533.
Article Google Scholar
Schietgat, L., Ramon, J., Bruynooghe, M., & Blockeel, H. (2008). An efficiently computable graph-based metric for the classification of small molecules. In Lecture notes in artificial intelligence : Vol. 5255. Proceedings of the eleventh international conference on discovery science (pp. 197–209). Berlin: Springer.
Google Scholar
Sebag, M. (1997). Distance induction in first order logic. In N. Lavrač & S. Džeroski (Eds.), Lecture notes in artificial intelligence : Vol. 1297. Proceedings of the seventh international workshop on inductive logic programming (pp. 264–272). Berlin: Springer.
Google Scholar
Swamidass, S. J., Chen, J., Bruand, J., Phung, P., Ralaivola, L., & Baldi, P. (2005). Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 21(suppl_1), 359–368.
Article Google Scholar
Wale, N., Watson, I., & Karypis, G. (2008). Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14, 347–375.
Article Google Scholar
Watanabe, S. (1960). Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development, 4(1), 66–82.
Article MATH Google Scholar
Willett, P. (2006). Similarity-based virtual screening using 2D fingerprints. Drug Discovery Today, 11(23/24), 1046–1051.
Article Google Scholar
Yan, X., & Han, J. (2002). gSpan: Graph-based substructure pattern mining. In Proceedings of the 2002 IEEE international conference on data mining, ICDM 2002, Japan (pp. 721–724). Las Alamitos: IEEE Comput. Soc.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001, Leuven, Belgium
Leander Schietgat, Fabrizio Costa, Jan Ramon & Luc De Raedt

Authors

Leander Schietgat
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Costa
View author publications
You can also search for this author in PubMed Google Scholar
Jan Ramon
View author publications
You can also search for this author in PubMed Google Scholar
Luc De Raedt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leander Schietgat.

Additional information

Editors: Hendrik Blockeel, Karsten Borgwardt, and Xifeng Yan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schietgat, L., Costa, F., Ramon, J. et al. Effective feature construction by maximum common subgraph sampling. Mach Learn 83, 137–161 (2011). https://doi.org/10.1007/s10994-010-5193-8

Download citation

Received: 20 September 2009
Revised: 15 February 2010
Accepted: 12 May 2010
Published: 20 July 2010
Issue Date: May 2011
DOI: https://doi.org/10.1007/s10994-010-5193-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Effective feature construction by maximum common subgraph sampling

Abstract

Article PDF

Similar content being viewed by others

Mining Discriminative Subgraph Patterns from Structural Data

Probabilistic frequent subtrees for efficient graph classification and retrieval

Profiling and analysis of chemical compounds using pointwise mutual information

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effective feature construction by maximum common subgraph sampling

Abstract

Article PDF

Similar content being viewed by others

Mining Discriminative Subgraph Patterns from Structural Data

Probabilistic frequent subtrees for efficient graph classification and retrieval

Profiling and analysis of chemical compounds using pointwise mutual information

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation