Abstract
The standard approach to feature construction and predictive learning in molecular datasets is to employ computationally expensive graph mining techniques and to bias the feature search exploration using frequency or correlation measures. These features are then typically employed in predictive models that can be constructed using, for example, SVMs or decision trees. We take a different approach: rather than mining for all optimal local patterns, we extract features from the set of pairwise maximum common subgraphs. The maximum common subgraphs are computed under the block-and-bridge-preserving subgraph isomorphism from the outerplanar examples in polynomial time. We empirically observe a significant increase in predictive performance when using maximum common subgraph features instead of correlated local patterns on 60 benchmark datasets from NCI. Moreover, we show that when we randomly sample the pairs of graphs from which to extract the maximum common subgraphs, we obtain a smaller set of features that still allows the same predictive performance as methods that exhaustively enumerate all possible patterns. The sampling strategy turns out to be a very good compromise between a slight decrease in predictive performance (although still remaining comparable with state-of-the-art methods) and a significant runtime reduction (two orders of magnitude on a popular medium size chemoinformatics dataset). This suggests that maximum common subgraphs are interesting and meaningful features.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Ben-David, S., Eiron, N., & Simon, H. U. (2002). Limitations of learning via embeddings in Euclidean half spaces. Journal of Machine Learning Research, 3, 441–461.
Bringmann, B., Zimmermann, A., Raedt, L. D., & Nijssen, S. (2006). Don’t be afraid of simpler patterns. In Proceedings of the tenth European conference on principles and practice of knowledge discovery in databases (pp. 55–66).
Bunke, H., & Shearer, K. (1998). A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19, 255–259.
Ceroni, A., Costa, F., & Frasconi, P. (2007). Classification of small molecules by two- and three-dimensional decomposition kernels. Bioinformatics, 23(16), 2038–2045.
Chaoji, V., Al Hasan, M., Salem, S., Besson, J., & Zaki, J. M. (2008). Origami: a novel and effective approach for mining representative orthogonal graph patterns. Statistical Analysis and Data Mining, 1(2), 67–84.
De Raedt, L. (2008). Logical and relational learning. Berlin: Springer.
De Raedt, L., & Ramon, J. (2009). Deriving distance metrics from generality relations. Pattern Recognition Letters, 30(3), 187–191.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.
Deshpande, M., Kuramochi, M., Wale, N., & Karypis, G. (2005). Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1036–1050.
Diestel, R. (2000). Graph theory. Berlin: Springer.
Garey, M. R., & Johnson, D. (1979). Computers and intractability: a guide to the theory of NP-completeness. New York: Freeman.
Gärtner, T. (2005). Kernels for structured data. PhD thesis, University of Bonn, Germany.
Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123.
He, H., & Singh, A. K. (2006). Graphrank: statistical modeling and mining of significant subgraphs in the feature space. In ICDM ’06: proceedings of the sixth international conference on data mining, Washington, DC, USA (pp. 885–890). Las Alamitos: IEEE Comput. Soc.
Horváth, T., Gärtner, T., & Wrobel, S. (2004). Cyclic pattern kernels for predictive graph mining. In KDD ’04: proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 158–167).
Horváth, T., Ramon, J., & Wrobel, S. (2006). Frequent subgraph mining in outerplanar graphs. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, August 2006, pp. 197–206.
Joachims, T. (2002). Learning to classify text using support vector machines: methods, theory, and algorithms. Berlin: Springer.
Karunaratne, T., & Boström, H. (2006). Learning to classify structured data by graph propositionalization. In Proceedings of the second IASTED international conference on computational intelligence (pp. 393–398).
Kramer, S., De Raedt, L., & Helma, C. (2001). Molecular feature mining in HIV data. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD-01 (pp. 136–143). New York: ACM.
Kramer, S., Lavrač, N., & Flach, P. (2001). Propositionalization approaches to relational data mining. In S. Džeroski & N. Lavrač (Eds.), Relational data mining (pp. 262–291). Berlin: Springer.
Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38.
Plotkin, G. (1971). A further note on inductive generalization. In Machine intelligence (Vol. 6, pp. 101–124). Edinburgh: Edinburgh University Press.
Provost, F., & Fawcett, T. (1998). Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In Proceedings of the third international conference on knowledge discovery and data mining (pp. 43–48). Menlo Park: AAAI Press.
Raymond, J., & Willett, P. (2002). Maximum common subgraph isomorphism algorithms for the matching of chemical structures. Journal of Computer-Aided Molecular Design, 16, 521–533.
Schietgat, L., Ramon, J., Bruynooghe, M., & Blockeel, H. (2008). An efficiently computable graph-based metric for the classification of small molecules. In Lecture notes in artificial intelligence : Vol. 5255. Proceedings of the eleventh international conference on discovery science (pp. 197–209). Berlin: Springer.
Sebag, M. (1997). Distance induction in first order logic. In N. Lavrač & S. Džeroski (Eds.), Lecture notes in artificial intelligence : Vol. 1297. Proceedings of the seventh international workshop on inductive logic programming (pp. 264–272). Berlin: Springer.
Swamidass, S. J., Chen, J., Bruand, J., Phung, P., Ralaivola, L., & Baldi, P. (2005). Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 21(suppl_1), 359–368.
Wale, N., Watson, I., & Karypis, G. (2008). Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14, 347–375.
Watanabe, S. (1960). Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development, 4(1), 66–82.
Willett, P. (2006). Similarity-based virtual screening using 2D fingerprints. Drug Discovery Today, 11(23/24), 1046–1051.
Yan, X., & Han, J. (2002). gSpan: Graph-based substructure pattern mining. In Proceedings of the 2002 IEEE international conference on data mining, ICDM 2002, Japan (pp. 721–724). Las Alamitos: IEEE Comput. Soc.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Hendrik Blockeel, Karsten Borgwardt, and Xifeng Yan.
Rights and permissions
About this article
Cite this article
Schietgat, L., Costa, F., Ramon, J. et al. Effective feature construction by maximum common subgraph sampling. Mach Learn 83, 137–161 (2011). https://doi.org/10.1007/s10994-010-5193-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5193-8