Summary
Utilizing graphs with unique node labels reduces the complexity of the maximum common subgraph problem, which is generally NP-complete, to that of a polynomial time problem. Calculating the maximum common subgraph is useful for creating a graph distance measure, since we observe that graphs become more similar (and thus have less distance) as their maximum common subgraphs become larger and vice versa. With a computationally practical method of determining distances between graphs, we are no longer limited to using simpler vector representations for machine learning applications.We can perform well-known algorithms, such as k-means clustering and k-nearest neighbors classification, directly on data represented by graphs, losing none of the inherent structural information. We demonstrate the benefits of the additional information retained in a graph-based data model for web content mining applications. We introduce several graph representations for capturing web document information and present some examples of our experimental results, which compare favorably with traditional vector methods.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
C. Apte, F. Damerau, S.M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12, 233–251, 1994.
S. Dumais, H. Chen. Hierarchical classification of web content. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263, 2000.
O. Zamir, O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 46–54, 1998.
A.K. Jain, M.N. Murty, P.J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3), 264–323, 1999.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley, 1989.
T.H. Cormen, C.E. Leiserson, R.L. Rivest. Introduction to Algorithms. Cambridge, MA: MIT Press, 1997.
H. Bunke. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 18, 689–694, 1997.
P.J. Dickinson, H. Bunke, A. Dadej, M. Kraetzl. Matching graphs with unique node labels. Pattern Analysis and Applications, 7(3), 243–254, 2004.
M.R. Garey, D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W. H. Freeman, 1979.
J.R. Ullman. An algorithm for subgraph isomorphism. Journal of the Association for Computing Machinery, 23, 31–42, 1976.
J.T.L. Wang, K. Zhang, G.-W. Chirn. Algorithms for approximate graph matching. Information Sciences, 82, 45–74, 1995.
B.T. Messmer, H. Bunke. A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5), 493–504, 1998.
K.-C. Tai. The tree-to-tree correction problem. Journal of the Association for Computing Machinery, 26(3), 422–433, 1979.
R.A. Wagner, M.J. Fischer. The string-to-string correction problem. Journal of the Association for Computing Machinery, 21, 168–173, 1974.
V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics-Doklady, 10, 707–710, 1966.
A. Sanfeliu, K.S. Fu. A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 13, 353–363, 1983.
H. Bunke, G. Allermann. Inexact graph matching for structural pattern recognition. Pattern Recognition Letters, 1(4), 245–253, 1983.
G. Levi. A note on the derivation of maximal common subgraphs of two directed or undirected graphs. Calcolo, 9, 341–354, 1972.
J.J. McGregor. Backtrack search algorithms and the maximal common subgraph problem. Software Practice and Experience, 12, 23–34, 1982.
H. Bunke, K. Shearer. A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19, 255–259, 1998.
W.D. Wallis, P. Shoubridge, M. Kraetz, D. Ray. Graph distances using graph union. Pattern Recognition Letters, 22, 701–704, 2001.
M.-L. Fernández, G. Valiente. A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters, 22, 753–758, 2001.
C.-M. Tan, Y.-F. Wang, C.-D. Lee. The use of bigrams to enhance text categorization. Information Processing and Management, 38, 529–546, 2002.
T.M. Mitchell. Machine Learning. New York: McGraw-Hill, 1997.
X. Jiang, A. Muenger, H. Bunke. On median graphs: properties, algorithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(10), 1144–1151, 2001.
C.T. Zahn. Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers, C-20, 68–86, 1971.
D. Boley, M. Gini, R. Gross, et al. Partitioning-based clustering for web document categorization. Decision Support Systems, 27, 329–341, 1999.
A. Schenker, M. Last, H. Bunke, A. Kandel. Classification of documents using graph matching. International Journal of Pattern Recognition and Artificial Intelligence, 18(3), 475–496, 2004.
A. Schenker, M. Last, H. Bunke, A. Kandel. Classification of web documents using a graph model. In Proceedings of the 7th International Conference on Document Analysis and Recognition, pages 240–244, 2003.
A. Schenker, M. Last, H. Bunke, A. Kandel. A comparison of two novel algorithms for clustering web documents. In Proceedings of the 2nd International Workshop on Web Document Analysis, pages 71–74, 2003.
P.D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 303–336, 2000.
A. Strehl, J. Ghosh, R. Mooney. Impact of similarity measures on web-page clustering. In AAAI-2000: Workshop of Artificial Intelligence for Web Search, pages 58–64, 2000.
W.M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850, 1971.
T.M. Cover, J.A. Thomas. Elements of Information Theory. New York: Wiley, 1991.
A. Schenker, M. Last, H. Bunke, A. Kandel. Comparison of distance measures for graph-based clustering of documents. In E. Hancock, M. Vento, eds. Proceedings of the 4th IAPR-TC15_International Workshop on Graph-based Representations in Pattern Recognition, volume 2726 of Lecture Notes in Computer Science, pages 202–213. New York: Springer-Verlag, 2003.
A. Schenker, M. Last, H. Bunke, A. Kandel. Comparison of algorithms for web document clustering using graph representations of data. In A. Fred, T. Caelli, R.P.W. Duin, A. Campilho, D. de Ridder, eds. Proceedings of the Joint IAPR Workshop on Syntactical and Structural Pattern Recognition, volume 3138 of Lecture Notes in Computer Science, pages 190–197. New York: Springer-Verlag, 2004.
A. Schenker, H. Bunke, M. Last, A. Kandel. Building graph-based classifier ensembles by random node selection. In F. Roli, J. Kittler, T. Windeatt, eds. Proceedings of the 5th International Workshop on Multiple Classifier Systems, volume 3077 of Lecture Notes in Computer Science, pages 214–222. New York: Springer-Verlag, 2004.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer Verlag London Limited
About this chapter
Cite this chapter
Schenker, A., Bunke, H., Last, M., Kandel, A. (2006). Polynomial Time Complexity Graph Distance Computation for Web Content Mining. In: Basu, M., Ho, T.K. (eds) Data Complexity in Pattern Recognition. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-84628-172-3_10
Download citation
DOI: https://doi.org/10.1007/978-1-84628-172-3_10
Publisher Name: Springer, London
Print ISBN: 978-1-84628-171-6
Online ISBN: 978-1-84628-172-3
eBook Packages: Computer ScienceComputer Science (R0)