Polynomial Time Complexity Graph Distance Computation for Web Content Mining

Schenker, Adam; Bunke, Horst; Last, Mark; Kandel, Abraham

doi:10.1007/978-1-84628-172-3_10

Adam Schenker³,
Horst Bunke⁴,
Mark Last⁵ &
…
Abraham Kandel³

Part of the book series: Advanced Information and Knowledge Processing ((AI&KP))

1140 Accesses

Summary

Utilizing graphs with unique node labels reduces the complexity of the maximum common subgraph problem, which is generally NP-complete, to that of a polynomial time problem. Calculating the maximum common subgraph is useful for creating a graph distance measure, since we observe that graphs become more similar (and thus have less distance) as their maximum common subgraphs become larger and vice versa. With a computationally practical method of determining distances between graphs, we are no longer limited to using simpler vector representations for machine learning applications.We can perform well-known algorithms, such as k-means clustering and k-nearest neighbors classification, directly on data represented by graphs, losing none of the inherent structural information. We demonstrate the benefits of the additional information retained in a graph-based data model for web content mining applications. We introduce several graph representations for capturing web document information and present some examples of our experimental results, which compare favorably with traditional vector methods.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Greedy Graph Edit Distance

A new class of metrics for learning on real-valued and structured data

Article 27 March 2019

On the Influence of Node Centralities on Graph Edit Distance for Graph Classification

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

C. Apte, F. Damerau, S.M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12, 233–251, 1994.
Article Google Scholar
S. Dumais, H. Chen. Hierarchical classification of web content. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pages 256–263, 2000.
Google Scholar
O. Zamir, O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 46–54, 1998.
Google Scholar
A.K. Jain, M.N. Murty, P.J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3), 264–323, 1999.
Article Google Scholar
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley, 1989.
Google Scholar
T.H. Cormen, C.E. Leiserson, R.L. Rivest. Introduction to Algorithms. Cambridge, MA: MIT Press, 1997.
Google Scholar
H. Bunke. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 18, 689–694, 1997.
Article MathSciNet Google Scholar
P.J. Dickinson, H. Bunke, A. Dadej, M. Kraetzl. Matching graphs with unique node labels. Pattern Analysis and Applications, 7(3), 243–254, 2004.
Article MathSciNet Google Scholar
M.R. Garey, D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W. H. Freeman, 1979.
MATH Google Scholar
J.R. Ullman. An algorithm for subgraph isomorphism. Journal of the Association for Computing Machinery, 23, 31–42, 1976.
MathSciNet Google Scholar
J.T.L. Wang, K. Zhang, G.-W. Chirn. Algorithms for approximate graph matching. Information Sciences, 82, 45–74, 1995.
Article MathSciNet Google Scholar
B.T. Messmer, H. Bunke. A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5), 493–504, 1998.
Article Google Scholar
K.-C. Tai. The tree-to-tree correction problem. Journal of the Association for Computing Machinery, 26(3), 422–433, 1979.
MathSciNet Google Scholar
R.A. Wagner, M.J. Fischer. The string-to-string correction problem. Journal of the Association for Computing Machinery, 21, 168–173, 1974.
MathSciNet Google Scholar
V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics-Doklady, 10, 707–710, 1966.
MathSciNet Google Scholar
A. Sanfeliu, K.S. Fu. A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 13, 353–363, 1983.
Google Scholar
H. Bunke, G. Allermann. Inexact graph matching for structural pattern recognition. Pattern Recognition Letters, 1(4), 245–253, 1983.
Article Google Scholar
G. Levi. A note on the derivation of maximal common subgraphs of two directed or undirected graphs. Calcolo, 9, 341–354, 1972.
MathSciNet Google Scholar
J.J. McGregor. Backtrack search algorithms and the maximal common subgraph problem. Software Practice and Experience, 12, 23–34, 1982.
Article Google Scholar
H. Bunke, K. Shearer. A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19, 255–259, 1998.
Article Google Scholar
W.D. Wallis, P. Shoubridge, M. Kraetz, D. Ray. Graph distances using graph union. Pattern Recognition Letters, 22, 701–704, 2001.
Article Google Scholar
M.-L. Fernández, G. Valiente. A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters, 22, 753–758, 2001.
Article Google Scholar
C.-M. Tan, Y.-F. Wang, C.-D. Lee. The use of bigrams to enhance text categorization. Information Processing and Management, 38, 529–546, 2002.
Article Google Scholar
T.M. Mitchell. Machine Learning. New York: McGraw-Hill, 1997.
MATH Google Scholar
X. Jiang, A. Muenger, H. Bunke. On median graphs: properties, algorithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(10), 1144–1151, 2001.
Article Google Scholar
C.T. Zahn. Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers, C-20, 68–86, 1971.
Google Scholar
D. Boley, M. Gini, R. Gross, et al. Partitioning-based clustering for web document categorization. Decision Support Systems, 27, 329–341, 1999.
Article Google Scholar
A. Schenker, M. Last, H. Bunke, A. Kandel. Classification of documents using graph matching. International Journal of Pattern Recognition and Artificial Intelligence, 18(3), 475–496, 2004.
Article Google Scholar
A. Schenker, M. Last, H. Bunke, A. Kandel. Classification of web documents using a graph model. In Proceedings of the 7th International Conference on Document Analysis and Recognition, pages 240–244, 2003.
Google Scholar
A. Schenker, M. Last, H. Bunke, A. Kandel. A comparison of two novel algorithms for clustering web documents. In Proceedings of the 2nd International Workshop on Web Document Analysis, pages 71–74, 2003.
Google Scholar
P.D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 303–336, 2000.
Article Google Scholar
A. Strehl, J. Ghosh, R. Mooney. Impact of similarity measures on web-page clustering. In AAAI-2000: Workshop of Artificial Intelligence for Web Search, pages 58–64, 2000.
Google Scholar
W.M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850, 1971.
Article Google Scholar
T.M. Cover, J.A. Thomas. Elements of Information Theory. New York: Wiley, 1991.
MATH Google Scholar
A. Schenker, M. Last, H. Bunke, A. Kandel. Comparison of distance measures for graph-based clustering of documents. In E. Hancock, M. Vento, eds. Proceedings of the 4th IAPR-TC15_International Workshop on Graph-based Representations in Pattern Recognition, volume 2726 of Lecture Notes in Computer Science, pages 202–213. New York: Springer-Verlag, 2003.
Google Scholar
A. Schenker, M. Last, H. Bunke, A. Kandel. Comparison of algorithms for web document clustering using graph representations of data. In A. Fred, T. Caelli, R.P.W. Duin, A. Campilho, D. de Ridder, eds. Proceedings of the Joint IAPR Workshop on Syntactical and Structural Pattern Recognition, volume 3138 of Lecture Notes in Computer Science, pages 190–197. New York: Springer-Verlag, 2004.
Google Scholar
A. Schenker, H. Bunke, M. Last, A. Kandel. Building graph-based classifier ensembles by random node selection. In F. Roli, J. Kittler, T. Windeatt, eds. Proceedings of the 5th International Workshop on Multiple Classifier Systems, volume 3077 of Lecture Notes in Computer Science, pages 214–222. New York: Springer-Verlag, 2004.
Google Scholar

Download references

Author information

Authors and Affiliations

University of South Florida, Tampa, FL, 33620, USA
Adam Schenker & Abraham Kandel
University of Bern, CH-3012, Bern, Switzerland
Horst Bunke
Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Mark Last

Authors

Adam Schenker
View author publications
You can also search for this author in PubMed Google Scholar
Horst Bunke
View author publications
You can also search for this author in PubMed Google Scholar
Mark Last
View author publications
You can also search for this author in PubMed Google Scholar
Abraham Kandel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Electrical Engineering Department, City College, City University of New York, USA
Mitra Basu PhD
Bell Laboratories, Lucent Technologies, New Jersey, USA
Tin Kam Ho BBA, MS, PhD

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Schenker, A., Bunke, H., Last, M., Kandel, A. (2006). Polynomial Time Complexity Graph Distance Computation for Web Content Mining. In: Basu, M., Ho, T.K. (eds) Data Complexity in Pattern Recognition. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-84628-172-3_10

Download citation

DOI: https://doi.org/10.1007/978-1-84628-172-3_10
Publisher Name: Springer, London
Print ISBN: 978-1-84628-171-6
Online ISBN: 978-1-84628-172-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Polynomial Time Complexity Graph Distance Computation for Web Content Mining

Summary

Chapter PDF

Similar content being viewed by others

Greedy Graph Edit Distance

A new class of metrics for learning on real-valued and structured data

On the Influence of Node Centralities on Graph Edit Distance for Graph Classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Polynomial Time Complexity Graph Distance Computation for Web Content Mining

Summary

Chapter PDF

Similar content being viewed by others

Greedy Graph Edit Distance

A new class of metrics for learning on real-valued and structured data

On the Influence of Node Centralities on Graph Edit Distance for Graph Classification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation