Detecting Tables in HTML Documents

Wang, Yalin; Hu, Jianying

doi:10.1007/3-540-45869-7_29

Yalin Wang⁶ &
Jianying Hu⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2423))

Included in the following conference series:

International Workshop on Document Analysis Systems

1285 Accesses
19 Citations

Abstract

Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including webmining, knowledge management, and webcon tent summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as <table> elements, a <table> element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the webdomain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different websites from various domains and contains over 10, 000 leaf <table> elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.

Download to read the full chapter text

Chapter PDF

A Human-Machine Method for Web Table Understanding

Identifying Web Tables: Supporting a Neglected Type of Content on the Web

Automated Table Understanding Using Stub Patterns

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

H.-H. Chen, S.-C. Tsai, and J.-H. Tsai: Mining Tables from Large Scale HTML Texts. In: The 18th Int. Conference on Computational Linguistics, Saarbrücken, Germany, July 2000.
Google Scholar
G. Penn, J. Hu, H. Luo, and R. McDonald: Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices. In: ICDAR2001, Seattle, WA, USA, September 2001.
Google Scholar
M. Hurst: Layout and Language: Challenges for Table Understanding on the Web. In: First International Workshop on WebDocument Analysis, Seattle, WA, USA, September 2001, http://www.csc.liv.ac.uk/ wda2001.
M. Yoshida, K. Torisawa, and J. Tsujii: A Method to Integrate Tables of the World Wide Web. In: First International Workshop on Web Document Analysis, Seattle, WA, USA, September 2001, http://www.csc.liv.ac.uk/ wda2001/.
R. Haralick and L. Shapiro: Computer and Robot Vision. Addison Wesley, 1992.
Google Scholar
T. Joachims: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: The 14th International Conference on Machine Learning, Nashville, Tennessee, 1997.
Google Scholar
Y. Yang and X. Liu: A Re-Examination of Text Categorization Methods, In: SIGIR’ 99, Berkeley, California, 1999.
Google Scholar
D. Baker and A.K. McCallum: Distributional Clustering of Words for Text Classification, In: SIGIR’98, Melbourne, Australia, 1998.
Google Scholar
M. F. Porter: An Algorithm for Suffix Stripping. In: Program, Vol. 14, no.3, 1980.
Google Scholar
J. Hu, R. Kashi, D. Lopresti, G. Nagy, and G. Wilfong: Why Table Ground-Truthing is Hard. In: ICDAR2001, Seattle, WA, September 2001.
Google Scholar
A. McCallum, K. Nigam, J. Rennie, and K. Seymore: Automating the Construction of Internet Portals with Machine Learning. In: Information Retrieval Journal, vol. 3, 2000.
Google Scholar
D. Mladenic: Text-learning and related intelligent agents. In: IEEE Expert, July–August 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Electrical Engineering, Univ. of Washington, 98195, Seattle, WA, US
Yalin Wang
Avaya Labs Research, 233 Mount Airy road, 07920, Basking Ridge, NJ, US
Jianying Hu

Authors

Yalin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianying Hu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Bell Labs, Lucent Technologies, 600 Mountain Avenue, 07974, Murray Hill, NJ, USA
Daniel Lopresti
Avaya Labs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA
Jianying Hu & Ramanujan Kashi &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Hu, J. (2002). Detecting Tables in HTML Documents. In: Lopresti, D., Hu, J., Kashi, R. (eds) Document Analysis Systems V. DAS 2002. Lecture Notes in Computer Science, vol 2423. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45869-7_29

Download citation

DOI: https://doi.org/10.1007/3-540-45869-7_29
Published: 09 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44068-0
Online ISBN: 978-3-540-45869-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Detecting Tables in HTML Documents

Abstract

Chapter PDF

Similar content being viewed by others

A Human-Machine Method for Web Table Understanding

Identifying Web Tables: Supporting a Neglected Type of Content on the Web

Automated Table Understanding Using Stub Patterns

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Detecting Tables in HTML Documents

Abstract

Chapter PDF

Similar content being viewed by others

A Human-Machine Method for Web Table Understanding

Identifying Web Tables: Supporting a Neglected Type of Content on the Web

Automated Table Understanding Using Stub Patterns

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation