Bootstrapping Information Extraction from Semi-structured Web Pages

Carlson, Andrew; Schafer, Charles

doi:10.1007/978-3-540-87479-9_31

Andrew Carlson¹ &
Charles Schafer²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5211))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

5969 Accesses
17 Citations

Abstract

We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. Thus, after a tiny investment of human effort, our method allows automatic extraction from potentially thousands of other sites within the same domain. Our approach extends previous methods for detecting data fields in semi-structured web pages by matching those fields to domain schema columns using robust models of data values and contexts. Annotating 2–5 pages for 4–6 web sites yields an extraction accuracy of 83.8% on job offer sites and 91.1% on vacation rental sites. These results significantly outperform a baseline approach.

Download to read the full chapter text

Chapter PDF

Event Extraction from Unstructured Text Data

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

A survey of methods for the extraction of information from Web resources

Article 16 September 2016

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Liu, B., Grossman, R.L., Zhai, Y.: Mining data records in web pages. In: KDD, pp. 601–606 (2003)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)
Article MATH Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: IJCAI, pp. 729–737 (1997)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: AGENTS, pp. 190–197 (1999)
Google Scholar
Chang, C.H., Lui, S.C.: IEPAD: information extraction based on pattern discovery. In: WWW, pp. 681–688 (2001)
Google Scholar
Chang, C.-H., Kuo, S.-C.: OLERA: Semisupervised web-data extraction with visual support. IEEE Intelligent Systems 19(6), 56–64 (2004)
Article Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85 (2005)
Google Scholar
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW, pp. 187–196 (2003)
Google Scholar
Golgher, P.B., da Silva, A.S., Laender, A.H.F., Ribeiro-Neto, B.A.: Bootstrapping for example-based data extraction. In: CIKM, pp. 371–378 (2001)
Google Scholar
Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.: Corpus-based schema matching. In: ICDE, pp. 57–68 (2005)
Google Scholar
Freitag, D.: Multistrategy learning for information extraction. In: ICML, pp. 161–169 (1998)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Wrapping-oriented classification of web pages. In: Nyberg, K., Heys, H.M. (eds.) SAC 2002. LNCS, vol. 2595, pp. 1108–1112. Springer, Heidelberg (2003)
Google Scholar
Lee, L.: Measures of distributional similarity. In: ACL, pp. 25–32 (1999)
Google Scholar
Ting, K.M., Witten, I.H.: Issues in stacked generalization. Journal of Artificial Intelligence Research 10, 271–289 (1999)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Andrew Carlson
Google, Inc., 4720 Forbes Avenue, Pittsburgh, PA 15213, USA
Charles Schafer

Authors

Andrew Carlson
View author publications
You can also search for this author in PubMed Google Scholar
Charles Schafer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Walter Daelemans Bart Goethals Katharina Morik

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carlson, A., Schafer, C. (2008). Bootstrapping Information Extraction from Semi-structured Web Pages. In: Daelemans, W., Goethals, B., Morik, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2008. Lecture Notes in Computer Science(), vol 5211. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87479-9_31

Download citation

DOI: https://doi.org/10.1007/978-3-540-87479-9_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87478-2
Online ISBN: 978-3-540-87479-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bootstrapping Information Extraction from Semi-structured Web Pages

Abstract

Chapter PDF

Similar content being viewed by others

Event Extraction from Unstructured Text Data

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

A survey of methods for the extraction of information from Web resources

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Bootstrapping Information Extraction from Semi-structured Web Pages

Abstract

Chapter PDF

Similar content being viewed by others

Event Extraction from Unstructured Text Data

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

A survey of methods for the extraction of information from Web resources

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation