Abstract
We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. Thus, after a tiny investment of human effort, our method allows automatic extraction from potentially thousands of other sites within the same domain. Our approach extends previous methods for detecting data fields in semi-structured web pages by matching those fields to domain schema columns using robust models of data values and contexts. Annotating 2–5 pages for 4–6 web sites yields an extraction accuracy of 83.8% on job offer sites and 91.1% on vacation rental sites. These results significantly outperform a baseline approach.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Liu, B., Grossman, R.L., Zhai, Y.: Mining data records in web pages. In: KDD, pp. 601–606 (2003)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: IJCAI, pp. 729–737 (1997)
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: AGENTS, pp. 190–197 (1999)
Chang, C.H., Lui, S.C.: IEPAD: information extraction based on pattern discovery. In: WWW, pp. 681–688 (2001)
Chang, C.-H., Kuo, S.-C.: OLERA: Semisupervised web-data extraction with visual support. IEEE Intelligent Systems 19(6), 56–64 (2004)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85 (2005)
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW, pp. 187–196 (2003)
Golgher, P.B., da Silva, A.S., Laender, A.H.F., Ribeiro-Neto, B.A.: Bootstrapping for example-based data extraction. In: CIKM, pp. 371–378 (2001)
Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.: Corpus-based schema matching. In: ICDE, pp. 57–68 (2005)
Freitag, D.: Multistrategy learning for information extraction. In: ICML, pp. 161–169 (1998)
Crescenzi, V., Mecca, G., Merialdo, P.: Wrapping-oriented classification of web pages. In: Nyberg, K., Heys, H.M. (eds.) SAC 2002. LNCS, vol. 2595, pp. 1108–1112. Springer, Heidelberg (2003)
Lee, L.: Measures of distributional similarity. In: ACL, pp. 25–32 (1999)
Ting, K.M., Witten, I.H.: Issues in stacked generalization. Journal of Artificial Intelligence Research 10, 271–289 (1999)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Carlson, A., Schafer, C. (2008). Bootstrapping Information Extraction from Semi-structured Web Pages. In: Daelemans, W., Goethals, B., Morik, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2008. Lecture Notes in Computer Science(), vol 5211. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87479-9_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-87479-9_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87478-2
Online ISBN: 978-3-540-87479-9
eBook Packages: Computer ScienceComputer Science (R0)