Abstract
The XPath query language offers a standard for information extraction from HTML documents. Therefore, the DOM tree representation is typically used, which models the hierarchical structure of the document. One of the key aspects of HTML is the separation of data and the structure that is used to represent it. A consequence thereof is that data extraction algorithms usually fail to identify data if the structure of a document is changed. In this paper, it is investigated how a set of tabular oriented XPath queries can be adapted in such a way it deals with modifications in the DOM tree of an HTML document. The basic idea is hereby that if data has already been extracted in the past, it could be used to reconstruct XPath queries that retrieve the data from a different DOM tree. Experimental results show the accuracy of our method.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
References
Laender, A.H., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Sugibuchi, T., Tanaka, Y.: Interactive web-wrapper construction for extracting relational information from web documents. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 2005, pp. 968–969. ACM, New York (2005)
Myllymaki, J.: Effective web data extraction with standard xml technologies. Computer Networks 39(5), 635–644 (2002)
Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, Á.: Semi-automatic wrapper generation for commercial web sources. In: Engineering Information Systems in the Internet Context, pp. 265–283. Springer (2002)
Buttler, D., Liu, L., Pu, C.: A fully automated object extraction system for the world wide web. In: 21st International Conference on Distributed Computing Systems, pp. 361–370. IEEE (2001)
Crescenzi, V., Mecca, G., Merialdo, P., et al.: Roadrunner: Towards automatic data extraction from large web sites. VLDB 1, 109–118 (2001)
Reis, D.: d.C., Golgher, P.B., Silva, A., Laender, A.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web, pp. 502–511. ACM (2004)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)
Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: Proceedings of the16th International Conference on Data Engineering, pp. 611–621. IEEE (2000)
Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 494–503. ACM (2006)
Carlson, A., Schafer, C.: Bootstrapping information extraction from semi-structured web pages. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 195–210. Springer, Heidelberg (2008)
Dalvi, N., Bohannon, P., Sha, F.: Robust web extraction: An approach based on a probabilistic tree-edit model. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, pp. 335–348. ACM, New York (2009)
Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)
Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: A unified solution for structured web data extraction. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 775–784. ACM, New York (2011)
Lerman, K., Minton, S., Knoblock, C.: Wrapper maintenance: A machine learning approach. J. Artif. Intell. Res (JAIR) 18, 149–181 (2003)
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press (1997)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Physics Doklady 10(8), 707–710 (1966)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
De Mol, R., Bronselaer, A., Nielandt, J., De Tré, G. (2015). Data Driven XPath Generation. In: Angelov, P., et al. Intelligent Systems'2014. Advances in Intelligent Systems and Computing, vol 322. Springer, Cham. https://doi.org/10.1007/978-3-319-11313-5_50
Download citation
DOI: https://doi.org/10.1007/978-3-319-11313-5_50
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11312-8
Online ISBN: 978-3-319-11313-5
eBook Packages: EngineeringEngineering (R0)