Data Driven XPath Generation

De Mol, Robin; Bronselaer, Antoon; Nielandt, Joachim; De Tré, Guy

doi:10.1007/978-3-319-11313-5_50

Robin De Mol¹²,
Antoon Bronselaer¹²,
Joachim Nielandt¹² &
…
Guy De Tré¹²

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 322))

1508 Accesses

Abstract

The XPath query language offers a standard for information extraction from HTML documents. Therefore, the DOM tree representation is typically used, which models the hierarchical structure of the document. One of the key aspects of HTML is the separation of data and the structure that is used to represent it. A consequence thereof is that data extraction algorithms usually fail to identify data if the structure of a document is changed. In this paper, it is investigated how a set of tabular oriented XPath queries can be adapted in such a way it deals with modifications in the DOM tree of an HTML document. The basic idea is hereby that if data has already been extracted in the past, it could be used to reconstruct XPath queries that retrieve the data from a different DOM tree. Experimental results show the accuracy of our method.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Strategies for Extracting Data from HTML and XML Content

Inferring a Relax NG Schema from XML Documents

Schema Extraction and Integration of Heterogeneous XML Document Collections

Keywords

References

Laender, A.H., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. ACM Sigmod Record 31(2), 84–93 (2002)
Article Google Scholar
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Article Google Scholar
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Article Google Scholar
Sugibuchi, T., Tanaka, Y.: Interactive web-wrapper construction for extracting relational information from web documents. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 2005, pp. 968–969. ACM, New York (2005)
Chapter Google Scholar
Myllymaki, J.: Effective web data extraction with standard xml technologies. Computer Networks 39(5), 635–644 (2002)
Article Google Scholar
Pan, A., Raposo, J., Álvarez, M., Hidalgo, J., Viña, Á.: Semi-automatic wrapper generation for commercial web sources. In: Engineering Information Systems in the Internet Context, pp. 265–283. Springer (2002)
Google Scholar
Buttler, D., Liu, L., Pu, C.: A fully automated object extraction system for the world wide web. In: 21st International Conference on Distributed Computing Systems, pp. 361–370. IEEE (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P., et al.: Roadrunner: Towards automatic data extraction from large web sites. VLDB 1, 109–118 (2001)
Google Scholar
Reis, D.: d.C., Golgher, P.B., Silva, A., Laender, A.: Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web, pp. 502–511. ACM (2004)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)
Google Scholar
Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: Proceedings of the16th International Conference on Data Engineering, pp. 611–621. IEEE (2000)
Google Scholar
Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 494–503. ACM (2006)
Google Scholar
Carlson, A., Schafer, C.: Bootstrapping information extraction from semi-structured web pages. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 195–210. Springer, Heidelberg (2008)
Chapter Google Scholar
Dalvi, N., Bohannon, P., Sha, F.: Robust web extraction: An approach based on a probabilistic tree-edit model. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, pp. 335–348. ACM, New York (2009)
Google Scholar
Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)
Article Google Scholar
Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: A unified solution for structured web data extraction. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 775–784. ACM, New York (2011)
Google Scholar
Lerman, K., Minton, S., Knoblock, C.: Wrapper maintenance: A machine learning approach. J. Artif. Intell. Res (JAIR) 18, 149–181 (2003)
MATH Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press (1997)
Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Telecommunications and Information Processing, Ghent University, Sint-Pietersnieuwstraat 41, B-9000, Ghent, Belgium
Robin De Mol, Antoon Bronselaer, Joachim Nielandt & Guy De Tré

Authors

Robin De Mol
View author publications
You can also search for this author in PubMed Google Scholar
Antoon Bronselaer
View author publications
You can also search for this author in PubMed Google Scholar
Joachim Nielandt
View author publications
You can also search for this author in PubMed Google Scholar
Guy De Tré
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robin De Mol .

Editor information

Editors and Affiliations

School of Computing and Communications, Lancaster University, Lancaster, United Kingdom
P. Angelov
Institute of Biophysics and Biomedical Engineering, Bulgarian Academy of Sciences, Sofia, Bulgaria
K.T. Atanassov
Intelligent Systems Department, Bulgarian Academy of Sciences Inst. of Infor. & Communication Techn., Sofia, Bulgaria
L. Doukovska
Metalurgy, University of Chemical Technology and, Sofia, Bulgaria
M. Hadjiski
University of Library Studies and IT (ULSIT), Sofia, Bulgaria
V. Jotsov
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
J. Kacprzyk
Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland, New Zealand
N. Kasabov
Intelligent Systems Laboratory, Prof. Assen Zlatarov University Faculty of Technical Sciences, Bourgas, Bulgaria
S. Sotirov
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
E. Szmidt
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
S. Zadrożny

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

De Mol, R., Bronselaer, A., Nielandt, J., De Tré, G. (2015). Data Driven XPath Generation. In: Angelov, P., et al. Intelligent Systems'2014. Advances in Intelligent Systems and Computing, vol 322. Springer, Cham. https://doi.org/10.1007/978-3-319-11313-5_50

Download citation

DOI: https://doi.org/10.1007/978-3-319-11313-5_50
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11312-8
Online ISBN: 978-3-319-11313-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Data Driven XPath Generation

Abstract

Chapter PDF

Similar content being viewed by others

Strategies for Extracting Data from HTML and XML Content

Inferring a Relax NG Schema from XML Documents

Schema Extraction and Integration of Heterogeneous XML Document Collections

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Data Driven XPath Generation

Abstract

Chapter PDF

Similar content being viewed by others

Strategies for Extracting Data from HTML and XML Content

Inferring a Relax NG Schema from XML Documents

Schema Extraction and Integration of Heterogeneous XML Document Collections

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation