The Lixto Project: Exploring New Frontiers of Web Data Extraction

Carme, Julien; Ceresna, Michal; Frölich, Oliver; Gottlob, Georg; Hassan, Tamir; Herzog, Marcus; Holzinger, Wolfgang; Krüpl, Bernhard

doi:10.1007/11788911_1

Julien Carme¹⁸,
Michal Ceresna¹⁸,
Oliver Frölich¹⁸,
Georg Gottlob¹⁹,
Tamir Hassan¹⁸,
Marcus Herzog¹⁸,
Wolfgang Holzinger¹⁸ &
…
Bernhard Krüpl¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 4042))

Included in the following conference series:

British National Conference on Databases

487 Accesses
7 Citations

Abstract

The Lixto project is an ongoing research effort in the area of Web data extraction. Whereas the project originally started out with the idea to develop a logic-based extraction language and a tool to visually define extraction programs from sample Web pages, the scope of the project has been extended over time. Today, new issues such as employing learning algorithms for the definition of extraction programs, automatically extracting data from Web pages featuring a table-centric visual appearance, and extracting from alternative document formats such as PDF are being investigated.

This work is funded in part by the Austrian Federal Ministry for Transport, Innovation and Technology under the FIT-IT Semantic Systems program.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

A survey of methods for the extraction of information from Web resources

Article 16 September 2016

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

Web Page Representations and Data Extraction with BERyL

References

Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. of Document Anal. and Recog. 5(1), 1–16 (2002)
Article MATH Google Scholar
Altamura, O., Esposito, F., Malerba, D.: Transforming Paper Documents into XML Format with WISDOM++. Intl. J. of Doc. Anal. and Recog. 4(1), 2–17 (2001)
Article Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), Rome, Italy, pp. 119–128 (2001)
Google Scholar
Baumgartner, R., Ceresna, M., Ledermüller, G.: Automating Web Navigation in Web Data Extraction. In: Proceedings of International Conference on Intelligent Agents, Vienna, Austria (to appear, 2005)
Google Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36(4), 929–965 (1989)
Article MATH MathSciNet Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks 31(11–16), 1623–1640 (1999)
Article Google Scholar
Ceresna, M., Gottlob, G.: Query Based Learning of XPath Fragments. In: Proceedings of Dagstuhl Seminar on Machine Learning for the Semantic Web (05071), Dagstuhl, Germany (2005)
Google Scholar
Embley, D.W.: Toward Semantic Understanding – An Approach Based on Information Extraction Ontologies. In: Proceedings of the Fifteenth Australasian Database Conference, Dunedin, New Zealand, p. 3 (2004)
Google Scholar
Gottlob, G., Koch, C.: A Formal Comparison of Visual Web Wrapper Generators. In: Wiedermann, J., Tel, G., Pokorný, J., Bieliková, M., Štuller, J. (eds.) SOFSEM 2006. LNCS, vol. 3831, pp. 30–48. Springer, Heidelberg (2006)
Chapter Google Scholar
Gottlob, G., Koch, C.: Monadic datalog and the expressive power of languages for Web information extraction. J. ACM 51(1), 74–113 (2004)
Article MathSciNet Google Scholar
Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto Data Extraction Project - Back and Forth between Theory and Practice. In: Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGAR Symposium on Principles of Database Systems, Paris, France, pp. 1–12 (2004)
Google Scholar
Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. ACM Trans. Database Syst. 30(2), 444–491 (2005)
Article MathSciNet Google Scholar
Hassan, T., Baumgartner, R.: Using Graph Matching Techniques to Wrap Data from PDF Documents. In: Proceedings of the 15th International World Wide Web Conference (Poster Track), Edinburgh, UK (to appear, 2006)
Google Scholar
Hurst, M.: The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh (2000)
Google Scholar
Levenshtein, V.I.: Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones. Russian Problemy Peredachi Informatsii 1, 12–25 (1965)
Google Scholar
Llados, J., Marti, E., Villanueva, J.J.: Symbol Recognition by Error-Tolerant Subgraph Matching between Region Adjacency Graphs. IEEE Tran. on Pattern Anal. and Mach. Intel. 23(10), 1137–1143 (2001)
Article Google Scholar
Page, L., Brin, S.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1–7), 107–117 (1998)
Google Scholar
Silva, A.C., Alipio, J., Torgo, L.: Automatic Selection of Table Areas in Documents for Information Extraction. In: 11th Protuguese Conference on Artificial Intelligence, EPIA, pp. 460–465 (2003)
Google Scholar
XML Path Language (XPath), Version 1.0, http://www.w3.org/TR/xpath

Download references

Author information

Authors and Affiliations

Database and Artificial Intelligence Group, Vienna University of Technology, Favoritenstraße 9-11, A-1040, Wien, Austria
Julien Carme, Michal Ceresna, Oliver Frölich, Tamir Hassan, Marcus Herzog, Wolfgang Holzinger & Bernhard Krüpl
Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford, OX1 3QD, United Kingdom
Georg Gottlob

Authors

Julien Carme
View author publications
You can also search for this author in PubMed Google Scholar
Michal Ceresna
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Frölich
View author publications
You can also search for this author in PubMed Google Scholar
Georg Gottlob
View author publications
You can also search for this author in PubMed Google Scholar
Tamir Hassan
View author publications
You can also search for this author in PubMed Google Scholar
Marcus Herzog
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Holzinger
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard Krüpl
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The School of Electronics, Electrical, Engineering and Computer Science, Queen’s University Belfast, BT7 1NN N.I., Belfast, UK
David A. Bell
School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, BT7 1NN, Belfast, UK
Jun Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carme, J. et al. (2006). The Lixto Project: Exploring New Frontiers of Web Data Extraction. In: Bell, D.A., Hong, J. (eds) Flexible and Efficient Information Handling. BNCOD 2006. Lecture Notes in Computer Science, vol 4042. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11788911_1

Download citation

DOI: https://doi.org/10.1007/11788911_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35969-2
Online ISBN: 978-3-540-35971-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics