Abstract
The Lixto project is an ongoing research effort in the area of Web data extraction. Whereas the project originally started out with the idea to develop a logic-based extraction language and a tool to visually define extraction programs from sample Web pages, the scope of the project has been extended over time. Today, new issues such as employing learning algorithms for the definition of extraction programs, automatically extracting data from Web pages featuring a table-centric visual appearance, and extracting from alternative document formats such as PDF are being investigated.
This work is funded in part by the Austrian Federal Ministry for Transport, Innovation and Technology under the FIT-IT Semantic Systems program.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. of Document Anal. and Recog. 5(1), 1–16 (2002)
Altamura, O., Esposito, F., Malerba, D.: Transforming Paper Documents into XML Format with WISDOM++. Intl. J. of Doc. Anal. and Recog. 4(1), 2–17 (2001)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), Rome, Italy, pp. 119–128 (2001)
Baumgartner, R., Ceresna, M., Ledermüller, G.: Automating Web Navigation in Web Data Extraction. In: Proceedings of International Conference on Intelligent Agents, Vienna, Austria (to appear, 2005)
Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36(4), 929–965 (1989)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks 31(11–16), 1623–1640 (1999)
Ceresna, M., Gottlob, G.: Query Based Learning of XPath Fragments. In: Proceedings of Dagstuhl Seminar on Machine Learning for the Semantic Web (05071), Dagstuhl, Germany (2005)
Embley, D.W.: Toward Semantic Understanding – An Approach Based on Information Extraction Ontologies. In: Proceedings of the Fifteenth Australasian Database Conference, Dunedin, New Zealand, p. 3 (2004)
Gottlob, G., Koch, C.: A Formal Comparison of Visual Web Wrapper Generators. In: Wiedermann, J., Tel, G., Pokorný, J., Bieliková, M., Štuller, J. (eds.) SOFSEM 2006. LNCS, vol. 3831, pp. 30–48. Springer, Heidelberg (2006)
Gottlob, G., Koch, C.: Monadic datalog and the expressive power of languages for Web information extraction. J. ACM 51(1), 74–113 (2004)
Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto Data Extraction Project - Back and Forth between Theory and Practice. In: Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGAR Symposium on Principles of Database Systems, Paris, France, pp. 1–12 (2004)
Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. ACM Trans. Database Syst. 30(2), 444–491 (2005)
Hassan, T., Baumgartner, R.: Using Graph Matching Techniques to Wrap Data from PDF Documents. In: Proceedings of the 15th International World Wide Web Conference (Poster Track), Edinburgh, UK (to appear, 2006)
Hurst, M.: The Interpretation of Tables in Texts. PhD thesis, University of Edinburgh (2000)
Levenshtein, V.I.: Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones. Russian Problemy Peredachi Informatsii 1, 12–25 (1965)
Llados, J., Marti, E., Villanueva, J.J.: Symbol Recognition by Error-Tolerant Subgraph Matching between Region Adjacency Graphs. IEEE Tran. on Pattern Anal. and Mach. Intel. 23(10), 1137–1143 (2001)
Page, L., Brin, S.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1–7), 107–117 (1998)
Silva, A.C., Alipio, J., Torgo, L.: Automatic Selection of Table Areas in Documents for Information Extraction. In: 11th Protuguese Conference on Artificial Intelligence, EPIA, pp. 460–465 (2003)
XML Path Language (XPath), Version 1.0, http://www.w3.org/TR/xpath
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Carme, J. et al. (2006). The Lixto Project: Exploring New Frontiers of Web Data Extraction. In: Bell, D.A., Hong, J. (eds) Flexible and Efficient Information Handling. BNCOD 2006. Lecture Notes in Computer Science, vol 4042. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11788911_1
Download citation
DOI: https://doi.org/10.1007/11788911_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35969-2
Online ISBN: 978-3-540-35971-5
eBook Packages: Computer ScienceComputer Science (R0)