Abstract
Recently, there has been increased interest in the extraction of structured data from the web (both “Surface” Web and“Hidden” Web). In particular, in this paper we focus on the automatic extraction of Web Lists. Although this task has been studied extensively, existing approaches are based on the assumption that lists are wholly contained in a Web page.They do not consider that many websites span their listing on several Web Pages and show for each of these only a partial view. Similar to databases, where a view can represent a subset of the data contained in a table, they split a logical list in multiple views (view lists). Automatic extraction of logical lists is an open problem. To tackle this issue we propose an unsupervised and domain-independent algorithm for logical list extraction. Experimental results on real-life and data-intensive Web sites confirm the effectiveness of our approach.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Baumgartner, R.: Datalog-related aspects in lixto visual developer. In: de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.) Datalog 2010. LNCS, vol. 6702, pp. 145–160. Springer, Heidelberg (2011)
Bing, L., Lam, W., Gu, Y.: Towards a unified solution: Data record region detection and segmentation. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1265–1274. ACM, New York (2011)
Cafarella, M.J., Halevy, A., Madhavan, J.: Structured data on the web. Commun. ACM 54(2), 72–79 (2011)
Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005)
Elmeleegy, H., Madhavan, J., Halevy, A.: Harvesting relational tables from lists on the web. The VLDB Journal 20(2), 209–226 (2011)
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 1535–1545. Association for Computational Linguistics, Stroudsburg (2011)
Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Extracting general lists from web documents: A hybrid approach. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds.) IEA/AIE 2011, Part I. LNCS, vol. 6703, pp. 285–294. Springer, Heidelberg (2011)
Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Hylien: A hybrid approach to general list extraction on the web. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) WWW (Companion Volume), pp. 35–36. ACM (2011)
Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 71–80. ACM, New York (2007)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 119–130. ACM, New York (2004)
Lie, H.W., Bos, B.: Cascading Style Sheets: Designing for the Web, 3rd edn., p. 5. Addison-Wesley Professional (2005)
Lin, C.X., Zhao, B., Weninger, T., Han, J., Liu, B.: Entity relation discovery from web tables and links. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1145–1146. ACM, New York (2010)
Liu, B., Grossman, R.L., Zhai, Y.: Mining web pages for data records. IEEE Intelligent Systems 19(6), 49–55 (2004)
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Maximilien, E.M., Ranabahu, A.: The programmableweb: Agile, social, and grassroot computing. In: Proceedings of the International Conference on Semantic Computing, ICSC 2007, pp. 477–481. IEEE Computer Society, Washington, DC (2007)
Miao, G., Tatemura, J., Hsiung, W.: Extracting data records from the web using tag path clustering. In: The World Wide Web Conference, pp. 981–990 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Lanotte, P.F., Fumarola, F., Ceci, M., Scarpino, A., Torelli, M.D., Malerba, D. (2014). Automatic Extraction of Logical Web Lists. In: Andreasen, T., Christiansen, H., Cubero, JC., Raś, Z.W. (eds) Foundations of Intelligent Systems. ISMIS 2014. Lecture Notes in Computer Science(), vol 8502. Springer, Cham. https://doi.org/10.1007/978-3-319-08326-1_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-08326-1_37
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08325-4
Online ISBN: 978-3-319-08326-1
eBook Packages: Computer ScienceComputer Science (R0)