Abstract
Today, web is a large source of information which may be structured or unstructured. The need is efficient information extraction from various unstructured sources on the web. Therefore, information extraction is playing a prominent role in the current scenario. It focuses on automatically extracting structured information from unstructured distributed resources on the web and is based on several approaches. Web page segmentation is one of the most significant techniques where a web page is broken down into semantically related parts. There are various approaches to Web page segmentation. In this paper, the first information extraction has been explored, discussed and reviewed. Second, a revisit has been done on web page segmentation and its various approaches where a comparative analysis has been made. Third, various phases of vision-based web page segmentation have been presented and reviewed along with a flowchart. Finally, the results and conclusions have been presented along with the future work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media, Inc.
Piskorski J, Yangarber R (2013) Information extraction: past, present and future. In: Multi-source, multilingual information extraction and summarization. Springer, Berlin, Heidelberg, pp 23–49
Kohlschütter C, Nejdl W (2008) A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 1173–1182
Feng H, Zhang W, Wu H, Wang CJ (2016) Web page segmentation and its application for web information crawling. In: Proceedings of the ICTAI-2016, IEEE Computer Society
Xiao Y, Tao Y, Li Q (2008) Web page adaptation for mobile device. In: Proceedings of the wireless communications, networking and mobile computing, IEEE Computer Society
Saad MB, Gançarski S (2010) Using visual pages analysis for optimizing web Archiving. In: Proceedings of the 2010 EDBT/ICDT workshops
Mahmud J, Borodin Y, Ramakrishnan IV (2007) Csurf: a context-driven non-visual web- Browser. In: Proceedings of the 16th international conference on World Wide Web, WWW’07, New York, NY, USA, pp 31–40. ACM
Barrio P, Gravano L (2016) Sampling strategies for information extraction over the deep web. Inf Process Manag 53(2):309–331
Gupta S, Kaiser G, Neistadt D, Grimm P (2003) DOM-based content extraction of HTML documents. In: Proceedings of the 12th international conference on World Wide Web, May 20–24, Budapest, Hungary, pp 1173–1182
Sanoja A, Gançarski S (2015) Web page segmentation evaluation. In: Proceeding of the of the 30th annual ACM symposium on applied computing, pp 753–760
Sanoja A, Gançarski S (2014) Block-o-matic: a web page segmentation framework. In: International conference on multimedia computing and systems (ICMCS), pp 595–600
Cormier M, Mann R, Moffatt K, Cohen R (2017) Towards an improved vision- based web page segmentation algorithm. In: 2017 14th conference on computer and robot vision computer and robot vision (CRV), pp 345–352
Cormier M, Moffatt K, Cohen R, Mann R (2016) Purely vision-based segmentation of web pages for assistive technology. Comput Vis Image Underst 148(3):46–66
Cai D, Yu S, Wen J-R, Ma W-Y (2003) Vips: a vision-based page segmentation algorithm. Microsoft technical report, MSR-TR-2003-79
Kuppusamy KS, Aghila G (2012) Multidimensional web page evaluation model using segmentation and annotations. Int J Cybern Inf 1(4):1–12
Elgin Akpınar M, Yesilada Y (2013) Page segmentation algorithm: extended and perceived success. Curr Trends Web Eng. ICWE 2013; Lect Notes Comput Sci 8295:238–252
Zeleny J, Burget R, Zendulka J (2017) Box clustering segmentation: a new method for vision-based web page preprocessing. Inf Process Manag 53(2):735–750
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Malhotra, P., Malik, S.K. (2019). Web Page Segmentation Towards Information Extraction for Web Semantics. In: Bhattacharyya, S., Hassanien, A., Gupta, D., Khanna, A., Pan, I. (eds) International Conference on Innovative Computing and Communications. Lecture Notes in Networks and Systems, vol 56. Springer, Singapore. https://doi.org/10.1007/978-981-13-2354-6_45
Download citation
DOI: https://doi.org/10.1007/978-981-13-2354-6_45
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2353-9
Online ISBN: 978-981-13-2354-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)