Abstract
Automated techniques can help to extract information from the Web. A new semi-automatic approach based on the maximum entropy segmental Markov model, therefore, is proposed to extract structured data from Web pages. It is motivated by two ideas: modeling sequences embedding structured data instead of their context to reduce the number of training Web pages and preventing the generation of too specific or too general models from the training data. The experimental results show that this approach has better performance than Stalker when only one training Web page is provided.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
References
Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A Comparative Study of Information Extraction Strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)
Chang, C.-H., Kuo, S.-C.: OLERA: Semisupervised Web Data Extracion with Visual Support. IEEE Intelligent Systems 4(6), 56–64 (2004)
Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: Proceedings of 6th International Conference on Web Information System Engineering (2005)
Hogue, A., Karger, D.: Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web (2005)
Lemma, K., Getoor, L., Minton, S., Knoblock, C.: Using the Structure of Web Sites for Automatic Segmentation of Tables. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of data, pp. 119–130 (2004)
Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree alignment. In: Proceedings of the 14th International World Wide Web in Chiba, Japan (2005)
Liu, B., Zhai, Y.: NET- A System for Extracting Web Data from Flat and Nested Data Records. In: Proceedings of 6th International Conference on Web Information Systems Engineering (2005)
Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–285 (1989)
McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proceedings ICML 2000, pp. 591–598 (2000)
Ge, X.: Segemental Semi-Markov Models and Applications to Sequence Analysis. PhD. Thesis, University of California, Irvine (2002)
Good, I.J.: Maximum Entropy for Hypothesis Formulation, Especially for Multidimensional Contingency Tables. The Annals of Mathematical Statistics 34, 911–934 (1963)
Darroch, J.N., Ratcliff, D.: Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics 43(5), 1470–1480 (1972)
Viterbi, A.J.: Error Bounds for Convolutional Codes and an Asymptotically Optimal Decoding Algorithm. IEEE Transactions on Information Theory IT-13, 260–269 (1967)
Jing, Y.: Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Models. Texas Tech University, Computer Science, Doctoral Dissertation (2007)
Fetch Technologies, Inc. (2009), http://www.fetch.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mengel, S., Jing, Y. (2009). Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds) Web Information Systems Engineering - WISE 2009. WISE 2009. Lecture Notes in Computer Science, vol 5802. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04409-0_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-04409-0_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04408-3
Online ISBN: 978-3-642-04409-0
eBook Packages: Computer ScienceComputer Science (R0)