Abstract
A significant and ever-increasing amount of data is accessible only by filling out HTML forms to query an underlying Web data source. While this is most welcome from a user perspective (queries are relatively easy and precise) and from a data management perspective (static pages need not be maintained and databases can be accessed directly), automated agents must face the challenge of obtaining the data behind forms. In principle an agent can obtain all the data behind a form by multiple submissions of the form filled out in all possible ways, but efficiency concerns lead us to consider alternatives. We investigate these alternatives and show that we can estimate the amount of remaining data (if any) after a small number of submissions and that we can heuristically select a reasonably minimal number of submissions to maximize the coverage of the data. Experimental results show that these statistical predictions are appropriate and useful.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bergman, M.K.: The Deep Web: Surfacing Hidden Value. BrightPlanet.com (July 2000), from http://www.brightplanet.com/deep_content/deepwebwhitepaper.pdf Downloadable checked August 10 (2001)
Doorenbos, R.B., Etzioni, O., Weld, D.S.: A scalable comparison-shopping agent for the World-Wide Web. In: Proceedings of the First International Confence on Autonomous Agents, Marina del Rey, California, February 1997, pp. 39–48 (1997)
Patil systems home page. Describes LiveFORM and ebCARD services. (Checked August 10, 2001), http://www.patils.com
eCode.com home page. (Checked August 10, 2001), http://www.eCode.com
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering 31, 227–251 (1999)
Florescu, D., Levy, A.Y., Mendelzon, A.O.: Database techniques for the World-Wide Web: A survey. SIGMOD Record 27(3), 59–74 (1998)
Home Page for BYU Data Extraction Group (2000), http://www.deg.byu.edu
Leonard, T.: A Course In Categorical Data Analysis. Chapman & Hall/CRC, New York (2000)
Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind Web forms. Technical report, Brigham Young University (June 2002), Available at http://www.deg.byu.edu/papers/
Liddle, S.W., Yao, S.H., Embley, D.W.: On the automatic extraction of data from the hidden web. In: Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS 2001), Yokohama, Japan, November 2001, pp. 106–119 (2001)
McLean, R.A., Anderson, V.L.: Applied Factorial and Fractional Designs. Marcel Dekker, Inc., New York (1984)
Microsoft Passport and Wallet services. (Checked August 10, 2001), http://memberservices.passport.com
Plackett, R.L.: The Analysis of Categorical Data, 2nd edn. Charles Griffin & Company Ltd., London (1981)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden Web. Technical Report 2000-36, Computer Science Department, Stanford University (December 2000), Available at http://dbpubs.stanford.edu/pub/2000-36
Raghavan, S., Garcia-Molina, H.: Crawling the hidden Web. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), Rome, Italy (September 2001)
Tamhane, A.C., Dunlop, D.D.: Statistics and Data Analysis: From Elementary to Intermediate. Prentice-Hall, New Jersey (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H. (2003). Extracting Data behind Web Forms. In: Olivé, A., Yoshikawa, M., Yu, E.S.K. (eds) Advanced Conceptual Modeling Techniques. ER 2002. Lecture Notes in Computer Science, vol 2784. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45275-1_35
Download citation
DOI: https://doi.org/10.1007/978-3-540-45275-1_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20255-4
Online ISBN: 978-3-540-45275-1
eBook Packages: Springer Book Archive