Skip to main content

Extracting Data behind Web Forms

  • Conference paper
Advanced Conceptual Modeling Techniques (ER 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2784))

Included in the following conference series:

Abstract

A significant and ever-increasing amount of data is accessible only by filling out HTML forms to query an underlying Web data source. While this is most welcome from a user perspective (queries are relatively easy and precise) and from a data management perspective (static pages need not be maintained and databases can be accessed directly), automated agents must face the challenge of obtaining the data behind forms. In principle an agent can obtain all the data behind a form by multiple submissions of the form filled out in all possible ways, but efficiency concerns lead us to consider alternatives. We investigate these alternatives and show that we can estimate the amount of remaining data (if any) after a small number of submissions and that we can heuristically select a reasonably minimal number of submissions to maximize the coverage of the data. Experimental results show that these statistical predictions are appropriate and useful.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Bergman, M.K.: The Deep Web: Surfacing Hidden Value. BrightPlanet.com (July 2000), from http://www.brightplanet.com/deep_content/deepwebwhitepaper.pdf Downloadable checked August 10 (2001)

  2. Doorenbos, R.B., Etzioni, O., Weld, D.S.: A scalable comparison-shopping agent for the World-Wide Web. In: Proceedings of the First International Confence on Autonomous Agents, Marina del Rey, California, February 1997, pp. 39–48 (1997)

    Google Scholar 

  3. Patil systems home page. Describes LiveFORM and ebCARD services. (Checked August 10, 2001), http://www.patils.com

  4. eCode.com home page. (Checked August 10, 2001), http://www.eCode.com

  5. Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering  31, 227–251 (1999)

    Google Scholar 

  6. Florescu, D., Levy, A.Y., Mendelzon, A.O.: Database techniques for the World-Wide Web: A survey. SIGMOD Record 27(3), 59–74 (1998)

    Article  Google Scholar 

  7. Home Page for BYU Data Extraction Group (2000), http://www.deg.byu.edu

  8. Leonard, T.: A Course In Categorical Data Analysis. Chapman & Hall/CRC, New York (2000)

    MATH  Google Scholar 

  9. Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind Web forms. Technical report, Brigham Young University (June 2002), Available at http://www.deg.byu.edu/papers/

  10. Liddle, S.W., Yao, S.H., Embley, D.W.: On the automatic extraction of data from the hidden web. In: Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS 2001), Yokohama, Japan, November 2001, pp. 106–119 (2001)

    Google Scholar 

  11. McLean, R.A., Anderson, V.L.: Applied Factorial and Fractional Designs. Marcel Dekker, Inc., New York (1984)

    MATH  Google Scholar 

  12. Microsoft Passport and Wallet services. (Checked August 10, 2001), http://memberservices.passport.com

  13. Plackett, R.L.: The Analysis of Categorical Data, 2nd edn. Charles Griffin & Company Ltd., London (1981)

    MATH  Google Scholar 

  14. Raghavan, S., Garcia-Molina, H.: Crawling the hidden Web. Technical Report 2000-36, Computer Science Department, Stanford University (December 2000), Available at http://dbpubs.stanford.edu/pub/2000-36

  15. Raghavan, S., Garcia-Molina, H.: Crawling the hidden Web. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB 2001), Rome, Italy (September 2001)

    Google Scholar 

  16. Tamhane, A.C., Dunlop, D.D.: Statistics and Data Analysis: From Elementary to Intermediate. Prentice-Hall, New Jersey (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H. (2003). Extracting Data behind Web Forms. In: Olivé, A., Yoshikawa, M., Yu, E.S.K. (eds) Advanced Conceptual Modeling Techniques. ER 2002. Lecture Notes in Computer Science, vol 2784. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45275-1_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45275-1_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20255-4

  • Online ISBN: 978-3-540-45275-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics