Abstract
The escalation of deep web databases has been phenomenal over the last decade, spawning a growing interest in automated discovery of interesting relationships among available deep web databases. Unlike the “surface” web of static pages, these deep web databases provide data through a web-based query interface and account for a huge portion of all web content. This paper presents a novel source-biased approach to efficiently discover interesting relationships among web-enabled databases on the deep web. Our approach supports a relationship-centric view over a collection of deep web databases through source-biased database analysis and exploration. Our source-biased approach has three unique features: First, we develop source-biased probing techniques, which allow us to determine in very few interactions whether a target database is relevant to the source database by probing the target with very precise probes. Second, we introduce source-biased relevance metrics to evaluate the relevance of deep web databases discovered, to identify interesting types of source-biased relationships for a collection of deep web databases, and to rank them accordingly. The source-biased relationships discovered not only present value-added metadata for each deep web database but can also provide direct support for personalized relationship-centric queries. Third, but not least, we also develop a performance optimization using source-biased probing with focal terms to further improve the effectiveness of the basic source-biased model. A prototype system is designed for crawling, probing, and supporting relationship-centric queries over deep web databases using the source-biased approach. Our experiments evaluate the effectiveness of the proposed source-biased analysis and discovery model, showing that the source-biased approach outperforms query-biased probing and unbiased probing.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Agichtein, E., Ipeirotis, P., Gravano, L.: Modeling query-based access to text databases. In: Proceedings of the International Workshop on the Web and Databases (WebDB ‘03), San Diego, 2003
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM, Addison-Wesley (1999)
Bergman, M.: The deep web: Surfacing hidden value. BrightPlanet (2000)
Callan, J.P., Connell, M.E.: Query-based sampling of text databases. ACM Trans. Inf. Sys. 19(2), 97–130 (2001)
Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), Seattle, 1995
Callan, J., Connell, M., Du, A.: Automatic discovery of language models for text databases. In: Proceedings of the 1999 ACM Conference on Management of Data (SIGMOD ‘99), Philadelphia, 1999
Caverlee, J., Liu, L., Buttler, D.: Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep web. In: Proceedings of the 20th IEEE International Conference on Data Engineering (ICDE ‘04), Boston, 2004
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. In: Proceedings of the Eighth International World Wide Web Conference (WWW ‘99), May, 1999
Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: Observations and implications. SIGMOD Rec. 33(3) (2004)
Cohen, W.W., Singer, Y.: Learning to query the web. In: AAAI Workshop on Internet-Based Information Systems, 1996
Craswell, N., Bailey, P., Hawking, D.: Server selection on the World Wide Web. In: Proceedings of the Fifth ACM conference on Digital Libraries (ACM DL ‘00), San Antonio, 2000
Dolin, R., Agrawal, D., Abbadi, A.: Scalable collection summarization and selection. In: Proceedings of the Fourth ACM conference on Digital Libraries (ACM DL ‘99), Berkeley, 1999
French, J.C., Powell, A.L., Callan, J.P., Viles, C.L., Emmitt, T., Prey, K.J., Mou, Y.: Comparing the performance of database selection algorithms. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘99), Berkeley, 1999
Fuhr, N.: A decision-theoretic approach to database selection in networked IR. ACM Trans. Inf. Sys. 17(3), 229–249 (1999)
Gravano, L., García-Molina, H.: Generalizing GlOSS to vector-space databases and broker hierarchies. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB ‘95), Zurich, 1995
Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: Text-source discovery over the Internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)
Hawking, D., Thistlewaite, P.: Methods for information server selection. ACM Trans. Inf. Sys. 17(1), 40–76 (1999)
Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Databases (VLDB ‘02), Hong Kong, 2002
Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden-web databases. In: Proceedings of the 2001 ACM Conference on Management of Data (SIGMOD ‘01), Santa Barbara, 2001
Ipeirotis, P.G., Gravano, L., Sahami, M.: QProber: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Sys. (TOIS) 21(1), 1–41 (2003)
Ipeirotis, P.G., Ntoulas, A., Cho, J., Gravano, L.: Modeling and managing content changes in text databases. In: Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE ‘05), 2005
Liu, L.: Query routing in large-scale digital library systems. In: Proceedings of the 15th IEEE International Conference on Data Engineering (ICDE ‘99), Sydney, 1999
Lyman, P., Varian, H.R.: How much information. http://www.sims.berkeley.edu/how-much-info-2003 (2003)
Meng, W., Liu, K.-L., Yu, C.T., Wang, X., Chang, Y., Rishe, N.: Determining text databases to search in the internet. In: Proceedings of the 24th International Conference on Very Large Databases (VLDB ‘98), New York, 1998
Meng, W., Yu, C.T., Liu, K.-L.: Detection of heterogeneities in a multiple text database environment. In: Proceedings of the Fourth IFCIS International Conference on Cooperative Information Systems (CoopIS ‘99), Edinburgh, 1999
Nie, J.: An information retrieval model based on modal logic. Inf. Process. Manag. 25(5), 477–497 (1989)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Powell, A.L., French, J.C., Callan, J.P., Connell, M.E., Viles, C.L.: The impact of database selection on distributed searching. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘00), Athens, 2000
ProFusion: http://www.profusion.com
Qiu, Y., Frei, H.-P.: Concept-based query expansion. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘93), Pittsburgh, 1993
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Databases (VLDB ‘01), Rome, 2001
Rocco, D., Caverlee, J., Liu, L., Critchlow, T.: Exploiting the deep web with dynabot: matching, probing, and ranking. In: Poster Proceedings of the 14th International World Wide Web Conference (WWW ‘05), 2005
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In: Readings in Information Retrieval. Morgan Kauffman, San Francisco, CA, 1997
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. CACM 18(11), 613–620 (1971)
Schutze, H., Pedersen, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. Inf. Process. Manag. 33(3), 307–318 (1997)
Sugiura, A., Etzioni, O.: Query routing for web search engines: Architecture and experiments. In: Proceedings of the Ninth International World Wide Web Conference (WWW ‘00), Amsterdam, 2000
Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the 30th International Conference on Very Large Databases (VLDB ‘04), Toronto, 2004
Wang, W., Meng, W., Yu, C.: Concept hierarchy based text database categorization in a metasearch engine environment. In: Proceedings of the First International Conference on Web Information Systems Engineering (WISE ‘00), Hong Kong, 2000
Wu, W., Yu, C.T., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM Conference on Management of Data (SIGMOD ‘04), Paris, 2004
Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘96), Zurich, 1996
Yuwono, B., Lee, D.L.: Server ranking for distributed text retrieval systems on the internet. In: Database Systems for Advanced Applications (DASFAA ‘97), Melbourne, 1997
Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: Proceedings of the 2004 ACM Conference on Management of Data (SIGMOD ‘04), Paris, 2004
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Caverlee, J., Liu, L. & Rocco, D. Discovering Interesting Relationships among Deep Web Databases: A Source-Biased Approach. World Wide Web 9, 585–622 (2006). https://doi.org/10.1007/s11280-006-0227-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-006-0227-7