Discovering Interesting Relationships among Deep Web Databases: A Source-Biased Approach

Caverlee, James; Liu, Ling; Rocco, Daniel

doi:10.1007/s11280-006-0227-7

Discovering Interesting Relationships among Deep Web Databases: A Source-Biased Approach

Published: 16 January 2007

Volume 9, pages 585–622, (2006)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

World Wide Web Aims and scope Submit manuscript

Discovering Interesting Relationships among Deep Web Databases: A Source-Biased Approach

Download PDF

James Caverlee¹,
Ling Liu¹ &
Daniel Rocco¹

148 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

The escalation of deep web databases has been phenomenal over the last decade, spawning a growing interest in automated discovery of interesting relationships among available deep web databases. Unlike the “surface” web of static pages, these deep web databases provide data through a web-based query interface and account for a huge portion of all web content. This paper presents a novel source-biased approach to efficiently discover interesting relationships among web-enabled databases on the deep web. Our approach supports a relationship-centric view over a collection of deep web databases through source-biased database analysis and exploration. Our source-biased approach has three unique features: First, we develop source-biased probing techniques, which allow us to determine in very few interactions whether a target database is relevant to the source database by probing the target with very precise probes. Second, we introduce source-biased relevance metrics to evaluate the relevance of deep web databases discovered, to identify interesting types of source-biased relationships for a collection of deep web databases, and to rank them accordingly. The source-biased relationships discovered not only present value-added metadata for each deep web database but can also provide direct support for personalized relationship-centric queries. Third, but not least, we also develop a performance optimization using source-biased probing with focal terms to further improve the effectiveness of the basic source-biased model. A prototype system is designed for crawling, probing, and supporting relationship-centric queries over deep web databases using the source-biased approach. Our experiments evaluate the effectiveness of the proposed source-biased analysis and discovery model, showing that the source-biased approach outperforms query-biased probing and unbiased probing.

Article PDF

A survey on semantic schema discovery

Article 27 November 2021

Crawling Ranked Deep Web Data Sources

Meta Paths and Meta Structures: Analysing Large Heterogeneous Information Networks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Agichtein, E., Ipeirotis, P., Gravano, L.: Modeling query-based access to text databases. In: Proceedings of the International Workshop on the Web and Databases (WebDB ‘03), San Diego, 2003
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM, Addison-Wesley (1999)
Bergman, M.: The deep web: Surfacing hidden value. BrightPlanet (2000)
Callan, J.P., Connell, M.E.: Query-based sampling of text databases. ACM Trans. Inf. Sys. 19(2), 97–130 (2001)
Article Google Scholar
Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), Seattle, 1995
Callan, J., Connell, M., Du, A.: Automatic discovery of language models for text databases. In: Proceedings of the 1999 ACM Conference on Management of Data (SIGMOD ‘99), Philadelphia, 1999
Caverlee, J., Liu, L., Buttler, D.: Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep web. In: Proceedings of the 20th IEEE International Conference on Data Engineering (ICDE ‘04), Boston, 2004
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. In: Proceedings of the Eighth International World Wide Web Conference (WWW ‘99), May, 1999
Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: Observations and implications. SIGMOD Rec. 33(3) (2004)
Cohen, W.W., Singer, Y.: Learning to query the web. In: AAAI Workshop on Internet-Based Information Systems, 1996
Craswell, N., Bailey, P., Hawking, D.: Server selection on the World Wide Web. In: Proceedings of the Fifth ACM conference on Digital Libraries (ACM DL ‘00), San Antonio, 2000
Dolin, R., Agrawal, D., Abbadi, A.: Scalable collection summarization and selection. In: Proceedings of the Fourth ACM conference on Digital Libraries (ACM DL ‘99), Berkeley, 1999
French, J.C., Powell, A.L., Callan, J.P., Viles, C.L., Emmitt, T., Prey, K.J., Mou, Y.: Comparing the performance of database selection algorithms. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘99), Berkeley, 1999
Fuhr, N.: A decision-theoretic approach to database selection in networked IR. ACM Trans. Inf. Sys. 17(3), 229–249 (1999)
Article Google Scholar
Gravano, L., García-Molina, H.: Generalizing GlOSS to vector-space databases and broker hierarchies. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB ‘95), Zurich, 1995
Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: Text-source discovery over the Internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)
Article Google Scholar
Hawking, D., Thistlewaite, P.: Methods for information server selection. ACM Trans. Inf. Sys. 17(1), 40–76 (1999)
Article Google Scholar
Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the 28th International Conference on Very Large Databases (VLDB ‘02), Hong Kong, 2002
Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden-web databases. In: Proceedings of the 2001 ACM Conference on Management of Data (SIGMOD ‘01), Santa Barbara, 2001
Ipeirotis, P.G., Gravano, L., Sahami, M.: QProber: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Sys. (TOIS) 21(1), 1–41 (2003)
Article Google Scholar
Ipeirotis, P.G., Ntoulas, A., Cho, J., Gravano, L.: Modeling and managing content changes in text databases. In: Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE ‘05), 2005
Liu, L.: Query routing in large-scale digital library systems. In: Proceedings of the 15th IEEE International Conference on Data Engineering (ICDE ‘99), Sydney, 1999
Lyman, P., Varian, H.R.: How much information. http://www.sims.berkeley.edu/how-much-info-2003 (2003)
Meng, W., Liu, K.-L., Yu, C.T., Wang, X., Chang, Y., Rishe, N.: Determining text databases to search in the internet. In: Proceedings of the 24th International Conference on Very Large Databases (VLDB ‘98), New York, 1998
Meng, W., Yu, C.T., Liu, K.-L.: Detection of heterogeneities in a multiple text database environment. In: Proceedings of the Fourth IFCIS International Conference on Cooperative Information Systems (CoopIS ‘99), Edinburgh, 1999
Nie, J.: An information retrieval model based on modal logic. Inf. Process. Manag. 25(5), 477–497 (1989)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Powell, A.L., French, J.C., Callan, J.P., Connell, M.E., Viles, C.L.: The impact of database selection on distributed searching. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘00), Athens, 2000
ProFusion: http://www.profusion.com
PubMed: http://www.ncbi.nlm.nih.gov/PubMed/
Qiu, Y., Frei, H.-P.: Concept-based query expansion. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘93), Pittsburgh, 1993
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Databases (VLDB ‘01), Rome, 2001
Rocco, D., Caverlee, J., Liu, L., Critchlow, T.: Exploiting the deep web with dynabot: matching, probing, and ranking. In: Poster Proceedings of the 14th International World Wide Web Conference (WWW ‘05), 2005
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In: Readings in Information Retrieval. Morgan Kauffman, San Francisco, CA, 1997
Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. CACM 18(11), 613–620 (1971)
Google Scholar
Schutze, H., Pedersen, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. Inf. Process. Manag. 33(3), 307–318 (1997)
Article Google Scholar
Sugiura, A., Etzioni, O.: Query routing for web search engines: Architecture and experiments. In: Proceedings of the Ninth International World Wide Web Conference (WWW ‘00), Amsterdam, 2000
Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the 30th International Conference on Very Large Databases (VLDB ‘04), Toronto, 2004
Wang, W., Meng, W., Yu, C.: Concept hierarchy based text database categorization in a metasearch engine environment. In: Proceedings of the First International Conference on Web Information Systems Engineering (WISE ‘00), Hong Kong, 2000
Wu, W., Yu, C.T., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM Conference on Management of Data (SIGMOD ‘04), Paris, 2004
Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘96), Zurich, 1996
Yuwono, B., Lee, D.L.: Server ranking for distributed text retrieval systems on the internet. In: Database Systems for Advanced Applications (DASFAA ‘97), Melbourne, 1997
Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: Best-effort parsing with hidden syntax. In: Proceedings of the 2004 ACM Conference on Management of Data (SIGMOD ‘04), Paris, 2004

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, College of Computing, 801 Atlantic Drive, Atlanta, GA, 30332, USA
James Caverlee, Ling Liu & Daniel Rocco

Authors

James Caverlee
View author publications
You can also search for this author in PubMed Google Scholar
Ling Liu
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Rocco
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to James Caverlee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Caverlee, J., Liu, L. & Rocco, D. Discovering Interesting Relationships among Deep Web Databases: A Source-Biased Approach. World Wide Web 9, 585–622 (2006). https://doi.org/10.1007/s11280-006-0227-7

Download citation

Received: 08 July 2005
Revised: 11 November 2005
Accepted: 14 July 2006
Published: 16 January 2007
Issue Date: December 2006
DOI: https://doi.org/10.1007/s11280-006-0227-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Discovering Interesting Relationships among Deep Web Databases: A Source-Biased Approach

Abstract

Article PDF

Similar content being viewed by others

A survey on semantic schema discovery

Crawling Ranked Deep Web Data Sources

Meta Paths and Meta Structures: Analysing Large Heterogeneous Information Networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Discovering Interesting Relationships among Deep Web Databases: A Source-Biased Approach

Abstract

Article PDF

Similar content being viewed by others

A survey on semantic schema discovery

Crawling Ranked Deep Web Data Sources

Meta Paths and Meta Structures: Analysing Large Heterogeneous Information Networks

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation