A Probabilistic Approach for Distillation and Ranking of Web Pages

Greco, Gianluigi; Greco, Sergio; Zumpano, Ester

doi:10.1023/A:1013883717655

A Probabilistic Approach for Distillation and Ranking of Web Pages

Published: September 2001

Volume 4, pages 189–207, (2001)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

World Wide Web Aims and scope Submit manuscript

A Probabilistic Approach for Distillation and Ranking of Web Pages

Download PDF

Gianluigi Greco¹,
Sergio Greco¹ &
Ester Zumpano¹

84 Accesses
6 Citations
Explore all metrics

Abstract

A great number of recent papers have investigated the possibility of introducing more effective and efficient algorithms for search engines. In traditional search engines the resulting ranking is carried out using textual information only and, as showed by several works, they are not very useful for extracting relevant information. Present research, instead, takes a new approach, called Topic Distillation, whose main task is finding relevant documents using a different similarity criterion: retrieved documents are those related to the query topic, but which do not necessarily contain the query string. Current algorithms for topic distillation first compute a base set containing all the relevant pages and then, by applying an iterative procedure, obtain the authoritative pages. In this paper, we present a different approach which computes the authoritative pages by analyzing the structure of the base set. The technique applies a statistical approach to the co-citation matrix (of the base set) to find the most co-cited pages and combines a link analysis approach with the content page evaluation. Several experiments have shown the validity of our approach.

References

S. Abiteboul, D. Quass, J. McHugh, and J. Widom, “The Lorel query language for semistructured data,” Internat. J. Digital Libraries 1(1), 1997, 68-88.
Google Scholar
V. Apparao et al., “Document object model (DOM) level 1 specification version 1.0,” 1998, http://www. w3.org/TR/REC-DOM-level-1.
R. Baeze-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, Reading, MA, 1999.
Google Scholar
R. Bellman, Introduction to Matrix Analysis, SIAM, Philadelphia, PA, 1997.
Google Scholar
K. Bharat and M. R. Henzinger, “Improved algorithms for topic distillation in a hyperlinked environment,” in Proc. of ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998.
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas, “Finding authorities and hubs from link structures on the World Wide Web,” in Proc. of WWW Conference, 2001, pp. 415-429.
S. Brin and L. Page, "The Page Rank citation ranking: Briging order to the Web, http://google.standford.edu/backrub/ pageranksub.ps.
S. Brin and L. Page, ”The anatomy of a large-scale hypertextual Web search engine,“ in Proc. of the 7th Internat. WWW Conference, 1997.
S. J. Carrire and R. Kazman, ”WebQuery: Searching and visualizing the Web through connectivity,“ Computer Networks 29(8–13), Special Issue on 6th Internat. WWW Conference, 1997, 1257-1267.
Google Scholar
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan, “Automatic resource list compilation by analizing hyperlink structure and associated text,” in Proc. of the 7th Internat. WWW Conference, 1998, pp. 65-74.
D. Cohn and H. Chang, “Learning to probabilistic identify authoritative documents,” Technical Report, 2000.
Compaq Computer Cooperation, http://www.research.digital.com/webl/.
Digital Equipment Corporation, “AltaVista Search Engine,” http://www.altavista.com/.
C. Dwork, S. R. Kumar, M. Naor, and D. Sivakumar, “Rank aggregation methods for the Web,” in Proc. of the 10th Internat. WWW Conference, 2001, pp. 613-622.
D. Gibson, J. M. Kleinberg, and P. Raghavan, “Inferring Web communities from link topology,” in Proc. of the 9th ACM Conf. on Hypertext and Hypermedia, 1998.
R. Goldman, J. McHugh, and J. Widom, “From semistructured data to XML: Migrating the Lore data model and query language,” Internat. Workshop on the Web Databases, 1999, pp. 25-30.
Google Corporation, Google search engine, http://www.google.com.
G. Greco, S. Greco, and E. Zumpano, “A probabilistic approach for discovering authoritative Web pages,” in Proc. of the 2nd Internat. Conf. on Web Information Systems Engineering, Kyoto, Japan, 2001.
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork, “On near-uniform URL sampling,” Computer Networks 33(1–6), Special Issue on 9th WWW Conference, 2000, 295-308.
Google Scholar
IBM Corporation Almaden Research Center, Clever, http://www.almaden.ibm.com/cs/k53/clever.html.
J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
R. Kumar, P. Taghavan, S. Rajagopalan, and D. Sivakumar, “TheWeb as a graph,” in Proc. of the 19th ACM Symposium on Principles of Database Systems, 2000, pp. 1-10.
R. Lempel and S. Moran, “The stochastic approach for link-structure analysis (SALSA) and the TCK effect,” in Proc. of the 7th Internat. WWW Conference, 1998.
M. Marchiori, “The quest for correct information on the Web: Hyper search engines,” Computer Networks 29(8–13), Special Issue on 6th Internat. WWW Conference, 1997, 1225-1236.
Google Scholar
P. Pirolli, J. E. Pitkow, and R. Rao, "Silk from a Sow's ear: Extracting usable structure from the Web, in Proc. of the 9th ACM-SIGCHI Conference, 1996, pp. 118-125.
M. Pollicott, and M. Yuri, Dynamical Systems and Ergodic Theory, Cambridge Univ. Press, Cambridge, 1998; on line version at http://www.maths.man.ac.uk/mp/book.html.
Google Scholar
D. Rafiei and A.O. Mendelzon, ”What is this page known for? Computing Web page reputation,“ IEEE Data Engineering Bulletin 23(3), 2000, 9-16.
Google Scholar
H. Small, “Co-citation in scientific literature: A new measure of the relationship between two documents,” J. American Soc. Info Sci., 1973, 275-279.
W.J. Stewart, Introduction to the Numerical Solution of Markov Chains, Princeton Univ. Press, Princeton, 1994.
Google Scholar

Download references

Author information

Authors and Affiliations

DEIS, Università della Calabria, 87030, Rende, Italy
Gianluigi Greco, Sergio Greco & Ester Zumpano

Authors

Gianluigi Greco
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Greco
View author publications
You can also search for this author in PubMed Google Scholar
Ester Zumpano
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Greco, G., Greco, S. & Zumpano, E. A Probabilistic Approach for Distillation and Ranking of Web Pages. World Wide Web 4, 189–207 (2001). https://doi.org/10.1023/A:1013883717655

Download citation

Issue Date: September 2001
DOI: https://doi.org/10.1023/A:1013883717655

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Probabilistic Approach for Distillation and Ranking of Web Pages

Abstract

Article PDF

Similar content being viewed by others

The hw-rank: an h-index variant for ranking web pages

The Research on Webpage Ranking Algorithm Based on Topic-Expert Documents

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Probabilistic Approach for Distillation and Ranking of Web Pages

Abstract

Article PDF

Similar content being viewed by others

The hw-rank: an h-index variant for ranking web pages

The Research on Webpage Ranking Algorithm Based on Topic-Expert Documents

Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation