Abstract
A great number of recent papers have investigated the possibility of introducing more effective and efficient algorithms for search engines. In traditional search engines the resulting ranking is carried out using textual information only and, as showed by several works, they are not very useful for extracting relevant information. Present research, instead, takes a new approach, called Topic Distillation, whose main task is finding relevant documents using a different similarity criterion: retrieved documents are those related to the query topic, but which do not necessarily contain the query string. Current algorithms for topic distillation first compute a base set containing all the relevant pages and then, by applying an iterative procedure, obtain the authoritative pages. In this paper, we present a different approach which computes the authoritative pages by analyzing the structure of the base set. The technique applies a statistical approach to the co-citation matrix (of the base set) to find the most co-cited pages and combines a link analysis approach with the content page evaluation. Several experiments have shown the validity of our approach.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
S. Abiteboul, D. Quass, J. McHugh, and J. Widom, “The Lorel query language for semistructured data,” Internat. J. Digital Libraries 1(1), 1997, 68-88.
V. Apparao et al., “Document object model (DOM) level 1 specification version 1.0,” 1998, http://www. w3.org/TR/REC-DOM-level-1.
R. Baeze-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, Reading, MA, 1999.
R. Bellman, Introduction to Matrix Analysis, SIAM, Philadelphia, PA, 1997.
K. Bharat and M. R. Henzinger, “Improved algorithms for topic distillation in a hyperlinked environment,” in Proc. of ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998.
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas, “Finding authorities and hubs from link structures on the World Wide Web,” in Proc. of WWW Conference, 2001, pp. 415-429.
S. Brin and L. Page, "The Page Rank citation ranking: Briging order to the Web, http://google.standford.edu/backrub/ pageranksub.ps.
S. Brin and L. Page, ”The anatomy of a large-scale hypertextual Web search engine,“ in Proc. of the 7th Internat. WWW Conference, 1997.
S. J. Carrire and R. Kazman, ”WebQuery: Searching and visualizing the Web through connectivity,“ Computer Networks 29(8–13), Special Issue on 6th Internat. WWW Conference, 1997, 1257-1267.
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan, “Automatic resource list compilation by analizing hyperlink structure and associated text,” in Proc. of the 7th Internat. WWW Conference, 1998, pp. 65-74.
D. Cohn and H. Chang, “Learning to probabilistic identify authoritative documents,” Technical Report, 2000.
Compaq Computer Cooperation, http://www.research.digital.com/webl/.
Digital Equipment Corporation, “AltaVista Search Engine,” http://www.altavista.com/.
C. Dwork, S. R. Kumar, M. Naor, and D. Sivakumar, “Rank aggregation methods for the Web,” in Proc. of the 10th Internat. WWW Conference, 2001, pp. 613-622.
D. Gibson, J. M. Kleinberg, and P. Raghavan, “Inferring Web communities from link topology,” in Proc. of the 9th ACM Conf. on Hypertext and Hypermedia, 1998.
R. Goldman, J. McHugh, and J. Widom, “From semistructured data to XML: Migrating the Lore data model and query language,” Internat. Workshop on the Web Databases, 1999, pp. 25-30.
Google Corporation, Google search engine, http://www.google.com.
G. Greco, S. Greco, and E. Zumpano, “A probabilistic approach for discovering authoritative Web pages,” in Proc. of the 2nd Internat. Conf. on Web Information Systems Engineering, Kyoto, Japan, 2001.
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork, “On near-uniform URL sampling,” Computer Networks 33(1–6), Special Issue on 9th WWW Conference, 2000, 295-308.
IBM Corporation Almaden Research Center, Clever, http://www.almaden.ibm.com/cs/k53/clever.html.
J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
R. Kumar, P. Taghavan, S. Rajagopalan, and D. Sivakumar, “TheWeb as a graph,” in Proc. of the 19th ACM Symposium on Principles of Database Systems, 2000, pp. 1-10.
R. Lempel and S. Moran, “The stochastic approach for link-structure analysis (SALSA) and the TCK effect,” in Proc. of the 7th Internat. WWW Conference, 1998.
M. Marchiori, “The quest for correct information on the Web: Hyper search engines,” Computer Networks 29(8–13), Special Issue on 6th Internat. WWW Conference, 1997, 1225-1236.
P. Pirolli, J. E. Pitkow, and R. Rao, "Silk from a Sow's ear: Extracting usable structure from the Web, in Proc. of the 9th ACM-SIGCHI Conference, 1996, pp. 118-125.
M. Pollicott, and M. Yuri, Dynamical Systems and Ergodic Theory, Cambridge Univ. Press, Cambridge, 1998; on line version at http://www.maths.man.ac.uk/mp/book.html.
D. Rafiei and A.O. Mendelzon, ”What is this page known for? Computing Web page reputation,“ IEEE Data Engineering Bulletin 23(3), 2000, 9-16.
H. Small, “Co-citation in scientific literature: A new measure of the relationship between two documents,” J. American Soc. Info Sci., 1973, 275-279.
W.J. Stewart, Introduction to the Numerical Solution of Markov Chains, Princeton Univ. Press, Princeton, 1994.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Greco, G., Greco, S. & Zumpano, E. A Probabilistic Approach for Distillation and Ranking of Web Pages. World Wide Web 4, 189–207 (2001). https://doi.org/10.1023/A:1013883717655
Issue Date:
DOI: https://doi.org/10.1023/A:1013883717655