Abstract
Nowadays “live” content, such as weblog, wikipedia, and news, is ubiquitous in the Internet. Providing users with relevant content in a timely manner becomes a challenging problem. Differing from Web search technologies and RSS feeds/reader applications, this paper envisions a personalized full-text content filtering and dissemination system in a highly distributed environment such as a Distributed Hash Table (DHT) based Peer-to-Peer (P2P) Network. Users subscribe to their interested content by specifying input keywords and thresholds as filters. Then, content is disseminated to those users having interest in it. In the literature, full-text document publishing in DHTs has suffered for a long time from the high cost of forwarding a document to home nodes of all distinct terms. It is aggravated by the fact that a document contains a large number of distinct terms (typically tens or thousands of terms per document). In this paper, we propose a set of novel techniques to overcome such a high forwarding cost by carefully selecting a very small number of meaningful terms (or key features) among candidate terms inside each document. Next, to reduce the average hop count per forwarding, we further prune irrelevant documents during the forwarding path. Experiments based on two real query logs and two real data sets demonstrate the effectiveness of our solution.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Banavar, G., Chandra, T.D., Mukherjee, B., Nagarajarao, J., Strom, R.E., Sturman, D.C.: An efficient multicast protocol for content-based publish-subscribe systems. In: ICDCS, pp. 262–272 (1999)
Berry M.W., Drmac Z., Jessup E.R.: Matrices, vector spaces, and information retrieval. SIAM Rev. 41(2), 335–362 (1999)
Broder, A.Z., Mitzenmacher, M.: Survey: network applications of bloom filters: a survey. Int. Math. 1(4), (2003)
Callan, J.P.: Document filtering with inference networks. In: SIGIR, pp. 262–269 (1996)
Cooper, B.F.: An optimal overlay topology for routing peer-to-peer searches. In: Middleware, (2005)
Cuenca-Acuna, F.M., Nguyen, T.D.: Text-based content search and retrieval in ad-hoc p2p communities. In: NETWORKING Workshops, pp. 220–234 (2002)
Dabek, F., Kaashoek, M.F., Karger, D.R., Morris, R., Stoica, I.: Wide-area cooperative storage with cfs. In: SOSP, (2001)
Deerwester S.C., Dumais S.T., Landauer T.K., Furnas G.W., Harshman R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Fabret, F., Jacobsen, H.-A., Llirbat, F., Pereira, J., Ross, K.A., Shasha, D.: Filtering algorithms and implementation for very fast publish/subscribe. In: SIGMOD Conference, pp. 115–126 (2001)
Fan L., Cao P., Almeida J.M., Broder A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8(3), 281–293 (2000)
Ganguly, S., Bhatnagar, S., Saxena, A., Izmailov, R., Banerjee, S.: A fast content-based data distribution infrastructure. In: INFOCOM (2006)
Ioannidis, Y.E.: The history of histograms (abridged). In: VLDB pp. 19–30 (2003)
Kukulenz, D., Ntoulas, A.: Answering bounded continuous search queries in the world wide web. In: WWW, pp. 551–560 (2007)
Li, J., Loo, B.T., Hellerstein, J.M., Kaashoek, M.F., Karger, D.R., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: IPTPS, pp. 207–215 (2003)
Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and replication in unstructured peer-to-peer networks. In: SIGMETRICS (2002)
Michel, S., Triantafillou, P., Weikum, G.: Klee: A framework for distributed top-k query algorithms. In: VLDB, pp. 637–648 (2005)
Milo, T., Zur, T., Verbin, E.: Boosting topic-based publish-subscribe systems with dynamic clustering. In: SIGMOD Conference, pp. 749–760, (2007)
Nabeel, M., Shang, N., Bertino, E.: Privacy-preserving filtering and covering in content-based publish subscribe systems. Tech. Rep. (2009)
Nguyen, L.T., Yee, W.G., Frieder, O.: Adaptive distributed indexing for structured peer-to-peer networks. In: CIKM, pp. 1241–1250 (2008)
Opyrchal, L., Prakash, A., Agrawal, A.: Supporting privacy policies in a publish-subscribe substrate for pervasive environments. JNW (2007)
Ramasubramanian, V., Peterson, R., Sirer, E.G.: Corona: a high performance publish-subscribe system for the world wide web. In: NSDI (2006)
Rao, W., Chen, L., Fu, A.W.-C., Bu, Y.: Optimal proactive caching in peer-to-peer network: analysis and application. In: CIKM, pp. 663–672 (2007)
Rao W., Chen L., Fu A.W.-C., Wang G.: Optimal resource placement in structured peer-to-peer networks. IEEE Trans. Parallel Distrib. Syst. 21(7), 1011–1026 (2010)
Rao, W., Chen, L., Yuan, M.: Towards efficient privacy-aware publish/subscribe. In: Hong Kong University of Science and Engineering, Department of Computer Science and Engineering, Technical Report, (2010)
Rao, W., Fu, A.W.-C., Chen, L., Chen, H.: Stairs: towards efficient full-text filtering and dissemination in a dht environment. In: ICDE (2009)
Ratnasamy, S., Francis, P., Handley, M., Karp, R.M., Shenker, S.: A scalable content-addressable network. In: SIGCOMM (2001)
Rose, I., Murty, R., Pietzuch, P.R., Ledlie, J., Roussopoulos, M., Welsh, M.: Cobra: Content-based filtering and aggregation of blogs and rss feeds. In: NSDI (2007)
Rowstron, A.I.T., Druschel, P.: Pastry: scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Middleware (2001)
Rowstron, A.I.T., Druschel, P.: Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. In: SOSP (2001)
Rowstron, A.I.T., Kermarrec, A.-M., Castro, M., Druschel, P.: Scribe: a large-scale and decentralised application-level multicast infrastructure. In: IEEE Journal on Selected Areas in Communication (JSAC), Vol. 20, p. 8 (2002)
Sandler, D., Mislove, A., Post, A., Druschel, P.: FeedTree: Sharing web micronews with peer-to-peer event notification. In: IPTPS, pp. 141–151 (2005)
Shang, N., Nabeel, M., Paci, F., Bertino, E.: A privacy-preserving approach to policy-based content dissemination. In: ICDE (2010)
Shikfa, A., Önen, M., Molva, R.: Privacy-preserving content-based publish/subscribe networks. In: SEC (2009)
Stoica I., Morris R., Liben-Nowell D., Karger D.R., Kaashoek M.F., Dabek F., Balakrishnan H.: Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw. 11(1), 17–32 (2003)
Stribling, J., Li, J., Councill, I.G., Kaashoek, M.F., Morris, R.: Overcite: a distributed, cooperative citeseer. In: NSDI (2006)
Tang, C., Dwarkadas, S.: Hybrid global-local indexing for efficient peer-to-peer information retrieval. In: NSDI, pp. 211–224 (2004)
Tang, C., Xu, Z.: pfilter: Global information filtering and dissemination using structured overlay networks. In: FTDCS, pp. 24–30 (2003)
Tang, C., Xu, Z., Dwarkadas, S.: Peer-to-peer information retrieval using self-organizing semantic overlay networks. In: SIGCOMM (2003)
Tang, C., Xu, Z., Mahalingam, M.: psearch: Information retrieval in structured overlays. In HotNets-I, (2002)
Terpstra, W.W., Kangasharju, J., Leng, C., Buchmann, A.P.: Bubblestorm: resilient, probabilistic, and exhaustive peer-to-peer search. In: SIGCOMM, pp. 49–60 (2007)
Tryfonopoulos, C., Idreos, S., Koubarakis, M.: Publish/subscribe functionality in IR environments using structured overlay networks. In: SIGIR, pp 322–329 (2005)
Xu, Q., Shen, H.T., Cui, B., Hou, X., Dai, Y.: A novel content distribution mechanism in dht networks. In: Networking, pp. 742–755 (2009)
Yalagandula, P., Dahlin, M.: A scalable distributed information management system. In: SIGCOMM, pp. 379–390 (2004)
Yan T.W., Garcia-Molina H.: The SIFT information dissemination system. ACM Trans. Database Syst. 24(4), 529–565 (1999)
Yang, Y., Dunlap, R., Rexroad, M., Cooper, B.F.: Performance of full text search in structured and unstructured peer-to-peer systems. In: INFOCOM, (2006)
Zhao, B.Y., Kubiatowicz, J., Joseph, A.D.: Tapestry: a fault-tolerant wide-area application infrastructure, vol. 32, (2002)
Zhong, M., Shen, K.: Popularity biased random walks for peer-to-peer search under the square root principle. In: IPTPS (2006)
Zhu Y., Hu Y.: Efficient semantic search on dht overlays. J. Parallel Distrib. Comput. 67(5), 604–616 (2007)
Zhu Y., Hu Y.: Ferry: a p2p-based architecture for content-based publish/subscribe services. IEEE Trans. Parallel Distrib. Syst. 18(5), 672–685 (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rao, W., Chen, L. & Fu, A.WC. STAIRS: Towards efficient full-text filtering and dissemination in DHT environments. The VLDB Journal 20, 793–817 (2011). https://doi.org/10.1007/s00778-011-0224-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-011-0224-z