Abstract
This paper presents a probabilistic relational modelling (implementation) of the major probabilistic retrieval models. Such a high-level implementation is useful since it supports the ranking of any object, it allows for the reasoning across structured and unstructured data, and it gives the software (knowledge) engineer control over ranking and thus supports customisation. The contributions of this paper include the specification of probabilistic SQL (PSQL) and probabilistic relational algebra (PRA), a new relational operator for probability estimation (the relational Bayes), the probabilistic relational modelling of retrieval models, a comparison of modelling retrieval with traditional SQL versus modelling retrieval with PSQL, and a comparison of the performance of probability estimation with traditional SQL versus PSQL. The main findings are that the PSQL/PRA paradigm allows for the description of advanced retrieval models, is suitable for solving large-scale retrieval tasks, and outperforms traditional SQL in terms of abstraction and performance regarding probability estimation.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., Croft, B., DeWitt, D., Franklin, M., Garcia-Molina, H., Gawlick, D., Gray, J., Haas, L., Halevy, A., Hellerstein, J., Ioannidis, Y., Kersten, M., Pazzani, M., Lesk, M., Maier, D., Naughton, J., Schek, H., Sellis, T., Silberschatz, A., Stonebraker, M., Snodgrass, R., Ullman, J., Weikum, G., Widom, J., Zdonik, S.: The lowell database research self assessment (2003)
Agrawal, S., Chaudhuri, S., Das, G.: Dbxplorer: a system for keyword-based search over relational databases. In: ICDE, pp. 5–16 (2002)
Agrawal, S., Chaudhuri, S., Das, G., Gionis, A.: Automated ranking of database query results. In: CIDR (2003)
Azzam, H., Roelleke, T.: Efficient processing of ontological queries. In: 2nd VLDB Workshop on Ontologies-based Techniques for Databases and Information Systems, Seoul (2006)
Amati G. and van Rijsbergen C.J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4): 357–389
Bosc P., Galibourg M. and Hamon G. (1988). Fuzzy querying with sql: extensions and implementation aspects. Fuzzy Sets Syst. 28(3): 333–349
Barbara D., Garcia-Molina H. and Porter D. (1990). A probabilistic relational data model. In: Bancilhon, F., Thanos, C., and Tsichrizis, D. (eds) Advances in Database Technology – EDBT ’90, pp 60–74. Springer, Berlin
Barbara D., Garcia-Molina H. and Porter D. (1992). The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4(5): 487–502
Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: SIGIR (ed.) SIGIR ’99 Proceedings of the 22nd International Conference on Research and Development in Information Retrieval, pp. 222–229, ACM, New York (1999)
Bosc, P., Pivert, O.: Fuzzy queries and relational databases. In: Proceedings of the 1994 ACM Symposium on Applied Computing, pp. 170–174 ACM Press, New York (1994)
Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic ranking of database query results. In: VLDB pp. 888–899 (2004)
Chaudhuri S., Das G., Hristidis V. and Weikum G. (2006). Probabilistic information retrieval approach for ranking of database query results. ACM Trans. Database Syst. 31(3): 1134–1168
Croft W.B. and Harper D.J. (1979). Using probabilistic models of document retrieval without relevance information. J. Doc. 35: 285–295
Cavallo, R., Pittarelli, M.: The theory of probabilistic databases. In: Proceedings of the 13th International Conference on Very Large Databases, pp. 71–81 Morgan Kaufman, Los Altos (1987)
Chaudhuri, S., Ramakrishnan, R., Weikum, G.: Integrating db and ir technologies: What is the sound of one hand clapping? In: CIDR, pp. 1–12 (2005)
Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB, pp. 864–875 (2004)
Dalvi, N.N., Suciu, D.: Answering queries from statistics and probabilistic views. In: VLDB, pp. 805–816 (2005)
de Vries, A., Roelleke, T.: Relevance information: a loss of entropy but a gain for idf? In: ACM SIGIR, Salvador, Brazil (2005)
Ercegovac, V., DeWitt, D.J., Ramakrishnan, R.: The texture benchmark: Measuring performance of text queries on a relational dbms. In: VLDB, pp. 313–324 (2005)
Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems. Addison-Wesley, Reading (2000)
Frommholz I. and Fuhr N. (2006). Probabilistic, object-oriented logics for annotation-based retrieval in digital libraries. In: Marchionini, G., Nelson, M.L. and Marshall, C.C. (eds) JCDL., pp 55–64. ACM, New York
Fagin R., Lotem A. and Naor M. (2003). Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614–656
Fuhr, N., Roelleke, T.: A probabilistic NF2 relational algebra for integrated information retrieval and database systems. In: Tanik, M.M., Bastani, F.B., Gibson, D., Fielding, P.J. (eds.) Proceedings of the 2nd World Conference on Integrated Design and Process Technology (IDPT), Society for Design and Process Science (SDPS), Austin, pp. 17–30 (1996)
Fuhr N. and Rölleke T. (1997). A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Info. Syst. 14(1): 32–66
Forst, J.F., Tombros, A., Roelleke, T.: solving the enterprise trec task with probabilistic data models. In: Proceedings of TREC 2006, pp. xx–yy (2006)
Fuhr N. (1990). A probabilistic framework for vague queries and imprecise information in databases. In: McLeod, D., Sacks-Davis, R. and Schek, H. (eds) Proceedings of the 16th International Conference on Very Large Databases., pp 696–707. Morgan Kaufman, Los Altos
Fuhr N. (1995). Probabilistic datalog—a logic for powerful retrieval methods. In: Fox, E.A., Ingwersen, P. and Fidel, R. (eds) Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 282–290. ACM, New York
Grabs, T., Böhm, K., Schek, H.-J.: Powerdb-ir - information retrieval on top of a database cluster. In: CIKM, pp. 411–418 (2001)
Grabs T., Böhm K. and Schek H.-J. (2004). Powerdb-ir - scalable information retrieval and storage with a cluster of databases. Knowl. Inf. Syst. 6(4): 465–505
Grossman D.A. and Frieder O. (1998). Information Retrieval: Algorithms and Heuristics. Kluwer, Massachusetts
Grossman D.A. and Frieder O. (2004). Information Retrieval. Algorithms and Heuristics, 2nd edn. vol. 15 of The Information Retrieval Series. Springer, Berlin Heidelberg
Grossman D.A., Frieder O., Holmes D.O. and Roberts D.C. (1997). Integrating structured data and text: a relational approach. JASIS 48(2): 122–132
Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient ir-style keyword search over relational databases. In: VLDB, pp. 850–861 (2003)
Hiemstra D. (2000). A probabilistic justification for using tf.idf term weighting in information retrieval. Int. J. Digit. Libr. 3(2): 131–139
Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB, pp. 670–681 (2002)
Li, C., Chang, K.C.-C., Ilyas, I.F., Song, S.: Ranksql: Query algebra and optimization for relational top-k queries. In: SIGMOD Conference, pp. 131–142 (2005)
Lee, S.K.: An extended relational database model for uncertain and imprecise information. In: Proceedings of the 18th VLDB Conference, pp. 211–220 Morgan Kaufman, Los Altos (1992)
Lakshmanan L.V.S., Leone N., Ross R. and Subrahmanian V.S. (1997). Probview: a flexible probabilistic database system. ACM Trans. Database Syst. 22(3): 419–469
Lalmas, M., Roelleke, T.: Modelling vague content and structure querying in XML retrieval with a probabilistic object-relational framework. In: Proceedings of the 6th International Conference on Flexible Query Answering Systems (FQAS), LNCS, Lyon, Springer, Berlin (2004)
Lalmas M., Roelleke T. and Fuhr N. (2002). Intelligent hypermedia retrieval. In: Szczepaniak, P.S., Segovia, F., and Zadeh, L.A. (eds) Intelligent Exploration of the Web, pp. Springer, Heidelberg
Lafferty, J., Zhai, C.: Probabilistic Relevance Models Based on Document and Query Generation, chap. 1. Kluwer (2002)
Macleod I.A. (1991). Text retrieval and the relational model. J. Am. Soc. Inf. Sci. 42(3): 155–165
Maron M.E. and Kuhns J.L. (1960). On relevance, probabilistic indexing and information retrieval. J. ACM 7: 216–244
Motro A. (1988). Vague: a user interface to relational databases that permits vague queries. ACM Trans. Off. Inf. Syst. 6(3): 187–214
Motro A. (1990). Accommodating imprecision in database systems: Issues and solutions. Sigmod rec. 19(4): 69
Niemi T. and Järvelin K. (1995). A straightforward NF2 relational interface with applications in information retrieval. Inf. Process. Manage. 31(2): 215–231
Ponte J.M. and Croft W.B. (1998). A language modeling approach to information retrieval. In: Bruce Croft, W., Moffat, A., van Rijsbergen, C.J., Wilkinson, R. and Zobel, J. (eds) Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 275–281. ACM, New York
Pfeifer, U., Fuhr, N.: Efficient processing of vague queries using a data stream approach. In: Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 189–198, New York (1995)
Roelleke T., Lübeck R. and Kazai G. (2001). The HySpirit retrieval platform, demonstration. In: Croft, B., Harper, D.J., Kraft, D.H., and Zobel, J. (eds) Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New Orleans
Robertson, S.E.: Term frequency and term value. In: SIGIR, pp. 22–29 (1981)
Robertson S.E. (2004). Understanding inverse document frequency: On theoretical arguments for idf. J. Doc. 60: 503–520
Roelleke T. (1999). POOL: Probabilistic Object-Oriented Logical Representation and Retrieval of Complex Objects. Shaker Verlag, Aachen Dissertation
Roelleke, T.: A frequency-based and a Poisson-based probability of being informative. In: Callan, J., Cormarck, G., Clarke, C., Hawking, D., Smeaton, A. (eds.) Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 227–234 (2003)
Ross, J.G.R., Subrahmanian, V.S.: Probabilistic aggregates. In: 13th International Symposium on Methodologies for Intelligent Systems (ISMIS), Lyon, Foundations of Intelligent Systems. Springer, Heidelberg (2002)
Robertson S.E. and Sparck Jones K. (1976). Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27: 129–146
Roelleke, T., Wang, J.: A parallel derivation of probabilistic information retrieval models. In: ACM SIGIR (2006)
Robertson S.E., Walker S. and Hancock-Beaulieu M.M. (1995). Large test collection experiments on an operational interactive system: Okapi at TREC. Inf. Process. Manage. 31: 345–360
Suciu, D., Dalvi, N.N.: Foundations of probabilistic answers to queries. In: SIGMOD Conference, p. 963 (2005)
Silberschatz, A., Korth, H.F., Sudarshan, S.: Database Systems Concepts, 4th Edn. McGraw-Hill Higher Education (2002)
Schek, H.-J., Pistor, P.: Data structures for an integrated database management and information retrieval system. In: Proceedings of the 8th International Conference on Very Large Data Bases, pp. 197–207, Morgan Kaufman, Los Altos (1982)
Turtle H. and Croft W.B. (1990). Inference networks for document retrieval. In: Vidick, J.-L. (eds) Proceedings of the 13th International Conference on Research and Development in Information Retrieval., pp 1–24. ACM, New York
Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation withprobabilistic guarantees. In: VLDB, pp. 648–659 (2004)
van Rijsbergen C.J. (1986). A non-classical logic for information retrieval. Comput. J. 29(6): 481–485
Wong S.K.M. and Yao Y.Y. (1995). On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Sys. 13(1): 38–68
Yu C.T., Lam K. and Salton G. (1982). Term weighting in information retrieval using the term precision model. J. ACM 29(1): 152–170
Zhai, C.X., Lafferty, J.D.: Two-stage language models for information retrieval. In: SIGIR, pp. 49–56 (2002)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Roelleke, T., Wu, H., Wang, J. et al. Modelling retrieval models in a probabilistic relational algebra with a new operator: the relational Bayes. The VLDB Journal 17, 5–37 (2008). https://doi.org/10.1007/s00778-007-0073-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-007-0073-y