Abstract
The Matrix Framework is a recent proposal by Information Retrieval (IR) researchers to flexibly represent information retrieval models and concepts in a single multi-dimensional array framework. We provide computational support for exactly this framework with the array database system SRAM (Sparse Relational Array Mapping), that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules. To demonstrate their effect on text retrieval, we apply them in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Ailamaki, A., DeWitt, D., Hill, M., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 169–180, Rome (2001)
Anh V.N. and Moffat A. (2005). Inverted index compression using word-aligned binary codes. Inf. Retr. 8(1): 151–166
Anh, V.N., Moffat, A.: Simplified similarity scoring using term ranks. In: Proceedings of the International Conference on Information Retrieval (ACM SIGIR), pp. 226–233, Salvador (2005)
van Ballegooij, A., de~Vries, A., Kersten, M.L.: RAM: Array processing over a relational DBMS. Tech. Rep. INS-R0301, CWI (2003)
Barroso L.A., Dean J. and Holzle U. (2003). Web search for a planet: the google cluster architecture. IEEE Mio 23(2): 22–28
Baumann, P.: A database array algebra for spatio-temporal data and beyond. In: Next Generation Information Technologies and Systems, pp. 76–93 (1999)
Boncz, P., Zukowski, M., Nes, N.: MonetDB/X100: Hyper-pipelining query execution. In: Proceedings of the Conference of Innovative Database Research (CIDR), pp. 225–237, Asilomar (2005)
Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the Conference of Information and Knowledge Management (CIKM), pp. 426–434, New Orleans (2003)
Buneman P., Libkin L., Suciu D., Tannen V. and Wong L. (1994). Comprehension syntax. SIGMOD Record 23(1): 87–96
Chakravarthy U.S., Grant J. and Minker J. (1990). Logic-based approach to semantic query optimization. ACM Trans. Database Syst. 15(2): 162–207
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: Proceedings of the Text Retrieval Conference (TREC), Gaithersburg (2004)
Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track. In: Proceedings of the Text Retrieval Conference (TREC), Gaithersburg (2005)
Demmel, J., Dongarra, J., Ruhe, A., van~der Vorst, H.: Templates for the solution of algebraic eigenvalue problems: a practical guide. Society for Industrial and Applied Mathematics, Philadelphia (2000)
Eisenberg A., Melton J., Kulkarni K., Michels J.E. and Zemke F. (2004). Sql:2003 has been published. SIGMOD Rec. 33(1): 119–126
Furtado, P., Baumann, P.: Storage of multidimensional arrays based on arbitrary tiling. In: Proc. of the 15th International Conference on Data Engineering, ICDE99, pp. 408–489 (1999)
Galindo-Legaria C. and Rosenthal A. (1997). Outerjoin simplification and reordering for query optimization. ACM Trans. Database Syst. 22(1): 43–74
Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE), Orlando (1998)
Golub G.H. and Loan C.F.V. (1983). Matrix Computations. Johns Hopkins University Press, Baltimore
Grabs T., Bhoem K. and Schek H.J. (2004). PowerDB-IR: scalable information retrieval and storage with a cluster of databases. Knowl. Inf. Systems 6(4): 465–505
Graefe G. (1994). Volcano—an extensible and parallel query evaluation system. IEEE TKDE 6(1): 120–135
Grossman D.A., Frieder O., Holmes D.O. and Roberts D.C. (1997). Integrating structured data and text: a relational approach. JASIS 48(2): 122–132
Hiemstra, D.: A linguistically motivated probabilistic model of information retrieval. In: ECDL, pp. 569–584 (1998)
Howe, B., Maier, D.: Algebraic manipulation of scientific datasets. In: VLDB, pp. 924–935 (2004)
Huffman, D.: A method for construction of minimum redundancy codes. In: Proc. of the IRE, vol.~40, pp. 1098–1101 (1952)
Kotlyar, V., Pingali, K., Stodghill, P.: A relational approach to the compilation of sparse matrix programs. In: Euro-Par, pp. 318–327 (1997)
Lerner, A., Shasha, D.: Aquery: query language for ordered data, optimization techniques, and experiments. In: VLDB, pp. 345–356 (2003)
Libkin, L., Machlin, R., Wong, L.: A query language for multidimensional arrays: Design, implementation, and optimization techniques. In: Proceedings of ACM SIGMOD International Conference on Managing Data, pp. 228–239. ACM Press (1996)
Mahesh, K., Kud, J., Dixen, P.: Oracle at TREC 8: a lexical approach. In: Proceedings of the Text Retrieval Conference (TREC), Gaithersburg (1999)
Maier, D., Vance, B.: A call to order. In: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 25–28, 1993, Washington, pp. 1–16. ACM Press (1993)
Ponte, J., Croft, W.: A language modelling approach to information retrieval. In: Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98) (1998)
Porter M.F. (1980). An algorithm for suffix stripping. Program 14(3): 130–137
Robertson S.E. and Jones K.S. (1976). Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27(3): 129–146
Robertson, S.E., Walker, S., Beaulieu, M.: Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In: Proceedings of the Text Retrieval Conference (TREC), pp. 143–167. Gaithersburg (1998)
Roelleke T., Tsikrika T. and Kazai G. (2005). A general matrix framework for modelling information retrieval. IP&M 42(1): 4–30
Steinbrunn, M., Moerkotte, G., Kemper, A.: Optimizing join orders. University of Passau, Tech. rep. (1993)
Trotman A. (2003). Compressing inverted files. Inf. Retr. 6(1): 5–19
Turtle H. and Flood J. (1995). Query evaluation: strategies and optimizations. Inf. Proces. Manage. 31(6): 831–850
Witten I.H., Moffat A. and Bell T.C. (1999). Managing gigabytes Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco
Zukowski, M., Heman, S., Nes, N., Boncz, P.: Super-scalar RAM-CPU cache compression. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE), Atlanta (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Cornacchia, R., Héman, S., Zukowski, M. et al. Flexible and efficient IR using array databases. The VLDB Journal 17, 151–168 (2008). https://doi.org/10.1007/s00778-007-0071-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-007-0071-0