Flexible and efficient IR using array databases

Cornacchia, Roberto; Héman, Sándor; Zukowski, Marcin; de Vries, Arjen P.; Boncz, Peter

doi:10.1007/s00778-007-0071-0

Flexible and efficient IR using array databases

Special Issue Paper
Open access
Published: 29 September 2007

Volume 17, pages 151–168, (2008)
Cite this article

Download PDF

You have full access to this open access article

The VLDB Journal Aims and scope Submit manuscript

Flexible and efficient IR using array databases

Download PDF

Roberto Cornacchia¹,
Sándor Héman¹,
Marcin Zukowski¹,
Arjen P. de Vries¹ &
…
Peter Boncz¹

836 Accesses
36 Citations
3 Altmetric
Explore all metrics

Abstract

The Matrix Framework is a recent proposal by Information Retrieval (IR) researchers to flexibly represent information retrieval models and concepts in a single multi-dimensional array framework. We provide computational support for exactly this framework with the array database system SRAM (Sparse Relational Array Mapping), that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules. To demonstrate their effect on text retrieval, we apply them in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage.

Article PDF

Strategies for array data retrieval from a relational back-end based on access patterns

Article Open access 30 March 2020

Towards a General Array Database Benchmark: Measuring Storage Access

Hierarchical Bitmap Indexing for Range Queries on Multidimensional Arrays

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Ailamaki, A., DeWitt, D., Hill, M., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 169–180, Rome (2001)
Anh V.N. and Moffat A. (2005). Inverted index compression using word-aligned binary codes. Inf. Retr. 8(1): 151–166
Article Google Scholar
Anh, V.N., Moffat, A.: Simplified similarity scoring using term ranks. In: Proceedings of the International Conference on Information Retrieval (ACM SIGIR), pp. 226–233, Salvador (2005)
van Ballegooij, A., de~Vries, A., Kersten, M.L.: RAM: Array processing over a relational DBMS. Tech. Rep. INS-R0301, CWI (2003)
Barroso L.A., Dean J. and Holzle U. (2003). Web search for a planet: the google cluster architecture. IEEE Mio 23(2): 22–28
Article Google Scholar
Baumann, P.: A database array algebra for spatio-temporal data and beyond. In: Next Generation Information Technologies and Systems, pp. 76–93 (1999)
Boncz, P., Zukowski, M., Nes, N.: MonetDB/X100: Hyper-pipelining query execution. In: Proceedings of the Conference of Innovative Database Research (CIDR), pp. 225–237, Asilomar (2005)
Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the Conference of Information and Knowledge Management (CIKM), pp. 426–434, New Orleans (2003)
Buneman P., Libkin L., Suciu D., Tannen V. and Wong L. (1994). Comprehension syntax. SIGMOD Record 23(1): 87–96
Article Google Scholar
Chakravarthy U.S., Grant J. and Minker J. (1990). Logic-based approach to semantic query optimization. ACM Trans. Database Syst. 15(2): 162–207
Article Google Scholar
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: Proceedings of the Text Retrieval Conference (TREC), Gaithersburg (2004)
Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track. In: Proceedings of the Text Retrieval Conference (TREC), Gaithersburg (2005)
Demmel, J., Dongarra, J., Ruhe, A., van~der Vorst, H.: Templates for the solution of algebraic eigenvalue problems: a practical guide. Society for Industrial and Applied Mathematics, Philadelphia (2000)
Eisenberg A., Melton J., Kulkarni K., Michels J.E. and Zemke F. (2004). Sql:2003 has been published. SIGMOD Rec. 33(1): 119–126
Article Google Scholar
Furtado, P., Baumann, P.: Storage of multidimensional arrays based on arbitrary tiling. In: Proc. of the 15th International Conference on Data Engineering, ICDE99, pp. 408–489 (1999)
Galindo-Legaria C. and Rosenthal A. (1997). Outerjoin simplification and reordering for query optimization. ACM Trans. Database Syst. 22(1): 43–74
Article Google Scholar
Goldstein, J., Ramakrishnan, R., Shaft, U.: Compressing relations and indexes. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE), Orlando (1998)
Golub G.H. and Loan C.F.V. (1983). Matrix Computations. Johns Hopkins University Press, Baltimore
MATH Google Scholar
Grabs T., Bhoem K. and Schek H.J. (2004). PowerDB-IR: scalable information retrieval and storage with a cluster of databases. Knowl. Inf. Systems 6(4): 465–505
Google Scholar
Graefe G. (1994). Volcano—an extensible and parallel query evaluation system. IEEE TKDE 6(1): 120–135
Google Scholar
Grossman D.A., Frieder O., Holmes D.O. and Roberts D.C. (1997). Integrating structured data and text: a relational approach. JASIS 48(2): 122–132
Article Google Scholar
Hiemstra, D.: A linguistically motivated probabilistic model of information retrieval. In: ECDL, pp. 569–584 (1998)
Howe, B., Maier, D.: Algebraic manipulation of scientific datasets. In: VLDB, pp. 924–935 (2004)
Huffman, D.: A method for construction of minimum redundancy codes. In: Proc. of the IRE, vol.~40, pp. 1098–1101 (1952)
Kotlyar, V., Pingali, K., Stodghill, P.: A relational approach to the compilation of sparse matrix programs. In: Euro-Par, pp. 318–327 (1997)
Lerner, A., Shasha, D.: Aquery: query language for ordered data, optimization techniques, and experiments. In: VLDB, pp. 345–356 (2003)
Libkin, L., Machlin, R., Wong, L.: A query language for multidimensional arrays: Design, implementation, and optimization techniques. In: Proceedings of ACM SIGMOD International Conference on Managing Data, pp. 228–239. ACM Press (1996)
Mahesh, K., Kud, J., Dixen, P.: Oracle at TREC 8: a lexical approach. In: Proceedings of the Text Retrieval Conference (TREC), Gaithersburg (1999)
Maier, D., Vance, B.: A call to order. In: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 25–28, 1993, Washington, pp. 1–16. ACM Press (1993)
Ponte, J., Croft, W.: A language modelling approach to information retrieval. In: Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98) (1998)
Porter M.F. (1980). An algorithm for suffix stripping. Program 14(3): 130–137
Google Scholar
Robertson S.E. and Jones K.S. (1976). Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27(3): 129–146
Article Google Scholar
Robertson, S.E., Walker, S., Beaulieu, M.: Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In: Proceedings of the Text Retrieval Conference (TREC), pp. 143–167. Gaithersburg (1998)
Roelleke T., Tsikrika T. and Kazai G. (2005). A general matrix framework for modelling information retrieval. IP&M 42(1): 4–30
Google Scholar
Steinbrunn, M., Moerkotte, G., Kemper, A.: Optimizing join orders. University of Passau, Tech. rep. (1993)
Trotman A. (2003). Compressing inverted files. Inf. Retr. 6(1): 5–19
Article Google Scholar
Turtle H. and Flood J. (1995). Query evaluation: strategies and optimizations. Inf. Proces. Manage. 31(6): 831–850
Article Google Scholar
Witten I.H., Moffat A. and Bell T.C. (1999). Managing gigabytes Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco
Google Scholar
Zukowski, M., Heman, S., Nes, N., Boncz, P.: Super-scalar RAM-CPU cache compression. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE), Atlanta (2006)

Download references

Author information

Authors and Affiliations

CWI, Kruislaan 413, 1098 SJ, Amsterdam, The Netherlands
Roberto Cornacchia, Sándor Héman, Marcin Zukowski, Arjen P. de Vries & Peter Boncz

Authors

Roberto Cornacchia
View author publications
You can also search for this author in PubMed Google Scholar
Sándor Héman
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Zukowski
View author publications
You can also search for this author in PubMed Google Scholar
Arjen P. de Vries
View author publications
You can also search for this author in PubMed Google Scholar
Peter Boncz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sándor Héman.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Cornacchia, R., Héman, S., Zukowski, M. et al. Flexible and efficient IR using array databases. The VLDB Journal 17, 151–168 (2008). https://doi.org/10.1007/s00778-007-0071-0

Download citation

Received: 18 September 2006
Revised: 18 May 2007
Accepted: 24 July 2007
Published: 29 September 2007
Issue Date: January 2008
DOI: https://doi.org/10.1007/s00778-007-0071-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Flexible and efficient IR using array databases

Abstract

Article PDF

Similar content being viewed by others

Strategies for array data retrieval from a relational back-end based on access patterns

Towards a General Array Database Benchmark: Measuring Storage Access

Hierarchical Bitmap Indexing for Range Queries on Multidimensional Arrays

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Flexible and efficient IR using array databases

Abstract

Article PDF

Similar content being viewed by others

Strategies for array data retrieval from a relational back-end based on access patterns

Towards a General Array Database Benchmark: Measuring Storage Access

Hierarchical Bitmap Indexing for Range Queries on Multidimensional Arrays

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation