Abstract
Very recently, the unexpected combination of data structures and machine learning has led to the development of a new area of research, called learned data structures. Their distinguishing trait is the ability to reveal and exploit patterns and trends in the input data for achieving more efficiency in time and space, compared to previously known data structures. The goal of this chapter is to provide the first comprehensive survey of these results and to stimulate further research in this promising area.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We assume that universe \(\mathcal {U}\) is a range of reals because of the arithmetic operations required by the models. This works for any kind of keys that can be mapped to reals by preserving their order, such as integers or strings.
- 2.
Private correspondence with the authors clarified that only linear models were used in the hierarchy due to their superior performance.
- 3.
It has to be noted that a B+-tree supports also insertions and deletions, which require a more complex structure that is both slower and more space-inefficient than other static tree-based indexes.
- 4.
The authors suggest organising the buffer “like a hash table”. Even though hashing simplifies buffer modifications, we lose the ability to do predecessor or range queries efficiently.
- 5.
The authors do not discuss the details of this step, but we assume that it is performed in \(O(\log \tilde{m})\) time via a binary search on the \(\tilde{m}\) sorted key-model pairs.
- 6.
We remark that MurmurHash is a known function which can be implemented in a few dozen lines of C code.
- 7.
It has to be noted that for 5K elements a Bloom filter of \({\approx }6\) KB and \(k=7\) hash functions would have sufficed to achieve 1% false positives. Similarly, a Bloom filter of \({\approx }9\) KB and \(k=10\) hash functions would have sufficed to achieve 0.1% false positives.
References
Ao, N., Zhang, F., Wu, D., Stones, D.S., Wang, G., Liu, X., Liu, J., Lin, S.: Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proc. VLDB Endow. 4(8), 470–481 (2011)
Bender, M.A., Farach-Colton, M., Mosteiro, M.A.: Insertion sort is \(o(n \log n)\). Theory Comput. Syst. 39(3), 391–397 (2006)
Bender, M.A., Hu, H.: An adaptive packed-memory array. ACM Trans. Database Syst. 32(4) (2007)
Bentley, J.L., Yao, A.C.C.: An almost optimal algorithm for unbounded searching. Inf. Process. Lett. 5(3), 82–87 (1976)
Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2004)
Coraddu, A., Oneto, L., Baldi, F., Anguita, D.: Vessels fuel consumption forecast and trim optimisation: a data analytics perspective. Ocean. Eng. 130, 351–370 (2017)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches, vol. 4, No. 1–3, pp. 1–294. Foundations and Trends in Databases (2011)
Cormode, G., Muthukrishnan, S.: Summarizing and mining skewed data streams. In: Proceedings of the International Conference on Data Mining, SDM, pp. 44–55. SIAM (2005)
Dai, Z., Shrivastava, A.: Adaptive learned Bloom filter (Ada-BF): Efficient utilization of the classifier. arXiv:1910.09131 (2019)
Ding, J., Minhas, U.F., Zhang, H., Li, Y., Wang, C., Chandramouli, B., Gehrke, J., Kossmann, D., Lomet, D.: ALEX: An updatable adaptive learned index. arXiv:1905.08898 (2019)
Ferragina, P., Vinciguerra, G.: The PGM-index: a fully dynamic compressed learned index with provable worst-case bounds. PVLDB J. 13(8) (2020). ISSN 2150-8097
Gaede, V., Günther, O.: Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998)
Galakatos, A., Markovitch, M., Binnig, C., Fonseca, R., Kraska, T.: FITing-Tree: a data-aware index structure. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 1189–1206. ACM, New York, NY, USA (2019)
Hadian, A., Heinis, T.: Considerations for handling updates in learned index structures. In: Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM, pp. 3:1–3:4. ACM, New York, NY, USA (2019)
Hashemi, M., Swersky, K., Smith, J.A., Ayers, G., Litz, H., Chang, J., Kozyrakis, C., Ranganathan, P.: Learning memory access patterns. In: ICML, Proceedings of Machine Learning Research, vol. 80, pp. 1924–1933. PMLR (2018)
Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction, 2nd edn., Springer Series in Statistics, Springer (2009)
Hsu, C.Y., Indyk, P., Katabi, D., Vakilian, A.: Learning-based frequency estimation algorithms. In: International Conference on Learning Representations, ICLR (2019)
Idreos, S., Zoumpatianos, K., Chatterjee, S., Qin, W., Wasay, A., Hentschel, B., Kester, M., Dayan, N., Guo, D., Kang, M., Sun, Y.: Learning data structure alchemy. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 42(2), 46–57 (2019)
Idreos, S., Zoumpatianos, K., Hentschel, B., Kester, M.S., Guo, D.: The data calculator: Data structure design and cost synthesis from first principles and learned cost models. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 535–550. ACM, New York, NY, USA (2018)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning. In: CIDR. www.cidrdb.org (2019)
Kraska, T., Alizadeh, M., Beutel, A., Chi, E.H., Kristo, A., Leclerc, G., Madden, S., Mao, H., Nathan, V.: SageDB: a learned database system. In: CIDR. www.cidrdb.org (2019)
Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 489–504. ACM, New York, NY, USA (2018)
Li, P., Hua, Y., Zuo, P., Jia, J.: A scalable learned index scheme in storage systems. arXiv:1905.06256 (2019)
Li, X., Li, J., Wang, X.: ASLM: Adaptive single layer model for learned index. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds.) Database Systems for Advanced Applications, pp. 80–95. Springer International Publishing, Cham (2019)
Macke, S., Beutel, A., Kraska, T., Sathiamoorthy, M., Cheng, D.Z., Chi, E.H.: Lifting the curse of multidimensional data with learned existence indexes. In: Workshop on ML for Systems at NeurIPS (2018)
Mehta, D.P., Sahni, S. (eds.): Handbook of Data Structures and Applications, 2nd edn. Chapman and Hall/CRC, Boca Raton (2018)
Mitzenmacher, M.: A model for learned Bloom filters and optimizing by sandwiching. In: 32nd Conference on Neural Information Processing Systems, NeurIPS (2018)
Navarro, G.: Compact data structures: a practical approach. Cambridge University Press, New York (2016)
Oneto, L., Ridella, S., Anguita, D.: Tikhonov, ivanov and morozov regularization for support vector machine learning. Mach. Learn. 103(1), 103–136 (2015)
O’Rourke, J.: An on-line algorithm for fitting straight lines between data ranges. Commun. ACM 24(9), 574–578 (1981)
Petrov, A.: Algorithms behind modern storage systems. Commun. ACM 61(8), 38–44 (2018)
Qu, W., Wang, X., Li, J., Li, X.: Hybrid indexes by exploring traditional B-tree and linear regression. In: Proceedings of the 16th International Conference on Web Information Systems and Applications, WISA, pp. 601–613. Springer International Publishing, Cham (2019)
Rae, J., Bartunov, S., Lillicrap, T.: Meta-learning neural Bloom filters. In: Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp. 5271–5280. PMLR, Long Beach, California, USA (2019)
Vahdat, M., Oneto, L., Anguita, D., Funk, M., Rauterberg, M.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: European Conference on Technology Enhanced Learning (2015)
Vinciguerra, G., Ferragina, P., Miccinesi, M.: Superseding traditional indexes by orchestrating learning and geometry. arXiv:1903.00507 (2019)
Vitter, J.S.: External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv. 33(2), 209–271 (2001)
Wang, H., Fu, X., Xu, J., Lu, H.: Learned index for spatial queries. In: Proceedings of the 20th International Conference on Mobile Data Management, MDM, pp. 569–574. IEEE (2019)
Wu, Y., Yu, J., Tian, Y., Sidle, R., Barber, R.: Designing succinct secondary indexing mechanism by exploiting column correlations. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 1223–1240. ACM, New York, NY, USA (2019)
Xiang, W., Zhang, H., Cui, R., Chu, X., Li, K., Zhou, W.: Pavo: A RNN-based learned inverted index, supervised or unsupervised? IEEE Access 7, 293–303 (2019)
Zhang, M., Wang, H., Li, J., Gao, H.: Learned sketches for frequency estimation. Inf. Sci. 507, 365–385 (2020)
Zhang, Y., Huang, Y.: “Learned” operating systems. SIGOPS Oper. Syst. Rev. 53(1), 40–45 (2019)
Acknowledgements
Part of this work has been supported by the Italian MIUR PRIN project “Multicriteria Data Structures and Algorithms: from compressed to learned indexes, and beyond” (Prot. 2017WR7SHH), by Regione Toscana (under POR FSE 2014/2020), and by the European Integrated Infrastructure for Social Mining and Big Data Analytics (SoBigData++, Grant Agreement # 871042).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Ferragina, P., Vinciguerra, G. (2020). Learned Data Structures. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds) Recent Trends in Learning From Data. Studies in Computational Intelligence, vol 896. Springer, Cham. https://doi.org/10.1007/978-3-030-43883-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-43883-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43882-1
Online ISBN: 978-3-030-43883-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)