Abstract
Perhaps the most basic synopsis of a data stream is a sample of elements from the stream. A key benefit of such a sample is its flexibility: the sample can serve as input to a wide variety of analytical procedures and can be reduced further to provide many additional data synopses. If, in particular, the sample is collected using random sampling techniques, then the sample can form a basis for statistical inference about the contents of the stream. This chapter surveys some basic sampling and inference techniques for data streams. We focus on general methods for materializing a sample; later chapters provide specialized sampling methods for specific analytic tasks.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
W.G. Cochran, Sampling Techniques, 3rd edn. (Wiley, New York, 1977)
L. Kish, Survey Sampling (Wiley, New York, 1965)
C.E. Särndal, B. Swensson, J. Wretman, Model Assisted Survey Sampling (Springer, New York, 1992)
M.E. Thompson, Theory of Sample Surveys (Chapman & Hall, London, 1997)
S.K. Thompson, Sampling (Wiley, New York, 2002)
B. Efron, R.J. Tibshirani, An Introduction to the Bootstrap (Chapman & Hall, New York, 1993)
R. Gemulla, W. Lehner, Deferred maintenance of disk-based random samples, in Proc. EDBT. Lecture Notes in Computer Science (Springer, Berlin, 2006), pp. 423–441
A. Pol, C.M. Jermaine, S. Arumugam, Maintaining very large random samples using the geometric file. VLDB J. 17(5), 997–1018 (2008)
P.G. Brown, P.J. Haas, Techniques for warehousing of sample data, in Proc. 22nd ICDE (2006)
P.J. Haas, The need for speed: speeding up DB2 using sampling. IDUG Solut. J. 10, 32–34 (2003)
S. Chaudhuri, R. Motwani, V.R. Narasayya, Random sampling for histogram construction: how much is enough? in Proc. ACM SIGMOD (1998), pp. 436–447
D. DeWitt, J.F. Naughton, D.A. Schneider, S. Seshadri, Practical skew handling algorithms for parallel joins, in Proc. 19th VLDB (1992), pp. 27–40
P.J. Haas, C. König, A bi-level Bernoulli scheme for database sampling, in Proc. ACM SIGMOD (2004), pp. 275–286
W. Hou, G. Ozsoyoglu, B. Taneja, Statistical estimators for relational algebra expressions, in Proc. Seventh PODS (1988), pp. 276–287
P.J. Haas, J.M. Hellerstein, Ripple joins for online aggregation, in Proc. ACM SIGMOD (1999), pp. 287–298
S. Acharya, P. Gibbons, V. Poosala, S. Ramaswamy, Join synopses for approximate query answering, in Proc. ACM SIGMOD (1999), pp. 275–286
S. Acharya, P. Gibbons, V. Poosala, Congressional samples for approximate answering of group-by queries, in Proc. ACM SIGMOD (2000), pp. 487–498
S. Chaudhuri, G. Das, M. Datar, R. Motwani, V.R. Narasayya, Overcoming limitations of sampling for aggregation queries, in Proc. Seventeenth ICDE (2001), pp. 534–542
S. Chaudhuri, R. Motwani, V.R. Narasayya, On random sampling over joins, in Proc. ACM SIGMOD (1999), pp. 263–274
S. Ganguly, P.B. Gibbons, Y. Matias, A. Silberschatz, Bifocal sampling for skew-resistant join size estimation, in Proc. ACM SIGMOD (1996), pp. 271–281
V. Ganti, M.L. Lee, R. Ramakrishnan, ICICLES: self-tuning samples for approximate query answering, in Proc. 26th VLDB (2000), pp. 176–187
P.J. Haas, A.N. Swami, Sampling-based selectivity estimation using augmented frequent value statistics, in Proc. Eleventh ICDE (1995), pp. 522–531
C. Jermaine, Robust estimation with sampling and approximate pre-aggregation, in Proc. 29th VLDB (2003), pp. 886–897
W. Hou, G. Ozsoyoglu, B. Taneja, Processing aggregate relational queries with hard time constraints, in Proc. ACM SIGMOD (1989), pp. 68–77
F. Olken, D. Rotem, Simple random sampling from relational databases, in Proc. 12th VLDB (1986), pp. 160–169
F. Olken, D. Rotem, Random sampling from \(B^{+}\) trees, in Proc. 15th VLDB (1989), pp. 269–277
F. Olken, D. Rotem, Maintenance of materialized views of sampling queries, in Proc. Eighth ICDE (1992), pp. 632–641
F. Olken, D. Rotem, Sampling from spatial databases, in Proc. Ninth ICDE (1993), pp. 199–208
F. Olken, D. Rotem, P. Xu, Random sampling from hash files, in Proc. ACM SIGMOD (1990), pp. 375–386
P.J. Haas, J.F. Naughton, S. Seshadri, A.N. Swami, Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci. 52, 550–569 (1996)
P.J. Haas, J.F. Naughton, A.N. Swami, On the relative cost of sampling for join selectivity estimation, in Proc. Thirteenth PODS (1994), pp. 14–24
P.J. Haas, A.N. Swami, Sequential sampling procedures for query size estimation, in Proc. ACM SIGMOD (1992), pp. 1–11
W. Hou, G. Ozsoyoglu, E. Dogdu, Error-constrained COUNT query evaluation in relational databases, in Proc. ACM SIGMOD (1991), pp. 278–287
R.J. Lipton, J.F. Naughton, Query size estimation by adaptive sampling, in Proc. Ninth PODS (1990), pp. 40–46
R.J. Lipton, J.F. Naughton, D.A. Schneider, Practical selectivity estimation through adaptive sampling, in Proc. ACM SIGMOD (1990), pp. 1–11
R.J. Lipton, J.F. Naughton, D.A. Schneider, S. Seshadri, Efficient sampling strategies for relational database operations. Theor. Comput. Sci. 116, 195–226 (1993)
K.D. Seppi, J.W. Barnes, C.N. Morris, A Bayesian approach to database query optimization. ORSA J. Comput. 5, 410–419 (1993)
J.M. Hellerstein, P.J. Haas, H.J. Wang, Online aggregation, in Proc. ACM SIGMOD (1997), pp. 171–182
C. Jermaine, A. Dobra, S. Arumugam, S. Joshi, A. Pol, A disk-based join with probabilistic guarantees, in Proc. ACM SIGMOD (2005)
C.M. Jermaine, S. Arumugam, A. Pol, A. Dobra, Scalable approximate query processing with the DBO engine, in Proc. ACM SIGMOD (2007), pp. 725–736
C. Jermaine, A. Dobra, A. Pol, S. Joshi, Online estimation for subset-based SQL queries, in Proc. 31st VLDB (2005), pp. 745–756
G. Luo, C. Ellman, P.J. Haas, J.F. Naughton, A scalable hash ripple join algorithm, in Proc. ACM SIGMOD (2002), pp. 252–262
A. Pol, C. Jermaine, Relational confidence bounds are easy with the bootstrap, in Proc. ACM SIGMOD (2005)
P.G. Brown, P.J. Haas, BHUNT: automatic discovery of fuzzy algebraic constraints in relational data, in Proc. 29th VLDB (2003), pp. 668–679
I.F. Ilyas, V. Markl, P.J. Haas, P.G. Brown, A. Aboulnaga, CORDS: automatic discovery of correlations and soft functional dependencies, in Proc. ACM SIGMOD (2004), pp. 647–658
P. Brown, P. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, Y. Sismanis, Toward automated large-scale information integration and discovery, in Data Management in a Connected World, ed. by T. Härder, W. Lehner (Springer, New York, 2005)
M. Charikar, S. Chaudhuri, R. Motwani, V.R. Narasayya, Towards estimation error guarantees for distinct values, in Proc. Nineteenth PODS (2000), pp. 268–279
P.J. Haas, L. Stokes, Estimating the number of classes in a finite population. J. Am. Stat. Assoc. 93, 1475–1487 (1998)
M. Wu, C. Jermaine, A Bayesian method for guessing the extreme values in a data set, in Proc. 33rd VLDB (2007), pp. 471–482
P. Billingsley, Probability and Measure, 2nd edn. (Wiley, New York, 1986)
A.M. Law, Simulation Modeling and Analysis, 4th edn. (McGraw-Hill, New York, 2007)
D.E. Knuth, The Art of Computer Programming, vol. 2: Seminumerical Algorithms (Addison-Wesley, Reading, 1969)
A.I. McLeod, D.R. Bellhouse, A convenient algorithm for drawing a simple random sample. Appl. Stat. 32, 182–184 (1983)
J.S. Vitter, Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 37–57 (1985)
J.S. Vitter, Faster methods for random sampling. Commun. ACM 27, 703–718 (1984)
M.T. Chao, A general purpose unequal probability sampling plan. Biometrika 69, 653–656 (1982)
H. Brönnimann, B. Chen, M. Dash, P.J. Haas, Y. Qiao, P. Scheuermann, Efficient data reduction methods for on-line association rule discovery, in Data Mining: Next Generation Challenges and Future Directions, ed. by H. Kargupta, A. Joshi, K. Sivakumar, Y. Yesha (AAAI Press, Menlo Park, 2004)
B. Babcock, M. Datar, R. Motwani, Sampling from a moving window over streaming data, in Proc. 13th SODA (2002), pp. 633–634
T. Hagerup, C. Rub, A guided tour of Chernoff bounds. Inf. Process. Lett. 33, 305–308 (1990)
W. Feller, An Introduction to Probability Theory and Its Applications, 3rd edn., vol. 1 (Wiley, New York, 1968)
R. Gemulla, W. Lehner, P.J. Haas, Maintaining bounded-size sample synopses of evolving datasets. VLDB J. 17(2), 173–202 (2008)
R. Gemulla, W. Lehner, P.J. Haas, Maintaining Bernoulli samples over evolving multisets, in Proc. Twenty Sixth PODS (2007), pp. 93–102
G. Cormode, S. Muthukrishnan, I. Rozenbaum, Summarizing and mining inverse distributions on data streams via dynamic inverse sampling, in Proc. 31st VLDB (2005), pp. 25–36
G. Frahling, P. Indyk, C. Sohler, Sampling in dynamic data streams and applications, in Proc. 21st ACM Symp. Comput. Geom. (2005), pp. 142–149
P.B. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, in Proc. 27th VLDB (2001), pp. 541–550
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Haas, P.J. (2016). Data-Stream Sampling: Basic Techniques and Results. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-28608-0_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28607-3
Online ISBN: 978-3-540-28608-0
eBook Packages: Computer ScienceComputer Science (R0)