Abstract
In many streaming scenarios, we need to measure and quantify the data that is seen. For example, we may want to measure the number of distinct IP addresses seen over the course of a day, compute the difference between incoming and outgoing transactions in a database system or measure the overall activity in a sensor network. In all of these examples, data can be modeled as massive, dynamic vectors. Here, we are interested in the well-known and widely used \(L_{p}\) norms of such vectors. These encompass the familiar Euclidean (root of sum of squares) and Manhattan (sum of absolute values) norms. We show how such \(L_{p}\) norms can be efficiently estimated for massive vectors presented in the streaming model. This is achieved by making succinct sketches of the data, which can be used as synopses of the vectors they summarize.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
N. Ailon, B. Chazelle, Approximate nearest neighbors and the fast Johnson–Lindenstrauss transform, in Proceedings of the ACM Symposium on Theory of Computing (2006)
N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments, in Proceedings of the ACM Symposium on Theory of Computing (1996), pp. 20–29. Journal version in J. Comput. Syst. Sci. 58, 137–147 (1999)
Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, L. Trevisian, Counting distinct elements in a data stream, in Proceedings of RANDOM 2002 (2002), pp. 1–10
B. Brinkman, M. Charikar, On the impossibility of dimensionality reduction in \(L_{1}\), in IEEE Conference on Foundations of Computer Science (2003), pp. 514–523
A. Chakrabarti, G. Cormode, A. McGregor, A near-optimal algorithm for computing the entropy of a stream, in Proceedings of ACM-SIAM Symposium on Discrete Algorithms (2007)
J.M. Chambers, C.L. Mallows, B.W. Stuck, A method for simulating stable random variables. J. Am. Stat. Assoc. 71(354), 340–344 (1976)
M. Charikar, K. Chen, M. Farach-Colton, Finding frequent items in data streams, in Procedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002), pp. 693–703
M.S. Charikar, Similarity estimation techniques from rounding algorithms, in Proceedings of the ACM Symposium on Theory of Computing (2002), pp. 380–388
G. Cormode, Sequence distance embeddings. PhD thesis, University of Warwick (2003)
G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan, Comparing data streams using Hamming norms, in Proceedings of the International Conference on Very Large Data Bases (2002), pp. 335–345. Journal version in IEEE Trans. Knowl. Data Eng. 15(3), 529–541 (2003)
G. Cormode, P. Indyk, N. Koudas, S. Muthukrishnan, Fast mining of tabular data via approximate distance computations, in Proceedings of the International Conference on Data Engineering (2002), pp. 605–616
G. Cormode, S. Muthukrishnan, The string edit distance matching problem with moves, in Proceedings of ACM–SIAM Symposium on Discrete Algorithms (2002), pp. 667–676
G. Cormode, S. Muthukrishnan, Estimating dominance norms of multiple data streams, in Proceedings of the European Symposium on Algorithms (ESA). LNCS, vol. 2838 (2003)
G. Cormode, S. Muthukrishnan, S.C. Ṣahinalp, Permutation editing and matching via embeddings, in Proceedings of 28th International Colloquium on Automata, Languages and Programming, vol. 2076 (2001), pp. 481–492
C. Cranor, T. Johnson, O. Spatscheck, V. Shkapenyuk, Gigascope: a stream database for network applications, in Proceedings of ACM SIGMOD International Conference on Management of Data (2003), pp. 647–651
M. Datar, N. Immorlica, P. Indyk, V.S. Mirrokni, Locality-sensitive hashing scheme based on \(p\)-stable distributions, in Symposium on Computational Geometry (2004)
C. Estan, G. Varghese, M. Fisk, Bitmap algorithms for counting active flows on high speed links, in Proceedings of the Internet Measurement Conference (2003), pp. 153–166
J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan, An approximate \(L_{1}\)-difference algorithm for massive data streams, in IEEE Conference on Foundations of Computer Science (1999), pp. 501–511
P. Flajolet, G.N. Martin, Probabilistic counting, in IEEE Conference on Foundations of Computer Science (1983), pp. 76–82. Journal version in J. Comput. Syst. Sci. 31, 182–209 (1985)
J. Fong, M. Strauss, An approximate \(L_{p}\)-difference algorithm for massive data streams, in Symposium on Theoretical Aspects of Computer Science (STACS) (2000), pp. 193–204
S. Ganguly, B. Lakshminath, Estimating entropy over data streams, in Proceedings of the European Symposium on Algorithms (ESA) (2006)
M. Garofalakis, A. Kumar, Correlating XML data streams using tree-edit distance embeddings, in Proceedings of ACM Principles of Database Systems (2003), pp. 143–154
P. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, in Proceedings of the International Conference on Very Large Data Bases (2001), pp. 541–550
P. Gibbons, S. Tirthapura, Estimating simple functions on the union of data streams, in ACM Symposium on Parallel Algorithms and Architectures (SPAA) (2001), pp. 281–290
A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss, Fast, small-space algorithms for approximate histogram maintenance, in Proceedings of the ACM Symposium on Theory of Computing (2002), pp. 389–398
P. Indyk, Stable distributions, pseudorandom generators, embeddings and data stream computation, in IEEE Conference on Foundations of Computer Science (2000), pp. 189–197
P. Indyk, Algorithmic aspects of geometric embeddings (invited tutorial), in IEEE Conference on Foundations of Computer Science (2001), pp. 10–35
P. Indyk, Algorithms for dynamic geometric problems over data streams, in Proceedings of the ACM Symposium on Theory of Computing (2004)
P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in Proceedings of the ACM Symposium on Theory of Computing (1998), pp. 604–613
W.B. Johnson, J. Lindenstrauss, Extensions of Lipschitz mapping into Hilbert space. Contemp. Math. 26, 189–206 (1984)
P. Li, Very sparse stable random projections, estimators and tail bounds for stable random projections. Technical report (2006). arXiv:cs.DS/0611114
P. Li, T. Hastie, K.W. Church, Nonlinear estimators and tail bounds for dimension reduction in \(L_{1}\) using Cauchy random projections. J. Mach. Learn. Res. (2007)
J. Matoušek, Lectures on Discrete Geometry (Springer, Berlin, 2002)
R. Motwani, P. Raghavan, Randomized Algorithms (Cambridge University Press, Cambridge, 1995)
S. Muthukrishnan, Data streams: algorithms and applications, in Proceedings of ACM–SIAM Symposium on Discrete Algorithms (2003)
A. Naor, G. Schechtman, Planar earthmover is not in \(L_{1}\), in IEEE Conference on Foundations of Computer Science (2006)
N. Nisan, Pseudorandom generators for space-bounded computation. Combinatorica 12, 449–461 (1992)
J. Nolan, Stable distributions. Available from http://academic2.american.edu/~jpnolan/stable/chap1.ps
M. Thorup, Y. Zhang, Tabulation based 4-universal hashing with applications to second moment estimation, in Proceedings of ACM-SIAM Symposium on Discrete Algorithms (2004), pp. 615–624
V.V. Uchaikin, V.M. Zolotarev, Chance and Stability: Stable Distributions and Their Applications (VSP, Utrecht, 1999)
D. Woodruff, Optimal space lower bounds for all frequency moments, in Proceedings of ACM–SIAM Symposium on Discrete Algorithms (2004), pp. 167–175
V.M. Zolotarev, One Dimensional Stable Distributions. Translations of Mathematical Monographs, vol. 65 (Am. Math. Soc., Providence, 1983)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Cormode, G., Indyk, P. (2016). Stable Distributions in Streaming Computations. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-28608-0_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28607-3
Online ISBN: 978-3-540-28608-0
eBook Packages: Computer ScienceComputer Science (R0)