Abstract
We study an infinite urn scheme with probabilities corresponding to a power function. Urns here represent words from an infinitely large vocabulary. We propose asymptotically normal estimators of the exponent of the power function. The estimators use the number of different elements and a few similar statistics. If we use only one of the statistics we need to know asymptotics of a normalizing constant (a function of a parameter). All the estimators are implicit in this case. If we use two statistics then the estimators are explicit, but their rates of convergence are lower than those for estimators with the known normalizing constant.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Bahadur, R.R. (1960). On the number of distinct values in a large sample from an infinite discrete distribution. Proceedings of the National Institute of Sciences of India26A, Supp II, 67–75.
Barbour, A.D. (2009). Univariate approximations in the infinite occupancy scheme. Alea6, 415–433.
Barbour, A.D. and Gnedin, A.V. (2009). Small counts in the infinite occupancy scheme. Electronic. J. Probab.14, 365–384.
Ben-Hamou, A., Boucheron, S. and Gassiat, E. (2016). Pattern coding meets censoring: (almost) adaptive coding on countable alphabets. arXiv:1608.08367.
Ben-Hamou, A., Boucheron, S. and Ohannessian, M.I. (2017). Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli23, 249–287.
Bogachev, L.V., Gnedin, A.V. and Yakubovich, Y.V. (2008). On the variance of the number of occupied boxes. Adv. Appl. Math.40, 401–432.
Boonta, S. and Neammanee, K. (2007). Bounds on random infinite urn model. Bull. Malays. Math. Sci. Soc. Second Series30.2, 121–128.
Chebunin, M.G. (2014). Estimation of parameters of probabilistic models which is based on the number of different elements in a sample. Sib. Zh. Ind. Mat.17:3, 135–147. (in Russian).
Chebunin, M. and Kovalevskii, A. (2016). Functional central limit theorems for certain statistics in an infinite urn scheme. Statist. Probab. Lett.119, 344–348.
Durieu, O. and Wang, Y. (2016). From infinite urn schemes to decompositions of self-similar Gaussian processes. Electron. J. Probab.21, 43.
Dutko, M. (1989). Central limit theorems for infinite urn models. Ann. Probab.17, 1255–1263.
Gnedin, A., Hansen, B. and Pitman, J. (2007). Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws. Probab. Surv.4, 146–171.
Grubel, R. and Hitczenko, P. (2009). Gaps in discrete random samples. J. Appl. Probab.46, 1038–1051.
Heaps, H.S. (1978). Information retrieval, computational and theoretical aspects. Academic Press.
Herdan, G. (1960). Type-token mathematics. The Hague, Mouton.
Hwang, H.-K. and Janson, S. (2008). Local limit theorems for finite and infinite urn models. Ann. Probab.36, 992–1022.
Karlin, S. (1967). Central limit theorems for certain infinite urn schemes. J. Math. Mech.17, 373–401.
Key, E.S. (1992). Rare Numbers. J. Theor. Probab.5, 375–389.
Key, E.S. (1996). Divergence rates for the number of rare numbers. J. Theor. Probab.9, 413–428.
Khmaladze, E.V. (2011). Convergence properties in certain occupancy problems including the Karlin-Rouault law. J. Appl. Probab.48, 1095–1113.
Mandelbrot, B. (1965). Information theory and psycholinguistics. In Scientific psychology. Basic Books, (B.B. Wolman and E. Nagel, eds.)
Muratov, A. and Zuyev, S. (2016). Bit flipping and time to recover. J. Appl. Probab.53, 650–666.
Nicholls, P.T. (1987). Estimation of Zipf parameters. J. Am. Soc. Inf. Sci.38, 443–445.
Ohannessian, M.I. and Dahleh, M.A. (2012). Rare probability estimation under regularly varying heavy tails. In Proceedings of the 25th Annual Conference on Learning Theory PMLR, pp. 23:21.1–21.24.
Petersen, A.M., Tenenbaum, J.N., Havlin, S., Stanley, H.E. and Perc, M. (2012). Languages cool as they expand: allometric scaling and the decreasing need for new words. Scientific Reports 2. Article No 943.
Zakrevskaya, N.S. and Kovalevskii, A.P. (2001). One-parameter probabilistic models of text statistics. Sib. Zh. Ind. Mat.4:2, 142–153. (in Russian).
Zipf, G.K. (1949). Human behavior and the principle of least effort. University Press, Cambridge.
Acknowledgments
Our research was partially supported by RFBR grant 17-01-00683 and by the program of fundamental scientific researches of the SB RAS No. I.1.3., project No. 0314-2016-0008.
Author information
Authors and Affiliations
Corresponding author
Appendix: Functional Central Limit Theorem
Appendix: Functional Central Limit Theorem
Let for t ∈ [0, 1],k ≥ 1
Theorem 4.
Let us assume that (1.2) holds,ν ≥ 1 is integer. Then random process\( \left ((Y^{*}_{n,1}(t), Y_{n,1}(t),\ldots , Y_{n,\nu }(t)), 0 \leq t \leq 1 \right ) \)convergesweakly in the uniform metrics inD(0, 1) to (ν + 1)-dimensionalGaussian process with continuous sample paths, zero expectation and covariancefunction\((c_{ij}(\tau ,t))_{i,j = 0}^{\nu }\),
cji(t, τ) = cij(τ, t).
Proof.
Theorem 3 by Chebunin and Kovalevskii (2016) states weak convergence of vector random process \( \left ((Y^{*}_{n,1}(t), \ldots , Y^{*}_{n,\nu }(t)), 0 \leq t \leq 1 \right ) \) in the uniform metrics in D(0, 1) to (ν + 1)-dimensional Gaussian process with continuous sample paths, zero expectation and covariance function \((c^{*}_{ij}(\tau ,t))_{i,j = 0}^{\nu }\).
The main focus of this paper was to prove tightness of components \((Y^{*}_{n,i}(t), 0 \leq t \leq 1 )\) by Poissonization and construction of an appropriate inequality for covariances.
As \(Y_{n,i}(t)=Y^{*}_{n_{i}}(t)-Y^{*}_{n,i-1}(t)\), we state tightness of components (Yn, i,0 ≤ t ≤ 1) and calculate cij(τ, t) by formulas
The proof is complete. □
The limiting (ν + 1)-dimensional Gaussian process is self-similar with Hurst parameter H = 𝜃/2 < 1/2. Its first component coincides in distribution with the first component of the limiting process in Theorem 1 in Durieu and Wang (2016).
We need some specific corollary to calculate limiting variance in Theorem 2.
Corollary 3.
In assumptions of Theorem 4, randomvector\(((Y^{*}_{n,1}(1)\),Yn,1(1)) convergesweakly to a normal one with zero mean and covariance matrix
Rights and permissions
About this article
Cite this article
Chebunin, M., Kovalevskii, A. Asymptotically Normal Estimators for Zipf’s Law. Sankhya A 81, 482–492 (2019). https://doi.org/10.1007/s13171-018-0135-9
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13171-018-0135-9