Abstract
The statistical extraction of multiwords (n-grams) from natural language corpora is challenged by computationally heavy searching and indexing, which can be improved by low error prediction of the n-gram frequency distributions. For different n-gram sizes (\(n\!\ge \!1\)), we model the sizes of groups of equal-frequency n-grams, for the low frequencies, \(k=1, 2,\ldots \), by predicting the influence of the corpus size upon the Zipf’s law exponent and the n-gram group size. The average relative errors of the model predictions, from 1-grams up to 6-grams, are near \(4\%\), for English and French corpora from 62 Million to 8.6 Billion words.
Acknowledgements to FCT MCTES, NOVA LINCS UIDB/04516/2020 and Carlos Gonçalves.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Relevant Expressions (RE) are semantically meaningful n-grams (\(n\ge 1\)), as “oceanography”, “oil crisis”, useful in document classification [15] and n-gram applications. However, most word sequences are not relevant in a corpus. Statistical RE extraction from texts, e.g. [7, 18], measures the cohesion among the n-grams within each distinct multiword; its performance benefits from predicting the n-gram frequency distributions. Low frequency n-grams are significant proportions of the number of distinct n-grams in a text, as well as of the RE. Assuming, for language L and n-gram size n, a finite vocabulary V(L, n) in each temporal epoch [9, 16, 17], we model the influence of the corpus size upon the sizes W(k) of equal-frequency (k) n-gram groups, for \(n\!\ge \!1\), especially for low frequencies. We present results (and compare to a Poisson-based model), for English and French Wikipedia corpora (up to 8.6 Gw), for \(1\!\le \!n\le 6\). We discuss background, the model, results and conclusions.
2 Background
Zipf’s law [20] is a good approximation to word frequency distribution, deviating from real data in high and low frequencies. More accurate approximations pose open issues [1, 2, 6, 8, 11,12,13,14, 19]. Low frequency words are often ignored, as well as multiwords. Most studies use truncated corpora data [5, 8], with some exceptions [17]. In models as [2, 3] the probability of a word occurring k times is given by a power law \(k^{-\gamma }\) corrected by the corpus size influence, but they do not consider other n-gram sizes, unlike e.g. [16].
3 The Model
Successive model refinements are shown: \(W_{z}(k)\), from Zipf’s Law; \(W_{\alpha _{d}}\!(k,\!C)\) for corpus size dependence; and \(W^{*}\!(k,C)\) for scaling adjustments.
3.1 \(W_{z}(k)\): The Size of the Frequency Levels from Zipf’s Law
By Zipf’s Law [20], the number of occurrences of the rth most frequent word in a corpus with a number of distinct words given by D, is
\(\alpha \) is a constant \(\sim \! 1\); r is the word rank (\(1\!\le \!r\!\le \!D\)). (1) also applies to n-grams of sizes \(n\!>\!1\), with \(\alpha \) dependent on n (for simplicity \(\alpha \) replaces \(\alpha (n)\)). The relative frequency of the most frequent n-gram (\(r\!=\!1\)) for each n shows small fluctuations around a value, taken as an approximation to its occurrence probability, \(p_{1}\). The absolute frequency \(f(1)\approx p_{1}\! \cdot C\). So, \(\ln (f(r))\) would decrease linearly with slope \(\alpha \) as \(\ln (r)\) increases. Real distributions deviate from straight lines and show, for their higher ranks, groups of equal-frequency n-grams. \(W\!(k)\) is defined based on Zipf’s law [4, 16]. For a level with frequency k, with its lowest (\(r_{l_{k}}\)) and highest (\(r_{h_{k}}\)) n-gram ranks: \(f(r_{l_{k}})\!=\! f(r_{h_{k}})\!=\!k\); \(W_{z}(k)=r_{h_{k}}-r_{l_{k}}+1\). The model assumes a minimum observed frequency of 1: \(f(r_{l_{1}})= f(r_{h_{1}})\!=\!1\); \(r_{h_{1}}\!=\!D\); and only applies to the higher ranks / lower frequencies where adjacent levels (\(r_{l_{k}}\!=\!r_{h_{k+1}}\!+\!1\)) have consecutive integer frequencies: \(f(r_{h_{k+1}})\!=\! f(r_{h_{k}})+1\).
Then, (2) is obtained, with constant \(\alpha _{z}\).
For predicting D in a corpus of size C, we use (3), following [16] with good agreement with real corpora. For language L and n-gram size n, V(L, n) is the finite vocabulary size; \(K_1\), \(K_2\) are positive constants. If V is assumed infinite, (3) equals Heap’s law.
3.2 An Analytical Model for the Dependence of \(\alpha \) on Corpus Size
Empirically, \(\alpha _{z}\) is shown to depend on corpus size. So, we consider \(\alpha \) in (1) as a function \(\alpha (C,r)\) of the corpus size and the n-gram rank r:
where \(1\!\le \!r\!\le \!D\), and \(f_{c}(1)\) and \(f_{c}(r)\) are the frequencies, respectively, of the most frequent n-gram and the rth ranked n-gram, in a corpus of size C. In (2) \(\alpha \) is obtained, for each corpus size, by fitting \(W_{z}(1)\) to the empirical level size \(W_{obs}(1)\) (for \(k\!=\!1\)). For that level, \(r_{h_{1}}\!=\!D(C,L,n)\) (denoted D or \(D_{c}\)), and \(f_{c}(D_{c})\!=\!1\), so \(\ln (f_{c}(r))\!=\!0\) in (4) for \(r=D_{c}\). Let \(\alpha (C,D_{c})\) (denoted \(\alpha _{d}(C)\)), be the \(\alpha \) value at rank D. Let \(C_{1}\) be the size of a reference corpus:
The \(2^\mathrm{nd}\) term in the right-hand side of (5) (denoted \( Ref_{c_{1}} \)) becomes fixed. It only depends on \(f_{c}(1)\!=\!C_{1}\!\cdot \!p_{1}\) (\(p_{1}\) is the occurrence probability of the most frequent n-gram) and \(D_{c_{1}}\) from (3). Using Table 1 (Sect. 4.2) and tuning \(\alpha _{d}(C_{1})\) by fitting, for \(C_{1}\), the \(W_{z}(1)\) from (2) to the observed \(W_{obs}(1)\), we find \(\alpha _{d}(C_{1})\) and \(D_{c_{1}}\). Given \(\alpha _{d}(C_{1})\) and \({Ref_{c_{1}}}\), then (5) predicts \(\alpha _{d}(C)\) for a size C corpus, and \(W_{z}(k)\) (2) leads to \(W_{\alpha _{d}}\!(k,\!C)\) (6), where \(\alpha _{d}(C)\) replaces \(\alpha _{z}\):
3.3 \(W^{*}\!(k,C)\): The Dependence of Level Size on Corpus Size
The frequency level size depends on frequency k and corpus size C. Firstly, for a corpus size C, \(\alpha _{z}\) in (2) is tuned to best fitting \(W_{z}(1)\) to \(W_{obs}(1)\). Except for the \(W_{obs}(k)\) fluctuations (Fig. 1a), the deviation, closely proportional to \(\ln (k)\), between \(W_{obs}(k)\) and \(W_{z}(k)\), suggests the improvements due to (7) (Fig. 1a).
\(\beta \) is a constant for each n, obtained from the best fit of W(k) to \(W_{obs}(k)\), for a given corpus. Secondly, for different corpus sizes, Fig. 1b shows \(W_{obs}(k)\) curves as a function of k, seeming parallel, but a detailed analysis shows otherwise. If, for each \(\ln (W_{obs}(k,C^{*}))\) for the three smaller corpora \(C^{*}\), an offset equal to \(\ln (W_{obs}(1,C))-\ln (W_{obs}(1,C^{*}))\) is added (\(C\!=\!8.6\) Gw being the largest corpus), the resulting curves (omitted due to lack of space) do not coincide, as they should if they were parallel in Fig. 1b. The gap between the curves is proportional to \(\ln (k)\). And, for each \(\ln (k)\) value, the distance in \(\ln (W_{obs}(k))\) for corpora of sizes C and \(C_{1}\) is proportional to \(\log _{2}(C/C_{1})\). The distance between the \(\ln (W(k))\) curves of any two corpora C and \(C_{1}\) is approximated by (8), with \(\delta \) constant for each n. Joining (6), (7), (8) leads to the final model, \(W^{*}\!(k,C)\), (9):
4 Results and Discussion
4.1 The Poisson-Zipf Model
In the \(W_{P}(k,C)\) model of [17] given by (10), an n-gram ranked r occurs, in a size C corpus, a number of times following Poisson distribution [10] with \(\lambda _r\!=\!f(r)\) by Zipf’s Law. \(W(0)\!=\!W_{P}(0,C)\) is the estimated number of unseen n-grams in the corpus. \(D\!=\!V\!-W\!(0)\), for n-gram vocabulary size V.
4.2 Comparison of Results
Complete corpora were built from documents randomly extracted from English and French Wikipedia. For evaluating size dependence, they were doubled successively (Table 2). A space was added between the words and each of the following characters: . All inflected word forms were kept.
The Model Calculations. (I) To calculate D(C; L, n) in (3), parameters \(K_1\), \(K_2\) and V(L, n) were found for each language L and n-gram size n (Table 1, also showing the \(\beta \) and \(\delta \) values used in (9)). The V(L, n) value is an estimate of the vocabulary size, such that further increasing it, does not significantly reduce the relative error \(((E-O)/O)\cdot 100\%\), between an estimated value (E) and the corresponding observed value (O). Pairs (\(K_1\), \(K_2\)) were found leading to the lowest possible relative error, for a selected pair of corpora with sizes close to the lowest and highest corpora sizes in the considered range for each language. (II) To evaluate the relative errors, (9) was applied with k such that the observed level sizes of consecutive frequency levels k and \(k\!+\!1\) are monotonic decreasing, \(W_{obs}(k,C)\!>\!W_{obs}(k\!+\!1,C)\). This avoids the non-monotony regions of the observed \(\ln (W(k))\) curve (Fig. 1a). We considered a basic set of k values, \(K\!=\!\{1,2,3,4,5,6,7,8,16,32,64,128\}\), constrained (to ensure \(\ln (W(k))\) monotony) depending on the corpus size C: for \(C\!<\!250\) Mw, we used \(k\!\le \!16\); for \(C\!<\!1\) Gw, \(k\!\le \!32\); the full K set was used only for \(C\!>\!4\) Gw. We selected corpora of sizes (\(C_{1}\)) 1.1 Gw (English) and 808 Mw (French). (III) The \(\alpha _{d}(C_{1})\) values for n-gram sizes from 1 to 6 are: (English) 1.1595, 1.02029, 0.88825, 0.82532, 0.8117, 0.8027; (French) 1.158825, 1.0203, 0.86605, 0.84275, 0.80818, 0.7569. The empirical values of \(p_1\) for n-gram sizes from 1 to 6: (English) 0.06704, 0.03250, 0.0062557, 0.0023395, 0.0017908, 0.0014424; (French) 0.07818, 0.037976, 0.004685, 0.0036897, 0.001971, 0.00072944. (IV) To run \(W_{P}\!(K,C)\), \(\alpha \) values leading to the lowest relative errors, are, for n-gram sizes from 1 to 6: (English) 1.17, 1.02, 0.891, 0.827, 0.814, 0.812; (French) 1.156, 1.01, 0.884, 0.842, 0.806, 0.759.
Table 2 presents the relative errors for the predictions of the frequency level sizes. For each n-gram size, the left column refers to \(W^{*}\!(k,C)\) and the right one to \(W_{P}\!(k,C)\). For each pair (corpus size, n-gram size), it shows the average relative error for the K set used: \(AvgErr(K)=\frac{1}{\Vert K\Vert }\sum _{k\in K}Err(k)\), where \(Err(k)\!=|\frac{W\!(k,C)-W_{obs}\!(k,C)}{W_{obs}\!(k,C)}|\). The average relative errors for \(W^{*}\!(k,C)\) are much lower than for \(W_{P}(k,C)\), which assumes an ideal Zipf’s Law. The line Avg shows the average value of each column over the full range of corpora sizes, with errors of the same magnitude across the range of n-gram sizes for \(W^{*}\!(k,C)\), but having significant variations in the Avg values for \(W_{P}(k,C)\). The global relative error is the average of the Avg values over the range of n-gram sizes, being around \(4\%\) for \(W^{*}\!(k,C)\). Thus, \(W^{*}\!(k,C)\) curves (omitted due to lack of space) closely follow the \(W_{obs}(k,C)\) curves forms of Fig. 1.
5 Conclusions
Estimating n-gram frequency distributions is useful in statistical-based n-gram applications. The proposed model estimates the sizes W(k, C) of equal-frequency (k) n-gram groups in a corpus of size C, for the low frequency n-grams. It applies uniformly to different n-gram sizes \(n\!\ge \!1\) and languages, assuming a finite language n-gram vocabulary. It models the dependences of Zipf’s Law exponent and W(k, C) on C, agreeing well with n-gram frequency data from unigrams up to hexagrams, from real un-truncated English and French corpora with million to billion words. Larger corpora evaluation is planned.
References
Ausloos, M., Cerqueti, R.: A universal rank-size law. PLoS ONE 11(11) (2016)
Balasubrahmanyan, V.K., Naranan, S.: Algorithmic information, complexity and zipf’s law. Glottometrics 4, 1–26 (2002)
Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: The meta book and size-dependent properties of written language. New J. Phys. 11(12), 123015 (2009)
Booth, A.D.: A law of occurrences for words of low frequency. Inf. Control 10(4), 386–393 (1967)
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Joint Conference on Empirical Methods in NLP and Computational Natural Language Learning, pp. 858–867. ACL (2007)
Cancho, R.F., Solé, R.V.: Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited*. J. Quant. Linguist. 8(3), 165–173 (2001)
Dias, G.: Multiword unit hybrid extraction. In: ACL Workshop on Multiword Expressions, vol. 18, pp. 41–48. ACL (2003)
Gerlach, M., Altmann, E.G.: Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006 (2013)
Goncalves, C., Silva, J.F., Cunha, J.C.: n-gram cache performance in statistical extraction of relevant terms in large Corpora. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11537, pp. 75–88. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22741-8_6
Haight, F.A.: Handbook of the Poisson Distribution. John Wiley & Sons, New York (1967)
Lü, L., Zhang, Z.K., Zhou, T.: Deviation of zipf’s and heaps’ laws in human languages with limited dictionary sizes. Sci. Rep. 3(1082), 1–7 (2013)
Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structural of Language and its Mathematical Aspects (1953)
Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Math. 1(2), 226–251 (2003)
Piantadosi, S.T.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychonomic Bull. Rev. 21, 1112–1130 (2014)
Silva, J., Mexia, J., Coelho, A., Lopes, G.: Document clustering and cluster topic extraction in multilingual corpora. In: Proceedings 2001 IEEE International Conference on Data Mining, pp. 513–520 (2001)
Silva, J.F., Cunha, J.C.: An empirical model for n-gram frequency distribution in large corpora. In: Lauw, H.W., et al. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 840–851. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_63
Silva, J.F., Gonçalves, C., Cunha, J.C.: A theoretical model for n-gram distribution in big data corpora. In: 2016 IEEE International Conference on Big Data, pp. 134–141 (2016)
da Silva, J.F., Dias, G., Guilloré, S., Pereira Lopes, J.G.: Using LocalMaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48159-1_9
Simon, H.: On a class of skew distribution functions. Biometrika 42(3/4), 425–440 (1955)
Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Silva, J.F., Cunha, J.C. (2021). A Model for Predicting n-gram Frequency Distribution in Large Corpora. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12742. Springer, Cham. https://doi.org/10.1007/978-3-030-77961-0_55
Download citation
DOI: https://doi.org/10.1007/978-3-030-77961-0_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77960-3
Online ISBN: 978-3-030-77961-0
eBook Packages: Computer ScienceComputer Science (R0)