A Model for Predicting n-gram Frequency Distribution in Large Corpora

Silva, Joaquim F.; Cunha, Jose C.

doi:10.1007/978-3-030-77961-0_55

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12742))

Included in the following conference series:

International Conference on Computational Science

1580 Accesses
1 Citations

Abstract

The statistical extraction of multiwords (n-grams) from natural language corpora is challenged by computationally heavy searching and indexing, which can be improved by low error prediction of the n-gram frequency distributions. For different n-gram sizes ($n\!\ge \!1$), we model the sizes of groups of equal-frequency n-grams, for the low frequencies, $k=1, 2,\ldots $, by predicting the influence of the corpus size upon the Zipf’s law exponent and the n-gram group size. The average relative errors of the model predictions, from 1-grams up to 6-grams, are near $4\%$, for English and French corpora from 62 Million to 8.6 Billion words.

Acknowledgements to FCT MCTES, NOVA LINCS UIDB/04516/2020 and Carlos Gonçalves.

Access provided by Autonomous University of Puebla. Download conference paper PDF

An Empirical Model for n-gram Frequency Distribution in Large Corpora

N-Gram Features for Unsupervised WSD with an Underlying Naïve Bayes Model

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

Keywords

1 Introduction

Relevant Expressions (RE) are semantically meaningful n-grams ($n\ge 1$), as “oceanography”, “oil crisis”, useful in document classification [15] and n-gram applications. However, most word sequences are not relevant in a corpus. Statistical RE extraction from texts, e.g. [7, 18], measures the cohesion among the n-grams within each distinct multiword; its performance benefits from predicting the n-gram frequency distributions. Low frequency n-grams are significant proportions of the number of distinct n-grams in a text, as well as of the RE. Assuming, for language L and n-gram size n, a finite vocabulary V(L, n) in each temporal epoch [9, 16, 17], we model the influence of the corpus size upon the sizes W(k) of equal-frequency (k) n-gram groups, for $n\!\ge \!1$, especially for low frequencies. We present results (and compare to a Poisson-based model), for English and French Wikipedia corpora (up to 8.6 Gw), for $1\!\le \!n\le 6$. We discuss background, the model, results and conclusions.

2 Background

Zipf’s law [20] is a good approximation to word frequency distribution, deviating from real data in high and low frequencies. More accurate approximations pose open issues [1, 2, 6, 8, 11,12,13,14, 19]. Low frequency words are often ignored, as well as multiwords. Most studies use truncated corpora data [5, 8], with some exceptions [17]. In models as [2, 3] the probability of a word occurring k times is given by a power law $k^{-\gamma }$ corrected by the corpus size influence, but they do not consider other n-gram sizes, unlike e.g. [16].

3 The Model

Successive model refinements are shown: $W_{z}(k)$, from Zipf’s Law; $W_{\alpha _{d}}\!(k,\!C)$ for corpus size dependence; and $W^{*}\!(k,C)$ for scaling adjustments.

3.1 $W_{z}(k)$: The Size of the Frequency Levels from Zipf’s Law

By Zipf’s Law [20], the number of occurrences of the r^th most frequent word in a corpus with a number of distinct words given by D, is

$$\begin{aligned} f(r) = f(1)\cdot r^{-\alpha } , \end{aligned}$$

(1)

$\alpha $ is a constant $\sim \! 1$; r is the word rank ($1\!\le \!r\!\le \!D$). (1) also applies to n-grams of sizes $n\!>\!1$, with $\alpha $ dependent on n (for simplicity $\alpha $ replaces $\alpha (n)$). The relative frequency of the most frequent n-gram ($r\!=\!1$) for each n shows small fluctuations around a value, taken as an approximation to its occurrence probability, $p_{1}$. The absolute frequency $f(1)\approx p_{1}\! \cdot C$. So, $\ln (f(r))$ would decrease linearly with slope $\alpha $ as $\ln (r)$ increases. Real distributions deviate from straight lines and show, for their higher ranks, groups of equal-frequency n-grams. $W\!(k)$ is defined based on Zipf’s law [4, 16]. For a level with frequency k, with its lowest ($r_{l_{k}}$) and highest ($r_{h_{k}}$) n-gram ranks: $f(r_{l_{k}})\!=\! f(r_{h_{k}})\!=\!k$; $W_{z}(k)=r_{h_{k}}-r_{l_{k}}+1$. The model assumes a minimum observed frequency of 1: $f(r_{l_{1}})= f(r_{h_{1}})\!=\!1$; $r_{h_{1}}\!=\!D$; and only applies to the higher ranks / lower frequencies where adjacent levels ($r_{l_{k}}\!=\!r_{h_{k+1}}\!+\!1$) have consecutive integer frequencies: $f(r_{h_{k+1}})\!=\! f(r_{h_{k}})+1$.

Then, (2) is obtained, with constant $\alpha _{z}$.

$$\begin{aligned} W_{z}(k)=\left( \frac{1}{D^{\alpha _{z}}} + \frac{k-1}{f(1)} \right) ^{-\frac{1}{\alpha _{z}}} - \left( \frac{1}{D^{\alpha _{z}}} + \frac{k}{f(1)} \right) ^{-\frac{1}{\alpha _{z}}}. \end{aligned}$$

(2)

$$\begin{aligned} D(C;L,n)=\frac{V(L,n)}{1+ (K_2\cdot C)^{-K_1}} . \end{aligned}$$

(3)

For predicting D in a corpus of size C, we use (3), following [16] with good agreement with real corpora. For language L and n-gram size n, V(L, n) is the finite vocabulary size; $K_1$, $K_2$ are positive constants. If V is assumed infinite, (3) equals Heap’s law.

3.2 An Analytical Model for the Dependence of $\alpha $ on Corpus Size

Empirically, $\alpha _{z}$ is shown to depend on corpus size. So, we consider $\alpha $ in (1) as a function $\alpha (C,r)$ of the corpus size and the n-gram rank r:

$$\begin{aligned} \alpha (C,r)=\frac{\ln (f_{c}(1))-\ln (f_{c}(r))}{\ln (r)} , \end{aligned}$$

(4)

where $1\!\le \!r\!\le \!D$, and $f_{c}(1)$ and $f_{c}(r)$ are the frequencies, respectively, of the most frequent n-gram and the r^th ranked n-gram, in a corpus of size C. In (2) $\alpha $ is obtained, for each corpus size, by fitting $W_{z}(1)$ to the empirical level size $W_{obs}(1)$ (for $k\!=\!1$). For that level, $r_{h_{1}}\!=\!D(C,L,n)$ (denoted D or $D_{c}$), and $f_{c}(D_{c})\!=\!1$, so $\ln (f_{c}(r))\!=\!0$ in (4) for $r=D_{c}$. Let $\alpha (C,D_{c})$ (denoted $\alpha _{d}(C)$), be the $\alpha $ value at rank D. Let $C_{1}$ be the size of a reference corpus:

$$\begin{aligned} \alpha _{d}(C)-\alpha _{d}(C_{1})\!=\! \frac{\ln (f_{c}(1)) }{\ln (D_{c})}- Ref_{c_{1}} . \end{aligned}$$

(5)

The $2^\mathrm{nd}$ term in the right-hand side of (5) (denoted $ Ref_{c_{1}} $) becomes fixed. It only depends on $f_{c}(1)\!=\!C_{1}\!\cdot \!p_{1}$ ($p_{1}$ is the occurrence probability of the most frequent n-gram) and $D_{c_{1}}$ from (3). Using Table 1 (Sect. 4.2) and tuning $\alpha _{d}(C_{1})$ by fitting, for $C_{1}$, the $W_{z}(1)$ from (2) to the observed $W_{obs}(1)$, we find $\alpha _{d}(C_{1})$ and $D_{c_{1}}$. Given $\alpha _{d}(C_{1})$ and ${Ref_{c_{1}}}$, then (5) predicts $\alpha _{d}(C)$ for a size C corpus, and $W_{z}(k)$ (2) leads to $W_{\alpha _{d}}\!(k,\!C)$ (6), where $\alpha _{d}(C)$ replaces $\alpha _{z}$:

$$\begin{aligned} W_{\alpha _{d}}\!(k,\!C)\!=\!\left( \!\frac{1}{D_{c}^{\alpha _{d}(C)}}\!+\!\frac{k-1}{f_{c}(1)}\!\right) ^{\!\!-\frac{1}{\alpha _{d}(C)}}\!-\left( \!\frac{1}{D_{c}^{\alpha _{d}(C)}}\!+\!\frac{k}{f_{c}(1)}\!\right) ^{\!\!-\frac{1}{\alpha _{d}(C)}} . \end{aligned}$$

(6)

3.3 $W^{}\!(k,C)$: The Dependence of Level Size on Corpus* Size

The frequency level size depends on frequency k and corpus size C. Firstly, for a corpus size C, $\alpha _{z}$ in (2) is tuned to best fitting $W_{z}(1)$ to $W_{obs}(1)$. Except for the $W_{obs}(k)$ fluctuations (Fig. 1a), the deviation, closely proportional to $\ln (k)$, between $W_{obs}(k)$ and $W_{z}(k)$, suggests the improvements due to (7) (Fig. 1a).

$$\begin{aligned} W_{adjusted}(k) = W_{z}(k)\cdot k^\beta . \end{aligned}$$

(7)

$\beta $ is a constant for each n, obtained from the best fit of W(k) to $W_{obs}(k)$, for a given corpus. Secondly, for different corpus sizes, Fig. 1b shows $W_{obs}(k)$ curves as a function of k, seeming parallel, but a detailed analysis shows otherwise. If, for each $\ln (W_{obs}(k,C^{*}))$ for the three smaller corpora $C^{*}$, an offset equal to $\ln (W_{obs}(1,C))-\ln (W_{obs}(1,C^{*}))$ is added ($C\!=\!8.6$ Gw being the largest corpus), the resulting curves (omitted due to lack of space) do not coincide, as they should if they were parallel in Fig. 1b. The gap between the curves is proportional to $\ln (k)$. And, for each $\ln (k)$ value, the distance in $\ln (W_{obs}(k))$ for corpora of sizes C and $C_{1}$ is proportional to $\log _{2}(C/C_{1})$. The distance between the $\ln (W(k))$ curves of any two corpora C and $C_{1}$ is approximated by (8), with $\delta $ constant for each n. Joining (6), (7), (8) leads to the final model, $W^{*}\!(k,C)$, (9):

$$\begin{aligned} \varDelta =\delta \cdot \ln (\frac{C}{C_{1}})\cdot \ln (k) \end{aligned}$$

(8)

$$\begin{aligned} W^{*}\!(k,C)\!=\!W_{\alpha _{d}}\!(k,C)\cdot k^{\beta +\delta \cdot \ln (\frac{C}{C_{1}})} . \end{aligned}$$

(9)

4 Results and Discussion

4.1 The Poisson-Zipf Model

In the $W_{P}(k,C)$ model of [17] given by (10), an n-gram ranked r occurs, in a size C corpus, a number of times following Poisson distribution [10] with $\lambda _r\!=\!f(r)$ by Zipf’s Law. $W(0)\!=\!W_{P}(0,C)$ is the estimated number of unseen n-grams in the corpus. $D\!=\!V\!-W\!(0)$, for n-gram vocabulary size V.

$$\begin{aligned} W_{P}\!(k,C)\!&=\!\sum _{r=1}^{r=V}\frac{(p_{1}\!\cdot \!C\!\cdot \!r^{-\alpha })^k\cdot e^{-p_{1}\cdot C \cdot r^{-\alpha }}}{k!} \approx \int _{1}^{V}\frac{(p_{1}\!\cdot \!C\!\cdot \!r^{-\alpha })^k\cdot e^{-p_{1}\cdot C \cdot r^{-\alpha }}}{k!} dr \nonumber \\&\approx \!\frac{(p_{1}\!\cdot \!C)^{1/\alpha }}{\alpha \!\cdot \!k!}\!\cdot \!\left[ \varGamma (k\!-\!\frac{1}{\alpha },\frac{p_{1}\!\cdot \!C}{V^\alpha })-\varGamma (k\!-\!\frac{1}{\alpha },p_{1}\!\cdot \!C)\right] \end{aligned}$$

(10)

4.2 Comparison of Results

Complete corpora were built from documents randomly extracted from English and French Wikipedia. For evaluating size dependence, they were doubled successively (Table 2). A space was added between the words and each of the following characters: . All inflected word forms were kept.

The Model Calculations. (I) To calculate D(C; L, n) in (3), parameters $K_1$, $K_2$ and V(L, n) were found for each language L and n-gram size n (Table 1, also showing the $\beta $ and $\delta $ values used in (9)). The V(L, n) value is an estimate of the vocabulary size, such that further increasing it, does not significantly reduce the relative error $((E-O)/O)\cdot 100\%$, between an estimated value (E) and the corresponding observed value (O). Pairs ($K_1$, $K_2$) were found leading to the lowest possible relative error, for a selected pair of corpora with sizes close to the lowest and highest corpora sizes in the considered range for each language. (II) To evaluate the relative errors, (9) was applied with k such that the observed level sizes of consecutive frequency levels k and $k\!+\!1$ are monotonic decreasing, $W_{obs}(k,C)\!>\!W_{obs}(k\!+\!1,C)$. This avoids the non-monotony regions of the observed $\ln (W(k))$ curve (Fig. 1a). We considered a basic set of k values, $K\!=\!\{1,2,3,4,5,6,7,8,16,32,64,128\}$, constrained (to ensure $\ln (W(k))$ monotony) depending on the corpus size C: for $C\!<\!250$ Mw, we used $k\!\le \!16$; for $C\!<\!1$ Gw, $k\!\le \!32$; the full K set was used only for $C\!>\!4$ Gw. We selected corpora of sizes ($C_{1}$) 1.1 Gw (English) and 808 Mw (French). (III) The $\alpha _{d}(C_{1})$ values for n-gram sizes from 1 to 6 are: (English) 1.1595, 1.02029, 0.88825, 0.82532, 0.8117, 0.8027; (French) 1.158825, 1.0203, 0.86605, 0.84275, 0.80818, 0.7569. The empirical values of $p_1$ for n-gram sizes from 1 to 6: (English) 0.06704, 0.03250, 0.0062557, 0.0023395, 0.0017908, 0.0014424; (French) 0.07818, 0.037976, 0.004685, 0.0036897, 0.001971, 0.00072944. (IV) To run $W_{P}\!(K,C)$, $\alpha $ values leading to the lowest relative errors, are, for n-gram sizes from 1 to 6: (English) 1.17, 1.02, 0.891, 0.827, 0.814, 0.812; (French) 1.156, 1.01, 0.884, 0.842, 0.806, 0.759.

Table 1. Parameter values $K_1$, $K_2$ and vocabulary sizes (V(L, n)) to be used in D(C; L, n), (3), and $\beta $ and $\delta $ to $W^{*}\!(k,\!C)$, (9).

Full size table

Table 2 presents the relative errors for the predictions of the frequency level sizes. For each n-gram size, the left column refers to $W^{*}\!(k,C)$ and the right one to $W_{P}\!(k,C)$. For each pair (corpus size, n-gram size), it shows the average relative error for the K set used: $AvgErr(K)=\frac{1}{\Vert K\Vert }\sum _{k\in K}Err(k)$, where $Err(k)\!=|\frac{W\!(k,C)-W_{obs}\!(k,C)}{W_{obs}\!(k,C)}|$. The average relative errors for $W^{*}\!(k,C)$ are much lower than for $W_{P}(k,C)$, which assumes an ideal Zipf’s Law. The line Avg shows the average value of each column over the full range of corpora sizes, with errors of the same magnitude across the range of n-gram sizes for $W^{*}\!(k,C)$, but having significant variations in the Avg values for $W_{P}(k,C)$. The global relative error is the average of the Avg values over the range of n-gram sizes, being around $4\%$ for $W^{*}\!(k,C)$. Thus, $W^{*}\!(k,C)$ curves (omitted due to lack of space) closely follow the $W_{obs}(k,C)$ curves forms of Fig. 1.

Table 2. Average relative error ($\%$) for the predictions of the n-gram frequency level sizes obtained by $W^{*}\!(k,C)$, (9), (left col.), and $W_{P}\!(k,C)$, (10), (right col.). Each cell in the table gives an average relative error over a subset of k values within the set K considered for that cell, as described in the text.

Full size table

5 Conclusions

Estimating n-gram frequency distributions is useful in statistical-based n-gram applications. The proposed model estimates the sizes W(k, C) of equal-frequency (k) n-gram groups in a corpus of size C, for the low frequency n-grams. It applies uniformly to different n-gram sizes $n\!\ge \!1$ and languages, assuming a finite language n-gram vocabulary. It models the dependences of Zipf’s Law exponent and W(k, C) on C, agreeing well with n-gram frequency data from unigrams up to hexagrams, from real un-truncated English and French corpora with million to billion words. Larger corpora evaluation is planned.

References

Ausloos, M., Cerqueti, R.: A universal rank-size law. PLoS ONE 11(11) (2016)
Google Scholar
Balasubrahmanyan, V.K., Naranan, S.: Algorithmic information, complexity and zipf’s law. Glottometrics 4, 1–26 (2002)
Google Scholar
Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: The meta book and size-dependent properties of written language. New J. Phys. 11(12), 123015 (2009)
Article Google Scholar
Booth, A.D.: A law of occurrences for words of low frequency. Inf. Control 10(4), 386–393 (1967)
Article Google Scholar
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Joint Conference on Empirical Methods in NLP and Computational Natural Language Learning, pp. 858–867. ACL (2007)
Google Scholar
Cancho, R.F., Solé, R.V.: Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited*. J. Quant. Linguist. 8(3), 165–173 (2001)
Article Google Scholar
Dias, G.: Multiword unit hybrid extraction. In: ACL Workshop on Multiword Expressions, vol. 18, pp. 41–48. ACL (2003)
Google Scholar
Gerlach, M., Altmann, E.G.: Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006 (2013)
Google Scholar
Goncalves, C., Silva, J.F., Cunha, J.C.: n-gram cache performance in statistical extraction of relevant terms in large Corpora. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11537, pp. 75–88. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22741-8_6
Chapter Google Scholar
Haight, F.A.: Handbook of the Poisson Distribution. John Wiley & Sons, New York (1967)
MATH Google Scholar
Lü, L., Zhang, Z.K., Zhou, T.: Deviation of zipf’s and heaps’ laws in human languages with limited dictionary sizes. Sci. Rep. 3(1082), 1–7 (2013)
Google Scholar
Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structural of Language and its Mathematical Aspects (1953)
Google Scholar
Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Math. 1(2), 226–251 (2003)
Article MathSciNet Google Scholar
Piantadosi, S.T.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychonomic Bull. Rev. 21, 1112–1130 (2014)
Article Google Scholar
Silva, J., Mexia, J., Coelho, A., Lopes, G.: Document clustering and cluster topic extraction in multilingual corpora. In: Proceedings 2001 IEEE International Conference on Data Mining, pp. 513–520 (2001)
Google Scholar
Silva, J.F., Cunha, J.C.: An empirical model for n-gram frequency distribution in large corpora. In: Lauw, H.W., et al. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 840–851. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_63
Chapter Google Scholar
Silva, J.F., Gonçalves, C., Cunha, J.C.: A theoretical model for n-gram distribution in big data corpora. In: 2016 IEEE International Conference on Big Data, pp. 134–141 (2016)
Google Scholar
da Silva, J.F., Dias, G., Guilloré, S., Pereira Lopes, J.G.: Using LocalMaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48159-1_9
Chapter Google Scholar
Simon, H.: On a class of skew distribution functions. Biometrika 42(3/4), 425–440 (1955)
Article MathSciNet Google Scholar
Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Google Scholar

Download references

Author information

Authors and Affiliations

NOVA Laboratory for Computer Science and Informatics, Costa da Caparica, Portugal
Joaquim F. Silva & Jose C. Cunha

Authors

Joaquim F. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Jose C. Cunha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joaquim F. Silva .

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
Ludwig-Maximilians-Universität München, Munich, Germany
Dieter Kranzlmüller
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, J.F., Cunha, J.C. (2021). A Model for Predicting n-gram Frequency Distribution in Large Corpora. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12742. Springer, Cham. https://doi.org/10.1007/978-3-030-77961-0_55

Download citation

DOI: https://doi.org/10.1007/978-3-030-77961-0_55
Published: 09 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77960-3
Online ISBN: 978-3-030-77961-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Model for Predicting n-gram Frequency Distribution in Large Corpora

Abstract

Similar content being viewed by others

An Empirical Model for n-gram Frequency Distribution in Large Corpora

N-Gram Features for Unsupervised WSD with an Underlying Naïve Bayes Model

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

Keywords

1 Introduction

2 Background

3 The Model

3.1 \(W_{z}(k)\): The Size of the Frequency Levels from Zipf’s Law

3.2 An Analytical Model for the Dependence of \(\alpha \) on Corpus Size

3.3 \(W^{}\!(k,C)\): The Dependence of Level Size on Corpus* Size

4 Results and Discussion

4.1 The Poisson-Zipf Model

4.2 Comparison of Results

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Model for Predicting n-gram Frequency Distribution in Large Corpora

Abstract

Similar content being viewed by others

An Empirical Model for n-gram Frequency Distribution in Large Corpora

N-Gram Features for Unsupervised WSD with an Underlying Naïve Bayes Model

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

Keywords

1 Introduction

2 Background

3 The Model

3.1 \(W_{z}(k)\): The Size of the Frequency Levels from Zipf’s Law

3.2 An Analytical Model for the Dependence of \(\alpha \) on Corpus Size

3.3 \(W^{*}\!(k,C)\): The Dependence of Level Size on Corpus Size

4 Results and Discussion

4.1 The Poisson-Zipf Model

4.2 Comparison of Results

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

3.3 \(W^{}\!(k,C)\): The Dependence of Level Size on Corpus* Size