Keywords

1 Introduction

Relevant Expressions (RE) are semantically meaningful n-grams (\(n\ge 1\)), as “oceanography”, “oil crisis”, useful in document classification [15] and n-gram applications. However, most word sequences are not relevant in a corpus. Statistical RE extraction from texts, e.g. [7, 18], measures the cohesion among the n-grams within each distinct multiword; its performance benefits from predicting the n-gram frequency distributions. Low frequency n-grams are significant proportions of the number of distinct n-grams in a text, as well as of the RE. Assuming, for language L and n-gram size n, a finite vocabulary V(Ln) in each temporal epoch [9, 16, 17], we model the influence of the corpus size upon the sizes W(k) of equal-frequency (k) n-gram groups, for \(n\!\ge \!1\), especially for low frequencies. We present results (and compare to a Poisson-based model), for English and French Wikipedia corpora (up to 8.6 Gw), for \(1\!\le \!n\le 6\). We discuss background, the model, results and conclusions.

2 Background

Zipf’s law [20] is a good approximation to word frequency distribution, deviating from real data in high and low frequencies. More accurate approximations pose open issues [1, 2, 6, 8, 11,12,13,14, 19]. Low frequency words are often ignored, as well as multiwords. Most studies use truncated corpora data [5, 8], with some exceptions [17]. In models as [2, 3] the probability of a word occurring k times is given by a power law \(k^{-\gamma }\) corrected by the corpus size influence, but they do not consider other n-gram sizes, unlike e.g. [16].

3 The Model

Successive model refinements are shown: \(W_{z}(k)\), from Zipf’s Law; \(W_{\alpha _{d}}\!(k,\!C)\) for corpus size dependence; and \(W^{*}\!(k,C)\) for scaling adjustments.

3.1 \(W_{z}(k)\): The Size of the Frequency Levels from Zipf’s Law

By Zipf’s Law [20], the number of occurrences of the rth most frequent word in a corpus with a number of distinct words given by D, is

$$\begin{aligned} f(r) = f(1)\cdot r^{-\alpha } , \end{aligned}$$
(1)

\(\alpha \) is a constant \(\sim \! 1\); r is the word rank (\(1\!\le \!r\!\le \!D\)). (1) also applies to n-grams of sizes \(n\!>\!1\), with \(\alpha \) dependent on n (for simplicity \(\alpha \) replaces \(\alpha (n)\)). The relative frequency of the most frequent n-gram (\(r\!=\!1\)) for each n shows small fluctuations around a value, taken as an approximation to its occurrence probability, \(p_{1}\). The absolute frequency \(f(1)\approx p_{1}\! \cdot C\). So, \(\ln (f(r))\) would decrease linearly with slope \(\alpha \) as \(\ln (r)\) increases. Real distributions deviate from straight lines and show, for their higher ranks, groups of equal-frequency n-grams. \(W\!(k)\) is defined based on Zipf’s law [4, 16]. For a level with frequency k, with its lowest (\(r_{l_{k}}\)) and highest (\(r_{h_{k}}\)) n-gram ranks: \(f(r_{l_{k}})\!=\! f(r_{h_{k}})\!=\!k\); \(W_{z}(k)=r_{h_{k}}-r_{l_{k}}+1\). The model assumes a minimum observed frequency of 1: \(f(r_{l_{1}})= f(r_{h_{1}})\!=\!1\); \(r_{h_{1}}\!=\!D\); and only applies to the higher ranks / lower frequencies where adjacent levels (\(r_{l_{k}}\!=\!r_{h_{k+1}}\!+\!1\)) have consecutive integer frequencies: \(f(r_{h_{k+1}})\!=\! f(r_{h_{k}})+1\).

Then, (2) is obtained, with constant \(\alpha _{z}\).

$$\begin{aligned} W_{z}(k)=\left( \frac{1}{D^{\alpha _{z}}} + \frac{k-1}{f(1)} \right) ^{-\frac{1}{\alpha _{z}}} - \left( \frac{1}{D^{\alpha _{z}}} + \frac{k}{f(1)} \right) ^{-\frac{1}{\alpha _{z}}}. \end{aligned}$$
(2)
$$\begin{aligned} D(C;L,n)=\frac{V(L,n)}{1+ (K_2\cdot C)^{-K_1}} . \end{aligned}$$
(3)

For predicting D in a corpus of size C, we use (3), following [16] with good agreement with real corpora. For language L and n-gram size n, V(Ln) is the finite vocabulary size; \(K_1\), \(K_2\) are positive constants. If V is assumed infinite, (3) equals Heap’s law.

3.2 An Analytical Model for the Dependence of \(\alpha \) on Corpus Size

Empirically, \(\alpha _{z}\) is shown to depend on corpus size. So, we consider \(\alpha \) in (1) as a function \(\alpha (C,r)\) of the corpus size and the n-gram rank r:

$$\begin{aligned} \alpha (C,r)=\frac{\ln (f_{c}(1))-\ln (f_{c}(r))}{\ln (r)} , \end{aligned}$$
(4)

where \(1\!\le \!r\!\le \!D\), and \(f_{c}(1)\) and \(f_{c}(r)\) are the frequencies, respectively, of the most frequent n-gram and the rth ranked n-gram, in a corpus of size C. In (2) \(\alpha \) is obtained, for each corpus size, by fitting \(W_{z}(1)\) to the empirical level size \(W_{obs}(1)\) (for \(k\!=\!1\)). For that level, \(r_{h_{1}}\!=\!D(C,L,n)\) (denoted D or \(D_{c}\)), and \(f_{c}(D_{c})\!=\!1\), so \(\ln (f_{c}(r))\!=\!0\) in (4) for \(r=D_{c}\). Let \(\alpha (C,D_{c})\) (denoted \(\alpha _{d}(C)\)), be the \(\alpha \) value at rank D. Let \(C_{1}\) be the size of a reference corpus:

$$\begin{aligned} \alpha _{d}(C)-\alpha _{d}(C_{1})\!=\! \frac{\ln (f_{c}(1)) }{\ln (D_{c})}- Ref_{c_{1}} . \end{aligned}$$
(5)

The \(2^\mathrm{nd}\) term in the right-hand side of (5) (denoted \( Ref_{c_{1}} \)) becomes fixed. It only depends on \(f_{c}(1)\!=\!C_{1}\!\cdot \!p_{1}\) (\(p_{1}\) is the occurrence probability of the most frequent n-gram) and \(D_{c_{1}}\) from (3). Using Table 1 (Sect. 4.2) and tuning \(\alpha _{d}(C_{1})\) by fitting, for \(C_{1}\), the \(W_{z}(1)\) from (2) to the observed \(W_{obs}(1)\), we find \(\alpha _{d}(C_{1})\) and \(D_{c_{1}}\). Given \(\alpha _{d}(C_{1})\) and \({Ref_{c_{1}}}\), then (5) predicts \(\alpha _{d}(C)\) for a size C corpus, and \(W_{z}(k)\) (2) leads to \(W_{\alpha _{d}}\!(k,\!C)\) (6), where \(\alpha _{d}(C)\) replaces \(\alpha _{z}\):

$$\begin{aligned} W_{\alpha _{d}}\!(k,\!C)\!=\!\left( \!\frac{1}{D_{c}^{\alpha _{d}(C)}}\!+\!\frac{k-1}{f_{c}(1)}\!\right) ^{\!\!-\frac{1}{\alpha _{d}(C)}}\!-\left( \!\frac{1}{D_{c}^{\alpha _{d}(C)}}\!+\!\frac{k}{f_{c}(1)}\!\right) ^{\!\!-\frac{1}{\alpha _{d}(C)}} . \end{aligned}$$
(6)

3.3 \(W^{*}\!(k,C)\): The Dependence of Level Size on Corpus Size

The frequency level size depends on frequency k and corpus size C. Firstly, for a corpus size C, \(\alpha _{z}\) in (2) is tuned to best fitting \(W_{z}(1)\) to \(W_{obs}(1)\). Except for the \(W_{obs}(k)\) fluctuations (Fig. 1a), the deviation, closely proportional to \(\ln (k)\), between \(W_{obs}(k)\) and \(W_{z}(k)\), suggests the improvements due to (7) (Fig. 1a).

$$\begin{aligned} W_{adjusted}(k) = W_{z}(k)\cdot k^\beta . \end{aligned}$$
(7)
Fig. 1.
figure 1

a) 1-gram equal-frequency level size W(k) vs k (log-log scale) – observed and estimates by (2) and (7) from a 1.1 Gw English corpus; b) Observed 3-gram level size values, \(W_{obs}(k)\) vs k (log-log scale), for different English corpora sizes.

\(\beta \) is a constant for each n, obtained from the best fit of W(k) to \(W_{obs}(k)\), for a given corpus. Secondly, for different corpus sizes, Fig. 1b shows \(W_{obs}(k)\) curves as a function of k, seeming parallel, but a detailed analysis shows otherwise. If, for each \(\ln (W_{obs}(k,C^{*}))\) for the three smaller corpora \(C^{*}\), an offset equal to \(\ln (W_{obs}(1,C))-\ln (W_{obs}(1,C^{*}))\) is added (\(C\!=\!8.6\) Gw being the largest corpus), the resulting curves (omitted due to lack of space) do not coincide, as they should if they were parallel in Fig. 1b. The gap between the curves is proportional to \(\ln (k)\). And, for each \(\ln (k)\) value, the distance in \(\ln (W_{obs}(k))\) for corpora of sizes C and \(C_{1}\) is proportional to \(\log _{2}(C/C_{1})\). The distance between the \(\ln (W(k))\) curves of any two corpora C and \(C_{1}\) is approximated by (8), with \(\delta \) constant for each n. Joining (6), (7), (8) leads to the final model, \(W^{*}\!(k,C)\), (9):

$$\begin{aligned} \varDelta =\delta \cdot \ln (\frac{C}{C_{1}})\cdot \ln (k) \end{aligned}$$
(8)
$$\begin{aligned} W^{*}\!(k,C)\!=\!W_{\alpha _{d}}\!(k,C)\cdot k^{\beta +\delta \cdot \ln (\frac{C}{C_{1}})} . \end{aligned}$$
(9)

4 Results and Discussion

4.1 The Poisson-Zipf Model

In the \(W_{P}(k,C)\) model of [17] given by (10), an n-gram ranked r occurs, in a size C corpus, a number of times following Poisson distribution [10] with \(\lambda _r\!=\!f(r)\) by Zipf’s Law. \(W(0)\!=\!W_{P}(0,C)\) is the estimated number of unseen n-grams in the corpus. \(D\!=\!V\!-W\!(0)\), for n-gram vocabulary size V.

$$\begin{aligned} W_{P}\!(k,C)\!&=\!\sum _{r=1}^{r=V}\frac{(p_{1}\!\cdot \!C\!\cdot \!r^{-\alpha })^k\cdot e^{-p_{1}\cdot C \cdot r^{-\alpha }}}{k!} \approx \int _{1}^{V}\frac{(p_{1}\!\cdot \!C\!\cdot \!r^{-\alpha })^k\cdot e^{-p_{1}\cdot C \cdot r^{-\alpha }}}{k!} dr \nonumber \\&\approx \!\frac{(p_{1}\!\cdot \!C)^{1/\alpha }}{\alpha \!\cdot \!k!}\!\cdot \!\left[ \varGamma (k\!-\!\frac{1}{\alpha },\frac{p_{1}\!\cdot \!C}{V^\alpha })-\varGamma (k\!-\!\frac{1}{\alpha },p_{1}\!\cdot \!C)\right] \end{aligned}$$
(10)

4.2 Comparison of Results

Complete corpora were built from documents randomly extracted from English and French Wikipedia. For evaluating size dependence, they were doubled successively (Table 2). A space was added between the words and each of the following characters: . All inflected word forms were kept.

The Model Calculations. (I) To calculate D(CLn) in (3), parameters \(K_1\), \(K_2\) and V(Ln) were found for each language L and n-gram size n (Table 1, also showing the \(\beta \) and \(\delta \) values used in (9)). The V(Ln) value is an estimate of the vocabulary size, such that further increasing it, does not significantly reduce the relative error \(((E-O)/O)\cdot 100\%\), between an estimated value (E) and the corresponding observed value (O). Pairs (\(K_1\), \(K_2\)) were found leading to the lowest possible relative error, for a selected pair of corpora with sizes close to the lowest and highest corpora sizes in the considered range for each language. (II) To evaluate the relative errors, (9) was applied with k such that the observed level sizes of consecutive frequency levels k and \(k\!+\!1\) are monotonic decreasing, \(W_{obs}(k,C)\!>\!W_{obs}(k\!+\!1,C)\). This avoids the non-monotony regions of the observed \(\ln (W(k))\) curve (Fig. 1a). We considered a basic set of k values, \(K\!=\!\{1,2,3,4,5,6,7,8,16,32,64,128\}\), constrained (to ensure \(\ln (W(k))\) monotony) depending on the corpus size C: for \(C\!<\!250\) Mw, we used \(k\!\le \!16\); for \(C\!<\!1\) Gw, \(k\!\le \!32\); the full K set was used only for \(C\!>\!4\) Gw. We selected corpora of sizes (\(C_{1}\)) 1.1 Gw (English) and 808 Mw (French). (III) The \(\alpha _{d}(C_{1})\) values for n-gram sizes from 1 to 6 are: (English) 1.1595, 1.02029, 0.88825, 0.82532, 0.8117, 0.8027; (French) 1.158825, 1.0203, 0.86605, 0.84275, 0.80818, 0.7569. The empirical values of \(p_1\) for n-gram sizes from 1 to 6: (English) 0.06704, 0.03250, 0.0062557, 0.0023395, 0.0017908, 0.0014424; (French) 0.07818, 0.037976, 0.004685, 0.0036897, 0.001971, 0.00072944. (IV) To run \(W_{P}\!(K,C)\), \(\alpha \) values leading to the lowest relative errors, are, for n-gram sizes from 1 to 6: (English) 1.17, 1.02, 0.891, 0.827, 0.814, 0.812; (French) 1.156, 1.01, 0.884, 0.842, 0.806, 0.759.

Table 1. Parameter values \(K_1\), \(K_2\) and vocabulary sizes (V(Ln)) to be used in D(CLn), (3), and \(\beta \) and \(\delta \) to \(W^{*}\!(k,\!C)\), (9).

Table 2 presents the relative errors for the predictions of the frequency level sizes. For each n-gram size, the left column refers to \(W^{*}\!(k,C)\) and the right one to \(W_{P}\!(k,C)\). For each pair (corpus size, n-gram size), it shows the average relative error for the K set used: \(AvgErr(K)=\frac{1}{\Vert K\Vert }\sum _{k\in K}Err(k)\), where \(Err(k)\!=|\frac{W\!(k,C)-W_{obs}\!(k,C)}{W_{obs}\!(k,C)}|\). The average relative errors for \(W^{*}\!(k,C)\) are much lower than for \(W_{P}(k,C)\), which assumes an ideal Zipf’s Law. The line Avg shows the average value of each column over the full range of corpora sizes, with errors of the same magnitude across the range of n-gram sizes for \(W^{*}\!(k,C)\), but having significant variations in the Avg values for \(W_{P}(k,C)\). The global relative error is the average of the Avg values over the range of n-gram sizes, being around \(4\%\) for \(W^{*}\!(k,C)\). Thus, \(W^{*}\!(k,C)\) curves (omitted due to lack of space) closely follow the \(W_{obs}(k,C)\) curves forms of Fig. 1.

Table 2. Average relative error (\(\%\)) for the predictions of the n-gram frequency level sizes obtained by \(W^{*}\!(k,C)\), (9), (left col.), and \(W_{P}\!(k,C)\), (10), (right col.). Each cell in the table gives an average relative error over a subset of k values within the set K considered for that cell, as described in the text.

5 Conclusions

Estimating n-gram frequency distributions is useful in statistical-based n-gram applications. The proposed model estimates the sizes W(kC) of equal-frequency (k) n-gram groups in a corpus of size C, for the low frequency n-grams. It applies uniformly to different n-gram sizes \(n\!\ge \!1\) and languages, assuming a finite language n-gram vocabulary. It models the dependences of Zipf’s Law exponent and W(kC) on C, agreeing well with n-gram frequency data from unigrams up to hexagrams, from real un-truncated English and French corpora with million to billion words. Larger corpora evaluation is planned.