Keywords

1 Introduction

1.1 Entropy

Randomness Sources. Important computer applications, like generating cryptographic keys, building countermeasures against side-channel attacks or gambling, demand randomness of excellent quality, that is uniformly or almost uniformly distributed sequences of bits. Unfortunately, in practice we cannot generate pure randomness. Even best physical sources of randomness produce bits that are slightly biased or correlated. Sources which provide some (not maximal) amount of randomness are called weak sources. In practice, randomness can be gathered based on a physical phenomena (like radiation [hot], photons transmission, thermal noise [BP99], atmospheric noise [ran], jitters) or even from a human-device interaction (like timing I/O disk and network events [dev], keystrokes or mouse movements [pgp], shaking accelerators in mobile phones [VSH11] and other ideas). Such raw randomness must be further post-processed before use, in order to eliminate bias and correlations between individual bits. While this task can be easily achieved by general-purpose tools called randomness extractors [BST03], the main problem is in evaluating the quality of a random source. One needs to ensure that enough randomness has been collected, depending on the chosen post-processing technique. This is the major concern in designing so called true random number generators, which combine randomness sources with postprocessing algorithms to generate random output of high quality from underlying weak sources. The design of a typical TRNG is illustrated in Fig. 1 below [BST03].

Fig. 1.
figure 1figure 1

True Random Number Generators. The scheme illustrates the typical design, where the building blocks are: (a) an entropy source (b) a harvesting mechanism and (c) a postprocessor (extractor). The main issue is how to ensure that enough raw bits have been collected?

Quantifying Randomness in Theory. From a theoretical point of view, to evaluate randomness within a known probability distribution one uses the notion of entropy. One uses different entropy definitions depending on the context. In information theory most widely used is Shannon entropy, which quantifies the encoding length of a given distribution. For any discrete random variable X its Shannon Entropy equals

$$\begin{aligned} H(X) = -\sum _{x}\Pr [X=x]\log \Pr [X=x]. \end{aligned}$$
(1)

In turn, cryptographers use the more conservative notion called min-entropy, which quantifies unpredictability (in particular, min-entropy provides a bound on how much randomness can be extracted). The min-entropy of a probability distribution X is defined as

$$\begin{aligned} H_{\infty }(X) = \min _x (1/\log \Pr [X=x]). \end{aligned}$$
(2)

In general, there is a large gap between these two measures: the min-entropy of an n-bit string may be only O(1) whereas its Shannon entropy as big as \(\varOmega (n)\)Footnote 1.

Quantifying Randomness in Practice. From a practical point of view, often it happens that not only the distribution of X is unknown, but it may be even hard to assign a fitting theoretical model with some degrees of freedom (for example if we knew that X comes from a particular paremetrized family of distributions, not knowing the concrete parameters). This is in particular the case when randomness is being gathered from different sources (for example the linux random number generator). In such cases, we have no understanding of the underlying physical process and cannot conclude anything about its characteristics (like the entropy), which would be the recommended solution [KMT15]. Instead, we can only observe output samples of the source, considering it a black-box device. In this setting, we need an entropy estimator, which guess the entropy of an unknown distribution X based on its samples.

1.2 Entropy Estimating

Motivations for Entropy Estimating. We have already seen that we need entropy estimators to evaluate the quality of a source and thus the quality of the extractor output. Also, entropy estimating is motivated in practice by the fact that actually many parties may be interested in evaluating the source entropy in the context of TRNGs [BL05, KMT15]: (a) designers, when they fail to fit a good model to the source (b) testing labs, when verifying quality claimed by manufacturers and (c) developers, especially when working with multiple sources.

Shannon Entropy or Min-Entropy? Technically speaking, extractable randomness is quantified in terms of min-entropy, not Shannon entropy. However, there are two reasons for why we actually makes sense to work with Shannon entropy

  1. (a)

    Shannon entropy is easier to be reliably estimated.

  2. (b)

    For memoryless sources, Shannon and min-entropy are comparable

Regarding (a), we note that to estimate Shannon entropy one can use fairly general source models based on Markov chains like Maurer-Coron tests [Cor99] or measures based on mutual information [YSK13]. Also, Shannon Entropy is much easier (and efficient) to estimate in an online manner, where the source distribution may change over time. Such estimators are an active research area and find important applications not only in cryptography [BL05, LPR11] but also in learning, data mining or network anomaly detection [HJW15].

To discuss (b), recall that memoryless source (called also stateless) is a source which produces consecutive samples independently. While this is a restriction, it is often assumed as a part of the model by practitioners working on random number generators (cf. [LRSV12, BKMS09, BL05, DG07]) or enforced be under some circumstances (so called certification mode which enforces fresh samples, see [BL05, DG07]). An important result is obtained from a more general fact called Asymptotic Equipartition Property (AEP). Namely, for a stateless source the min-entropy rate (min-entropy per sample) is close to its Shannon entropy per bit (conditionally with probability almost 1), for a large number of samples.

AEP and Extracting from Memoryless Shannon Sources. We start with the following general fact, which easily follows by the Weak Law of Large Numbers.

Theorem 1

(Asymptotic Equipartition Property). The min entropy per bit in a sequence \(X_1,\ldots ,X_n\) of i.id. samples from X, for large n and with high probability, is close to the Shannon entropy of X. More precisely

$$\begin{aligned} -\frac{1}{n} \log P_{X_1,\ldots ,X_n}(x_1,\ldots ,x_n) \longrightarrow H(X) \end{aligned}$$
(3)

where the convergence holds in probability (over \((x_1,\ldots ,x_n)\leftarrow X_1,\ldots ,X_n\)).

Intuitively, the AEP simply means that, conditionally with large probability, the product of many independent copies is flat and its min-entropy approaches Shannon entropy.

Thus, the AEP is a bridge connecting the heuristic use of Shannon entropy as a measure of extractable randomness (practice) and the provable security (randomness extractors theory). The best known quantitative form of Eq. (3) appears in [Hol06].

Theorem 2

(Quantitative Asymptotic Equipartition Property [Hol06]). Let \(X^{n} = X_1,\ldots ,X_n\) be a sequence of i.i.d. samples from a distribution X of Shannon entropy k. Then the sequence \((X_1,\ldots ,X_n)\) is \(\epsilon \)-close in the variational distance to a distribution of min entropy \(kn - O\left( \sqrt{kn\log (1/\epsilon ) }\right) \).

Now we restate the same result in language of randomness extractors

Corollary 1

(Extracting from Memoryless Shannon Sources, Lower Bound). In particular, in the above setting, one can extract at least

$$\begin{aligned} m = kn - O\left( \sqrt{kn\log (1/\epsilon ) }\right) -2\log (1/\epsilon ) \end{aligned}$$
(4)

bits which are \(\epsilon \)-close to uniform (e.g. using independent hash functions [HILL99] as an extractor). Since in most settings we haveFootnote 2 \(\epsilon \gg 2^{-kn}\), we extract

$$\begin{aligned} m = {H}(X^{n}) - O\left( \sqrt{ {H}(X^{n}) \log (1/\epsilon )} \right) \end{aligned}$$
(5)

bits, that is we extract all the Shannon entropy but \(O\left( \sqrt{kn\log (1/\epsilon )} \right) \) bits.

1.3 Problem Statement

We have already seen that in case of many independent copies the amount of extractable bits approaches asymptotically the Shannon entropy. We note that some works, including works on entropy estimating [BL05, LPR11] suggest to use a simplified (asymptotic) version of Eq. (5), namely

$$\begin{aligned} m\approx {H}(X^{n}) \end{aligned}$$
(6)

bits \(\epsilon \)-close to uniform, ignoring a smaller order term \(O\left( \left( {H}(X^{n}) \log (1/\epsilon ) \right) ^{\frac{1}{2}} \right) \). The question which naturally arises is how much do we lose by this approximation. Is it safe to assume (heuristically) that the equality (6) holds in practical parameter regimes?

Question: What is the exact error of the AEP heuristic Eq. (6)?

We rewrite this problem as the task of finding upper bounds on the extraction rate of Shannon entropy memoryless sources.

Question (Reformulated): Suppose that we have a source which produces i.i.d samples \(X_1,X_2,\ldots \) each of Shannon entropy k. How much almost uniform bits can be extracted from n such samples?

This question is well-motivated as no upper bounds to Theorem 2 have been known so far (though some other works [Hol11] also address lower bounds), and because of the popularity of the AEP herustic (6).

1.4 Our Results and Applications

The Tight No-Go Result. We answer the posted question, showing that the convergence rate in Eq. (3) given in Theorem 2 is optimal.

Theorem 3

(An Upper Bound on the Extraction Rate from Shannon Sources). From any sequence of i.i.d. binary random variables \(X^{n}=X_1,\ldots ,X_n\) we no extract can get more than

$$\begin{aligned} m = H(X^{n}) - \varTheta (\sqrt{H(X^{n}) \log (1/\epsilon )}) \end{aligned}$$
(7)

bits which are \(\epsilon \)-close (in the variation distance) to uniform. This matches the lower bound in [Hol06] (the constant under \(\varTheta (\cdot )\) depends on the source X).

Corollary 2

(A Significant Entropy Loss in the AEP Heuristic Estimate). In the above setting, the gap between the Shannon entropy and the number of extractable bits \(\epsilon \)-close to uniform equals at least \(\varTheta (\sqrt{\log (1/\epsilon )kn})\). In particular, for the recommended security level (\(\epsilon =2^{-80}\)) we obtain the loss of \(kn - m \approx \sqrt{80 k n}\) bits, no matter what an extractor we use.

An Application to TRNGs: Not To Overestimate Security. Imagine a mobile phone where the accelerometer is being used as an entropy source. Such a source was studied in [LPR11] and the Shannon entropy rate was estimated to be roughly 0.125 per bit. Since the recommended security level for almost random bits is \(\epsilon = 2^{-80}\). According to the heuristic (3) we need roughly \(m=128/0.125 = 1024\) samples to extract a 128-bit key. However taking into account the true error in our Theorem 3 we see that we need at least \(m\approx 2214\) bits!

An Application to the AEP: Convergence Speed. Let \(X^{n}=X_1,X_2,\ldots ,X_n\) be a sequence of i.i.d. bit random variables. By the standard AEP we know that with probability \(1-\epsilon \) over \(x\leftarrow X^n\) we have \(P_{X^n}(x) \leqslant 2^{-n H(X_1)+ O\left( \sqrt{nH(X_1)\log (/\epsilon )}\right) }\). Our result implies that for any event E of probability \(1-\epsilon \) for some \(x\in E\) we have \(P_{X^n}(x) \geqslant 2^{-n H(X_1)+\varOmega \left( \sqrt{nH(X_1)\log (1/\epsilon )}\right) }\). This proves that the error term for the convergence is really \(\varTheta \left( \sqrt{nH(X_1)\log (1/\epsilon )}\right) \).

1.5 Organization

The remainder of the paper is structured as follows. In Sect. 2 we give some basic facts and auxiliary technical results that will be used later. The proof of the main result, that is Theorem 3, is given in Sect. 3. Finally, Sect. 4 concludes the work.

2 Preliminaries

2.1 Basic Definitions

The most popular way of measuring how two distributions are close is the statistical distance.

Definition 1

(Statistical Distance). The statistical (or total variation) distance of two distributions XY is defined as

$$\begin{aligned} \mathrm {SD}\left( X;Y\right) = \sum _{x}\left| \Pr [X=x]-\Pr [Y=x]\right| \end{aligned}$$
(8)

We also simply say that X and Y are \(\epsilon \)-close.

Below we recall the definition of Shannon entropy and min entropy. The logarithms are taken at base 2.

Definition 2

(Shannon Entropy). The Shannon Entropy of a distribution X equals \(H(X) = -\sum _{x}\Pr [X=x]\log \Pr [X=x]\).

Definition 3

(Min Entropy). The min entropy of a distribution X equals \(H_\mathrm {\infty }(X) = \min _{x}(1/\log \Pr [X=x])\).

2.2 Extractors

Extractors are functions which transform inputs of some required min-entropy amount into an almost uniform string of known length. To extract from every high-entropy source (that is, to have an extractor of general purpose), one needs to allow extractors to use small amount of auxiliary randomness, which can be “reinvested” as in the case of catalysts in chemistry. Also, one has to accept some small deviation of the output from being uniform (small enough to be acceptable for almost every application) and some entropy loss [RTS00]. Good extractors, simple, provable-secure and widely used in practice, are obtained from universal hash families [CW79]. We refer the reader to [Sha11] for a survey.

Definition 4

(Randomness Extractors). A function \(\mathsf {Ext}:\{0,1\}^n\times \{0,1\}^d\rightarrow \{0,1\}^m\) is a \((k,\epsilon )\)-extractor

$$\mathrm {SD}\left( \mathsf {Ext}(X,U_d),U_d; U_{m+d}\right) \leqslant \epsilon $$

for any distribution X of min-entropy at least k and independent d-bit string \(U_d\).

Definition 5

(Extractable Entropy, [RW04]). We say that X has k extractable bits within distance \(\epsilon \), denoted \(H_\mathrm {ext}^{\epsilon }(X)\geqslant k\), if for some randomized function \(\mathrm {Ext}\) we have \(\mathrm {SD}\left( \mathrm {Ext}(X,S); U_k,S\right) \leqslant \epsilon \), where \(U_k\) is a uniform k-bit string and S is an independent uniform string (called the seed).

The so called Leftover Hash Lemma [HILL88] ensures that \(H_\mathrm {ext}^{\epsilon }(X)\geqslant H_\mathrm {\infty }(X)-2\log (1/\epsilon ).\)

2.3 Technical Facts

Our proof uses the following characterization of “extractable” distributions.

Theorem 4

(An Upper Bound on Extractable Entropy, [RW04]). If \(H_\mathrm {ext}^{\epsilon }(X) \geqslant k\) then X is \(\epsilon \)-close to Y such that \(H_\mathrm {\infty }(Y) \geqslant k\).

The second important fact we use is the sharp bound on binomial tails.

Theorem 5

(Tight Binomial Tails [McK]). Let B(np) be a sum of independent Bernoulli trials with success probability p. Then for \(\gamma \leqslant \frac{3}{4}q\) we have

$$\begin{aligned} \Pr \left[ B(n,p) \geqslant pn + \gamma n\right] = Q\left( \sqrt{\frac{n\gamma ^2}{pq}} \right) \cdot \psi \left( p,q,n,\gamma \right) \end{aligned}$$
(9)

with the error term satisfies

$$\begin{aligned} \psi \left( p,q,n,\gamma \right) = \exp \left( \frac{n\gamma ^2}{2pq} -n\mathrm {KL}\left( p+\gamma \parallel p\right) + \frac{1}{2}\log \left( \frac{p+\gamma }{p}\cdot \frac{q}{q-\gamma }\right) + O_{p,q}\left( n^{-\frac{1}{2}}\right) \right) \end{aligned}$$
(10)

where \(\mathrm {KL}\left( a \parallel b\right) = a\log (a/b) + (1-a)\log ((1-a)/(1-b)\) is the Kullback-Leibler divergence, and Q is the complement of the cumulative distribution function of the standard normal distribution.

3 Proof of Theorem 3

3.1 Characterizing Extractable Entropy

We state the following fact with an explanation in Fig. 2.

Lemma 1

(An Uppper Bound on the Extractable Entropy). Let X be a distribution. Then for every distribution Y which is \(\epsilon \)-close to X, twe have \(H_{\infty }(Y) \leqslant -\log t\) where t satisfies

$$\begin{aligned} \sum _x \max (\mathbf {P}_X(x)-t,0) = \epsilon . \end{aligned}$$
(11)

Proof

(of Lemma 1). The proof follows easily by observation that the optimal mass rearrangement (which maximizes \(H_\mathrm {\infty }(Y)\)) is to decrease probability mass at biggest points. Indeed, we have to find the optimal value of the following optimization program.

$$\begin{aligned} \begin{array}{ll} {{\mathrm{\mathrm {maximize}}}}&{}{H}_{\infty }(Y) \\ \text { s.t. } &{} \mathrm {SD}(X;Y) \leqslant \epsilon \end{array} \end{aligned}$$
(12)

Note that we can write \(P_Y = \epsilon (x) + P_X\) where \(\sum _{x}\epsilon (x) = 0\) and \(\sum _{x}|\epsilon (x)|=2\epsilon \), transforming program (12) into

(13)

where the optimization runs over the numbers \(\epsilon (x)\). Since \(u\rightarrow \log u^{-1}\) is an decreasing function when \(u\in (0,1)\), the program has the same maximizer as the solution of

(14)

For the set \(S = \{x: \epsilon (x) < 0\}\), we claim that the optimal solution satisfies \(P_X(x)+\epsilon (x) = \mathrm {const}\) on \(x\in S\). Indeed, otherwise we have

$$P_X(x_1)+\epsilon (x_1) < P_X(x)+\epsilon (x) \leqslant P_X(x_2)+\epsilon (x_2) $$

for some \(x_1,x_2 \in S\) and every \(x \in S\setminus \{x_1\}\). Then replacing \(\epsilon (x_1),\epsilon (x_2)\) by \(\frac{\epsilon (x_1)+\epsilon (x_2)}{2}\) increases the objective keeping the constraint, a contradiction. Thus \(P_X(x)+\epsilon (x) = t_0\) whenever \(\epsilon (x) < 0\). Similarly, we prove that \(P_X(x)+\epsilon (x) \leqslant t_0\) for any x. Going back to Eq. (13), it suffices to observe that we have

$$\min _x\left( \log \frac{1}{P_X(x)+\epsilon (x)} \right) \leqslant -\log t_0.$$

Finally, since \(-\epsilon (x) = P_X(x)-t_0\) when \(x\in S\), we get \(\epsilon \geqslant -\sum _{x}\epsilon (x) = \sum _{x}\max (P_X(x)-t_0,0)\). In particular, \(t_0 \geqslant t\) which gives \(-\log t_0 \leqslant -\log t\) and the result follows.       \(\square \)

Fig. 2.
figure 2figure 2

The Entropy Smoothing Problem. For a given probability density function, we want to cut a total mass of up to \(\epsilon \) above a possibly highest threshold (in dotted red) and rearrange it (in green), to keep the upper bound smallest possible.

Without losing generality, we assume from now that \(X\in \{0,1\}\) where \(\Pr [X=1]=p, q=1-p\). Define \(X^n = \left( X_1,\ldots ,X_n\right) \). For any \(x\in \{0,1\}^n\) we have

$$\begin{aligned} \Pr [X^n=x] = p^{\left\| x\right\| }q^{n-\left\| x\right\| }. \end{aligned}$$
(15)

According to the last lemma and Theorem 4, we have

$$\begin{aligned} H_\mathrm {ext}^{\epsilon }\left( X^n\right) \leqslant -\log t \end{aligned}$$
(16)

where \(t\in (0,1)\) is such that

$$\begin{aligned} \sum _{x}\max \left( \mathbf {P}_{X^n}(x)-t, 0 \right) = \epsilon . \end{aligned}$$
(17)

From now we assume that

$$\begin{aligned} t = p^{p n+\gamma n}q^{q n-\gamma n}. \end{aligned}$$
(18)

3.2 Determining the Threshold t

The next key observation is that t is actually small and can be omitted. That is, we can simply cut the \((1-\epsilon )\)-quantile. This is stated in the lemma below.

Lemma 2

(Replacing the Threshold by the Quantile). Let \(x_0\in \{0,1\}^n\) be a point such that \(\Vert x_0\Vert = pn + \gamma n\). Then we have

$$\begin{aligned} \sum _{x:\ \Vert x\Vert \geqslant \Vert x_0\Vert } \max \left( \mathbf {P}_{X^n}(x) - \mathbf {P}_{X^{n}}(x_0) \right) \geqslant \frac{1}{2}\sum _{x:\ \Vert x\Vert \geqslant \Vert x_0\Vert } \mathbf {P}_{X^n}(x) \end{aligned}$$
(19)

To prove the lemma, note that from Theorem 5 it follows that setting

$$\begin{aligned} \gamma ' = \gamma + n^{-1}\log \left( \frac{p}{q}\right) \end{aligned}$$
(20)

we obtain

$$\begin{aligned} \sum _{j\geqslant pn + \gamma ' n} \left( {\begin{array}{c}n\\ j\end{array}}\right) \geqslant \frac{3}{4}\cdot \sum _{j\geqslant pn + \gamma n} \left( {\begin{array}{c}n\\ j\end{array}}\right) \end{aligned}$$
(21)

when \(\gamma \) is sufficiently small comparing to p and q (formally this is justified by calculating the derivative with respect to \(\gamma \) and noticing that it is bigger by at most a factor of \(1+\frac{\gamma }{\sqrt{npq}}\)). But we also have

$$\begin{aligned} p^{j}q^{n-j} \geqslant 2\cdot p^{(p+\gamma )n}q^{(q-\gamma )n}\quad \text {for } j\geqslant \gamma ' n \end{aligned}$$
(22)

Therefore,

$$\begin{aligned} \sum _{j\geqslant pn + \gamma n} \left( {\begin{array}{c}n\\ j\end{array}}\right) p^{j}q^{n-j}&\geqslant \sum _{ j\geqslant pn + \gamma ' n} \left( {\begin{array}{c}n\\ j\end{array}}\right) p^{j}q^{n-j} \nonumber \\&\geqslant 2 \cdot p^{(p+\gamma )n}q^{(q-\gamma )n} \cdot \sum _{ j\geqslant pn + \gamma ' n} \left( {\begin{array}{c}n\\ j\end{array}}\right) \nonumber \\&\geqslant 2\cdot \frac{3}{4} \cdot p^{(p+\gamma )n}q^{(q-\gamma )n} \cdot \sum _{ j\geqslant pn + \gamma n} \left( {\begin{array}{c}n\\ j\end{array}}\right) \end{aligned}$$
(23)

which finishes the proof.

3.3 Putting This All Together

Now, by combining Lemmas 1 and 2 and the estimate \(Q(x) \approx x^{-1}\exp (-x^2/2)\) for \(x\gg 0\) we obtain

$$\begin{aligned} \epsilon \geqslant \exp \left( -n\mathrm {KL}\left( p+\gamma \parallel p\right) - \log \left( \frac{n\gamma ^2}{2pq} \right) +O_{p,q}(1)\right) \end{aligned}$$
(24)

which, because of the Taylor expansion \(\mathrm {KL}\left( p+\gamma \parallel p\right) = \frac{\gamma ^2}{2pq}+ O_{p,q}(\gamma ^3)\), gives us

$$\begin{aligned} \gamma \geqslant \varOmega \left( \sqrt{\frac{\log (1/\epsilon )}{pq n}} \right) \end{aligned}$$
(25)

Setting \(\gamma = c\cdot \sqrt{\frac{\log (1/\epsilon )}{pq n}} \), with sufficiently big c, we obtain the claimed result.

4 Conclusion

We show an upper bound on the amount of random bits that can be extracted from a Shannon entropy source. Even in the most favourable case, that is for independent bits, the gap between the Shannon entropy and the amount of randomness that can be extracted is significant. In practical settings, the Shannon entropy may be even 2 times bigger than the extractable entropy. We conclude that the hard error term in the AEP needs to be taken into account when extracting from memoryless Shannon sources.