Keywords

1 Introduction

1.1 Entropy Smoothing

Security Based on Statistical Closeness. In most of cryptographic applications, probability distributions which are close enough in the variational (statistical) distance are considered indistinguishable. More informally, they have similar cryptographic “quality”, when used as randomness sources (randomness extracting) [15] or secure keys (in the context of key derivation [1, 6]).

Entropy Notions do not See Statistical Closeness. Unfortunately, standard entropy notions (including important min-entropy and collision entropy which are widely used as randomness measures in cryptography), are not robust with respect to small probability perturbations. Consider the AES cipher with a 256-bit key which is \(\epsilon = 2^{-80}\)-close to uniform. While such a key is considered secure nowadays, it may happen that it has no more than 81 bits of min-entropy (more precisely, fix \(x\in \{0,1\}^{256}\) and consider the key X which is \(x_0\) with probability \(2^{-256}+2^{-80}\) and uniform for \(x\in \{0,1\}^{256}\setminus \{x_0\}\)). This is a mismatch with respect to our intuitive understanding of min-entropy as a measure of how many almost random bits can be extracted.

Smooth Entropy takes Probability Perturbations into Account. To fix the issue described above, the concept of smooth min-entropy has been proposed [4, 15]. Smooth entropy is defined as the maximal possible entropy within a certain distance to a given distribution. More precisely, for a given entropy notion \(H(\cdot )\) (which is usually Renyi entropy, see Sect. 2 for its formal definition) we define the \(\epsilon \)-smooth entropy of X as the value of the following optimization program

$$\begin{aligned} \begin{array}{ll} \mathrm {maximize}{\quad } &{} {H}(Y) \\ \text { s.t. } &{} \mathrm {SD}(X;Y) \leqslant \epsilon \end{array} \end{aligned}$$
(1)

where \(\mathrm {SD}()\) stands for statistical (variational) distance (see Sect. 2) for a formal definition). This definition is now well-suited for cryptographic applications, because does not depend anymore on negligible variations of the probability distribution. In particular, setting \(\epsilon =2^{-80}\) for our AES example we obtain the “correct” result of 256 bits of (smooth) entropy.

Importance of Smooth Entropy. Smooth Renyi entropy, formally introduced by Renner and Wolf in [15], found many applications including privacy amplification [13, 15, 19], information reconciliation [15] and quantum information theory [17, 18]. The technique of perturbing a distribution to get more-entropy was actually known before. For example, entropy smoothing is implicitly used to prove the Asymptotic Equipartition Property [10] or more concretely in the construction of a pseudorandom generator from one-way functions [79]. However the simple question

Question 1

How does the shape of optimal Y depend on X?

has not been fully understood so far. In this paper we answer Question 1 by explicitly characterizing the shape of Y depending on X, and give some applications of the derived characterization.

1.2 Related Works and Our Contribution

Related Works. The problem of finding the optimal shape for Y has been addressed in [13, 15]. The authors argued intuitively that for min-entropy (which is a special case of Renyi entropy, particularly useful in randomness extraction) the optimal solution cuts down the biggest probabilities of X.

Our Contribution. We show that this characterization is not true, and the problem is more subtle: the optimal solution uses a threshold not a quantile cut (see Fig. 1).

Fig. 1.
figure 1

Our result - the optimal shape for entropy smoothing

The precise answer to Question 1 is given in Theorem 1. We provide an intuitive explanation, as a three-step algorithm, in Fig. 2.

Theorem 1

(Optimal Renyi entropy smoothing). Let \(\alpha >1\) be fixed, X be an arbitrary distribution over a finite set and \(\epsilon \in (0,1)\). Let \(t\in (0,1)\) be such that

$$\begin{aligned} \sum _{x} \max \left( \mathbf {P}_X(x) - t,0 \right) = \epsilon , \end{aligned}$$
(2)

and Y be distributed according to

$$\begin{aligned} \mathbf {P}_{Y}(x) = \frac{ \min \left( t,\mathbf {P}_X(x) \right) }{1-\epsilon }. \end{aligned}$$
(3)

Then Y is nearly optimal, that is we have

$$\begin{aligned} \mathrm {SD}(X;Y) \leqslant \epsilon \end{aligned}$$
(4)

and

$$\begin{aligned} \mathbf {H}_{\alpha }(Y) \leqslant \mathbf {H}_{\alpha }^{\epsilon }(X) \leqslant \mathbf {H}_{\alpha }(Y) + \frac{\alpha }{\alpha -1}\log \left( \frac{1}{1-\epsilon }\right) \end{aligned}$$
(5)

Corollary 1

(Tightness of Theorem 1 ). Note that for fixed \(\alpha > 1\) we have \(\frac{\alpha }{\alpha -1}\log \left( \frac{1}{1-\epsilon }\right) = O(\epsilon )\). Thus, our solution differs from the ideal one by only a negligible additive constant in the entropy amount, which is good enough for almost all applications.

Fig. 2.
figure 2

Our result - details of optimal entropy smoothing

1.3 Tight No-Go Results for Extracting from Stateless Shannon Sources

A stateless source (called also memoryless) is a source which produces consecutive samples independently. While this is a restriction, it is often assumed by practitioners working on random number generators (cf. [2, 3, 5, 11]) and argued to be reasonable under some circumstances (so called restart mode which enforces fresh samples, see [3, 5]). An important result is obtained from a more general fact called Asymptotic Equipartition Property (AEP). Namely, for a stateless source the min-entropy rate (min-entropy per sample) is close to its Shannon entropy per bit. The convergence holds in probability, for large number of samples.

A variant of the AEP: The min entropy per bit in a sequence \(X_1,\ldots ,X_n\) of i.id. samples from X converges, when \(n\rightarrow \infty \), to the Shannon entropy of X. More precisely

$$\begin{aligned} \frac{-\log \mathbf {P}_{X_1,\ldots ,X_n}(\cdot )}{n} \overset{\text {in probability}}{\longrightarrow } H(X), \end{aligned}$$
(6)

where the probability is taken over \(X_1,\ldots ,X_n\).

Thus, the AEP is a bridge connecting the heuristic use of Shannon entropy as a measure of extractable randomness (practice) and the provable security (randomness extractors theory). The best known quantitative form of Eq. (6) appears in [9].

Lemma 1

(Asymptotic Equipartition Property [9]). Let \(X_1,\ldots ,X_n\) be i.i.d. samples from a distribution X of Shannon entropy k. Then the sequence \((X_1,\ldots ,X_n)\) is \(\epsilon \)-close to a distribution of min entropy \(kn - O\left( \sqrt{kn\log (1/\epsilon ) }\right) \).

Corollary 2

In particular, one can extract \(kn - O\left( \sqrt{kn\log (1/\epsilon ) }\right) -2\log (1/\epsilon )\) bits which are \(\epsilon \)-close to uniform (e.g. using independent hash functions [8] as an extractor).

Based on Theorem 1 we reprove the following result which matches the bound in [9]. Our result can be understood as the lower bound on the convergence speed in the Asymptotic Equipartition Property given in Lemma 1.

Theorem 2

(An upper bound on the extraction rate [16]). Let \(X_1,X_2,\ldots \) be of i.i.d. random variables, each of Shannon entropy k. Then from the sequence \((X_1,X_2,\ldots ,X_n)\) no extractor can get more than

$$\begin{aligned} N = kn - \varTheta (\sqrt{kn\log (1/\epsilon )}) \end{aligned}$$
(7)

bits which are \(\epsilon \)-close (in the variation distance) to uniform (the constant under \(\varTheta (\cdot )\) depends on the source).

Remark 1

(The bound is tight for most settings). Since from N bits of min-entropy we can extract at least \(N-2\log (1/\epsilon )\) bits \(\epsilon \)-close to uniform, and since in most cases \(\log (1/\epsilon ) = o(kn)\)

From Theorem 2 we conclude that the error in Eq. (6) is significant and has to be taken into account no matter what the extractor is. It is worth of noting that our separation between Shannon entropy and extractable entropy holds in the most favorable case, when the bits are independent.

Corollary 3

(A significant error in the heuristic estimate). In the above setting, the gap between the Shannon entropy and the number of extractable bits \(\epsilon \)-close to uniform equals at least \(\varTheta (kn-\sqrt{\log (1/\epsilon )})\). In particular, for the recommended security level (\(\epsilon =2^{-80}\)) we obtain the loss of \(kn - N \approx \sqrt{80 k n}\) bits, no matter what an extractor we use.

1.4 Organization

Notions we use, as well as some auxiliary technical facts, are explained in Sect. 2. We prove our main result, that is Theorem 1, in Sect. 3. The proof of Theorem 2 appears in Sect. 4.

2 Preliminaries

2.1 Basic Definitions

The most popular way of measuring how two distributions are close is the statistical distance.

Definition 1

(Statistical Distance). The statistical (or total variation) distance of two distributions XY over the same finite set is defined as

$$\begin{aligned} \mathrm {SD}\left( X;Y\right) = \sum _{x}\left| \Pr [X=x]-\Pr [Y=x]\right| \end{aligned}$$
(8)

We also say that X and Y are \(\epsilon \)-close.

Below we recall the definition of Renyi entropy of order \(\alpha \). The logarithms are taken at base 2.

Definition 2

(Renyi Entropy). The Renyi entropy of order \(\alpha \) of a distribution X equals \(\mathbf {H}_{\alpha }(X) = \frac{1}{1-\alpha }\log \left( \sum _{x} \Pr [X=x]^{\alpha } \right) \).

Choosing \(\alpha \rightarrow 1\) and \(\alpha \rightarrow \infty \) we recover two important notions: Shannon entropy and min entropy.

Definition 3

(Shannon Entropy). The Shannon Entropy of a distribution X equals \(H(X) = -\sum _{x}\Pr [X=x]\log \Pr [X=x]\).

Definition 4

(Min Entropy). The min entropy of a distribution X equals \(H_\mathrm {\infty }(X) = -\max _{x}\log \Pr [X=x]\).

Smooth Renyi Entropy is defined as the value of the program (1).

Definition 5

(Smooth Renyi Entropy, [4]). The \(\epsilon \)-smooth Renyi entropy of order \(\alpha \) of a distribution X equals \({H}^{\epsilon }_{\alpha }(X) = \max _{Y} {H}_{\alpha }(Y)\) where the maximum is taken over Y satisfying the constraint \(\mathrm {SD}(X;Y)\leqslant \epsilon \).

Definition 6

(Extractable Entropy, [14]). We say that X has k extractable bits within distance \(\epsilon \), denoted \(H_\mathrm {ext}^{\epsilon }(X)\geqslant k\), if for some randomized function \(\mathrm {Ext}\) we have \(\mathrm {SD}\left( \mathrm {Ext}(X,S); U_k,S\right) \leqslant \epsilon \), where \(U_k\) is a uniform k-bit string and S is an independent uniform string.

2.2 Technical Facts

We will need the following simple fact on convex functions

Proposition 1

Let f be a strictly convex differentiable real-valued function and \(x<y\). Then for any \(\delta > 0\) we have

$$\begin{aligned} f(x)-f(x-\delta ) \leqslant f(y)-f(y-\delta ). \end{aligned}$$

Our proof uses the following characterization of “extractable” distributions.

Theorem 3

(An Upper Bound on Extractable Entropy, [14]). If \(H_\mathrm {ext}^{\epsilon }(X) \geqslant k\) then X is \(\epsilon \)-close to Y such that \(H_\mathrm {\infty }(Y) \geqslant k\).

Another important fact we use is the sharp bound on binomial tails.

Theorem 4

(Tight Binomial Tails [12]). Let B(np) be a sum of independent Bernoulli trials with success probability p. Then for \(\gamma \leqslant \frac{3}{4}q\) we have

$$\begin{aligned} \Pr \left[ B(n,p) \geqslant pn + \gamma n\right] = Q\left( \sqrt{\frac{n\gamma ^2}{pq}} \right) \cdot \psi \left( p,q,n,\gamma \right) \end{aligned}$$
(9)

with the error term satisfies

$$\begin{aligned}&\psi \left( p,q,n,\gamma \right) = \nonumber \\&\ \ \exp \left( \frac{n\gamma ^2}{2pq} -n\mathrm {KL}\left( p+\gamma \parallel p\right) + \frac{1}{2}\log \left( \frac{p+\gamma }{p}\cdot \frac{q}{q-\gamma }\right) + O_{p,q}\left( n^{-\frac{1}{2}}\right) \right) \end{aligned}$$
(10)

where \(\mathrm {KL}\left( a \parallel b\right) = a\log (a/b) + (1-a)\log ((1-a)/(1-b)\) is the Kullback-Leibler divergence, and Q is the complement of the cumulative distribution function of the standard normal distribution.

3 Proof of Theorem 1

We start by rewriting Eq. (1) in the following way

$$\begin{aligned} \begin{aligned}&\mathrm {minimize}&\left( \sum _{x} (\mu (x) +\epsilon (x) )^{\alpha } \right) ^{\frac{1}{\alpha -1}}{\quad } \\&\text { s.t. }&\left\{ \begin{array}{l} \sum _{x}\epsilon (x) = 0 \\ \sum _{x}|\epsilon (x)| = 2\epsilon \\ \forall x\quad 0 \leqslant \mu (x)+\epsilon (x) \leqslant 1 \end{array} \right. \end{aligned} \end{aligned}$$
(11)

where for the sake of clarity we replace \(\mathbf {P}_X\) by \(\mu _X\).

Claim

Let \(\epsilon (x)\) be optimal for Eq. (11). Define \(S^{+} = \{x:\ \epsilon (x) < 0\}\). Then \(\mu (x) + \epsilon (x) = \mu (y) + \epsilon (y)\) for all \(x,y \in S^{+}\). We will show the optimality by a mass-shifting argument.

Proof

(of Claim). Suppose that

$$\begin{aligned} \epsilon (x_1),\epsilon (x_2) < 0 \text { and }\mu (x_1)+\epsilon (x_1) > \mu (x_2) +\epsilon (x_2)>0 \end{aligned}$$
(12)

for two different points \(x_1,x_2\) (the statement is trivially true when there is only one point). Take a number \(\delta \) such that

$$\begin{aligned} 0 < \delta < \min \left( -\epsilon (x_2), \frac{\left( \mu (x_1) +\epsilon (x_1) \right) - \left( \mu (x_2)+\epsilon (x_2) \right) }{2} \right) \end{aligned}$$
(13)

and modify \(\mu \) by shifting the mass from \(x_2\) to \(x_1\) in the following way

$$\begin{aligned} \epsilon '(x) = \left\{ \begin{array}{rl} \epsilon (x), &{}\quad x\not \in \{x_1,x_2\} \\ \epsilon (x) -\delta , &{} \quad x=x_1 \\ \epsilon (x) +\delta , &{} \quad x=x_2 \\ \end{array} \right. \end{aligned}$$

that is shifting the mass from the biggest point to the smallest point. Note that from Eqs. (11) and (13) it follows that the constraints in (11) are satisfied with \(\epsilon (x)\) replaced by \(\epsilon '(x)\). Let Y be a random variable distributed according to \(\mathbf {P}_Y(x) = \mu (x)+\epsilon (x)\) and let \(Y'\) be distributed as \(\mathbf {P}_{Y'}(x) = \mu (x)+\epsilon '(x)\). Note that we have

$$\begin{aligned} \sum _{x} (\mu (x) +\epsilon '(x) )^{\alpha }&- \sum _{x} (\mu (x) +\epsilon (x) )^{\alpha } = \\&\left( \left( \mathbf {P}_Y(x_2)+\delta \right) ^{\alpha } - \left( \mathbf {P}_Y(x_2)\right) ^{\alpha }\right) - \left( \left( \mathbf {P}_Y(x_1)\right) ^{\alpha }-\left( \mathbf {P}_Y(x_1)-\delta \right) ^{\alpha }\right) \end{aligned}$$

Note that we have

$$\begin{aligned} \mathbf {P}_Y(x_2)<\mathbf {P}_Y(x_2)+\delta < \mathbf {P}_Y(x_1)-\delta < \mathbf {P}_Y(x_1) \end{aligned}$$

by Eqs. (12) and (13). Now from Proposition 1 applied to \(f(u) = u^{\alpha }\), \(x=\mathbf {P}_Y(x_2)+\delta \), \(y=\mathbf {P}_Y(x_2)\) and \(\delta \) (here we also use the assumption \(\alpha > 1\)), it follows that

$$\begin{aligned} \sum _{x} (\mu (x) +\epsilon '(x) )^{\alpha } - \sum _{x} (\mu (x) +\epsilon (x) )^{\alpha } <0 \end{aligned}$$

which means \(\mathbf {H}_{\alpha }(Y') > \mathbf {H}_{\alpha }(Y)\). In other words, Y is not optimal.    \(\square \)

By the last claim it is clear that there is a number \(t\in (0,1)\) such that the set \(S^{+}=\{x:\ \mathbf {P}_{Y^{*}}(x) < \mu (x) \}\) is contained in \(\{x:\ \mu (x) \geqslant t\}\) and that \(\mathbf {P}_{Y^{*}}(x) \geqslant t\) for \(x\in S^{+}\). Therefore

$$\begin{aligned} \sum _{x} \left( \mathbf {P}_{Y^{*}}(x) \right) ^{\alpha }&\geqslant \#\left\{ x:\ \mu (x) \geqslant t \right\} \cdot t^{\alpha } + \sum _{x:\ \mu (x) < t} (\mu (x))^{\alpha } \nonumber \\&= \sum _{x} \min \left( \mu (x),t\right) ^{\alpha } \nonumber \\&\geqslant (1-\epsilon )^{\alpha } \sum _{x} \left( \mathbf {P}_{Y}(x) \right) ^{\alpha } \end{aligned}$$
(14)

which, since \(\mathbf {H}^{\epsilon }_{\alpha }(X) = \mathbf {H}_{\alpha }(Y^{*})\), proves the second inequality in Eq. (5). To prove the first inequality in Eq. (5) note that

$$\begin{aligned} \mathrm {SD}(X;Y) = \sum _{x:\ \mathbf {P}_X(x) > \mathbf {P}_Y(x)} (\mathbf {P}_X(x)-\mathbf {P}_Y(x) ) = \sum _{x:\ \mu (x) > t/(1-\epsilon ) }\left( \mu (x)-\frac{t}{1-\epsilon }\right) . \end{aligned}$$

Since \(\frac{t}{1-\epsilon } > t\) we have

$$\begin{aligned} \mathrm {SD}(X;Y) = \sum _{x:\ \mu (x) > t/(1-\epsilon ) }\left( \mu (x)-\frac{t}{1-\epsilon }\right) \leqslant \sum _{x:\ \mu (x) > t }\left( \mu (x)-t\right) \end{aligned}$$

and therefore by Eq. (2) we obtain

$$\begin{aligned} \mathrm {SD}(X;Y) < \sum _{x:\ \mu (x) > t}\left( \mu (x)-t\right) = \epsilon . \end{aligned}$$

which finishes the proof.

4 Proof of Theorem 2

4.1 Characterizing Extractable Entropy

We state the following fact with an explanation in Fig. 3.

Lemma 2

(Lower bound on the extractable entropy). Let X be a distribution. Then for every distribution Y which is \(\epsilon \)-close to X, we have \(H_{\infty }(Y) \leqslant -\log t\) where t satisfies

$$\begin{aligned} \sum _x \max (\mathbf {P}_X(x)-t,0) = \epsilon . \end{aligned}$$
(15)

The proof follows by Theorem 1.

Fig. 3.
figure 3

Entropy Smoothing Problem. For a given probability density function, we want to cut a total mass of up to \(\epsilon \) above a possibly highest threshold (in dotted red) and rearrange it (in green), to keep the upper bound smallest possible (Color figure online)

Without losing generality, we assume from now that \(X\in \{0,1\}\) where \(\Pr [X=1]=p, q=1-p\). Define \(X^n = \left( X_1,\ldots ,X_n\right) \). For any \(x\in \{0,1\}^n\) we have

$$\begin{aligned} \Pr [X^n=x] = p^{\left\| x\right\| }q^{n-\left\| x\right\| }. \end{aligned}$$
(16)

According to the last lemma and Theorem 3, we have

$$\begin{aligned} H_\mathrm {ext}^{\epsilon }\left( X^n\right) \leqslant -\log t \end{aligned}$$
(17)

where

$$\begin{aligned} \sum _{x}\max \left( \mathbf {P}_{X^n}(x)-t, 0 \right) = \epsilon . \end{aligned}$$
(18)

From now we assume that

$$\begin{aligned} t = p^{p n+\gamma n}q^{q n-\gamma n}. \end{aligned}$$
(19)

4.2 Determining the Threshold t

The next key observation is that t is actually small and can be omitted. That is, we can simply cut the \((1-\epsilon )\)-quantile. This is stated in the lemma below.

Lemma 3

(Replacing the threshold by the quantile). Let \(x_0\in \{0,1\}^n\) be a point such that \(\Vert x_0\Vert = pn + \gamma n\). Then we have

$$\begin{aligned} \sum _{x:\ \Vert x\Vert \geqslant \Vert x_0\Vert } \max \left( \mathbf {P}_{X^n}(x) - \mathbf {P}_{X^{n}}(x_0) \right) \geqslant \frac{1}{2}\sum _{x:\ \Vert x\Vert \geqslant \Vert x_0\Vert } \mathbf {P}_{X^n}(x) \end{aligned}$$
(20)

To prove the lemma, note that from Theorem 4 it follows that setting

$$\begin{aligned} \gamma ' = \gamma + n^{-1}\log \left( \frac{p}{q}\right) \end{aligned}$$
(21)

we obtain

$$\begin{aligned} \sum _{j\geqslant pn + \gamma ' n} \left( {\begin{array}{c}n\\ j\end{array}}\right) \geqslant \frac{3}{4}\cdot \sum _{j\geqslant pn + \gamma n} \left( {\begin{array}{c}n\\ j\end{array}}\right) \end{aligned}$$
(22)

when \(\gamma \) is sufficiently small comparing to p and q (formally this is justified by calculating the derivative with respect to \(\gamma \) and noticing that it is bigger by at most a factor of \(1+\frac{\gamma }{\sqrt{npq}}\)). But we also have

$$\begin{aligned} p^{j}q^{n-j} \geqslant 2\cdot p^{(p+\gamma )n}q^{(q-\gamma )n}\quad \text {for } j\geqslant \gamma ' n \end{aligned}$$
(23)

Therefore,

$$\begin{aligned} \sum _{j\geqslant pn + \gamma n} \left( {\begin{array}{c}n\\ j\end{array}}\right) p^{j}q^{n-j}&\geqslant \sum _{ j\geqslant pn + \gamma ' n} \left( {\begin{array}{c}n\\ j\end{array}}\right) p^{j}q^{n-j} \nonumber \\&\geqslant 2 \cdot p^{(p+\gamma )n}q^{(q-\gamma )n} \cdot \sum _{ j\geqslant pn + \gamma ' n} \left( {\begin{array}{c}n\\ j\end{array}}\right) \nonumber \\&\geqslant 2\cdot \frac{3}{4} \cdot p^{(p+\gamma )n}q^{(q-\gamma )n} \cdot \sum _{ j\geqslant pn + \gamma n} \left( {\begin{array}{c}n\\ j\end{array}}\right) \end{aligned}$$
(24)

which finishes the proof.

4.3 Putting This All Together

Now, by combining Lemmas 2 and 3 and the estimate \(Q(x) \approx x^{-1}\exp (-x^2/2)\) for \(x\gg 0\) we obtain

$$\begin{aligned} \epsilon \geqslant \exp \left( -n\mathrm {KL}\left( p+\gamma \parallel p\right) - \log \left( \frac{n\gamma ^2}{2pq} \right) +O_{p,q}(1)\right) \end{aligned}$$
(25)

which, because of the Taylor expansion \(\mathrm {KL}\left( p+\gamma \parallel p\right) = \frac{\gamma ^2}{2pq}+ O_{p,q}(\gamma ^3)\), gives us

$$\begin{aligned} \gamma \geqslant \varOmega \left( \sqrt{\frac{\log (1/\epsilon )}{pq n}} \right) \end{aligned}$$
(26)

Setting \(\gamma = c\cdot \sqrt{\frac{\log (1/\epsilon )}{pq n}} \), with sufficiently big c, we obtain the claimed result.