1 Introduction

The main objective of this paper is to expand the available techniques for proving information complexity lower bounds for communication problems. Let \(f:\mathcal{X}\times \mathcal{Y}\rightarrow \{0,1\}\) be a function, and \(\mu \) be a distribution on \(\mathcal{X}\times \mathcal{Y}\). Informally, the information complexity of f is the least amount of information that Alice and Bob need to exchange on average to compute f(xy) using a randomized communication protocol if initially x is given to Alice, y is given to Bob, and \((x,y)\sim \mu \). Note that information here is measured in the Shannon sense, and the amount of information may be much smaller than the number of bits exchanged. Thus the randomized communication complexity of f is an upper bound on its information complexity, but may not be a lower bound.

Within the context of communication complexity, information complexity has first been introduced in the context of direct sum theorems for randomized communication complexity [2, 7, 9]. These techniques are also being used in the related direction of direct product theorems [12, 16, 19, 20]. A direct sum theorem in a computational model states that the amount of resources needed to perform k independent tasks is roughly k times the amount of resources c needed for computing a single task. A direct product theorem, which is a stronger statement, asserts that any attempt to solve k independent tasks using o(kc) resources would result in an exponentially small success probability \(2^{-\Omega (k)}\).

The direct sum line of work [2, 5, 11, 14] has eventually led to a tight connection (equality) between amortized communication complexity and information complexity. Thus proving lower bounds on the communication complexity of k copies of f for a growing k is equivalent to proving lower bounds on the information complexity of f. In particular if f satisfies \(IC(f)=\Omega (CC(f))\), i.e. that its information cost is asymptotically equal to its communication complexity, then a strong direct sum theorem holds for f. In addition to the intrinsic interest of understanding the amount of information exchange that needs to be involved in computing f, direct sum theorems motivate the development of techniques for proving lower bounds on the information complexity of functions.

Another important motivation for seeking lower bounds on the information complexity of functions stems from understanding the limits of security in two-party computation. In a celebrated result Ben-Or et al. [3] (see also [1]) showed how a multi-party computation (with three or more parties) may be carried out in a way that reveals no information to the participants except for the computation’s output. The protocol relies heavily on the use of random bits that are shared between some, but not all, parties. Such a resource can clearly not exist in the two-party setting. While it can be shown that perfect information security is unattainable by two-party protocols [6, 8], quantitatively it is not clear just how much information the parties must “leak” to each other to compute f. The quantitative answer depends on the model in which the leakage occurs, and whether quantum computation is allowed [15]. Information complexity answers this question in the strongest possible sense for classical protocols: the parties are allowed to use private randomness to help them “hide” their information, and the information revealed is measured on average. Thus an information complexity lower bound of I on a problem implies that the average (as opposed to worst-case) amount of information revealed to the parties is at least I.

As mentioned above, the information complexity is always upper bounded by the communication complexity of f. The converse is not known to be true. Moreover, lower bound techniques for communication complexity do not readily translate into lower bound techniques for information complexity. The key difference is that a low-information protocol is not limited in the amount of communication it uses, and thus rectangle-based communication bounds do not readily convert into information lower bounds. No general technique has been known to yield sharp information complexity lower bounds. A linear lower bound on the communication complexity of the disjointness function has been shown in [22]. An information-theoretic proof of this lower bound [7] can be adapted to prove a linear information lower bound on disjointness [5]. One general technique for obtaining (weak) information complexity lower bounds was introduced in [5], where it has been shown that any function that has I bits of information complexity, has communication complexity bounded by \(2^{O(I)}\). This immediately implies that the information complexity of a function f is at least the log of its communication complexity (\(IC(f)\ge \Omega (\log (CC(f)))\)). In fact, this result easily follows from the stronger result we prove below (Theorem 3.1).

1.1 Our Results

In this paper we prove that the discrepancy method—a general communication complexity lower bound technique—generalizes to information complexity. The discrepancy of f with respect to a distribution \(\mu \) on inputs, denoted \(Disc_\mu (f)\), measures how “unbalanced” f can get on any rectangle, where the balancedness is measured with respect to \(\mu \):

$$\begin{aligned} Disc_{\mu }(f)\mathop {=}\limits ^{def}\max _{\text {rectangles }R} \bigg | \Pr _\mu [f(x,y) = 0 \wedge (x,y)\in R] - \Pr _\mu [f(x,y) = 1 \wedge (x,y)\in R] \bigg |. \end{aligned}$$

A well-known lower bound (see e.g [18]) asserts that the distributional communication complexity of f, denoted \(D^\mu _{1/2-\epsilon }(f)\), when required to predict f with advantage \(\varepsilon \) over a random guess (with respect to \(\mu \)), is bounded from below by \(\Omega (\log 1/Disc_{\mu }(f))\):

$$\begin{aligned} D^\mu _{1/2-\epsilon }(f) \ge \log (2\epsilon /Disc_{\mu }(f)). \end{aligned}$$

Note that the lower bound holds even if we are merely trying to get an advantage of \(\varepsilon =\sqrt{Disc_{\mu }(f)}\) over random guessing in computing f. We prove that the information complexity of computing f with probability 9 / 10 with respect to \(\mu \) is also bounded from below by \(\Omega (\log (1/Disc_{\mu }(f)))\).

Theorem 1.1

Let \(f{:} \mathcal{X}\times \mathcal{Y}\rightarrow \{0,1\}\) be a Boolean function and let \(\mu \) be any probability distribution on \(\mathcal{X}\times \mathcal{Y}\). Then

$$\begin{aligned} {\mathsf {IC}_{\mu }(f,1/10)} \ge \Omega (\log (1/Disc_{\mu }(f))). \end{aligned}$$

Remark 1.2

The choice of 9 / 10 is somewhat arbitrary. For randomized worst-case protocols, we may replace the success probability with \(1/2 + \delta \) for a constant \(\delta \), since repeating the protocol constantly many times (\(1/\delta ^2\)) would yield the aforementioned success rate, while the information cost of the repeated protocol differs only by a constant factor from the original one. In particular, using prior-free information cost [5] this implies \( {\mathsf {IC}\left( f,1/2-\delta \right) } \ge \Omega (\delta ^2 \log (1/Disc_{\mu }(f))).\)

In particular, Theorem 1.1 implies a linear lower bound on the information complexity of the inner product function \(IP(x,y)=\sum _{i=1}^n x_i y_i \mod 2\), and on a random boolean function \(f_r{:}\{0,1\}^n \times \{0,1\}^n \rightarrow \{0,1\}\), expanding the (limited) list of functions for which nontrivial information-complexity lower bounds are known:

Corollary 1.3

The information complexity \({\mathsf {IC}_{uniform}(IP,1/10)}\) of IP(xy) is \(\Omega (n)\). The information complexity \({\mathsf {IC}_{uniform}(f_r,1/10)}\) of a random function \(f_r\) is \(\Omega (n)\), except with probability \(2^{-\Omega (n)}\).

We study the communication and information complexity of the Greater-Than function (\(GT_n\)) on numbers of length n. This is a very well-studied problem [18, 21, 24]. Only very recently the tight lower bound of \(\Omega ({\log n})\) in the public-coins probabilistic model was given by Viola [25]. We show that the discrepancy of the \(GT_n\) function is \(\Omega (1/\sqrt{n})\):

Lemma 1.4

There exist a distribution \(\mu _n\) on \(\mathcal {X}\times \mathcal {Y}\) such that the discrepancy of \(GT_n\) with respect to \(\mu _n\) satisfies \(\; Disc_{\mu _n}(GT_n)< 20/\sqrt{n}.\)

We defer the proof to the “Appendix 4”. Lemma 1.4 provides an alternative (arguably simpler) proof of Viola’s [25] lower bound on the communication complexity of \(GT_n\). By Theorem 1.1, Lemma 1.4 immediately implies a lower bound on the information complexity of \(GT_n\):

Corollary 1.5

\({\mathsf {IC}_{\mu _n}(GT_n,1/10)} = \Omega (\log {n})\)

This settles the information complexity of the GT function, since this problem can be solved by a randomized protocol with \(O(\log {n})\) communication (see [18]). This lower bound is particularly interesting since it demonstrates the first tight information-complexity lower bound for a natural function that is not linear.

The key technical idea in the proof of Theorem 1.1 is a new simulation procedure that allows us to convert any protocol that has information cost I into a (two-round) protocol that has communication complexity O(I) and succeeds with probability \(>1/2 + 2^{-O(I)}\), yielding a \(2^{-O(I)}\) advantage over random guessing. Combined with the discrepancy lower bound for communication complexity, this proves Theorem 1.1.

Our approach, namely, the observation that any nontrivial correlation with the function’s output requires \(\Omega (\log (1/Disc_\mu f))\) communication, was used before by Viola and Wigderson [26], who observed the that the discrepancy of a function is essentially equivalent to the maximum correlation obtainable by a 2-bit protocol. We show here that this correlation is at least \(1/2 + 2^{-\Omega (I)}\) (See also Remark 3.6 in Sect. 3).

1.2 Comparison and Connections to Prior Results

The most relevant prior work is an article by Lee et al. [20]. Improving on an earlier work of Shaltiel [23], Lee et al. show a direct product theorem for discrepancy, proving that the discrepancy of \(f^{\otimes k}\)—the k-wise XOR of a function f with itself—behaves as \(Disc(f)^{\Omega (k)}\). This implies in particular that the communication complexity of \(f^{\otimes k}\) scales at least as \(\Omega (k\cdot \log Disc(f))\). Using the fact that the limit as \(k\rightarrow \infty \) of the amortized communication complexity of f is equal to the information cost of f [4], the result of Lee et al. “almost” implies the bound of Theorem 1.1. Unfortunately, the amortized communication complexity in the sense of [4] is the amortized cost of k copies of f, where each copy is allowed to err with some probability (say 1 / 10). Generally speaking, this task is much easier than computing the XOR (which requires all copies to be evaluated correctly with high probability). Thus the lower bound that follows from Lee et al. applies to a more difficult problem, and does not imply the information complexity lower bound.

Another generic approach one may try to take is to use compression results such as [2] to lower bound the information cost from communication complexity lower bounds. The logic of such a proof would go as follows: “Suppose there was a information-complexity-I protocol \(\pi \) for f, then if one can compress it into a low-communication protocol one may get a contradiction to the communication complexity lower bound f”. Unfortunately, all known compression results compress \(\pi \) into a protocol \(\pi '\) whose communication complexity depends on I but also on \(CC(\pi )\). Even for external information complexity (which is always greater than the internal information complexity, the bound obtained in [2] is of the form \(I_{ext}(\pi )\cdot {\text{ p }olylog} (CC(\pi ))\). Thus compression results of this type cannot rule out protocols that have low information complexity but a very high (e.g. exponential) communication complexity.

Our result can be viewed as a weak compression result for protocols, where a protocol for computing f that conveys I bits of information is converted into a protocol that uses O(I) bits of communication and giving an advantage of \(2^{-O(I)}\) in computing f. This strengthens the result in [5] where a compression to \(2^{O(I)}\) bits of communication has been shown. It should be noted that compression to O(I) bits of communication and constant (say 2 / 3) success probability (as opposed to small advantage over random guessing) has recently proven impossible by the breakthrough work of [10], who showed that \(2^{\Omega (I)}\) bits is the best one can hope for.

In a very recent breakthrough work of Kerenidis et al. [17], our main protocol played an important role in the proof that almost all known lower bound techniques for communication complexity (and not just discrepancy) apply to information complexity. The results of [17] shed more light on the information complexity of many communication problems, and the question of whether interactive communication can be compressed.

2 Preliminaries

In an effort to make this paper as self-contained as possible, we provide some background on information theory and communication complexity, which is essential to proving our results. For further details and a more thorough treatment of these subjects see [4] and references therein.

Notation We reserve capital letters for random variables and distributions, calligraphic letters for sets, and small letters for elements of sets. Throughout this paper, we often use the notation |b to denote conditioning on the event \(B=b\). Thus A|b is shorthand for \(A|B=b\).

We use the standard notion of statistical/total variation distance between two distributions.

Definition 2.1

Let D and F be two random variables taking values in a set \(\mathcal {S}\). Their statistical distance is \(| D-F | \mathop {=}\limits ^{def}\max _{\mathcal {T} \subseteq \mathcal {S}}(|\Pr [D \in \mathcal {T}] - \Pr [F \in \mathcal {T}]|) = \frac{1}{2} \sum _{s \in \mathcal {S}}|\Pr [D=s]-\Pr [F=s]|\)

2.1 Information Theory

Definition 2.2

(Entropy) The entropy of a random variable X is \(H(X) \mathop {=}\limits ^{def}\sum _x \Pr [X=x] \log (1/\Pr [X=x]).\) The conditional entropy H(X|Y) is defined as \({\mathbf{E}}_{y \in _{R}Y}\left[ H(X|Y=y) \right] \).

Definition 2.3

(Mutual Information) The mutual information between two random variables AB, denoted I(AB) is defined to be the quantity \(H(A) - H(A|B) = H(B) - H(B|A).\) The conditional mutual information I(AB |C) is \(H(A|C) - H(A|BC)\).

We also use the notion of divergence (also known as Kullback-Leibler distance or relative entropy), which is a different way to measure the distance between two distributions:

Definition 2.4

(Divergence) The informational divergence between two distributions is

$$\begin{aligned}\mathbf{D} \left( A ||B \right) \mathop {=}\limits ^{def}\sum _x A(x) \log (A(x)/B(x)).\end{aligned}$$

Proposition 2.5

Let ABC be random variables in the same probability space. For every a in the support of A and c in the support of C, let \(B_a\) denote \(B|A=a\) and \(B_{ac}\) denote \(B|A=a,C=c\). Then \(I(A;B | C) = {\mathbf E}_{a,c\in _R A,C} [\mathbf{D} \left( B_{ac} ||B_c \right) ]\).

2.2 Communication Complexity

We use the standard definitions of the computational model defined in [27]. For complete details see “Appendix 1”.

Given a communication protocol \(\pi \), \(\pi (x,y)\) denotes the concatenation of the public randomness with all the messages that are sent during the execution of \(\pi \). We call this the transcript of the protocol. When referring to the random variable denoting the transcript, rather than a specific transcript, we will use the notation \(\Pi (x,y)\)—or simply \(\Pi \) when x and y are clear from the context, thus \(\pi (x,y)\in _{R}\Pi (x,y)\). When x and y are random variables themselves, we will denote the transcript by \(\Pi (x,y)\), or just \(\Pi \).

Definition 2.6

(Communication Complexity notation) For a function \(f: \mathcal {X} \times \mathcal {Y} \rightarrow \mathbb {Z}_K\), a distribution \(\mu \) supported on \(\mathcal {X} \times \mathcal {Y}\), and a parameter \(\epsilon > 0\), \(D^\mu _{\epsilon }(f)\) denotes the communication complexity of the cheapest deterministic protocol computing f on inputs sampled according to \(\mu \) with error \(\epsilon \).

Definition 2.7

(Combinatorial Rectangle) A Rectangle . in \(\mathcal {X} \times \mathcal {Y}\) is a subset \(R\subseteq \mathcal {X} \times \mathcal {Y}\) which satisfies \( (x_1,y_1) \in R \text { and } (x_2,y_2) \in R \Longrightarrow (x_1,y_2) \in R\)

2.3 Information + Communication: The Information Cost of a Protocol

The following quantity, which is implicit in [7] and was explicitly defined in [2], is the central notion of this paper.

Definition 2.8

The (internal) information cost of a protocol \(\pi \) over inputs drawn from a distribution \(\mu \) on \(\mathcal{X}\times \mathcal{Y}\), is given by:

$$\begin{aligned} {\mathsf {IC}_{\mu }(\pi )}:=I(\Pi ;X|Y)+I(\Pi ;Y|X). \end{aligned}$$

Intuitively, Definition 2.8 captures how much the two parties learn about each other’s inputs from the execution transcript of the protocol \(\pi \). The first term captures what the second player learns about X from \(\Pi \)—the mutual information between the input X and the transcript \(\Pi \) given the input Y. Similarly, the second term captures what the first player learns about Y from \(\Pi \).

Note that the information of a protocol \(\pi \) depends on the prior distribution \(\mu \), as the mutual information between the transcript \(\Pi \) and the inputs depends on the prior distribution on the inputs. To give an extreme example, if \(\mu \) is a singleton distribution, i.e. one with \(\mu (\{(x,y)\}) = 1\) for some \((x,y)\in \mathcal{X}\times \mathcal{Y}\), then \({\mathsf {IC}_{\mu }(\pi )}=0\) for all possible \(\pi \), as no protocol can reveal anything to the players about the inputs that they do not already know a-priori. Similarly, \({\mathsf {IC}_{\mu }(\pi )}=0\) if \(\mathcal{X}=\mathcal{Y}\) and \(\mu \) is supported on the diagonal \(\{(x,x):x\in \mathcal{X}\}\). As expected, one can show that the communication cost \(\mathsf {CC}(\pi )\) of \(\pi \) is an upper bound on its information cost over any distribution \(\mu \):

Lemma 2.9

[4] For any distribution \(\mu \), \({\mathsf {IC}_{\mu }(\pi )}\le \mathsf {CC}(\pi )\).

On the other hand, as noted in the introduction, the converse need not hold. In fact, by [4], getting a strict inequality in Lemma 2.9 is equivalent to violating the direct sum conjecture for randomized communication complexity.

As one might expect, the information cost of a function f with respect to \(\mu \) and error \(\rho \) is the least amount of information that needs to be revealed by a protocol computing f with error \(\le \rho \):

$$\begin{aligned} {\mathsf {IC}_{\mu }(f,\rho )}:= \inf _{\pi :~ {\mathbf P}_\mu [\pi (x,y)\ne f(x,y)]\le \rho } {\mathsf {IC}_{\mu }(\pi )}. \end{aligned}$$

The (prior-free) information cost was defined in [5] as the minimum amount of information that a worst-case error-\(\rho \) randomized protocol can reveal against all possible prior distributions.

$$\begin{aligned} {\mathsf {IC}\left( f,\rho \right) } := \inf _{\pi \text {is a protocol with } {\mathbf P}[\pi (x,y)\ne f(x,y)]\le \rho \text { for all } (x,y)} \max _\mu ~~ {\mathsf {IC}_{\mu }(\pi )}. \end{aligned}$$

This information cost matches the amortized randomized communication complexity of f [5]. It is clear that lower bounds on \({\mathsf {IC}_{\mu }(f,\rho )}\) for any distribution \(\mu \) also apply to \({\mathsf {IC}\left( f,\rho \right) } \).

3 Proof of Theorem 1.1

To establish the correctness of Theorem 1.1, we prove the following theorem, which is the central result of this paper:

Theorem 3.1

Suppose that \({\mathsf {IC}_{\mu }(f,1/10)} = I_\mu \). Then there exist a protocol \(\pi '\) such that

  • \(\mathsf {CC}(\pi ') = O(I_\mu )\).

  • \({\mathbf P}_{(x,y)\sim \mu }[\pi '(x,y) = f(x,y)] \ge 1/2 + 2^{-O(I_\mu )}\)

We first show how Theorem 1.1 follows from Theorem 3.1:

Proof of Theorem 1.1

Let \(f, \mu \) be as in Theorem 1.1, and let \({\mathsf {IC}_{\mu }(f,1/10)} = I_\mu \). By Theorem 3.1, there exists a protocol \(\pi '\) computing f with error probability \(1/2 - 2^{-O(I_\mu )}\) using \(O(I_\mu )\) bits of communication. Applying the discrepancy lower bound for communication complexity we obtain

$$\begin{aligned} O(I_\mu ) \ge D^\mu _{1/2-2^{-O(I_\mu )}}(f) \ge \log (2\cdot 2^{-O(I_\mu )}/Disc_{\mu }(f)) \end{aligned}$$
(1)

which after rearranging gives \(I_\mu \ge \Omega (\log (1/Disc_{\mu }(f)))\), as desired. \(\square \)

We turn to prove Theorem 3.1. The main step is the following sampling lemma.

Lemma 3.2

Let \(\mu \) be any distribution over a universe \(\mathcal{U}\) and let \(I\ge 0\) be a parameter that is known to both A and B. Further, let \(\nu _A\) and \(\nu _B\) be two distributions over \(\mathcal{U}\) such that \(\mathbf{D} \left( \mu ||\nu _A \right) \le I\) and \(\mathbf{D} \left( \mu ||\nu _B \right) \le I\). The players are each given a pair of real functions \((p_A,q_A)\), \((p_B,q_B)\), \(p_A,q_A,p_B,q_B:\mathcal{U}\rightarrow [0,1]\) such that for all \(x\in \mathcal{U}\), \(\mu (x) = p_A(x)\cdot p_B(x)\), \(\nu _A(x)=p_A(x)\cdot q_A(x)\), and \(\nu _B(x)=p_B(x)\cdot q_B(x)\). Then there is a (two round) sampling protocol \(\Pi _1=\Pi _1(p_A, p_B, q_A, q_B,I)\) which has the following properties:

  1. 1.

    at the end of the protocol, the players either declare that the protocol “fails”, or output \(x_A\in \mathcal{U}\) and \(x_B\in \mathcal{U}\), respectively (“success”).

  2. 2.

    let \({\mathcal {S}}\) be the event that the players output “success”. Then \({\mathcal {S}}\Rightarrow x_A = x_B\), and \(0.9 \cdot 2^{-50{(I+1)}} \le \Pr [{\mathcal {S}}] \le 2^{-50(I+1)}\).

  3. 3.

    if \(\mu _1\) is the distribution of \(x_A\) conditioned on \({\mathcal {S}}\), then \(|\mu -\mu _1|<2/9\).

Furthermore, \(\Pi _1\) can be “compressed” to a protocol \(\Pi _2\) such that \(\mathsf {CC}(\Pi _2) = 211 I + 1\), whereas \(|\Pi _1 - \Pi _2| \le 2^{-59 I}\) (by an abuse of notation, here we identify \(\Pi _i\) with the random variable representing its output).

We will use the following technical fact about the divergence of distributions, which to best of our knowledge first appeared in the substate theorem of [13].

Claim 3.3

([13], Claim 5.1 in [5]) Suppose that \(\mathbf{D} \left( \mu ||\nu \right) \le I\). Let \(\varepsilon \) be any parameter. Then \( \mu \left\{ x:~2^{(I+1)/\varepsilon }\cdot \nu (x) <\mu (x)\right\} < \varepsilon . \)

For completeness, we repeat the proof in “Appendix 2”.

Proof of Lemma 3.2

Throughout the execution of \(\Pi _1\), Alice and Bob interpret their shared random tape as a source of points \((x_i,\alpha _i,\beta _i)\) uniformly distributed in \(\mathcal{U}\times [0,2^{50(I+1)}] \times [0,2^{50(I+1)}]\). Alice and Bob consider the first \(T= |\mathcal{U}|\cdot 2^{100(I+1)}\cdot 60I\) such points. Their goal will be to discover the first index \(\tau \) such that \(\alpha _\tau \le p_A(x_\tau )\) and \(\beta _\tau \le p_B(x_\tau )\) (where they wish to find it using a minimal amount of communication, even if they are most likely to fail). First, we note that the probability that an index t satisfies \(\alpha _t\le p_A(x_t)\) and \(\beta _t \le p_B(x_t)\) is exactly \(1/|\mathcal{U}|2^{50(I+1)}2^{50(I+1)} = 1/|\mathcal{U}|2^{100(I+1)}\). Hence the probability that \(\tau >T\) (i.e. that \(x_\tau \) is not among the T points considered) is bounded by

$$\begin{aligned} \left( 1-1/|\mathcal{U}|2^{100(I+1)}\right) ^{T} < e^{-T/|\mathcal{U}|2^{100(I+1)}} = e^{-60I} < 2^{-60I} \end{aligned}$$
(2)

Denote by \(\mathcal{A}\) the following set of indices \(\mathcal{A}:= \{i\le T:~\alpha _i\le p_A(x_i) \text { and }\beta _i\le 2^{50(I+1)}\cdot q_A(x_i)\}\), the set of potential candidates for \(\tau \) from A’s viewpoint. Similarly, denote \(\mathcal{B}:= \{i\le T:~\alpha _i\le 2^{50(I+1)}\cdot q_B(x_i) \text { and }\beta _i\le p_B(x_i)\}\).

The protocol \(\Pi _1\) is very simple. Alice takes her bet on the first element \(a \in \mathcal{A}\) and sends it to Bob. Bob outputs a only if (it just so happens that) \(\beta _\tau \le p_B(a)\).

We turn to analyze \(\Pi _1\). Denote the set of “Good” elements by

$$\begin{aligned} {\mathcal {G}}\mathop {=}\limits ^{def}\{ x:~2^{50(I+1)}\cdot \nu _A(x) \ge \mu (x) \; \text { and } \; 2^{50(I+1)}\cdot \nu _B(x) \ge \mu (x)\}\}. \end{aligned}$$

Then by Claim 3.3, \(\mu ({\mathcal {G}}) \ge 48/50 = 24/25\). The following claim asserts that if it succeeds, the output of \(\Pi _1\) has the “correct” distribution on elements in \({\mathcal {G}}\). \(\square \)

Proposition 3.4

Assume \(\mathcal{A}\) is nonempty. Then for any \(x_i \in \mathcal{U}\), the probability that \(\Pi _1\) outputs \(x_i\) is at most \(\mu (x_i)\cdot 2^{-50(I+1)}\). If \(x_i \in {\mathcal {G}}\), then this probability is exactly \(\mu (x_i)\cdot 2^{-50(I+1)}\).

Proof

Note that if \(\mathcal{A}\) is nonempty, then for any \(x_i\in \mathcal{U}\), the probability that \(x_i\) is the first element in \(\mathcal{A}\) (i.e, \(a = x_i\)) is \(p_A(x_i)q_A(x_i) = \nu _A(x_i)\). By construction, the probability that \(\beta _i\le p_B(a)\) is \(\min \{p_B(x_i)/(2^{50{(I+1)}}q_A(x_i)), 1\}\), and thus

$$\begin{aligned} \Pr [\Pi _1 \text { outputs } x_i] \le p_A(x_i)q_A(x_i)\cdot \frac{p_B(x_i)}{2^{50(I+1)}q_A(x_i)} = \mu (x_i)\cdot 2^{-50(I+1)}. \end{aligned}$$

On the other hand, if \(x_i \in {\mathcal {G}}\), then we know that \( p_B(x_i)/q_A(x_i) = \mu (x_i)/\nu _A(x_i) \le 2^{50(I+1)}, \) and so the probability that \(\beta _i\le p_B(a)\) is exactly \(p_B(x_i)/(2^{50{(I+1)}}q_A(x_i))\). Since \(\Pr [\Pi _1 \text { outputs } x_i] = \Pr [a = x_i]\Pr [\beta _i\le p_B(a)]\) (assuming \(\mathcal{A}\) is nonempty), we conclude that:

$$\begin{aligned} x_i \in {\mathcal {G}}\Longrightarrow \Pr [\Pi _1 \text { outputs } x_i] = p_A(x_i)q_A(x_i)\cdot \frac{p_B(x_i)}{2^{50(I+1)}q_A(x_i)} = \mu (x_i)\cdot 2^{-50(I+1)}. \end{aligned}$$

\(\square \)

We are now ready to estimate the success probability of the protocol.

Proposition 3.5

Let \({\mathcal {S}}\) denote the event that \(\mathcal{A}\ne 0\) and \(a \in \mathcal{B}\) (i.e, that the protocol succeeds). Then

$$\begin{aligned} 0.9 \cdot 2^{-50{(I+1)}} \le \Pr [{\mathcal {S}}] \le 2^{-50(I+1)}. \end{aligned}$$

Proof

Using Proposition 3.4, we have that

$$\begin{aligned} \Pr [{\mathcal {S}}]\le & {} {\mathbf P}[a\in \mathcal{B}\; |\; \mathcal{A}\ne \emptyset ] = \sum _{i\in \mathcal{U}}\Pr [a = x_i]\Pr [\beta _i\le p_B(a)] \le \nonumber \\\le & {} \sum _{i\in \mathcal{U}} \mu (x_i)\cdot 2^{-50(I+1)} = 2^{-50(I+1)} \end{aligned}$$
(3)

For the lower bound, we have

$$\begin{aligned} \Pr [{\mathcal {S}}]\ge & {} \Pr [\beta _i\le p_B(a) \; |\; \mathcal{A}\ne \emptyset ]\cdot \Pr [\mathcal{A}\ne \emptyset ] \nonumber \\\ge & {} (1-2^{-60I})\bigg (\sum _{i\in \mathcal{U}}\Pr [a = x_i]\Pr [\beta _i\le p_B(a)]\bigg ) \nonumber \\\ge & {} (1-2^{-60I})\bigg (\sum _{i\in {\mathcal {G}}}\Pr [a = x_i]\Pr [\beta _i\le p_B(a)]\bigg ) \nonumber \\= & {} (1-2^{-60I})\bigg ( 2^{-50(I+1)}\sum _{i\in {\mathcal {G}}} \mu (x_i) \bigg ) = (1-2^{-60I})\bigg ( 2^{-50(I+1)}\mu ({\mathcal {G}}) \bigg ) \nonumber \\\ge & {} \frac{24}{25}(1-2^{-60I})2^{-50{(I+1)}} \ge 0.9 \cdot 2^{-50{(I+1)}} \end{aligned}$$
(4)

where the equality follows again from Claim 3.4. This proves the second claim of Lemma 3.2. \(\square \)

The following claim asserts that if \({\mathcal {S}}\) occurs, then the distribution of a is indeed close to \(\mu \).

Claim 4

Let \(\mu _1\) be the distribution of \(a|{\mathcal {S}}\). Then \(|\mu _1 - \mu | \le 2/9\).

Proof

The claim follows directly from Proposition 3.5. We defer the proof to the “Appendix 3”.

We turn to the “Furthermore” part of of Lemma 3.2. The protocol \(\Pi _1\) satisfies the premises of the lemma, except it has a high communication cost. This is due to the fact that Alice explicitly sends a to Bob. To reduce the communication, Alice will instead send O(I) random hash values of a, and Bob will add corresponding consistency constraints to his set of candidates. The final protocol \(\Pi _2\) is given in Figure 1.

Fig. 1
figure 1

The sampling protocol \(\Pi _2\) from Lemma 3.2

Let \(\mathcal{E}\) denote the event that in step 3. of the protocol, Bob finds an element \(x_i\ne a\) (that is, the probability that the protocol outputs “success” but \(x_A \ne x_B\)). We upper bound the probability of \(\mathcal{E}\). Given \(a\in \mathcal{A}\) and \(x_i\in \mathcal{B}\) such that \(a\ne x_i\), the probability (over possible choices of the hash functions) that \(h_j(a)=h_j(x_i)\) for \(j=1..d\) is \(2^{-d} \le 2^{-211 I}\). For any t, \({\mathbf P}[t\in \mathcal{B}] \le \frac{1}{|\mathcal{U}|}\sum _{x_i\in \mathcal{U}}p_B(x_i)q_B(x_i)\cdot 2^{50(I+1)} = \frac{1}{|\mathcal{U}|}\sum _{x_i\in \mathcal{U}}\nu _B(x_i)\cdot 2^{50(I+1)} = 2^{50(I+1)}/|\mathcal{U}|\). Thus, by a union bound we have

$$\begin{aligned} {\mathbf P}[\mathcal{E}]\le & {} {\mathbf P}[\exists x_i\in \mathcal{B}\; s.t \; x_i\ne a \; \wedge \; h_j(a)=h_j(x_i) \; \forall \; j=1,\ldots ,d] \nonumber \\\le & {} T\cdot 2^{50(I+1)}\cdot 2^{-d}/|\mathcal{U}| = 2^{150(I+1)}\cdot 60 I \cdot 2^{-211 I} \ll 2^{-60 I}. \end{aligned}$$
(5)

By a slight abuse of notation, let \(\Pi _2\) be the distribution of \(\Pi _2\)’s output. Similarly, denote by \(\Pi _1\) the distribution of the output of protocol \(\Pi _1\). Note that if \(\mathcal{E}\) does not occur, then the outcome of the execution of \(\Pi _2\) is identical to the outcome of \(\Pi _1\). Since \({\mathbf P}[\mathcal{E}] \le 2^{-60 I}\), we have

$$\begin{aligned} |\Pi _2 - \Pi _1| = \Pr [\mathcal{E}] \cdot |[\Pi _2|\mathcal{E}]-[\Pi _1|\mathcal{E}]| \le 2\cdot 2^{-60I} \ll 2^{-59 I} \end{aligned}$$

which finishes the proof of the lemma. \(\square \)

Remark 3.6

The communication cost of the sampling protocol \(\Pi _2\) can be reduced from \(O(I_\mu )\) to O(1) (more precisely, to only two bits) in the following way: Instead of having Alice compute the hash values privately and send them to Bob in step 2 and 3 of the protocol, the players can use their shared randomness to sample \(d=O(I_\mu )\) random hash values \(h_1(b_1),\ldots ,h_d(b_d)\) (where the \(b_i\)’s are random independent strings in \(\mathcal{U}\)), and Alice will only send Bob a single bit indicating whether those hash values match the hashing of her string a (i.e, \(h_i(b_i) = h_i(a)\) for all \(i\in [d]\)). In step 4 Bob will act as before, albeit comparing the hashes of his candidate b to the random hashes \(h_i(b_i)\), and output success (“1”) if the hashes match. Note that this modification incurs an additional loss of \(2^{-d}=2^{-O(I_\mu )}\) in the success probability of the protocol (as this is the probability that \(h_i(b_i) = h_i(a)\) for all \(i\in [d]\)), but since the success probability we are shooting for is already of the order \(2^{-O(I_\mu )}\), we can afford this loss. This modification was discovered in [17].

With Lemma 3.2 in hand, we are now ready to prove Theorem 3.1.

Proof of Theorem 3.1

Let \(\pi \) be a protocol that realizes the value \(I_\mu :={\mathsf {IC}_{\mu }(f,1/10)}\). In other words, \(\pi \) has an error rate of at most 1 / 10 and information cost of at most \(I_\mu \) with respect to \(\mu \). Denote by \(\pi _{xy}\) the random variable that represents that transcript \(\pi \) given the inputs (xy), and by \(\pi _x\) (resp. \(\pi _y\)) the protocol conditioned on only the input x (resp. y). We denote by \(\pi _{XY}\) the transcripts where (XY) are also a pair of random variables. By Claim 3.3, we know that

$$\begin{aligned} I_\mu = I(X; \pi _{XY}|Y)+I(Y; \pi _{XY}|X) = {\mathbf E}_{(x,y)\sim \mu } [\mathbf{D} \left( \pi _{xy} ||\pi _x \right) +\mathbf{D} \left( \pi _{xy} ||\pi _y \right) ]. \end{aligned}$$
(6)

Let us now run the sampling algorithm \(\Pi _1\) from Lemma 3.2, with the distribution \(\mu \) taken to be \(\pi _{xy}\), the distributions \(\nu _A\) and \(\nu _B\) taken to be \(\pi _x\) and \(\pi _y\) respectively, and I taken to be \(20 I_\mu \).

At each node v of the protocol tree that is owned by player X let \(p_0(v)\) and \(p_1(v)=1-p_0(v)\) denote the probabilities that the next bit sent by X is 0 and 1, respectively. For nodes owned by player Y, let \(q_0(v)\) and \(q_1(v)=1-q_0(v)\) denote the probabilities that the next bit sent by Y is 0 and 1, respectively, as estimated by player X given the input x. For each leaf \(\ell \) let \(p_X(\ell )\) be the product of all the values of \(p_b(v)\) from the nodes that are owned by X along the path from the root to \(\ell \); let \(q_X(\ell )\) be the product of all the values of \(q_b(v)\) from the nodes that are owned by Y along the path from the root to \(\ell \). The values \(p_Y(\ell )\) and \(q_Y(\ell )\) are defined similarly. For each \(\ell \) we have \({\mathbf P}[\pi _{xy}=\ell ] = p_X(\ell )\cdot p_Y(\ell )\), \({\mathbf P}[\pi _{x} = \ell ] = p_X(\ell )\cdot q_X(\ell )\), and \({\mathbf P}[\pi _{y} = \ell ] = p_Y(\ell )\cdot q_Y(\ell )\). Thus we can apply Lemma 3.2 so as to obtain the following protocol \(\pi '\) for computing f:

  • If \(\Pi _1\) fails, we return a random unbiased coin flip.

  • If \(\Pi _1\) succeeds, we return the final bit of the transcript sample T. Denote this bit by \(T_{out}\).

To prove the correctness of the protocol, let \(\mathcal {Z}\) denote the event that both \(\mathbf{D} \left( \pi _{xy} ||\pi _x \right) \le 20 I_\mu \) and \(\mathbf{D} \left( \pi _{xy} ||\pi _y \right) \le 20 I_\mu \). By (6) and Markov inequality, \(\Pr [\mathcal {Z}] \ge 19/20\) (where the probability is taken with respect to \(\mu \)). Denote by \(\delta \) the probability that \(\Pi _1\) succeeds. By the assertions of Lemma 3.2, \(\delta \ge 0.9\cdot 2^{-50(I+1)}\). Furthermore, if \(\Pi _1\) succeeds, then we have \(|T-\pi _{xy}|\le 2/9\), which in particular implies that \({\mathbf P}[T_{out} = \pi _{out}] \ge 7/9\). Finally, \({\mathbf P}[ \pi _{out} = f(x,y)] \ge 9/10\), since \(\pi \) has error at most 1 / 10 with respect to \(\mu \). Now, let \(\mathcal{W}\) denote the indicator variable whose value is 1 iff \(\pi '(x,y) = f(x,y)\). Putting together the above,

$$\begin{aligned} {\mathbf E}[\mathcal{W}\; | \; \mathcal {Z}] = (1-\delta )\cdot \frac{1}{2} + \delta \cdot \left( \frac{7}{9}-\frac{1}{10}\right) > \frac{1}{2} + \delta \cdot \frac{1}{6} > \frac{1}{2} + \frac{1}{8}\cdot 2^{-50(I+1)}.\quad \end{aligned}$$
(7)

On the other hand, note that by Lemma 3.2 the probability that \(\Pi _1\) succeeds is at most \(2^{-50(I+1)}\) (no matter how large \(\mathbf{D} \left( \pi _{xy} ||\pi _x \right) \) and \(\mathbf{D} \left( \pi _{xy} ||\pi _y \right) \) are!), and so \( {\mathbf E}[\mathcal{W}\; | \; \lnot \mathcal {Z}] \ge (1-2^{-50(I+1)})/2. \)

Hence we conclude that

$$\begin{aligned} {\mathbf E}[\mathcal{W}]= & {} {\mathbf E}[\mathcal{W}\; | \; \mathcal {Z}]\cdot {\mathbf P}[\mathcal {Z}] + {\mathbf E}[\mathcal{W}\; | \; \lnot \mathcal {Z}]\cdot {\mathbf P}[\lnot \mathcal {Z}] \ge \left( \frac{1}{2} + \frac{1}{8}\cdot 2^{-50(I+1)}\right) \cdot \frac{19}{20} \\&+ \left( 1-2^{-50(I+1)}\right) \cdot \frac{1}{2}\cdot \frac{1}{20} \ge \frac{1}{2} + \frac{1}{12}\cdot 2^{-50(I+1)} > \frac{1}{2} + \frac{1}{12}\cdot 2^{-1000(I_\mu +1)}. \end{aligned}$$

Finally, Lemma 3.2 asserts that \(|\Pi _1 - \Pi _2| < 2^{-59 I}\). Thus if we replace \(\Pi _1\) by \(\Pi _2\) in the execution of protocol \(\pi '\), the success probability decreases by at most \(2^{-59 I}\ll \frac{1}{12}\cdot 2^{-50(I+1)}\). Furthermore, the amount of communication used by \(\pi '\) is now

$$\begin{aligned} 211 I = 4220 I_\mu = O(I_\mu ).\end{aligned}$$

Hence we conclude that with this modification, \(\pi '\) has the following properties:

  • \(\mathsf {CC}(\pi ') = 4220\cdot I_\mu \);

  • \({\mathbf P}_{(x,y)\sim \mu }[\pi '(x,y) = f(x,y)] \ge 1/2 + 2^{-1000(I_\mu +1)-4}\);

which completes the proof. \(\square \)

Remark 3.7

Using similar techniques, Braverman [5] showed previously that any function f whose information complexity is I has communication cost at most \(2^{O(I)}\) Footnote 1, thus implying that \(IC(f)\ge \Omega (\log (CC(f)))\). We note that this result can be easily derived (up to constant factors) from Theorem 3.1. Indeed, applying the “compressed” protocol \(2^{O(I)}\log (1/\epsilon )\) independent times and taking a majority vote guarantees an error of at most \(\epsilon \) (by a standard Chernoff boundFootnote 2), with communication \(O(I)\cdot 2^{O(I)} = 2^{O(I)}\). Thus, our result is strictly stronger than the former one.