An early version of this work appeared as TR16-015 of ECCC. The original version reproduced some text from the author’s lecture notes on property testing [8], which were later used as a basis for his book [9]. The current revision is quite minimal, except for the correction of various typos and a significant elaboration of the Appendix.

1 Introduction

Inspired by Diakonikolas and Kane [5], we present, for every fixed distribution D over [n], a simple reduction of the problem of testing whether an unknown distribution over [n] equals D to the problem of testing whether an unknown distribution over [n] equals the uniform distribution over [n]. Specifically, we reduce \(\epsilon \)-testing of equality to D to \(\epsilon /3\)-testing of equality to the uniform distribution over [6n], denoted \(U_{6n}\).

Hence, the sample (resp., time) complexity of testing equality to D, with respect to the proximity parameter \(\epsilon \), is at most the sample (resp., time) complexity of testing equality to \(U_{6n}\) with respect to the proximity parameter \(\epsilon /3\). Since optimal bounds were known for both problems (cf., e.g., [1, 2, 4, 12, 14, 17]), our reduction yields no new bounds. Still, it provides a simple way of obtaining testers for equality to arbitrary fixed distributions from testers for the uniform distribution.

The Setting at a Glance. For any fixed distribution D over [n], we consider the problem of \(\epsilon \)-testing equality to D, where the tester is given samples drawn from an unknown distribution X and is required to distinguish the case that \(X\equiv D\) from the case that X is \(\epsilon \)-far from D, where the distance is the standard statistical distance. The sample complexity of this testing problem, depends on D, and is viewed as a function of n and \(\epsilon \). We write \(D\subseteq [n]\) to denote that D ranges over [n].

Wishing to present reductions between such problems, we need to spell out what we mean by such a reduction. Confining ourselves to problems of testing equality to fixed distributions, we use a very stringent notion of a reduction, which we call reduction via a filter. Specifically, we say that \(\epsilon \)-testing equality to \(D\subseteq [n]\) reduces to \(\epsilon '\)-testing equality to \(D'\subseteq [n']\) if there exists a randomized process F (called a filter) that maps [n] to \([n']\) such that the distribution D is mapped to the distribution \(D'\) and any distribution that is \(\epsilon \)-far from D is mapped to a distribution that is \(\epsilon '\)-far from \(D'\), where we say that F maps the distribution X to the distribution Y if \(Y\equiv F(X)\).

Note that the foregoing is a very stringent notion of reduction between distribution testing problems: Under this notion, a tester T for \(\epsilon \)-testing equality to D is derived by invoking a \(\epsilon '\)-tester \(T'\) for equality to \(D'\) and providing \(T'\) with the sample \(F(i_1),...,F(i_s)\), where \(i_1,...,i_s\) is the sample provided to T. Still, our main result can be stated as follows.

Theorem 1.1

(completeness of testing equality to \(U_n\)): For every distribution D over [n] and every \(\epsilon >0\), it holds that \(\epsilon \)-testing equality to D reduces to \(\epsilon /3\)-testing equality to \(U_{6n}\), where \(U_m\) denotes the uniform distribution over [m]. Furthermore, the same reduction F can be used for all \(\epsilon >0\).

Hence, the sample complexity of \(\epsilon \)-testing equality to D is upper-bounded by the sample complexity of \(\epsilon /3\)-testing equality to \(U_{6n}\). We mention that in some cases, testing equality to D can be easier than testing equality to \(U_n\); such natural cases contain grained distributions (see below). (A general study of the dependence on D of the complexity of testing equality to D was undertaken in [17].)

Our Reduction at a Glance. We decouple the reduction asserted in Theorem 1.1 into two steps. In the first step, we assume that the distribution D has a probability function q that ranges over multiples of 1/m, for some parameter \(m\in \mathbb N\); that is, \(m\cdot q(i)\) is a non-negative integer (for every i). We call such a distribution m-grained, and reduce testing equality to any fixed m-grained distribution to testing equality to the uniform distribution over [m]. This reduction maps i uniformly at random to a set \(S_i\) of size \(m\cdot q(i)\) such that the \(S_i\)’s are disjoint. Clearly, this reduction maps the distribution q to the uniform distribution over m fixed elements, and it can be verified that this randomized mapping preserves distances between distributions.

Since every distribution over [n] is \(\epsilon /2\)-close to a \(O(n/\epsilon )\)-grained distribution, it is stands to reason that the general case can be reduced to the grained case. This is indeed true, but the reduction is less obvious than the treatment of the grained case. Actually, we shall use a different “graining” procedure (than the one eluded to above), which yields a better result (i.e., a reduction to the case of O(n)-grained distributions rather than to the case of \(O(n/\epsilon )\)-grained distributions). Specifically, we present a reduction of \(\epsilon \)-testing equality to any distribution \(D\subseteq [n]\) to \(\epsilon /3\)-testing equality to some 6n-grained distribution \(D'\), where \(D'\) depends only on D. This reduction is described next.

Letting \(q:[n]\rightarrow [0,1]\) denote the probability function of D, the reduction maps \(i\in [n]\) to itself with probability \(\frac{{\lfloor 6n\,\cdot \, q(i)\rfloor }/6n}{q(i)}\), and otherwise maps i to \(n+1\). This description suffices when \(q(i)\ge 1/2n\) for every \(i\in [n]\), since in this case \(\frac{{\lfloor 6n\,\cdot \, q(i)\rfloor }/6n}{q(i)}\ge \frac{2}{3}\), and in order to guaranteed this condition (i.e., \(q(i)\ge 1/2n\) for every \(i\in [n]\)) we use a preliminary reduction that maps \(i\in [n]\) to itself with probability 1/2 and maps it uniformity to [n] otherwise. This preliminary reduction cuts the distance between distributions by a factor of two, and it can be shown that the main randomized mapping preserves distances between distributions up to a constant factor (of 2/3).

History, Credits, and an Acknowledgement. The study of testing properties of distributions was initiated by Batu, Fortnow, Rubinfeld, Smith and White [2].Footnote 1 Testers of sample complexity \(\mathrm{poly}(1/\epsilon )\cdot {\sqrt{n}}\) for equality to \(U_n\) and for equality to an arbitrary distribution D over [n] were presented by Goldreich and Ron [12] and Batu et al. [1], respectively, were the presentation in [12] is only implicit.Footnote 2 The tight lower and upper bound of \(\varTheta ({\sqrt{n}}/\epsilon ^2)\) on the sample complexity of both problems were presented in [4, 14, 17] (see also [5, 6]). For a general survey of the area, the interested reader is referred to Canonne [3].

As stated upfront, our reductions are inspired by Diakonikolas and Kane [5], who presented a unified approach for deriving optimal testers for various properties of distributions (and pairs of distributions) via reductions to testing the equality of two unknown distributions that have small \(\mathcal{L}_2\)-norm. We note that our reduction from testing equality to grained distributions to testing equality to the uniform distribution is implicit in [6].

Lastly, we wish to thank Ilias Diakonikolas for numerous email discussions, which were extremely helpful in many ways.

Organization. In Sect. 2 we recall the basic context and define the restricted notion of a reduction used in this work. The core of this work is presented in Sect. 3, where we prove Theorem 1.1. In Sect. 4 we briefly consider the problem of testing whether an unknown distribution is grained, leaving an open problem. The appendix addresses a side issue that arises in Sect. 4.

2 Preliminaries

We consider discrete probability distributions. Such distribution have a finite support, which we assume to be a subset of [n] for some \(n\in \mathbb N\), where the support of a distribution is the set of elements assigned positive probability mass. We represent such distributions either by random variables, like X, that are assigned values in [n] (indicated by writing \(X\in [n]\)), or by probability functions like \(p:[n]\rightarrow [0,1]\) that satisfy \(\sum _{i\in [n]}p(i)=1\). These two representation correspond via \(p(i)=\mathbf{Pr}[X\!=\!i]\). At times, we also refer to distributions as such, and denote them by D. (Distributions over other finite sets can be treated analogously, but in such a case we may provide the tester with a description of the set; indeed, n serves as a concise description of [n].)

Recall that the study of “distribution testing” refers to testing properties of distributions. That is, the object being testing is a distribution, and the property it is tested for is a property of distributions (equiv., a set of distributions). The tester itself is given samples from the distribution and is required to distinguish the case that the distribution has the property from the case that the distribution is far from having the property, where the distance between distributions is defined as the total variation distance between them (a.k.a the statistical difference). That is, X and Y are said to be \(\epsilon \) -close if

$$\begin{aligned} \frac{1}{2}\cdot \sum _{i}\Bigl |\mathbf{Pr}[X\!=\!i]-\mathbf{Pr}[Y\!=\!i]\Bigr |\le \epsilon , \end{aligned}$$
(1)

and otherwise they are deemed \(\epsilon \) -far. With this definition in place, we are ready to recall the standard definition of testing distributions.

Definition 2.1

(testing properties of distributions): Let \(\mathcal{D}=\{\mathcal{D}_n\}_{n\in \mathbb N}\) be a property of distributions and \(s:\mathbb N\times (0,1]\rightarrow \mathbb N\). A tester, denoted T, of sample complexity s for the property \(\mathcal D\) is a probabilistic machine that, on input parameters n and \(\epsilon \), and a sequence of s(n) samples drawn from an unknown distribution \(X\in [n]\), satisfies the following two conditions.

  1. 1.

    The tester accepts distributions that belong to \(\mathcal D\): If X is in \(\mathcal{D}_n\), then

    $$\mathbf{Pr}_{i_1,...,i_s\sim X}[T(n,\epsilon ;i_1,...,i_s)\!=\!1]\ge 2/3,$$

    where \(s=s(n,\epsilon )\) and \(i_1,...,i_s\) are drawn independently from the distribution X.

  2. 2.

    The tester rejects distributions that far from \(\mathcal D\): If X is \(\epsilon \)-far from any distribution in \(\mathcal{D}_n\) (i.e., X is \(\epsilon \)-far from \(\mathcal D\)), then

    $$\mathbf{Pr}_{i_1,...,i_s\sim X}[T(n,\epsilon ;i_1,...,i_s)\!=\!0]\ge 2/3,$$

    where \(s=s(n,\epsilon )\) and \(i_1,...,i_s\) are as in the previous item.

Our focus is on “singleton” properties; that is, the property is \(\{D_n\}_{n\in \mathbb N}\), where \(D_n\) is a fixed distribution over [n]. Note that n fully specifies the distribution \(D_n\), and we do not consider the complexity of obtaining an explicit description of \(D_n\) from n. For sake of simplicity, we will consider a generic n and omit it from the notation (i.e., use D rarher than \(D_n\)). Furthermore, we refer to \(\epsilon \)-testers derived by setting the proximity parameter to \(\epsilon \). Nevertheless, all testers discussed here are actually uniform with respect to the proximity parameter \(\epsilon \) (and also with respect to n, assuming that they already derived or obtained an explicit description of \(D_n\)).

Confining ourselves to problems of testing equality to distributions, we formally restate the notion of a reduction used in the introduction. In fact, we explicitly refer to the randomized mapping at the heart of the reduction, and also define a stronger (i.e., uniform over \(\epsilon \)) notion of a reduction that captures the furthermore part of Theorem 1.1.

Definition 2.2

(reductions via filters): We say that a randomized process F, called a filter, reduces \(\epsilon \)-testing equality to \(D\subseteq [n]\) to \(\epsilon '\)-testing equality to \(D'\subseteq [n']\) if the following two conditions hold:

  1. 1.

    The filter F maps the distribution D to the distribution \(D'\); that is, \(p'(i)=\sum _{j}p(j)\cdot \mathbf{Pr}[F(j)\!=\!i]\), where p and \(p'\) denote the probability functions of D and \(D'\), respectively.

  2. 2.

    The filter F maps any distribution that is \(\epsilon \)-far from D to a distribution that is \(\epsilon '\)-far from \(D'\); that is, if q is \(\epsilon \)-far from D, then \(q'(i)\,{\mathop {=}\limits ^\mathrm{def}}\,\sum _{j}q(j)\cdot \mathbf{Pr}[F(j)\!=\!i]\) is \(\epsilon '\)-far from \(D'\).

We say that F reduces testing equality to \(D\subseteq [n]\) to testing equality to \(D'\subseteq [n']\) if, for some constant c and every \(\epsilon >0\), it holds that F reduces \(\epsilon \)-testing equality to D to \(\epsilon /c\)-testing equality to \(D'\).

Recall that we say that F maps the distribution X to the distribution Y if Y and F(X) are identically distributed (i.e., \(Y\equiv F(X)\)), where we view the distributions as random variables. We stress that if F is invoked t times on the same i, then the t outcomes are (identically and) independently distributed. Hence, a sequence of samples drawn independently from a distribution X is mapped to a sequence of samples drawn independently from the distribution F(X).

Note (added in revision): As stated in the introduction, Definition 2.2 captures a natural but stringent notion of a reduction. First, note that this notion extends to reducing testing any set of distributions \(\mathcal D\) to testing the set \(\mathcal{D}'\) (by requiring that F maps any distribution in \(\mathcal D\) to some distribution in \(\mathcal{D}'\) while mapping any distribution that is \(\epsilon \)-far from \(\mathcal D\) to a distribution that is \(\epsilon '\)-far from \(\mathcal{D}'\)). However, more general definitions may allow the tester of \(\mathcal D\) to use the sample provided to it in arbitrary ways and invoke the tester of \(\mathcal{D}'\) on an arbitrary sample as long as it distinguishes distributions in \(\mathcal D\) from distributions that are \(\epsilon \)-far from \(\mathcal D\). While such general definitions are analogous to Cook-reductions, Definition 2.2 seems analogous to a very restricted (i.e., “local”) notion of a Karp-reduction.

3 The Reduction

Recall that testing equality to a fixed distribution D means testing the property \(\{D\}\); that is, testing whether an unknown distribution equals the fixed distribution D. For any distribution D over [n], we present a reduction of the task of \(\epsilon \)-testing \(\{D\}\) to the task of \(\epsilon /3\)-testing the uniform distribution over [6n].

3.1 Overview

We decouple the reduction into two steps. In the first step, we assume that the distribution D has a probability function q that ranges over multiples of 1/m, for some parameter \(m\in \mathbb N\); that is, \(m\,\cdot \, q(i)\) is a non-negative integer (for every i). We call such a distribution m-grained, and reduce testing equality to any fixed m-grained distribution to testing uniformity (over [m]). Next, in the second step, we reduce testing equality to any distribution over [n] to testing equality to some 6n-grained distribution.

Definition 3.1

(grained distributions): A probability distribution over [n] having a probability function \(q:[n]\rightarrow [0,1]\) is m -grained if q ranges over multiples of 1/m; that is, if for every \(i\in [n]\) there exists a non-negative integer \(m_i\) such that \(q(i)=m_i/m\).

Note that the uniform distribution over [n] is n-grained, and it is the only n-grained distribution having support [n]. Furthermore, if a distribution D results from applying some deterministic process to the uniform distribution over [m], then D is m-grained. On the other hand, any m-grained distribution must have support size at most m.

3.2 Testing Equality to a Fixed Grained Distribution

Fixing any m-grained distribution (represented by a probability function) \(q:[n]\rightarrow \{j/m:j\in \mathbb N\cup \{0\}\}\), we consider a randomized transformation (or “filter”), denoted \(F_q\), that maps the support of q to \(S=\{{\langle {i,j}\rangle }:i\!\in \![n]\wedge j\!\in \![m_i]\}\), where \(m_i=m\cdot q(i)\). (We stress that, as with any randomized process, invoking the filter several times on the same input yields independently and identically distributed outcomes.) Specifically, for every i in the support of q, we map i uniformly to \(S_i=\{{\langle {i,j}\rangle }:j\!\in \![m_i]\}\); that is, \(F_q(i)\) is uniformly distributed over \(S_i\). If i is outside the support of q (i.e., \(q(i)=0\)), then we map it to \({\langle {i,0}\rangle }\). Note that \(|S|=\sum _{i\in [n]}m_i=\sum _{i\in [n]}m\cdot q(i)=m\). The key observations about this filter are:

  1. 1.

    The filter \(F_q\) maps q to a uniform distribution: If Y is distributed according to q, then \(F_q(Y)\) is distributed uniformly over S; that is, for every \({\langle {i,j}\rangle }\in S\), it holds that

    $$\begin{aligned} \mathbf{Pr}[F_q(Y)={\langle {i,j}\rangle }]&= \mathbf{Pr}[Y=i] \cdot \mathbf{Pr}[F_q(i)={\langle {i,j}\rangle }] \\&= q(i) \cdot \frac{1}{m_i} \\&= \frac{m_i}{m} \cdot \frac{1}{m_i} \end{aligned}$$

    which equals \(1/m=1/|S|\).

  2. 2.

    The filter preserves the variation distance between distributions: The total variation distance between \(F_q(X)\) and \(F_q(X')\) equals the total variation distance between X and \(X'\). This holds since, for \(S'=S\cup \{{\langle {i,0}\rangle }:i\in [n]\}\), we have

    $$\begin{aligned}&\sum _{{\langle {i,j}\rangle }\in S'} \Bigl |\mathbf{Pr}[F_q(X)={\langle {i,j}\rangle }]-\mathbf{Pr}[F_q(X')={\langle {i,j}\rangle }] \Bigr | \\&\;\;= \sum _{{\langle {i,j}\rangle }\in S'} \Bigl |\mathbf{Pr}[X=i] \cdot \mathbf{Pr}[F_q(i)={\langle {i,j}\rangle }] -\mathbf{Pr}[X'=i] \cdot \mathbf{Pr}[F_q(i)={\langle {i,j}\rangle }] \Bigr | \\&\;\;= \sum _{{\langle {i,j}\rangle }\in S'} \mathbf{Pr}[F_q(i)={\langle {i,j}\rangle }]\cdot \Bigl |\mathbf{Pr}[X=i]-\mathbf{Pr}[X'=i] \Bigr | \\&\;\;= \sum _{i\in [n]} \Bigl |\mathbf{Pr}[X=i]-\mathbf{Pr}[X'=i] \Bigr |. \end{aligned}$$

    Indeed, this is a generic statement that applies to any filter that maps i to a random variable \(Z_i\), which only depends on i, such that the supports of the different \(Z_i\)’s are disjoint.

Observing that knowledge of q allows to implement \(F_q\) as well as to map S to [m], yields the following reduction.

Algorithm 3.2

(reducing testing equality to m-grained distributions to testing uniformity over [m]): Let D be an m-grained distribution with probability function \(q:[n]\rightarrow \{j/m:j\in \mathbb N\cup \{0\}\}\). On input \((n,\epsilon ;i_1,...,i_s)\), where \(i_1,...,i_s\in [n]\) are samples drawn according to an unknown distribution \(p:[n]\rightarrow [0,1]\), invoke an \(\epsilon \)-tester for uniformity over [m] by providing it with the input \((m,\epsilon ;i'_1,...,i'_s)\) such that for every \(k\in [s]\) the sample \(i'_k\) is generated as follows:

  1. 1.

    Generate \({\langle {i_k,j_k}\rangle }\leftarrow F_q(i_k)\).

    Recall that if \(m_{i_k}\,{\mathop {=}\limits ^\mathrm{def}}\, m\cdot q(i_k)>0\), then \(j_k\) is selected uniformly in \([m_k]\), and otherwise \(j_k\leftarrow 0\). We stress that if \(F_q\) is invoked t times on the same i, then the t outcomes are (identically and) independently distributed. Hence, the s samples drawn independently from p are mapped to s samples drawn independently from \(p'\) such that \(p'({\langle {i,j}\rangle })=p(i)/m_i\) if \(j\in [m_i]\) and \(p'({\langle {i,0}\rangle })=p(i)\) if \(m_i=0\).

  2. 2.

    If \(j_k\in [m_{i_k}]\), then \({\langle {i_k,j_k}\rangle }\in S\) is mapped to its rank in S (according to a fixed order of S), where \(S=\{{\langle {i,j}\rangle }:i\!\in \![n]\wedge j\!\in \![m_i]\}\), and otherwise \({\langle {i_k,j_k}\rangle }\not \in S\) is mapped to \(m+1\).

(Alternatively, the reduction may just reject if any of the \(j_k\)’s equals 0.)Footnote 3

The forgoing description presumes that the tester for uniform distributions over [m] also operates well when given arbitrary distributions (which may have a support that is not a subset of [m]).Footnote 4 However, any tester for uniformity can be easily extended to do so (see discussion in Sect. 3.4). In any case, we get

Proposition 3.3

(Algorithm 3.2 as a reduction): The filter \(F_q\) used in Algorithm 3.2 reduces \(\epsilon \)-testing equality to an m-grained distribution D (over [n]) to \(\epsilon \)-testing equality to the uniform distribution over [m], where the distributions tested in the latter case are over \([m+1]\). Furthermore, if the support of D equals [n] (i.e., \(q(i)>0\) for every \(i\in [n]\)), which may happen only if \(m\ge n\), then the reduction is to testing whether a distribution over [m] is uniform on [m].

Using any of the known uniformity tests that have sample complexity \(O({\sqrt{n}}/\epsilon ^2)\),Footnote 5 we obtain—

Corollary 3.4

(testing equality to m-grained distributions): For any fixed m-grained distribution D, the property \(\{D\}\) can be \(\epsilon \)-tested in sample complexity \(O({\sqrt{m}}/\epsilon ^2)\).

The foregoing tester for equality to grained distributions seems to be of independent interest, which extends beyond its usage towards testing equality to arbitrary distributions.

3.3 From Arbitrary Distributions to Grained Ones

We now turn to the problem of testing equality to an arbitrary known distribution, represented by \(q:[n]\rightarrow [0,1]\). The basic idea is to round all probabilities to multiples of \(\gamma /n\), for an error parameter \(\gamma \) (which will be a small constant). Of course, this rounding should be performed so that the sum of probabilities equals 1. For example, we may use a randomized filter that, on input i, outputs i with probability \(\frac{m_i\,\cdot \,\gamma /n}{q(i)}\), where \(m_i={\lfloor q(i)\,\cdot \, n/\gamma \rfloor }\), and outputs \(n+1\) otherwise. Hence, if i is distributed according to p, then the output of this filter will be i with probability \(\frac{\gamma m_i/n}{q(i)}\cdot p(i)\). This works well if \(\gamma m_i/n\approx q(i)\), which is the case if \(q(i)\gg \gamma /n\) (equiv., \(m_i\gg 1\)), but may run into trouble otherwise.

For starters, we note that if \(q(i)=0\), then \(\frac{\gamma m_i/n}{q(i)}\) is undefined, and replacing it by either 0 or 1 will not do. More generally, suppose that \(q(i)\in (0,\gamma /n)\) (e.g., \(q(i)=0.4\gamma /n\)). In this case, setting \(m_i=0\) means that the filter is oblivious of the probability assigned to this i, and does not distinguish distributions that agree on \(\{i:q(i)\ge \gamma /n\}\) but greatly differ on \(\{i:q(i)<\gamma /n\}\), which means that it does not distinguish the distribution associated with q from some distributions that are \(0.1\gamma \)-far from it.Footnote 6 Hence, we modify the basic idea such to avoid this problem.

Specifically, we first use a filter that averages the input distribution p with the uniform distribution, and so guarantees that all elements occur with probability at least 1/2n, while preserving distances between different input distributions (up to a factor of two). Only then, do we apply the foregoing proposed filter (which outputs i with probability \(\frac{m_i\,\cdot \,\gamma /n}{q(i)}\), where \(m_i={\lfloor q(i)\cdot n/\gamma \rfloor }\), and outputs \(n+1\) otherwise). Details follow.

  1. 1.

    We first use a filter \(F'\) that, on input \(i\in [n]\), outputs i with probability 1/2, and outputs the uniform distribution (on [n]) otherwise. Hence, if i is distributed according to the distribution p, then \(F'(i)\) is distributed according to \(p'=F'{({p})}\) such that

    $$\begin{aligned} p'(i)=\frac{1}{2}\cdot p(i)+\frac{1}{2}\cdot \frac{1}{n}. \end{aligned}$$
    (2)

    (Indeed, we denote by \(F'{({p})}\) the probability function of the distribution obtained by selecting i according to the probability function p and outputting \(F'(i)\).)

    Let \(q'=F'{({q})}\); that is, \(q'(i)=0.5\cdot q(i)+(1/2n)\ge 1/2n\).

  2. 2.

    Next, we apply a filter \(F''_{q'}\), which is related to the filter \(F_q\) used in Algorithm 3.2. Letting \(m_i={\lfloor q'(i)\cdot n/\gamma \rfloor }\), on input \(i\in [n]\), the filter outputs i with probability \(\frac{m_i\,\cdot \,\gamma /n}{q'(i)}\), and outputs \(n+1\) otherwise (i.e., with probability \(1-\frac{m_i\gamma /n}{q'(i)}\)).

    Note that \(\frac{m_i\gamma /n}{q'(i)}\le 1\), since \(m_i\le q'(i)\cdot n/\gamma \). On the other hand, recalling that \(q'(i)\ge 1/2n\) and observing that \(m_i\cdot \gamma /n > ((q'(i)\cdot n/\gamma )-1)\cdot \gamma /n = q'(n)-(\gamma /n)\), it follows that \(\frac{m_i\gamma /n}{q'(i)} > \frac{q'(i)\,-\,(\gamma /n)}{q'(i)} \ge 1-2\gamma \), since \(q'(i)\ge 1/2n\).

    Now, if i is distributed according to the distribution \(p'\), then \(F''_{q'}(i)\) is distributed according to \(p'':[n+1]\rightarrow [0,1]\) such that, for every \(i\in [n]\), it holds that

    $$\begin{aligned} p''(i)=p'(i)\cdot \frac{m_i\cdot \gamma /n}{q'(i)} \end{aligned}$$
    (3)

    and \(p''(n+1)=1-\sum _{i\in [n]}p''(i)\).

    Let \(q''\) denote the probability function related (by \(F''_{q'}\)) to \(q'\). Then, for every \(i\in [n]\), it holds that \(q''(i)=q'(i)\cdot \frac{m_i\gamma /n}{q'(i)}=m_i\,\cdot \,\gamma /n \in \{j\cdot \gamma /n:j\in \mathbb N\cup \{0\}\}\) and \(q''(n+1)=1-\sum _{i\in [n]}m_i\cdot \gamma /n<\gamma \), since \(m\,{\mathop {=}\limits ^\mathrm{def}}\,\sum _{i\in [n]}m_i > \sum _{i\in [n]}((n/\gamma )\cdot q'(i)-1) = (n/\gamma )-n\).

    Note that if \(n/\gamma \) is an integer, then \(q''\) is \(n/\gamma \)-grained. Furthermore, if \(m=n/\gamma \), which happens if and only if \(q'(i)=m_i\cdot \gamma /n\) for every \(i\in [n]\) (i.e., \(q'\) is itself \(n/\gamma \)-grained), then \(q''\) has support [n], and otherwise it has support \([n+1]\).

Combining these two filters, we obtain the desired reduction.

Algorithm 3.5

(reducing testing equality to a general distribution to testing equality to a O(n)-grained distributions): Let D be an arbitrary distribution with probability function \(q:[n]\rightarrow [0,1]\), and T be an \(\epsilon '\)-tester for m-grained distributions having sample complexity \(s(m,\epsilon ')\). On input \((n,\epsilon ;i_1,...,i_s)\), where \(i_1,...,i_s\in [n]\) are \(s=s(O(n),\epsilon /3)\) samples drawn according to an unknown distribution p, the tester proceeds as follows:

  1. 1.

    It produces a s-long sequence \((i''_1,...,i''_s)\) by applying \(F''_{F'{({q})}}\circ F'\) to \((i_1,...,i_s)\), where \(F'\) and \(F''_{q'}\) are as in Eqs. (2) and (3); that is, for every \(k\in [s]\), it produces \(i'_k\leftarrow F'(i_k)\) and \(i''_k\leftarrow F''_{F'{({q})}}(i'_k)\).

    (Recall that \(F''_{q'}\) depends on a universal constant \(\gamma \), which we shall set to 1/6.)

  2. 2.

    It invokes the \(\epsilon /3\)-tester T for equality to \(q''\) while providing it with the sequence \((i''_1,...,i''_s)\). Note that this is a sequence over \([n+1]\).

Using the notations as in Eqs. (2) and (3), we first observe that the total variation distance between \(p'=F'{({p})}\) and \(q'=F'{({q})}\) is half the total variation distance between p and q (since \(p'(i)=0.5\cdot p(i)\,+\,(1/2n)\) and ditto for \(q'\)). Next, we observe that the total variation distance between \(p''=F''_{q'}{({p'})}\) and \(q''=F''_{q'}{({q'})}\) is lower-bounded by a constant fraction of the total variation distance between \(p'\) and \(q'\). To see this, let X and Y be distributed according to \(p'\) and \(q'\), respectively, and observe that

$$\begin{aligned} \sum _{i\in [n]}\Bigl |\mathbf{Pr}[F''_{q'}(X)=i]-\mathbf{Pr}[F''_{q'}(Y)=i] \Bigr |&= \sum _{i\in [n]} \left| p'(i)\cdot \frac{m_i\gamma /n}{q'(i)} -q'(i)\cdot \frac{m_i\gamma /n}{q'(i)} \right| \\&= \sum _{i\in [n]} \frac{m_i\gamma /n}{q'(i)}\cdot \left| p'(i)-q'(i)\right| \\&\ge \min _{i\in [n]}\left\{ \frac{m_i\gamma /n}{q'(i)}\right\} \cdot \sum _{i\in [n]}\left| p'(i)-q'(i)\right| . \end{aligned}$$

As stated above, recalling that \(q'(i)\ge 1/2n\) and \(m_i={\lfloor (n/\gamma )\cdot q'(i)\rfloor }>(n/\gamma )\cdot q'(i)-1\), it follows that

$$\frac{m_i\gamma /n}{q'(i)} > \frac{((n/\gamma )\cdot q'(i)-1)\cdot \gamma /n}{q'(i)} = 1-\frac{\gamma /n}{q'(i)} \ge 1-\frac{\gamma /n}{1/2n} = 1-2\gamma .$$

Hence, if p is \(\epsilon \)-far from q, then \(p'\) is \(\epsilon /2\)-far from \(q'\), and \(p''\) is \(\epsilon /3\)-far from \(q''\), where we use \(\gamma \le 1/6\). On the other hand, if \(p=q\), then \(p''=q''\). Noting that \(q''\) is an \(n/\gamma \)-grained distribution, provided that \(n/\gamma \) is an integer (as is the case for \(\gamma =1/6\)), we complete the analysis of the reduction. Hence,

Proposition 3.6

(Algorithm 3.5 as a reduction): The filter \(F''_{F'{({q})}}\circ F'\) used in Algorithm 3.5 reduces \(\epsilon \)-testing equality to any fixed distribution D (over [n]) to \(\epsilon \)-testing equality to an 6n-grained distribution over \([n']\), where \(n'\in \{n,n+1\}\) depends on q.Footnote 7 Furthermore, the support of \(F''_{F'{({q})}}\circ F'{({q})}\) equals \([n']\).

Hence, the sample complexity of \(\epsilon \)-testing equality to arbitrary distributions over [n] equals the sample complexity of \(\epsilon /3\)-testing equality to 6n-grained distributions (which is essentially a special case).

Digest. One difference between the filter underlying Algorithm 3.2 and the one underlying Algorithm 3.5 is that the former preserves the exact distance between distributions, whereas the later only preserves them up to a constant factor. The difference is reflected in the fact that the first filter maps the different i’s to distributions of disjoint support, whereas the second filter (which is composed of the filters of Eqs. (2) and (3)) maps different i’s to distributions of non-disjoint support. (Specifically, the filter of Eq. (2) maps every \(i\in [n]\) to a distribution that assigns each \(i'\in [n]\) probability at least 1/2n, whereas the filter of Eq. (3) typically maps each \(i\in [n]\) to a distribution with a support that contains the element \(n+1\).)

3.4 From Arbitrary Distributions to the Uniform One

Combining the reductions stated in Propositions 3.3 and 3.6, we obtain a proof of Theorem 1.1.

Theorem 3.7

(Theorem 1.1, restated): For every probability function \(q:[n]\rightarrow [0,1]\) the filter \(F_{q''}\circ F''_{F'{({q})}}\circ F'\), where \(q''=F''_{F'{({q})}}\circ F'{({q})}\) is as in Algorithm 3.5 and \(F_{q''}\) is as in Algorithm 3.2, reduces \(\epsilon \)-testing equality to q to \(\epsilon /3\)-testing equality to the uniform distribution over [6n].

Proof:

First, setting \(\gamma =1/6\) and using the filter \(F''_{F'{({q})}}\circ F'\), we reduce the problem of \(\epsilon \)-testing equality to q to the problem of \(\epsilon /3\)-testing equality to the 6n-grained distribution \(q''\), while noting that the distribution \(q''\) has support \([n']\), where \(n'\in \{n,n+1\}\) (depending on q). Note that the latter assertion relies on the furthermore part of Proposition 3.6. Next, using the furthermore part of Proposition 3.3, we note that \(F_{q''}\) reduces \(\epsilon /3\)-testing equality to \(q''\) to \(\epsilon /3\)-testing equality to the uniform distribution over [6n].    \(\blacksquare \)

Observe that the proof of Theorem 3.7 avoids the problem discussed right after the presentation of Algorithm 3.2, which refers to the fact that testing equality to an m-grained distribution over [n] is reduced to testing whether distributions over \([m']\) are uniform over [m], where in some cases \(m'\in [m,m\,+\,n]\) rather than \(m'=m\). These bad cases arise when the support of the m-grained distribution is a strict subset of [n] (see Footnote 4), and it was avoided above because we applied the filter of Algorithm 3.2 to distributions \(q'':[n']\rightarrow [0,1]\) that have support \([n']\). Nevertheless, it is nice to have a reduction from the general case of “testing uniformity” to the special case, where the general case refers to testing whether distributions over [n] are uniform over [m], for any n and m, and the special case mandates that \(m=n\). Such a reduction is provided next.

Theorem 3.8

(testing uniform distributions, a reduction between two versions): There exists a simple filter that maps \(U_m\) to \(U_{2m}\), while mapping any distribution X that is \(\epsilon \)-far from \(U_m\) to a distribution over [2m] that is \(\epsilon /2\)-far from \(U_{2m}\). We stress that X is not necessarily distributed over [m] and remind the reader that \(U_n\) denotes the uniform distribution over [n].

Thus, for every n and m, this filter reduces \(\epsilon \)-testing whether distributions over [n] are uniform over [m] to \(\epsilon /2\)-testing whether distributions over [2m] are uniform over [2m].

Proof:

The filter, denoted F, maps \(i\in [m]\) uniformly at random to an element in \(\{i,m+i\}\), while mapping any \(i\not \in [m]\) uniformly at random to an element in [m]. Observe that any distribution over [n] is mapped to a distribution, and that \(F(U_m)\equiv U_{2m}\). Note that F does not necessarily preserve distances between arbitrary distributions over [n] (e.g., both the uniform distribution over [2m] and the uniform distribution over \([m]\cup [2m+1,3m]\) are mapped to the same distribution), but (as shown next) F preserves distances to the relevant uniform distributions up to a constant factor. Specifically, note that

$$\begin{aligned} \sum _{i\in [m+1,2m]}\Bigl |\mathbf{Pr}[F(X)\!=\!i]-\mathbf{Pr}[U_{2m}\!=\!i]\Bigr |= & {} \sum _{i\in [m]}\left| \mathbf{Pr}[X\!=\!i]\cdot \frac{1}{2}-\frac{1}{2m}\right| \\= & {} \frac{1}{2}\cdot \sum _{i\in [m]}\Bigl |\mathbf{Pr}[X\!=\!i]-\mathbf{Pr}[U_{m}\!=\!i]\Bigr | \end{aligned}$$

and

$$\begin{aligned} \sum _{i\in [m]}\Bigl |\mathbf{Pr}[F(X)\!=\!i]-\mathbf{Pr}[U_{2m}\!=\!i]\Bigr |\ge & {} \mathbf{Pr}\left[ F(X)\in [m]\right] -\mathbf{Pr}\left[ U_{2m}\in [m]\right] \\= & {} \left( \mathbf{Pr}[X\in [m]]\cdot \frac{1}{2}+\mathbf{Pr}[X\not \in [m]]\right) -\frac{1}{2} \\= & {} \frac{1}{2}\cdot \mathbf{Pr}[X\not \in [m]] \\= & {} \frac{1}{2}\cdot \sum _{i\not \in [m]} \Bigl |\mathbf{Pr}[X\!=\!i]-\mathbf{Pr}[U_{m}\!=\!i]\Bigr |. \end{aligned}$$

Hence, the total variation distance between F(X) and \(U_{2m}\) is at least half the total variation distance between X and \(U_{m}\).    \(\blacksquare \)

4 On Testing Whether a Distribution Is Grained

A natural question that arises from the interest in grained distributions refers to the complexity of testing whether an unknown distribution is grained. Specifically, given n and m (and a proximity parameter \(\epsilon \)), how many samples are required in order to determine whether an unknown distribution X over [n] is m-grained or \(\epsilon \)-far from any m-grained distribution?

This question can be partially answered by invoking the results of Valiant and Valiant [16]. Specifically, for an upper bound we use their “learning up to relabelling” algorithm, which may be viewed as a learner of histograms (which is what it actually does). Recall that the histogram of the probability function p is defined as the multiset \(\{p(i):i\in [n]\}\) (equiv., as the set of pairs \(\{(v,m):m=|\{i\!\in \![n]:p(i)\!=\!v\}|>0\}\)).

Theorem 4.1

(learning the histogram [16, Thm. 1]):Footnote 8 There exists an \(O(\epsilon ^{-2}\cdot n/\log n)\) time algorithm that, on input \(n,\epsilon \) and \(O(\epsilon ^{-2}\cdot n/\log n)\) samples drawn from an unknown distribution \(p:[n]\rightarrow [0,1]\), outputs, with probability \(1-1/\mathrm{poly}(n)\), a histogram of a distribution that is \(\epsilon \)-close to p.

The implication of this result on testing any label-invariant property of distributions is immediate, where a property of distribution \(\mathcal D\) is called label-invariant if for every distribution \(p:[n]\rightarrow [0,1]\) in \(\mathcal D\) and every permutation \(\pi :[n]\rightarrow [n]\) it holds that \(p\circ \pi \) is in \(\mathcal D\). In our case, the tester consists of employing the algorithm of Theorem 4.1 with proximity parameter \(\epsilon /2\) and accepting if and only if the output fits a histogram of a distribution that is \(\epsilon /2\)-close to being m-grained. The same holds with respect to estimating the distance from the set of m-grained distributions (which can be captured as a special case of label-invariant properties). Hence, we get

Corollary 4.2

(testing whether a distribution is grained): For every \(n,m\in \mathbb N\), the set of m-grained distributions over [n] has a tester of sample complexity \(O(\epsilon ^{-2}\cdot n/\log n)\). Furthermore, the distance of an unknown distribution to the set of m-grained distributions over [n] can be approximated up to an additive error of \(\epsilon \) using the same number of samples.

We comment that it seems that, using the techniques of [16], one can reduce the complexity to \(O(\epsilon ^{-2}\cdot n'/\log n')\), where \(n'=\min (n,m)\). (For the case of testing, this is shown in the Appendix, using a reduction.) On the other hand, for \(m\in [\varOmega (n),O(n)]\), the above distance approximator is optimal, whereas it makes no sense to consider \(m>n/\epsilon \) (since any distribution over [n] is \(\epsilon \)-close to being \(n/\epsilon \)-grained). The negative result follows from the corresponding result of Valiant and Valiant [16].

Theorem 4.3

(optimality of Theorem 4.1 [16, Thm. 2]):Footnote 9 For every sufficiently small \(\epsilon >0\), there exist two distributions \(p_1,p_2:[n]\rightarrow [0,1]\) that are indistinguishable by any label-invariant algorithm that takes \(O(\epsilon ^{-1}n/\log n)\) samples although \(p_1\) is \(\epsilon \)-close to the uniform distribution over [n] and \(p_2\) is \(\epsilon \)-close to the uniform distribution over some set of n/2 elements.

Let us spell out that, in the current context, an algorithm A is called label-invariant if for every permutation \(\pi :[n]\rightarrow [n]\) and every sample \(i_1,...,i_s\), it holds that \(A(n,\epsilon ;i_1,...,i_s)\equiv A(n,\epsilon ;\pi (i_1),...,\pi (i_s))\). Indeed, when estimating the distance to a label-invariant property, we may assume (w.l.o.g.) that the algorithm is label-invariant. Combining Theorem 4.3 with the latter fact, we get—

Corollary 4.4

(optimality of Corollary 4.2): For any \(m\in [\varOmega (n),O(n)]\), estimating the distance to the set of m-grained distributions over [n] up to a sufficiently small additive constant requires \(\varOmega (n/\log n)\) samples.

Similarly, tolerant testing (cf. [15]) in the sense of distinguishing distributions that are \(\epsilon _1\)-close to being m-grained from distributions that are \(\epsilon _2\)-far from being m-grained requires \(\varOmega (n/\log n)\) samples, for any constant \(\epsilon _2\in (0,1/(2\cdot {\lfloor 2m/n\rfloor }))\) and \(\epsilon _1\in (0,\epsilon _2)\).

Proof:

The case of \(m=n/2\) follows by invoking Theorem 4.3, while observing that \(p_2\) is \(\epsilon \)-close to being m-grained, whereas \(p_1\) is \(\epsilon \)-close to a distribution (i.e, \(U_{2m}\)) that is \((0.499-\epsilon )\)-far from being m-grained,Footnote 10 where \(p_1,p_2\) and \(\epsilon \) are as in Theorem 4.3. Hence, distinguishing the distributions \(p_2\) and \(p_1\) (in a label-invariant manner) is reducible to \((0.499-2\epsilon )\)-testing the set of distributions that are \(\epsilon \)-close to be m-grained, which implies that the latter task has sample complexity \(\varOmega (n/\log n)\).

The case of \(m<n/2\) is reduced to the case of \(m=n/2\), by resetting \(n\leftarrow 2m\). This yields a lower bound of \(\varOmega (m/\log m)\). Using the hypothesis \(m=\varOmega (n)\), we derive the desired lower bound of \(\varOmega (n/\log n)\).

For \(m>n/2\) (equiv., \(n<2m\)), we show a reduction of the distinguishing task underlying Theorem 4.3 to the testing problem at hand. Specifically, let \(t={\lceil 2m/n\rceil }\) and \(m'\,{\mathop {=}\limits ^\mathrm{def}}\,{\lfloor m/t\rfloor }\), and note that \(t\in [2,O(1)]\) and \(m'\in [\varOmega (n),n/2]\) (by the various hypothesis).Footnote 11 We assume for simplicity that \(m'=m/t\) (equiv., t divides m).Footnote 12 Now, consider a randomized filter, denoted \(F_{m,t}:[n]\rightarrow [n]\), that maps each \(i\in [m']\) to \(m'+i\) with probability 1/t and otherwise maps it to itself, but always maps \(i\in [n]\setminus [m']\) to \(i-m'\). Then:

  • \(F_{m,t}\) maps the uniform distribution over \([m']\) to an m-grained distribution, since \(q'_2(j)\,{\mathop {=}\limits ^\mathrm{def}}\,\mathbf{Pr}[F_{m,t}(U_{m'})\!=\!j]\) equals \(\mathbf{Pr}[U_{m'}\!=\!j]\cdot \frac{t\,-\,1}{t} = \frac{1}{m'}\cdot \frac{t\,-\,1}{t} = \frac{t\,-\,1}{m}\) if \(j\in [m']\) and equals \(\mathbf{Pr}[U_{m'}\!=\!j\,-\,m']\cdot \frac{1}{t}=\frac{1}{m}\) if \(j\in [m'+1,2m']\).

  • \(F_{m,t}\) maps the uniform distribution over \([2m']\) to a distribution that is (0.999/2t)-far from being m-grained, since \(q'_1(j)\,{\mathop {=}\limits ^\mathrm{def}}\,\mathbf{Pr}[F_{m,t}(U_{2m'})\!=\!j]\) equals \(\mathbf{Pr}[U_{2m'}\!=\!j+m']+\mathbf{Pr}[U_{2m'}\!=\!j]\cdot \frac{t\,-\,1}{t} =\frac{t\,+\,t\,-\,1}{2m't} =\frac{2t\,-\,1}{2m}\) if \(j\in [m']\) and equals \(\mathbf{Pr}[U_{2m'}\!=\!j\,-\,m']\cdot \frac{1}{t}=\frac{1}{2m}\) if \(j\in [m'+1,2m']\).

Applying the filter \(F_{m,t}\) to the distributions \(p_1\) and \(p_2\) of Theorem 4.3 (while setting \(n=2m'=2\cdot m/t\)), we obtain distributions \(p'_2\) and \(p'_1\) such that \(p'_2\) is \(\epsilon \)-close \(q'_2\), which is m-grained, whereas \(p'_1\) is \(\epsilon \)-close to \(q'_1\), which is (0.999/2t)-far from being m-grained, since filters can only decrease the distance between distributions. Hence, distinguishing the distributions \(p_2\) and \(p_1\) (over [2m/t]) is reducible to \(((0.999/2t)-2\epsilon )\)-testing the set of distributions that are \(\epsilon \)-close to being m-grained, which implies that the latter task has sample complexity \(\varOmega ((2m/t)/\log (2m/t))\). (The claim follows by recalling that \(1/t=\varOmega (1)\), since \(m=O(n)\).)    \(\blacksquare \)

Open Problems. Note that Corollary 4.4 does not refer to testing, but rather to distance approximation, and there are natural cases in which the complexity of testing a property of distributions is significantly lower than the corresponding distance approximation task (cf. [12] versus [16]). Hence, we ask—

Open Problem 4.5

(the sample complexity of testing whether a distribution is grained): For any m and n, what is the sample complexity of testing the property that consists of all m-grained distributions over [n].

This question can be generalized to properties that allow m to reside in some predetermined set M, where the most natural case is that M is an interval, say of the form \([m',2m']\).

Open Problem 4.6

(Problem 4.5, generalized): For any finite set \(M\subset \mathbb N\) and \(n\in \mathbb N\), what is the sample complexity of testing the property that consists of all distributions over [n] that are each m-grained for some \(m\in M\).