1 Introduction

Inferring information about the probability distribution that underlies a data sample is an essential question in Statistics, and one that has ramifications in every field of the natural sciences and quantitative research. In many situations, it is natural to assume that this data exhibits some simple structure because of known properties of the origin of the data, and in fact these assumptions are crucial in making the problem tractable. Such assumptions translate as constraints on the probability distribution—e.g., it is supposed to be Gaussian, or to meet a smoothness or “fat tail” condition (see e.g., [36, 39, 48]).

As a result, the problem of deciding whether a distribution possesses such a structural property has been widely investigated both in theory and practice, in the context of shape restricted inference [8, 47] and model selection [41]. Here, it is guaranteed or thought that the unknown distribution satisfies a shape constraint, such as having a monotone or log-concave probability density function [7, 25, 46, 54]. From a different perspective, a recent line of work in Theoretical Computer Science, originating from the papers of Batu et al. [10, 11, 35] has also been tackling similar questions in the setting of property testing (see [13, 43,44,45] for surveys on this field). This very active area has seen a spate of results and breakthroughs over the past decade, culminating in very efficient (both sample and time-wise) algorithms for a wide range of distribution testing problems [1, 2, 9, 18, 24, 28, 34]. In many cases, this led to a tight characterization of the number of samples required for these tasks as well as the development of new tools and techniques, drawing connections to learning and information theory [26, 50, 51, 53].

In this paper, we focus on the following general property testing problem: given a class (property) of distributions \(\mathcal {P}\) and sample access to an arbitrary distribution D, one must distinguish between the case that (a) \(D\in \mathcal {P}\), versus (b) \(\|{D}-{D}^{\prime }\|_{1} >\varepsilon \) for all \(D^{\prime }\in \mathcal {P}\) (i.e., D is either in the class, or far from it). While many of the previous works have focused on the testing of specific properties of distributions or obtained algorithms and lower bounds on a case-by-case basis, an emerging trend in distribution testing is to design general frameworks that can be applied to several property testing problems [27, 28, 49, 51]. This direction, the testing analog of a similar movement in distribution learning [4, 15,16,17], aims at abstracting the minimal assumptions that are shared by a large variety of problems, and giving algorithms that can be used for any of these problems. In this work, we make significant progress in this direction by providing a unified framework for the question of testing various properties of probability distributions. More specifically, we describe a generic technique to obtain upper bounds on the sample complexity of this question, which applies to a broad range of structured classes. Our technique yields sample near-optimal and computationally efficient testers for a wide range of distribution families. Conversely, we also develop a general approach to prove lower bounds on these sample complexities, and use it to derive tight or nearly tight bounds for many of these classes.

Related Work

Batu et al. [12] initiated the study of efficient property testers for monotonicity and obtained (nearly) matching upper and lower bounds for this problem; while [2] later considered testing the class of Poisson Binomial Distributions, and settled the sample complexity of this problem (up to the precise dependence on ε). Indyk, Levi, and Rubinfeld [37], focusing on distributions that are piecewise constant on t intervals (“t-histograms”) described a \(\tilde {O}(\sqrt {tn}/\varepsilon ^{5})\)-sample algorithm for testing membership to this class. Another body of work by [9, 12], and [24] shows how assumptions on the shape of the distributions can lead to significantly more efficient algorithms. They describe such improvements in the case of identity and closeness testing as well as for entropy estimation, under monotonicity or k-modality constraints. Specifically, Batu et al. show in [12] how to obtain a \(O\left ({\log ^{3} n/\varepsilon ^{3}}\right )\)-sample tester for closeness in this setting, in stark contrast to the \({\Omega }\left ({{n}^{2/3}}\right )\) general lower bound. Daskalakis et al. [24] later gave \({O}(\sqrt {\log n})\) and \({O}({\log ^{2/3} n})\)-sample testing algorithms for testing respectively identity and closeness of monotone distributions, and obtained similar results for k-modal distributions. Finally, we briefly mention two related results, due respectively to [9] and [22]. The first one states that for the task of getting a multiplicative estimate of the entropy of a distribution, assuming monotonicity enables exponential savings in sample complexity—\(O\left ({\log ^{6} n}\right )\), instead of Ω(n c) for the general case. The second describes how to test if an unknown k-modal distribution is in fact monotone, using only O(k/ε 2) samples. Note that the latter line of work differs from ours in that it presupposes the distributions satisfy some structural property, and uses this knowledge to test something else about the distribution; while we are given a priori arbitrary distributions, and must check whether the structural property holds. Except for the properties of monotonicity and being a PBD, nothing was previously known on testing the shape restricted properties that we study.

Moreover, for the specific problems of identity and closeness testing,Footnote 1 recent results of [27, 28] describe a general algorithm which applies to a large range of shape or structural constraints, and yields optimal identity testers for classes of distributions that satisfy them. We observe that while the question they answer can be cast as a specialized instance of membership testing, our results are incomparable to theirs, both because of the distinction above (testing with versus testing for structure) and as the structural assumptions they rely on are fundamentally different from ours.

Concurrent and Followup Work

Independently and concurrently to this work, Acharya, Daskalakis, and Kamath [3] obtained a sample near-optimal efficient algorithm for testing log-concavity, as well as sample-optimal algorithms for testing the classes of monotone, unimodal, and monotone hazard rate distributions (along with matching lower bounds on the sample complexity of these tasks). Their work builds on ideas from [2] and their techniques are orthogonal to ours: namely, while at some level both works follow a “testing-by-learning” paradigm, theirs rely on first learning in the (more stringent) χ 2 distance, then applying a testing algorithm which is robust to some amount of noise (i.e., tolerant testing) in this χ 2 sense (as opposed to noise in an 1 sense, which is known to be impossible without a near-linear number of samples [50]).

Subsequent to the publication of the conference version of this work, [14] improved on both [37] and our results for the specific class of t-histograms, providing nearly tight upper and lower bounds on testing membership to this class. Specifically, it obtains an upper bound of \(\tilde {O}(\sqrt {n}/\varepsilon ^{2} + t/\varepsilon ^{3})\), complemented with an \({\Omega }(\sqrt {n}/\varepsilon ^{2}+t/(\varepsilon \log t))\) lower bound on the sample complexity.

Building on our work, Fischer, Lachish, and Vasudev recently generalized in [33] our approach and algorithm to the conditional sampling model of [19, 20], obtaining analogues of our testing results in this different setting of distribution testing where the algorithm is allowed to condition the samples it receives on subsets of the domain of its choosing. In the “standard” sampling setting, [33] additionally provides an alternative to the first subroutine of our testing algorithm: this yields a simpler and non-recursive algorithm, with a factor \(\log n\) shaved off at the price of a worse dependency on the distance parameter ε. (Namely, their sample complexity is dominated by \(O(\sqrt {nL}\log ^{2}(1/\varepsilon )/{\varepsilon ^{5}})\), to be compared to the \(O(\sqrt {nL}\log n/{\varepsilon ^{3}})\) term of Theorem 3.3.)

1.1 Results and Techniques

Upper Bounds A natural way to tackle our membership testing problem would be to first learn the unknown distribution D as if it satisfied the property, before checking if the hypothesis obtained is indeed both close to the original distribution and to the property. Taking advantage of the purported structure, the first step could presumably be conducted with a small number of samples; things break down, however, in the second step. Indeed, most approximation results leading to the improved learning algorithms one would apply in the first stage only provide very weak guarantees, that is in the 1 sense only. For this reason, they lack the robustness that would be required for the second part, where it becomes necessary to perform tolerant testing between the hypothesis and D—a task that would then entail a number of samples almost linear in the domain size. To overcome this difficulty, we need to move away from these global 1 closeness results and instead work with stronger requirements, this time in 2 norm.

At the core of our approach is an idea of Batu et al. [12], which show that monotone distributions can be well-approximated (in a certain technical sense) by piecewise constant densities on a suitable interval partition of the domain; and leverage this fact to reduce monotonicity testing to uniformity testing on each interval of this partition. While the argument of [12] is tailored specifically for the setting of monotonicity testing, we are able to abstract the key ingredients, and obtain a generic membership tester that applies to a wide range of distribution families. In more detail, we provide a testing algorithm which applies to any class of distributions which admit succinct approximate decompositions—that is, each distribution in the class can be well-approximated (in a strong 2 sense) by piecewise constant densities on a small number of intervals (we hereafter refer to this approximation property, formally defined in Definition 3.1, as (Succinctness); and extend the notation to apply to any class \(\mathcal {C}\) of distributions for which all \(D\in \mathcal {C}\) satisfy (Succinctness)). Crucially, the algorithm does not care about how these decompositions can be obtained: for the purpose of testing these structural properties we only need to establish their existence. Specific examples are given in the corollaries below. Informally, our main algorithmic result, informally stated (see Theorem 3.3 for a detailed formal statement), is as follows:

Theorem 1.1 (Main Theorem)

There exists an algorithm TestSplittable which, given sampling access to an unknown distribution D over [n]and parameter ε ∈ (0,1], can distinguish with probability 2/3between (a) \(D\in \mathcal {P}\) versus (b) \(\ell _{1}({D}, \mathcal {P}) > \varepsilon \), for any property \(\mathcal {P}\) that satisfies the above natural structural criterion(Succinctness). Moreover, for many such properties this algorithm is computationally efficient, and its sample complexity is optimal (up to logarithmic factors and the exact dependence on ε ).

We then instantiate this result to obtain “out-of-the-box” computationally efficient testers for several classes of distributions, by showing that they satisfy the premise of our theorem (the definition of these classes is given in Section 2.1):

Corollary 1.2

The algorithm TestSplittable can test the classes of monotone, unimodal, log-concave, concave, convex, and monotone hazard rate (MHR) distributions, with \(\tilde {O}({\sqrt {n}/\varepsilon ^{7/2}})\) samples.

Corollary 1.3

The algorithm TestSplittable can test the class of t-modal distributions, with \(\tilde {O}({\sqrt {tn}/\varepsilon ^{7/2}})\) samples.

Corollary 1.4

The algorithm TestSplittable can test the classes of t-histograms and t-piecewise degree- D distributions, with \(\tilde {O}({\sqrt {tn}/\varepsilon ^{3}})\) and \(\tilde {O}({\sqrt {t(d+1)n}/\varepsilon ^{7/2} + t(d+1)/\varepsilon ^{3}})\) samples respectively.

Corollary 1.5

The algorithm TestSplittable can test the classes of Binomial and Poisson Binomial Distributions, with \(\tilde {O}({{n}^{1/4}/\varepsilon ^{7/2}})\) samples.

We remark that the aforementioned sample upper bounds are information-theoretically near-optimal in the domain size n (up to logarithmic factors). See Table 1 and the following subsection for the corresponding lower bounds. We did not attempt to optimize the dependence on the parameter ε, though a more careful analysis might lead to such improvements.

Table 1 Summary of results

We stress that prior to our work, no non-trivial testing bound was known for most of these classes – specifically, our nearly-tight bounds for t-modal with t > 1, log-concave, concave, convex, MHR, and piecewise polynomial distributions are new. Moreover, although a few of our applications were known in the literature (the \(\tilde {O}\left (\sqrt {n}/{\varepsilon }^{6} \right )\) upper and \({\Omega \left (\sqrt {n}/{\varepsilon }^{2} \right )}\) lower bounds on testing monotonicity can be found in [12], while the \({\Theta }\left ({n^{1/4}}\right )\) sample complexity of testing PBDs was recently givenFootnote 2 in [2], and the task of testing t-histograms is considered in [37]), the crux here is that we are able to derive them in a unified way, by applying the same generic algorithm to all these different distribution families. We note that our upper bound for t-histograms (Corollary 1.4) also significantly improves on the previous \(\tilde {O}\left (\sqrt {tn}/{\varepsilon }^{5} \right )\)-sample tester with regard to the dependence on the proximity parameter ε. In addition to its generality, our framework yields much cleaner and conceptually simpler proofs of the upper and lower bounds from [2].

Lower Bounds

To complement our upper bounds, we give a generic framework for proving lower bounds against testing classes of distributions. In more detail, we describe how to reduce—under a mild assumption on the property \(\mathcal {C}\)—the problem of testing membership to \(\mathcal {C}\) (“does \(D\in \mathcal {C}\)?”) to testing identity to D (“does D = D ?”), for any explicit distribution D in \(\mathcal {C}\). While these two problems need not in general be related,Footnote 3 we show that our reduction-based approach applies to a large number of natural properties, and obtain lower bounds that nearly match our upper bounds for all of them. Moreover, this lets us derive a simple proof of the lower bound of [2] on testing the class of PBDs. The reader is referred to Theorem 6.1 for the formal statement of our reduction-based lower bound theorem. In this section, we state the concrete corollaries we obtain for specific structured distribution families:

Corollary 1.6

Testing log-concavity, convexity, concavity, MHR, unimodality, t-modality, t-histograms, and t-piecewise degree- D distributions each require \({\Omega \left ({\sqrt {n}}/{{\varepsilon }^{2}} \right )}\) samples (the last three for \(t = o(\sqrt {n})\) and \(t(d+1) = o(\sqrt {n})\), respectively), for any ε ≥ 1/n O(1) .

Corollary 1.7

Testing the classes of Binomial and Poisson Binomial Distributions each require \({\Omega \left ({n^{1/4}}/{{\varepsilon }^{2}} \right )}\) samples, for any ε ≥ 1/n O(1) .

Corollary 1.8

There exist absolute constants c > 0and ε 0 > 0such that testing the class of k-SIIRV distributions requires \({\Omega }\left (k^{1/2}n^{1/4}/{\varepsilon }^{2} \right )\) samples, for any \(k={o\left (n^{c} \right )}\) and \({1}/{n^{O(1)}} \leq {\varepsilon } \leq {\varepsilon }_{0}\) .

Tolerant Testing

Using our techniques, we also establish nearly–tight upper and lower bounds on tolerant testingFootnote 4 for shape restrictions. Similarly, our upper and lower bounds are matching as a function of the domain size. More specifically, we give a simple generic upper bound approach (namely, a learning followed by tolerant testing algorithm). Our tolerant testing lower bounds follow the same reduction-based approach as in the non-tolerant case. In more detail, our results are as follows (see Sections 6 and 7):

Corollary 1.9

Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and t-modality can be performed with \(O\left (\frac {1}{(\varepsilon _{2}-\varepsilon _{1})^{2}}\frac {n}{\log n} \right )\) samples, for ε 2C ε 1 (where C > 2is an absolute constant).

Corollary 1.10

Tolerant testing of the classes of Binomial and Poisson Binomial Distributions can be performed with \(O\left (\frac {1}{(\varepsilon _{2}-\varepsilon _{1})^{2}}\frac {\sqrt {n\log ({1}/{\varepsilon _{1}})}}{\log n} \right )\) samples, for ε 2C ε 1 (where C > 2is an absolute constant).

Corollary 1.11

Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and t-modality each require \({\Omega \left (\frac {1}{({\varepsilon }_{2}-{\varepsilon }_{1})}\frac {n}{\log n} \right )}\) samples (the latter for t = o(n)).

Corollary 1.12

Tolerant testing of the classes of Binomial and Poisson Binomial Distributions each require \({\Omega \left (\frac {1}{({\varepsilon }_{2}-{\varepsilon }_{1})}\frac {\sqrt {n}}{\log n} \right )}\) samples.

On the Scope of our Results

We point out that our main theorem is likely to apply to many other classes of structured distributions, due to the mild structural assumptions it requires. However, we did not attempt here to be comprehensive; but rather to illustrate the generality of our approach. Moreover, for all properties considered in this paper the generic upper and lower bounds we derive through our methods turn out to be optimal up to at most polylogarithmic factors (with regard to the support size). The reader is referred to Table 1 for a summary of our results and related work.

1.2 Organization of the Paper

We start by giving the necessary background and definitions in Section 2, before turning to our main result, the proof of Theorem 1.1 (our general testing algorithm) in Section 3. In Section 4, we establish the necessary structural theorems for each class of distributions considered, enabling us to derive the upper bounds of Table 1. Section 5 introduces a slight modification of our algorithm which yields stronger testing results for classes of distributions with small effective support, and use it to derive Corollary 1.5, our upper bound for Poisson Binomial distributions. Second, Section 6 contains the details of our lower bound methodology, and of its applications to the classes of Table 1. Finally, Section 6.2 is concerned with the extension of this methodology to tolerant testing, of which Section 7 describes a generic upper bound counterpart.

2 Notation and Preliminaries

2.1 Definitions

We give here the formal descriptions of the classes of distributions involved in this work, starting with that of monotone distributions.

Definition 2.1 (monotone)

A distribution D over [n] is monotone (non-increasing) if its probability mass function (pmf) satisfies \(D(1) \geq {D}(2) \geq {\dots } {D}(n)\).

A natural generalization of the class \({\mathcal {M}}\) of monotone distributions is the set of t-modal distributions, i.e. distributions whose pmf can go “up and down” or “down and up” up to t times:Footnote 5

Definition 2.2 (t-modal)

Fix any distribution D over [n], and integer t. D is said to have t modes if there exists a sequence \(i_{0} < {\dots } < i_{t+1}\) such that either (−1)j D(i j ) < (−1)j D(i j+1) for all 0 ≤ jt, or (−1)j D(i j ) > (−1)j D(i j+1) for all 0 ≤ jt. We call D t-modal if it has at most t modes, and write \({\mathcal {M}_{t}}\) for the class of all t-modal distributions (omitting the dependence on n). The particular case of t = 1 corresponds to the set \(\mathcal {M}_{1}\) of unimodal distributions.

Definition 2.3 (Log-concave)

A distribution D over [n] is said to be log-concave if it satisfies the following conditions: (i) for any 1 ≤ i < j < kn such that D(i)D(k) > 0, D(j) > 0; and (ii) for all 1 < k < n, D(k)2D(k − 1)D(k + 1). We write \({\mathcal {L}}\) for the class of all log-concave distributions (omitting the dependence on n).

Definition 2.4 (Concave and Convex)

A distribution D over [n] is said to be concave if it satisfies the following conditions: (i) for any 1 ≤ i < j < kn such that D(i)D(k) > 0, D(j) > 0; and (ii) for all 1 < k < n such that D(k − 1)D(k + 1) > 0, 2D(k) ≥ D(k − 1) + D(k + 1); it is convex if the reverse inequality holds in (ii). We write \({\mathcal {K}^-}\) (resp. \({\mathcal {K}^+}\)) for the class of all concave (resp. convex) distributions (omitting the dependence on n).

It is not hard to see that convex and concave distributions are unimodal; moreover, every concave distribution is also log-concave, i.e. \(\mathcal {K}^{-}\subseteq \mathcal {L}\). Note that in both Definition 2.3 and Definition 2.4, condition (i) is equivalent to enforcing that the distribution be supported on an interval.

Definition 2.5 (Monotone Hazard Rate)

A distribution D over [n] is said to have monotone hazard rate (MHR) if its hazard rate \(H(i)\overset {\text {def}}{=} \frac {{D}(i)}{{\sum }_{j=i}^{n} {D}(j)}\) is a non-decreasing function. We write \({\mathcal {MHR}}\) for the class of all MHR distributions (omitting the dependence on n).

It is known that every log-concave distribution is both unimodal and MHR (see e.g. [6, Proposition 10]), and that monotone distributions are MHR. Two other classes of distributions have elicited significant interest in the context of density estimation, those of histograms (piecewise constant) and piecewise polynomial densities:

Definition 2.6 (Piecewise Polynomials [16])

A distribution D over [n] is said to be a t-piecewise degree-d distribution if there is a partition of [n] into t disjoint intervals \(I_{1},\dots ,I_{t}\) such that D(i) = p j (i) for all iI j , where each \(p_{1},{\dots } p_{t}\) is a univariate polynomial of degree at most d. We write \({\mathcal {P}_{t,d}}\) for the class of all t-piecewise degree-d distributions (omitting the dependence on n). (We note that t-piecewise degree-0 distributions are also commonly referred to as t-histograms, and write \({\mathcal {H}_{t}}\) for \(\mathcal {P}_{t,0}\).)

Finally, we recall the definition of the two following classes, which both extend the family of Binomial distributions \(\mathcal {BIN}_{n}\): the first, by removing the need for each of the independent Bernoulli summands to share the same bias parameter.

Definition 2.7

A random variable X is said to follow a Poisson Binomial Distribution (with parameter \(n\in {\mathbb {N}}\)) if it can be written as \(X={\sum }_{k=1}^{n} X_{k}\), where \(X_{1}\dots ,X_{n}\) are independent, non-necessarily identically distributed Bernoulli random variables. We denote by \(\mathcal {PBD}_{n}\) the class of all such Poisson Binomial Distributions.

It is not hard to show that Poisson Binomial Distributions are in particular log-concave. One can generalize even further, by allowing each random variable of the summation to be integer-valued:

Definition 2.8

Fix any k ≥ 0. We say a random variable X is a k-Sum of Independent Integer Random Variables (k-SIIRV) with parameter \(n\in {\mathbb {N}}\) if it can be written as \(X={\sum }_{j=1}^{n} X_{j}\), where \(X_{1}\dots ,X_{n}\) are independent, non-necessarily identically distributed random variables taking value in \(\{0,1,\dots ,k-1\}\). We denote by \({k}\text {-}{\mathcal {SIIRV}}_{n}\) the class of all such k-SIIRVs.

2.2 Tools from Previous Work

We first restate a result of Batu et al. relating closeness to uniformity in 2 and 1 norms to “overall flatness” of the probability mass function, and which will be one of the ingredients of the proof of Theorem 1.1:

Lemma 2.9 ([10, 11])

Let D be a distribution on a domain S. (a) If \(\max _{i\in S} {D}(i) \leq (1+{\varepsilon })\min _{i\in S} {D}(i)\), then \(\|{D}\|^{2} \leq (1+\varepsilon ^{2})/\left \lvert {S} \right \rvert \) . (b) If \(\|{D}\|^{2} \leq (1+\varepsilon ^{2})/\left \lvert {S} \right \rvert \), then \(\|{D}-{\mathcal {U}}_{S}\| \leq \varepsilon \) .

To check condition (b) above we shall rely on the following, which one can derive from the techniques in [28] and whose proof we defer to Appendix A:

Theorem 2.10 (Adapted from [28, Theorem 11])

There exists an algorithm Check-Small- 2 which, given parameters ε, δ ∈ (0, 1) and \(c\cdot {\sqrt {\left \lvert {I} \right \rvert }}/{\varepsilon ^{2}} \log (1/\delta )\) independent samples from a distribution D over I (for some absolute constant c > 0), outputs either yes or no, and satisfies the following.

  • If \({\lVert {{D}-{\mathcal {U}}_{I}}}{\rVert }_2 > {{\varepsilon }}/{\sqrt {\left \lvert I \right \rvert }}\), then the algorithm outputs no with probability at least 1 − δ;

  • If \({\lVert {{D}-{\mathcal {U}}_{I}}{\rVert }}_2 \leq {{\varepsilon }}/{2\sqrt {\left \lvert I \right \rvert }}\), then the algorithm outputs yes with probability at least 1 − δ.

Finally, we will also rely on a classical result from probability, the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality, restated below:

Theorem 2.11 ([32, 40])

Let D be a distribution over [n].Given m independent samples \(x_{1},{\dots } ,x_{m}\) from D, define the empirical distribution \(\hat {{D}}\) as follows:

$$\hat{{D}}(i)\overset{\text{def}}{=} \frac{\left\lvert \left\{\; j\in[m] \;\colon\; x_{j}=i\; \right\} \right\rvert }{m}, \quad i\in[n]. $$

Then,for all ε > 0,\(\Pr \!\left [\, {\lVert {D} - \hat {{D}}{\rVert }_{\text {Kol}}} > {\varepsilon } \, \right ] \leq 2e^{-2m{\varepsilon }^{2}}\), where\({\lVert {{\cdot } - {\cdot }}{\rVert }_{\text {Kol}}}\)denotes the Kolmogorovdistance (i.e., the \(\ell _{\infty }\)distance between cumulative distribution functions).

In particular, this implies that \({O\left (1/{\varepsilon }^{2} \right )}\) samples suffice to learn a distribution up to ε in Kolmogorov distance.

3 The General Algorithm

In this section, we obtain our main result, restated below:

Theorem 1.1 (Main Theorem)

There exists an algorithm TestSplittable which, given sampling access to an unknown distribution D over [n]and parameter ε ∈ (0, 1], can distinguish with probability 2/3between (a) \(D\in \mathcal {P}\) versus (b) \(\ell _{1}({D}, \mathcal {P}) > \varepsilon \), for any property \(\mathcal {P}\) that satisfies the above natural structural criterion(Succinctness).Moreover, for many such properties this algorithm is computationally efficient, and its sample complexity is optimal (up to logarithmic factors and the exact dependence on ε ).

Intuition

Before diving into the proof of this theorem, we first provide a high-level description of the argument. The algorithm proceeds in 3 stages: the first, the decomposition step, attempts to recursively construct a partition of the domain in a small number of intervals, with a very strong guarantee. If the decomposition succeeds, then the unknown distribution D will be close (in 1 distance) to its “flattening” on the partition; while if it fails (too many intervals have to be created), this serves as evidence that D does not belong to the class and we can reject. The second stage, the approximation step, then learns this flattening of the distribution—which can be done with few samples since by construction we do not have many intervals. The last stage is purely computational, the projection step: where we verify that the flattening we have learned is indeed close to the class \(\mathcal {C}\). If all three stages succeed, then by the triangle inequality it must be the case that D is close to \(\mathcal {C}\); and by the structural assumption on the class, if \(D\in \mathcal {C}\) then it will admit succinct enough partitions, and all three stages will go through.

Turning to the proof, we start by defining formally the “structural criterion” we shall rely on, before describing the algorithm at the heart of our result in Section 3.1. (We note that a modification of this algorithm will be described in Section 5, and will allow us to derive Corollary 1.5.)

Definition 3.1 (Decompositions)

Let γ, ζ > 0 and L = L(γ, ζ, n) ≥ 1. A class of distributions \({\mathcal {C}}\) on [n] is said to be (γ, ζ, L)-decomposable if for every \(D\in {\mathcal {C}}\) there exists L and a partition \(\mathcal {I}(\gamma ,\zeta ,{D})=(I_{1},\dots ,I_{\ell })\) of the interval [1,n] such that, for all j ∈ [], one of the following holds:

  1. (i)

    \(D(I_{j}) \leq \frac {\zeta }{L}\); or

  2. (ii)

    \(\displaystyle \max _{i \in I_{j}} {D}(i)\leq (1+\gamma )\cdot \min _{i \in I_{j}} {D}(i)\).

Further, if \(\mathcal {I}(\gamma ,\zeta ,{D})\) is dyadic (i.e., each I k is of the form [j ⋅ 2i + 1,(j + 1) ⋅ 2i] for some integers i, j, corresponding to the leaves of a recursive bisection of [n]), then \(\mathcal {C}\) is said to be (γ, ζ, L)-splittable.

Lemma 3.2

If \({\mathcal {C}}\) is (γ, ζ, L)-decomposable, then it is \((\gamma , \zeta , L^{\prime })\) -splittable for \(L^{\prime }(\gamma ,\zeta ,n) = O(\log n)\cdot L(\gamma ,\frac {\zeta }{2(\log n+1)},n)\) .

Proof

We will begin by proving a claim that for every partition \(\mathcal {I}=\{I_{1},I_{2},...,I_{L}\}\) of the interval [1,n] into L intervals, there exists a refinement of that partition which consists of at most \(L\cdot O(\log n)\) dyadic intervals. So, it suffices to prove that every interval \([a,b]\subseteq [1,n]\) can be partitioned in at most \({O\left ({\log n} \right )}\) dyadic intervals. Indeed, let be the largest integer such that \(2^{\ell }\leq \frac {b-a}{2}\) and let m be the smallest integer such that m ⋅ 2a. If follows that \(m\cdot 2^{\ell }\leq a+\frac {b-a}{2}=\frac {a+b}{2}\) and (m + 1) ⋅ 2b. So, the interval I = [m ⋅ 2 + 1,(m + 1) ⋅ 2] is fully contained in [a, b] and has size at least \(\frac {b-a}{4}\).

We will also use the fact that, for every \(\ell ^{\prime }\leq \ell \),

$$ m\cdot 2^{\ell}=m\cdot 2^{\ell-\ell^{\prime}}\cdot 2^{\ell^{\prime}}=m^{\prime} \cdot 2^{\ell^{\prime}} $$
(1)

Now consider the following procedure: Starting from right (resp. left) side of the interval I, we add the largest interval which is adjacent to it and fully contained in [a, b] and recurse until we cover the whole interval [(m + 1) ⋅ 2 + 1,b] (resp. [a, m ⋅ 2]). Clearly, at the end of this procedure, the whole interval [a, b] is covered by dyadic intervals. It remains to show that the procedure takes \({O\left ({\log n} \right )}\) steps. Indeed, using (1), we can see that at least half of the remaining left or right interval is covered in each step (except maybe for the first 2 steps where it is at least a quarter). Thus, the procedure will take at most \(2\log n +2={O\left ({\log n} \right )}\) steps in total. From the above, we can see that each of the L intervals of the partition \(\mathcal {I}\) can be covered with \({O\left ({\log n} \right )}\) dyadic intervals, which completes the proof of the claim.

In order to complete the proof of the lemma, notice that the two conditions in Definition 3.1 are closed under taking subsets: so that the second is immediately verified, while for the first we have that for any of the “new” intervals I that \(D(I) \leq \frac {\zeta }{L} \leq \frac {\zeta \cdot (2\log n+2)}{L^{\prime }}\). □

3.1 The Algorithm

Theorem 1.1, and with it Corollary 1.2, Corollary 1.3, and Corollary 1.4 will follow from the theorem below, combined with the structural theorems from Section 4:

Theorem 3.3

Let \({\mathcal {C}}\) be a class of distributions over [n]for which the following holds.

  1. 1.

    \({\mathcal {C}}\) is (γ, ζ, L(γ, ζ, n))-splittable;

  2. 2.

    there exists a procedure \(\textsc {ProjectionDist}_{\mathcal {C}}\) which, given as input a parameter α ∈ (0, 1)and the explicit description of a distribution D over [n], returns yes if the distance \(\ell _{1}({D},\mathcal {C})\) to \(\mathcal {C}\) is at most α/10, and no if \(\ell _{1}({D},\mathcal {C}) \geq 9\alpha /10\) (and either yes or no otherwise).

Then the algorithm TestSplittable (Algorithm 1) is a \({O\left (\max \left ({\sqrt {nL}}\log n/{{\varepsilon }^{3}}, L/{{\varepsilon }^{2}}\right ) \right )}\) -sample tester for \({\mathcal {C}}\), for L = L(O(ε),O(ε),n).(Moreover, if \(\textsc {ProjectionDist}_{\mathcal {C}}\) is computationally efficient, then so is TestSplittable .)

figure a

3.2 Proof of Theorem 3.3

We now give the proof of our main result (Theorem 3.3), first analyzing the sample complexity of Algorithm 1 before arguing its correctness. For the latter, we will need the following simple fact from [37], restated below:

Fact 3.4 ([37, Fact 1])

Let D be a distribution over [n], and δ ∈ (0, 1].Given \(m\geq C\cdot \frac {\log \frac {n}{\delta }}{\eta }\) independent samples from D (for some absolute constant C > 0), with probability at least 1 − δ we have that, for every interval \(I\subseteq [n]\) :

  1. (i)

    if \(D(I) \geq \frac {\eta }{4}\), then \(\frac {{D}(I)}{2} \leq \frac {m_{I}}{m} \leq \frac {3{D}(I)}{2}\) ;

  2. (ii)

    if \(\frac {m_{I}}{m} \geq \frac {\eta }{2}\), then \(D(I) > \frac {\eta }{4}\) ;

  3. (iii)

    if \(\frac {m_{I}}{m} < \frac {\eta }{2}\), then D(I) < η;

where \(m_{I} \overset {\text {def}}{=} \left \lvert \left \{\; j\in [m] \;\colon \; x_{j} \in I \; \right \} \right \rvert \) is the number of the samples falling into I.

3.3 Sample Complexity

The sample complexity is immediate, and comes from Steps 4 and 20. The total number of samples is

$$m+{O\left( \frac{\ell}{{\varepsilon}^{2}} \right)} = {O\left( \frac{\sqrt{\left\lvert I \right\rvert\cdot L}}{{\varepsilon}^{3}} \log \left\lvert I \right\rvert + \frac{L}{{\varepsilon}}\log \left\lvert I \right\rvert + \frac{L}{{\varepsilon}^{2}} \right)} = {O\left( \frac{\sqrt{\left\lvert I \right\rvert\cdot L}}{{\varepsilon}^{3}}\log \left\lvert I \right\rvert + \frac{L}{{\varepsilon}^{2}} \right)}\;. $$

3.4 Correctness

Say an interval I considered during the execution of the “Decomposition” step is heavy if m I is big enough on Step 7, and light otherwise; and let \(\mathcal {H}\) and \(\mathcal {L}\) denote the sets of heavy and light intervals respectively. By choice of m, we can assume that with probability at least 9/10 the guarantees of Fact 3.4 hold simultaneously for all intervals considered. We hereafter condition on this event.

We first argue that if the algorithm does not reject in Step 13, then with probability at least 9/10 we have \(\|{D}-{\Phi }({D},\mathcal {I})\|_{1} \leq \varepsilon /20\) (where \({\Phi }({D},\mathcal {I})\) denotes the flattening of D over the partition \(\mathcal {I}\)). Indeed, we can write

$$\begin{array}{@{}rcl@{}} {\lVert{{D}-{\Phi}({D},\mathcal{I})}{\rVert}}_1 &=& \sum\limits_{k\colon I_{k} \in \mathcal{L} } {D}(I_{k})\cdot {\lVert{{D}_{I_{k}} - {\mathcal{U}}_{I_{k}}}{\rVert}}_1 + \sum\limits_{k\colon I_{k} \in \mathcal{H}} {D}(I_{k})\cdot {\lVert{{D}_{I_{k}} - {\mathcal{U}}_{I_{k}}}{\rVert}}_1 \\ &\leq& 2\sum\limits_{k\colon I_{k} \in \mathcal{L} } {D}(I_{k}) + \sum\limits_{k\colon I_{k} \in \mathcal{H} } {D}(I_{k})\cdot {\lVert{{D}_{I_{k}} - {\mathcal{U}}_{I_{k}}}{\rVert}}_1\;. \end{array} $$

Let us bound the two terms separately.

  • If \(I^{\prime } \in \mathcal {H}\), then by our choice of threshold we can apply Lemma 2.10 with \(\delta =\frac {1}{10L}\); conditioning on all of the (at most L) events happening, which overall fails with probability at most 1/10 by a union bound, we get

    $${\lVert{{D}_{I^{\prime}}}{\rVert}}_2^{2} = {\lVert{{D}_{I^{\prime}}-{\mathcal{U}}_{I^{\prime}}}{\rVert}}_2^{2} + \frac{1}{\left\lvert I^{\prime} \right\rvert} \leq \left( 1+\frac{{\varepsilon}^{2}}{1600} \right) \frac{1}{\left\lvert I^{\prime} \right\rvert} $$

    as Check-Small- 2 returned yes; and by Lemma 2.9 this implies \({\lVert {{D}_{I^{\prime }} - {\mathcal {U}}_{I^{\prime }}}{\rVert }}_1\leq {\varepsilon }/40\).

  • If \(I^{\prime } \in \mathcal {L}\), then we claim that \(D(I^{\prime }) \leq \max (\kappa , 2c\cdot \frac {\sqrt {\left \lvert I^{\prime } \right \rvert }}{m{\varepsilon }^{2}}\log \frac {1}{\delta } )\). Clearly, this is true if \(D(I^{\prime }) \leq \kappa \), so it only remains to show that \(D(I^{\prime }) \leq 2c\cdot \frac {\sqrt {\left \lvert I^{\prime } \right \rvert }}{m{\varepsilon }^{2}}\log \frac {1}{\delta }\). But this follows from Fact 3.4 (i), as if we had \(D(I^{\prime }) > 2c\cdot \frac {\sqrt {\left \lvert I^{\prime } \right \rvert }}{m{\varepsilon }^{2}}\log \frac {1}{\delta }\) then \(m_{I^{\prime }}\) would have been big enough, and \(I^{\prime }\notin \mathcal {L}\). Overall,

    $$\sum\limits_{I^{\prime} \in \mathcal{L} } {D}(I^{\prime}) \leq \sum\limits_{I^{\prime} \in \mathcal{L} } \left( \kappa + 2c\cdot\frac{\sqrt{\left\lvert {I^{\prime}} \right\rvert}}{m\varepsilon^{2}}\log\frac{1}{\delta} \right) \leq L\kappa + 2\sum\limits_{I^{\prime} \in \mathcal{L} } c\cdot\frac{\sqrt{\left\lvert I^{\prime} \right\rvert}}{m{\varepsilon}^{2}}\log\frac{1}{\delta} \leq \frac{{\varepsilon}}{160}\left( 1+ \sum\limits_{I^{\prime} \in \mathcal{L} } \sqrt{\frac{\left\lvert I^{\prime} \right\rvert}{\left\lvert I \right\rvert L}}\right) \leq \frac{{\varepsilon}}{80} $$

    for a sufficiently big choice of constant C > 0 in the definition of m; where we first used that \(\left \lvert {\mathcal {L}} \right \rvert \leq L\), and then that \({\sum }_{I^{\prime } \in \mathcal {L} } \sqrt {\frac {\left \lvert {I^{\prime }} \right \rvert }{\left \lvert {I} \right \rvert }}\leq \sqrt {L}\) by Jensen’s inequality.

Putting it together, this yields

$$\begin{array}{@{}rcl@{}} {\lVert{{D}-{\Phi}({D},\mathcal{I})}{\rVert}}_1 \leq 2\cdot \frac{{\varepsilon}}{80} + \frac{{\varepsilon}}{40} \sum\limits_{ I^{\prime} \in \mathcal{H} } {D}(I_{k}) \leq {\varepsilon}/40+{\varepsilon}/40 = {\varepsilon}/20. \end{array} $$
Soundness. :

By contrapositive, we argue that if the test returns accept, then (with probability at least 2/3) D is ε-close to \(\mathcal {C}\). Indeed, conditioning on \(\tilde {{D}}\) being ε/20-close to \({\Phi }({D},\mathcal {I})\), we get by the triangle inequality that

$$\begin{array}{@{}rcl@{}} {\lVert{{D}-{\mathcal{C}}}{\rVert}}_1 &\leq& {\lVert{{D}-{\Phi}({D},\mathcal{I})}{\rVert}}_1 + {\lVert{\Phi({D},\mathcal{I}) - \tilde{{D}}}{\rVert}}_1 + {\operatorname{dist}\!\left( {\tilde{{D}}, \mathcal{C}}\right)} \\ &\leq& \frac{{\varepsilon}}{20} + \frac{{\varepsilon}}{20}+\frac{9{\varepsilon}}{10} = {\varepsilon}. \end{array} $$

Overall, this happens except with probability at most 1/10 + 1/10 + 1/10 < 1/3.

Completeness.:

Assume \(D\in {\mathcal {C}}\). Then the choice of of γ and L ensures the existence of a good dyadic partition \(\mathcal {I}(\gamma ,\gamma ,{D})\) in the sense of Definition 3.1. For any I in this partition for which (i) holds (\(D(I) \leq \frac {\gamma }{L} < \frac {\kappa }{2}\)), I will have \(\frac {m_{I}}{m} < \kappa \) and be kept as a “light leaf” (this by contrapositive of Fact 3.4 (ii)). For the other ones, (ii) holds: let I be one of these (at most L) intervals.

  • If m I is too small on Step 7, then I is kept as “light leaf.”

  • Otherwise, then by our choice of constants we can use Lemma 2.9 and apply Lemma 2.10 with \(\delta =\frac {1}{10L}\); conditioning on all of the (at most L) events happening, which overall fails with probability at most 1/10 by a union bound, Check-Small- 2 will output yes, as

    $${\lVert{{D}_{I}-{\mathcal{U}}_{I}}{\rVert}}_2^{2} = {\lVert{{D}_{I}}{\rVert}}_2^{2} - \frac{1}{\left\lvert I \right\rvert} \leq \left( 1+\frac{{\varepsilon}^{2}}{6400} \right) \frac{1}{\left\lvert I \right\rvert} - \frac{1}{\left\lvert I \right\rvert} = \frac{{\varepsilon}^{2}}{6400\left\lvert I \right\rvert} $$

    and I is kept as “flat leaf.”

Therefore, as \(\mathcal {I}(\gamma ,\gamma ,{D})\) is dyadic the Decomposition stage is guaranteed to stop within at most L splits (in the worst case, it goes on until \(\mathcal {I}(\gamma ,\gamma ,{D})\) is considered, at which point it succeeds).Footnote 6 Thus Step 13 passes, and the algorithm reaches the Approximation stage. By the foregoing discussion, this implies \({\Phi }({D},\mathcal {I})\) is ε/20-close to D (and hence to \({\mathcal {C}}\)); \(\tilde {{D}}\) is then (except with probability at most 1/10) \((\frac {{\varepsilon }}{20}+\frac {{\varepsilon }}{20}=\frac {{\varepsilon }}{10})\)-close to \({\mathcal {C}}\), and the algorithm returns accept.

4 Structural Theorems

In this section, we show that a wide range of natural distribution families are succinctly decomposable, and provide efficient projection algorithms for each class.

4.1 Existence of Structural Decompositions

Theorem 4.1 (Monotonicity)

For all γ, ζ > 0, the class \({\mathcal {M}}\) of monotone distributions on [n]is (γ, ζ, L)-splittable for \(L \overset {\text {def}}{=} {O\left ({\frac {\log ^{2} \frac {n}{\zeta }}{\gamma }} \right )}\) .

Note that this proof can already be found in [12, Theorem 10], interwoven with the analysis of their algorithm. For the sake of being self-contained, we reproduce the structural part of their argument, removing its algorithmic aspects:

Proof of Theorem 4.1

We define the \(\mathcal {I}\) recursively as follows: \(\mathcal {I}^{(0)}=([1,n])\), and for j ≥ 0 the partition \(\mathcal {I}^{(j+1)}\) is obtained from \(\mathcal {I}^{(j)}=(I_{1}^{(j)},\dots ,I_{\ell _{j}}^{(j)})\) by going over the \(I^{(j)}_{i}=[a^{(j)}_{i}, b^{(j)}_{i}]\) in order, and:

  1. (a)

    if \(D(I^{(j)}_{i})\leq \frac {\zeta }{L}\), then \(I^{(j)}_{i}\) is added as element of \(\mathcal {I}^{(j+1)}\) (“marked as leaf”);

  2. (b)

    else, if \(D(a^{(j)}_{i}) \leq (1+\gamma ){D}(b^{(j)}_{i})\), then \(I^{(j)}_{i}\) is added as element of \(\mathcal {I}^{(j+1)}\) (“marked as leaf”);

  3. (c)

    otherwise, bisect I (j) in \(I^{(j)}_{\mathrm {L}}\), \(I^{(j)}_{\mathrm {R}}\) (with \(\left \lvert {I^{(j)}_{\mathrm {L}}} \right \rvert =\left \lceil {\left \lvert {I^{(j)}} \right \rvert /2} \right \rceil \)) and add both \(I^{(j)}_{\mathrm {L}}\) and \(I^{(j)}_{\mathrm {R}}\) as elements of \(\mathcal {I}^{(j+1)}\).

and repeat until convergence (that is, whenever the last item is not applied for any of the intervals). Clearly, this process is well-defined, and will eventually terminate (as ( j ) j is a non-decreasing sequence of natural numbers, upper bounded by n). Let \(\mathcal {I}=(I_{1},\dots ,I_{\ell })\) (with I i = [a i ,a i+1)) be its outcome, so that the I i ’s are consecutive intervals all satisfying either (a) or (b). As (b) clearly implies (ii), we only need to show that L; for this purpose, we shall leverage as in [12] the fact that D is monotone to bound the number of recursion steps.

The recursion above defines a complete binary tree (with the leaves being the intervals satisfying (a) or (b), and the internal nodes the other ones). Let t be the number of recursion steps the process goes through before converging to \(\mathcal {I}\) (height of the tree); as mentioned above, we have \(t\leq \log n\) (as we start with an interval of size n, and the length is halved at each step.). Observe further that if at any point an interval \(I^{(j)}_{i}=[a^{(j)}_{i}, b^{(j)}_{i}]\) has \(D(a^{(j)}_{i}) \leq \frac {\zeta }{nL}\), then it immediately (as well as all the \(I^{(j)}_{k}\)’s for ki by monotonicity) satisfies (a) and is no longer split (“becomes a leaf”). So at any jt, the number of intervals i j for which neither (a) nor (b) holds must satisfy

$$1 \geq {D}(a^{(j)}_{1}) > (1+\gamma){D}(a^{(j)}_{2}) > (1+\gamma)^{2}{D}(a^{(j)}_{3}) > {\dots} >(1+\gamma)^{i_{j}-1}{D}(a^{(j)}_{i_{j}}) \geq (1+\gamma)^{i_{j}-1}\frac{\zeta}{nL} $$

where a k denotes the beginning of the k-th interval (again we use monotonicity to argue that the extrema were reached at the ends of each interval), so that \(i_{j} \leq 1+\frac {\log \frac {nL}{\zeta }}{\log (1+\gamma )}\). In particular, the total number of internal nodes is then

$$\sum\limits_{i=1}^{t} i_{j} \leq t\cdot\left( 1+\frac{\log\frac{nL}{\zeta}}{\log(1+\gamma)}\right) \leq \frac{2\log^{2} \frac{n}{\zeta}}{\log(1+\gamma)} \leq L\;.$$

This implies the same bound on the number of leaves . □

Corollary 4.2 (Unimodality)

For all γ, ζ > 0, the class \(\mathcal {M}_{1}\) of unimodal distributions on [n]is (γ, ζ, L)-decomposable for \(L \overset {\text {def}}{=} {O\left ({\frac {\log ^{2} \frac {n}{\zeta }}{\gamma }} \right )}\) .

Proof

For any \(D\in \mathcal {M}_{1}\), [n] can be partitioned in two intervals I, J such that D I , D J are either monotone non-increasing or non-decreasing. Applying Theorem 4.1 to D I and D J and taking the union of both partitions yields a (no longer necessarily dyadic) partition of [n]. □

The same argument yields an analogous statement for t-modal distributions:

Corollary 4.3 (t-modality)

For any t ≥ 1and all γ, ζ > 0, the class \({\mathcal {M}_{t}}\) of t-modal distributions on [n]is (γ, ζ, L)-decomposable for \(L \overset {\text {def}}{=} {O\left ({\frac {t\log ^{2} \frac {n}{\zeta }}{\gamma }} \right )}\) .

Corollary 4.4 (Log-concavity, concavity and convexity)

For all γ, ζ > 0, the classes \({\mathcal {L}}\), \({\mathcal {K}^-}\) and \({\mathcal {K}^+}\) of log-concave, concave and convex distributions on [n]are (γ, ζ, L)-decomposable for \(L \overset {\text {def}}{=} {O\left ({\frac {\log ^{2} \frac {n}{\zeta }}{\gamma }} \right )}\) .

Proof

This is directly implied by Corollary 4.2, recalling that log-concave, concave and convex distributions are unimodal. □

Theorem 4.5 (Monotone Hazard Rate)

For all γ, ζ > 0, the class \({\mathcal {MHR}}\) of MHR distributions on [n]is (γ, ζ, L)-decomposable for \(L \overset {\text {def}}{=} {O\left ({\frac {\log \frac {n}{\zeta }}{\gamma }} \right )}\) .

Proof

This follows from adapting the proof of [15], which establishes that every MHR distribution can be approximated in 1 distance by a \({O\left ({\log (n/\varepsilon )/\varepsilon } \right )}\)-histogram. For completeness, we reproduce their argument, suitably modified to our purposes, in Appendix B. □

Theorem 4.6 (Piecewise Polynomials)

For all γ, ζ > 0, t, d ≥ 0, the class \({\mathcal {P}_{t,d}}\) of t-piecewise degree- D distributions on [n]is (γ, ζ, L)-decomposable for \(L \overset {\text {def}}{=} O\left ({\frac {t(d+1)}{\gamma }\log ^{2} \frac {n}{\zeta }} \right )\) .(Moreover, for the class of t-histograms \({\mathcal {H}_{t}}\) ( d = 0) one can take L = t .)

Proof

The last part of the statement is obvious, so we focus on the first claim. Observing that each of the t pieces of a distribution \(D\in {\mathcal {P}_{t,d}}\) can be subdivided in at most d + 1 intervals on which D is monotone (being degree-d polynomial on each such piece), we obtain a partition of [n] into at most t(d + 1) intervals. D being monotone on each of them, we can apply an argument almost identical to that of Theorem 4.1 to argue that each interval can be further split into \(O(\log ^{2} n/\gamma )\) subintervals, yielding a good decomposition with \(O(t(d+1)\log ^{2} ({n}/{\zeta })/\gamma )\) pieces. □

4.2 Projection Step: Computing the Distances

This section contains details of the distance estimation procedures for these classes, required in the last stage of Algorithm 1. (Note that some of these results are phrased in terms of distance approximation, as estimating the distance \(\ell _{1}({D},\mathcal {C})\) to sufficient accuracy in particular yields an algorithm for this stage.)

We focus in this section on achieving the sample complexities stated in Corollary 1.2, Corollary 1.3, and Corollary 1.4—that is, our procedures will not require any additional sample from the distribution. While almost all the distance estimation procedures we give in this section are efficient, running in time polynomial in all the parameters or even with only a polylogarithmic dependence on n, there are two exceptions—namely, the procedures for monotone hazard rate (Lemma 4.9) and log-concave (Lemma 4.10) distributions. We do describe computationally efficient procedures for these two cases as well in Section 4.2.1, at a modest additive cost in the sample complexity (that is, these more efficient procedures will require some additional samples from the distribution).

Lemma 4.7 (Monotonicity [12, Lemma 8])

There exists a procedure \(\textsc {ProjectionDist}_{{\mathcal {M}}}\) that, on input n as well as the full (succinct) specification of an -histogram D on [n], computes the (exact) distance \(\ell _{1}({D},\mathcal {M})\) in time poly().

A straightforward modification of the algorithm above (e.g., by adapting the underlying linear program to take as input the location m ∈ [] of the mode of the distribution; then trying all possibilities, running the subroutine times and picking the minimum value) results in a similar claim for unimodal distributions:

Lemma 4.8 (Unimodality)

There exists a procedure \(\textsc {ProjectionDist}_{\mathcal {M}_{1}}\) that, on input n as well as the full (succinct) specification of an -histogram D on [n], computes the (exact) distance \(\ell _{1}({D},{{\mathcal {M}}_{1}})\) in time poly().

A similar result can easily be obtained for the class of t-modal distributions as well, with a poly(, t)-time algorithm based on a combination of dynamic and linear programming. Analogous statements hold for the classes of concave and convex distributions \({\mathcal {K}^+}, {\mathcal {K}^-}\), also based on linear programming (specifically, on running \({O\left (n^{2} \right )}\) different linear programs—one for each possible support \([a,b]\subseteq [n]\)—and taking the minimum over them).

Lemma 4.9 (MHR)

There exists a (non-efficient) procedure \(\textsc {ProjectionDist}_{{\mathcal {MHR}}}\) that, on input n, ε, as well as the full specification of a distribution D on [n], distinguishes between \(\ell _{1}({D},{\mathcal {MHR}}) \leq \varepsilon \) and \(\ell _{1}({D},{\mathcal {MHR}})>2\varepsilon \) in time \(2^{\tilde {O}_{\varepsilon }(n)}\).

Lemma 4.10 (Log-concavity)

There exists a (non-efficient) procedure \(\textsc {ProjectionDist}_{{\mathcal {L}}}\) that, on input n, ε, as well as the full specification of a distribution D on [n], distinguishes between \(\ell _{1}({D},\mathcal {L}) \leq \varepsilon \) and \(\ell _{1}({D},\mathcal {L})>2\varepsilon \) in time \(2^{\tilde {O}_{\varepsilon }(n)}\) .

Proof of Lemma 4.9 and Lemma 4.10

We here give a naive algorithm for these two problems, based on an exhaustive search over a (huge) ε-cover \(\mathcal {S}\) of distributions over [n]. Essentially, \(\mathcal {S}\) contains all possible distributions whose probabilities \(p_{1},\dots ,p_{n}\) are of the form j ε/n, for \(j\in \{0,\dots ,n/\varepsilon \}\) (so that \(\left \lvert {\mathcal {S}} \right \rvert = {O\left ({(n/\varepsilon )^{n}} \right )}\)). It is not hard to see that this indeed defines an ε-cover of the set of all distributions, and moreover that it can be computed in time \(\text {poly}(\left \lvert {\mathcal {S}} \right \rvert )\). To approximate the distance from an explicit distribution D to the class \(\mathcal {C}\) (either \({\mathcal {MHR}}\) or \(\mathcal {L}\)), it is enough to go over every element S of \(\mathcal {S}\), checking (this time, efficiently) if ∥SD1ε and if there is a distribution \(P\in \mathcal {C}\) close to S (this time, pointwise, that is \(\left \lvert {P(i)-S(i)} \right \rvert \leq \varepsilon /n\) for all i)—which also implies ∥SP1ε and thus ∥PD1 ≤ 2ε. The test for pointwise closeness can be done by checking feasibility of a linear program with variables corresponding to the logarithm of probabilities, i.e. \(x_{i} \equiv \ln P(i)\). Indeed, this formulation allows to rephrase the log-concave and MHR constraints as linear constraints, and pointwise approximation is simply enforcing that \(\ln (S(i)-\varepsilon /n) \leq x_{i} \leq \ln (S(i)+\varepsilon /n)\) for all i. At the end of this enumeration, the procedure accepts if and only if for some S both ∥SD1ε and the corresponding linear program was feasible. □

Lemma 4.11 (Piecewise Polynomials)

There exists a procedure \(\textsc {ProjectionDist}_{{\mathcal {P}_{t,d}}}\) that, on input n as well as the full specification of an -histogram D on [n], computes an approximation Δof the distance \(\ell _{1}({D},{\mathcal {P}_{t,d}})\) such that \(\ell _{1}({D},{\mathcal {P}_{t,d}}) \leq {\Delta } \leq 3\ell _{1}({D},{\mathcal {P}_{t,d}})+\varepsilon \), and runs in time \({O\left ({n^{3}} \right )}\cdot \text {poly}(\ell ,t,d,\frac {1}{\varepsilon })\) .

Moreover, for the special case of t-histograms (d = 0)there exists a procedure \(\textsc {ProjectionDist}_{{\mathcal {H}_{t}}}\), which, given inputs as above, computes an approximation Δof the distance \(\ell _{1}({D},{\mathcal {H}_{t}})\) such that \(\ell _{1}({D},{\mathcal {H}_{t}}) \leq {\Delta } \leq 4\ell _{1}({D},{\mathcal {H}_{t}})+{\varepsilon }\), and runs in time \(\text {poly}(\ell ,t,\frac {1}{{\varepsilon }})\), independent of n.

Proof

We begin with \(\textsc {ProjectionDist}_{{\mathcal {H}_{t}}}\). Fix any distribution D on [n]. Given any explicit partition of [n] into intervals \(\mathcal {I}=(I_{1},\dots ,I_{t})\), one can easily show that \(\|{{D} - {\Phi }({D},\mathcal {I})\|_{1}} \leq 2{\textsc {opt}}_{\mathcal {I}}\), where \({\textsc {opt}}_{\mathcal {I}}\) is the optimal distance of D to any histogram on \(\mathcal {I}\) (recall that we write \({\Phi }({D},\mathcal {I})\) for the flattening of D over the partition \(\mathcal {I}\)). To get a 2-approximation of \(\ell _{1}({D},{\mathcal {H}_{t}})\), it thus suffices to find the minimum, over all possible partitionings \(\mathcal {I}\) of [n] into t intervals, of the quantity \(\|{{D} - {\Phi }({D},\mathcal {I})\|_{1}}\) (which itself can be computed in time \(T=O(\min (t\ell ,n))\)). By a simple dynamic programming approach, this can be performed in time \({O\left ({t n^{2} \cdot T} \right )}\). The quadratic dependence on n, which follows from allowing the endpoints of the t intervals to be at any point of the domain, is however far from optimal and can be reduced to (t/ε)2, as we show below.

For η > 0, define an η-granular decomposition of a distribution D over [n] to be a partition of [n] into \(s={O\left ({1/\eta } \right )}\) intervals \(J_{1},\dots ,J_{s}\) such that each interval J i is either a singleton or satisfies D(J i ) ≤ η. (Note that if D is a known -histogram, one can compute an η-granular decomposition of D in time \({O\left ({\ell /\eta } \right )}\) in a greedy fashion.)

Claim 4.12

Let D be a distribution over [n], and \(\mathcal {J} = (J_{1},\dots ,J_{s})\) be an η -granular decomposition of D (with st ). Then, there exists a partition of [n]into t intervals \(\mathcal {I}=(I_{1},\dots ,I_{t})\) and a t-histogram H on \(\mathcal {I}\) such that \(\|{{D} - H}\|_{1} \leq 2\ell _{1}({D},{\mathcal {H}_{t}})+2t\eta \), and \(\mathcal {I}\) is a coarsening of \(\mathcal {J}\) .

Before proving it, we describe how this will enable us to get the desired time complexity for \(\textsc {ProjectionDist}_{{\mathcal {H}_{t}}}\). Phrased differently, the claim above allows us to run our dynamic program using the \({O\left ({1/\eta } \right )}\) endpoints of the \({O\left ({1/\eta } \right )}\) instead of the n points of the domain, paying only an additive error O(t η). Setting \(\eta =\frac {\varepsilon }{4t}\), the guarantee for \(\textsc {ProjectionDist}_{{\mathcal {H}_{t}}}\) follows.

Proof of Claim 4.12

Let \(\mathcal {J} =(J_{1},\dots ,J_{s})\) be an η-granular decomposition of D, and \(H^{\ast }\in {\mathcal {H}_{t}}\) be a histogram achieving \({\textsc {opt}}=\ell _{1}({D},{\mathcal {H}_{t}})\). Denote further by \(\mathcal {I^{\ast }} = (I^{\ast }_{1},\dots ,I^{\ast }_{t})\) the partition of [n] corresponding to H . Consider now the rt endpoints of the \(I^{\ast }_{i}\)’s that do not fall on one of the endpoints of the J i ’s: let \(J_{i_{1}},\dots ,J_{i_{r}}\) be the respective intervals in which they fall (in particular, these cannot be singleton intervals), and \(S=\cup _{j=1}^{r} J_{i_{j}}\) their union. By definition of η-granularity, D(S) ≤ t η, and it follows that \(H^{\ast }(S)\leq t\eta + \frac {1}{2}{\textsc {opt}}\). We define H from H in two stages: first, we obtain a (sub)distribution \(H^{\prime }\) by modifying H on S, setting for each \(x\in J_{i_{j}}\) the value of H to be the minimum value (among the two options) that H takes on \(J_{i_{j}}\). \(H^{\prime }\) is thus a t-histogram, and the endpoints of its intervals are endpoints of \(\mathcal {J}\) as wished; but it may not sum to one. However, by construction we have that \(H^{\prime }([n]) \geq 1-H^{\ast }(S) \geq 1-t\eta - \frac {1}{2}{\textsc {opt}}\). Using this, we can finally define our t-histogram distribution H as the renormalization of \(H^{\prime }\). It is easy to check that H is a valid t-histogram on a coarsening of \(\mathcal {J}\), and

$${\lVert{{D}-H}{\rVert}}_1 \leq {\lVert{{D}-H^{\prime}}{\rVert}}_1 + (1-H^{\prime}([n])) \leq {\lVert{{D}-H^{\ast}}{\rVert}}_1 + {\lVert{H^{\ast}-H^{\prime}}{\rVert}}_1 + t\eta + \frac{1}{2}{\textsc{opt}} \leq 2{\textsc{opt}} + 2t\eta $$

as stated. □

Turning now to \(\textsc {ProjectionDist}_{{\mathcal {P}_{t,d}}}\), we apply the same initial dynamic programming approach, which will result on a running time of \({O\left ({n^{2}t\cdot T} \right )}\), where T is the time required to estimate (to sufficient accuracy) the distance of a given (sub)distribution over an interval I onto the space \({\mathcal {P}_{d}}\) of degree-d polynomials. Specifically, we will invoke the following result, adapted from [16] to our setting:

Theorem 4.13

Let p be an -histogram over [−1,1).There is an algorithm ProjectSinglePoly(d, η)which runs in time poly(, d + 1,1/η), and outputs a degree- D polynomial q which defines a pdf over [−1,1)such that \({\lVert {p-q}{\rVert }}_1 \leq 3 \ell _{1}(p,{\mathcal {P}_{d}}) + O(\eta )\) .

The proof of this modification of [16, Theorem 9] is deferred to Appendix C. Applying it as a blackbox with η set to \({O\left ({\varepsilon /t} \right )}\) and noting that computing the 1 distance to our explicit distribution on a given interval of the degree-d polynomial returned incurs an additional \({O\left ({n} \right )}\) factor, we obtain the claimed guarantee and running time. □

4.2.1 Computationally Efficient Procedures for Log-Concave and MHR Distributions

We now describe how to obtain efficient testing for the classes \({\mathcal {L}}\) and \({\mathcal {MHR}}\)—that is, how to obtain polynomial-time distance estimation procedures for these two classes, unlike the ones described in the previous section. At a very high-level, the idea is in both cases to write down a linear program on variables related logarithmically to the probabilities we are searching, as enforcing the log-concave and MHR constraints on these new variables can be done linearly. The catch now becomes the 1 objective function (and, to a lesser extent, the fact that the probabilities must sum to one), now highly non-linear.

The first insight is to leverage the structure of log-concave (resp. monotone hazard rate) distributions to express this objective as slightly stronger constraints, specifically pointwise (1 ± ε)-multiplicative closeness, much easier to enforce in our “logarithmic formulation.” Even so, doing this naively fails, essentially because of a too weak distance guarantee between our explicit histogram \(\hat {{D}}\) and the unknown distribution we are trying to find: in the completeness case, we are only promised ε-closeness in 1, while we would also require good additive pointwise closeness of the order ε 2 or ε 3.

The second insight is thus to observe that we “almost” have this for free: indeed, if we do not reject in the first stage of the testing algorithm, we do obtain an explicit k-histogram \(\hat {{D}}\) with the guarantee that D is ε-close to the distribution P to test. However, we also implicitly have another distribution \(\hat {{D}}^{\prime }\) that is \(\sqrt {\varepsilon /k}\)-close to P in Kolmogorov distance: as in the recursive descent we take enough samples to use the DKW inequality (Theorem 2.11) with this parameter, i.e. an additive overhead of \({O\left ({k/\varepsilon } \right )}\) samples (on top of the \(\tilde {O}(\sqrt {kn}/\varepsilon ^{7/2})\)). If we are willing to increase this overhead by just a small amount, that is to take \(\tilde {O}\left ({ \max (k/\varepsilon , 1/\varepsilon ^{4}) } \right )\), we can guarantee that \(\hat {{D}}^{\prime }\) be also \(\tilde {O}\left ({\varepsilon ^{2}} \right )\)-close to P in Kolmogorov distance.

Combining these ideas yield the following distance estimation lemmas:

Lemma 4.14 (Monotone Hazard Rate)

There exists a procedure \(\textsc {ProjectionDist}_{{\mathcal {MHR}}}^{\ast }\) that, on input n as well as the full specification of a k-histogram distribution D on [n]and of an -histogram distribution \(D^{\prime }\) on [n], runs in time poly(n,1/ε), and satisfies the following.

  • If there is \(P\in {\mathcal {MHR}}\) such that \({\lVert {{D}-P}{\rVert }}_1 \leq {\varepsilon }\) and \({\lVert {D}^{\prime } - P{\rVert }_{\text {Kol}}} \leq {\varepsilon }^{3}\), then the procedure returns yes;

  • If \(\ell _{1}({D},{\mathcal {MHR}}) > 100{\varepsilon }\), then the procedure returns no.

Theorem 4.15 (Log-Concavity)

There exists a procedure \(\textsc {ProjectionDist}_{{\mathcal {L}}}^{\ast }\) that, on input n as well as the full specifications of a k-histogram distribution D on [n]and an -histogram distribution \(D^{\prime }\) on [n], runs in time poly(n, k, ,1/ε), and satisfies the following.

  • If there is \(P\in {\mathcal {L}}\) such that \({\lVert {{D}-P}{\rVert }}_1\leq {\varepsilon }\) and \({\lVert {D}^{\prime } - P{\rVert }_{\text {Kol}}}\leq \frac {\varepsilon ^{2}}{\log ^{2}(1/\varepsilon )}\), then the procedure returns yes;

  • If \(\ell _{1}({D},{\mathcal {L}}) \geq 100{\varepsilon }\), then the procedure returns no.

The proofs of these two lemmas are quite technical and deferred to Appendix C. With these in hand, a simple modification of our main algorithm (specifically, setting \(m = \tilde {O}(\max ({\sqrt {L\left \lvert {I} \right \rvert }}/{\varepsilon ^{3}}, {L}/{\varepsilon ^{2}}, {1}/{\varepsilon ^{c}} ) )\) for c either 4 or 6 instead of \(\tilde {O}(\max ({\sqrt {L\left \lvert {I}\right \rvert }}/{\varepsilon ^{3}}, {L}/{\varepsilon ^{2}} ) )\), to get the desired Kolmogorov distance guarantee; and providing the empirical histogram defined by these m samples along to the distance estimation procedure) suffices to obtain the following counterpart to Corollary 1.2:

Corollary 4.16

The algorithm TestSplittable , after this modification, can efficiently test the classes of log-concave and monotone hazard rate (MHR) distributions, with respectively \(\tilde {O}\left ({\sqrt {n}/\varepsilon ^{7/2} + 1/\varepsilon ^{4}}\right )\) and \(\tilde {O}\left ({\sqrt {n}/\varepsilon ^{7/2} + 1/\varepsilon ^{6}}\right )\) samples.

We observe that Lemma 4.14 and Lemma 4.15 actually imply efficient proper learning algorithms for the classes of respectively MHR and log-concave distributions, with sample complexity \({O\left ({1/\varepsilon ^{4}} \right )}\) and \({O\left ({1/\varepsilon ^{6}} \right )}\). Along with analogous subroutines of [3], these were the first proper learning algorithms (albeit with suboptimal sample complexity) for these classes. (Subsequent work of Diakonikolas, Kane, and Steward [30] recently obtained, through a completely different approach, a sample-optimal and efficient learning algorithm for the class of log-concave distributions which is both proper and agnostic.)

5 Going Further: Reducing the Support Size

The general approach we have been following so far gives, out-of-the-box, an efficient testing algorithm with sample complexity \(\tilde {O}\left (\sqrt {n} \right )\) for a large range of properties. However, this sample complexity can for some classes \({\mathcal {P}}\) be brought down a lot more, by taking advantage in a preprocessing step of good concentration guarantees of distributions in \({\mathcal {P}}\).

As a motivating example, consider the class of Poisson Binomial Distributions (PBD). It is well-known (see e.g. [38, Section 2]) that PBDs are unimodal, and more specifically that \(\mathcal {PBD}_{n}\subseteq \mathcal {L}\subseteq {{\mathcal {M}}_{1}}\). Therefore, using our generic framework we can test Poisson Binomial Distributions with \(\tilde {O}\left ({\sqrt {n}} \right )\) samples. This is, however, far from optimal: as shown in [2], a sample complexity of \({\Theta \left ({n^{1/4}} \right )}\) is both necessary and sufficient. The reason our general algorithm ends up making quadratically too many queries can be explained as follows. PBDs are tightly concentrated around their expectation, so that they “morally” live on a support of size \(m={O\left ({\sqrt {n}} \right )}\). Yet, instead of testing them on this very small support, in the above we still consider the entire range [n], and thus end up paying a dependence \(\sqrt {n}\) – instead of \(\sqrt {m}\).

If we could use that observation to first reduce the domain to the effective support of the distribution, then we could call our testing algorithm on this reduced domain of size \({O\left ({\sqrt {n}} \right )}\). In the rest of this section, we formalize and develop this idea, and in Section 5.2 will obtain as a direct application a \(\tilde {O}\left ({n^{1/4}} \right )\)-query testing algorithm for \(\mathcal {PBD}_{n}\).

Definition 5.1

Given ε > 0, the ε-effective support of a distribution D is the smallest interval I such that D(I) ≥ 1 − ε.

The last definition we shall require is that of the conditioned distributions of a class \({\mathcal {C}}\):

Definition 5.2

For any class of distributions \({\mathcal {C}}\) over [n], define the set of conditioned distributions of \(\mathcal {C}\) (with respect to ε > 0 and interval \(I\subseteq [n]\)) as \(\mathcal {C}^{\varepsilon ,I}\overset {\text {def}}{=} \left \{ {{D}_{I}} \;\colon \; {{D}\in \mathcal {C}, {D}(I) \geq 1-\varepsilon } \right \} \).

Finally, we will require the following simple result:

Lemma 5.3

Let D be a distribution over [n], and \(I\subseteq [n]\) an interval such that \(D(I) \geq 1- \frac {{\varepsilon }}{10}\) .Then,

  • If \(D\in {\mathcal {C}}\), then \(D_{I}\in {\mathcal {C}}^{\frac {{\varepsilon }}{10},I}\) ;

  • If \(\ell _{1}({D},{\mathcal {C}}) > {\varepsilon }\), then \(\ell _{1}({D}_{I}, {\mathcal {C}}^{\frac {{\varepsilon }}{10},I}) > \frac {7{\varepsilon }}{10}\) .

Proof

The first item is obvious. As for the second, let \(P\in {\mathcal {C}}\) be any distribution with \(P(I)\geq 1-\frac {{\varepsilon }}{10}\). By assumption, ∥DP1 > ε: but we have, writing α = 1/10,

$$\begin{array}{@{}rcl@{}} {\lVert{{D}-P}{\rVert}x}_1 &=&\sum\limits_{i\in I}\left\lvert \frac{{D}(i)}{{D}(I)} - \frac{P(i)}{P(I)} \right\rvert = \frac{1}{{D}(I)}\sum\limits_{i\in I}\left\lvert {D}(i) - P(i) + P(i)\left( 1- \frac{{D}(I)}{P(I)}\right) \right\rvert \\ &\geq& \frac{1}{{D}(I)}\left( \sum\limits_{i\in I}\left\lvert {D}(i) - P(i) \right\rvert - \left\lvert 1- \frac{{D}(I)}{P(I)} \right\rvert \sum\limits_{i\in I} P(i) \right)\\ &=& \frac{1}{{D}(I)}\left( \sum\limits_{i\in I}\left\lvert {D}(i) - P(i) \right\rvert - \left\lvert P(I)- {D}(I) \right\rvert \right) \geq \frac{1}{{D}(I)}\left( \sum\limits_{i\in I}\left\lvert {D}(i) - P(i) \right\rvert - \alpha{\varepsilon} \right)\\ &\geq& \frac{1}{{D}(I)}\left( {\lVert{{D}-P}{\rVert}}_1 - \sum\limits_{i\notin I} \left\lvert {D}(i) - P(i) \right\rvert - \alpha{\varepsilon} \right) \geq \frac{1}{{D}(I)}\left( {\lVert{{D}-P}{\rVert}}_1 - 3\alpha{\varepsilon} \right) \\ &>& (1- 3\alpha){\varepsilon} = \frac{7}{10}{\varepsilon}. \end{array} $$

We now proceed to state and prove our result—namely, efficient testing of structured classes of distributions with nice concentration properties.

Theorem 5.4

Let \({\mathcal {C}}\) be a class of distributions over [n]for which the following holds.

  1. 1.

    there is a function M(⋅,⋅)such that each \(D\in {\mathcal {C}}\) has ε -effective support of size at most M(n, ε);

  2. 2.

    for every ε ∈ [0, 1]and interval \(I\subseteq [n]\), \({\mathcal {C}}^{{\varepsilon },I}\) is (γ, ζ, L)-splittable;

  3. 3.

    there exists an efficient procedure \(\textsc {ProjectionDist}_{{\mathcal {C}}^{{\varepsilon },I}}\) which, given as input the explicit description of a distribution D over [n]and interval \(I\subseteq [n]\), computes the distance \(\ell _{1}({D}_{I},\mathcal {C}^{\varepsilon ,I})\) .

Then, the algorithm TestEffectiveSplittable (Algorithm 2) is a \({O\left (\max \left (\frac {1}{{\varepsilon }^{3}} \sqrt {m\ell } \log m, \frac {\ell }{{\varepsilon }^{2}}\right ) \right )}\) -sample tester for \({\mathcal {C}}\), where \(m=M(n,\frac {{\varepsilon }}{60})\) and \(\ell =L(\frac {{\varepsilon }}{1200},\frac {{\varepsilon }}{1200}, m)\) .

5.1 Proof of Theorem 5.4

By the choice of m and the DKW inequality, with probability at least 23/24 the estimate \(\hat {{D}}\) satisfies \({\lVert {D} - \hat {{D}}{\rVert }_{\text {Kol}}} \leq \frac {{\varepsilon }}{60}\). Conditioning on that from now on, we get that \(D(I) \geq \hat {{D}}(I) - \frac {{\varepsilon }}{30} \geq 1-\frac {{\varepsilon }}{10}\). Furthermore, denoting by j and k the two inner endpoints of J and K in Steps 4 and 5, we have \(D(J\cup \{j+1\}) \geq \hat {{D}}(J\cup \{j+1\}) - \frac {{\varepsilon }}{60} > \frac {{\varepsilon }}{60}\) (similarly for \(D(K\cup \{k-1\})\)), so that I has size at most σ + 1, where σ is the \(\frac {{\varepsilon }}{60}\)-effective support size of D.

Finally, note that since \(D(I) = {\Omega \left (1 \right )}\) by our conditioning, the simulation of samples by rejection sampling will succeed with probability at least 23/24 and the algorithm will not output fail.

figure b

Sample Complexity

The sample complexity is the sum of the \({O\left (1/{\varepsilon }^{2} \right )}\) in Step 3 and the \({O\left (q \right )}\) in Step 11. From Theorem 1.1 and the choice of I, this latter quantity is \({O\left ({ \max \left (\frac {1}{\varepsilon ^{3}} \sqrt {m\ell } \log m, \frac {\ell }{\varepsilon ^{2}}\right ) } \right )}\) where \(m = M(n,\frac {\varepsilon }{60})\) and \(\ell =L(\frac {\varepsilon }{1200},\frac {\varepsilon }{1200}, M(n,\frac {\varepsilon }{60}))\).

Correctness

If \(D\in {\mathcal {C}}\), then by the setting of τ (set to be an upper bound on the \(\frac {\varepsilon }{60}\)-effective support size of any distribution in \(\mathcal {C}\)) the algorithm will go beyond Step 6. The call to TestSplittable will then end up in the algorithm returning accept in Step 12, with probability at least 2/3 by Lemma 5.3, Theorem 1.1 and our choice of parameters.

Similarly, if D is ε-far from \({\mathcal {C}}\), then either its effective support is too large (and then the test on Step 6 fails), or the main tester will detect that its conditional distribution on I is \(\frac {7\varepsilon }{10}\)-far from \(\mathcal {C}\) and output reject in Step 12.

Overall, in either case the algorithm is correct except with probability at most 1/24 + 1/24 + 1/3 = 5/12 (by a union bound). Repeating constantly many times and outputting the majority vote brings the probability of failure down to 1/3.

5.2 Application: Testing Poisson Binomial Distributions

In this section, we illustrate the use of our generic two-stage approach to test the class of Poisson Binomial Distributions. Specifically, we prove the following result:

Corollary 5.5

The class of Poisson Binomial Distributions can be tested with \(\tilde {O}\left ({n}^{1/4}/{\varepsilon }^{7/2} \right ) + \tilde {O}\left (\log ^{2} n/\varepsilon ^{3} \right )\) samples, using Algorithm 2.

This is a direct consequence of Theorem 5.4 and the lemmas below. The first one states that, indeed, PBDs have small effective support:

Fact 5.6

For any ε > 0, a PBD has ε -effective support of size \({O\left (\sqrt {n\log (1/{\varepsilon })} \right )}\) .

Proof

By an additive Chernoff Bound, any random variable X following a Poisson Binomial Distribution has \(\Pr \!\left [\, {\left \lvert {X-\mathbb {E} X} \right \rvert > \gamma n}\, \right ] \leq 2e^{-2\gamma ^{2}n}\). Taking \(\gamma \overset {\text {def}}{=}\sqrt {\frac {1}{2n}\ln \frac {2}{\varepsilon }}\), we get that \(\Pr \!\left [\, {X\in I}\, \right ] \geq 1-\varepsilon \), where \(I\overset {\text {def}}{=} [\mathbb {E} X-\sqrt {\frac {n}{2}\ln \frac {2}{\varepsilon }}, \mathbb {E} X+\sqrt {\frac {n}{2}\ln \frac {2}{\varepsilon }}]\). □

It is clear that if \(D\in {\mathcal {PBD}}_{n}\) (and therefore is unimodal), then for any interval \(I\subseteq [n]\) the conditional distribution D I is still unimodal, and thus the class of conditioned PBDs \(\mathcal {PBD}_{n}^{\varepsilon ,I}\overset {\text {def}}{=} \left \{ {{D}_{I}} \;\colon \; {{D}\in \mathcal {PBD}_{n}, {D}(I) \geq 1-\varepsilon } \right \}\) falls under Corollary 4.2. The last piece we need to apply our generic testing framework is the existence of an algorithm to compute the distance between an (explicit) distribution and the class of conditioned PBDs. This is provided by our next lemma:

Claim 5.7

There exists a procedure \(\textsc {ProjectionDist}_{{\mathcal {PBD}}_{n}^{{\varepsilon },I}}\) that, on input n and ε ∈ [0, 1], \(I\subseteq [n]\) as well as the full specification of a distribution D on [n], computes a value τ such that \(\tau \in [1\pm 2\varepsilon ] \cdot \ell _{1}({D},\mathcal {PBD}_{n}^{\varepsilon ,I}) \pm \frac {\varepsilon }{100}\), in time \(n^{2} \left ({1/\varepsilon } \right )^{{O\left ({\log {1/\varepsilon }} \right )}}\) .

Proof

The goal is to find a γ = Θ(ε)-approximation of the minimum value of \({\sum }_{i\in I}\left \lvert \frac {P(i)}{P(I)} - \frac {{D}(i)}{{D}(I)} \right \rvert \), subject to \(P(I)={\sum }_{i\in I} P(i) \geq 1-{\varepsilon }\) and \(P\in {\mathcal {PBD}}_{n}\). We first note that, given the parameters \(n \in {\mathbb {N}}\) and \(p_{1},\dots ,p_{n}\in [0, 1]\) of a PBD P, the vector of (n + 1) probabilities \(P(0),\dots ,P(n)\) can be obtained in time \({O\left (n^{2} \right )}\) by dynamic programming. Therefore, computing the 1 distance between D and any PBD with known parameters can be done efficiently. To conclude, we invoke a result of Diakonikolas, Kane, and Stewart, that guarantees the existence of a succinct (proper) cover of \({\mathcal {PBD}}_{n}\):

Theorem 5.8 ([31, Theorem 4] (rephrased))

For all n, γ > 0, there exists a set \(\mathcal {S}_{\gamma } \subseteq \mathcal {PBD}_{n}\) such that:

  1. (i)

    \(\mathcal {S}_{\gamma }\) is a γ -cover of \({\mathcal {PBD}}_{n}\) ;that is, for all \(D \in {\mathcal {PBD}}_{n}\) there exists some \(D^{\prime } \in \mathcal {S}_{\gamma }\) such that \(\|{{D}-{D}^{\prime }}\|_{1} \leq \gamma \)

  2. (ii)

    \(\left \lvert \mathcal {S}_{\gamma } \right \rvert \leq n \left ({1/\gamma }\right )^{{O\left (\log {1/\gamma } \right )}}\)

  3. (iii)

    \(\mathcal {S}_{\gamma }\) can be computed in time \(n\left ({1/\gamma } \right )^{{O\left (\log {1/\gamma } \right )}}\)

and each \(D\in \mathcal {S}_{\gamma }\) is explicitly described by its set of parameters.

We further observe that the factor n in both the size of the cover and running time can be easily removed in our case, as we know a good approximation of the support size of the candidate PBDs. (That is, we only need to enumerate over a subset of the cover of [31], that of the PBDs with effective support compatible with our distribution D.)

Set \(\gamma \overset {\text {def}}{=}\frac {{\varepsilon }}{250}\). Fix \(P\in {\mathcal {PBD}}_{n}\) such that P(I) ≥ 1 − ε, and \(Q\in \mathcal {S}_{\gamma }\) such that \({\lVert {P-Q}{\rVert }}_1\leq \gamma \). In particular, it is easy to see via the correspondence between 1 and total variation distance that \(\left \lvert P(I)-Q(I) \right \rvert \leq \gamma /2\). By a calculation similar to that of Lemma 5.3, we have

$$\begin{array}{@{}rcl@{}} {\lVert{P_{I}-Q_{I}}{\rVert}}_1 &=& \sum\limits_{i\in I}\left\lvert \frac{P(i)}{P(I)} - \frac{Q(i)}{Q(I)} \right\rvert = \sum\limits_{i\in I}\left\lvert \frac{P(i)}{P(I)} - \frac{Q(i)}{P(I)} + Q(i)\left( \frac{1}{P(I)} - \frac{1}{Q(I)} \right) \right\rvert \\ &=& \sum\limits_{i\in I}\left\lvert \frac{P(i)}{P(I)} - \frac{Q(i)}{P(I)} \right\rvert \pm \sum\limits_{i\in I} Q(i)\left\lvert \frac{1}{P(I)} - \frac{1}{Q(I)} \right\rvert\\ &=& \frac{1}{P(I)}\left( \sum\limits_{i\in I}\left\lvert P(i) - Q(i) \right\rvert \pm \left\lvert P(I)-Q(I) \right\rvert\right) \\ &=& \frac{1}{P(I)}\left( \sum\limits_{i\in I}\left\lvert P(i) - Q(i) \right\rvert \pm \frac{\gamma}{2}\right) = \frac{1}{P(I)}\left( {\lVert{ P - Q }{\rVert}}_1 \pm \frac{5\gamma}{2}\right) \\ &\in& [ {\lVert{ P - Q }{\rVert}}_1 - {5\gamma}/{2}, (1+2{\varepsilon})\left( {\lVert{ P - Q }{\rVert}}_1 + {5\gamma}/{2} \right) ] \end{array} $$

where we used the fact that \({\sum }_{i\notin I}\left \lvert P(i) - Q(i) \right \rvert = 2\left ({\sum }_{i\notin I\colon P(i) > Q(i)} (P(i)-Q(i))\right ) + Q(I)-P(I) \in [-2\gamma ,2\gamma ]\). By the triangle inequality, this implies that the minimum of \({\lVert {P_{I}-{D}_{I}}{\rVert }}_1\) over the distributions P of \(\mathcal {S}_{{\varepsilon }}\) with P(I) ≥ 1 − (ε + γ/2) will be within an additive \({O\left ({\varepsilon } \right )}\) of \(\ell _{1}({D},{\mathcal {PBD}}_{n}^{{\varepsilon },I})\). The fact that the former can be found (by enumerating over the cover of size \(\left ({1/{\varepsilon }} \right )^{{O\left (\log {1/{\varepsilon }} \right )}}\) by the above discussion, and for each distribution in the cover computing the vector of probabilities and the distance to D) in time \(O(n^{2})\cdot \left \lvert S_{{\varepsilon }} \right \rvert =n^{2} \cdot \left ({1/{\varepsilon }} \right )^{{O\left (\log {1/{\varepsilon }} \right )}}\) concludes the proof. □ As previously mentioned, this approximation guarantee for \(\ell _{1}({D},{\mathcal {PBD}}_{n}^{{\varepsilon },I})\) is sufficient for the purpose of Algorithm 1.

Proof of Corollary 5.5

Combining the above, we invoke Theorem 5.4 with \(M(n,{\varepsilon })=O(\sqrt {n\log (1/{\varepsilon })} )\) (Fact 5.6) and \(L(\gamma ,\zeta ,m)=O\left (\frac {1}{\gamma }\log ^{2} \frac {m}{\zeta } \right )\) (Corollary 4.2). This yields the claimed sample complexity; finally, the efficiency is a direct consequence of Claim 5.7. □

6 Lower Bounds

6.1 Reduction-Based Lower Bound Approach

We now turn to proving converses to our positive results—namely, that many of the upper bounds we obtain cannot be significantly improved upon. As in our algorithmic approach, we describe for this purpose a generic framework for obtaining lower bounds.

In order to state our results, we will require the usual definition of agnostic learning. Recall that an algorithm is said to be a semi-agnostic learner for a class \(\mathcal {C}\) if it satisfies the following. Given sample access to an arbitrary distribution D and parameter ε, it outputs a hypothesis \(\hat {{D}}\) which (with high probability) does “almost as well as it gets”:

$${\lVert{{D} - \hat{{D}}}{\rVert}}_1 \leq c\cdot{\textsc{opt}}_{{\mathcal{C}},{D}} + {O\left( {\varepsilon} \right)} $$

where \({\textsc {opt}}_{{\mathcal {C}},{D}}\overset {\text {def}}{=} \inf _{{D}^{\prime }\in {\mathcal {C}}} \ell _{1}({D}^{\prime },{D})\), and c ≥ 1 is some absolute constant (if c = 1, the learner is said to be agnostic).

High-Level Idea

The motivation for our result is the observation of [12] that “monotonicity is at least as hard as uniformity.” Unfortunately, their specific argument does not generalize easily to other classes of distributions, making it impossible to extend it readily. The starting point of our approach is to observe that while uniformity testing is hard in general, it becomes very easy under the promise that the distribution is monotone, or even only close to monotone (namely, \({O\left ({1/\varepsilon ^{2}} \right )}\) samples suffice.)Footnote 7 This can give an alternate proof of the lower bound for monotonicity testing, via a different reduction: first, test if the unknown distribution is monotone; if it is, test whether it is uniform, now assuming closeness to monotone.

More generally, this idea applies to any class \({\mathcal {C}}\) which (a) contains the uniform distribution, and (b) for which we have a \({o\left (\sqrt {n} \right )}\)-sample agnostic learner \({\mathcal {L}}\), as follows. Assuming we have a tester \({\mathcal {T}}\) for \({\mathcal {C}}\) with sample complexity \({o\left (\sqrt {n} \right )}\), define a uniformity tester as below.

  • test if \(D\in {\mathcal {C}}\) using \({\mathcal {T}}\); if not, reject (as \({\mathcal {U}}\in {\mathcal {C}}\), D cannot be uniform);

  • otherwise, agnostically learn D with \({\mathcal {L}}\) (since D is close to \({\mathcal {C}}\)), and obtain hypothesis \(\hat {{D}}\);

  • check offline if \(\hat {D}\) is close to uniform.

By assumption, \({\mathcal {T}}\) and \({\mathcal {L}}\) each use \({o\left (\sqrt {n} \right )}\) samples, so does the whole process; but this contradicts the lower bound of [11, 42] on uniformity testing. Hence, \({\mathcal {T}}\) must use \({\Omega \left ({\sqrt {n}} \right )}\) samples.

This “testing-by-narrowing” reduction argument can be further extended to other properties than to uniformity, as we show below:

Theorem 6.1

Let \({\mathcal {C}}\) be a class of distributions over [n]for which the following holds:

  1. (i)

    there exists a semi-agnostic learner \({\mathcal {L}}\) for \({\mathcal {C}}\), with sample complexity q L (n, ε, δ)and “agnostic constant” c;

  2. (ii)

    there exists a subclass \({\mathcal {C}}_{\text {Hard}}\subseteq {\mathcal {C}}\) such that testing \({\mathcal {C}}_{\text {Hard}}\) requires q H (n, ε)samples.

Suppose further that \(q_{L}(n,{\varepsilon }, 1/6)={o\left (q_{H}(n,{\varepsilon }) \right )}\) .Then, any tester for \({\mathcal {C}}\) must use \({\Omega \left ({q_{H}(n,\varepsilon )} \right )}\) samples.

Proof

The above theorem relies on the reduction outlined above, which we rigorously detail here. Assuming \({\mathcal {C}}\), \(\mathcal {C}_{\text {Hard}}\), \({\mathcal {L}}\) as above (with semi-agnostic constant c ≥ 1), and a tester \({\mathcal {T}}\) for \(\mathcal {C}\) with sample complexity q T (n, ε), we define a tester \({\mathcal {T}}_{\text {Hard}}\) for \(\mathcal {C}_{\text {Hard}}\). On input ε ∈ (0, 1] and given sample access to a distribution D on [n], \({\mathcal {T}}_{\text {Hard}}\) acts as follows:

  • call \({\mathcal {T}}\) with parameters n, \(\frac {{\varepsilon }^{\prime }}{c}\) (where \({\varepsilon }^{\prime }\overset {\text {def}}{=}\frac {{\varepsilon }}{3}\)) and failure probability 1/6, to \(\frac {\varepsilon ^{\prime }}{c}\)-test if \(D\in \mathcal {C}\). If not, reject.

  • otherwise, agnostically learn a hypothesis \(\hat {{D}}\) for D, with \({\mathcal {L}}\) called with parameters n, \(\varepsilon ^{\prime }\) and failure probability 1/6;

  • check offline if \(\hat {{D}}\) is \({\varepsilon }^{\prime }\)-close to \({\mathcal {C}}_{\text {Hard}}\), accept if and only if this is the case.

We condition on both calls (to \({\mathcal {T}}\) and \({\mathcal {L}}\)) to be successful, which overall happens with probability at least 2/3 by a union bound. The completeness is immediate: if \(D\in \mathcal {C}_{\text {Hard}}\subseteq \mathcal {C}\), \({\mathcal {T}}\) accepts, and the hypothesis \(\hat {{D}}\) satisfies \(\|{\hat {{D}}-{D}}\|_{1} \leq \varepsilon ^{\prime }\). Therefore, \(\ell _{1}(\hat {{D}},\mathcal {C}_{\text {Hard}}) \leq \varepsilon ^{\prime }\), and \({\mathcal {T}}_{\text {Hard}}\) accepts.

For the soundness, we proceed by contrapositive. Suppose \({\mathcal {T}}_{\text {Hard}}\) accepts; it means that each step was successful. In particular, \(\ell _{1}(\hat {{D}},{\mathcal {C}})\leq {{\varepsilon }^{\prime }}/{c}\); so that the hypothesis outputted by the agnostic learner satisfies \({\lVert {\hat {{D}}-{D}}{\rVert }}_1 \leq c\cdot {\textsc {opt}}+{\varepsilon }^{\prime }\leq 2{\varepsilon }^{\prime }\). In turn, since the last step passed and by a triangle inequality we get, as claimed, \(\ell _{1}({D}, {\mathcal {C}}_{\text {Hard}}) \leq 2\varepsilon ^{\prime } + \ell _{1}(\hat {{D}},{\mathcal {C}}_{\text {Hard}}) \leq 3{\varepsilon }^{\prime } = {\varepsilon }\).

Observing that the overall sample complexity is \(q_{T}(n,\frac {{\varepsilon }^{\prime }}{c})+q_{L}(n,{\varepsilon }^{\prime }, \frac {1}{6}) = q_{T}(n,\frac {\varepsilon ^{\prime }}{c})+{o\left ({q_{H}(n,\varepsilon ^{\prime })} \right )}\) concludes the proof. □

Taking \({\mathcal {C}}_{\text {Hard}}\) to be the singleton consisting of the uniform distribution, and from the semi-agnostic learners of [15, 16] (each with sample complexity either poly(1/ε) or \(\text {poly}(\log n,1/{\varepsilon })\)), we obtain the following:Footnote 8

Corollary 1.6

Testing log-concavity, convexity, concavity, MHR, unimodality, t-modality, t-histograms, and t-piecewise degree- D distributions each require \({\Omega \left ({\sqrt {n}}/{{\varepsilon }^{2}} \right )}\) samples (the last three for \(t = o(\sqrt {n})\) and \(t(d+1) = o(\sqrt {n})\), respectively), for any ε ≥ 1/n O(1).

Similarly, we can use another result of [23] which shows how to agnostically learn Poisson Binomial Distributions with \(\tilde {O}\left (1/{\varepsilon }^{2} \right )\) samples.Footnote 9 Taking \({\mathcal {C}}_{\text {Hard}}\) to be the single \({\operatorname {Bin}\!\left (n, 1/2 \right )}\) distribution (along with the testing lower bound of [53]), this yields the following:

Corollary 1.7

Testing the classes of Binomial and Poisson Binomial Distributions each require \({\Omega \left ({n^{1/4}}/{{\varepsilon }^{2}} \right )}\) samples, for any ε ≥ 1/n O(1).

Finally, we derive a lower bound on testing k-SIIRVs from the agnostic learner of [21] (which has sample complexity poly(k,1/ε), independent of n):

Corollary 1.8

There exist absolute constants c > 0and ε 0 > 0such that testing the class of k-SIIRV distributions requires \({\Omega }\left (k^{1/2}n^{1/4}/{\varepsilon }^{2} \right )\) samples, for any \(k={o\left (n^{c} \right )}\) and \({1}/{n^{O(1)}} \leq {\varepsilon } \leq {\varepsilon }_{0}\) .

Proof of Corollary 1.8

To prove this result, it is enough by Theorem 6.1 to exhibit a particular k-SIIRV S such that testing identity to S requires this many samples. Moreover, from [53] this last part amounts to proving that the (truncated) 2/3-norm \(\lVert S^{-\max }_{-{\varepsilon }}{\rVert }_{2/3}\) of S is \({\Omega \left (k^{1/2}n^{1/4} \right )}\) (for every ε ∈ (0,ε 0), for some small ε 0 > 0). Our hard instance S will be defined as follows: it is defined as the distribution of \(X_{1}+\dots +X_{n}\), where the X i ’s are independent integer random variables uniform on \(\{0,\dots ,k-1\}\) (in particular, for k = 2 we get a \({\operatorname {Bin}\!\left (n, 1/2 \right )}\) distribution). It is straightforward to verify that \(\mathbb {E}{S} = \frac {n(k-1)}{2}\) and \(\sigma ^{2}\overset {\text {def}}{=}\text {Var}~S = \frac {(k^{2}-1)n}{12} = {\Theta \left (k^{2} n \right )}\); moreover, S is log-concave (as the convolution of n uniform distributions). From this last point, we get that (i) the maximum probability of S, attained at its mode, is \({\lVert {S}{\rVert }}_{\infty }={\Theta \left (1/\sigma \right )}\); and (ii) for every j in an interval I of length 2σ centered at this mode, \(S(j) \geq {\Omega \left ({\lVert {S}{\rVert }}_{\infty } \right )}\) (see e.g. [29, Lemma 5.7] for the latter point). Define now ε 0 as an absolute constant such that \(2{\varepsilon }_{0} \leq {D}(I) = {\Omega \left (1 \right )}\).

We want to lower bound \(\lVert S^{-\max }_{-{\varepsilon }}{\rVert }_{2/3}\), for εε 0; as by the above the “\(-\max \)” part can only change the value by \({\lVert {{S}}{\rVert }}_{\infty }=o(1)\), we can ignore it. Turning to the − ε part, i.e. the removal of the ε probability mass of the elements with smallest probability, note that this can only result in zeroing out at most \(\frac {\varepsilon }{{D}(I)}\left \lvert {I} \right \rvert \leq \frac {1}{2}\left \lvert {I} \right \rvert \) elements in I (call these \(J_{\varepsilon }\subseteq I\)). From this, we obtain that

$$\lVert S^{-\max}_{-{\varepsilon}}{\rVert}_{2/3}\geq \left( \sum\limits_{j\in I\setminus J_{{\varepsilon}}} S(j)^{2/3}\right)^{3/2} \geq \left( \frac{1}{2}\cdot 2\sigma\cdot {\Omega\left( 1/\sigma \right)}^{2/3}\right)^{3/2} ={\Omega\left( \sigma^{1/2} \right)} = {\Omega\left( k^{1/2}n^{1/4} \right)} $$

which concludes the proof. □

6.2 Tolerant Testing

This lower bound framework from the previous section carries to tolerant testing as well, resulting in this analogue to Theorem 6.1:

Theorem 6.2

Let \({\mathcal {C}}\) be a class of distributions over [n]for which the following holds:

  1. (i)

    there exists a semi-agnostic learner \({\mathcal {L}}\) for \({\mathcal {C}}\), with sample complexity q L (n, ε, δ)and “agnostic constant” c;

  2. (ii)

    there exists a subclass \({\mathcal {C}}_{\text {Hard}}\subseteq {\mathcal {C}}\) such that tolerant testing \(\mathcal {C}_{\text {Hard}}\) requires q H (n, ε 1,ε 2)samples for some parameters ε 2 > (4c + 1)ε 1.

Suppose further that \(q_{L}(n,{\varepsilon }_{2}-{\varepsilon }_{1}, 1/10)={o\left (q_{H}(n,{\varepsilon }_{1},{\varepsilon }_{2}) \right )}\) . Then, any tolerant tester for \(\mathcal {C}\) must use \({\Omega \left ({q_{H}(n,\varepsilon _{1},\varepsilon _{2})} \right )}\) samples (for some explicit parameters \(\varepsilon ^{\prime }_{1},\varepsilon ^{\prime }_{2}\) ).

Proof

The argument follows the same ideas as for Theorem 6.1, up to the details of the parameters. Assuming \(\mathcal {C}\), \(\mathcal {C}_{\text {Hard}}\), \({\mathcal {L}}\) as above (with semi-agnostic constant c ≥ 1), and a tolerant tester \({\mathcal {T}}\) for \(\mathcal {C}\) with sample complexity q(n, ε 1,ε 2), we define a tolerant tester \({\mathcal {T}}_{\text {Hard}}\) for \(\mathcal {C}_{\text {Hard}}\). On input 0 < ε 1 < ε 2 ≤ 1 with ε 2 > (4c + 1)ε 1, and given sample access to a distribution D on [n], \({\mathcal {T}}_{\text {Hard}}\) acts as follows. After setting \(\varepsilon ^{\prime }_{1}\overset {\text {def}}{=}\frac {\varepsilon _{2}-\varepsilon _{1}}{4}\), \(\varepsilon ^{\prime }_{2}\overset {\text {def}}{=}\frac {\varepsilon _{2}-\varepsilon _{1}}{2}\), \(\varepsilon ^{\prime } \overset {\text {def}}{=} \frac {\varepsilon _{2}-\varepsilon _{1}}{16}\) and \(\tau \overset {\text {def}}{=} \frac {6\varepsilon _{2}+10\varepsilon _{1}}{16}\),

  • call \({\mathcal {T}}\) with parameters n, \(\frac {{\varepsilon }^{\prime }_{1}}{c}\), \(\frac {{\varepsilon }^{\prime }_{2}}{c}\) and failure probability 1/6, to tolerantly test if \(D\in \mathcal {C}\). If \(\ell _{1}({D},\mathcal {C}) > \varepsilon ^{\prime }_{2}/c\), reject.

  • otherwise, agnostically learn a hypothesis \(\hat {{D}}\) for D, with \({\mathcal {L}}\) called with parameters n, \(\varepsilon ^{\prime }\) and failure probability 1/6;

  • check offline if \(\hat {{D}}\) is τ-close to \({\mathcal {C}}_{\text {Hard}}\), accept if and only if this is the case.

We condition on both calls (to \({\mathcal {T}}\) and \({\mathcal {L}}\)) to be successful, which overall happens with probability at least 2/3 by a union bound. We first argue completeness: assume \(\ell _{1}({D},{\mathcal {C}}_{\text {Hard}}) \leq {\varepsilon }_{1}\). This implies \(\ell _{1}({D},{\mathcal {C}}) \leq {\varepsilon }_{1}\), so that \({\mathcal {T}}\) accepts as \({\varepsilon }_{1} \leq {\varepsilon }^{\prime }_{1}/c\) (which is the case because ε 2 > (4c + 1)ε 1). Thus, the hypothesis \(\hat {{D}}\) satisfies \({\lVert {\hat {{D}}-{D}}{\rVert }}_1 \leq c\cdot {\varepsilon }^{\prime }_{1}/c + {\varepsilon }^{\prime } = {\varepsilon }^{\prime }_{1} + {\varepsilon }^{\prime }\). Therefore, \(\ell _{1}(\hat {{D}},{\mathcal {C}}_{\text {Hard}}) \leq {\lVert {\hat {{D}}-{D}}{\rVert }}_1 + \ell _{1}({D},{\mathcal {C}}_{\text {Hard}}) \leq {\varepsilon }^{\prime }_{1} + {\varepsilon }^{\prime } + {\varepsilon }_{1} < \tau \), and \({\mathcal {T}}_{\text {Hard}}\) accepts.

For the soundness, we again proceed by contrapositive. Suppose \({\mathcal {T}}_{\text {Hard}}\) accepts; it means that each step was successful. In particular, \(\ell _{1}(\hat {{D}},{\mathcal {C}})\leq {{\varepsilon }^{\prime }_{2}}/{c}\); so that the hypothesis outputted by the agnostic learner satisfies \({\lVert {\hat {{D}}-{D}}{\rVert }}_1 \leq c\cdot {\textsc {opt}}+{\varepsilon }^{\prime }\leq {\varepsilon }^{\prime }_{2} + {\varepsilon }^{\prime }\). In turn, since the last step passed and by a triangle inequality we get, as claimed, \(\ell _{1}({D}, {\mathcal {C}}_{\text {Hard}}) \leq {\varepsilon }^{\prime }_{2} + {\varepsilon }^{\prime } + \ell _{1}(\hat {{D}},{\mathcal {C}}_{\text {Hard}}) \leq {\varepsilon }^{\prime }_{2} + {\varepsilon }^{\prime } + \tau < {\varepsilon }_{2}\).

Observing that the overall sample complexity is \(q_{T}(n,\frac {\varepsilon ^{\prime }_{1}}{c},\frac {\varepsilon ^{\prime }_{2}}{c})+q_{L}(n,\varepsilon ^{\prime }, \frac {1}{10}) = q_{T}(n,\frac {\varepsilon ^{\prime }}{c})+{o\left ({q_{H}(n,\varepsilon ^{\prime })} \right )}\) concludes the proof. □

As before, we instantiate the general theorem to obtain specific lower bounds for tolerant testing of the classes we covered in this paper. That is, taking \({\mathcal {C}}_{\text {Hard}}\) to be the singleton consisting of the uniform distribution (combined with the tolerant testing lower bound of [50] (restated in Theorem D.3), which states that tolerant testing of uniformity over [n] requires \({\Omega \left (\frac {n}{\log n} \right )}\) samples), and again from the semi-agnostic learners of [15, 16] (each with sample complexity either poly(1/ε) or \(\text {poly}(\log n,1/{\varepsilon })\)), we obtain the following:

Corollary 1.11

Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and t-modality can be performed with \(O\left (\frac {1}{(\varepsilon _{2}-\varepsilon _{1})^{2}}\frac {n}{\log n} \right )\) samples, for ε 2C ε 1 (where C > 2is an absolute constant).

Similarly, we again turn to the class of Poisson Binomial Distributions, for which we can invoke as before the \(\tilde {O}\left (1/{\varepsilon }^{2} \right )\)-sample agnostic learner of [23]. As before, we would like to choose for \({\mathcal {C}}_{\text {Hard}}\) the single \({\operatorname {Bin}\!\left (n, 1/2 \right )}\) distribution; however, as no tolerant testing lower bound for this distribution exists—to the best of our knowledge—in the literature, we first need to establish the lower bound we will rely upon:

Theorem 6.3

There exists an absolute constant ε 0 > 0such that the following holds. Any algorithm which, given sampling access to an unknown distribution D on Ωand parameter ε ∈ (0, ε 0), distinguishes with probability at least 2/3between (i) \({\lVert {{D}-{\operatorname {Bin}\!\left (n, 1/2 \right )}}{\rVert }}_1 \leq {\varepsilon }\) and (ii) \({\lVert {{D}-{\operatorname {Bin}\!\left (n, 1/2 \right )}}{\rVert }}_1 \geq 100{\varepsilon }\) must use \({\Omega \left (\frac {1}{{\varepsilon }}\frac {\sqrt {n}}{\log n} \right )}\) samples.

The proof relies on a reduction from tolerant testing of uniformity, drawing on a result of Valiant and Valiant [50]; for the sake of conciseness, the details are deferred to Appendix D. With Theorem 6.3 in hand, we can apply Theorem 6.2 to obtain the desired lower bound:

Corollary 1.12

Tolerant testing of the classes of Binomial and Poisson Binomial Distributions can be performed with \(O\left (\frac {1}{(\varepsilon _{2}-\varepsilon _{1})^{2}}\frac {\sqrt {n\log ({1}/{\varepsilon _{1}})}}{\log n} \right )\) samples, for ε 2C ε 1 (where C > 2is an absolute constant).

We observe that both Corollary 1.11 and Corollary 1.12 are tight (with regard to the dependence on n), as proven in the next section (Section 7).

7 A Generic Tolerant Testing Upper Bound

To conclude this work, we address the question of tolerant testing of distribution classes. In the same spirit as before, we focus on describing a generic approach to obtain such bounds, in a clean conceptual manner. The most general statement of the result we prove in this section is stated below, which we then instantiate to match the lower bounds from Section 6.2:

Theorem 7.1

Let \({\mathcal {C}}\) be a class of distributions over [n]for which the following holds:

  1. (i)

    there exists a semi-agnostic learner \({\mathcal {L}}\) for \({\mathcal {C}}\), with sample complexity q L (n, ε, δ)and “agnostic constant” c;

  2. (ii)

    for any η ∈ [0, 1], every distribution in \({\mathcal {C}}\) has η -effective support of size at most M(n, η).

Then, there exists an algorithm that, for any fixed κ > 1and on input ε 1,ε 2 ∈ (0, 1)such that ε 2C ε 1 , has the following guarantee (where C > 2depends on c and κ only). The algorithm takes \({O\left ({\frac {1}{(\varepsilon _{2}-\varepsilon _{1})^{2}}\frac {m}{\log m}} \right )} + q_{L}(n,\frac {\varepsilon _{2}-\varepsilon _{1}}{\kappa }, \frac {1}{10})\) samples (where m = M(n, ε 1)), and with probability at least 2/3distinguishes between (a) \(\ell _{1}({D},\mathcal {C}) \leq \varepsilon _{1}\) and (b) \(\ell _{1}({D},\mathcal {C}) >~\varepsilon _{2}\) . (Moreover, one can take \(C=(1+(5c+6)\frac {\kappa }{\kappa -1})\) .)

Corollary 1.9

Tolerant testing of log-concavity, convexity, concavity, MHR, unimodality, and t-modality can be performed with \(O\left (\frac {1}{(\varepsilon _{2}-\varepsilon _{1})^{2}}\frac {n}{\log n} \right )\) samples, for ε 2C ε 1 (where C > 2is an absolute constant).

Applying now the theorem with \(M(n,{\varepsilon })=\sqrt {n\log (1/{\varepsilon })}\) (as per Corollary 5.5), we obtain an improved upper bound for Binomial and Poisson Binomial distributions:

Corollary 1.10

Tolerant testing of the classes of Binomial and Poisson Binomial Distributions can be performed with \(O\left (\frac {1}{(\varepsilon _{2}-\varepsilon _{1})^{2}}\frac {\sqrt {n\log ({1}/{\varepsilon _{1}})}}{\log n} \right )\) samples, for ε 2C ε 1 (where C > 2is an absolute constant).

High-Level Idea

Somewhat similar to the lower bound framework developed in Section 6, the gist of the approach is to reduce the problem of tolerant testing membership of D to the class \(\mathcal {C}\) to that of tolerant testing identity to a known distribution—namely, the distribution \(\hat {{D}}\) obtained after trying to agnostically learn D. Intuitively, an agnostic learner for \(\mathcal {C}\) should result in a good enough hypothesis \(\hat {{D}}\) (i.e., \(\hat {{D}}\) close enough to both D and \(\mathcal {C}\)) when D is ε 1-close to \(\mathcal {C}\); but output a \(\hat {{D}}\) that is significantly far from either D or \(\mathcal {C}\) when D is ε 2-far from \(\mathcal {C}\)—sufficiently for us to be able to tell. Besides the many technical details one has to control for the parameters to work out, one key element is the use of a tolerant testing algorithm for closeness of two distributions due to [52], whose (tight) sample complexity scales as \(n/\log n\) for a domain of size n. In order to get the right dependence on the effective support (required in particular for Corollary 1.10), we have to perform a first test to identify the effective support of the distribution and check its size, in order to only call this tolerant closeness testing algorithm on this much smaller subset. (This additional preprocessing step itself has to be carefully done, and comes at the price of a slightly worse constant C = C(c, κ) in the statement of the theorem.)

7.1 Proof of Theorem 7.1

As described in the preceding section, the algorithm will rely on the ability to perform tolerant testing of equivalence between two unknown distributions (over some known domain of size m). This is ensured by an algorithm of Valiant and Valiant, restated below:

Theorem 7.2 ([52, Theorem 3 and 4])

There exists an algorithm \(\mathcal {E}\) which, given sampling access to two unknown distributions D 1,D 2 over [m], satisfies the following. On input ε ∈ (0, 1], it takes \(O(\frac {1}{{\varepsilon }^{2}}\frac {m}{\log m} )\) samples from D 1 and D 2, and outputs a value Δsuch that \(\lvert {\lVert {{D}_{1}-{D}_{2}}{\rVert }}_1-{\Delta } \rvert \leq {\varepsilon }\) with probability 1 − 1/poly(m).(Furthermore, \(\mathcal {E}\) runs in time poly(m).)

For the proof, we will also need this fact, similar to Lemma 5.3, which relates the distance of two distributions to that of their conditional distributions on a subset of the domain:

Fact 7.3

Let D and P be distributions over [n], and \(I\subseteq [n]\) an interval such that D(I) ≥ 1 − α and P(I) ≥ 1 − β.Then,

  • \({\lVert {{D}_{I} - P_{I}}{\rVert }}_1 \leq \frac {3}{2}\frac {{\lVert {{D} - P}{\rVert }}_1}{{D}(I)} \leq 3{\lVert {{D} - P}{\rVert }}_1\) (the last inequality for \(\alpha \leq \frac {1}{2}\) ); and

  • \({\lVert {{D}_{I} - P_{I}}{\rVert }}_1 \geq \frac {}{}{\lVert {{D} - P}{\rVert }}_1 - 2(\alpha +\beta )\) .

Proof

To establish the first item, write:

$$\begin{array}{@{}rcl@{}} {\lVert{{D}_{I} - P_{I}}{\rVert}}_1 &=&\sum\limits_{i\in I}\left\lvert \frac{{D}(i)}{{D}(I)} - \frac{P(i)}{P(I)} \right\rvert = \frac{1}{{D}(I)}\sum\limits_{i\in I}\left\lvert {D}(i) - P(i) + P(i)\left( 1- \frac{{D}(I)}{P(I)}\right) \right\rvert \\ &\leq& \frac{1}{{D}(I)}\left( \sum\limits_{i\in I}\left\lvert {D}(i) - P(i) \right\rvert + \left\lvert 1- \frac{{D}(I)}{P(I)} \right\rvert \sum\limits_{i\in I} P(i) \right)\\ &=& \frac{1}{{D}(I)}\left( \sum\limits_{i\in I}\left\lvert {D}(i) - P(i) \right\rvert + \left\lvert P(I)- {D}(I) \right\rvert \right) \leq \frac{1}{{D}(I)}\left( \sum\limits_{i\in I}\left\lvert {D}(i) - P(i) \right\rvert + \frac{1}{2}{\lVert{{D}-P}{\rVert}}_1 \right)\\ &\leq& \frac{1}{{D}(I)} \cdot \frac{3}{2}{\lVert{{D}-P}{\rVert}}_1 \end{array} $$

where we used the fact that \(\left \lvert P(I)- {D}(I) \right \rvert \leq {\operatorname {d_{\text {TV}}}\!\left ({{D}, P}\right )} = \frac {1}{2}{\lVert {{D}-P}{\rVert }}_1\). Turning now to the second item, we have:

$$\begin{array}{@{}rcl@{}} {\lVert{{D}_{I} - P_{I}}{\rVert}}_1 &=& \frac{1}{{D}(I)}\sum\limits_{i\in I}\left\lvert {D}(i) - P(i) + P(i)\left( 1- \frac{{D}(I)}{P(I)}\right) \right\rvert \geq \frac{1}{{D}(I)}\left( \sum\limits_{i\in I}\left\lvert {D}(i) - P(i) \right\rvert - \left\lvert 1- \frac{{D}(I)}{P(I)} \right\rvert \sum\limits_{i\in I} P(i) \right)\\ &=& \frac{1}{{D}(I)}\left( \sum\limits_{i\in I}\left\lvert {D}(i) - P(i) \right\rvert - \left\lvert P(I)- {D}(I) \right\rvert \right) \geq \frac{1}{{D}(I)}\left( \sum\limits_{i\in I}\left\lvert {D}(i) - P(i) \right\rvert - (\alpha+\beta) \right)\\ &\geq& \frac{1}{{D}(I)}\left( {\lVert{{D}-P}{\rVert}}_1 - \sum\limits_{i\notin I} \left\lvert {D}(i) - P(i) \right\rvert - (\alpha+\beta) \right) \geq \frac{1}{{D}(I)}\left( {\lVert{{D}-P}{\rVert}}_1 - 2(\alpha+\beta) \right) \\ &\geq& {\lVert{{D}-P}{\rVert}}_1 - 2(\alpha+\beta). \end{array} $$

With these two ingredients, we are in position to establish our theorem:

Proof of Theorem 7.1

The algorithm proceeds as follows, where we set \({\varepsilon }\overset {\text {def}}{=}\frac {{\varepsilon }_{2}-{\varepsilon }_{1}}{17\kappa }\), \(\theta \overset {\text {def}}{=}{\varepsilon }_{2} - ((6+c)\varepsilon _{1}+11\varepsilon )\), and \(\tau \overset {\text {def}}{=}2 \frac {(3+c)\varepsilon _{1}+5\varepsilon }{2}\):

  1. (1)

    using \(O(\frac {1}{{\varepsilon }^{2}})\) samples, get (with probability at least 1 − 1/10, by Theorem 2.11) a distribution \(\tilde {{D}}\) \(\frac {\varepsilon }{2}\)-close to D in Kolmogorov distance; and let \(I\subseteq [n]\) be the smallest interval such that \(\tilde {{D}}(I) > 1-\frac {3}{2}\varepsilon _{1}-\varepsilon \). Output reject if \(\left \lvert {I} \right \rvert > M(n,\varepsilon _{1})\).

  2. (2)

    invoke \({\mathcal {L}}\) on D with parameters ε and failure probability \(\frac {1}{10}\), to obtain a hypothesis \(\hat {{D}}\);

  3. (3)

    call \(\mathcal {E}\) (from Theorem 7.2) on D I , \(\hat {{D}}_{I}\) with parameter \(\frac {\varepsilon }{6}\) to get an estimate \(\hat {\Delta }\) of \(\|{{D}_{I}-\hat {{D}}_{I}}\|_{1}\);

  4. (4)

    output reject if \(\hat {{D}}(I) < 1-\tau \);

  5. (5)

    compute “offline” (an estimate accurate within ε of) \(\ell _{1}(\hat {{D}},\mathcal {C})\), denoted Δ;

  6. (6)

    output reject is \({\Delta }+\hat {\Delta } > \theta \), and output accept otherwise.

The claimed sample complexity is immediate from Steps (2) and (3), along with Theorem 7.2. Turning to correctness, we condition on both subroutines meeting their guarantee (i.e., \(\|{{D}-\hat {{D}}}\|_{1} \leq c\cdot {\textsc {opt}}+\varepsilon \) and \(\|{{D}-\hat {{D}}}\|_{1} \in [\hat {\Delta } - \varepsilon ,\hat {\Delta } + \varepsilon ]\)), which happens with probability at least 8/10 − 1/poly(n) ≥ 3/4 by a union bound.

Completeness

If \(\ell _{1}({D},{\mathcal {C}}) \leq {\varepsilon }_{1}\), then D is ε 1-close to some \(P\in {\mathcal {C}}\), for which there exists an interval \(J\subseteq [n]\) of size at most M(n, ε 1) such that P(J) ≥ 1 − ε 1. It follows that \(D(J) \geq 1-\frac {3}{2}\varepsilon _{1}\) (since \(\left \lvert {{D}(J)-P(J)} \right \rvert \leq \frac {\varepsilon _{1}}{2}\)) and \(\tilde {{D}}(J) \geq 1-\frac {3}{2}\varepsilon _{1}-2\cdot \frac {\varepsilon }{2}\varepsilon \); establishing existence of a good interval I to be found (and Step (1) does not end with reject). Additionally, \(\|{{D}-\hat {{D}}}\|_{1} \leq c\cdot \varepsilon _{1}+\varepsilon \) and by the triangle inequality this implies \(\ell _{1}(\hat {{D}},\mathcal {C}) \leq (1+c)\varepsilon _{1}+\varepsilon \).

Moreover, as \(D(I) \geq \tilde {{D}}(I) - 2\cdot \frac {{\varepsilon }}{2} \geq 1-\frac {3}{2}{\varepsilon }_{1}-2{\varepsilon }\) and \(\left \lvert \hat {{D}}(I) - {D}(I) \right \rvert \leq \frac {1}{2}{\lVert {{D}-\hat {{D}}}{\rVert }}_1\), we do have

$$\hat{{D}}(I) \geq 1-\frac{3}{2}{\varepsilon}_{1}-2{\varepsilon} - \frac{c{\varepsilon}_{1}}{2}-\frac{{\varepsilon}}{2} = 1-\tau $$

and the algorithm does not reject in Step (4). To conclude, one has by Fact 7.3 that

$${\lVert{{D}_{I}-\hat{{D}}_{I}}{\rVert}}_1 \leq \frac{3}{2}\frac{{\lVert{{D}-\hat{{D}}}{\rVert}}_1}{{D}(I)} \leq \frac{3}{2}\frac{(c{\varepsilon}_{1}+{\varepsilon})}{1-\frac{3}{2}{\varepsilon}_{1}-2{\varepsilon}} \leq 3(c{\varepsilon}_{1}+{\varepsilon})\quad (\text{for}~{\varepsilon}_{1} < 1/4, ~\text{as}~{\varepsilon} < 1/17) $$

Therefore, \({\Delta }+\hat {\Delta } \leq \ell _{1}(\hat {{D}},{\mathcal {C}})+{\varepsilon } + {\lVert {{D}_{I}-\hat {{D}}_{I}}{\rVert }}_1+{\varepsilon } \leq (4c+1){\varepsilon }_{1}+6{\varepsilon } \leq {\varepsilon }_{2} - ((6+c){\varepsilon }_{1}+11{\varepsilon }) = \theta \)(the last inequality by the assumption on ε 2,ε 1), and the tester accepts.

Soundness

If \(\ell _{1}({D},{\mathcal {C}}) > {\varepsilon }_{2}\), then we must have \({\lVert {{D}-\hat {{D}}}{\rVert }}_1 + \ell _{1}(\hat {{D}},\mathcal {C}) > \varepsilon _{2}\). If the algorithm does not already reject in Step (4), then \(\hat {{D}}(I) \geq 1-\tau \). But, by Fact 7.3,

$$\begin{array}{@{}rcl@{}} {\lVert{{D}_{I}-\hat{{D}}_{I}}{\rVert}}_1 &\geq& {\lVert{ {D} - \hat{{D}} }{\rVert}}_1 - 2({D}(I^{c}) + \hat{{D}}(I^{c})) \geq {\lVert{{D}_{I}-\hat{{D}}_{I}}{\rVert}}_1 - 2\left( \frac{3}{2}\varepsilon_{1} + 2\varepsilon+\tau\right) \\ &=& {\lVert{ {D} - \hat{{D}} }{\rVert}}_1 - ((6+c){\varepsilon}_{1}+9{\varepsilon}) \end{array} $$

we then have \({\lVert {{D}_{I}-\hat {{D}}_{I}}{\rVert }}_1 + \ell _{1}(\hat {{D}},{\mathcal {C}}) > {\varepsilon }_{2} - ((6+c){\varepsilon }_{1}+9{\varepsilon })\). This implies \({\Delta }+\hat {\Delta } > {\varepsilon }_{2} - ((6+c){\varepsilon }_{1}+9{\varepsilon }) -2{\varepsilon } = {\varepsilon }_{2} - ((6+c){\varepsilon }_{1}+11{\varepsilon }) = \theta \), and the tester rejects. Finally, the testing algorithm defined above is computationally efficient as long as both the learning algorithm (Step (2)) and the estimation procedure (Step (5)) are. □