1 Introduction

Multiple hypothesis testing simply refers to any instance in which more than one null hypothesis is tested simultaneously. While this problem is pervasive throughout all empirical work in economics, we focus on the analysis of data from experiments in economics. In this setting, different null hypotheses arise naturally for at least three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics (e.g., gender or age) and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments.

Testing multiple null hypotheses for each of these three reasons is ubiquitous in the analysis of experimental data. Anderson (2008), for example, reports that 84% of experiments published from 2004 to 2006 in a set of social sciences field journals examine five or more outcomes simultaneously and 61% examine ten or more outcomes simultaneously. Specific examples include many studies of early childhood interventions, such as the Abecedarian and Perry pre-school programs, which collected data on a large variety of outcomes pertaining to educational attainment, employment, and criminal behavior, among others. Similarly, Fink et al. (2014) report that 76% of field experiments published in leading economics journals examine multiple subgroups and 29% examine ten or more subgroups. Specific examples include analyses of how the effects of competition may vary by gender (Gneezy et al. 2003; Niederle and Vesterlund 2007; Flory et al. 2015b) or age (Sutter and Glätzle-Rützler 2014; Flory et al. 2015a). Multiple treatments are also commonplace in experiments. For instance, the recent economics literature has studied how different incentive schemes affect a variety of outcomes including worker productivity (Hossain and List 2012), child food choice and consumption (List and Samek 2015), and educational performance (Levitt et al. 2012).

With a few exceptions, some of which we note below, it is uncommon for the analyses of these data to account for the multiple hypothesis testing. As a result, the probability of a false rejection may be much higher than desired. To illustrate this point, consider testing N null hypotheses simultaneously. Suppose that for each null hypothesis a p value is available whose distribution is uniform on the unit interval when the corresponding null hypothesis is true. Suppose further that all null hypotheses are true and that the p values are independent. In this case, if we were to test each null hypothesis in the usual way at level \(\alpha \in (0,1)\), then the probability of one or more false rejections equals \(1 - (1 - \alpha )^N\), which may be much greater than \(\alpha\) and in fact tends rapidly to one as N increases. For instance, with \(\alpha = 0.05\), it equals 0.226 when \(N = 5\), equals 0.401 when \(N = 10\) and 0.994 when \(N = 100\). In order to control the probability of a false rejection, it is therefore important to account appropriately for multiplicity of null hypotheses being tested.

In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Formally, we establish our results by applying the general results in Romano and Wolf (2010). In particular, we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni (1935) and Holm (1979) corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. See Remark 3.7 for further discussion of this point.

As mentioned previously, it is uncommon in the experimental economics literature for authors to account for the multiplicity of null hypotheses being tested. Some notable exceptions include Kling et al. (2007), who use a more restrictive resampling-based multiple testing procedure due to Westfall and Young (1993) and Anderson (2008), Heckman et al. (2010), Heckman et al. (2011), and Lee and Shaikh (2014), who combine randomization methods with results in Romano and Wolf (2005) to construct multiple testing procedures with finite-sample validity for testing a more restrictive family of null hypotheses. Perhaps most importantly, none of these papers consider null hypotheses emerging due to multiple treatments, which, as noted above, is a very common occurrence in experiments in economics.

The remainder of our paper is organized as follows. In Sect. 2, we introduce our setup and notation as well as the assumptions under which we will establish the validity of our multiple testing procedure. Section 3 describes our multiple testing procedure and establishes its validity. In Sect. 4, we apply our methodology to data originally presented in Karlan and List (2007), who study the economics of charity by measuring, among other things, the effectiveness of a matching grant on charitable giving. Section 5 concludes. Proofs of all results can be found in “Appendix”.

2 Setup and notation

For \(k\in {\mathcal {K}}\), let \(Y_{i,k}\) denote the (observed) kth outcome of interest for the ith unit, \(D_{i}\) denote treatment status for the ith unit, and \(Z_{i}\) denote observed, baseline covariates for the ith unit. Further denote by \({\mathcal {D}}\) and \({\mathcal {Z}}\) the supports of \(D_{i}\) and \(Z_{i}\), respectively. For \(d\in {\mathcal {D}}\), let \(Y_{i,k}(d)\) be the kth potential outcome for the ith unit if treatment status were (possibly counterfactually) set equal to d. As usual, the kth observed outcome and kth potential outcome are related to treatment status by the relationship

$$Y_{i,k}=\sum _{d\in {\mathcal {D}}}Y_{i,k}(d)I\{D_{i}=d\}.$$

It is useful to introduce the shorthand notation \(Y_{i}=(Y_{i,k}:k\in {\mathcal {K}})\) and \(Y_{i}(d)=(Y_{i,k}(d):k\in {\mathcal {K}})\). We assume that \(((Y_{i}(d):d\in {\mathcal {D}}),D_{i},Z_{i}),i=1,\ldots ,n\) are i.i.d. with distribution \(Q\in \varOmega\), where our requirements on \(\varOmega\) are specified below. It follows that the observed data \((Y_{i},D_{i},Z_{i}),i=1,\ldots ,n\) are i.i.d. with distribution \(P=P(Q)\). Denote by \({\hat{P}}_{n}\) the empirical distribution of the observed data.

The family of null hypotheses of interest is indexed by

$$s\in {\mathcal {S}}\subseteq \{(d,d',z,k):d\in {\mathcal {D}},d'\in {\mathcal {D}},z\in {\mathcal {Z}},k\in {\mathcal {K}}\}.$$

For each \(s\in {\mathcal {S}}\), define

$$\omega _{s}=\{Q\in \varOmega :E_{Q}\left[ Y_{i,k}(d)-Y_{i,k}(d')|Z_{i}=z\right] =0\}.$$

Using this notation, the family of null hypotheses of interest is given by

$$H_{s}:Q\in \omega _{s}\text { for }s\in {\mathcal {S}}.$$
(1)

In other words, the sth null hypothesis specifies that the average effect of treatment d on the kth outcome of interest for the subpopulation where \(Z_{i}=z\) equals the average effect of treatment \(d'\) on the kth outcome of interest for the subpopulation where \(Z_{i}=z\). For later use, denote by \({\mathcal {S}}_{0}(Q)\) the subset of \({\mathcal {S}}\) corresponding to true null hypotheses, i.e.,

$${\mathcal {S}}_{0}(Q)=\{s\in {\mathcal {S}}:Q\in \omega _{s}\}.$$

Our goal is to construct a procedure for testing these null hypotheses in a way that ensures asymptotic control of the familywise error rate for each \(Q\in \varOmega\). More precisely, we require for each \(Q\in \varOmega\) that

$$\limsup _{n\rightarrow \infty }FWER_{Q}\le \alpha$$
(2)

for a pre-specified value of \(\alpha \in (0,1)\), where

$$FWER_{Q}=Q\{\text {reject any }H_{s}\text { with }s\in {\mathcal {S}}_{0}(Q)\}.$$
(3)

The notation \(FWER_{Q}\) is intended to reflect the fact that the quantity on the righthand-side of (3) is the familywise error rate computed under Q. We additionally require that the testing procedure is “balanced” in that for each \(Q\in \varOmega\),

$$\lim _{n\rightarrow \infty }Q\{\text {reject }H_{s}\}=\lim _{n\rightarrow \infty }Q\{\text {reject }H_{s'}\}\text { for any }s\text { and }s'\text { in }{\mathcal {S}}_{0}(Q).$$
(4)

We impose the requirement of “balance” to avoid situations where some (true) null hypotheses may be more likely to be rejected than other (true) null hypotheses for reasons that are viewed as undesirable, such as some outcomes taking on much larger values than other outcomes.

We now describe our main requirements on \(\varOmega\). The assumptions make use of the notation

$$\begin{aligned} \mu _{k|d,z}(Q)= & {} E_{Q}[Y_{i,k}(d)|D_{i}=d,Z_{i}=z]\\ \sigma _{k|d,z}^{2}(Q)= & {} \text {Var}_{Q}[Y_{i,k}(d)|D_{i}=d,Z_{i}=z]. \end{aligned}$$

Assumption 2.1

For each \(Q\in \varOmega\),

$$(Y_{i}(d):d\in {\mathcal {D}})\perp \!\!\! \perp D_{i}|Z_{i}$$

under Q.

Assumption 2.2

For each \(Q\in \varOmega\), \(k\in {\mathcal {K}}\), \(d\in {\mathcal {D}}\) and \(z\in {\mathcal {Z}}\),

$$0<\sigma _{k|d,z}^{2}(Q)=\text {Var}_{Q}[Y_{i,k}(d)|D_{i}=d,Z_{i}=z]<\infty .$$

Assumption 2.3

For each \(Q\in \varOmega\), there is \(\epsilon >0\) such that

$$Q\{D_{i}=d,Z_{i}=z\}>\epsilon$$
(5)

for all \(d\in {\mathcal {D}}\) and \(z\in {\mathcal {Z}}\).

Assumption 2.1 simply requires that treatment status was randomly assigned. Assumption 2.2 is a mild non-degeneracy requirement. Assumption 2.3 simply requires that both \(D_{i}\) and \(Z_{i}\) are discrete random variables (with finite supports).

Remark 2.1

Note that we have assumed in particular that treatment status \(D_{i},i=1,\ldots ,n\) is i.i.d. While this assumption accommodates situations in which treatment status is assigned according to simple random sampling, it does not accommodate more complicated treatment assignment rules, such as those in which treatment status is assigned in order to “balance” baseline covariates among the subsets of individuals with different treatment status. For a discussion of such treatment assignment rules and the implications for inference about the average treatment effect, see Bugni et al. (2015).□

Remark 2.2

When \({\mathcal {S}}\) is very large, requiring control of the familywise error rate may significantly limit the ability to detect genuinely false null hypotheses. For this reason, it may be desirable in such situations to relax control of the familywise error rate in favor of generalized error rates that penalize false rejections less severely. Examples of such error rates include: the m-familywise error rate, defined to be the probability of m or more false rejections; the tail probability of the false discovery proportion, defined to be the fraction of false rejections (understood to be zero if there are no rejections at all); and the false discovery rate, defined to be the expected value of false discovery proportion. Control of the m-familywise error rate and the tail probability of the false discovery proportion using resampling are discussed in Romano et al. (2008b) and Romano and Wolf (2010). For procedures based only on (multiplicity-unadjusted) p values, see Lehmann and Romano (2005), Romano and Shaikh (2006a, b). For resampling-based control of the false discovery rate, see Romano et al. (2008a).□

3 A stepwise multiple testing procedure

In this section, we describe a stepwise multiple testing procedure for testing (1) in a way that satisfies (2) and (4) for any \(Q\in \varOmega\). In order to do so, we first require some additional notation. To this end, first define the “unbalanced” test statistic for \(H_{s}\),

$$T_{s,n}=\sqrt{n}\left| \frac{1}{n_{d,z}}\sum _{1\le i\le n:D_{i}=d,Z_{i}=z}Y_{i,k}-\frac{1}{n_{d',z}}\sum _{1\le i\le n:D_{i}=d',Z_{i}=z}Y_{i,k}\right| ,$$
(6)

and its re-centered version

$${\tilde{T}}_{s,n}(P)=\sqrt{n}\left| \frac{1}{n_{d,z}}\sum _{1\le i\le n:D_{i}=d,Z_{i}=z}(Y_{i,k}-{\tilde{\mu }}_{k|d,z}(P))-\frac{1}{n_{d',z}}\sum _{1\le i\le n:D_{i}=d',Z_{i}=z}(Y_{i,k}-{\tilde{\mu }}_{k|d',z}(P))\right| ,$$
(7)

where

$${\tilde{\mu }}_{k|d,z}(P)=E_{P}[Y_{i,k}|D_{i}=d,Z_{i}=z].$$

Next, for \(s\in {\mathcal {S}}\), define

$$J_{n}(x,s,P)=P\left\{ {\tilde{T}}_{s,n}(P)\le x\right\}.$$

Note that \(J_{n}(x,s,P)\) is simply the distribution of (6) when \(H_s\) is true. In order to achieve “balance,” rather than reject \(H_{s}\) for large values of \(T_{s,n}\), we reject \(H_{s}\) for large values of

$$J_{n}(T_{s,n},s,{\hat{P}}_{n}).$$
(8)

Note that (8) is simply one minus a (multiplicity-unadjusted) bootstrap p value for testing \(H_{s}\) based on \(T_{s,n}\). Finally, for \({\mathcal {S}}'\subseteq {\mathcal {S}}\), let

$$L_{n}(x,{\mathcal {S}}',P)=P\left\{ \max _{s\in {\mathcal {S}}'}J_{n}({\tilde{T}}_{s,n}(P),s,P)\le x\right\} .$$

Note that \(L_{n}(x,{\mathcal {S}}',P)\) is the distribution the maximum of (8) over \(s \in {\mathcal {S}}'\) when \(H_s\) is true for all \(s \in {\mathcal {S}}'\). Using this notation, we may describe our proposed stepwise multiple testing procedure as follows:

Algorithm 3.1

  • Step 0. Set\({\mathcal {S}}_{1}={\mathcal {S}}\).

  •               \(\vdots\)

  • Stepj. If\(S_{j}=\emptyset\)or

    $$\max _{s\in {\mathcal {S}}_{j}}J_{n}(T_{s,n},s,{\hat{P}}_{n})\le L_{n}^{-1}(1-\alpha ,{\mathcal {S}}_{j},{\hat{P}}_{n}),$$

    then stop. Otherwise, reject any\(H_{s}\)with\(J_{n}(T_{s,n},s,{\hat{P}}_{n})>L_{n}^{-1}(1-\alpha ,{\mathcal {S}}_{j},{\hat{P}}_{n})\), set

    $${\mathcal {S}}_{j+1}=\{s\in {\mathcal {S}}_{j}:J_{n}(T_{s,n},s,{\hat{P}}_{n})\le L_{n}^{-1}(1-\alpha ,{\mathcal {S}}_{j},{\hat{P}}_{n})\},$$

    and continue to the next step.

  •         \(\vdots\)

The following theorem describes the asymptotic behavior of our proposed multiple testing procedure.

Theorem 3.1

Consider the procedure for testing (1) it given by Algorithm 3.1. Under Assumptions 2.12.3, Algorithm 3.1satisfies (2) and (4) for any\(Q\in \varOmega\).

Remark 3.1

If \({\mathcal {S}}=\{s\}\), i.e., \({\mathcal {S}}\) is a singleton, then the familywise error rate is simply the usual probability of a Type I error. Hence, Algorithm 3.1 provides asymptotic control of the probability of a Type I error. In this case, Algorithm 3.1 is equivalent to the usual bootstrap test of \(H_{s}\), i.e., the test that rejects \(H_{s}\) whenever \(T_{s,n}>J_{n}^{-1}(1-\alpha ,s,{\hat{P}}_{n})\).□

Remark 3.2

As noted above, \({\hat{p}}_{s,n}=1-J_{n}(T_{s,n},s,{\hat{P}}_{n})\) may be interpreted as a bootstrap p value for testing \(H_{s}\). Indeed, for any \(Q\in \omega _{s}\), it is possible to show that

$$\limsup _{n\rightarrow \infty }Q\{{\hat{p}}_{s,n}\le u\}\le u$$

for any \(0<u<1\). A crude solution to the multiplicity problem would therefore be to apply a Bonferroni or Holm correction to these p values. By replacing \(L_{n}^{-1}(1-\alpha ,{\mathcal {S}}_{j},{\hat{P}}_{n})\) with a suitable choice of critical value, it is possible to describe both the Bonferroni and Holm corrections in terms of Algorithm 3.1. The Bonferroni correction may be obtained by applying Algorithm 3.1 with \(1 - \frac{\alpha }{|{\mathcal {S}}|}\) in place of \(L_{n}^{-1}(1-\alpha ,{\mathcal {S}}_{j},{\hat{P}}_{n})\), whereas the Holm correction, first described in Holm (1979), may be obtained by applying Algorithm 3.1 with \(1 - \frac{\alpha }{|{\mathcal {S}}_j|}\) in place of \(L_{n}^{-1}(1-\alpha ,{\mathcal {S}}_{j},{\hat{P}}_{n})\). Such approaches would indeed satisfy (2), as desired, but implicitly rely upon a “least favorable” dependence structure among the p values. To the extent that the true dependence structure differs from this “least favorable” one, improvements may be possible. Algorithm 3.1 uses the bootstrap to incorporate implicitly information about the dependence structure when deciding which null hypotheses to reject. In fact, Algorithm 3.1 will always reject at least as many null hypotheses as these procedures.□

Remark 3.3

Implementation of Algorithm 3.1 typically requires approximating the quantities \(J_{n}(x,s,{\hat{P}}_{n})\) and \(L_{n}(x,{\mathcal {S}}',{\hat{P}}_{n})\) using simulation. As noted by Romano and Wolf (2010), doing so does not require nested bootstrap simulations. To explain further, for \(b=1,\ldots ,B\), draw a sample of size n from \({\hat{P}}_{n}\) and denote by \({\tilde{T}}_{s,n}^{*,b}({\hat{P}}_{n})\) the quantity \({\tilde{T}}_{s,n}(P)\) using the bth resample and \({\hat{P}}_{n}\) as an estimate of P. Then, \(J_{n}(x,s,{\hat{P}}_{n})\) may be approximated as

$${\hat{J}}_{n}(x,s,{\hat{P}}_{n})=\frac{1}{B}\sum _{1\le b\le B}I\{{\tilde{T}}_{s,n}^{*,b}({\hat{P}}_{n})\le x\}$$

and \(L_{n}(x,{\mathcal {S}}',{\hat{P}}_{n})\) may be approximated as

$${\hat{L}}_{n}(x,{\mathcal {S}}',{\hat{P}}_{n})=\frac{1}{B}\sum _{1\le b\le B}I\left\{ \max _{s \in {\mathcal {S}}'}{\hat{J}}_{n}(T_{s,n}^{*,b}({\hat{P}}_{n}),s,{\hat{P}}_{n})\le x\right\} .$$

In particular, the same set of bootstrap resamples may be used in the two approximations.□

Remark 3.4

In terms of higher-order asymptotic properties, it is often desirable to studentize, i.e., to replace \(T_{s,n}\) and \({\tilde{T}}_{s,n}(P)\), respectively, with

$$\begin{aligned} T_{s,n}^{\mathrm{stud}} &= \frac{T_{s,n}}{\sqrt{n\cdot \left( \frac{{\tilde{\sigma }}_{k|d,z}^{2}({\hat{P}}_{n})}{n_{d,z}}+\frac{{\tilde{\sigma }}_{k|d',z}^{2}({\hat{P}}_{n})}{n_{d',z}}\right) }}\\ {\tilde{T}}_{s,n}^{\mathrm{stud}}(P) &= \frac{{\tilde{T}}_{s,n}(P)}{\sqrt{n\cdot \left( \frac{{\tilde{\sigma }}_{k|d,z}^{2}({\hat{P}}_{n})}{n_{d,z}}+\frac{{\tilde{\sigma }}_{k|d',z}^{2}({\hat{P}}_{n})}{n_{d',z}}\right) }}, \end{aligned}$$

where

$${\tilde{\sigma }}_{k|d,z}^{2}(P)=\text {Var}_{P}[Y_{i,k}|D_{i}=d,Z_{i}=z].$$

Theorem 3.1 continues to hold with these changes.□

Remark 3.5

In some cases, it may be of interest to consider one-sided null hypotheses, e.g., \(H_{s}^{-}:P\in \omega _{s}^{-}\), where

$$\omega _{s}^{-}=\{Q\in \varOmega :E_{Q}[Y_{i,k}(d)-Y_{i,k}(d')|Z_{i}=z]\le 0\}$$
(9)

In this case, it suffices simply to replace \(T_{s_{n}}\) and \({\tilde{T}}_{s_{n}}(P)\), respectively, with \(T_{s,n}^{-}\) and \({\tilde{T}}_{s,n}^{-}(P)\), which are, respectively, defined as in (6) and (7), but without the absolute values. An analogous modification can be made for null hypotheses \(H_{s}^{+}:P\in \omega _{s}^{+}\), where \(\omega _{s}^{+}\) is defined as in (9), but with the inequality reversed.□

Remark 3.6

Note that a multiplicity-adjusted p value for \(H_{s}\), \({\hat{p}}_{s,n}^{\mathrm{adj}}\), may be computed simply as the smallest value of \(\alpha\) for which \(H_{s}\) is rejected in Algorithm 3.1.□

Remark 3.7

It is possible to improve Algorithm 3.1 by exploiting transitivity (i.e., \(\mu _{k|d,z}(Q)=\mu _{k|d',z}(Q)\) and \(\mu _{k|d',z}(Q)=\mu _{k|d'',z}(Q)\) implies that \(\mu _{k|d,z}(Q)=\mu _{k|d'',z}(Q)\)). To this end, for \({\mathcal {S}}'\subseteq {\mathcal {S}}\), define

$${\mathbb {S}}({\mathcal {S}}')=\{{\mathcal {S}}''\subseteq {\mathcal {S}}':\exists ~Q\in \varOmega \text { s.t. }{\mathcal {S}}''={\mathcal {S}}_{0}(Q)\}$$

and replace \(L_{n}^{-1}(1-\alpha ,{\mathcal {S}}_{j},{\hat{P}}_{n})\) in Algorithm 3.1 with

$$\max _{\mathcal {{\tilde{S}}}\in {\mathbb {S}}({\mathcal {S}}_{j})}L_{n}^{-1}(1-\alpha ,\mathcal {{\tilde{S}}},{\hat{P}}_{n}).$$

With this modification to Algorithm 3.1, Theorem 3.1 remains valid. Note that this modification is only non-trivial when there are more than two treatments and may be computationally prohibitive when there are more than a few treatments.□

Remark 3.8

Note that we only require that the familywise error rate is asymptotically no greater than \(\alpha\) for each \(Q\in \varOmega\). By appropriately strengthening the assumptions of Theorem 3.1, it is possible to show that Algorithm 3.1 satisfies

$$\limsup _{n\rightarrow \infty }\sup _{Q\in \varOmega }FWER_{Q}\le \alpha .$$

In particular, it suffices to replace Assumption 2.2 with a mild uniform integrability requirement and require in Assumption 2.3 that there exists \(\epsilon >0\) for which (5) holds for all \(Q\in \varOmega\), \(d\in {\mathcal {D}}\) and \(z\in {\mathcal {Z}}\). Relevant results for establishing this claim can be found in Romano and Shaikh (2012), Bhattacharya et al. (2012), and Machado et al. (2013). □

4 Empirical applications

In this section, we apply our methodology to data originally presented in Karlan and List (2007), who use direct mail solicitations targeted to previous donors of a nonprofit organization to study the effectiveness of a matching grant on charitable giving. The sample includes all 50,083 individuals who had given to the organization at least once since 1991. Each individual was independently assigned with probability two-thirds to a treatment group (resulting in 33,396, or 67 percent of the sample, being treated) and with probability one-third to a control group (resulting in 16,687 subjects, or 33 percent of the sample, being untreated). Individuals in the treatment group were offered independently and with equal probability one of 36 possible matching grants whose terms varied along three dimensions: three possible values for the price ratio of the match, four possible values for the maximum size of the matching gift across all donations, and three possible values for the suggested donation amount. The possible values for the price ratio of the match were $1:$1, $2:$1, and $3:$1. Here, an $X:$1 ratio means that for every dollar the individual donates, the matching donor also contributes $X. Hence, the charity receives $X+1 for every $1 the individual donates (subject to the maximum size of the matching gift across all donations). The possible values for the maximum matching grant amount were $25,000, $50,000, $100,000, and “unstated.” The possible values for the (individual-specific) suggested donation amounts were the individual’s highest previous contribution, 1.25 times the highest previous contribution, and 1.50 times the highest previous contribution.

In the following three subsections, we first consider testing families of null hypotheses that emerge in this application due to multiple outcomes alone, multiple subgroups alone and multiple treatments alone. In the final subsection, we then consider testing the family of null hypotheses that emerges by combining all three considerations at the same time. In each case, we consider inference based on Theorem 3.1 using the studentized test statistics described in Remark 3.4. We also compare our results with those obtained using the classical Bonferroni and Holm multiple testing procedures. Stata and Matlab code used to produce these results can be found at the following address: https://github.com/seidelj/mht.

4.1 Multiple outcomes

Four outcomes of interest in Karlan and List (2007) are the response rate, dollars given not including the matching amount, dollars given including the matching amount, and the change in the amount given (not including the matching amount). A more detailed description of these variables can be found in Karlan and List (2007). Table 1 displays for each of these four outcomes of interest, the following five quantities: difference in means between treated and untreated groups, a (multiplicity-unadjusted) p value computed using Remark 3.1, a (multiplicity-adjusted) p value computed using Theorem 3.1, a (multiplicity-adjusted) p value obtained by applying Bonferroni to the (multiplicity-unadjusted) p values, a (multiplicity-adjusted) p value obtained by applying Holm to the (multiplicity-unadjusted) p values. Following Remark 3.2, the (multiplicity-adjusted) p values obtained by applying Bonferroni can be calculated simply by multiplying the (multiplicity-unadjusted) p values by the total number of hypotheses in Table 1. Similarly, the (multiplicity-adjusted) p values obtained by applying Holm can be calculated by the multiplying the smallest (multiplicity-unadjusted) p value (corresponding in this case to response rate) by the total number of hypotheses in Table 1, multiplying the second smallest (multiplicity-unadjusted) p value (corresponding in this case to dollars given including match) by one less than the total number of hypotheses in Table 1, and continuing in this fashion until multiplying the largest (multiplicity-unadjusted) p value (corresponding in this case to amount change) by one.

Before adjusting for the multiplicity of null hypotheses being tested, we find that the treatment has an effect on the response rate, dollars given not including the matching amount, and dollars given including the matching amount at the 5% significance level. Here, by treatment, we mean receiving any of the 36 possible matching grants. After adjusting for the multiplicity of null hypotheses being tested, however, we find that the effect of the treatment on dollars given not including the matching is no longer significant at the 5% significance level—instead, it is only significant at the 10% significance level. By comparing the last three columns in Table 1, we additionally see that the p values obtained by applying Theorem 3.1 are an improvement upon those obtained by applying Bonferroni or Holm.

4.2 Multiple subgroups

Four subgroups of interest in Karlan and List (2007) are red county in a red state, blue county in a red state, red county in a blue state, and blue county in a blue state. Red states are defined as states that voted for George W. Bush in the 2004 Presidential election and blue states are defined as states that voted for John Kerry in the same election. Red and blue counties are defined analogously. In this subsection, we examine how the effect of the treatment on the response rate varies across these subgroups. Table 2 displays for each of the four subgroups of interest, the same five quantities found in Table 1. Note that 105 out of the 50,083 individuals in our dataset do not have complete subgroup information. We treat these 105 individuals as a subgroup of no interest for our analysis.

Before adjusting for the multiplicity of null hypotheses being tested, we find that the treatment has an effect on two of the four subgroups at the 10% significance level. As before, here, by treatment, we mean receiving any of the 36 possible matching grants. After adjusting for the multiplicity of null hypotheses being tested, however, we find that the treatment only has an effect on one subgroup at the same significance level. By comparing the last three columns in Table 1, we again see that the p values obtained by applying Theorem 3.1 are an improvement upon those obtained by applying Bonferroni or Holm.

4.3 Multiple treatments

We now consider null hypotheses that emerge due to multiple treatments. We define three treatments corresponding to different values for the price ratio of the match: $1:$1, $2:$1, and $3:$1. Each treatment is understood to mean any of the 12 possible treatments with the same value for the price ratio of the match. We focus on dollars given not including the matching amount as the outcome of interest.

We first consider testing three null hypotheses corresponding to comparing each treatment with the control group. Table 3 displays for each of these three null hypotheses the same five quantities found in Table 1. Before adjusting for the multiplicity of null hypotheses being tested, we find that the treatment $2:$1 has an effect at the 5% significance level on the outcome of interest. After adjusting for the multiplicity of null hypotheses being tested, however, we find that this effect is no longer significant even at the 10% significance level. By comparing the last three columns in Table 3, we again see that the p values obtained by applying Theorem 3.1 are an improvement upon those obtained by applying Bonferroni or Holm.

Next, we consider testing the six null hypotheses corresponding to all pairwise comparisons across the three treatments and the control group. Table 4 displays for each of these three null hypotheses the same five quantities found in Table 1. The results are qualitatively similar to those described above. Table 4 also displays a sixth quantity corresponding to the improvement in p values described in Remark 3.7 obtained by exploiting the logical restrictions among null hypotheses when there are multiple treatments. While in this case the improvement does not lead to any additional rejections of null hypotheses, we see that the difference in p values can in some cases be large. Note that this column is omitted from the previous table because Remark 3.7 results in no further improvements when solely comparing each treatment with the control group.

4.4 Multiple outcomes, subgroups, and treatments

More often than not, it is desired to test null hypotheses stemming from all three considerations above: multiple outcomes, multiple subgroups, and multiple treatments simultaneously. In this subsection, we consider the four outcome variables described in Sect. 4.1, the four subgroups described in Sect. 4.2, and the three treatments described in Sect. 4.3. Here, we only consider comparing each treatment with the control group. As a result there are 48 null hypotheses of interest.

For each of the 48 null hypotheses, Table 5 displays the same five quantities found in Table 1. Before adjusting for the multiplicity of null hypotheses being tested, we reject 21 null hypotheses at the 10% significance level. After adjusting for the multiplicity of null hypotheses being tested, however, we find that only 9 null hypotheses are rejected at the same significance level. It is worth noting that 7 of these 9 null hypotheses are related to the same outcome—dollars given including the matching amount. By comparing the last three columns in Table 5, we again see that the p values obtained by applying Theorem 3.1 are an improvement upon those obtained by applying Bonferroni or Holm.

5 Conclusion

In this paper, we have developed a procedure for testing simultaneously null hypotheses that emerge naturally when analyzing data from experiments because of some combination of the presences of multiple outcomes of interest, multiple subgroups of interest or multiple treatments. Using the general results in Romano and Wolf (2010), we have shown that our approach applies under weak assumptions to experiments in which individuals are assigned to treatments and control using simple random sampling. Notably, we show not only that our procedure has greater power than classical multiple testing procedures like Bonferroni and Holm, but have also shown how further improvements can be obtained in the presence of multiple treatments by exploiting the logical restrictions among null hypotheses. We have applied our methodology to data originally presented in Karlan and List (2007), who studied the effectiveness of a matching grant on charitable giving.

As we have argued in the introduction, it is commonplace to consider multiple null hypotheses when analyzing experimental data for one or more of the reasons mentioned above. It is, however, uncommon to account correctly for the multiplicity of null hypotheses under consideration, and, as a result, the probability of a false rejection may be much higher than desired. This failure to adjust inference procedures is almost certainly related to the “credibility” and “reproducibility” crises that plague not just the social sciences, but the sciences more generally. See, for example, Jennions and Moller (2002), Ioannidis (2005), Nosek et al. (2012), Bettis (2012), Maniadis et al. (2014) and Camerer et al. (2016). We believe the adoption of testing procedures like the one described in this paper will help address these concerns and, with this in mind, advocate that researchers at the very least report multiplicity-adjusted results alongside conventional multiplicity-unadjusted results. In some cases, such as when evaluating existing studies, it may be more convenient to compute a Bonferroni or Holm correction, which only requires knowledge of conventional multiplicity-unadjusted p values, rather than apply the methodology in this paper.