Stratification refers to dividing a population into groups, called strata, such that pairs of population units within the same stratum are deemed more similar (homogeneous) than pairs from different strata. The strata are mutually exclusive (non-overlapping) and exhaustive of the population. Clearly sufficient information on each population unit must be available before we can divide the population into strata.

The primary reason for dividing a population into strata is to make use of the strata in drawing a sample. For example, instead of drawing a simple random sample of sample size n from the population, one may draw a simple random sample of sample size n h from stratum h of L strata, where \(n = {n}_{1} + \cdots + {n}_{L}.\) The sample selection for any stratum is done independently of the other strata. The stratum sample sizes n h are often chosen proportional to the number of population units in stratum h but other allocations of the stratum samples may be preferred in specific situations.

There are two major reasons for drawing a stratified sample instead of an unstratified one:

  1. 1.

    Such samples are generally more efficient (in the sense that estimates have smaller variances) than samples that do not use stratification. There are exceptions, primarily when the strata are far from homogeneous with respect to the variable being estimated.

  2. 2.

    The sample sizes are controlled (rather than random) for the population strata. This means, in particular, that one may guarantee adequate sample size for estimates that depend only on certain strata. For instance, if men and women are in separate strata, one can assure the sample size for estimates for men and for women.

Estimation Under Simple Random Sampling Within Strata

The independence of the sample selection by strata allows for straightforward variance calculation when simple random sampling is employed within strata. Let Y T denote the population total for a variable Y for which an estimate is sought. Let N h and n h denote respectively the population size and sample size for stratum h. Let, moreover, Y hj and y hi denote respectively the Y -value of the jth population element or i th sample element in stratum h. Then, if

$${\overline{Y }}_{h} = \frac{1} {{N}_{h}}{ \sum \nolimits }_{j=1}^{{N}_{h} }{Y }_{hj}\mbox{ and }{\overline{y}}_{h} = \frac{1} {{n}_{h}}{ \sum \nolimits }_{i=1}^{{n}_{h} }{y}_{hi},$$

define

$${S}_{h}^{2} = \frac{1} {{N}_{h} - 1}{\sum \limits _{j=1}^{{N}_{h} }}{({Y }_{hj}-{\overline{Y }}_{h})}^{2}\mbox{ and }{s}_{ h}^{2} = \frac{1} {{n}_{h} - 1}{\sum \limits _{i=1}^{{n}_{h} }}{({y}_{hi}-{\overline{y}}_{h})}^{2}.$$

We estimate Y T by \(\hat{y}\) where \(\hat{y} ={ \sum \limits_{{h=1}}^{L}{N}_{h}}{\overline{y}}_{h}.\) The variance of \(\hat{y}\) is

$$V (\hat{y}) ={ \sum \nolimits }_{h=1}^{L}\frac{{N}_{h}^{2}} {{n}_{h}} (1 - {n}_{h}/{N}_{h}){S}_{h}^{2}$$

and the variance is estimated by

$$\hat{V }(\hat{y}) ={ \sum \nolimits }_{h=1}^{L}\frac{{N}_{h}^{2}} {{n}_{h}} (1 - {n}_{h}/{N}_{h}){s}_{h}^{2}.$$

Similarly, the population mean \(\overline{Y } = {Y }_{T}/N,\) where N = ∑_{h = 1}^{L}N h is the size of the population, is estimated by \(\hat{y}/N\) and its variance by \(\hat{V }(\hat{y})/{N}^{2}.\)

Allocation of Sample Sizes to Strata Under Simple Random Sampling within Strata

For a total sample size of n and given values of S h , the question arises how should one allocate the sample to the strata; that is, how should one choose the n h , h = 1, , L, so that n = n 1 + ⋯ + n L and \(V (\hat{y})\) is minimized? This is a straightforward constrained minimization problem (solved with Lagrange multipliers) that yields the solution:

$${n}_{h} = \frac{n{N}_{h}{S}_{h}} {{\sum \limits_{{k=1}}^{L}}{N}_{k}{S}_{k}}$$

Note that, as one would expect, the more variability in a stratum (larger S h ), the larger the relative sample size in that stratum. This method of determining the stratum sample sizes is termed Neyman allocation in view of the seminal paper on stratified sampling by Neyman (1934).

Sometimes the strata are not equally costly to sample. For example, there may be additional travel costs in sampling a rural geographically-determined stratum over an urban one. If it costs C h to sample a unit in stratum h, then the allocation

$${n}_{h} = \frac{n{N}_{h}{S}_{h}/\sqrt{{C}_{h}}} {{\sum }_{k=1}^{L}{N}_{k}{S}_{k}/\sqrt{{C}_{k}}}$$

is best in two senses: It minimizes \(V (\hat{y})\) subject to fixed total cost (a fixed budget) C T = C 1 + ⋯ + C L and it mini- mizes C T subject to fixed \(V (\hat{y}).\)

These allocations assume that the S h , h = 1, , L, are known. In practice, rough estimates, perhaps based on a similar previous survey, will serve. The same comment applies to the costs for the cost-based allocation.

In the absence of any prior information, even approximate, the simple proportional allocationn h = nN h N is often used. In this case, the estimator \(\hat{y}\) has a particularly simple form

$$\begin{array}{rcl} \hat{y}& =& {\sum \limits _{h=1}^{L}}{N}_{h}{\overline{y}}_{h} ={ \sum \limits _{h=1}^{L}}\frac{{N}_{h}}{{n}_{h}} { \sum \limits _{i=1}^{{n}_{h}} }{y}_{hi} ={ \sum \limits_{h=1}^{L}} \frac{{N}_{h}} {(n{N}_{h}/N)}{\sum \limits_{i=1}^{{n}_{h} }}{y}_{hi} \\ & =& \frac{N} {n} {\sum \limits_{h=1}^{L}}{ \sum \limits _{i=1}^{{n}_{h} }}{y}_{hi}\end{array}$$

Therefore \(\hat{y}\) is just the sum of the sample values expanded by Nn. In many surveys a wide variety of quantities are estimated and their within-stratum variability may differ so proportional allocation may be employed as a compromise.

Unbiased estimation requires at least one sample selection per stratum. Unbiased variance estimation requires at least two selections per stratum.

Stratum Boundaries

Sometimes stratification is based on small discrete categories like gender or race. Other times, one may have data on a variable that can be regarded as continuous closely related to the variable one wants to estimate from the sample. For example, one may want to estimate the output of factories based on strata defined by the number of workers at the factory. One stratum might be all factories with 75–100 workers. In this case, 75 and 100 are said to be the stratum boundaries. How should these boundaries be chosen?

One method that has been shown to be good is the cumulative square root of frequencies method developed by Dalenius and Hodges (1957): Start by assuming (in our example) that the factories have been divided into a rather large number of categories based on the numbers of workers, numbered from fewest workers to the most workers. If f k is the number of factories in category k, calculate \({Q}_{k} = \sqrt{{f}_{1}} + \cdots + \sqrt{{f}_{k}}.\) Divide the factories into strata so that the differences between the at adjacent stratum boundary points are as equal as possible.

More recently, Lavallée and Hidiroglou (1988) developed an iterative procedure especially designed for skewed populations.

Variance Estimation for Stratified Samples

For simple estimators and stratified sampling, direct formulas are available to calculate variance estimates. These formulas are tailored to the specific estimator whose variance is sought. General purpose variance estimators have been developed, however, that allow one to estimate variances for a wide class of estimators using a single procedure. See Wolter (2007) and Shao and Tu (1995) for a complete discussion of these procedures.

The procedure balance half-sample replication (or balanced repeated replication) has been developed as a variance estimation procedure when two primary sampling units (PSUs) are selected from each stratum. There may be additional sampling within each PSU so the sample design may be complex. The variance estimation is based on half sample replicates, each replicate consisting of one PSU from each stratum. The pattern that determines which PSU to choose from each stratum for a particular replicate is based on a special kind of matrix, called a Hadamard matrix.

A form of the jackknife method (see Jackknife) is also widely employed with two PSU per stratum sample designs (although it can be extended to other designs). This jackknife method is based on forming replicates, but the replicate consists of one PSU selected to be in the replicate from a specific stratum, with both PSUs being in the replicate for all other strata.

Various forms of the bootstrap method (see Bootstrap Methods) have been employed in recent years as general variance estimation methods for stratified sampling.

Although not as generic, the Taylor series (or linearization) method is a powerful technique for estimating variances in complex samples.

Stratified Sampling with Maximal Overlap (Keyfitzing)

Sometimes it is worthwhile to select a stratified sample in a manner that maximizes overlap with another stratified sample, subject to the constraint that the probabilities of selection are the ones desired. For example, cost savings may arise if a new stratified sample is similar to a previous one, yet births, deaths, and migration in the population may preclude it being exactly the same. Keyfitz (1951) developed a method to deal with this problem, so it is often called Keyfitzing. More recent researchers have extended the method to more general situations.

Stratification in Two Phases

It may be that it is clearly desirable to stratify on a certain characteristic, but that characteristic may not be available on the sampling frame (list of units from which the sample is selected). For example, in travel surveys one would likely want to stratify on household type (e.g., single adult head of household or adult couple with children) but this information is usually not provided on an address list. One solution is to first conduct a large, relatively inexpensive first phase of the survey for the sole purpose of obtaining the information needed to stratify. This information is then employed in the stratification of the second stage of the survey. This process is called two-phase sampling or double sampling.

Let n h I be the size of the first stage sample that lies in stratum h and let n I = n 1 I + ⋯ + n L I be the first-stage sample size. At the second stage, n h II units with Y -values \({y}_{h1},\ldots ,{y}_{h{n}_{h}^{II}}\) are sampled in stratum h. Then one can estimate Y T by

$$\tilde{y} = N{\sum \limits _{h=1}^{L}}\frac{{n}_{h}^{I}} {{n}^{I}} { \sum \limits _{i=1}^{{n}_{h}^{II}} } \frac{{y}_{hi}} {{n}_{h}^{II}}$$

Approximate variance formulas can also be given. See, e.g., Raj and Chandhok (1998) or Scheaffer et al. (2006). Because the n h I are random, the usual (one-phase) variance formulas would underestimate the variance.

Poststratification

After a sample has been selected and the data collected, sometimes the estimation procedures of stratification can be employed even if the sample selection was for an unstratified design. An important requirement is that the population proportions N h N must be known, at least approximately. If so, then

$$\hat{y} = N{\sum \limits _{h=1}^{L}}\frac{{N}_{h}} {N} {\sum \limits _{i=1}^{{n}_{h} }}\frac{{y}_{hi}} {{n}_{h}} = N{\sum \limits _{h=1}^{L}}\frac{{N}_{h}} {N}{ \overline{y}}_{h}$$

is an improved estimate of the population total. The usual variance estimator \(\hat{V }(\,\hat{y}),\) however, is no longer valid as it does not account for the randomness of the n h . More complicated variance estimators can be developed for this purpose.

Another reason to employ poststratification is to reduce bias due to nonresponse.

Controlled Selection

Controlled selection is a sample selection method that is related to stratified sampling but differs in that independent selections are not made from the cells (“strata”). The method was introduced by Goodman and Kish (1950). For an example of controlled selection, imagine a two-dimensional array of cells of population units, say of industrial classification categories by geographic areas. All population units lie in exactly one cell, analogous to strata. The sample size is not large enough for there to be the two selections per cell needed for unbiased variance estimation if the selections were independent by cell. Under controlled selection, only certain balanced patterns of cell combinations can be selected. When properly carried out, this is a valid probability selection technique.

About the Author

Dr. Michael P. Cohen is Senior Consultant to the National Opinion Research Center and Adjunct Professor, Department of Statistics, George Mason University. He was President of the Washington Statistical Society (2007–2008), and of the Washington Academy of Sciences (2003–2004). He served as Assistant Director for Survey Programs of the U.S. Bureau of Transportation Statistics (2002–2006). He is a Fellow of the American Statistical Association, the American Educational Research Association, and the Washington Academy of Sciences. He is an Elected Member of the International Statistical Institute and Sigma Xi and a Senior Member of the American Society for Quality. Dr. Cohen has over 60 professional publications. He served as an Associate Editor, Journal of the American Statistical Association, Applications and Case Studies Section (2004–2006). He has been an Associate Editor of the Journal of Official Statistics since 2003. He is the Guest Problem Editor of the Journal of Recreational Mathematics for 2009–2010.

Cross References

Balanced Sampling

Jackknife

Multistage Sampling

Sampling From Finite Populations

Simple Random Sample