Abstract
Good causal inference requires good measurement; even the most thoughtfully designed research can be derailed by noisy data. Because policy scholars are often interested in public opinion as a key dependent or independent variable, paying careful attention to the sources of measurement error from surveys is an essential step toward detecting causation. This chapter introduces multilevel regression and poststratification (MRP), a method for adjusting public opinion estimates to account for observed imbalances between the survey sample and population of interest. It covers the history of MRP, recent advances, an example analysis with code, and concludes with a discussion of best practices and limitations of the approach.
You have full access to this open access chapter, Download chapter PDF
Similar content being viewed by others
By the end of this chapter, you will be able to:
-
Explain the motivation for MRP and the circumstances under which it is appropriate to implement.
-
Describe the two steps in producing MRP estimates: model fitting and postsratification.
-
Generate MRP estimates by adapting the provided sample code.
-
Implement more sophisticated variants of MRP, including stacked regression and postratification (SRP) or multilevel regression and synthetic poststratification (MrsP) where appropriate.
5.1 Introduction
The book you are reading is a testament to the “credibility revolution” in the social sciences (Angrist & Pischke, 2010), a wide-ranging effort spanning multiple disciplines to develop credible, design-based approaches to causal inference. It is difficult to overstate the influence this revolution has had on empirical social science, and the increasing emphasis that policymakers place on informing policy with good research design is a welcome trend.
But as the ongoing replication crisis in experimental psychology (Button et al., 2013) has made clear, good research design alone is insufficient to yield good science. After all, double-blind randomized control trials are the “gold standard” of credible causal inference, but small sample sizes and noisy measurement have created a situation where many published effect estimates fail to replicate upon further scrutiny (Loken & Gelman, 2017). To confidently detect causation, one needs both good research design and good measurement.
Often policy researchers are interested in public opinion on some issue, either as an independent or dependent variable. But the surveys we use to measure public opinion are frequently unrepresentative in some important way. Perhaps their respondents come from a convenience sample (Wang et al., 2015), or non-response bias skews an otherwise random sample. Or perhaps the data is representative of some larger population (i.e., a country-level random sample) but contains too few observations to make inferences about a subgroup of interest. Even the largest US public opinion surveys do not have enough respondents to make reliable inferences about lower-level political entities like states or municipalities. Conclusions drawn from low frequency observations – even in a large sample survey – can be wildly misleading (Ansolabehere et al., 2015).
This presents a challenge for researchers: how to take unrepresentative survey data and adjust it so that it is useful for our particular research question. In this chapter, I will demonstrate a method called Multilevel Regression and Poststratification (MRP). Using this approach, the researcher first constructs a model of public opinion (multilevel regression) and then reweights the model’s predictions based on the observed characteristics of the population of interest (poststratification). In the sections that follow, I will describe this approach in detail, accompanied by replication code in the R statistical language.
As we will see, the accuracy of our MRP estimates depends critically on whether the first-stage model makes good out-of-sample predictions. The best first-stage models are regularized (Gelman, 2018) to avoid both over- and underfitting to the survey data. Regularized ensemble models (Ornstein, 2020) with group-level predictors tend to produce the best estimates, especially when trained on large survey datasets.
5.2 How It Works
MRP was first introduced by Gelman and Little (1997), and in the subsequent decades, it has helped address a diverse set of research questions in political science. These range from generating election forecasts using unrepresentative survey data (Wang et al., 2015) to assessing the responsiveness of state (Lax & Phillips, 2012) and local policymakers (Tausanovitch & Warshaw, 2014) to their constituents’ policy preferences.
To demonstrate how the method works, the next section will introduce a running example drawn from the Cooperative Election Study (Schaffner et al., 2021), a 50,000+ respondent study of voters in the United States. The 2020 wave of the study includes a question asking respondents whether they support a policy that would “decrease the number of police on the street by 10 percent, and increase funding for other public services.” Since police reform is a policy issue on which US local governments have a significant amount of autonomy, it would be useful to know how opinions on this issue vary from place to place without having to conduct separate, costly surveys in each area.
The problem is that even a survey as large as CES has relatively few respondents in some small areas of interest. If we wanted to know, for example, what voters in Detroit thought about police reform, a survey of 50,000 people randomly sampled from across the United States will have, on average, only 100 people from Detroit. Estimates from such a small sample will not be very precise. And more importantly, those 100 people are unlikely to be representative of the population of Detroit, since the survey was designed to be representative of the country at large.
The core insight of the MRP approach is that we can use similar respondents from similar areas – e.g., Cleveland or Chicago or Pittsburgh – to improve our inferences about public opinion in Detroit. The way we do so is to first fit a statistical model of public opinion, using both individual-level predictors (e.g., race, age, gender, education) and group-level predictors (e.g., median income, population density) from our survey dataset. Then, we reweight the predictions of the model to match the observed demographics and characteristics of Detroit. In this way, we get the most out of the information contained in our survey and produce a better estimate of what Detroit residents think than our small sample from Detroit alone could produce.
5.3 Running Example
To help demonstrate this process, we will draw a small random sample from the CES survey, and, using that sample alone, attempt to estimate state-level public opinion on police reform in each US state. In this way, we can evaluate the accuracy of our MRP estimates and explore how various refinements to the method improve predictive accuracy. This approach mirrors Buttice and Highton (2013), who use disaggregated responses from large-scale US survey of voters as their target estimand to evaluate MRP’s performance. The Cooperative Election Study data is available here, and we’ll be using a tidied version of the dataset created by the R/cleanup-ces-2020.R script.Footnote 1
This tidied version of the data only includes the 33 states with at least 500 respondents. First, let’s plot the percent of CES respondents who supported “defunding” the policeFootnote 2 by state.
Oregon is the only state where a majority of respondents supported this policy proposal. And note that Fig. 5.1 likely overstates the percent of the total population that support such a policy, since self-identified Democrats are overrepresented in the CES sample. But nevertheless, these population-level parameters will be a useful target to evaluate the performance of our MRP estimates.
5.3.1 Draw a Sample
Suppose that we did not have access to the entire CES dataset, but only to a random sample of 1,000 respondents. How good of a job can we do at estimating those state-level means?
In a sample with only 1,000 respondents, there are several states with very few (or no) respondents. Notice, for example, that this sample includes only four respondents from Arkansas, of whom zero support reducing police budgets. Simply disaggregating and taking sample means is unlikely to yield good estimates, as you can see by comparing those sample means against the truth (Fig. 5.2).
These are clearly poor estimates of state-level public opinion. The four respondents from Arksansas simply do not give us enough information to adequately measure public opinion in that state. But one of the key insights behind MRP is that the respondents from Arkansas are not the only respondents who can give us information about Arkansas! There are other respondents in, for example, Missouri, that are similar to Arkansas residents on their observed characteristics. If we can determine the characteristics that predict support for police reform using the entire survey sample, then we can use those predictions – combined with demographic information about Arkansans – to generate better estimates. The trick, in essence, is that our estimate for Arkansas will be borrowing information from similar respondents in other states.
The method proceeds in three steps.
5.3.1.1 Step 1: Fit a Model
First, we fit a model of our outcome, using observed characteristics of the survey respondents as predictors. To demonstrate, let’s fit a simple logistic regression model including only four demographic predictors: gender, education, race, and age.
5.3.1.2 Step 2: Construct the Poststratification Frame
The poststratification stage requires the researcher to know (or estimate) the joint frequency distribution of predictor variables in each state. This information is stored in a “poststratification frame,” a matrix where each row is a unique combination of characteristics, along with the observed frequency of that combination. Often, one constructs this frequency distribution from Census micro-data (Lax & Phillips, 2009). For our demonstration, I will compute it directly from the CES.
5.3.1.3 Step 3: Predict and Poststratify
With the model and poststratification frame in hand, the final step is to generate frequency-weighted predictions of public opinion. For each cell in the poststratification frame, append the model’s predicted probability of supporting police defunding.
Then, the poststratified estimates are the frequency-weighted means of those predictions.
Let’s see how these estimates compare with the known values (Fig. 5.3).
These estimates, though still imperfectly correlated with the truth, are much better than the previous estimates from disaggregation. Notice, in particular, that the estimate for Arkansas went from 0% to roughly 39%, reflecting the significant improvement that comes from using more information than the four Arkansans in our sample can provide.
But we can still do better. In the following sections, I will show how successive improvements to the first-stage model can yield more reliable poststratified estimates.
5.3.2 Beware Overfitting
A common instinct among social scientists building models is to take a “kitchen sink” approach, including as many explanatory variables as possible (Achen, 2005). This is counterproductive when the objective is out-of-sample predictive accuracy. To illustrate, let’s estimate a model with a separate intercept term for each state – a “fixed effects” model. Because our sample contains several states with very few observations, these state-specific intercepts will be overfit to sampling variability (Fig. 5.4).
These poststratified estimates perform about as well as the disaggregated estimates from Fig. 5.2. Because each state’s intercept is estimated separately, the overfit model foregoes the advantages of “partial pooling” (Park et al., 2004), borrowing information from respondents in other states. Note that the estimate for Arkansas is once again 0%.
5.3.3 Partial Pooling
A better approach is to estimate a multilevel model (alternatively known as “varying intercepts” or “random effects” model), including group-level covariates. In the model below, I estimate varying intercepts by US Census division, including the state’s 2020 Democratic vote share as a covariate. The result is a marked improvement over Fig. 5.3 (particularly for West Coast states like Oregon, Washington, and California) (Fig. 5.5).
5.3.4 Sample Size Is Critical
MRP’s performance depends heavily on the quality and size of the researcher’s survey sample. Up to now, we’ve been working with a random sample of 1,000 respondents, and though the resulting estimates are better than the raw sample means, their performance has been somewhat underwhelming. Suppose instead we had a sample of 5,000 respondents (Fig. 5.6).
Now MRP really shines. With more observations, the first-stage model can better predict opinions of out-of-sample respondents, which dramatically improves the poststratified estimates.
5.3.5 Stacked Regression and Poststratification (SRP)
Ultimately, the accuracy of one’s poststratified estimates depends on the out-of-sample predictive performance of the first-stage model. As we’ve seen above, the challenge is to thread the needle between overfitting and underfitting. Several recent papers (Bisbee, 2019; Broniecki et al., 2022; Ornstein, 2020) have shown that approaches from machine learning can help to automate this process, particularly with large survey samples.
In the code below, I’ll demonstrate how an ensemble of models – using the same set of predictors but different methods for combining them into predictions – can yield superior performance to a single multilevel regression model. In particular, I will fit a “stacked regression” (Breiman, 1996), which makes predictions based on a weighted average of multiple models, where the weights are assigned by cross-validated prediction performance (van der Laan et al., 2007). The literature on ensemble models is extensive, but for good entry points, I recommend Breiman (1996), Breiman (2001), and Montgomery et al. (2012) (Fig. 5.7).
The performance gains in Fig. 5.7 reflect the improvement that comes from modeling “deep interactions” in the predictors of public opinion (Ghitza & Gelman, 2013). If, for example, income better predicts partisanship in some states but not in others (Gelman et al., 2007), then a model that captures that moderating effect will produce better poststratified estimates than one that does not. Machine learning techniques like random forest (Breiman, 2001) are especially useful for automatically detecting and representing such deep interactions, and stacked regression and poststratification (SRP) tends to outperform MRP in simulations, particularly for training data with large sample size (Ornstein, 2020).
5.3.6 Synthetic Poststratification
Researchers rarely have access to the entire joint distribution of individual-level covariates. This can be limiting, since there may be a variable that one would like to include in the first-stage model but cannot because it is not in the poststratification frame. Leemann and Wasserfallen (2017) suggest an extension of MRP, which they (delightfully) dub Multilevel regression and synthetic Poststratification’ (MrsP). Lacking the full joint distribution of covariates for poststratification, one can instead create a synthetic poststratification frame by assuming that additional covariates are statistically independent of one another. So long as the first-stage model is linear additive, this approach yields the same predictions as if you knew the true joint distribution!Footnote 3 And even if the first-stage model is not linear additive, simulations suggest that the improved performance from additional predictors tends to overcome the error introduced in the poststratification stage.
Here are some CES covariates that we might want to include in our model of police reform:
-
How important is religion to the respondent?
-
Whether the respondent lives in an urban, rural, or suburban area.
-
Whether the respondent or a member of the respondent’s family is a military veteran.
-
Whether the respondent owns or rents their home.
-
Is the respondent the parent or guardian of a child under the age of 18?
These variables are likely to be useful predictors of opinion about police reform, and the first-stage model could be improved by including them. But there is no dataset (that I know of) that would allow us to compute a state-level joint probability distribution over every one of them. Instead, we would typically only know the marginal distributions of each covariate (e.g., the percent of a state’s residents that are military households or the percent that live in urban areas). So a synthetic poststratification approach may prove helpful.
To create a synthetic poststratification frame, we create a set of marginal probability distributions and multiply them together.Footnote 4
Then, poststratify as normal using the synthetic poststratification frame (Fig. 5.8).
5.3.7 Best Performing
As a final demonstration, suppose we had access to the entire joint distribution over those covariates, and our first-stage model was a Super Learner ensemble. This combination yields the best-performing estimates yet (Fig. 5.9).
The results shown in Fig. 5.9 reflect all the gains from a larger sample size, ensemble modeling, and a full set of individual-level and group-level predictors.
5.4 Conclusion
For policy researchers interested in public opinion, MRP and its various refinements offer a useful approach to get the most out of survey data. The results I’ve presented in this chapter suggest a few lessons to keep in mind when applying MRP to one’s own research.
First, be wary of first-stage models that are underfit or overfit to the survey data. As we saw in Fig. 5.3, MRP estimates with too few predictors tend to over-shrink toward the grand mean.Footnote 5 Using such estimates to inform subsequent causal inference would understate the differences between regions. Conversely, models that are overfit to survey data (e.g., Fig. 5.4) will tend to exaggerate regional differences.
Second, new techniques like synthetic poststratification and stacked regression can help researchers manage the trade-off between underfitting and overfitting. Synthetic poststratification allows for the inclusion of more relevant predictors, and regularized ensemble models help ensure that the predictions are not overfit to noisy survey samples. The best estimates often come from combining these two approaches.
Finally, recall that the most significant performance gains in our demonstration came not from more sophisticated modeling techniques, but from more data. As we saw in Fig. 5.6, working with a larger survey yielded greater improvements than any tinkering around with the first-stage modeling choices. MRP is not a panacea, and one should be skeptical of estimates produced from small-sample surveys, regardless of how they are operationalized.
In the code above, I emphasize “do-it-yourself” approaches to MRP – fitting a model, building a poststratification frame, and producing estimates separately. But there are a now number of R packages available with useful functions to help ease the process. In particular, I would encourage curious readers to explore the autoMrP package (Broniecki et al., 2022), which implements the ensemble modeling approach described above and performs quite well in simulations when compared to existing packages.
Further Suggested Readings
-
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. Boca Raton: Taylor and Francis, CRC Press. (particularly chapter 13).
-
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2021. Regression and Other Stories. Cambridge, United Kingdom: Cambridge University Press. (particularly chapter 17).
Review Questions
-
1.
What other individual-level or group-level variables might be useful to include in the first-stage model of opinion on police reform, if they were available?
-
2.
Why is regularization crucial for constructing good first-stage MRP models?
-
3.
What are the benefits and potential downsides of using a synthetic poststratification frame?
Notes
- 1.
All replication code and data is available on a public repository (https://github.com/joeornstein/mrp-chapter). Throughout, I will use R functions from the “tidyverse” (Wickham et al., 2019) to make the code more human readable.
- 2.
Obviously that phrase means different things to different people. In this case, we’ll stick with the CES proposed policy of reducing police staffing by 10% and diverting those expenditures to other priorities.
- 3.
See Ornstein (2020) Appendix A for mathematical proof.
- 4.
The SRP package contains a convenience function for this operation (see the vignette for more information).
- 5.
In the limit, a first-stage model with zero predictors would yield identical poststratified estimates for each state, equal to the survey sample mean.
References
Achen, C. H. (2005). Let’s put garbage-can regressions and garbage-can probits where they belong. Conflict Management and Peace Science, 22(4), 327–339. https://doi.org/10.1080/07388940500339167
Angrist, J. D., & Pischke, J.-S. (2010). The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. Journal of Economic Perspectives, 24(2), 3–30. https://doi.org/10.1257/jep.24.2.3
Ansolabehere, S., Luks, S., & Schaffner, B. F. (2015). The perils of cherry picking low frequency events in large sample surveys. Electoral Studies, 40(December), 409–410. https://doi.org/10.1016/j.electstud.2015.07.002
Bisbee, J. (2019). BARP: Improving mister P using Bayesian additive regression trees. American Political Science Review, 113(4), 1060–1065. https://doi.org/10.1017/S0003055419000480
Breiman, L. (1996). Stacked regressions. Machine Learning, 24, 49–64. https://doi.org/10.17485/ijst/2016/v9i28/98380
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Broniecki, P., Leemann, L., & Wüest, R. (2022). Improved multilevel regression with poststratification through machine learning (autoMrP). The Journal of Politics, 84(1). https://doi.org/10.1086/714777
Buttice, M. K., & Highton, B. (2013). How does multilevel regression and poststratification perform with conventional National Surveys? Political Analysis, 21(4), 449–467. https://doi.org/10.1093/pan/mpt017
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. https://doi.org/10.1038/nrn3475
Gelman, A. (2018, May 19). Regularized prediction and poststratification (The Generalization of Mister p). Statistical Modeling, Causal Inference, and Social Science (Blog). https://statmodeling.stat.columbia.edu/2018/05/19/
Gelman, A., & Little, T. C. (1997). Poststratification into many categories using hierachical logistic regression. Survey Methodology, 23(2), 127–135.
Gelman, A., Shor, B., Bafumi, J., & Park, D. (2007). Rich state, poor state, red state, blue state: What’s the matter with Connecticut? Quarterly Journal of Political Science, 2(June 2006), 345–367. https://doi.org/10.1561/100.00006026
Ghitza, Y., & Gelman, A. (2013). Deep interactions with MRP: Election turnout and voting patterns among small electoral subgroups. American Journal of Political Science, 57(3), 762–776. https://doi.org/10.1111/ajps.12004
Lax, J. R., & Phillips, J. H. (2009). How should we estimate public opinion in the states? American Journal of Political Science, 53(1), 107–121. https://doi.org/10.1111/j.1540-5907.2008.00360.x
Lax, J. R., & Phillips, J. H. (2012). The democratic deficit in the states. American Journal of Political Science, 56(1), 148–166. https://doi.org/10.1111/j.1540-5907.2011
Leemann, L., & Wasserfallen, F. (2017). Extending the use and prediction precision of subnational public opinion estimation. American Journal of Political Science, 61(4), 1003–1022.
Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585. https://doi.org/10.1126/science.aal3618
Montgomery, J. M., Hollenbach, F., & Ward, M. D. (2012). Improving predictions using ensemble Bayesian model averaging. Political Analysis, 20(3), 271–291.
Ornstein, J. T. (2020). Stacked regression and Poststratification. Political Analysis, 28(2), 293–301. https://doi.org/10.1017/pan.2019.43
Park, D. K., Gelman, A., & Bafumi, J. (2004). Bayesian multilevel estimation with poststratification: State-level estimates from national polls. Political Analysis, 12(4), 375–385. https://doi.org/10.1093/pan/mph024
Schaffner, B., Ansolabehere, S., & Luks, S. (2021). Cooperative election study common content, 2020. Edited by YouGov and Add your team name(s) here. https://doi.org/10.7910/DVN/E9N6PH.
Tausanovitch, C., & Warshaw, C. (2014). Representation in municipal government. The American Political Science Review, 108(03), 605–641. https://doi.org/10.1017/S0003055414000318
van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1).
Wang, W., Rothschild, D., Goel, S., & Gelman, A. (2015). Forecasting elections with non-representative polls. International Journal of Forecasting, 31(3), 980–991. https://doi.org/10.1016/j.ijforecast.2014.06.001
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., et al. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Ornstein, J.T. (2023). Getting the Most Out of Surveys: Multilevel Regression and Poststratification. In: Damonte, A., Negri, F. (eds) Causality in Policy Studies. Texts in Quantitative Political Analysis. Springer, Cham. https://doi.org/10.1007/978-3-031-12982-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-12982-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-12981-0
Online ISBN: 978-3-031-12982-7
eBook Packages: Political Science and International StudiesPolitical Science and International Studies (R0)