Prevalence studies of disordered gambling have been conducted in many countries over three decades (Williams et al. 2012). In consequence there is widespread consensus that gambling disorder, at least of a level of severity warranting clinical intervention, is a relatively rare mental disorder, though one that has become more common in many jurisdictions as a result of more widespread gambling opportunities. Based on the application of econometric methods for identification and control of sample selection bias, we question this consensus, concluding that prevalence of gambling problems in the general population is likely to be significantly larger than generally thought. The issues identified here apply, in principle, to every disorder for which prevalence is estimated using surveys based on psychiatric screening instruments, and not just gambling disorder.

Scholarly research consistently finds high shares of commercial gambling revenue to be derived from proportions of populations that are much smaller than the large proportion who occasionally or frequently gamble. For the United States 15% of revenue derives from 0.5% of the population, for Canada 23% derives from 4.2% of the population, for Australia 33% derives from 2.1% of the population, and for New Zealand 19% derives from just 1.3% of the population. Footnote 1 It is primarily among the ranks of these high-spending gamblers that one finds those who currently have, or are at greatest risk for having, clinically diagnosable gambling problems. And the largest share of casino gambling floor revenue now derives from electronic slot and poker machines, which are strikingly characterized as constituting an Addiction by Design by Schüll (2012).

We examine in detail the manner in which the population prevalence of disordered gambling has been estimated by psychologists and psychiatric researchers. Surveys of disordered gambling have traditionally used screens designed to detect individuals who engage in gambling activity that might lead them to clinically “present” and meet criteria for diagnosis of a psychiatric disorder. This is a valid scientific goal for the design, calibration and application of such surveys, although it is not the only possible goal or the most interesting for broader public health assessments.Footnote 2 We reconsider the manner in which inferences about gambling problems in the general population are made based on these surveys. We suggest that there are different kinds of inferences possible than have traditionally been emphasized, and that there is a recurring, major sample selection bias that has not been accounted for. When that bias is corrected we infer significantly greater prevalence of gambling disorders, and notably fewer comorbidities with other mental health problems than are typically reported. Thus we contribute to isolating gambling disorder as a partly discrete public health problem to which policies can be specifically targeted and their efficiency evaluated.

Most of the inferences that have been drawn based on analysis of prevalence surveys have concerned general population prevalence, socio-demographic correlates, and comorbidities. They have typically focused on the binary classification of individuals as “disordered,” “pathological” or “problem” gamblers, or not, where these terms are defined either directly or approximately in terms of DSM-IV (American Psychiatric Association 1994) clinical criteria. Since the classification of the condition under “Substance-Related and Addictive Disorders” in DSM 5 (American Psychiatric Association 2013), it has become standard usage to refer to it as “Gambling Disorder”.

Henceforth, where we refer to the clinical phenomenon ex cathedra we will follow DSM 5 and use “gambling disorder” (GD). We will refer to a representative person who has acquired the condition as a “disordered gambler” (DG). Where we refer to previous work set in clinical contexts that used either “pathological” or “problem” gambling without distinguishing them, or intending that they be distinguished (for example, in some work applying the Problem Gambling Severity Index (PGSI) (Ferris and Wynne 2001), we will anachronistically use the terms “gambling disorder” and “disordered gambler.” Where we are referring to a context in which “problem gambling” and “pathological gambling” are distinguished, with the former denoting a pre-clinical or warning state for the latter, we retain the distinction and use these older terms. Finally, when we talk about harmful consequences of gambling outside the clinical context we use “gambling problems” as a non-technical term of everyday English.

An original goal of DSM 5 was to shift focus away from categorical classifications emphasized in DSM-III and DSM-IV (e.g., “pathological/non-pathological”) to continuous measures, understood as probing continua between normal and disordered functioning. However, the American Psychiatric Association ultimately decided to defer this ambition, and GD continues to be clinically regarded as a pathology from which a person either suffers or does not. We implicitly consider that classification, but expand the analysis to include the range of gambling problems as an ordered hierarchy. Our interest is in the latent continuum of gambling problems, as a complement to studying binary classifications with thresholds.Footnote 3 This interest corresponds to our ultimate focus, as economists concerned with the general impact on welfare of gambling and public health policy, on evaluating the severity of all problems associated with gambling, which include but are not limited to the form of addiction that DSM 5 labels as GD.Footnote 4

In Sect. 1 we reconsider inferences from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) in the United States. The first wave of NESARC was conducted in 2000 and 2001, and had a sample of 43,093 individuals.Footnote 5 The instrument for measuring gambling problems was based on the DSM-IV criteria, and DSM-IV criteria were likewise used for the instruments measuring other major psychiatric disorders.Footnote 6

The most significant statistical issues arise from the difficulty of drawing inferences about GD prevalence and comorbidity when one attempts to account for the sample selection bias of “trigger,” “gateway” or “diagnostic stem” questions. Such questions ask whether a respondent has ever gambled more frequently than some threshold rate or number of occasions, and/or whether they have ever gambled away more than some threshold amount of money on any single occasion. Only respondents who report meeting the relevant thresholds are asked the remaining gambling screen questions. A main motivation for use of trigger questions is not to irritate respondents by asking them about gambling problems after they have effectively said that they are not regular gamblers, or perhaps not gamblers at all. This motivation is particularly easy to appreciate in the case of surveys such as the NESARC, which address multiple potential disorders; the surveyor does not want to risk reduced cooperation on other survey modules by annoying respondents about gambling problems they (apparently) manifestly do not have.

The potential for sample selection bias arises when there is some systematic factor explaining why someone might not want to participate in the full set of questions, and therefore deliberately or subconsciously selects out of that full set by answering a certain way in response to the trigger question.Footnote 7 Sometimes this potential leads to no difference in inferences from the observed sample: for instance, if respondents want to spend more time in a face-to-face interview with more attractive interviewers, and the attractiveness level of interviewers is random, there will be no a priori reason to expect an effect on inferences about gambling risks. On the other hand, if someone wants to hide their gambling problems, they might reasonably choose to lie in response to the trigger question. Indeed, hiding gambling problems is explicit in one of the criteria used in the full set of questions for determining the extent to which someone is at risk for GD or should be classified as a DG! There are no perfect statistical methods to correct for this bias, but the bias appears to be significant in the case of several major, influential surveys of gambling problems that used trigger questions. We therefore take some time in Sect. 2 to review the rationale for these trigger questions, and note the vigorous rhetoric sometimes used to defend them. We suspect that the strength of these defenses is thought to be justified by an expectation that they have no effect on inference, and the efficiency gains in the time needed to conduct surveys that are apparent from their use.

Section 3 draws some conclusions, including recommendations for future survey design and analysis.

Estimates for the United States from NESARC

Comorbidities

The prevailing view is that GD typically co-occurs with a variety of other mental disorders. Petry et al. (2005, p. 564) evaluated this using NESARC data and concluded that GD is “highly comorbid with substance use, mood, anxiety, and personality disorders, suggesting that treatment for one condition should involve assessment and possible concomitant treatment for comorbid conditions.” Panel A of Table 1 replicates their methods and essentially obtains the same results, using a logistic specification.Footnote 8 All calculations with the NESARC correct for the complex sampling design.Footnote 9 In each row the independent binary variable is whether the respondent is defined as having the indicated psychiatric disorder or not.Footnote 10 Petry et al. (2005) examine the risk of being what we would now call a DG, and this is the sole risk level used in Table 1. Each of the odds ratio (OR) estimates in Panel A are much greater than 1, and statistically significantly greater than 1: the lower bound of the 95% confidence interval is well above 1.

Table 1 Logistic odds ratio calculations for pathological gambling

These analyses of comorbidities examine “total effects” rather than “marginal effects.” We say that one has measured the total effect of some secondary psychiatric disorder X on the focus disorder Y when there are no controls for the presence of other psychiatric disorders A, B, C … etc. The marginal effect of psychiatric disorder X is measured when one controls for the presence of other psychiatric disorders. Both types of effects can be of interest for public health and clinical purposes, and answer different questions.Footnote 11 The total effect answers a question along these lines: “If all I know about a group of people is that they abuse alcohol, how likely is it that they are also DGs?” Another total effect question might be, “If all I know about a group of people is that they are chronically depressed, how likely is it that they are also DGs?” Assume, as is the case, that both total effects are positive and statistically significant. The marginal effect answers a different question, of the following kind: “If I know that people abuse alcohol and/or are chronically depressed, what is the incremental correlation of each disorder with their also suffering from GD?” It could be that the incremental correlation of alcohol abuse is low or non-existent and the incremental correlation of chronic depression is high. These particular correlations suggest, but of course do not prove, that there is causality from chronic depression to GD, none from alcohol abuse to GD, and some from chronic depression to alcohol abuse (or vice versa).Footnote 12 If this suggestion is correct, it has direct implications for treatment for GD. We would argue that marginal effects are closer to what we want to learn about from evaluation of general population surveys, at least for purposes of designing and choosing public health interventions, than total effects.

Panel B of Table 1 shows the estimates of comorbidities, focusing on marginal effects and the implied OR. We use the same econometric specification as Panel A, for comparability. The point estimates are much closer to 1 than the total effects, as are the lower bounds of the 95% confidence intervals. In one case, the comorbidity of anxiety and GD, the OR is not statistically significantly different from 1. The upper bound of the 95% confidence interval of marginal effects in Panel B are all well below the lower bound of the 95% confidence interval of total effects in Panel A.

Panels C and D of Table 1 show comparable estimates of total and marginal effects if one includes a long list of socio-economic and socio-demographic covariates.Footnote 13 There is a slight lowering of most of the OR compared to Panels A and B, respectively, but no significant change from the conclusions drawn from Panels A and B.

Sample Selection

Panel E of Table 1 lists additional covariates from the logistic model estimated to obtain the marginal effects in Panels C and D. To informally motivate the concern with sample selection bias, focus on the OR ratios in Panel E in bold. Imagine we encounter men, Blacks, those separated by divorce or death, people living in the West, those without a college or graduate degree, and those with a personal income over $70 k at the time of the survey. The value of these OR estimates, and their statistical significance, tell us that respondents with these characteristics are more likely to be DGs. So suppose we encounter respondents with these characteristics who happened not to respond affirmatively to the trigger question? Without knowing their responses to the trigger question we would be inclined to suspect them of some greater-than-baseline risk of GD, ceteris paribus. The only reason they are not so classified is that their response to the trigger question led to them being assumed to have no current or past gambling problems and, therefore, no risk of GD. This involves two fallacious inferences: first, that no one who says “no” to the trigger question has any current or past gambling problems, and, second, that there are no other potential indicators of risk. We can easily imagine some degree of sample selection bias if the responses to the trigger question are correlated with the characteristics that constitute these additional indicators.Footnote 14 This is loose and informal, since it is based on a “chicken and egg” fallacy—we are looking at estimates that ignore this sample selection correction to motivate the possibility of sample selection bias. But as long as we check this with appropriate methods, this motivation is acceptable.

The sample selection models developed by Heckman (1976, 1979) meet this need. They require the researcher to specify a sample selection process, characterizing which respondents appear in the main survey and which do not. Typically this is a simple binary matter, so one can specify this process with a probit model. In our case the sample selection consists of some trigger questions we examine in a moment; if the respondent passes these, they are admitted to the main survey and asked the DSM criteria questions. The Heckman approach also requires a model of the data generation process in the main survey. In our case this might consist of a binary choice statistical model explaining whether someone meets the DSM threshold for being potentially classified as a DG.

In the original setting studied by Heckman (1976, 1979) the main data generating process of interest, and potentially subject to sample selection bias, had a dependent variable that was continuous, and the specification was Ordinary Least Squares. In our case, at least initially, the main data generating process underlying the classification as a DG is binary, and the same ideas carry over: Van de Ven and Van Praag (1981) is the first application of sample selection to a probit specification of the behavior of interest, and Lee (1983) and Maddala (1983) provide general expositions.

One important assumption in the standard sample selection model is to specify some structure for the errors of the two equations, the sample selection equation and the main survey question. If both equations are modeled with probit specifications, for example, the natural first assumption is that the errors are bivariate normal.Footnote 15 We assume instead a flexible semi-nonparametric (SNP) approach due to Gallant and Nychka (1987), applied to the sample selection model by De Luca and Perotti (2011). This SNP approach approximates the bivariate density function of the errors by a Hermite polynomial expansion.Footnote 16

In addition, another important assumption in the sample selection model, said to be “good for identification,” is to find variables that explain sample selection but that a priori do not explain the main outcome. In many expositions one sees the comment that in the absence of these “exclusion restrictions” the sample selection model is “problematic.” Often this is a major empirical challenge, since it can be hard to exclude something from potentially affecting the main variable of interest, but to include it as likely to affect sample selection. In epidemiology, for instance, a spirited defenceFootnote 17 of the use of sample selection corrections to estimates of HIV prevalence in Bärnighausen et al. (2011a) came from Bärnighausen et al. (2011b) on the grounds that they had access to ideal exclusionary restrictions: the identity of the survey interviewer. We agree that this exclusion restriction is an attractive and reasonably general one, but it is not universally applicable.

What is particularly “problematic” in the absence of a priori convincing exclusion restrictions is that one must rely on having the right econometric specification if the sample selection model is to correct for sample selection bias. This specification in turn refers to the specification of the two equations as probit models, and specifically to the assumed bivariate normality of errors.Footnote 18 The importance of having the right specification of the error distribution also applies even when one does have exclusion restrictions.

As it happens, there are ways to construct exclusion restrictions in NESARC that have some a priori credibility. For instance, we know the day of the week on which the interview was conducted, and can condition on Friday, Saturday or Sunday interviews as potentially generating differential response. We also know how many trigger questions for other disorders a subject had answered affirmatively by the time the gambling trigger questions were asked, as one measure of how much time and “patience” had been taken up by that stage of the interview. Additional characteristics of the individual are available from baseline questions, and can be used to identify the sample selection equation. But such exclusion restrictions do not always arise in other surveys of gambling, even major epidemiological surveys. In general we recommend survey methods that do not require these sorts of tradeoffs (between finding attractive exclusion restrictions and reliance on the assumed stochastic structure for identification), but with existing surveys some tradeoffs are often needed.

To set the stage for the evaluation of sample selection corrections, Table 2 and Fig. 1 show the estimated OR between GD and other psychiatric disorders when using the SNP approach rather than the parametric logistic specification. The total comorbidity estimates in Panel A of Table 2 are comparable to those in Panel C of Table 1; similarly, the marginal comorbidity estimates in Panel B of Table 1 are comparable to those in Panel D of Table 1. With the SNP approach, however, the marginal comorbidities are not quite as close to 1 as with the parametric model. However, the same qualitative conclusions about the relationship of total and marginal comorbidity still apply.

Table 2 Semi-nonparametric odds ratio calculations for pathological gambling
Fig. 1
figure 1

Source: National Epidemiological Survey on Alcohol and Related Conditions (NESARC)

Comorbidity of gambling disorder and other psychiatric disorders. Estimated odds ratios using semi-nonparametric ordered response model.

Figure 2 shows marginal effects of comorbidities when one undertakes sample selection corrections.Footnote 19 The covariates used for this exercise are the same full set used in “model 3” of Petry et al. (2005), and are used for both equations.Footnote 20 In addition, for the sample selection equation, we used a set of 29 variables reflecting recent events in the life of the respondent (e.g., family deaths or illness, job layoff, change in job, problems with neighbors or friends, criminal problems), height and weight, days of the week for the interview, and the number of previous trigger questions answered affirmatively. The variables reflecting life events only referred to the last year or last few months prior to the interview, and we are examining GD incidence across the lifetime frame. Table 3 presents detailed estimates: for now, focus on Panel C, which shows OR with respect to the GD risk level. The effect of sample selection corrections is clear: the OR estimates are generally much lower. The estimated correlation between the two equations in the selection model, a measure of the importance of sample selection corrections, is − 0.19.

Fig. 2
figure 2

Source: National Epidemiological Survey on Alcohol and Related Conditions (NESARC)

Effect of sample selection correction on estimates of comorbidity marginal effects with gambling disorder. Estimated odds ratios using semi-nonparametric ordered response model.

Table 3 Semi-nonparametric odds ratio calculations for gambling risks

The Hierarchy of Gambling Disorders

We turn to the hierarchy of gambling disorders, and inferences about general population prevalence. For example, the PGSI classifies samples into the categories “Non-Gambler,” “Low Risk for Problem Gambling,” “Moderate Risk for Problem Gambling,” and “Problem Gambler.”Footnote 21 Previous statistical evaluations of these hierarchies have not, to our knowledge, formally recognized the ordered nature of the categories used in standard survey screens, which are derived directly from clinical screens. When several categories are ordered there are appropriate estimation procedures that use this information. The most popular are ordered probit models in which a latent index is estimated with “cut points” to identify the categories. We employ a SNP version of this type of ordered response model, developed by Stewart (2004) and extended by De Luca and Perotti (2011) to allow for sample selection corrections. We classify respondents into 4 categories: Non-Indicated individuals have no DSM-IV criteria or were not asked about them; Weakly Indicated individuals meeting 1 or 2 DSM-IV criteria; and Moderately Indicated individuals meeting 3 or 4 DSM-IV criteria.Footnote 22 We retain the terminology used in the NESARC, and refer to individuals who meet 5 or more DSM-IV criteria as Pathological Gamblers.

Figures 3 and 4 report estimates from a SNP ordered response model that ignores sample selection and estimates that correct for it. We use the estimates from these models to predict the fraction of the population in each of our four categories above. As a control, it is useful to note that the fractions of the population from the raw data found in each DSM-IV response number “bin” are recovered by the estimated ordered response model when we do not correct for sample selection: 94.6% Non-Indicated, 4.0% Weakly Indicated, 0.9% Moderately Indicated, and 0.4% Pathological Gamblers. Hence we know that the base statistical model we have estimated is not biased relative to the raw data, as we have binned it. These base predictions are referred to as the Uncorrected predictions in Figs. 3 and 4. We therefore find a common result, that the prevalence of Pathological Gambling is around 0.4%. To the extent that our “Moderately Indicated” individuals are taken to approximately correspond to what some researchers (e.g., Pietrzak et al. 2007; Algeria et al. 2009; Nower et al. 2013) categorize as sub-clinical “Problem Gamblers,” the sum of the two most troubled categories produces a figure of 1.3%, familiar from much of the GD prevalence literature. The Corrected predictions, allowing for sample selection biases, are again dramatic. The fraction of Weakly Indicated increases from 4.0 to 8.3%, the fraction of Moderately Indicated increases from 0.9 to 3.9%, and the fraction of Pathological Gamblers increases from 0.4 to 3.8%. Hence prevalence of Pathological Gamblers plus Moderately Indicated is 7.7% when sample selection bias is corrected, compared to 1.3% when no correction is applied.

Fig. 3
figure 3

Source: National Epidemiological Survey on Alcohol and Related Conditions (NESARC)

Predicted prevalence of gambling risk with and without sample selection correction. Estimated probabilities using semi-nonparametric ordered response model.

Fig. 4
figure 4

Statistical significance of sample selection corrections for gambling risk. 100 predicted marginal probabilities from each model, for each individual, reflecting covariance of estimates

It is worth stressing that this result obtains not simply because the sample selection model predicts that more people will get through the gateway of the trigger question, although it does predict that. The observed fraction being selected by their responses to that question is 27%, and the predicted fraction from the sample selection model who would have been selected if they answered the trigger question accurately (according to the empirical specification) is 58%.Footnote 23 The issue is also a matter of which profile of subjects is predicted to be selected. The sample selection model predicts more of the types of people predicted to flag more DSM criteria, and fewer of the type of people predicted to flag fewer DSM criteria. Thus sample selection is, as emphasized by Heckman (1976, 1979), fundamentally an issue about allowing for unobserved heterogeneity.Footnote 24

Figure 4 displays the distribution of predictions, with and without sample selection corrections, as well as indicators of the statistical significance of the effect of sample selection. Consider the top left panel in Fig. 4, for the “Non-Indicated” category of gambling risk. The Uncorrected distribution of predictions reflects the results of simulating 100 random draws for each NESARC respondent from the predicted marginal probability of Non-Indicated, using the estimated SNP ordered probit model. Each random draw is from a normal distribution whose mean is the point estimate of the marginal probability for that subject, and whose standard deviation is the standard error of that point estimate, again for that subject. Thus the 100 random draws for each subject reflect individual-specific predictions, taking into account the statistical uncertainty of the prediction. The Corrected distribution of predictions is generated similarly, using the estimated SNP ordered probit model allowing for sample selection. Since there are 43,093 respondents to NESARC, each of the kernel densities in Fig. 4 reflect 4,309,300 predictions.

These densities in Fig. 4 allow one to see the average effects shown in Fig. 3, the decrease in predicted Non-Indicated respondents from 0.946 to 0.839, but also to visualize the precision of this difference. A t test for each NESARC respondent generates a p value for the hypothesis that the predicted marginal probability is the same with and without sample selection corrections. The 90th, 95th and 99th percentiles of this distribution of 43,093 p values are tabulated in the top-left panel of Fig. 4. We find that the predicted decrease in No Risk is statistically significant, in the sense that the 99th percentile of these p values is 0.001 or lower.Footnote 25 Similarly, the average predicted increases in the Weakly Indicated, Moderately Indicated and Pathological Gambler categories (Fig. 3) are also statistically significant, with the 99th percentile of p values again being 0.001 or lower in each case (Fig. 4).

Figure 5 shows a decomposition of the processes underlying the sample selection correction, to better understand the logic. For each category of gambling problem or risk, it displays the conditional probability of being classified in that category depending on whether the subject is predicted to be “selected out” or “selected in” by the trigger question. For instance, if someone is predicted not to be selected in, the probability of them being classified as Weakly Indicated is 0.142; if that person is predicted to be selected in, the probability of them being classified as Weakly Indicated is 0.040. Since the predicted probability of being selected in is 0.580, this implies that the weighted probability of being in the Weakly Indicated bin is [0.580 × 0.040] + [(1 − 0.580) × 0.142] = 0.083, which is the value shown in Fig. 4 for being Weakly Indicated with sample selection correction.

Fig. 5
figure 5

Source: National Epidemiological Survey on Alcohol and Related Conditions (NESAKC)

Predicted probability of gambling risk conditional on sample selection or not. Predicted probability of sample selection = 0.58. Estimated probabilities using semi-nonparametric ordered response model.

Table 3 shows the predicted OR with respect to other psychiatric disorders for each category of the gambling hierarchy model with and without sample selection corrections. For each category of gambling problem or risk the OR for each disorder is much smaller when corrections are made for sample selection. Again, the upper bound of the 95% confidence interval with sample selection corrections is always below the lower bound of the same confidence interval without sample selection corrections.

Sample Selection Bias and Gambling Survey Screens

Survey screens have been traditionally designed to provisionally identify individuals who are likely to meet clinical criteria for GD. This has various implications for the design and format of the survey questions, which have evolved over time. Here we evaluate some of the issues that flow from that design objective as those relate to the use of trigger questions, ending with constructive suggestions to mitigate the sample selection biases such questions generate.

The Evolution of Trigger Questions

The history of the South Oaks Gambling Screen (SOGS) provides an important exemplar of these origins and concerns. The initial stages of the development of the instrument involved South Oaks Hospital patients already admitted for some alcohol or drug dependency, and was prompted by knowledge from previous clinical treatment of the correlations between these addictions and gambling problems (Lesieur et al. 1986). In the initial pilots of screen designs, if “the patient denied any gambling, he or she was not interviewed further” (Lesieur and Blume 1987, p. 1185). On the other hand, later care and conversations might reveal that some deception had occurred, in which case the patient was re-interviewed (ibid.). The pilot questions, and the subsequent finalized SOGS, were directly motivated by the criteria stipulated in DSM-III (American Psychiatric Association 1987), albeit with modifications to focus less on late stage, “desperation phase,” symptoms.Footnote 26

The final instrument, presented in Lesieur and Blume (1987; Appendix 1), was cross-validated by being given to 213 members of Gamblers Anonymous, 384 university students, and 152 hospital employees. The logic of this cross-validation was that the first group are self-identified as having gambling problems, while the last two groups were presumptively expected not to be DGs. Hence the detection of GD propensities of 98%, 5% and 1.3%, respectively, by the SOGS response scores was viewed as providing evidence of 2% false negatives, 5% tentative false positives, and 1.3% tentative false positives, respectively.

The clinical origins of SOGS did not mean that it automatically translated into an ideal epidemiological instrument, and indeed it was subsequently largely supplanted from that use by other instruments, such as the PGSI, thought to be more accurate. An important early warning was raised by one of the SOGS authors, Lesieur (1994), who carefully noted how seemingly minor changes in sampling procedures and question wording might completely change the interpretation, and claims of validity, of the instrument.

An important exception to the emphasis on clinical objectives for GD survey instruments is offered by Currie et al. (2009), who argue that many gamblers who report no occurrent or historical gambling problem might be “at risk” in a broader public health sense. That is, someone identified in a survey as having no gambling problems might have a heightened propensity to engage in other behaviors that predict vulnerability to GD, and for that reason might be of interest to public health forecasting.

There is no mention of a trigger question in the first epidemiological applications of SOGS in the United States reported by Volberg and Steadman (1988, 1989), or in the revised SOGS surveys for New Zealand reported by Abbot and Volberg (1996). One of the first surveys to have used a trigger question appears to be Dickerson et al. (1996). Since then, the use of trigger questions has become standard, particularly in large-scale epidemiological surveys, as the review of national prevalence studies by Williams et al. (2012) shows. There are continuing debates about the nature of those trigger questions, but they generally concern whether participants should be asked about their gambling over lifetime or only past-year frames. There is also critical discussion about whether monetary loss thresholds should figure in questions. Stone et al. (2015) emphasize these issues, while also signaling awareness of potential sample selection bias introduced by use of trigger questions, but do not address measures to explicitly correct for it.

A somewhat aggressive defense of trigger questions is provided by the Australian Productivity Commission (1999; volume 3, page F14):

The [Australian] National Gambling Survey did not administer the SOGS to all respondents – indeed there are good reasons why gambling surveys do not ask the problem gambling screen of all participants:

  • questions about what people do when they gamble are clearly of no relevance to non gamblers. In the National Gambling Survey, respondents were classified as a non gambler only after they had answered ‘no’ to thirteen separate questions about whether they had participated in any of twelve specified gambling activities and an ‘any other’ gambling category. Hence, this detail of questioning should reliably identify a genuine non gambler.

  • a problem gambling screen is of little or no relevance to infrequent gamblers because their gambling is very unlikely to be associated with problematic behaviour; but

  • it is most appropriate to administer a problem gambling screen to those respondents whose gambling has a greater likelihood of giving rise to problems.

Indeed, as the NORC [National Opinion Research Center] study (Gerstein et al. 1999) noted:

We chose to use these “filter” questions in the national survey after our pretesting indicated that nongamblers and very infrequent gamblers grew impatient with repeated questions about gambling-related problems (p. 19).

For these reasons, the problem gambling instrument was administered only to that subset of gamblers considered most likely to experience problems related to their gambling – all ‘regular’ gamblers as defined by filter 2 and ‘big spending’ and other non-regular gamblers captured by filter 3.

We would rephrase the last sentence as follows: for these reasons, the GD screen used by Gerstein et al. (1999) was administered only to that subset of gamblers considered most likely on the basis of ex ante theory to experience gambling problems. As we discuss below, best-practice survey design should bring prior theory to bear, but for the purpose of gathering data that contribute to modeling sample exclusions, rather than as a basis for filtering out some information altogether.

The form of the trigger question is raised as an issue by Volberg and Williams (2012, p. 9) as follows:

A final important methodological variation that is known to have a significant impact on problem gambling prevalence rates concerns the threshold for administering problem gambling questions. Engaging in any gambling in the past year is a common criterion used to administer questions about problem gambling. However, Williams and Volberg (2009, 2010) found that this criterion results in too many false positives on problem gambling screening instruments (as assessed by subsequent clinical assessment). These false positives can be significantly reduced by (a) using a higher threshold for the designation of problem gambling (i.e., CPGIFootnote 27 5+ versus CPGI 3+); and/or (b) requiring a minimal frequency of gambling in the past year (i.e., at least 10 times on some format) before administering problem gambling screens; and/or (c) resolving these cases of inconsistent gambling behaviour by automatically asking people to explain the discrepancy between their problem gambling classification in the absence of significant gambling behaviour, or intensive gambling involvement in the absence of reports of problems.

Indeed, Williams and Volberg (2009) conducted a careful evaluation of three survey administration features, and report disturbing effects on inferred GD prevalence:

  • they found that just referring in the introduction to a “gambling survey” rather than a “health and recreation survey” caused a 133% increase in estimated GD prevalence;

  • using face-to-face interviews rather than telephone interviews led to a 55% increase; and

  • using a trigger question with a cutoff of C$300 in annual gambling losses, compared to the trigger of any gambling in the past year, would have implied a 42% decrease.

The conjectured rationale for the first effect is that gamblers like taking gambling surveys, which economists regard as a classic sample selection effect. The second effect is simply demographic, and would be easy to correct with the right sample weights in the population: men respond more to one mode of interview than women, and men gamble much more than women. No explanation for the final effect is offered, although Williams and Volberg (2009, p. 112) note that one of their subjects who was in this category revealed an interesting issue:

There was one individual with a CPGI score of 12 despite not reporting any past year gambling. It is interesting to note that this person reported having a history of problem gambling prior to the past 12 months, which may have influenced his responses to the CPGI past year questions.

This subject, it seems, had gambling under control in the year before the survey, but based on earlier history might be conjectured to still be vulnerable to GD under certain conditions. Such a fact might not be clinically important at point of presentation, but should be relevant to public health forecasting, or to regulatory officials deciding whether to license new gaming facilities.

Possible ambiguity of some threshold questions also raises sampling concerns. Blaszczynski et al. (1977) cite evidence suggesting ambiguity in interpreting the question “How much money do you spend on gambling?” Over five case study vignettes considered by their subjects the most popular interpretation was the net amount of money spent in a session. But other subjects interpreted the same vignette in terms of initial stake, turnover, or even just losses, as well as some random responses disconnected to the information. Blaszczynski et al. (1977; p. 249ff.) suggest

that the most relevant estimate of gambling expenditure is net expenditure. […] It is recommended that future prevalence studies provide adequate instructions on how to calculate the net expenditure by drawing subjects’ attention to the difference between amounts invested and the residual at the conclusion of each session. It is suggested that wins reinvested during particular individual sessions should be ignored.

A similar issue was examined by Wood and Williams (2007), who evaluated 12 different ways of asking this question, and concluded (p. 72) that, “In general, retrospective estimates of gambling expenditures appear unreliable.” To be sure, some ways of asking the question elicited more reliable responses, by some sensible metrics. And it does not follow that other forms of detecting a gambling threshold suffer the same ambiguities. For instance, asking if someone has gambled five times in the past year may be easier than asking them to tell you how much they spent on gambling in the past year, or even if they recall losing a certain amount of money in any one day in the past year.

There is widespread recognition of the difficulty of asking “how much money have you lost” questions. Some DGs erase prior losses within a gambling session from cognitive book-keeping as soon as they win; Rachlin (1990, 2000) and Rachlin et al. (2015) argue that this is one of the characteristics that distinguishes DGs from self-controlled gamblers. Concern with this issue led Sharp et al. (2012), in a South African prevalence study, to pose the question as follows:

Thinking about the last time you participated in [ASK FOR EACH GAME EVER PLAYED, FROM A PREVIOUS QUESTION], approximately how much money would you say you staked on that occasion – that is the total amount in rands you put down to bet on that activity during that whole evening or day, not the amount you won and not the amount you ended up with at the end? Please take your time to think carefully about this.

They found that subjects identified as DGs based on their PGSI scores tended to take significantly longer to answer this question than people who reported regular gambling but did not score in the GD range.Footnote 28

Mitigating the Effects of Trigger Questions

How might one mitigate some of the effects on prevalence estimates of survey screens that use trigger questions, whatever the form of the question?

First, if possible one could design surveys that do not naively assume that trigger questions lead to no sample selection bias, and indeed we have done that in Denmark as a result of the concerns identified here (see Harrison et al. 2018). In this study, questions based on two different loss threshold quanta were asked of respondents at the end of the survey that was administered to all participants. This allowed analysis to compare the actual estimation of GD prevalence, across all levels in the hierarchy of risk, with the hypothetical estimates that would have been generated had those who failed to meet one threshold or the other been assigned to a “Non-Gambler” or “No Risk” category due to being excluded from further screening. We recognize that this can only be done if a few specific psychiatric disorders are the focus of the survey, given limitations on time needed for subject responses.

Several surveys have come close to this ideal, by employing extremely “light” trigger questions that only exclude from the sample those that have never engaged in any gambling over some period, including the mere purchase of a lottery ticket. One example is the British Gambling Prevalence Survey (BGPS) of 2010, which asked PGSI and DSM-IV questions for 73% of their entire sample of 7756: see Wardle et al. (2011). Figure 6 shows that although there are differences in prevalence when correcting for sample selection bias using the DSM-IV-based screen, they are not statistically significant or even quantitatively significant for policy purposes.Footnote 29 The analysis of the BGPS also demonstrates that the sample selection correction does not always increase the fraction of the population predicted to be at risk: in this case that fraction drops from 5.0 to 4.8% with correction.

Fig. 6
figure 6

Source: British Gambling Prevalence Survey of 2010. Fraction at any risk level changes from 0.050 to 0.048 with correction

Predicted prevalence of gambling disorders in the U.K., measured with the DSM-IV screen, with and without sample selection correction. Estimated probabilities using semi-nonparametric ordered response model.

Another example of the value of asking an extremely light trigger question is the first wave of the Victorian Gambling Survey (VGS) of 2008, which asked PGSI questions for 75% of their entire sample of 15,000: see Billi et al. (2014, 2015) and Stone et al. (2015).Footnote 30 Again, Fig. 7 shows that although there are economically significant differences in prevalence overall when correcting for sample selection bias using the PGSI, the differences for the most important categories of Moderately Indicated and Pathological GamblerFootnote 31 are not statistically significantFootnote 32 or quantitatively significant for policy purposes. Of course, if the policy objective is to identify demographic slices that might be at risk, one would need to go beneath these aggregate population prevalence estimates to know if there is a sample selection bias.

Fig. 7
figure 7

Source: Wave 1 of the Victorian Gambling Survey of 2008. Fraction at any risk level changes from 0.088 to 0.116 with correction.

Predicted prevalence of gambling disorders in Victoria (Australia), measured with the PGSI screen, with and without sample selection correction. Estimated probabilities using semi-nonparametric ordered response model.

On the other hand, the same VGS illustrates the risks of using additional threshold questions in order to reduce respondent time during the interview.Footnote 33 Starting with the 75% of the sample that had gambled at all in the last 12 months, these surveyors added 1057 individuals who had gambled before then, to arrive at a lifetime sample of gamblers of 81%. But they then employed an additional pre-screening procedure when applying the National Opinion Research Center DSM (NODS) screen of Gerstein et al. (1999) to measure lifetime gambling prevalence. This procedure asks 5 questions about gambling behavior, and only follows up with the additional questions of the full NODS instrument if someone responds affirmatively to one of those 5 pre-screening questions. This procedure drops the VGS sample by 11,075, so that we end up with only 8.5% of the full sample being evaluated with the NODS instrument. Unfortunately, this procedure is not statistically innocent: as Fig. 8 shows, it leads to a statistically and quantitatively significant sample selection bias when inferring lifetime prevalence.Footnote 34 The fraction of the Victorian adult population that is indicated as being at risk of GD jumps from 5.6 to 10.5% after correcting for sample selection bias, and the fraction of DGs increases from 0.9 to 2.3%.

Fig. 8
figure 8

Source: Wave 1 of the Victorian Gambling Survey of 2008. Fraction at any risk level changes from 0.056 to 0.105 with correction

Predicted prevalence of gambling disorders in Victoria (Australia), measured with the NODS screen, with and without sample selection correction. Estimated probabilities using semi-nonparametric ordered response model.

Second, one can design surveys that use criteria for GD that do not rely on historical gambling experience to measure whether someone is at risk, as is the focus of virtually every trigger question we find in the literature. To the extent that one has a theoretically motivated structural model of the nexus of causal factors for GD, some of which may be present in the absence of gambling opportunities in a person’s environment, one can gain information about a respondent’s risk for developing GD that is independent of any historical gambling behavior meeting a threshold of frequency or financial loss. Note that in such modeling, the sense of “risk” is prospective, in contrast to the current risk of misdiagnosis that is operationalized in the PGSI and in DSM-based screens. The Focal Adult Gambling Screen (FLAGS) designed by Schellinck et al. (2015a, b) is an instance of such a screen, which has been used by industry analysts around the world, but has been deployed by no prevalence study prior to the use of it by Harrison et al. 2018) in Denmark. The value of prospective risk forecasting for policy around gambling facility licensing should be obvious, and this policy goal naturally complements the point we are making about using trigger questions that are not reliant on past gambling experience. When considering jurisdictions that have had bans on certain forms of gambling, or where the transactions costs of engaging in gambling have changed, it is quite possible that someone exhibits traits that would lead them to be at risk of gambling problems in different circumstances than they have experienced.Footnote 35 Mitigation of sample selection bias and enhanced policy guidance might thus be achieved by the same research design strategy.

Third, where there is a need for some sort of trigger question or questions to avoid taking too much time in surveys, one can build in random treatments to make it easier to identify sample selection bias. These treatments would be conditions that affect the likelihood of someone participating in a full survey, or engaging in deception due to sensitivity around a question. An example of the former would be financial incentives for participating in surveys, of the kind employed in some surveys and experiments.Footnote 36 An example of the latter would be any one of a myriad of survey techniques for using “randomized response” methods to ensure that subjects are not revealing with certainty some sensitive information in response to a question.Footnote 37

Fourth, one could recruit subjects from an Administrative Registry, so that one can again better control for sample selection biases by knowing characteristics of all of those recruited, whether or not they agree to participate. This is not a general option, since few non-Scandinavian countries have general registries, although recruiting from a Census may suffice if access to characteristics of the individuals recruited is possible.

Conclusions

Measurement of the population prevalence of the risk of gambling problems, and the psychiatric disorders with which they are correlated, play critical roles in public health policy and policy around the licensing of casinos and other gambling facilities. It should make a substantive difference to policy assessment and forecasts of consequence whether the fraction “at risk” of being current DGs is 1.3% or 7.7%, and that is the pure effect of allowing for sample selection bias in the application of a major, widely-cited, and conventional survey of the U.S. population. In fact, one conjecture as to why investigation of GD prevalence was dropped from follow-up waves of the NESARC in the United States and from the Mental Health module of the Canadian Community Health Survey is that the uncorrected population prevalence was too low to justify resources and interview time asking the questions. Hence it becomes “settled” belief that the prevalence of GD is tiny, and no data are ever then collected to question that belief. It would represent a shameful failure of linkage between best research practice and best policy design if these statistical biases in measurement of population prevalence drove substantive decisions concerning the regulation of gambling or the allocation of resources toward the treatment of GD.

One immediate substantive implication of our findings is to ask if comparable biases distort inferences about other psychiatric disorders, since prevalence surveys for every major psychiatric disorder typically use comparable trigger questions.Footnote 38 Although it is conceivable that the distortion could be in any direction, our a priori expectation would be that the distortions lead to understatements of prevalence across the board, given the sensitive nature of the trigger questions.