Bayesian theories of cognition offer a unified formal framework for cognitive science (Tenenbaum et al., 2011) that has had remarkable explanatory successes across domains, including in perception (e.g. Kersten et al., 2004), memory (e.g. Anderson, 1991), language (e.g. Xu & Tenenbaum, 2007), and reasoning (e.g. Lu et al., 2012). At the heart of the Bayesian project is the idea that cognition is fundamentally probabilistic: that people reason according to subjective degrees of belief which follow the laws of probability and, in particular, that they are revised in light of evidence according to Bayes’ Rule. It is somewhat embarrassing then, that these theories have often been accused of failing to describe human “beliefs” of the simple and everyday sort, such as beliefs like “it will rain tomorrow”, “vaccines are safe”, or “this politician is trustworthy” (Chater et al., 2020).

Trouble starts as soon as we attempt to measure beliefs. According to Bayesian theories of cognition and epistemology (Jaynes, 2003), the degree to which people believe in various propositions, or their credences, should reflect subjective mental probabilities. So, asking people to express beliefs in terms of probability seems only natural.

Unfortunately, people’s explicit probability judgments routinely violate the most basic axioms of probability theory. For example, human probability judgments often exhibit the “conjunction fallacy”: people will often judge the conjunction of two events (e.g. “Tom Brady likes football and miniature horses”) as being more probable than one of the events in isolation (e.g. “Tom Brady likes miniature horses”), a plain and flagrant violation of probability theory (Tversky & Kahneman, 1983). Other demonstrations of the incoherence of probability judgments include disjunction fallacies, subadditivity or “unpacking” effects (Tversky & Koehler, 1994), and a number of others (for an accessible review, see (Kahneman, 2013). Altogether, these findings have led many researchers to abandon the notion that degrees of belief are represented as probabilities.

Recently however, two groups of researchers have proposed theories of human probability judgments that account for biases in these judgments while maintaining that mental credences are fundamentally probabilistic (Costello & Watts, 2014; Zhu et al., 2020). Both of these theories build on the increasingly popular notion that a variety of human reasoning tasks are accomplished by a limited process of mental “sampling” from a probabilistic mental model (see also Chater et al., 2020, Dasgupta et al., 2017).Footnote 1

Two Probabilistic Theories of Probability Judgment

Costello & Watts (2014, 2016, 2018) have proposed a theory of probability judgment they call the “Probability Theory plus Noise” theory (PT+N). In the PT+N model, mental “samples” are drawn from a probabilistic mental model of events and are then “read” with noise, so that some positive examples will be read as negative and some negative examples read as positive with some probability d. The end products are probability judgments reflecting probabilistic credences perturbed by noise. In their model, the probability that a mental sample for an event A is correctly read as A is the probability that the sample truly is A, p(A), and that it is correctly read (1 − d), plus the probability that the sample is not A, 1 − P(A) and that it is incorrectly read (d), or:

$$ \begin{array}{@{}rcl@{}} P(\text{read as A}) &=& (1-d)P(A) + d(1-P(A))\\ &=& (1-2d)P(A) + d \end{array} $$
(1)

Thus under the simplest form of the PT+N model, the expected value of probability judgments is:

$$ E[\hat{P}_{PT+N}(A)] = (1-2d)P(A) + d $$
(2)

By assumption, a maximum of 50% of samples can be misread on average, so d is a number in the range [0,1/2]. The overall consequence of the sample-reading noise will be to shrink probability estimates toward .50 in proportion to d. The PT+N theory provides a unified account for a wide variety of biases in probability judgment that were previously attributed to different types of heuristics, as well as novel biases identified based on the model’s predictions (Costello & Watts 2014, 2016, 2017, 2018). For example, the PT+N theory offers an explanation for many instances of “conservatism” (Costello & Watts, 2014)—people’s tendency to shy away from extreme probability judgments near 0 and 1, even when strong evidence warrants such judgments (e.g. Edwards, 1968, Erev et al., 1994).

Meanwhile, Zhu et al. (2020) have proposed a Bayesian model of probability judgment they call the “Bayesian Sampler”. Under this model, probability judgment is itself seen as a process of Bayesian inference. To judge the probability of an event, a limited number of samples are again drawn from a mental model of the event. Then, those “observed” samples are integrated with a prior over probabilities to produce a probability judgment. This prior takes the form of a symmetric Beta distribution, Beta(β,β). After observing S(A) successes and NS(A) failures, the posterior over probabilities is distributed Beta(β + S(A),β + NS(A)). Zhu et al. (2020) assume that people report the mean of their posterior probability estimates. For any Beta distribution \(x \sim Beta(a,b)\), \(E[x] = \frac {a}{a+b}\). So, the expected probability estimate is a linear function of S, N, and β.

$$ \hat{P}_{BS}(A) = \frac{S(A)}{N+2\beta} + \frac{\beta}{N+2\beta} $$
(3)

The expected value of the estimate can then be written in terms of the expected number of successes, or P(A) ⋅ N. Under the simplest version of the Bayesian Sampler model, this gives the following formula:

$$ E[\hat{P}_{BS}(A)] = \frac{N}{N+2\beta}P(A) + \frac{\beta}{N+2\beta} $$
(4)

Like the PT+N model, the Bayesian Sampler model accounts for a wide array of biases in probability judgments, including the novel biases identified by Costello and Watts (Costello and Watts, 2014, 2016). In fact, important equivalencies can be drawn between the two models. Zhu et al. (2020) show that the N and β parameters of their model can be related to the d parameter of the PT+N model via the following bridging formula:

$$ d = \frac{\beta}{N+2\beta} $$
(5)

Thus, in many cases the effect of a Bayesian prior is identical to the effect of noise in the PT+N model (at least in expectation). A caveat to this is that the Bayesian Sampler theory restricts the parameterization of the equivalent d parameter compared to the PT+N model. Whereas PT+N assumes d ∈ [0,1/2], the Bayesian Sampler theory assumes an uninformatative prior parameter β ∈ [0,1], which in turn restricts the equivalent parameterization of d ∈ [0,1/3]. But beyond this subtle difference of parameterization, there is a larger difference in interpretation: rather than merely perturbing people’s probability judgments, this prior can be seen as regularizing these judgments away from extreme values. Zhu et al. (2020) argue that such regularization can be adaptive in cases where only a small number of mental samples can be drawn. For instance, consider someone estimating the probability that they can swim across a lake, outrun an animal, or win a hand of poker: if a mental simulation of these events produces two samples indicating success, one might conclude these are all certain victories and thereby be too willing to assume risk. A regularizing prior pushes these estimates away from extremes, thereby promoting better decision-making when mental samples are sparse. However, this hedging comes at the cost of systematic incoherence and biases.

Differentiating Between the Models

The model’s predictions can be distinguished on two levels: First, the models have distinct accounts of conditional probability judgments that make different predictions in terms of expected values. Second, the models present different process-level accounts of probability judgment that entail different predictions about the shape of the distribution of responses across trials.

Different Accounts of Conditional Probability Judgments

By explaining the incoherence of human probability judgments using coherent mental probabilities, both models have the potential to rescue the larger project of Bayesian cognitive science as applied to everyday beliefs (Chater et al., 2020). However, the two models diverge substantially in their treatment of conditional probability judgments. Bayesian cognitive theories are fundamentally theories of inductive reasoning: Bayes’ rule describes how existing beliefs should be updated conditional on the observation of different kinds of evidence. So, treatment of the conditioning of beliefs is at the heart of these theories.

According to the Bayesian sampler model, conditioning is something that happens in the mental model of the events, not as part of the process of rendering probability judgments. By not assigning any special status to conditional probability judgments, the Bayesian Sampler theory fits neatly into the larger project of Bayesian cognitive science: probability judgments are simply another judgment process applied to the outputs of other (ideally Bayesian) mental models (Chater et al., 2020).

In contrast, the PT+N model presents a constructive account of conditional probability judgments that is fundamentally non-Bayesian (Costello and Watts, 2016). According to the PT+N model, conditional probabilities P(A|B) are estimated by a two-stage sampling procedure: first both events A and B are sampled with noise, and then a second noisy process computes the ratio of the events read as A and B over events read as B. Schematically, the estimated probability can be written as:

$$ \begin{array}{@{}rcl@{}} P_{e}(A|B) &=& P(\textit{read as A}| \textit{read as B}) \\ &=& P(\textit{read as A}|B)P(B|\textit{read as B}) \\&&+ P(\textit{read as A}| \neg B)P(\neg B|\textit{read as B}) \end{array} $$
(6)

Substituting terms according to the PT+N model and then simplifying, the PT+N model predicts conditional probability estimates using the following equation:

$$ P_{e}(A|B) = \frac{(1-2d)^{2}P(A \land B) + d(1-2d)\big(P(A)+P(B)\big)+d^{2}}{(1-2d)P(B)+d} $$
(7)

This non-Bayesian account of conditional probability judgments separates the PT+N theory quite fundamentally from the Bayesian Sampler and the larger project of Bayesian cognitive science.

Different Process-Level Accounts and Predicted Response Distributions

Although the model’s predictions for unconditional probability judgments are identical in expectation (as seen via the bridging condition), the models posit different psychological processes underlying those judgments: sample reading noise in the PT+N model and Bayesian inference in the Bayesian Sampler model. These process-level differences imply different predictions about the distributions of people’s judgments.

The models make qualitatively different distributional predictions on two fronts. First, the Bayesian Sampler predicts a clear relationship between the degree to which responses are shrunk toward .50 and the trial-by-trial variability in those responses. In both models, the amount of variability in trial-level responses is related to the number of mental samples drawn, N. In the Bayesian Sampler model, assuming β is relatively small, N should also help to determine the degree to which responses are shrunk toward .50. In contrast, in the PT+N model the variance across responses and degree of shrinkage are reasonably considered to be independent. Second, because the Bayesian Sampler describes a process of adjustment after the sampling process, in which people report the mean of their mental posterior over probabilities, the model also predicts a truncation of the response distribution in proportion to β and N (Chater et al., 2020; Sundh et al., 2021). That is, even when zero positive or negative samples are drawn, the mean of the posterior is drawn away from extreme responses of zero and one.

Modeling the distributions of raw responses holds clear promise for disentangling the models. However, there are at least three challenges to directly modeling raw human response data. First, both models are, strictly speaking, discrete and so make a limited set of discrete predictions while assigning zero probability to responses outside that set. Second, and similarly, the truncation in the Bayesian Sampler model also assigns zero probability to responses beyond the truncated range. And third, from a cursory glance it is clear that a majority of human responses are rounded by some unknown degree, with most seemingly rounded to the nearest 5 or 10%. Given the combination of these factors, if fit directly to raw human data the posterior probability of both models is likely to be zero. I return to these challenges and my approaches to addressing them in the results.

Prior Comparisons of the Models

Comparison of Participant-Level Query-Averaged Responses

Zhu et al. (2020) compared their Bayesian Sampler model against Costello and Watts’ (2014, 2016, 2017, 2018) PT+N model as explanations for human probability judgments in two experiments. Unfortunately, their results were somewhat equivocal.

Zhu et al. (2020) measured participants’ judgments for each query (e.g. “what is the probability that it will be rainy”) on three repeated trials. Their primary quantitative analysis fit the models separately to participants’ average response to each query (averaged over three trials). These analyses compare human responses to the models’ predictions in expectation. After fitting, Bayesian Information Criteria (BIC) values were computed for each participant, which were then used to approximate the posterior probability of each model for each participant, assuming a uniform prior. The researchers found that a preponderance of participants’ responses were best-captured by the Bayesian Sampler model. However, a substantial number of participants were instead more strongly fit by the PT+N model.

Given that these models are proposing quite basic psychological processes, we might expect the same process to be shared across all people. But, the authors do not report on the overall posterior probability of each model if one model is assumed to explain all participants’ responses. Such a comparison with these methods would likely be limited in a few ways. First, as they note (Zhu et al., 2020), BIC cannot fully account for the differences in the competing models’ complexity (also see Piantadosi, 2018). Further, their “unpooled” analysis likely exaggerates the complexity of the models overall and may therefore affect comparisons between them. In contrast, hierarchical models with partial pooling offer a solution that balances between ignoring individual variation and allowing all parameters to vary freely, allowing for an accounting of heterogeneity without over-penalizing in cases where heterogeneity is low.

Comparison of Distributional Model Predictions

Rather than computing query-level averages across trials for each participant, examining the models’ distributional predictions requires modeling participants’ raw trial-by-trial responses. As mentioned above, this presents substantial challenges. To address these, Zhu et al. (2020) estimated discrete versions of the Bayesian Sampler and PT+N models by minimizing the Wasserstein distance between participants’ raw responses and model predictions. The use of Wasserstein distance rather than a proper likelihood-based measure of model fit helped to minimize issues created by rounding and out-of-support responses, which could otherwise lead both models to assign probability zero to many observations.

Still, the results of this analysis were largely inconclusive with respect to differentiating the models (Zhu et al., 2020). Specifically, the quality of the fit for each model depended heavily on the maximum number of samples that is assumed possible. For small numbers of samples, the Bayesian Sampler model is clearly superior, but for larger numbers of samples, the PT+N model was found to better fit the data. Presumably, this is because with small numbers of samples the PT+N model is extremely constrained in the distinct discrete responses it can predict. For small N, both models predict only a limited set of distinct values are possible. However, whereas the size of that set is the same for each model, in the Bayesian Sampler the continuous β parameter can shift exactly what those discrete values are, providing it much greater flexibility. Yet, this additional model flexibility goes unpunished in comparisons based on Wasserstein distance.

In later work, Sundh et al. (2021) examined the distributional properties of participants’ responses using indirect means, by regressing the variance of participants’ responses across trials on their mean response for each query type. Their findings suggest that earlier fits with Wasserstein distance may have produced biased results (Sundh et al., 2021). They reported evidence for the truncation of responses and a correlation between variance and shrinkage parameter estimates across participants. However, their analysis did not enforce that the underlying probabilities driving participants’ judgments be coherent (simply estimating the true probability as the mean across trials), nor did they evaluate how frequently participants gave out-of-support judgments that would be inconsistent with the Bayesian Sampler theory.

The Present Work

Here, I cast both the Bayesian Sampler and PT+N models into a Bayesian data analysis framework that may permit a more decisive comparison. Two sets of analyses compared different aspects of the models’ predictions: First, the model’s were examined based on their predictions in expectation. This analysis allows for a test of their different accounts of conditional probability judgments and their parameterizations (i.e. the restriction of d to [0,1/2] versus [0,1/3]). To preview the results, these analyses revealed the Bayesian Sampler’s account of probability judgments to be superior, but could not distinguish between the different psychological processes proposed by the two models. A second set of analyses examined the model’s distributional predictions to test their process-level accounts of probability judgments. As will be described, modeling response distributions directly presents a number of challenges, and so this second set of analyses required some minor additions and modifications to the models to permit their fitting to experimental data.

Both sets of analyses are supported by the use of a Bayesian Framework. First, Bayesian analyses allow issues of model complexity to be addressed through comparisons of model fit based on modern information criteria, such as Pareto smoothed importance sampling approximate leave-one-out cross validation (PSIS-LOO; Gelman et al., 2014, Vehtari et al.2017).Footnote 2 Second, the Bayesian framework supports straightforward implementation of hierarchical versions of these models allowing for information about model parameters to be shared across participants, resulting in potential improvements to out-of-sample prediction, reductions in model complexity, and a more realistic test of the models. Finally, a Bayesian framework also supports new extensions of these models to directly model participants trial-level responses while accounting for rounding and out-of-distribution response errors. These extensions allow for principled probabilistic tests of the distributional predictions of the models.

Methods

Data Selection

Zhu et al. (2020) conducted two experiments to compare the PT+N and Bayesian Sampler theories. These experiments asked participants to judge the probability of different events in various combinations. Following prior work by Costello and Watts (e.g. 2016, 2018), both experiments focused on the everyday events of different kinds of weather.

Experiment 1 asked about the events [icy, frosty] and [normal, typical] (e.g. “what is the probability that the weather in England is normal and not typical?”). The authors’ goal was to ask about highly correlated events, but the events used are perhaps nearly perfectly correlated. Because the terms used to describe these events are nearly synonymous, there is a concern about the interpretation of the statements evaluated in this experiment. This is especially clear, as the authors note, for disjunctive query trials such as “normal or typical,” where “or typical” might not be read as a disjunction but rather an elaborative clause. In light of these concerns, I excluded the disjunctive trials from Experiment 1 from my analyses.

Experiment 2 focused on more moderately correlated events, [cold, rainy] and [windy, cloudy], that do not admit these misinterpretations. In addition, a third experimental condition asking about [warm, snowy] was also included in the experiment, but was dropped from the analyses reported in the paper. Exploring the raw responses from this condition reveals a substantial fraction of “zero” and “one” responses for certain trials. This may reflect a different response process than was intended. For instance, some participants may have engaged in deductive reasoning to judge that it is not possible for the weather to at once be warm and snowy, and therefore responded with zero—failing to properly consider that it is possible (at least logically) for it to be warm and snowy at different times within the same day. Given these potentially aberrant responses, I followed Zhu et al. (2020) in ignoring data from this condition.

Modeling Results: Participant-Level Query-Averaged Responses

This first set of analyses compares the models’ ability to capture participant’s probability judgments in expectation, averaged over the three blocks on which they made judgments about each query. These analyses will test the first points of differentiation between the models: their different predictions with respect to conditional probability judgments and their specific parameterizations.

I implement several variants of the Bayesian Sampler and PT+N models in a Bayesian framework. These models were implemented in the probabilistic programming language Numpyro. All code and results are available as supplemental materials (https://github.com/derekpowell/bayesian-sampler).

Bayesian Implementation of Participant-Level Query-Averaged Response Models

The PT+N model defines expected probability judgments (Pe) as:

$$ \begin{array}{@{}rcl@{}} P_{e}(A) &=& (1-2d)P(A) + d \\ P_{e}(A\land B) &=& (1-d^{\prime})P(A \land B)+d^{\prime} \\ P_{e}(A\lor B) &=& (1-d^{\prime})P(A \lor B)+d^{\prime} \\ P_{e}(A|B) &=& \frac{(1-2d)^{2}P(A \land B) + d(1-2d)\big(P(A)+P(B)\big)+d^{2}}{(1-2d)P(B)+d}\\ \end{array} $$
(8)

In contrast, the Bayesian Sampler model defines expected probability judgments as:

$$ \begin{array}{@{}rcl@{}} P_{e}(A) &=& \frac{N}{N + 2 \beta}P(A) + \frac{\beta}{N+2 \beta} \\ P_{e}(A \land B) &=& \frac{N^{\prime}}{N^{\prime} + 2 \beta}P(A \land B) + \frac{\beta}{N^{\prime}+2 \beta} \\ P_{e}(A \lor B) &=& \frac{N^{\prime}}{N^{\prime} + 2 \beta}P(A \lor B) + \frac{\beta}{N^{\prime}+2 \beta} \\ P_{e}(A|B) &=& \frac{N}{N + 2 \beta}P(A|B) + \frac{\beta}{N+2 \beta} \end{array} $$
(9)

Fixing d and \(d^{\prime }\) or N and \(N^{\prime }\) equal yields the “simple” variant of each of the models, which treat conjunctive and disjunctive probability judgments identically to simple probability judgments.

Notice that for each model the probability judgments depend on underlying subjective probabilities, derived from a mental sampling process. These subjective probabilities are unobserved, and must be estimated as a latent variable. Here, they are represented with a four-dimensional dirichlet distribution for each subject, \(\vec {\theta }\), representing the probability of the elementary events (AB, ¬AB, A ∧¬B, ¬A ∧¬B).

Zhu et al. (2020) implement completely unpooled models with separate d, \(d^{\prime }\), N, \(N^{\prime }\), and β parameters for each participant. Although hierarchical models with partial pooling might be expected to better account for the data and offer a better test of the models, for consistency and comparison with Zhu et al. (2020) analyses, I first estimated implementations of these unpooled models. Figure 1 displays the translation of the PT+N model into the Bayesian framework, along with a plate diagram representing the dependencies among parameters.

Fig. 1
figure 1

Complex unpooled PT+N model diagram and formula specifications. Circular nodes are parameters, shaded nodes are observations, and squared nodes are deterministic functions of parameters. Plates signify values defined for i trials, j participants, and k conditions

The function fPT+N computes the expected probability estimate using the underlying subjective probability computed from \(\vec {\theta }\) and the query, the noise parameters d and \(d^{\prime }\), and the relevant equation as defined by the PT+N theory (see supplemental materials for implementation details). Prior predictive checks were conducted for all models to select priors that would be uninformative or minimally informative on the scale of the model parameters d and \(d^{\prime }\). Footnote 3

Recall that Zhu et al. (2020) identified a bridging condition relating β and N in the Bayesian Sampler model to the d parameter of the PT+N model. To support direct comparisons of the models, I parameterize the Bayesian Sampler model according to the implied d and \(d^{\prime }\), rather than directly according to its β, N, and N parameters.Footnote 4 I constrain d to [0,1/3] for the Bayesian Sampler model to reflect the assumption that β ∈ [0,1]. This allows the same priors to be used for the corresponding Bayesian Sampler and PT+N models, simplifying their comparison.

The Bayesian Sampler model is therefore identical to the PT+N model save for the changes to μijk, d, and \(d^{\prime }\) shown below:

$$ \begin{array}{@{}rcl@{}} \mu_{ijk} &=& f_{BS}(\overrightarrow{\theta_{jk}}, x_{ijk}, d_{j}, d^{\prime}_{j}) \\ d_{j} &=& \frac{1}{3} \ \text{logistic}(\delta_{j}) \\ d_{j}^{\prime} &=& \frac{1}{3} \ \text{logistic}\big(\delta_{j} + \exp({\Delta}\delta_{j})\big) \end{array} $$
(10)

Where the function fBS computes the expected probability estimate as prescribed by the Bayesian Sampler theory.

Hierarchical Implementations of the Models

Both of these models can also be implemented as hierarchical models with partial pooling for the d and \(d^{\prime }\) parameters (implicitly, for N and \(N^{\prime }\) in the case of the Bayesian Sampler). This partial pooling can help to regularize parameter estimates and improve out-of-sample predictive performance. In addition, partial pooling effectively reduces model complexity, and could support more realistic comparison between the “simple” and “complex” variants of the models. Figure 2 displays the translation of a hierarchical implementation of the Bayesian Sampler model into the Bayesian framework, along with a plate diagram representing the dependencies among parameters. For ease of interpretation, the centered parameterization is shown below, although the actual models used a non-centered parameterization to improve sampling efficiency (Papaspiliopoulos et al., 2007).

Fig. 2
figure 2

Hierarchical complex Bayesian Sampler model diagram and formula specifications. Circular nodes are parameters, shaded nodes are observations, and squared nodes are deterministic functions of parameters. Plates signify values defined for i trials, j participants, and k conditions

Finally, I also explored fitting a hierarchical version of the Bayesian Sampler model that allowed values of β > 1. Restricting β to [0,1] restricts the prior distribution of the Bayesian sampler to the class of “ignorance priors” (Zhu et al., 2020). However, it is also possible that people bring informative priors to the probability judgment task. Indeed, Zhu et al. (2020) acknowledge there are situations where an informative prior may be warranted (see e.g. Fennell and Baddeley, 2012). If β is unrestricted, allowed to fall in the domain \([0, \infty ]\) then the Bayesian Sampler model becomes more flexible, allowing for equivalent “noise” levels in the same [0,1/2] range as the PT+N model. That is, through the bridging condition, the implied d approaches 1/2 in the limit as N → 1 and \(\beta \to \infty \). Though it would seem a more fundamental change, this same model may also be seen as a version of the PT+N theory that jettisons its two-stage process of conditional probability judgment. Thus, fitting this additional unrestricted model allows for a complete comparison of the models along both of their differing dimensions.

Simulation studies verified that the complex hierarchical PT+N and Bayesian Sampler models can correctly and unbiasedly recover parameters from simulated data (see Supplemental Materials).

Model Comparison

I fit each of the models specified above to data from Zhu et al. (2020) Experiment 1 and 2 and estimated the expected log predictive density with PSIS-LOO (\(\widehat {\text {elpd}}_{\text {LOO}}\)) for each combination. Compared with BIC, \(\widehat {\text {elpd}}_{\text {LOO}}\) offers a more sophisticated account of model complexity and is more appropriate in the “\({\mathscr{M}}\)-open” case; situations where we do not know if any of the models being compared are the “true” model (Vehtari et al., 2019). Model posteriors were estimated using the Numpyro (Phan et al., 2019) implementation of the No-U-Turn Hamiltonian Markov chain Monte Carlo (MCMC) sampler. For each model, four MCMC chains of 2000 iterations were sampled after 2000 iterations of warmup and all passed convergence tests according to \(\hat {R}\) (see Gelman et al., 2014). Figure 3 below displays the estimated differences in \(\widehat {\text {elpd}}_{\text {LOO}}\) scores for each of the models as compared to the best-scoring model.

Fig. 3
figure 3

Model comparison results for data from Experiments 1 and 2. Error bars indicate two standard errors of the estimates. Typically, a difference of greater than two standard errors is taken as clear evidence for the superiority of the lower-scoring model (Sivula et al., 2020)

Data from Experiment 1 favor “complex” variants of the Bayesian Sampler model compared with the “simple” variants and all versions of the PT+N model (greater values of \(\widehat {\text {elpd}}_{\text {LOO}}\) are better). As shown in Fig. 3, the best-scoring model is an unrestricted variant of the Bayesian Sampler that allows for people to bring informative priors to the probability judgment task (i.e. allowing \(\beta \in [0, \infty ]\). Data from Experiment 2 more decisively reveal a single winning model: the hierarchical “unrestricted” implementation of the Bayesian Sampler model allowing for informative priors.

This unrestricted BS model differs from the PT+N model only in its treatment of conditional probability judgments and so, from its superior fit, we can infer that the Bayesian Sample theory provides a better account of human conditional probability judgments.

Figure 4 (top) shows the posterior distributions of the population-level d and \(d^{\prime }\) parameters inferred from the unrestricted Bayesian Sampler model. In Experiment 2, population-level estimates of \(d^{\prime }\) are greater than 1/3, as are a substantial number of participant-level estimates for d (37 of 83), as shown in Fig. 5. These values fall outside the range implied by the assumption of “ignorance priors” in the Bayesian Sampler model. Parameters fit to the data from Experiment 1 are more consistent with this assumption, although a substantial proportion of individual participants’ d and \(d^{\prime }\) estimates also lie outside this range (11 of 59 for d, 18 of 59 for \(d^{\prime }\)). The finding that there are clear differences in d and \(d^{\prime }\) estimated across experiments suggest that the mental sampling processes producing estimates vary in the different conditions, either in terms of the number of samples that are drawn, the noise in reading those samples, or the form of the prior distribution assumed by participants in each context.

Fig. 4
figure 4

Posterior density of population-level d and \(d^{\prime }\) parameters estimated from the unrestricted hierarchical Bayesian Sampler model for data from Experiments 1 and 2. Dashed line indicates theoretical maximum values for Bayesian Sampler model with uninformative priors

Fig. 5
figure 5

Participant-level estimated d and d’ values across Experiments 1 and 2. Error bars indicate 95% CIs

Zhu et al. (2020) demonstrated that the Bayesian Sampler model can capture a set of probabilistic identities developed by Costello and Watts (Costello and Watts, 2016, 2018) that capture some of the incoherence in people’s probability judgments. Following the design of the present experiments, these identities involve combinations of probability estimates for different combinations of two events A and B that should all be equal to zero according to probability theory. Under the Bayesian Sampler and PT+N theories, however, some of these identities should be zero, but some are allowed to take on other values. Figure 6 shows the average prediction of the winning model against the average observed value for each equality. Consistent with prior findings, this model captures these identities quite closely.

Fig. 6
figure 6

Average model predicted and observed values for the 18 identities. Note that the Bayesian Sampler but not the PT+N model is capable of predicting non-zero values for identities Z10 through Z13. Error bars represent 95% CI. In Experiment 1, like participants’ responses, the model’s estimates are very slightly positive for {icy, frosty} and very slightly negative for {normal, typical}. This pattern replicates the qualitative pattern reported by Zhu and colleagues

Finally, it is worth noting that the best of these models provide quite strong overall fits to the data, not just for the query averages, but also for the query averages across individual participants as seen from the correlations between predicted and observed responses in Table 1. Figure 7 shows the correlation between participants’ responses across all trials and the best-performing model’s predictions.

Table 1 Bayesian model comparison results with best scoring model in bold face
Fig. 7
figure 7

Posterior predictions for best-fitting model and participants responses in Experiments 1 and 2

Results: Raw Trial-Level Response Distributions

The best-fitting model capturing participant’s predictions in expectation can be seen either as a variant of the Bayesian Sampler theory allowing for informative priors or as a variant of the PT+N theory without any special treatment of conditional versus unconditional probability judgments. Thus, these prior analyses have not decisively ruled between the different process-level psychological theories behind the models. To evaluate the Bayesian Sampler and PT+N theories’ competing accounts of the psychological processes behind probability judgments, a second set of analyses focused on the distribution of participant’s trial-by-trial responses was conducted.

Recall, in contrast to a noise-based account, the Bayesian Sampler theory predicts both a truncation of the response distribution as well as a correlation between the degree of shrinkage in probability estimates and the variability of those estimates trial-to-trial (assuming β is relatively small). Because the Bayesian Sampler applies a Bayesian adjustment after sampling, it predicts probability judgments will always lie between d and 1 − d, even when zero positive or negative mental samples are drawn (see Eq. 12 and the bridging condition, Eq. 5). First, it bears noting that the truncated response distribution implied by the Bayesian Sampler model appears at odds with the raw response data: participants’ responses frequently lie outside the range implied by the best estimates of their d parameters (41% in Experiment 1 and 60% in Experiment 2).

Yet comparing the distributional predictions of the models more rigorously poses three challenges: (1) the discrete nature of the models suggests a limited set of allowable responses, assigning zero probability to all others, (2) Bayesian adjustment implies truncation of the support of the response distribution, again assigning zero probability to other response values, and (3) participants routinely round their responses, complicating both of the previous issues.

In the following set of analyses I attempt to lay out a set of reasonable assumptions that permit participants’ trial-by-trial responses to be modeled and used to compare the theories’ predictions. To do so, I first extend the models so as to render them fully continuous in their latent space and then marry them with a specific model of response errors. Thus the models compared in the following analyses are not identical to those originally proposed by Zhu et al. (2020) and Costello and Watts (2014). However, they do provide implementations of the theories’ process-level accounts, and thus a means to test the distributional predictions of models based on Bayesian adjustment against models based on sampling noise.

Continuous Extensions of the Models

Under both models, the variability of people’s responses trial-to-trial is driven by the number of mental samples drawn: more mental samples produce less-variable responses. However, if the number of samples is considered to be a truly discrete quantity, then only a limited number of discrete responses are possible. As Zhu et al. (2020) note, this is somewhat implausible on its face and their later work has abandoned this assumption (Zhu et al., 2021). At the same time, from a pragmatic perspective it is highly desirable that all latent parameters within the models be continuous rather than discrete. The models would be far more tractable to fit if, rather than including a latent Binomial variable representing the discrete number of samples drawn, we could instead model a continuous proportion of samples using, for instance, a Beta distribution.

Zhu et al. (2020) introduce the possibility of an “autocorrelated” Bayesian Sampler model under which samples are assumed to be autocorrelated (ideas which were advanced further in Zhu et al., (2021)). As autocorrelated samples provide less information than i.i.d samples, they should be weighted when computing probability estimates. The idea is that people actually draw N autocorrelated samples that approximate some smaller effective number of i.i.d samples (Neff). Assuming the actual number of autocorrelated samples drawn, N, is allowed to vary somewhat noisily, then a model based on autocorrelated samples would no longer be limited to predicting a discrete set of possible responses. To approximate this as part of a wholly continuous model, I model the proportions calculated from such a hypothetical autocorrelated sampling process using a Beta distribution.

Mixture Modeling: Rounding and Contaminants

Creating continuous extensions of the models makes their estimation more tractable. However, a specific model of participants’ response processes and errors is still needed to capture rounded and out-of-support responses. To address these challenges, I implemented variants of the PT+N and Bayesian Sampler models within discrete mixture models allowing for varying rounding policies as well as “contaminant” responses generated by noise processes outside the models.

First, it is clear that participants have rounded a majority of their responses. This sort of rounding can be modeled by a categorical distribution across the discrete possible rounded responses. Each rounded response category corresponds to a set of cut points, a and b, with the probability of the categorical response defined by the cumulative distribution function of the underlying latent distribution (B).

$$ P([a,b)) = B(a, \mu N, (1-\mu)N) - B(b, \mu N, (1-\mu)N) $$
(11)

As participants were allowed to respond freely with whole numbers from 0 to 100, the exact rounding policy for each response is unknown. Nevertheless these rounding policies can be estimated via mixture modeling. For simplicity, rounding to the nearest 5% was enforced for all responses. Then, the probability of these categorical responses are computed for 21 and 11 categories (corresponding to rounding to 5% and 10%). These probabilities were combined along with a uniform probability representing “contaminants” according to mixing probabilities ϕ, distributed with a Dirichlet prior (see Appendix for further implementation details).

The Bayesian Sampler model predicts a truncated range of possible responses given β and effective N (and consequently, implied d). Modeling these different rounding processes allows for at least some out-of-bounds responses to be accounted for by rounding processes (e.g. when an allowable response of .14 is rounded to the out-of-bounds value of .10). However, some responses still cannot be accounted for by the model. Instead, these responses are treated as “contaminants” generated by a random response process. Modeling “contaminant” response processes allows the Bayesian Sampler model to be fit in the presence of true outliers. Identifying the estimated proportion of “contaminant” responses can also provide a check on the models: if a model can only be fit by assuming a large proportion of contaminant responses, this suggests it is likely not a good model of human behavior.

Trial-Level Noise-Based Model

Compared to the query-averaged model, the trial-level noise-based model adds two features: mixture components for rounding and contaminants and subject-level varying N in place of a fixed K parameter. This model’s implementation and the implementation of its mixture components is depicted in Fig. 8. However, note that N is allowed to vary independently from d, allowing for independence between response shrinkage and variability.

Fig. 8
figure 8

Hierarchical complex trial-level noise-based model diagram and formula specifications. ZNB and fNB are functions that compute the probability of each categorical response and the expected proportion of read-out mental samples given underlying mental probabilities and sample reading noise. See Appendix for further descriptions of these details

Trial-Level Bayesian Sampler Model

Approximating an autocorrelated sampling process using a Beta distribution and starting from the original Bayesian Sampler model,

$$ \hat{P}_{BS}(A) = \frac{S(A)}{N+2\beta} + \frac{\beta}{N+2\beta} $$
(12)

we can replace the number of successes S(A) (distributed binomial) with the quantity ρ(A)N, where ρ(A) represents the Beta-distributed sample proportions generated by the autocorrelated sampling process outlined above.

$$ \hat{P}_{BS}(A) = \frac{\rho(A)N}{N+2\beta} + \frac{\beta}{N+2\beta} $$
(13)

Then it is plain that \(\hat {P}_{BS}(A)\) is a transformation of ρ(A), and therefore a transformed Beta distribution (see Appendix for derivation). Figure 9 diagrams the entire Bayesian Sampler model, now parameterized in terms of β, N, and \(N^{\prime }\).

Fig. 9
figure 9

Hierarchical complex trial-level Bayesian Sampler model diagram and formula specifications. ZBS and f0 are functions that compute the probability of each categorical response and the expected proportion of mental samples given underlying mental probabilities before Bayesian adjustment. See Appendix for further descriptions of these details

This model assumes that the number of samples drawn on each trial is fixed as N or \(N^{\prime }\) accordingly (modulo the uncertainty about these parameters). However, it also seems reasonable to imagine that the number of samples drawn in fact varies from trial-to-trial. For the Bayesian Sampler model, this could substantially impact the model’s fit. As the number of samples drawn affects the truncation of the response distribution, this may allow the model to capture some responses that would otherwise be treated as contaminants. To capture this, the model can be given one additional extension to allow the number of samples to vary, by adding a new parameter Ntrial. This parameter multiplies the number of effective samples as a fraction of each individual participants’ average number of samples drawn, e.g. so that a participant might sometimes draw 1.5× or 2× the number of effective samples they typically draw. The appropriate amount of variation in Ntrial is constrained to be fairly small, but is estimated hierarchically: I assume \(log(N_{trial}) \sim N(0,\sigma _{N_{trial}})\) and \(log(\sigma _{N_{trial}}) \sim \) (− 1,.3).

Model Comparison: Raw Trial-Level Response Models

Prior to fitting the models, response data were rounded to the nearest 5%. Nearly all responses (Exp. 1: 93% and Exp. 2: 89%) were already divisible by five, and this was necessary to speed model fitting. Even still, estimating model posteriors for the trial-level mixture models using MCMC proved intractable. Instead, model posteriors were estimated using Stochastic Variational Inference (SVI) with Numpyro (Phan et al., 2019) using a multivariate Normal guide (e.g. Kucukelbir et al., 2015). Estimating each model posterior took between approximately 20 to 90 minutes on an Nvidia V100 GPU. Simulation studies verified that this estimation approach could reliably recover parameters from simulated data (see Supplemental Materials).

Table 2 shows the scores of each model fit to the trial-level data in Experiments 1 and 2. From the quantitative model comparison it is clear that the noise-based model is superior, with substantially lower \(\widehat {\text {elpd}}_{\text {LOO}}\) than both of the competing Bayesian Sampler implementations in both experiments.

Table 2 Bayesian model comparison results for trial-level models with best scoring model in bold face

The models’ performance can be better understood by examining the two main features about which the models’ predictions depart: truncation of the response distribution and shrinkage-dependent variance of responses.

The non-varying Bayesian Sampler model fits quite poorly and is estimated to have a large proportion of “contaminant” responses (43% in Experiment 1, 26% in Experiment 2). This is likely due to a lack of the predicted truncation of the response distribution. As with the trial-average models, the estimates of implied d are quite high, which would predict substantial truncation. As mentioned earlier, many of participants’ responses fall outside the range implied by their most likely implied d values.

Allowing the effective number of samples drawn to vary trial-by-trial improves the fit of the Bayesian Sampler model substantially and somewhat decreases the estimated proportion of contaminant responses (30% in Experiment 1, 28% in Experiment 2). Nevertheless, these results still indicate inferior fit compared with the noise-based model, which has substantially lower \(\widehat {\text {elpd}}_{\text {LOO}}\) and attributes far fewer responses to the contaminant process (11% in Experiment 1, 4% in Experiment 2).

Second, we can examine the shrinkage-variance relationship. In the noise-based model, N and d were allowed to vary freely. But if these quantities are actually correlated as predicted by a Bayesian adjustment model, then we should expect to nevertheless see a correlation between the subject-level estimates in these parameters. Figure 10 shows scatterplots relating these estimates. Although there is a slight negative correlation in Experiment 1, this is driven by only a handful of participants with extreme values. In Experiment 2 the correlation appears to if anything be positive rather than negative. These findings again run counter to the predictions of the Bayesian Sampler theory.

Fig. 10
figure 10

Scatterplots showing the relationships between subject-level d and N estimates from the trial-level noise-based model for Experiments 1 and 2. Figure for Experiment 1 excludes some outlier paricipants who gave repetitive responses that resulted in abnormally high N-values. Error bars indicate 95% CI

Comparing the d values inferred from the trial-level model we see they are similar to but generally smaller than those estimated in the trial-averaged model. It is unclear exactly why this is, though it could owe at least in part to the mixture component preventing “contaminant” responses from affecting the estimates of these parameters.

Discussion

Fit to the average of participants’ responses over blocks, there is a single clear winner among the competing models: a model without any special treatment of conditional probability (ala the Bayesian Sampler model) and allowing for an implied d parameter ∈ [0,.5]. This model could be interpreted either as a variant of the Bayesian Sampler without restriction on its β parameter, or as a variant of the PT+N model that removes its account of conditional probability judgments.

In either case, these findings make clear that the Bayesian Sampler theory provides a superior account of conditional probability judgments in this task. In keeping with the larger theoretical framework of Bayesian cognitive science, the Bayesian Sampler theory assumes that subjective probabilities underlie people’s probability judgments, and that conditional probability judgments are produced by Bayesian conditioning occurring in their mental models of the events in question, rather than as arising from the probability judgment process (Chater et al., 2020; Zhu et al., 2020).

At the psychological process-level, the Bayesian adjustment process hypothesized by the Bayesian Sampler model makes two clear predictions about the distribution of participants’ responses. First, it implies a truncated range of possible responses. Second, assuming that β is constrained to be a relatively small value, then under the Bayesian Sampler model, N influences both the degree to which responses are shrunk toward .50 and the variability of those responses trial-to-trial. Thus, participant’s inferred N values should be somehow correlated with the noise in their trial-level responses. In contrast, under the noise-based account of the PT+N model, there is no truncation of responses and no predicted correlation between shrinkage and response variability.

The distributions of participants’ responses are more consistent with the PT+N theory’s account of sampling noise than the Bayesian adjustment implied by the Bayesian Sampler theory as neither of the Bayesian Sampler’s predictions appear to be borne out by the data. First, participants’ responses frequently fall outside the truncated range implied by the parameters estimated under the Bayesian Sampler model. Fit to the raw data, this requires treating an unreasonably large proportion of responses as “contaminants”. Second, the degree of shrinkage in participants’ responses and the variability in those responses are not correlated in the ways predicted by the Bayesian Sampler theory.

In the end, I find the best overall account of participants’ probability judgments is a modification of the PT+N theory without its two-stage process of conditional probability judgments. Like the PT+N theory, this model accounts for distortions in probability judgments via a process of noisy sampling. But like the Bayesian Sampler theory, under this theory, conditional probability judgments are produced by a process of conditioning in the mental model of the events, rather than as part of the mental sampling process itself.

Outside of probability judgments, Bayesian conditioning is a key aspect of the sorts of mental models imagined by Bayesian cognitive scientists, where cognitive models are conditioned on information as part of learning, prediction, and inference. Within this framework, it seems only natural to imagine that a similar conditioning process would subserve probability judgments, rather than conditional probability judgments being made by a distinct two-stage sampling process.

To illustrate the distinction, consider being asked to judge the conditional probability that it will rain tomorrow in London given it rained today in London. Now, compare that with first being told that it rained today in London, and then being asked to make the simple probability judgment that it will rain tomorrow in London. The original PT+N theory would draw a distinction between the two tasks: The first task would invoke the two-stage sampling process, whereas presumably the second would involve some change in the mental model to reflect learning about the day’s weather followed by only a single stage of sampling. In contrast, the present findings suggest these tasks would invoke the same set of mental processes—conditioning of the mental model followed by the drawing of samples from that model.

Remaining Questions and Limitations

Despite the model’s quantitative success, some more qualitative questions remain. First, the plausibility of the parameter values inferred from the model bears consideration. Many participants’ estimated d and \(d^{\prime }\) parameters were quite high—potentially against the spirit of the original PT+N model. This model bounds d at .50 in principle, but a sample-reading process with such high error-rates may or may not be plausible. In prior work, simulations have often assumed values of d around .10 (Costello and Watts, 2017; e.g. Howe & Costello, 2020). Further research examining what factors might affect the mental sampling and reading processes (e.g. task complexity, distractions, prior experience) might help to shed light on the most plausible range of d values in different contexts.

High estimates of d parameters might also call into question arguments for the rational utility of such a process. Zhu and colleagues argue that the regularizing effect of Bayesian adjustment should be seen as adaptive. They also consider that “noise” might give an algorithmic-level solution to the computation-level goals defined by the Bayesian sampler (Zhu et al., 2020). Even high implied d values might be consistent with rational inference in cases where the number of effective mental samples is very low. For instance, a Beta (2,2) prior is only modestly informative, but could produce d = .40 if N = 1. However, in cases where more samples are drawn, high d values would correspond to potentially inflexible and suboptimal priors. From the results, it is clear there are some individuals with both relatively high d and N values, which may press somewhat against the rational justification for the desirability of sampling noise. For instance, some participants are inferred to have N ≈ 20 and d ≈ .30, implying β ≈ 15.

Finally as noted, the comparisons of these models using trial-level data rests on a number of elaborating assumptions to support fitting of the models. It should be recognized that different assumptions may have produced different results, and other error processes remain possible. Although other indirect analysis approaches might be designed to avoid these concerns (e.g. Sundh et al., 2021), ultimately it seems crucial that cognitive models at some point be fit to the actual human behaviors of interest. An important direction for future elaborations of sampling-based theories are more rigorous theories of realistic mental sampling processes, including details of their initialization, autocorrelation, and amortization (Gershman and Goodman, 2016).

Conclusions

Probability judgments have proven a fruitful testing ground for sampling-based theories of cognition. But, the implications of sampling-based models like the Bayesian Sampler and PT+N theory go well-beyond the probability judgment task itself: these models have the potential to extend the success of Bayesian theories of cognition to develop a probabilistic science of everyday beliefs. Under such an account, beliefs are not explicitly represented, stored, or even computed as probabilities, but rather they are emergent properties of mental models generating probabilistic samples (Chater et al., 2020; Sanborn & Chater, 2016).

Nevertheless we might still use the logic of Bayesian models to understand the operation of these beliefs and how they respond to evidence. Indeed, by representing the “true” subjective probabilities as a latent variable in the models used here, Bayesian data analysis allows those underlying credences to be inferred. Future research could explore how estimates of people’s credences might be made more reliable, and how inferences about these mental probabilities might be integrated with other Bayesian models of reasoning (e.g. Franke et al., 2016, Griffiths & Tenenbaum, 2006, Jern et al., 2014). For instance, people’s responses in various reasoning tasks are often explicitly related to inferred subjective mental probabilities, so accounting for biases in those reports may permit more rigorous model testing. One particularly promising direction could be to integrate these models with formal models of belief revision, which might then shed new light on these fundamental cognitive processes (e.g. Cook & Lewandowsky, 2016, Jern et al., 2014, Powell, 2022, Powell et al., 2018).