Children face a difficult problem in learning about the world. There is much to learn and little time in which to learn it. In this context, the benefits of social learning are self-evident. Self-directed strategies are slow and cannot be used to acquire some knowledge (e.g. language). It is quicker to call upon on the knowledge of others. However, people do not always produce reliable data. A person may have inaccurate knowledge or may wish to deceive. Thus, it is necessary for people to trust informants and their information selectively (Koenig and Harris 2005; Pasquini et al. 2007; Corriveau and Harris 2009; Corriveau et al. 2009; Chen et al. 2012). This sort of selective trust in informants and their information is referred to as epistemic trust.

Research has identified informant and contextual features that cause children trust informants and their information differently. Children trust more accurate informants (Koenig and Harris 2005; Pasquini et al. 2007). Children are less likely to ask informants who mislabel common objects for future information than informants who label common objects correctly (Koenig and Harris 2005) and children’s preference for accurate over inaccurate informants increases with the relative accuracy between informants (Pasquini et al. 2007). Additionally, children have been shown to trust information from groups of informants over dissenters (Corriveau et al. 2009; Chen et al. 2012) and to prefer familiar informants (Corriveau and Harris 2009), informants with the same native accent (Kinzler et al. 2011), informants of the same gender (Taylor 2013) and more attractive informants (Bascandziev and Harris 2014).

Research has also shown that children’s epistemic trust develops. Older children seem to allocate their trust more flexibly than younger children (Koenig and Harris 2005; Pasquini et al. 2007; Corriveau and Harris 2009). The literature typically explains this development in terms of changes in the ability to monitor who is knowledgeable (Pasquini et al. 2007; Corriveau et al. 2009; Corriveau and Harris 2009).Footnote 1 Others have broadly argued that trust is rational (Sobel and Kushnir 2013). An adjacent literature indicates changes in the ability to reason about deception (Couillard and Woodward 1999; Mascaro and Sperber 2009).

Shafto et al. (2011) proposed a probabilistic model that formalizes epistemic trust as inferences about informants’ knowledgeability (versus unknowledgeability) and helpfulness (versus deception) (see also (Eaves and Shafto 2012); (Butterfield et al. 2008)). The computational model was fit to three studies to investigate possible explanations for developmental changes in behavior. Contrary to the aforementioned qualitative accounts that attribute developmental changes to children’s improving ability to monitor informants’ knowledge, the results showed that the behavioral differences between three- and four-year-olds are primarily a result of a change in children’s representation of informants’ helpfulness. Three-year-olds’ data was better explained by a model that only reasons about informants’ knowledgeability, whereas four-year-olds’ data was better explained by a model that reasons about both knowledge and helpfulness. Although provocative, these results are limited by the reliance on a small subset of the literature.

It would be desirable to use the computational model to generate a more integrative theoretical account of the literature on development of epistemic trust. Indeed, the model in principle should apply to findings across the literature. However, the research questions and methods used in epistemic trust research are heterogeneous. In addition to variations in age, researchers have investigated experimental features such as the modes through which informants communicate (e.g. verbal testimony, pointing, gaze), the experimental paradigm (e.g. forced-choice, looking time), and culture. Shafto et al. (2011) focused on a small subset of the overall literature to ensure homogeneity of tasks and ages that would allow all experiments to be explained with a simple, unified explanation, but this necessarily limits the explanatory power of the theory. Any integrative theory must deal with not only heterogeneity of tasks and ages, but correlations between task and age. Methods that work for very young children—such as looking time—do not work for older children, and vice versa. Indeed, the correlation between task and age, and the interpretation problems it poses, are a general problem for integrative theories of cognitive development.

In this paper we introduce a method for conducting integrative, model-driven analysis of heterogeneous experiments and apply it to the construction of an integrative account of the development of epistemic trust. The approach is based on two components: a domain-specific model of epistemic trust (Shafto et al. 2011) and a domain-general approach for integrative analysis (Mansinghka et al. Accepted pending revision; Shafto et al. 2014). The model of epistemic trust is used to parameterize the conditions of heterogeneous experiments—to translate the experimental results into model parameters. Along with each parameterization, we document the methodological details of each condition—mean age, experimental paradigm, communication mode, etc. The collection of conditions, each translated into a set of model parameters and experimental features comprise the input into the integrative analysis. The integrative analysis infers a joint probability distribution over all relevant experimental features and model parameter values. The resulting joint distribution allows querying of conditional distributions over parameters and experimental features. From these conditional distributions we gain the ability to ask and answer fundamental questions about how features of conditions such as task and age are related to the variables in the model, e.g., how do children’s beliefs about informants’ helpfulness change from age 18 months, to 3 years, to 4.5 years or how are pointing versus verbal testimony reflected in children’s beliefs about helpfulness.

We begin by discussing the heterogeneity in the epistemic trust literature. We then discuss the model of epistemic trust, followed by our approach to aggregating parameterized results. We then detail our methods and results, and conclude by discussing broader implications of this approach for epistemic trust and broader theories in cognitive development.

Heterogeneity in studies on the development of epistemic trust

The epistemic trust literature—as defined in terms of the scope of the computational model (see (Eaves and Shafto 2012))—is composed of many literatures each of which is interested in how learners trust informants differently in different contexts. The set of encompassed literature includes the selective trust, deception, informant expertise, and pedagogy literatures. Each of these literatures has its own conceptual, methodological, and age conventions. In this section, we briefly review each literature in turn and to offer a sense of the heterogeneity of the conceptual and methodological landscape.

The selective trust literature recounts people’s different trust in informants driven by inferences about their epistemic states. As an example, Koenig and Harris (2005) proposed that children monitor the accuracy of informants and use prior accuracy information when choosing between and learning from informants. Preschool-aged children observed two informants label common objects (chair, ball, etc). One informant labeled all four objects correctly and the other labeled all four objects incorrectly. After three of these accuracy trials, unfamiliar objects were placed before the informants. The child was then asked which informant she would like to ask for the novel object’s label (ask trial) or after having observed each informant provide a label, was asked to chose a label (endorse trial). Four-year-old children asked and endorsed the accurate informant most often. This result demonstrates that children’s preferences for specific informants and their information is influenced by informants’ accuracy. A number of other studies have reproduced this result and have shown that a single inaccuracy can shape children’s informant preferences (Fitneva and Dunfield 2010) and that children take into account not only whether an informant has been accurate or inaccurate but the relative accuracy between informants (Pasquini et al. 2007) and the magnitude of informants’ errors (Einav and Robinson 2010). Even infants appear to learn differently from reliable and unreliable informants (Tummeltshammer et al. 2014) and are surprised when informants mislabel common objects (Koenig and Echols 2003). The selective trust literature also indicates that children prefer informants who are part of a consensus (Corriveau et al. 2009; Chen et al. 2012), and who are more familiar (Corriveau & Harris, 2009) (e.g. their preschool teacher over a stranger). Another, closely related, line of research indicates that children may choose informants based on their superficial, non-epistemic, qualities such as their gender (Taylor 2013), their attractiveness (Bascandziev and Harris 2014), and accent (Kinzler et al. 2011). Research also suggests that selective trust is modulated by cultural factors. For example, children of different cultures are differently likely to accept seemingly unreliable information from a consensus (DiYanni and Kelemen 2008).

The deception literature recounts people’s different trust in informants driven by inferences about knowledgeable informants’ helpfulness. The deception literature is vast, addressing issues related to false belief, sarcasm and more. Here we consider only the simplest case, which is most closely related to tasks described above: informants who are knowledgeable but nonetheless provide inaccurate information. Research indicates that three-year-olds have difficulty handling deceptive data compared with older children (Couillard and Woodward 1999; Mascaro and Sperber 2009). For example, three-year-olds, but not four-year-olds are repeatedly fooled by an informant who, for ten trials indicates, by way of pointing, the one of two cups under which no prize is hidden (Couillard and Woodward 1999). In addition to age, reasoning about deception varies with communicative mode. The same study found that children’s ability to choose the correct cup was improved if the informant indicated cups by placing markers on them rather than pointing at them.

The above studies focus on cases where the informant’s testimony provides information about their trustworthiness. It is common to experience cases where an informant’s trustworthiness is implied by social decree, as is the case with expertise. Research has investigated the development of trust in experts by pitting two informants labeled as experts in contrasting domains against each other. Children begin to correctly attribute domain knowledge fairly early, at about age four (Lutz and Keil 2002; Aguiar et al. 2012), and these abilities improve as children learn more about how knowledge domains are organized (Danovitch and Keil 2004; Keil et al. 2008). Four-year-olds, but not three-year-olds, more often endorse novel object labels from informants who demonstrate accurate knowledge of those objects’ functions and internal properties (Sobel and Corriveau 2010). Additionally, preschoolers hold a domain-general view of ignorance and a domain-specific view of expertise (Koenig and Jaswal 2011) and more often endorse information from nice non-experts than information from mean experts (Landrum et al. 2013).

The pedagogy literature recounts people’s different learning from informants when informants are assumed to be helpful and knowledgeable. For example, the Natural Pedagogy theory (Csibra and Gergely 2006; Gergely et al. 2007) asserts that children have a strong, in-born belief that all informants are helpful and knowledgeable and that relaxing this belief is a primary task in early development. Hence, the pedagogy literature looks at how children make different inferences about the world given data from teachers than they do given unintentional data (Bonawitz et al. 2014) or given data generated by self–directed strategies (Shafto et al. 2012; Shafto et al. 2012). Recent research has demonstrated that children can identify when these assumptions do not apply and use self-directed means to fill in gaps left by poorly-performing pedagogs (Gweon et al. 2014).

The different literatures employ different methods on different age groups. Trust-in-testimony research primarily focuses on two age groups: infants up to 18 months, and preschoolers from three to four years. Studies with preschoolers typically employ forced-choice paradigms, asking children which informants they prefer or what information they believe; and research on infants is carried out using looking-time and simple motor paradigms, observing which informants or actions infants are surprised by or which actions they imitate. Deception research typically focuses on three- and four-year-olds, but research into more subtle questions goes on well beyond those ages. Expertise research focuses on children old enough to allow the use of language to inform children about informants’ expertise. Pedagogy researchers seek to evaluate children as young as possible, using ostensive cues such as gaze to cue trust. The epistemic trust literature is broad and the age groups investigated and methods employed are highly variable. To create an account of the development of epistemic trust we must not only account for performance across ages, but across fundamentally different tasks and phenomena.

Modeling epistemic trust

Leveraging theoretical work on the teleological stance (Gergely and Csibra 2003; Dennett 1989; Baker et al. 2009), Shafto et al. (2011) proposed a computational model of epistemic trust that in principle applies to all of these phenomena. The model explains epistemic trust in terms of inferences about informants’ knowledgeability and helpfulness (Eaves and Shafto 2012; Shafto et al. 2011; Landrum et al. 2015). A trustworthy informant must both posses accurate knowledge about the world (be knowledgeable), and be willing and able to share his or her knowledge (be helpful). Knowledgeable informants may not act consistently with their knowledge through lack of communicative skill or malicious intent; helpful informants may hold misconceptions, which may lead them to produce inaccurate information.

The model is represented as a Bayesian Network (Pearl 2000; Spirtes et al. 1993): a set of variables (nodes) causally linked by probabilistic relationships (edges). Edges link parent nodes to their child nodes. Figure 1a shows a graphical representation of the learner’s model of how informants choose data. Informants’ beliefs, b, about the world, w, are determined by their knowledgeability, k, about the world. Knowledgeable informants’ beliefs align with the true state of the world; unknowledgeable informants’ beliefs are determined randomly. An unknowledgeable informant’s beliefs may follow a uniform distributions corresponding to a completely random guess or may follow a distribution that allows some beliefs to be less likely. For example, given the animal, lion, an informant should be less likely to guess the label car, than to guess the label tiger.

Fig. 1
figure 1

A graphical representation of the epistemic trust model. Informants’ beliefs, b about the world, w, are determined by their knowledgeability, k. Informants’ actions, a, are determined by their beliefs and their helpfulness, h. Actions on the world result in effects, e. 𝜃 k and 𝜃 h represent individual informants’ probability of being knowledgeable and helpful, respectively. 𝜃s have beta distribution priors that represent expectations about informants in general. a A representation of the intentional stance (Dennett 1989) in which beliefs and desires, in this case to help or not, lead to actions. The mob) Single-informant model. c Multi-informant model for reasoning about groups of informants. Note that beta priors on knowledgeability and helpfulness and the true state of the world, w, are shared across informants. Arrows and nodes are colored-coded for clarity

Informants’ actions, a, are determined by their beliefs, b, and their helpfulness, h. Helpful selection of evidence is modeled using the pedagogical sampling model in Shafto et al. (2012). Helpful informants act to induce their own beliefs in learners; unhelpful informants act to induce other beliefs in learners. This is captured by the recursive equations:

$$\begin{array}{@{}rcl@{}} P_{\text{learner}}(b|a) &\propto& P_{\text{informant}}(a|b)P(b), \end{array} $$
(1)
$$\begin{array}{@{}rcl@{}} P_{\text{informant}}(a|b) &\propto& \left\{\begin{array}{ll} P_{\text{learner}}(b|a) &\text{ if helpful} \\ 1-P_{\text{learner}}(b|a) &\text{ if not helpful.} \end{array}\right. \end{array} $$
(2)

Informants’ actions are selected conditional on their beliefs about the world. Because informants only control the action that they choose, they must consider all the possible effects of their actions. The effects are thus marginalized (summed) out. Equation 2 captures the idea that actions are selected purposefully, with a goal (helping or deceiving), based on the informant’s beliefs. Actions on the world result in effects e. The effect is determined by the true state of the world, w, and the action, a. In word learning, we do not model an effect, for unless the speaker is a wizard or has uttered some extraordinarily breathy statement, words do not themselves elicit observable effects from the world.

Prior distributions are placed on informants’ helpfulness and knowledgeability, corresponding to learners’ beliefs about individual informants and informants in general,

$$\begin{array}{@{}rcl@{}} h|\theta_{h} &\sim& \text{Bernoulli}(\theta_{h}) \end{array} $$
(3)
$$\begin{array}{@{}rcl@{}} \theta_{h} &\sim& \text{beta}(\alpha_{h}, \beta_{h}) \end{array} $$
(4)

and similarly for knowledgeability,

$$\begin{array}{@{}rcl@{}} k|\theta_{k} &\sim& \text{Bernoulli}(\theta_{k}) \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} \theta_{k} &\sim& \text{beta}(\alpha_{k}, \beta_{k}). \end{array} $$
(6)

The value of h and k are determined by flips of 𝜃-weighted coins. The 𝜃s are drawn from beta distributions. These beta distributions leave the model with four free parameters: α k , β k , α h , and β h . We use the standard beta distribution parametrization, beta(α,β), which distributes probability according the function

$$ f(x|\alpha,\beta) = \frac{x^{\alpha-1}\left(1-x\right)^{\beta-1}}{\text{B}\left(\alpha,\beta\right)}. $$
(7)

where B(⋅,⋅) is the beta function.

Beta distributions represent the distribution of people and each informant is a draw of 𝜃 from that distribution (see Fig. 1b). 𝜃 values persist across multiple demonstrations by a single informant. Keeping these rules in mind, we can link several single-informant graphs by their beta priors and by the state of the world to form a group demonstration (see Fig. 1b). We can also link a number of single informant graphs by 𝜃 k and 𝜃 h to form successive demonstrations from a single informant. For multiple demonstrations, we need not (necessarily) link the state of the world; the state of the world is free to change from demonstration to demonstration. We can link graphs in both ways simultaneously to form successive group demonstrations.

Modeling word learning

Epistemic trust studies generally follow a similar setup. Children are introduced to one or more informants from whom they receive differing data (experience) in familiarization trials. Children must then choose to accept or reject information from the informant(s). For example, a child may be introduced to two informants and then observe that one informant labels common objects incorrectly while the other labels them correctly (accuracy trials). The child may then be presented with a novel object and asked which informant he or she would like to ask for the object’s label (an ask trial), or similarly after having observed both informants label, the child may then be asked to label the object (an endorse trial). Here we discuss the process by which we model these studies.

To begin, we must make some assumptions about the world. We arbitrarily assume that at any given labeling trial there are four reasonable labels.Footnote 2 That is, |W|=4 and hence there are four possible beliefs, |B|=4. In word learning, each action is a label and so the number of actions (labels) is equivalent to the number of world states and number of possible beliefs |A|=|W|=|B|=4. We assume that the states of the world are distributed with uniform probability. No word is a priori more likely than any other

$$ P(W) = \frac{1}{|W|}. $$
(8)

These assumptions result in the following relationship between the world and informants’ knowledgeability and beliefs: knowledgeable informants’ beliefs match the true label, w, while naive informants guess at random, uniformly from among the possible labels. The probability that an informant’s belief aligns with the true state of the world is

$$ P(b=w|k) = \left\{\begin{array}{ll} 1, & \text{if }k=\text{knowledgeable}\\ 1/\left|W\right|, & \text{otherwise} \end{array}\right.. $$
(9)

As for which labels informants utter, helpful informants shall always utter the label they believe to be correct and unhelpful informants shall always utter a label they believed not to be correct,

$$ P(a|h,b) = \left\{\begin{array}{ll} 1, & \text{if }a=b\text{ and }h=\text{helpful}\\ \frac{1}{\left|W\right|-1}, & \text{if }a \not= b\text{ and }h=\text{unhelpful}\\ 0, & \text{otherwise} \end{array}\right.. $$
(10)

Again, we focus on actions and ignore effects in word learning demonstrations.

Though there are four attribute combinations based on helpfulness and knowledgeability, this formalization captures three distinct types of informant behavior. Knowledgeable and helpful informants always label correctly because they know the correct label and want the learner to know. Knowledgeable but unhelpful informants always label incorrectly because they know the correct label and do not want the learner to know. Unknowledgeable informants, regardless of whether they are helpful, may or may not label correctly because unknowledgeable informants must guess labels for objects. Unknowledgeable but helpful informants produce correct labels when they guess the correct label. Unknowledgeable and unhelpful informants produce the correct label when they guess the incorrect label and choose to produce the correct label as a foil. Thus it is difficult to determine whether an unknowledgeable informant is helpful.

In familiarization trials, the model must leverage what it knows about the world to learn about informants. In accuracy trials, informants label common objects, thus the true state of the world is known. The model can then estimate the probability with which the informant is helpful and knowledgeable.Footnote 3 This means learning the joint probability distribution for 𝜃 k and 𝜃 h given a and w, p(𝜃 k ,𝜃 h |a,w).

During test (ask and endorse) trials, the model must use what it has learned about the informant to learn about the world. Ask and endorse questions may seem superficially similar, but they are in fact important differences. Framed in a probabilistic context, the endorse problem is to determine the probability of each informants’ label being correct given what is known about about informants in general (prior parameters) and past experience, ξ, with informant, i:

$$\begin{array}{@{}rcl@{}} P(endorse_{i}) &\propto& \sum\limits_{w}P(w=a|a,\alpha,\beta,\xi) \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} &=& \!\sum\limits_{w,h,k}\iint_{\theta} P(w=a|a,h,k)P(h,k|\theta)P(\theta|\alpha,\beta, \xi) d\theta. \end{array} $$
(12)

Where, for notational simplicity, we collapse similar variables and parameters such that 𝜃={𝜃 k ,𝜃 h }, α={α k ,α h }, and β={β k ,β h }. The probability of endorsing informant 1 over informant 2 is,

$$ P(endorse_{1,2}) = \frac{P(endorse_{1})}{P(endorse_{1})+P(endorse_{2})}. $$
(13)

It is less obvious how to formalize the ask question. The question again is “who would you like to choose for information.” Because one may ask an informant for a variety of reasons—i.e., because they are consistently wrong, because one wants to assess their knowledge, etc.—formalizing this question is challenging. Due to its ambiguity, we avoid modeling the ask question where we can and where we cannot we adopt the simple assumption that children choose to ask informants who are more likely to label correctly. That is,

$$ P(ask) \propto \sum\limits_{w,a}P(w=a|a,\alpha,\beta,\xi). $$
(14)

Inferring the label of an object given an informant’s utterance is reminiscent of the referential communication and language pragmatics literature (Grice et al. 1975; Frank and Goodman 2012). However, language pragmatics rely on the assumption that speakers are cooperative; the epistemic trust model does not require such a constraint to learn.

Previous developmental findings

Previous work employed this model to investigate possible explanations for developmental changes in epistemic trust. The full model and a model based on reasoning about knowledge alone were compared by searching for the parameters that best fit children’s behavior in three experiments (Shafto et al. 2011; Eaves and Shafto 2012). The results indicated that the knowledge-only model fit three-year-olds behavior better while the full model better fit four-year-olds’. These results are consistent with a developmental change in children’s ability to reason about helpfulness.

The import of the previous modeling is limited by the fact that the model was only applied to three experiments from the literature. To broaden the scope, it is necessary to account for a wider set of heterogeneous studies. As a computational theory, the basic claim of the model is that it can explain epistemic trust behavior. That is, the model should be able to parameterize each result (locate the result in model space). Thus, it is reasonable to expect the model to predict results across the domain, regardless of the method by which they were obtained. However, it is unreasonable to expect that all methodological details are irrelevant to how the model will fit. Certain tasks may focus more on knowledge while others may focus more on helpfulness. Similarly, different methods of communicating—speaking, pointing, and marking—may elicit different degrees of trust based on past experience. While the model should explain behavior across these variations, how it explains it vis-a-vis the parameters can be expected to vary to some degree.

Of course, parameterizing the model individually in terms of each condition may not be ideal. Many free parameters raises concerns about reducing generality through over-fitting. Systematic similarities and differences among the features of experiments, such as age or communication mode, may be used as a bottom-up source of constraint on the variation in parameters. Moreover, the degree of association between the experimental features and model parameters may provide a means by which we may quantify differences in methodology or across development. We present a method for automatically identifying such similarities and differences in the next section.

Aggregation of parameterized results via cross-categorization

How might we draw inferences about commonalities and differences among a collection of parameterized results? There are a number of possibilities, but one especially flexible and therefore attractive approach is cross-categorization. Cross-categorization (CrossCat) is a Bayesian non-parametric method for estimating the full joint probability density over tabular data (Mansinghka et al. Accepted pending revision; Shafto et al. 2014). It simultaneously estimates dependence among variables and, among dependent variables, estimates dependence among rows. For current purposes, cross-categorization represents a method by which we can determine the probability of dependencies between the individual model parameters—such as helpfulness—and features of conditions—such as age—given a table composed of a parametrization of the results together with features of the condition. CrossCat is a more flexible tool than standard statistical approaches, such as various forms of regression, which force the user to identify which variables drive changes in others. Our goal is to learn which variables drive what kind of changes in which model parameters under what circumstances. CrossCat provides a platform to do so while seamlessly handling missing and heterogeneous data.

CrossCat is a generalization of an infinite mixture model (IMM; see (Teh et al. 2006); (Rasmussen 2000); (MacEachern and Müller 1998); (Neal 2000); (Anderson 1991), for more information on IMMs) in which features’ assignments to views and objects’ assignments to categories within views are each inferred. Thus, CrossCat behaves as a hierarchical mixture model, where instead of assuming that there is only a single explanation for the variability over the rows, there are potentially many ways of organizing, and thus explaining the rows.

CrossCat explains a data table in terms of two main structural components: a partitioning of features (columns) into views and for each view, a partitioning of objects (rows) into categories. A view, Z, assigns the F features (columns) to |V| views. The assignment of categories, V, contains |V| partitions of the objects (rows), V 0,V 1,...,V |V|−1, such that each view V assigns the N rows to categories for the collection of features in that view. Each view models the variation in the features of that view as a mixture (those looking for a detailed treatment of cross-categorization are referred to Mansinghka et al. (Accepted pending revision)).

Each cross-categorization state (or sample) represents core elements of probability. The partition of features into views instantiates an inference about whether each possible pair of variables is dependent or independent. Modeling views as mixtures allows the model to identify relationships that are much more general than simple linearity. The model, therefore, allows one to generically ask key questions of interest without strong assumptions such as linearity or Gaussianity that can lead to interpretation problems. In the case of epistemic trust, for example, are age and helpfulness dependent? What is the form of that dependence? Which experiments can be explained by a common set of parameters and which require different parameters?

The data in each feature are modeled by a data-appropriate statistical model. Conjugate models are typically chosen for efficiency. For example, continuous data are modeled using a Normal distribution with a Normal-Gamma prior (Murphy 2007; Fink 1994), while categorical data are modeled using a Multinomial distribution with a symmetric Dirichlet prior. Many other data types can be instantiated in this framework by implementing conjugate, semi-conjugate, and non-conjugate models, as appropriate. The hyperparameters for priors are inferred to facilitate efficient inference. This produces an unusually flexible model suited to a wide variety of different types of data.

Consider a data table where each row represents a condition of an experiment and each column represents a feature of interest (experiment features or model parameters). The views would represent whether, for example, age were dependent on the helpfulness parameters by placing those features in the same view or in different views. Similarly, given a collection of samples, we could query conditional distributions to answer questions about the relationship between features. For example, we could check our previous results by asking about the relationship between age and biases toward believing informants are helpful. In this way, we use Bayesian inference to free the model from the specifics of individual studies and allow for the formulation of a general model that considers many possible hypotheses.

Method

The method consists of three steps that link a domain model (the epistemic trust model) with an analysis model (CrossCat). The domain model is used to approximate the parameter distribution for each study, and the analysis model is used to identify trends in the parameters induced by different studies. The general method is as follows: First, select studies that can be straight-forwardly modeled with the epistemic trust model. Second, for each condition of each study, search for sets of model parameters that cause the epistemic trust model to fit the experimental data well. Third, construct and analyze a CrossCat table in which each row comprises the model parameters and experimental features of each modeled condition.

We begin by explaining the process by which studies were selected and how we determined which studies were suitable for modeling. We then describe the procedure used to search for well-fitting parameter sets. Last, we exhaustively discuss the procedure by which each study was modeled and how the model accounted for the experimental results.

Study inclusion criteria

We include for analysis studies that the epistemic trust model can capture with no extension, or simple extension by way of existing, off-the-shelf models. In previous research (Shafto et al. 2011), we focused on modeling three selective trust strategies (relative accuracy (Pasquini et al. 2007), familiarity (Corriveau and Harris 2009), and consensus (Corriveau et al. 2009)) each of which employed the ask-endorse, forced-choice paradigm and in which informants communicated either by way of verbal testimony (Pasquini et al. 2007; Corriveau and Harris 2009) or pointing (Corriveau et al. 2009). Different communication modes do not require extensions to the model to capture; it will be an empirical question as to how they differ in terms of the model parameters. Different paradigms necessitate minimal modifications, e.g., we model looking time as proportionate to the inverse of the probability of the event looked at. Thus, the inclusion criteria mainly focus on the informant- and information-selection strategies investigated.

There are a variety of strategies that would require involved modifications to the model and were thus omitted (see Table 1). For example, consider studies that investigate effects of domain expertise (Koenig and Jaswal 2011). Capturing these phenomena would be quite natural within our general framework; however, expertise would require the model to be extended to capture how children believe knowledge is distributed among people. This would require multiple assumptions, and therefore expertise studies are excluded from analyses. Another group of studies uses verbal testimony by an experimenter or an additional informant to provide information about the informants. For example, some studies employ methods in which experimenters explicitly tell participants that an informant is “very mean” (Landrum et al. 2013), “a big liar” (Mascaro and Sperber 2009), or “a dog expert” (Koenig and Jaswal 2011). Others have used verbal testimony about one’s own beliefs, e.g., “I don’t know” (Sabbagh et al. 2003; Sobel and Corriveau 2010; Buchsbaum et al. 2012). Capturing the semantics of these verbal statements would require additional parameters and are therefore omitted.

Table 1 A list of studies excluded from analyses and the reason for exclusion

We also exclude studies that investigate informant-selection strategies driven by informants’ superficial qualities. For example, studies have investigated whether informants’ attractiveness (Bascandziev and Harris 2014), gender (Taylor 2013), or accent (Kinzler et al. 2011) affect epistemic trust. While it is possible that learners attribute different knowledgeability or helpfulness to informants with certain superficial features, these features are not direct demonstrations of informants’ data generation capabilities. To model how learners would learn, say, that someone dressed in a t-shirt is less trustworthy than someone dressed in a suit (McDonald and Ma 2015) would require making assumptions about, and simulating, the types of life experiences that lead learners to acquire such biases; or worse, would require building the result of the experiment into the model.

Eight new studies met our inclusion criteria. There were three additional studies that investigated informant accuracy (Koenig and Echols 2003; Fitneva and Dunfield 2010; Koenig and Harris 2005). These studies differ in the amount of experience they provide learners—ranging from one to twelve instances of accuracy or inaccuracy—and the method employed (looking-time and forced choice).

Two studies interleaved feedback between learner’s guesses (Couillard & Woodward, 1999; Tummeltshammer et al., 2014). These included studies where the individual was repeatedly, implausibly incorrect (i.e. deceptive) (Couillard and Woodward 1999) and where trust was measured in looking-time (Tummeltshammer et al. 2014), as opposed to the standard forced-choice, ask-endorse approach. To model these, we updated the model’s beliefs about the knowledgeability and helpfulness of the informant by conditioning on the feedback, w, between each trial.

In addition to the consensus experiment (Corriveau et al. 2009) modeled in our previous work, we model two additional studies investigating consensus (Chen et al. 2012; DiYanni and Kelemen 2008). These are modeled as in Shafto et al. (2011), by simply considering the probability of agreement and disagreement among more than one informant.

Finally, we include a study that investigates the effect of error magnitude on epistemic trust (Einav and Robinson 2010). This study investigated the degree of the error, and thus required extending the model with a notion of semantic similarity. We employ an existing psychological model of semantic relatedness (Griffiths et al., 2007; Collins & Loftus, 1975). This extension allows the epistemic trust model to assess both errors and their degree. Concepts that are closer in a semantic network are more similar and errors between similar concepts are more reasonable.

These studies are heterogeneous in terms of their features: the ages of the children, the communication mode, and the experimental paradigm. Ages span 8 months to adult. Communication modes include verbal testimony, pointing, gaze, and use of markers. Paradigms include forced-choice (ask and/or endorse), and looking time. Table 2 lists the set of studies included—a total of 11 studies comprising 24 conditions.

Table 2 List of study conditions included in analyses divided into conditions

Data preparation and model fitting

We divided the 11 studies into analysis units, which we refer to as conditions. For example, an experiment which separately reported results for three- and four-year-olds consists of two conditions. This resulted in 24 total conditions. We fit the model parameters by searching for the parameters that best reproduced the data.

Our choice of search method is dictated by the complexity of the inference problem and the heterogeneity of the studies we model. Often the distribution of an informant’s helpfulness and knowledgeability cannot be calculated analytically and must be approximated. Exact calculation of probabilities requires enumerating over each unknown variable. In the case of Einav and Robinson (2010), enumerating over possible beliefs and the binary values of helpfulness and knowledgeability for four labeling trials leaves more than 1017 terms to evaluate. We approximate probabilities using Monte Carlo simulation (see Appendix 1). In simpler situations, one may employ direct fit methods that search for local error minima by traversing the path of steepest descent. These methods require calculating the gradient of the probability space with respect to the parameters. For the same reason we cannot calculate the probabilities exactly, we cannot calculate their gradients exactly. Grid search is an alternative technique in which a finite grid of search points is placed over the parameter space and the target function is evaluated at each point. We employ a randomized version of grid search, random search (Bergstra and Bengio 2012), in which random points in the parameter space are evaluated. In practice, grid search and random search perform similarly with respect to error, but random search offers additional flexibility in that it more easily allows us to exploit knowledge of which areas of the parameter space require more thorough search.

The random search procedure we applied involved generating a large number of parameter sets, running the model for each experiment for each parameter set, and calculating the errors between the model prediction and the empirical data. We generated 4000 parameter sets from independent exponential distributions with mean 5. That is, for each parameter in the parameter set, {α k ,β k ,α h ,β h } was drawn from \(\text {Exp} ( \frac {1}{5} )\). We choose this specific parameter-generating distribution because it applies higher probability to lower-valued parameters but also represents higher values. Higher parameters values are more robust; small changes in high-valued parameters affect the model results less than small changes in low-valued parameters. Note that we focus only on the full, four-parameter model because previous research demonstrates that a knowledge-only model, which does not account for variable helpfulness, fails to account for development (Shafto et al. 2011).

We searched for parameters that minimized the summed relative error of each experiment rather than the parameters that maximize probability because the studies report different measures (e.g. proportions of participants and looking times). The relative error of two values, a and b≠0 is the absolute value of one minus their ratio |1−a/b|. If a/b is 1 then a=b.Footnote 4 We use relative error rather than squared or absolute error because experiments’ dependent measures are not always identically scaled. One experiment may report the proportion of children who asked a particular informant for information while another may report the number of seconds an infant looked at an informant. We use relative error so that error is calculated similarly regardless of the result metric employed by the study. We use the sum of error so that the error of each data point (bar in a bar chart) carries equal weight. An experiment with more bars should be weighted higher for error minimization.

To construct the cross-categorization table, we took the fiveFootnote 5 best-fitting parameter sets for each condition and arranged them in a table. Each row represented a single parameter set for a condition and was augmented with demographic features of the experiment. These features included the mean age of participants, communication mode, culture, and experimental paradigm. Thus, each column was a parameter or a demographic or experimental feature of interest (see Table 3).

Table 3 Structure of the prepared table used during cross-categorization

For ease of interpretation, we converted the model’s α and β parameters on knowledgeability and helpfulness to strength and balance (Kemp et al. 2007). The strength (s) and balance (b) parameterization of the beta distribution is s=α+β and \(b = \frac {\alpha }{\alpha +\beta }\). Balance corresponds exactly the mean of the beta distribution and takes on values in the open interval (0,1). For example, a balance parameter on knowledgeability, b k , closer to 1 means that the learner believes that informants are, in general, knowledgeable while a b k closer to 0 implies that the learner believes that informants are, in general, unknowledgeable. Strength roughly corresponds to the invariance in beliefs and lies in the interval \((0, \infty )\). For example, a very high value of s k —the strength parameter on knowledgeability—implies a very strong belief that all people are the same—either all knowledgeable or all unknowledgeable as determined by b k .

Modeling individual studies

In this section we explain the procedure by which each study used for analyses was modeled and how the model captures each empirical result. This section is intended not only for those who wish to reproduce our procedure but also for those who seek an intuitive understanding of how the model works.

For each study we display results given the best-fitting parameters, and when possible, we display standard error bars given those parameters. As we have discussed, the heterogeneous nature of the literature forces individual fitting. The approach we take is distinct from the standard modeling approach in which a model’s validity is measured by its fit, in which the validity of the fit is measure in terms of whether the its parameter values make intuitive sense and whether it cross-validates. In the approach we take, these concerns have no influence on the analysis. The model’s ability to fit the results of individual studies is not our primary interest, but a necessary precondition for aggregating results—to include a study in analyses, the model must be able to account for its results. Our goal is to look at trends in regions of fit in which the model captures experimental results—regardless of where they are in parameter space—and to determine if these trends have implications for development.

Accuracy

Koenig and Harris’ (2005) study on children’s preference to ask for and endorse information from accurate sources is a seminal work in the trust-in-testimony literature. For three trials children observed two informants label common objects, e.g., a ball and a cup. One informant labeled each object correctly and the other labeled each object incorrectly. After these accuracy or familiarization trials, a novel object was placed before the informants. The child was either invited to choose the informant whom she would like to ask for the label (ask trial) or after having observed each informant provide his own label, the child was invited to label the object herself (endorse trials).

This study maps easily to inference in the epistemic trust model. We have only to account for data that does or does not match the state of the world. Participants observed novel informants, thus there is no need to account a prior bias that one informant should be more likely than the other to label correctly. Additionally, each informants’ incorrect answers are equally incorrect (labeling a ball as a shoe is just as foolish as labeling a cup as a dog) therefore there is no need to account for the relative magnitude of errors, which we account for in a later section. Endorse questions are modeled as described in the section on modeling word learning.

During accuracy trials, children learn about their informants. The model is concerned with learning the probability distribution defining each informants’ tendency toward or away from helpfulness and knowledgeability given the state of the world (the object) and the label uttered by the informant. This means collecting information about k and h given w and a.

We see the model results along side the experimental results (Koenig and Harris 2005, Experiment 1) in Fig. 2. For both age groups, the model prefers to endorse the label provided by the accurate speaker. The model infers that an informant who always labels accurately is likely knowledgeable and helpful and that an informant who always labels inaccurately is not. In fact, an informant who repeatedly labels incorrectly is assumed to be knowledgeable and unhelpful—deceptive. An unknowledgeable and helpful informant will produce the correct label by correctly guessing—an informant chooses a label from a fixed set of labels of which only one (or a few) is correct.

Fig. 2
figure 2

Model simulation results for Koenig and Harris (2005). The y-axis represents the proportion of children who endorsed the answer given by the accurate informant, or for the model, the probability of endorsing the accurate informant

This preference for more accurate informants has been documented after even a single encounter (Fitneva & Dunfield, 2010). In Fitneva and Dunfield (2010) children were shown an image and told a corresponding story. A sticky note occluded part of each image. The child asked two informants (children on a computer screen) what was under the card. The two informants answered differently. The sticky note was removed, revealing that one informant had been correct and the other had been incorrect. The procedure was then repeated but the child was allowed only to ask one informant. For this study we modeled ask questions. The results, averaged over three experiments can be seen in Fig. 3.Footnote 6

Fig. 3
figure 3

Model simulation results for Fitneva and Dunfield (2010). The y-axis represents the proportion of children who asked the previously accurate informant, or for the model, the probability of asking the accurate informant

We see that the model captures people’s preference for the accurate informant as well as an increasing preference with age. A theme in the literature is that the speed with which people update their beliefs about informants given data increases with age.

Relative accuracy

Informants are not deterministic. They are not always correct or always incorrect; they provide information with some amount of noise. Pasquini et al. (2007) extended the paradigm of Koenig and Harris (2005) to account for variable levels of relative accuracy between informants. Children were introduced to two informants who labeled four common objects with variable accuracy. Informants labeled either 100 %, 75 %, 25 %, and 0 % accurately, corresponding to four, three, one, and zero of four objects correctly labeled, respectively. There were four conditions 100 % vs 0 % accurate, 100 % vs 25 % accurate, 75 % vs 0 % accurate, and 75 % vs 25 % accurate. For example in the 100 % vs 25 % accurate condition, the child observed one informant label each object correctly and the other label only one of the four objects correctly. After accuracy trials, a novel object was placed before the child who then participated in ask and endorse trials.

The model shows a preference for the more accurate informant (Fig. 4). We see a tiered effect in both three-year-olds’ behavior and model prediction. In previous research, we found that 3-year-olds’ behavior is best represented by a model with a strong bias toward believing all informants are helpful (Shafto et al. 2011). This means that the model predicts three-year-olds’ inferences about informants primarily based on knowledgeably. Informants are either knowledgeable or not. An informant who always labels correctly is knowledgeable, all other informants are not. This causes difficulty in creating a grading between the different accuracy levels.

Fig. 4
figure 4

Model simulation results for Pasquini et al. (2007). a Three-year-olds. b Four-year-olds. The y-axis represents the proportion of children who endorsed the answer given by the accurate informant, or for the model, the probability of endorsing the accurate informant. Error bars represent standard error

The model predictions show a rather different trend for four-year-olds. The results closely follow the data, plateauing where there is a 75 % difference in relative accuracy between informants.

Familiarity

Corriveau and Harris (2009) investigated the interaction between familiarity and accuracy. For their study, Corriveau and Harris (2009) chose children’s preschool teachers to play the role of familiar informants. Familiarity is formalized as prior experience. In this case specifically, because the familiar informants were teachers—not tricky uncles—we modeled familiarity as experience demonstrating helpfulness and knowledgeability. This manifests mathematically as an altered prior. This manipulation is straight forward to implement as a beta distribution posterior update. As a demonstration, assume that we have witnessed an informant be helpful twenty times and unhelpful once. Given a base prior of b e t a(α h ,β h ) the posterior distribution is simply b e t a(α h +20,β h +1). We used this procedure for both knowledgeability and helpfulness. The result is a strong bias and requires more data to override than the presumably weaker bias for an unfamiliar informant.

Before any familiarization or accuracy trials, children were given ask and endorse questions to gage their natural preference for the familiar informant (pretest). Children were then given four familiar object labeling trials in which the familiar informant labeled each object accurately and the novel informant labeled each object inaccurately (familiar 100 %) or in which the converse occurred (novel 100 %). If children hold a more biased belief that their teacher is helpful and knowledgeable, they should prefer to ask and endorse their teacher at pretest. Observing the teacher label common objects correctly should reinforce this bias and observing her labeling them incorrectly should work to relax or reverse the bias.

We see in Fig. 5 the model captures trends across several ages but fails to capture the sharp reversal made by five-year-olds when the familiar informant labels inaccurately in the novel 100 % condition. A possible reason for this is that to minimize complexity we have applied the same familiar prior for each age group. It is reasonable to assume that children of different ages have different experiences with their teachers or handle familiarity in a more flexible way. Whether this holds true is an question for future research.

Fig. 5
figure 5

Model simulation results for Corriveau and Harris (2009). a Three-year-olds. b Four-year-olds. c Five-year-olds. The y-axis represents the proportion of children who endorsed the answer given by the familiar informant, or for the model, the probability of endorsing the familiar informant. Error bars represent standard error

Consensus

Corriveau et al. (2009) looked at children’s preferences for members of a group over rogue dissenters. For four trials, three novel objects were laid out before a group of four informants. On each trial an experimenter asked “Which is the [novel object label]”, after which, each informant pointed simultaneously to an object. Three informants pointed to the same object and the other pointed to a different object. On each trial the same informants agreed and the same informant dissented. It is important to emphasize that informants testified through pointing rather than vocalization. We did not model points differently than verbal communication. After these group (pretest) trials children observed as two of the informants, one of whom had belonged to the agreeing group and the dissenter, labeled additional novel objects (test trials). Children again chose the object that they believed corresponded to the label.

We see the model results in Fig. 6a, b. Because the objects were novel, children could not leverage their knowledge of the world to learn about informants. However, the fact that children learned from a group of informants labeling the same objects provides extra power not only for learning about novel objects but learning about informants as well. In the case of a group consensus we can exploit informant dynamics. In general, it is unlikely for multiple independent informants to repeatedly converge on the same object unless they are both helpful and knowledgeable. This leads logically to the conclusion that our dissenter is either unknowledgeable, unhelpful, or both; and that the agreeing informants are pointing at the correct object.

Fig. 6
figure 6

Model simulation results for Corriveau et al. (2009) and Chen et al. (2012). a (Corriveau et al. 2009), three-year-olds. b (Corriveau et al. 2009), four-year-olds. c (Chen et al. 2012) Younger and older groups

As a simple illustration of why this is so, let us categorize informants into two groups: reliable and unreliable. Further assume that reliable informants always point to the correct object and that unreliable informants point uniformly at random. We assume that informants are reliable and unreliable with equal probability. Given three objects to choose from, the probability that three reliable informants converge on the same object is 1, the probability that three unreliable informants converge on the same object is \(\left (\begin {array}{c}3\\1 \end {array}\right )\left (\frac {1}{3}\right )^{3} = \frac {1}{9}\). The probability that unreliable informants converge on the same answer for four trials is then \(\left (\frac {1}{9}\right )^{4} = \frac {1}{6561}\).

Things are not so black and white in the model so this effect is softened. In the model, informants are not so neatly categorized as reliable and unreliable. There are different degrees and sources of unreliability that bring about different types of unreliability, e.g. the difference in behavior between unknowledgeable and unhelpful informants. This additional uncertainty is reflected in the results by a less distinct preference to choose with the group at pretest and the informant from the group at test. Additionally, the certainty of these inferences is dependent to an extent on prior beliefs about informants. The higher the prior toward knowledgeability and helpfulness, the higher the probability that agreeing informants are knowledgeable, helpful, and correct. This of course assumes uniform probability over labels. It is possible that there may be some wrong belief with a high prior probability that unknowledgeable informants could converge on (for example, that in the time of Christopher Columbus it was common knowledge that the Earth was flat).

We also modeled the results of Chen et al. (2012) which reproduced the pretest (group) trials of Corriveau et al. (2009) with different age groups. The model procedure was identical. The results can be seen in Fig. 6c. Again, the model captures a bias toward choosing with the group, which appears to increase with age.

Culture

It is not enough to demonstrate that a model fits data; the model should fail to capture results outside of its scope. Here we demonstrate how our epistemic trust model fails to account for non-epistemic, cultural behavior.

DiYanni and Kelemen (2008) looked at culture effects in children’s deferring to consensus. Children observed three informants choose a tool to crush a cookie. The tool was either functionally affordant (hard plastic) or non-affordant (a mass of plush, fuzzy balls). Each of the three informants had a cookie in front of them. The first informant selected the affordant tool and tapped the cookie twice with it then repeated the procedure with the non-affordant tool. The cookie remained intact. The informant then held the non-affordant tool and said “This is the one I would need”. This process was repeated with the other two informants. Children were then asked which tool would be best for crushing the cookie. A similar condition was conducted but with a single informant. The hypothesis was that children in both culture groups would similarly reject the advice of a single informant claiming that the non-affordant tool was best, but that for cultural—not epistemic—reasons Asian-American children would be less likely to dissent from the group. For modeling purposes we treat this task as equivalent to labeling. The effect is the same in each case, the cookie remains intact, and can be ignored. Informants explicitly label the non-affordant tool as “the one I would need”, which we interpret as a novel object labeling task in which one of the objects is “the best for crushing cookies”. Children’s bias for the affordant tool plays a major role and so we modeled the bias based on previous research using the same tools in which “[…]89 % of 3-4-year-olds choose to use the Functionally-Affordant tool over the Non-Affordant tool to crush a cookie when both tools are modeled with equal intention” (DiYanni and Kelemen 2008; DiYanni et al. 2015). The prior probability on w was left uniform because both tools are equally novel, but P(bk,w) was altered such that an unknowledgeable informant should guess the affordant tool was best 89 % of the time.

Both groups of children were equally likely to dismiss the advice of a single informant, but Caucasian-American children more often rejected the advice of the consensus than did the Asian-American children. DiYanni and Kelemen (2008) suggest that this result stems from a cultural stigma with respect to deviancy in the Asian community. The model can only venture to capture these results as modified prior beliefs (see Fig. 7).

Fig. 7
figure 7

Model simulation results for DiYanni and Kelemen (2008). a Caucasian children. b Asian Children. Error bars represent standard error

The model captured American-Caucasian children’s disagreement with both the single informant and the group but fails to capture Asian-American children’s agreement with the group. The model fundamentally fails to capture Asian-American children’s behavior. The study noted that Asian-American children’s conformity is likely a symptom of their avoiding appearing deviant (DiYanni and Kelemen 2008)—not an epistemic goal.

It is important that the model fails to capture this result because the result is non-epistemic. This result illustrates that the model has limitations; it cannot explain all patterns of results. It is likely that group membership studies do not capture differential learning but simply the effect of social norms. Other research would suggest that children have no difficulty in appeasing a group of seemingly unreliable informants, but do not allow it affect their learning. Corriveau and Harris (2010) demonstrated that though children may appear to defer to a group whose consensus violates their own perceptions (in the study, the group agreed a shorter line was longer than a longer line), children rely on their own perceptions when solving a pragmatic task. Though children agreed with the group that a shorter line was the longest, children then used the longest line to construct an adequate bridge to help a bunny cross a gap.

Deceptive pointing and marking

In Couillard and Woodward (1999)’s study on children’s interpretation of deceptive points, a child plays a game of Two Cup Monte with an informant. Behind a screen, the informant hides a sticker under one of two cups. The screen is taken away and the informant points to one of the cups. Children’s job is to choose the cup under which the sticker is hidden. For each time children choose correctly they get to keep the sticker. This procedure repeats for ten trials. On each trial the experimenter indicates the empty cup. We assume that a point acts as a label and we assume that the informant is knowledgeable because children observe the informant place the sticker (though they do not observe under which cup). The knowledgeability bias is applied to the prior. Children receive feedback after each trial. The experiment is iterative. Each trial consists of an endorse question (choose to endorse or reject the informant’s testimony) and a subsequent familiarization demonstration in which the child is given information regarding the veracity of the informant’s testimony. Because the bias toward knowledgeability has been strongly influenced by the informant’s hiding the sticker, children must make inferences primarily through inferences with respect to helpfulness. The informant knows the location of the sticker but does not want learners to know. Children at three-years-and-three-months of age were more often fooled by the informant than children closer to four-years of age.

The experiment was repeated with a markers condition in which the informant placed a marker to indicate a cup rather than pointing to it. Younger children were far more likely to choose the correct cup in the markers condition. We make no fundamentally different modeling assumptions to capture this result, but allow it to manifest as an alternate parameter set.

Figure 8 shows the proportion of children who chose the correct cup (the cup not indicated by the informant) averaged across the first four and last four trials. We see that the model captures the rate of learning. At each trial the learner is given extra information about the informant which it uses to learn about the world. The informant is reliably inaccurate. An informant who repeatedly labels incorrectly is likely deceptive. Because a deceptive informant never labels correctly, the model infers that the opposite cup is more likely. Younger children have a stronger belief that informants are helpful. A stronger belief requires more data to overcome, thus we see that younger children more often choose with the informant, though they choose with the informant less as trials progress.

Fig. 8
figure 8

Model simulation results for Couillard and Woodward (1999). a Points. b Markers. The x-axis shows the trial number collapsed into blocks. The y-axis displays the proportion of children who choose the cup opposite the cup indicated by the informant, or for the model, the probability that the marker is in the cup opposite the cup indicated by the informant

Error magnitude

Einav and Robinson (2010) looked at the effect of error magnitude on children’s informant preferences. For example, labeling a lion as a tiger is a smaller magnitude error than labeling a lion as a mouse or a clock. The structure of the study was nearly identical to that of Pasquini et al. (2007). Children observed two informants label common animals for four trials. On each trial after the first, both informants labeled incorrectly but one informant produced higher magnitude errors. For example, given the labels “dog”, “tiger”, “horse”, and “butterfly”, the more accurate informant provided the labels “dog”, “lion”, “cow”, and “bee”, while the less accurate informant either provided the labels “dog”, “mouse”, “fish”, and “cat” (animal-animal condition) or “dog”, “clock”, “fork”, and “car” (animal-object condition).

Some words are more prevalent than others. If one was asked to provide a word starting with the letter ‘A’ one may be more likely to respond ‘Apple’ than ‘Appendectomy’. To capture that some labels are more inappropriate in response to certain cues, we must formalize a meaningful relationship between words. Griffiths et al. (2007) had success using semantic networks and pagerank (Page et al. 1999; Sloman et al. 1998).

The lexicon can be organized into a network where associated words share links. We can represent a network containing n words as a n×n matrix L where L i j is 1 if there is a link from word j to word i and 0 otherwise. Pagerank captures that important words have more incoming links and that importance travels along these links. Pagerank is thus recursively defined: important nodes have more links incoming from important nodes. If M is a matrix where M i j is the total proportion of importance that travels through L i j , then

$$ \mathbf{M}_{ij} = \mathbf{L}_{ij} \left/ \sum\limits_{k=1}^{n} \mathbf{L}_{kj}\right., $$
(15)

and Pagerank is the solution for r in the recursive equation,

$$ \mathbf{r} = \mathbf{M}\mathbf{r}. $$
(16)

Now that we have defined a prior probability distribution on cues, p(cue), we must define a sampling distribution (likelihood) for labels given cues, p(label|cue) which is exactly P(bk,w):Footnote 7 the probability of an unknowledgeable informant believing a particular label given the cue, w. For this we apply the idea of spreading-activation (Collins and Loftus 1975) in which activation—which is directly analogous to importance—flows from node to node in the network. We can construct an activation-based sampling distribution by assuming that the probability of a label given a cue is determined by the minimal path length from the cue to the label in the network. That is, the closer the label is to a cue in a network, the higher its probability. More formally, if we assume that activation decays at the same rate across every edge, then for the set of edges, D, that defines the minimal path from cue to label, the probability of label given cue is,

$$ P(\text{label}|\text{cue}) \propto \gamma^{|D|}, $$
(17)

where |D| is the number of links in the path ( |D|=0 if l a b e l=c u e) and γ∈[0,1] is a decay constant capturing that activation decreases as a function of distance. We arbitrarily chose γ=.5, which corresponds to losing half of the signal at each jump. This formalization of the belief probabilities implies that low-magnitude errors are most indicative of a helpful, unknowledgeable informant while high-magnitude errors are most indicative of unhelpful informants. A knowledgeable informant knows the correct label, an unknowledgeable informant is likely to guess a close label; in both cases, unhelpful informants will choose a label to lead learners away from their own beliefs: a label distant from the true label or distant from a close label.

The network used here was constructed from the University of South Florida free association norms database (Nelson et al. 2004), which comprises free associations for 5019 cue words. We only included words that were both cues and responses, leaving 4870 words. Links were created from targets to responses. We used the python package NetworkX (Hagberg et al. 2008) to construct the network, find minimal paths, and calculate pagerank. This allowed us to model the study using the exact words used in the study rather than word analogs as we did in the previous studies. For example, given this model we can ask for the probability that an informant is knowledgeable and helpful given that she labeled a lion as a tiger, P(k,h|a=tiger,w=lion), instead of asking about a label indicies, P(k,h|a=0,w=1), or simply whether a label does not match the true state of the world, P(k,h|aw). It should be noted that the free-association database records responses given to text cues and not visual cues, which were used case in the study.

The experimental results (see Fig. 9) indicate that four- and five-year-olds do not exhibit a preference for either informant, but six- and seven-year-olds prefer informants who produce lower-magnitude errors. Higher magnitude errors are a better indication of naivety or unhelpfulness than lower magnitude errors. Unknowledgeable, helpful informants should guess a label close to the target and then produce a label that is close to the guessed label.

Fig. 9
figure 9

Model simulation results for Einav and Robinson (2010). a Four- and five-year-olds. b Six- and Seven-year-olds. The x-axis displays the accuracy condition. The y-axis shows the proportion of children who endorsed the answer given by the lower-magnitude-error informant, or for the model, the probability of endorsing the lower-magnitude-error informant

Looking time

The epistemic trust model is easily adapted to account for looking time paradigms. The primary hurdle is the mapping from probability to looking time. We assume that the time spent looking at an event is inversely proportional to the probability of that event. We are aware of recent work that suggest looking time follows a U-shaped function whereby infants look longer at moderately improbable events and less at extremely probable or improbable events (Kidd et al. 2012). Recent work has successfully modeled this phenomenon (Piantadosi et al. 2014), but adopting this model requires more than doubling the number of free parameters in our model, which we believe adds unjustifiable complexity.

We model Koenig and Echols (2003, Study 1) in which 18-month-olds observe novel informants label common objects, displayed on a screen, either correctly (true labels condition) or incorrectly (false labels condition) for twelve trials. At each trial the number of seconds infants looked at the informant, the object, and their parents (on whose lap they sat) was recorded. We model only the time spent looking at the informant because the model most fluidly produces the probability of an informant producing a specific label given a specific target. Koenig and Echols (2003) report the mean looking time over trials. We report the mean inverse probability scaled arbitrarily. It is important to note that the parameter fit for this particular experiment was achieved by minimizing the error of the proportion difference between the time spent looking at each informant in both the accurate and inaccurate conditions. For example, if infants in the true labels condition looked at the informant for an average of 4 seconds and infants in the false labels condition looked at the informant for an average of 7.5 seconds, the proportion difference is 7.5/4=1.875. If the mean inverse probabilities for the true and false labels conditions are 1.2 and 3.8, respectively, then the relative error is |1−(3.8/1.2)/(7.5/4)|=0.69. We use this method because we are interested only in the trend from one condition to the other; we make no attempt to find the scaling constant that maps inverse probability to seconds. In this way, we can capture the trend without adding complexity.

Apart from the looking-time modifications, the rest of the workings are identical to those we used to model (Pasquini et al. 2007). The results can be seen in Fig. 10. We plot seconds beside inverse probability arbitrarily scaled. The model captures that an informant labeling common objects correctly is less surprising than an informant labeling common objects incorrectly.

Fig. 10
figure 10

Model simulation results for Koenig and Echols (2003). On the Y axis is the mean time in seconds infants spent looking at the informant across trials and for the model, the mean inverse probability of the informants actions across trials

Gaze following

Tummeltshammer et al. (2014, Experiment 1) investigated 8-month-olds’ learning from informants using a gaze-following paradigm. The researchers employed eye-tracking technology to record infants’ eye movements in response to gazes made by reliable and unreliable faces. For each face type, infants participated in four blocks of four familiarization trials. In each trial, a woman’s head appeared in the center of a black screen. In each of the four corners of the screen were empty boxes (squares). At the beginning of each trial the head looked at the infant, said “Wow, look!”, and turned to look at one of the four corners, at which time an animal noise sounded and its respective animal appeared in one of the boxes. Reliable faces always preemptively looked at the box in which the animal appeared and unreliable faces preemptively looked at the box in which the animal appeared only 25 % of the time. Each square had a distinct animal and the heads only ever looked at two of the four boxes, that is, there were two boxes in which an animal never appear and which were never looked at. After familiarization trials, infants participated in two different kinds of target trials: test and generalization. On test trials, the head looked at a box it had previously look at. After a short delay an animal sound played but no animal appeared, instead the corner boxes flashed. The same procedure repeated for generalization trails but the head looked at one of the boxes it had never looked at before—the hypothesis, in both cases, being that if such young infants are sensitive to informant reliability, infants who observed the reliable head should be more likely to follow its gaze. In both target trial types, infants looked at the box indicated by the reliable informant far more than the others boxes. Infants looked at the box indicated by the unreliable informant at chance.

From a modeling standpoint this study was difficult to capture, not because there is something about it that is inherently difficult to capture, but because the information supplied in the publication does not provide sufficient information to account for all the relevant details.Footnote 8 Before the experiment began, infants participated in a number of calibration trials during which objects appear in the corners and center of the screen. It is possible that these trials affected infants’ beliefs about where objects should appear on the screen and hence their learning during familiarization. As an illustration: assume that during calibration infants cumulatively observe ten objects appear in each of the four corners. We capture the likelihood of an object appearing in a given corner with multinomial distribution with Jeffery’s prior,

$$ P(\text{corner}) \sim \text{Dirichlet}\left(\frac{1}{2},\frac{1}{2},\frac{1}{2},\frac{1}{2}\right), $$
(18)

which is the probabilistic way of establishing a loose, uniform belief that objects are equally likely to appear in any of the four squares. After calibration and posterior probability updates we have

$$ P(\text{corner})\! \sim\! \text{\!Dirichlet\!}\left(\!\!\frac{1}{2}\,+\,10,\frac{1}{2}\,+\,10,\frac{1}{2}\,+\,10,\frac{1}{2}\,+\,\!10\!\!\right), $$
(19)

which amounts to a very rigid uniform belief and which slows future updating—that is to say that each subsequent observation less affects the predictive probability of a specific event. Assuming that infants update their beliefs about objects and corners on each trial, an infant who receives the above calibration trials will attribute a predictive probability of 0.362 to an object appearing in one of the two never-before-indicated boxes on generalization trials where an infant with no calibration trials would attribute a probability of only 0.056 to the same event. We ignored this sort of posterior updating because the study provides insufficient data and, as we have demonstrated, subtle differences in calibration assumptions can lead to dramatically different results. We assume that infants held a uniform probability over objects to corners for the duration of the experiment. It should be noted that there was a qualitative difference in infants’ behavior in the two target trials that could be explained by updating beliefs about objects and corners. It appears that infants followed the reliable head’s gaze to the cued box more in generalization trials than they did in test trials and followed the unreliable head’s gaze less in generalization trials than they did in test trials (see Fig. 11). If infants are looking for the box with the animal and an unreliable informant looks toward a box in which an animal has never appeared, children should look less because at baseline it is unlikely for an animal to appear there. A reliable head’s gaze, to an extent, overrides the low prior probability of an animal appearing in that corner.

Fig. 11
figure 11

Model simulation results for Tummeltshammer et al. (2014). Error bars represent standard error

Another issue is trial ordering. Just as beliefs about corners and objects propagate across trials, so too do beliefs about informants. The study was conducted using a between-subjects design. The order of the boxes in which the animals appears and—we assume—the order of the trials during which the unreliable informant looked at the correct object were counterbalanced. It is computationally intractable to average over many orderings for an experiment of so many trials, and because we do not have the exact trial orders of each participant, we cannot use approximation methods to capture individuals’ behaviors (e.g. win-stay, lose-shift (Bonawitz et al. 2011)). We modeled each condition—reliable and unreliable—separately and chose a single order for the unreliable condition (that the face looked at the correct box on the second trial of each block).

Infants’ likelihood of looking at the box indicated by the face was modeled using the same process as modeling an endorse trial. The infant should expect an animal to appear in the box indicated by the face if the face is likely to correctly label (via its gaze) that box as “the box that is going to have the animal in it”. In Fig. 11 we report the model results.Footnote 9

We see that the model captures infants’ preference to follow the reliable face’s gaze and to look other than where the unreliable face gazes. Again, there is a qualitative (though not statistically significant) difference in the results for the test and generalization trials for unreliable faces. Infants seems to look uniformly in the test trials (Fig. 11a) and seem to look other than where the unreliable face looks in generalization trials (Fig. 11b). Because we have ignored posterior updating with respect to object locations, these two target trials are indistinguishable to the model.

Results

An especially novel aspect of this work is in integrating results across experiments. We proceed by conducting an analysis using CrossCat. Once CrossCat has inferred a joint probability distribution over the data table, the bulk of the work is done; we need only ask CrossCat what it has learned. We refer readers who are more familiar with significance testing and who may wonder why we chose not to use analogous significance test to Appendix 3 for a concrete example.

The first question relates to dependence among the variables. Previous research has debated what explains changes in children’s behavior with age: changes in reasoning about knowledgeability, or changes in reasoning about helpfulness? Figure 12 (left) presents a dependence probability matrix where each row and column entry, (i,j), represents the probability that variables i and j share a dependence (for details on calculating dependence probability and conditional distributions under CrossCat, see Appendix 2 or Mansinghka et al. (Accepted pending revision)). Pairs of variables for which changes in one tend to be associated with changes in the other are said to be dependent. As a reference point, the expected dependence probability (before effects of the data), derived from the CRP with parameter α where α∼Exp(1) is roughly 0.596 (for the full derivation of this quantity see Appendix 2). The dependence probability matrix is used as a way to explore which variables have interesting relationships. The higher the dependence probability between two variables, the more likely it is that the variables are mutually predictive. Because CrossCat learns a joint distribution over the entire dataset, we can try to predict any variable using any other variable but if the dependence probability between those variables is low, the two variables may not hold much information about each other; and if the dependence probability between two variables is zero, they have zero mutual information. The dependence probability matrix gives us a way to quickly determine which variables are likely to have interesting relationships that warrant more in-depth exploration.

Fig. 12
figure 12

The dependence probability matrix resulting from cross-categorization. Each cell, [i,j] of the table represents the probability of dependence between columns i and j. Probability is represented by shade. The lighter the shade, the lower the probability of dependence. Numerical dependence probabilities values are displayed in their respective cells

In our model, the dependence probability between columns is generally high. The lower right-hand area of the matrix shows that the strength parameter for helpfulness and age are highly dependent and that both helpfulness parameters are highly dependent with communication mode. In contrast, both knowledgeability parameters show minimal evidence for dependence with age. Thus, the model indicates that age-related changes in behavior on epistemic trust tasks are related to changes in children’s reasoning about helpfulness.

Because the dependence probability matrix suggests a dependency exists between the helpfulness variables and age, we may investigate the form of these dependencies. How does children’s reasoning about helpfulness change with age? We can form predictions about one variable based on different values of a second variable. To investigate the relationship between age and helpfulness we compute the distributions for the strength and balance parameters on helpfulness given a set of age groups, i.e. P(s h |a g e={1.5,3.5,4.5,5.5}) and P(b h |a g e={1.5,3.5,4.5,5.5}). The resulting distributions are multimodal, so we display the full distributions rather than report standard summary statistics, which are largely useless in this case. For example, the mean and variance of the data are sufficient to summarize normally-distributed data because a single normal distribution is parametrized in terms of a mean and a variance, but they are not sufficient to describe data from a mixture of many normal distributions.

Figure 13a, b shows the results for balance and strength, respectively. The mass of balance for helpfulness (see Fig. 13a) for 18-month-olds rests heavily to toward 1 indicating that the model explains their behavior via an assumption that people are in general helpful. From 18 months through 5.5 years there is a shift through a more uniform (flat) distribution to a peak at a more neutral position. This suggests that the data are explained by an increasing belief that not everyone is helpful.

Fig. 13
figure 13

The conditional probability distributions of helpfulness and knowledgeability parameters given age. The distributions for ages 1.5 (blue), 3.5 (green), 4.5 (red), and 5.5 (teal). a Averaged conditional probability distribution of helpfulness’s balance parameter. b Averaged conditional probability distribution of helpfulness’s strength parameter. c Averaged conditional probability distribution of knowledgeability’s balance parameter. d Averaged conditional probability distribution of knowledgeability’s strength parameter

We see a similar trend in the strength of helpfulness. Younger ages have higher mean strengths, which, together with the balance parameter result, indicates more rigid beliefs that everyone is helpful. With age, the strength relaxes to a lower value. Lower strength indicates greater flexibility, indicating a non-rigid belief that people are either helpful or not. Thus, the model captures younger children’s behavior by attributing higher, more rigid prior biases toward helpfulness.

We calculated similar distributions for knowledgeability parameters but saw no marked age differences (see Fig. 13c and d). The shapes of the distributions for each age group are essentially the same, suggesting no evidence for developmental changes in reasoning about knowledgeability.

The dependence probability matrix (Fig. 12) showed that communication mode was dependent with the helpfulness parameters. Previous empirical research has observed differences in behavior based on different communication modes. For example, Couillard and Woodward (1999) found that children who received communication in the form of marker placement were less susceptible to informants’ misinformation that those who were communicated to through finger points (Jaswal and Neely (2006) found similar results exploring different communication modes). Querying the helpfulness parameter distributions given different communicative modes allows investigation of how the model captures differences across communication modes. Figure 14 shows the conditional distribution of helpfulness parameters based on each communication mode. The results show that the model explains behavior resulting from communication using markers differently than the others communication modes. Marker placement (in green) is captured with a bimodal distribution and further investigation reveals that the each mode corresponds to an age group (see Fig. 14c). The high-balance mode corresponds to three-year-olds and the low-balance mode corresponds with four-year-olds. The other communications modes induce more unimodal distributions. This is broadly consistent with the idea that labeling, pointing, and gaze are ostensive cues that may be strongly associated with helpful communication (Gergely et al. 2007; Topal et al. 2008). However, given that this result is based on a single study (Couillard and Woodward 1999), some caution is warranted in this interpretation of the differences in epistemic trust using ostensive and non-ostensive cues.

Fig. 14
figure 14

The marginal conditional distribution of helpfulness’s balance distribution (a) and strength distribution (b) given each communication mode. In blue: verbal, in green: marker placement, in red: pointing; in teal: gaze. c The distributions of helpfulness’s balance parameter given that the informant communicated via marking and the informant for 3.3-year-olds (solid line) and 4.3 year-olds (dashed line)

Discussion

The model predicts that development is driven by changes in children’s understanding of helpfulness in part because we have modeled studies that explicitly demonstrate the development of the understanding of helpfulness (e.g. (Couillard and Woodward 1999)). Couillard and Woodward (1999) provided children with demonstrations of an informant behaving inconsistently with her knowledge, which is only possible in the epistemic trust model if helpfulness is represented. An informant who knows that the sticker is under cup A, but indicates cupB must not be helpful in conveying her knowledge. The more flexibly children represent helpfulness, the quicker they can learn to choose the opposite cup.

Younger children’s slower updating in response to inaccurate labels may also be attributed to a lack of understanding of variable helpfulness. Older children update their trust more quickly than younger children. A four-year-old who observes an accurate label from informant A but an inaccurate label from informant B, is more likely to prefer informant A than a a three-year-old (cf. (Koenig and Harris 2005); (Pasquini et al. 2007); (Fitneva and Dunfield 2010)). In the epistemic trust model, helpfulness is a more predictive informant attribute than knowledgeability. This means that knowing only about informants’ helpfulness provides more information about the veracity of their testimony than knowing only about their knowledgeability. Given two informants with unknown knowledgeability, an known unhelpful informant will produce correct labels less often than a known helpful informant. Assuming that there are n possible labels for an object and that the probability of guessing the correct label is 1/n, the helpful informant will produce the correct label n+1 times more often than the unhelpful informant. Under the same assumptions, but not knowing the informants’ helpfulness, the known knowledgeable informant will produce the correct label only n/2 times more often than the known unknowledgeable informant. Thus knowing an informant’s helpfulness reduces one’s surprise at the outcome of a label more so than knowing an informant knowledgeability.Footnote 10

Any predictions made by the model will reflect these properties. The model indicates that younger children represent helpfulness, but are highly biased to believe that all informants are helpful. This implies that children can learn that informants can act in ways inconsistent with their model (relax their biases); thus the more a child observes informants acting unhelpfully, the better that child should perform on helpfulness-oriented tasks. This leads to the prediction that a child with more experience with unhelpful informants should perform better on epistemic trust tasks. For example, younger children who attend preschool or daycare, or have older siblings should perform similarly to older children who spend more of their time around only their caregivers. This suggests that researchers should collect more demographic information and conduct analyses grouped by experiential variables rather than age.

General Discussion

Research in cognitive development routinely emphasizes the importance of other people in learning about the world. While a considerable amount of research has investigated the bases on which children decide epistemic trust, precise theories of the basic phenomenon and how it develops have been limited. Researchers have interpreted their results in terms of updating beliefs about informants’ knowledge (Pasquini et al. 2007; Corriveau et al. 2009; Corriveau and Harris 2009), theorists have discussed whether epistemic trust is rational (Sobel and Kushnir 2013), and philosophers have formalized accounts based on reasoning about informants’ knowledgeability only (Bovens and Hartmann 2004).

More recently, computational (Shafto and Goodman 2008; Butterfield et al. 2008), theoretical (Sperber et al. 2010), and empirical accounts (Mascaro and Sperber 2009; Heyman and Legare 2013; Koenig and Stephens 2014) have proposed that a complete theory of epistemic trust requires reasoning about both informants’ knowledgeability and intent. Shafto et al. (2011) proposed a computational model and applied it to three studies from the literature, finding that an account based on knowledge and intent best explained four-year-olds’ behavior. They also found that there were developmental changes in reasoning, and that these changes were in reasoning about intent rather than knowledgeability. However, the import of this evidence is limited by the need to limit consideration to only three studies, which ensured uniformity in methods, ages, etc. necessary for the model fitting.

We have proposed a computational framework for integrating results from heterogeneous studies and used it to model the development of epistemic trust. The framework is based on parameterizing results in model space and analyzing the parametrized results alongside demographic features of the studies, allowing heterogeneous studies to be included and the heterogeneity to be analyzed without requiring arbitrary assumptions from the analyst as to how to partition the data. Our results confirm and quantify previous arguments claiming that reasoning about both knowledgeability and intent play a role in epistemic trust and developmental differences are attributable to changes in reasoning about informants’ intent. Reasoning about informant’s knowledgeability is found to be relatively constant. Our results extend previous findings, but in a much broader age range—18 months to 5.5 years—and quantify gradual change in reasoning about informants’ intent across that time period. Our results also extend previous findings by explaining why different modes of communicating used in experiments lead to different results. Consistent with previous theoretical accounts (Csibra and Gergely 2009) and empirical observations (Couillard and Woodward 1999), different modes of communication induce different expectations about how the data are selected.

Our approach represents a proposed solution to a vexing problem in cognitive development: developing coherent theoretical accounts that explain changes in behavior over time despite the confounded relationship between age and methodology. Standard practice in cognitive development circumvents this problem by focusing on identifying the youngest age at which children can succeed on a conceptual problem. This avoids the problem of covariance between age and task by prioritizing methods that apply at the youngest ages. However, this approach limits the relevance of resulting theory by prioritizing questions of competence over questions of performance.

Instead of focusing only on tasks that demonstrate competence at the youngest ages, we used the computational theory to parametrize the complete set of results that are explainable with that theory. We then used computational tools to make explicit the relationship between the model’s parametrization and the demographics of the experiment. This approach formalizes developmental theorizing in a way that supports inferences about the youngest ages that children may succeed on a task, as well as relationships across behavior on different tasks, at different ages.

While this approach provides a more comprehensive, computationally precise account of the development of epistemic trust, there are limitations. Most notably, we have considered 11 studies from the literature. Although the epistemic trust literature is in principle, much larger, including more studies would have required additional assumptions and/or free parameters. The evidence is too sparse to constrain these choices. Currently, much of the focus of epistemic trust research is on documenting new paradigms that cause children to allocate trust differently. The method we have outlined will be most informative given more systematic analyses of phenomenon, in which studies are more mutually informative, e.g. paradigms that are slight adjustments of other paradigms or that investigate interactions between paradigms. Furthermore, empirical research focusing on quantitative, as well as qualitative, results would provide richer data for testing computational theories of epistemic trust on cognitive development.

Theoretical and empirical accounts of cognitive development emphasize the important role of other people in children’s learning about the world. We have proposed a computational theory and an approach for integrating results across heterogeneous methods and ages. The results indicate developmental changes in reasoning about informant’s intent and differences across tasks. Although we believe our approach to be the most precise and comprehensive account of the development of epistemic trust, there are many ways in which it is likely too simple to explain the richness of development. Continued empirical research is necessary toward the goal of developing a complete computational theory of the development of epistemic trust. Developmentalists are vital to this effort and can contribute in two ways. First, by filling gaps in the literature by reproducing existing results in different age and culture groups, and by extending existing paradigms to account for more nuanced phenomenon (much in the same way (Pasquini et al. 2007) did for (Koenig and Harris 2005)). And second, by experimentally evaluating the model assumptions.