1 Introduction

In his most recent book, Becoming Human, Michael Tomasello (2019a) spells out the cognitive differences between humans and apes and constructs an evolutionary narrative about how humans became what they are today. The book is organized into two large parts, the first about the ontogeny of cognition, the second about the ontogeny of sociality. Each part contains four chapters, in each of which Tomasello discusses the developmental pathway of a specific feature of human thought, and how the same or analogous features manifest in apes. In many chapters Tomasello briefly comments on cross-cultural psychological studies investigating these capacities in non-WEIRD cultures (Henrich et al. 2010). The goal of these cross-cultural comparisons is usually to collect evidence in favor of the feature’s universality, despite superficial differences in its manifestations. The quest for human universals is consistent with Tomasello’s commitment to articulate how our species evolved and how human infants today, not just in Peru or Romania, but across the globe, develop over time.

We entirely agree with Tomasello’s main idea that shared intentionality and a “cooperative infrastructure” underlying human interaction are useful concepts to define the uniqueness of human conduct and cognition. In this critical essay, we hone in on three aspects of this account that we believe could be improved. These revolve around the content and timing of key developmental steps that are argued to occur in infancy, at age three, and at age 4 to 5. Our challenges raise vital questions about the enduring coherence of Tomasello’s otherwise persuasive account of human ontogeny. Before turning to the critique, it is worth remarking on the biggest changes and additions Tomasello has made to the theoretical framework that guides his research.

2 What’s New in Tomasello’s Account of Human Cognition?

Readers familiar with Tomasello’s recent work will recognize a recapitulation, with rich new details, of the broad outlines of his theory: a “9-month revolution” that initiates “joint intentionality” in the form of joint attention and cooperatively structured communication, followed by the emergence of “collective intentionality at age 3, when children become group-minded and commit to a “we” that is greater than “me” (we > me), with normative protests ensuing if the commitment is broken. In the past, the upper boundary of the temporal window Tomasello has been concerned with was typically 4 to 5 years, when children have formed a full-blown theory of mind, acquired the concept of perspective and become normative agents. Tomasello now expands the window to include children entering the “age of reason” at 6 to 7 years. We learn that late preschoolers and first-graders become “reasonable” and “responsible” agents who make argumentative moves in the space of reasons. They can now be held accountable for their actions, just like they, too, hold others accountable.

The “normative turn” in Tomasello’s theorizing is part of a general trend (sometimes critically referred to as “raising the bar”) of trying to pin down human-uniqueness in ever higher strata of mental and social functioning, in response to the surprising skills apes have demonstrated as the field of comparative psychology has advanced. These skills, which most scholars, including Tomasello, did not initially believe apes possessed, encompass basic forms of perspective-taking, (non-recursive) mind reading, (non-social) self-regulation, and acts of cooperation and helping in specific contexts. In brief, apes and perhaps some other animals can track and imagine others’ mental states. They can opportunistically use others to achieve their individual goals and occasionally help others achieve theirs. However, apes do all this without the cooperative infrastructure that underlies human interaction and without the “dual level structure” of sharedness (by of having a joint goal) and individuality (different perspectives/roles).

Another novelty is the attention Tomasello pays to capacities for self-regulation and their development. He did not engage much with this topic in the past, perhaps because it is dominated by cognitive-scientific debates about working memory, cognitive load, and other things that fall outside of his scope of interest. But self-regulation and self-guidance were problems of central importance to Vygotsky (and thereby to Tomasello), who thought that these competencies result from the internalization of societal roles and rules that the child acquired in interaction with others. As Tomasello notes, “uniquely human forms of cognition and sociality emerge in human ontogeny through, and only through, species-unique forms of sociocultural activity” (6). In Vygotskian manner, he argues that children age 3 and older can act as part of a ‘we’ that requires them to subject any individual preferences and inclinations to the group’s plans and decisions.

Also for the first time, Tomasello takes a developmental approach not just to human cognition but to ape cognition. With new data from developmental comparative psychology, he informs us about the age of onset of apes’ various cognitive capacities where this information exists (it is sparse so far). The new attention paid to ontogenetic change in apes may reflect a recognition of a major problem pervading comparative psychology. By comparing human infants and children with juvenile or grown apes, this field has operated with an in-built confound of species and chronological age. Distancing oneself from this method acknowledges the importance of maturation and learning in species other than the human one and of shedding the legacy of the discredited ‘biogenetic law’ according to which ontogenetic stages (e.g., 1-year-old humans’ physical cognition) replicate phylogenetic stages (e.g., the physical cognition of a common ancestor of Pan and Homo). Tracing the developmental pathways of apes’ cognitive capacities is thus a laudable task that helps to counteract what may be a systematic methodical flaw in the comparative study of cognition.

Also new is the explicitness with which Tomasello takes a stance in the nature-nurture debate by marking out his account as “transactional”. Tomasello rightly rejects the opposition of explanations by nature versus nurture or biology versus culture. Instead, he chooses a “third way” by identifying capacities like joint attention, recursive mind-reading, etc., as “maturational”. This means that these capacities emerge naturally at a certain time in ontogeny on the condition that specific socio-cultural experiences occur. This has two implications. First, these capacities cannot be trained or learned prior to maturation; e.g., joint attention cannot be brought about in a 6-month-old, even when excessive opportunities are provided. Second, without the necessary experiences, these capacities atrophy much like the muscle of our eyes atrophies if we don’t use vision.

Another welcome development in Tomasello’s theorizing is a stronger endorsement of a “transformative”, as opposed to an “additive”, account of shared intentionality. On the additive account, shared intentionality is a capacity that has been added to foundational psychological capacities without exerting influence on this foundation. It is the cherry on top. On the transformative account, shared intentionality shapes the entire gestalt of human cognition. While Tomasello oscillated between these two types of account in earlier writings (Kern and Moll 2017), he is now firmer in his conviction that shared intentionality shapes “children’s cognitive and social psychology across the board” (8). (Tomasello still sometimes hesitates to fully embrace the transformative account in all its implications, such as when he calls shared intentionality the “extra something” (Tomasello 2019b, 2) that makes humans different.)

In the following critique, we will address issues at three key developmental stages of Tomasello’s theory of ontogeny. The first concerns the onset of second personhood. Although Tomasello recognizes that infants by age 9 to 12 months can share experiences with others in joint attention and comprehend others’ communicative acts as designed “for you”, he argues that children only engage second-personally with others around age 3. We will argue that although infants may not yet make and acknowledge claims on one another’s conduct and will, and thus may not meet Darwall’s (2006) criteria for the second person standpoint, they nonetheless expect mutual engagement and seek second-personal exchanges with others. The next problem we will discuss concerns the most central feature of 3-year-olds’ “collective intentionality”-revolution: their “group-mindedness”. It is our hunch that some prerequisites for the group-mindedness Tomasello must have in mind given its role in revolutionizing Homo sapiens’ sociality around 150,000YA may not yet be in place by age 3. These include a clear delineation of in- and out-group and the capacity for inter-group competition. The third and final point is that we find the conceptual change occurring between 4 and 5 years of age to be downplayed because it does not coincide with either of the two cognitive revolutions. But the age window of roughly 4 to 5 is generally considered a watershed where children acquire the capacity to “confront” different points of view and come to acknowledge the possibility of misrepresentation (Moll et al. 2013; Perner et al. 2003). Tomasello mainly strives to explain 3-year-olds’ known struggles in this area with their newly acquired collective intentionality instead of focusing on a positive account of older preschoolers’ cognitive advancements.

3 Critique

3.1 Infants Are Second Persons, Too

Tomasello argues that children only start relating to others as second persons once they have reached preschool age. In his words: “One- and 2-year-old toddlers are not yet second-personal agents [our emphasis], nor do they perceive or treat others in that way. But they are working on it. One- and 2-year-old toddlers are in the process of constructing with adult partners a sense of I-you-we, a sense of other individuals as equally deserving cooperative partners (who see me the same way)” (192). It is only by age 3 or so that “participation in joint intentional activities creates the conditions for what moral philosophers call second-personal relationships, based on respect, commitment, accountability/responsibility, and fairness” (192).

Tomasello thus ties second personhood to particular interpersonal and discursive attitudes and practices analyzed by Strawson (2008) and Darwall (2006), such as rebuking others for their wrongdoings, feeling the weight of guilt on one’s shoulders and apologizing for one’s misbehavior, experiencing moral indignation or acknowledging the binding force of joint commitments. And he is right: A one-year-old pulling her mother’s hair or scratching her face does not apologize or feel guilty for what she did because she finds nothing wrong with her behavior. When an infant cries after falling from an elevated surface she expresses pain and the need for comfort, but demands no explanation from her caregiver for why she neglected her responsibility to prevent the harm. We believe, however, that demanding such discursive practices sets the bar too high. There are other, earlier, indicators that infants seek reciprocal or mutual exchanges and prefer standing in “I-Thou” rather than “I-It” relations with others (Buber 1970). These earlier interactions, we believe, deserve to be titled second-personal. By two to three months of age, infants start smiling and cooing at others and exchanging affect in “primary intersubjectivity” (Reddy 2008; Trevarthen 1993). As the still-face paradigm shows, young infants get upset when their second-personal address is ignored (Brazelton et al. 2008). Clearly, infants in these situations do not relate to others as third persons--from a distanced, observational standpoint. Instead, they expect reciprocity and mutual recognition. They want to see eye to eye. These, we believe, are reasons to agree with Buber (1970) that the “I-Thou” relation is primary in humans and present from early on in ontogeny, before age 3. Similarly, MacMurray (1999) argues that humans are “persons in relation” from infancy onward because they are part of a web of interrelated agents with whom they communicate and to which they turn for care and attention. A “person in relation” is nothing other than a second person.

One might disagree and argue that using the term of the second person is premature at this dyadic level of mutual involvement because child and other do not yet form a “we” that is united by some shared object of interest or attention. Such a “we-mode” of interaction, however, is present by the time infants engage triadically with others in joint attention at around 9 to 12 months. Why then can’t we grant infants, at least once they have joint intentionality and thus joint attention, the status of second persons? Tomasello himself and colleagues stress that infants in joint attention experience objects together with others as a “we”. They manifest their understanding of acting a part of such a “we” when exchanging the “knowing look” or “knowing smile” (Carpenter and Liebal 2011) in bouts of joint attention. Tomasello himself argues that in cooperative communication, infants understand that “you” intend for “me” to know something (p. 148). And on p. 15 he writes “The partners in joint agency relate to one another dyadically, second-personally [our emphasis] in face-to-face interaction; over time they create with one another shared experiences, the common ground on which their collaborative efforts may rely.” Evidently, Tomasello himself consistently invokes the first person plural and second person singular in his phenomenological descriptions of joint attention in infancy. And, we think, rightly so. We see no compelling reason why infants should be denied the status of second persons. It seems arbitrary to demand that such status be earned by competent participation in a “we” that is not constituted by a particular “you and I” (say, father and infant) sharing experiences in joint attention—the “we” of joint intentionality—but in a “we” that is constituted by a larger group of agents, say, a tribe, of which individuals identify as members—the “we” of collective intentionality. Gaining the competence to partake in “language games” of making and acknowledging moral claims and demands on others (Darwall 2006) is certainly an expression of one’s second-personal stance, but we believe it is not a necessary one. Granting infants the status of second persons is also, as we have seen above, consistent with Tomasello’s own rich descriptions of infants’ psychology in joint attention.

3.2 Limits of 3-Year-Olds’ Group-Mindedness and Capacities for (Intergroup) Competition

Throughout Becoming Human, Tomasello makes claims such as the following:

“children trust pedagogical communication and generalize it to new items because they see its generic formulation as coming from the cultural knowledge of the social group, with the instructor acting as a kind of authoritative representative” (150–151)

and

“Given that the voice of instruction is the ‘objective’ voice of the culture, the child is, in an important sense, self-regulating in terms of the normative understandings and standards of the cultural group as she understands them” (152).

We wish to emphasize that there is a difference between on one hand recognizing that someone speaks in the form of the impersonal “one”, as in “one doesn’t do it like that” (or, more colloquially “you don’t do it like that”), and on the other hand interpreting such formulations as reflecting the views of a specific social or cultural group (“We, here, don’t do it like that” or “We from tribe x don’t do it like that”). While there is strong evidence to suggest that children understand judgments and practices to be objective, general, and normative, as Tomasello points out, we do not necessarily agree that children understand these judgments and practices as tethered to a particular cultural group. It has repeatedly been shown and argued that children generally trust adults for the truth and regard their demonstrations and generic statements as conveying objective and general knowledge (Harris, 2012; Moll 2020; Moll and Kern 2020). If mother shows daughter how to prepare a potato, daughter generalizes the method of preparation beyond the particular potato and instance of preparing it and takes it to be “the right way” of preparing potatoes. But this does not imply that the child takes the source of this knowledge to be “our”, as opposed to some other, foreign, culture.

It is correct that studies suggest an early onset of in-group favoritism, and in some cases even negativity toward out-group members. For example, young children preferentially benefit, imitate, help, and seek positively biased information about members of their in-group (Aboud 2003; Buttelmann et al. 2013; Over 2018; Over et al. 2018). But experimentally constructed “minimal groups” with their clearly delineated memberships do not pose the same challenge as real life groups, with their fluid and not always visible boundaries. Children may not always be aware of those groups as groups and their own membership in them. It is our hunch that they often presuppose knowledge and practices to be shared by others without recognizing them as belonging to a group. Let’s give an example. Say you grow up in a household in which one family member infrequently and irregularly interjects words from a language other than, but similar to, the one spoken by all others at all times. When entering school, you use these pieces of vocabulary with your classmates only to find them scratching their heads and asking you what you mean. It is through their lack of understanding that you find out that the phrases originate in a group of language speakers you do not even know existed. This example serves to show that groups and their memberships aren’t always visible to young children, and the degree to which they identify with them varies. This observation agrees with social-developmental studies showing that a child’s group attitudes are more experience-dependent than age-dependent, with contingent factors such as particular group membership and the group’s emphasis of in-group/out-group distinctions impacting the child’s group-mindedness (Nesdale 2004; Nesdale et al. 2005).

Another potential problem we see with Tomasello’s characterization of collective intentionality pertains to the aspect of (intergroup) competition. This aspect must be a central feature of 3-year-olds’ collective intentionality because Tomasello argues that 1) at the corresponding evolutionary juncture 150,000YA, Homo sapiens not only formed tight-knit social groups but also engaged in hostile competition with other tribal groups in fights over valuable resources, and 2) ontogenetic stages mirror evolutionary stages. He writes, “Mirroring the phylogenetic sequence, this maturational process unfolds in two basic steps: first is the emergence of joint intentionality […] and second is the emergence of collective intentionality at around three years of age” (p. 8). If my group wants to outcompete another group, e.g., during a hunt of large game, we not only have to coordinate our efforts within the group, but also need to collectively prevent the other group from succeeding, e.g., by disturbing or undermining their plans. But work by Priewasser et al. (2013) has shown that systematic competitiveness is a higher level cognitive achievement that exceeds 3-year-olds’ capacities. Children were instructed to string beads on a bead-stick and try to complete the task before a competitor could. The authors found that children under the age of 4 to 5 did not understand that, in order to win, they had to use beads from a pile that would diminish their competitor’s access to beads, thus making it impossible for their opponent to fill his bead-stick before the child could complete hers. Instead, 3-year-olds solely focused on their own action, using beads from a pile that did not affect their competitor’s resources. The findings corresponded with a survey of the age recommendations given for competitive board games. Board games affording strategic competition undergo stringent tests for age appropriateness, and age 4 was consistently reported as the lower age limit (Priewasser and Perner 2017). These findings are consistent with social-developmental research suggesting that young children are more occupied with their in-group orientation than with any competition with or active exclusion of outgroup members (Nesdale 1999; Nesdale and Flesser 2001).

The upshot is that children younger than 4 to 5 years do not comprehend that the desirability of an outcome depends on one’s perspective or group membership. They do not grasp that what is good for us (the stag killed over here, by us) is bad for them and vice versa. A specific outcome is either objectively good or bad. This not only supports Perner and colleagues’ teleology account of human action understanding (Perner and Roessler 2012) but also Tomasello’s own idea that 3-year-olds construct and adhere to an objective perspective on the world. While we see this part of Tomasello’s account of collective intentionality confirmed, the part of the thesis that postulates intergroup competition in young children remains unconfirmed. The empirical work reviewed above suggests that 3-year-olds are not maturationally ready to engage in systematic competition.

We propose a version of collective intentionality in which 3-year-olds grasp the objective and normative character of rules and practices without thereby cognizing themselves as members of a particular cultural group. These children take offense when a task x is performed incorrectly because they assume that there is an objectively correct way of doing x, and they trust that they were shown how to x “the right way”. The standard of correctness is held by all those who know how to x, without this knowledge being exclusively possessed by a particular cultural group. On this account, children correct others’ defective performances because they know how the task is objectively performed—and that’s all there is to it as far as their representation of the situation goes. They need not ground the “oughts” and the knowledge they have of them in a particular culture or group.

This characterization of 3-year-olds’ norm-governed actions—one that does not see the basis of these actions in a tribal, “us vs. them” cognition—is not only in better agreement with social-developmental work on young children’s group attitudes, but would also help to remove a tension in Tomasello’s picture of an alleged U-shaped development of belief understanding, to which we now turn.

3.3 What Happened to the Social-Cognitive Revolution at Age 4 to 5?

Most scholars of cognitive development regard the age of 4 to 5 years as a crucial turning point at which children come to understand the subjective and aspectual nature of people’s perspectives on objects. They now can not only take but also “confront” different perspectives, acknowledge the clash between reality and someone’s misrepresentation of it, between what something appears to be at first sight and what it really is, or between two different visual perspectives and so on (Doherty and Perner in press; Moll et al. 2013). Although Tomasello generally acknowledges the developmental significance of this time period (“the child comes to some new way of understanding her own thoughts and beliefs at around four years of age” (Tomasello 1999, 176)), it does not play as big of a role in his current account. It is our sense that the conceptual revolution at age 4 to 5 had to make way for the ‘collective-intentionality’ revolution at age 3 because the former fits less well into the theoretical framework of shared intentionality than does the latter. Two big phylogenetic switches are postulated for the evolution of shared intentionality, and so two corresponding ontogenetic switches must be assumed. First, the advent of triadic communication and cooperative activities 2MYA or at 9 to 12 months, respectively (“joint intentionality”), and second, the formation of internally cohesive and externally antagonistic groups about 150,000YA or at 3 years, respectively (“collective intentionality”). But how do we explain the cognitive advances between 4 and 5 years? Tomasello’s explanatory strategy is to not give a positive account of this process, but instead to account for 3-year-olds’ well-documented failures in theory of mind with their newly acquired skills in collective intentionality.

“The classic tasks of false belief could, in principle, be solved by a similar method of focusing only on the agent and what she has and has not experienced (and how this might affect her behavior). But this would not explain why 3-year-olds systematically fail the classic tasks by consistently choosing the location where the object really is. If they were just tracking the actor’s knowledge states, like infants, they should pass. Our hypothesis is that this mistake actually represents conceptual progress in that it emanates from an emerging conceptualization of an objective perspective on the situation [our emphasis]—how it really is, independent of any individual’s subjective perspective. As this understanding is just emerging, 3-year-olds apply it too widely, assuming that people guide their search for things by an objective perspective (that is, there is a “pull of the real”; see Perner and Roessler 2012). This assumption makes sense because children are frequently exposed to situations in which an adult knows something that they have not seen her learn; for example, their mother often knows what happened at a friend’s house even though she was not there. 3-year-olds’ confusion is only exacerbated by the fact that they have a cooperative bias that may lead them to take the experimenter’s question about where the agent will look as a question about where he should look (Helming et al. 2014). Eventually, 4-year-olds come to see the conflict (she believes it’s here when it’s really there), so to be successful they have to coordinate these perspectives” (73).

On the following page, the author refers to the development of false belief understanding as U-shaped:

“infants succeed based on tracking the experience of others. Three-year-olds fail as they begin to be able to take an objective perspective on things, which leads them to default to this objective perspective. Four-year-olds succeed as they learn to coordinate subjective and objective perspectives” (74)

If false belief understanding is U-shaped, as Tomasello claims, then what stands out and needs to be explained is the “dip” of the U. This is where 3-year-olds stand. What causes their nose-dive in theory of mind situations, according to Tomasello, is their newly gained capacity to construct “objective perspectives” or a “view from nowhere”. These skills temporarily corrupt young children’s capacity to imagine and track others’ epistemic states—capacities that human infants share with the apes (see also Tomasello 2018). Four-year-olds regain this capacity through perspective-shifting discourse as their language skills further progress. Three-year-olds’ low performance in theory of mind tasks thus reflects a temporary loss or masking of early-emerging and basic mind-reading skills caused, like “growing pains”, by their increased intellectual capacity for conceptualizing an objective perspective (72).

This story understates the actual cognitive progress made between 4 and 5 years. False belief understanding is not U-shaped but progresses in steps with increasing degrees of explicability. There is no intermediary phase in which an infantile knowledge of belief is lost and later regained. The picture of a U-shape only comes about when we inadvertently shift attention from so-called “indirect” (mostly non-verbal) tests, which test for an implicit understanding of beliefs or belief-like states, to “direct” or standard tests, which test for an explicit knowledge of beliefs. In direct tests, 2-year-olds are just as incompetent as 3-year-olds, giving responses based on reality instead of the protagonist’s representation of it. In this regard, 3-year-olds do not stand out by suffering a bias toward objective reality; 2-year-olds have it, too. On indirect tests, 3-year-olds, rather than falling behind, outperform 2-year-olds. By age 3, they not only look but also act in anticipation of what a misinformed agent will do (Garnham and Perner 2001), and they can anticipate how the agent will feel (surprised and disappointed) when encountering unexpected reality (Moll et al. 2016). Notice also that 3-year-olds are not bewitched by the objective facts in these kinds of tests: against what Tomasello would predict with 3-year-olds’ alleged fixation on the objective perspective (the pull of the real), they manage to look and act away from the object’s location and correctly anticipate the agent’s misguided behavior. Neither 2-year-olds’ failure in standard false belief tests, nor 3-year-olds’ advanced performance on indirect tests is consistent with the proposal of a U-shaped function, which declares a drop in capacity and performance between ages 2 and 3. Consequently, 4-year-olds do not “regain” a capacity that was once in place and then vanished. They acquire a new capacity that far exceeds what infants were ever capable of doing: for the first time, they can explain, justify, and predict actions on the basis of agents’ misrepresenting the world. Tomasello on some level acknowledges this when he states that 4-year-olds learn to separate objective and subjective perspectives, and that perspective-shifting discourse with two agents each articulating for one another how they view a scene or event, fosters this development (p. 78/79).

From what we laid out above, it follows that 3-year-olds do not suffer a temporary loss of a prior capacity. They retain their implicit grasp of beliefs and progress toward an ever more explicit understanding of the possibility of misrepresentation. What also is not quite convincing is Tomasello’s list of heterogeneous explanations for why 3-year-olds struggle with theory of mind problems. First, he claims that 3-year-olds suffer a “cooperative bias” that leads them to inform the agent where her belongings actually are. But it is unclear why such a bias would not also sabotage their performance on indirect tests and why younger and older children would not show the same bias. Second, there is the idea that children might interpret the test question normatively as “Where should Maxi look for his chocolate?” Third, Tomasello posits an “objectivity bias” according to which children cannot resist the pull of the objective facts. Fourth, children might think that the agent has updated her belief in accord with reality, because people often know things they did not first-personally witness. It seems that Tomasello tries hard to suggest that what is in need of explaining is 3-year-olds’ struggles, not 4-year-olds’ growth of intellect. But his various suggestions do not cohere into an argument or unified explanation; in fact, some of them are incompatible. If children assume that the agent already brought her belief in line with the facts, then why should they feel the need to cooperatively inform her about the truth?

Taken together, these considerations serve to show that Tomasello downplays the conceptual change in children’s thinking between ages 4 and 5, and that he does so because this change does not quite fit into his two-step model of becoming human, which squarely focuses on skills of shared intentionality. To make his theoretical framework fit, he portrays 3-year-olds’ not-yet-fully developed understanding of the mind as progress in their understanding of normativity and objectivity. This, however, does not fully solve the puzzle of what accounts for the conceptual leap children take between 4 and 5 years.