Introduction

Since the 1950s, artificial intelligence (AI) has been the “intellectual heart” (Boden 2006) of cognitive science. With the computer metaphor of the mind, the view that minds are best understood as computational devices and cognition as system-internal information processing has gained popularity.Footnote 1 In the contemporary debate about how to best understand cognition, we can find two prominent proponents of this view: those committed to Good Old-Fashioned AI (GOFAI) argue that cognition is essentially rule-based computation over symbolic representations, while connectionists claim it is computation over sub-symbolic or distributed representations that is not (or at least not explicitly) rule-based. Despite their differences, both views exemplify classic cognitivism.

Classic Cognitivism (COG): The mind is basically an intracranial information processing system manipulating (sub-)symbolic representations; cognition essentially is this computational process.

Advocates of COG have traditionally maintained that cognition is primarily an “offline” exercise of “mental gymnastics” (Chemero 2009). Although this exercise may (but not necessarily does) involve a physical body and is often part of an elaborate interaction with the external world (think of calculating the change you expect to receive in a shop), these facts are best taken to be of secondary importance. COG proponents thus emphasize that offline processing involves internal representations, which are not bound to the current features of the agent’s body or her environment and hence “decoupled”.Footnote 2

Classic Cognitivism has had serious ramifications for the study of social cognition. For several decades, research on how we understand others has been conducted under the heading of “mindreading”. Mindreading, as Baron-Cohen (2001) puts it, is the ability to “infer the full range of mental states (beliefs, desires, intentions, imagination, emotions, etc.) that cause action” (p. 174) and use these inferences to predict and explain the behavior of others. There are two dominant approaches to explaining mindreading: theory theory (TT) and simulation theory (ST).

According to TT, mindreading is enabled by a folk psychological theory that specifies how mental states (in particular beliefs and desires) interrelate and give rise to intentions and actions. There are various versions of TT, for example “modular TT” (Fodor 1992; Leslie et al. 2005), “scientific TT” (Gopnik and Meltzoff 1997), “model TT” (Maibom 2003, Godfrey-Smith 2005) and “external TT” (Braddon-Mitchell and Jackson 2007). ST is usually portrayed as the main rival of TT because it denies that we need a folk psychological theory to understand others. Instead, ST claims mindreading involves putting ourselves in someone else’s shoes by simulating their mental states while adjusting for the relevant differences. Like TT, ST has been developed into several directions; among them “explicit” ST (Heal 1986; Goldman 1989), “radical” ST (Gordon 1986, 2008), and “implicit” ST (Hurley 2008; Gallese 2005). Nowadays, most proponents of TT and ST favor hybrid models that accommodate both theorizing and simulation.Footnote 3 However, they still consider offline mindreading to be the central explanandum of social cognition.

In recent years, there has been a growing criticism of TT and ST. Several authors have argued against the importance of mindreading for social cognition and challenged the classic cognitivism underneath (e.g., Hutto 2004, 2008; Gallese 2005; Gallagher 2004, 2005, 2007; Ratcliffe 2006, 2007; Fuchs and De Jaegher 2009). Many of these critics have promoted the importance of the agent’s body and the surrounding environment for cognitive processes under the heading of enactivism. Already proponents of embodied, embedded, and even extended cognition assign functional or implementational roles to the agent’s extra-cranial body and her environment; however, enactivism goes beyond that: its proponents claim that cognition is enacted rather than residing somewhere in the agent’s head or the world. In particular, they have claimed, cognition is not mere manipulation of (sub-)symbolic representations but an interactive process of sense-making—a relational activity between agent and environment. The term “enactivism” can be traced back to “The Embodied Mind”, where Varela, Thompson and Rosch (1991) used it to designate a new way of thinking about the mind. Complaining that classic cognitivists left out “what it means to be human” (p. xv), they advocated focusing on the experiential nature of cognition as it arises from dynamic sensorimotor coupling and reciprocal determination between organism and environment. This characterization is still prominent:

According to the enactive approach in cognitive science, cognition is grounded on the sense-making activity of autonomous agents—beings that actively generate and sustain themselves, and thereby enact or bring forth their own domains of meaning and value […]. (Thompson and Stapleton 2009, p. 23)

Although the enactive approach is sometimes celebrated as a “new paradigm” for cognitive science (e.g., Stewart et al. 2011), it is far from being unified. Rather, as we will show below (“Dynamic Embodied Cognition” section), enactivism turns out to be an umbrella under which a range of related approached find shelter. For now, however, it suffices to describe enactivism as follows:

Enactive Cognition (ENAC): Rather than a representational process, cognition is a process of sense-making that emerges from the dynamic online interaction or ‘coupling’ between autonomous agents and the environment which they are embedded.

ENAC conceives of cognition as a process of sense-making emerging from the dynamic online interaction between agents and environment. As such, it can conveniently be modeled by dynamical systems theory (DST). According to DST, cognitive systems are best characterized as sets of differential equations the variables of which dynamically change their values over time, i.e. they evolve, in accordance with a set of dynamical laws (e.g., Port and Van Gelder 1995; Spivey 2007; Van Gelder 1995, 1998). Cognition thus is best understood as state–space evolution within a multidimensional dynamical system where the number of dimensions is given by the number of evolving variables.Footnote 4

In this article, we investigate the merits of an enactive conception of cognition for the contemporary debate about social cognition. If ENAC is to be a genuine alternative to COG, it should be able to bridge what De Jaegher and Froese (2009) have called the “cognitive gap”, i.e. provide us with a convincing account of those higher forms of cognition that have traditionally been the focus of its cognitivist opponents. In the next section, we show that, at least when it comes to social cognition, current articulations of enactivism are not yet up to the task. This is because they (a) do not pay sufficient attention to the role of offline processing or decoupling for social cognition, and (b) overemphasize phenomenology at the cost of embodiment. The main challenge for any enactive account of social cognition, so we argue, will be to acknowledge the importance of both coupled (online) and decoupled (offline) processes in advanced but also basic forms of social cognition. To meet this challenge, we develop a dynamic embodied view of (social) cognition (“Dynamic Embodied Cognition” section). We illustrate the fruitfulness of this view by applying it to recent findings on “implicit” false belief understanding (“The development of false belief understanding” section).

The cognitive gap

The enactive approach and the appeal to phenomenology

When the basic principles of the enactive approach are applied to social cognition, an agent’s interaction with the world—including other agents—is transformed into what De Jaegher and Di Paolo (2007) call “participatory sense-making”: “the coordination of intentional activity in interaction, whereby individual sense-making processes are affected and new domains of social sense-making can be generated that were not available to each individual on her own” (pp. 12–13; cf. Fuchs and De Jaegher 2009; De Jaegher et al. 2010). Thus, from an enactive point of view, the proper unit of analysis in social cognition is not the individual agent or the individual brain, but the coupled system as a whole; i.e. the participants, their dynamic interactions, and the context in which these interactions take place. What is characteristic of such a dynamic “coupled systems” view of social cognition is its emphasis on reciprocal interaction and recurrent feedback loops. It is precisely the active process of engaging with others that constitutes social cognition. What is appealing about the enactive view is the possibility to account for our everyday interactions without having to invoke the mindreading procedures so cherished by proponents of TT and ST. In other words, the enactive approach promises to explain social cognition in a direct and non-representational way, without folk psychological theories or simulation routines.

Proponents of ENAC frequently draw on phenomenological considerations in their criticism of COG (e.g., Varela et al. 1991; Gallagher and Zahavi 2008). Gallagher (2007), for example, has made the following objection against Goldman’s claim that simulation is the “fundamental, default procedure” (2006, p. 175) of how we understand others. If simulation routines are indeed employed in a frequent and explicit fashion, then we should be aware of the different steps that we go through as we consciously simulate the other’s mental states. However, according to Gallagher, when I interact with others and try to understand them, “there is no experiential evidence that I use such conscious (imaginative, introspective) simulation routines” (2007, p.65).

In response, Goldman (2006) has argued that the appeal to phenomenology is problematic because phenomenology is “incapable of supporting weighty theses”, hard to agree upon and “hotly disputed” (p. 249). Some have taken this criticism a step further. For instance, Spaulding (2010) assumes that neither the process of mindreading (theorizing or simulating) nor its product (the explanation or prediction of behavior) need be consciously accessible or phenomenologically transparent. Consequently, she claims that “the fallibility of phenomenology is one reason to doubt Gallagher’s phenomenological argument. The total irrelevance of phenomenology is another” (p. 131).

For reasons of space we will not discuss the role of phenomenology in the debate on social cognition in detail. What seems to be important at this point is this: it is unclear how the appeal to phenomenology can yield conclusive answers to questions about the frequency or pervasiveness (i.e. it being the default mode) of mindreading—be it theorizing or simulation. This rather seems to be an empirical question.

Spauldings’ criticism of enactivism

Despite the problems with phenomenological reflections, ENAC supporters typically use them as guidance in their approach to (social) cognition. Gallagher (2007), for instance, concludes that we usually do not need to consult a folk psychological theory or run a complicated simulation routine because phenomenology shows that in our everyday interactions we directly perceive and respond to meanings in the other’s action. The immediate and responsive “online” nature of our social interactions is frequently explicated in terms of basic embodied practices (e.g., intentionality detection, shared attention, the perception of meaning and emotion in movement and posture). Proponents of ENAC argue that there are basically two ways in which such practices are primary to the “offline” modes of mindreading advocated by TT and ST. First, online social interaction involves abilities that are developmentally more fundamental. Call this the argument from developmental primacy. Second, online social interaction is primary to offline social interaction in the sense that it continues to characterize most of our participatory sense-making, and remains the default or pervasive mode of how we understand others (Gallagher 2001, 2005). This argument depends on the appeal to phenomenology.

Spaulding (2010) argues that both arguments are untenable to make her case against what she calls “Embodied Cognition” (EC)—her label for ENAC. Spaulding rejects the phenomenological argument against the pervasiveness of mindreading because it fails to establish that mindreading occurs only in rare circumstances (see the previous paragraph). Her criticism of the developmental primacy heavily relies on a distinction between two versions of this claim: according to the strong version, children are completely incapable of any social interaction before they master the basics of mindreading; according to a weaker version, children’s social understanding is limited (but not inexistent) before they learn to robustly attribute mental states to others. Spaulding argues that the stronger version is obviously false, and it is hard to disagree with her on this point given the social interaction even newborns can engage in. However, Spaulding thinks the weaker version (which is the view actually defended in the literature) is also false. This is because even proponents of COG can happily acknowledge the existence of certain low-level mechanisms ontogenetically primary to “proper” mindreading abilities—the “precursors” to a full-blown Theory of Mind (cf. Currie 2008). On this line of argument, the central difference between COG and ENAC seems to boil down to a different interpretation of low-level early mindreading abilities: while liberal proponents of COG see them merely as a transition stage (this is Spaulding’s view), proponents of ENAC see them as the primary mode of social interaction even in adulthood.

While we agree with many of Spaulding’s observations, we think that her conclusion is unsatisfactory because it leaves intact a fundamental dichotomy between COG and ENAC. This is not a problem as such, but it seems to result in a strictly separate treatment of mindreading and non-mindreading abilities (the “precursors”) that fails to address the bigger picture and might not be adequate given the development of social cognition (see “The development of false belief understanding” section). In the remainder of this section, we will show that proponents of the enactive approach in fact face a similar problem, in the sense that they neglect to give a convincing explanation of their opponents’ target explanandum.

The cognitive gap

Although the arguments from phenomenology are often used to criticize the pervasiveness of mindreading, its enactive wielders usually do not deny that we sometimes do engage in specialized TT or ST procedures. The point is that this happens only when our everyday agent–agent interactions fail or break down. Thus, they say, “such specialized cognitive approaches do not characterize our primary or everyday encounters with others” (Gallagher 2004, p. 202).

However, quarrels about what is and what is not the primary mode of everyday social interaction seem to distract us from what is really at stake in the debate. On the one hand, there obviously are cases of social cognition (e.g., gaze following, joint attention, imitation) that are carried out online and for which there is a straightforward enactivist interpretation—one that may be cognitively less demanding since it does not rely on building up costly internal representations and therefore is evolutionarily sensible.Footnote 5 On the other hand, there clearly are also cases of social cognition that are carried out offline and for which classical ST/TT accounts seem much more plausible (e.g., think of explicit forms of first-order and second-order false belief understanding). The problem is not simply that mindreaders deny the former and enactivists deny the latter. Rather, it seems both camps just (over-)emphasize the importance of offline and online social cognition, respectively.

This dichotomy does not further the debate on social cognition. Instead, it results in a divide and conquer strategy with mindreaders trying to account for advanced social abilities in terms of higher-order offline representational abilities, and enactivists focusing on more elementary forms of social interactions in terms of dynamic online couplings between agents. This is not only unsatisfactory for those in favor of mindreading, as we suggested above, but also for promoters of enactivism; for each neglects what the other focuses on.

ENAC has been celebrated as a revolutionary new paradigm replacing dusted COG. However, it cannot quite keep this promise and offer a genuine alternative to COG as long as it ignores the traditional explanandum of COG. That is, in order to replace COG, ENAC must not only account for dynamic online cognition but also for the “mindreading” abilities that have traditionally been the focus of TT and ST. This is so even if such abilities are only recruited in exceptional situations. The perhaps biggest—and yet unresolved—challenge for enactivism is to bridge this “cognitive gap”, and provide us with a convincing account of offline social cognitive capacities. Thus, ENAC has to show how “an explanatory framework that accounts for basic biological processes can be systematically extended to incorporate the highest reaches of human cognition” (De Jaegher and Froese 2009, p. 439).Footnote 6 Problematically, though, phenomenological arguments tend to obscure this point. Hence, there are two serious challenges for ENAC: (1) restrict the emphasis on phenomenology to get a clear view of the cognitive gap, and (2) bridge the cognitive gap.

If ENAC is to bridge the cognitive gap, it must not only acknowledge advanced offline social cognitive abilities but also explain these “decoupled” modes of social cognition without falling back into the cognitivist camp. That is, the enactivist has to tell a story about how offline (i.e. decoupled) social cognition is grounded in and emerges from online (i.e. coupled) interaction.

To conclude our diagnosis: neither COG nor ENAC has been successful in providing a convincing account of both online and offline forms of cognitive processing. It hence seems fruitful to aim at a unified theoretical framework that solves the stalemate between ENAC and COG and integrates online and offline processes into a coherent story of how cognition can best be understood. This is our aim for the remainder of this article.

Dynamic Embodied Cognition

Two versions of enactivism

We already mentioned that enactivism is far from being a fully developed research program; it still lacks a unifying theoretical, conceptual, and methodological characterization. Without delving into too much detail, it will be important to at least distinguish between two readings of ENACFootnote 7:

Broad Conception of Enactivism (ENACb): Cognitive Science is concerned with the question “What is an agent with a mind?”; cognition is the relational process of sense-making emergent from the agent’s dynamic coupling to her surroundings, an autonomous process that self-sustains the agent.

and

Narrow Conception of Enactivism (ENACn): Cognitive Science tries to characterize the nature of perception and perceptual experience; cognition is grounded in exploratory activities giving rise to perceptual consciousness.

ENACb is a generic view that focuses on autopoetic (i.e. autonomously self-organizing) systems and how they are maintained. Its defenders include, among others, Thompson (2005, 2007) and Maturana and Varela (1980). Proponents of ENACn, on the other hand, put much more emphasis on how we become conscious of our immediate surroundings. They reject that our experiences result from manipulating detailed internal representations. Instead, they suggest a skill-based view according to which agents use a certain skill set to pick up details from the environment as needed—something that is familiar from embodied and embedded approaches to cognition (e.g., Clark 1989). What sets apart ENACn from these views, however, is its expressed focus on conscious experiences, particularly perceptual consciousness. ENACn is famously associated with O’Regan and Noë’s work on visual perception (O’Regan and Noë 2001; Noë 2004) who argue that perception is essentially skillful explanatory activity—mere stimulation, i.e. sensory input, is insufficient for us to perceive and become aware of what lies in front of us; instead, practical knowledge (so-called sensorimotor contingencies) relating our action to changes in stimulation is required to make sense of the input.

Studies like O’Regan and Noë’s are, however, not unequivocally perceived as evidence for cognition being enactive. On the one hand, they are also cited to support embodied and embedded views of cognition; on the other, proponents of ENACb may deny research within the narrower framework to properly count as enactivism because it does not offer as radical an alternative to COG as does the broad conception. If we consider the historical development of both positions, it becomes clear why there is this divide within the enactivist camp: while ENACb is commonly read as descending from ecological psychology, ENACn may be conceived of as situatedness taken to the extreme—either as an intense version of the extended view (Robbins and Aydede 2009) or a combination of both extreme embeddedness and extreme embodiment (Rupert 2009 seems to assume something like this). In both cases, ENACn descends from COG since situated views are probably best understood as attempts to liberate COG by adding environmental and bodily features to the classical representation-based internal processing. What sets apart ENACn from COG, however, is its tendency to focus on active online rather than internal offline processing. Nevertheless, ENACn bears a strong cognitivist heritage: it still conceives of cognition as a computational process—albeit one that does involve things other than internal (sub-)symbolic representations—and hence simply “waters down” (Chemero 2009, p. 30) COG. Read this way, ENACn cannot bridge the cognitive gap but will be doomed to stay on the cognitivist side. Therefore, opting for a narrow conception of ENAC will not do the trick if one aims to bridge the cognitive gap.

ENACb rejects COGs central tenets, i.e. the computational nature of cognition and the importance of some kind of representational structure. In return, it directs all its attention towards dynamic coupled interactions between agents and their environments which is accompanied by an almost complete neglect of offline processes. This in turn results in a failure to address the cognitive gap. Opting for a broad conception of ENAC thus will not enable us to bridge the cognitive gap either. As illustrated in “The cognitive gap” section, the emphasis of the experiential nature of cognitive processing in both versions of ENAC obscures these issues. In order to develop a view that can bridge the cognitive gap, we need to not only restrict the emphasis on phenomenology but revise ENACb's and ENACn's conception of an agent’s dynamic coupling to her surroundings and bring (at least some) capacities for offline processing back into the picture. This will also require us to reconsider the role of representations in cognitive processing.

Dynamic Embodied Cognition

This section is devoted to developing a view that explicitly takes the problems with ENACb and ENACn on board: Dynamic Embodied Cognition (DEC). DEC can be seen as an attempt to offer a framework that spans the whole spectrum from COG to ENAC and offers a way of bridging the cognitive gap. Since our starting point is an embodied view of cognition, DEC may also be interpreted as a dynamicist version of embodiment.

What sets apart DEC from enactivism is the recognition that an agent can—to a certain degree—decouple from her environment and is not entirely dependent upon direct online interaction. On the one hand, as we pointed out above, ENACn proponents tend to hold that all cognition is grounded in exploratory interaction with the environment. Although DEC does not deny an essential role for these activities, and in so far is in tune with ENACn, it does not take these exploratory activities to be all there is to cognition—DEC explicitly allows for decoupled processes to play a substantial role as well—and hence takes a crucial step away from ENACn. Supporters of ENACb, on the other hand, tend to see dynamic coupling between agent and environment as ultimately replacing the representational structures postulated by their cognitivist opponents. In fact, a lot of effort has been put into getting rid of the notion of “representation” altogether (e.g., Gallagher 2008; Garzón 2008; Hutto 2008, 2011). However, an integrative theoretical framework will have to deal with cases where cognitive processing, at least to some extent, takes place independently of the agent’s environment. Sometimes an agent simply cannot rely on what is right there, and she will have to appeal to some kind of stand-in for what is currently absent. This may be particularly relevant in social contexts where agents do not always have direct access to the mental states (e.g., beliefs, desires) of others. Furthermore, an agent may also purposefully distance herself from her immediate surroundings to not automatically act upon particular affordances. She may instead consider other ways to respond to the affordances the environment offers, contemplate upon various possibilities for actions towards a certain goal, etc. That is, the agent may engage in “offline” processing, while controlling for or suppressing “online” processing, to achieve the best result. Focusing on direct coupling (online processing) alone neglects the agent’s ability to generate new and more advantageous conditions for relating to her environment by engaging in offline processing.Footnote 8

What is characteristic of offline processing is the manipulation of information that is currently unavailable from the environment (but typically has been present on previous occasions), and therefore has to be internally represented. We contrast this type of processing with online processing, which does not require recourse to such internal representations but instead manipulates what is currently present in the environment and readily available to the agent. The distinction we draw between the information used for online and offline processing parallels Shannon’s (1993) distinction between “presentations” and “representations”. Presentations are what is currently within the agent’s grasp and can be readily used in forms of processing in which the agent is directly coupled to her environment. Representations, on the other hand, are not directly bound to what lies in front of the agent (although they otherwise bear characteristics similar to those of presentations); they are decoupled from the environment—although typically arising from or grounded in enaction before taken offline—and thus suitable for offline processing.Footnote 9 The agent may engage in both types of processing differentially depending on the environment she is situated in, and the internal processing resources at her disposal. When the agent is able to represent the information missing from the environment (e.g., based on memory), she may be able to achieve what under other circumstances she could accomplish by online processing alone—or even engage in more sophisticated tasks unsuited for online processing.

Decoupling, as we see it, is a matter of reducing direct effects of environmental stimulation and opening up possibilities for internally regulated behavior. This enables the agent to inhibit reflex-like automatic responses to external triggers (presentations), and elicit actions in the absence of an external trigger, for example by recruiting internal representations in scenarios that do not provide the presentations triggering coupled online processing. To illustrate this, think of the walking reflex: as soon as their soles touch a flat surface, newborns attempt to walk by placing one foot in front of the other. The infant’s behavior is purely automatic; it can neither be inhibited, i.e., the environmental input cannot be ignored and the newborn will engage in walking movements even lying on its back with its feet pointing upward, nor can walking behavior be triggered without a flat surface under the infant’s soles. This seems to be a perfectly fine example of direct dynamic coupling between agent and environment. The infant does not have the capacity to walk independently of what touches the sole of her feet; her behavior thus is not decoupled, it is not independent of environmental stimulation. At about 8 weeks of age, though, the infant acquires the ability to inhibit the walking reflex—she learns to decouple her behavior and the walking reflex disappears as an automatic response. Towards the end of the first year, the infant acquires the ability to walk, that is she becomes able to initiate walking as voluntary behavior—walking is now fully decoupled, and has become (at least to some extent) independent of what kind of surface touches her soles (see, e.g., Adolph et al. 2003 for a more detailed discussion).

This is a good example of how decoupling provides an agent with a certain degree of autonomy towards the presence of bodily and environmental features. The child acquires the capacity to (a) inhibit automatic behavioral responses (walking motion) vis-à-vis environmental features (a flat surface touching their soles) and (b) elicit that behavior (formerly coupled to the presence of a certain stimulus) independently of environmental stimulation. Analogous observations can be made in the development of social cognition. For example, several researchers report that young infants are like “imitation machines” in the sense that they copy and reproduce virtually all human actions they see in their surroundings (Carpenter et al. 1998; Tomasello 1999). Although these infants cannot help but respond to what they perceive, they do have the capacity to correct and improve their imitations. Moreover, with age they come to master forms of imitation which are beyond the scope of younger infants. By 18 months, for instance, infants can complete unfinished actions of agents they observe, such as pulling apart miniature dumbbells (Meltzoff 1995; Meltzoff and Brooks 2001). In this respect, imitation in infants quickly develops into more than a mechanical reflex-like response—it starts to involve a kind of responsiveness that is best characterized as a dynamic process of coupling and decoupling. Although imitative behavior is triggered by the perception of other agents it is not fully determined by this action–perception coupling but (at least partly) independent, thus decoupled or autonomous, of what is going on around the agent.

In the enactivist literature, the emphasis on autonomy is prominent. An autonomous agent is sometimes defined as an autopoietic system that (a) generates its own identity and the particular conditions by which it can relate to its environment (the system has an intrinsic teleology and constitutes its own purposeful and goal-directed existence), and (b) makes sense of the affordances offered by the environment in relation to its particular way of realizing and preserving its identity (the system has a capacity for self-monitoring and appropriate regulation, and can adapt itself to changes in the environment). This definition is demanding and it is not unequivocally shared among enactivists (see Weber and Varela 2002; Froese and Di Paolo 2009, and Barandiaran et al. 2009 for further discussions of autonomy). For current purposes, since we are not concerned with issues of self-organization and self-sustainment, it suffices to use a simpler, less demanding understanding of autonomy that still captures the effects of decoupling. “Autonomy” as we will use it simply refers to a certain degree of independence of environmental stimulation and the ability to inhibit and elicit actions that are typically directly coupled to the perception of certain environmental features and/or the affordances they offer.Footnote 10

The fact that our weaker definition of autonomy allows for degrees of independence is crucial, for there certainly seem to be different extents to which systems are autonomous or decoupled from their environments.Footnote 11 For instance, an agent may be able to inhibit automatic responses without being able to initiate certain actions in the absence of environmental input. This makes her more autonomous than an agent who is unable to inhibit automatic responses, but less autonomous than an agent who can additionally elicit actions in the absence of environmental stimulation. Importantly, degrees of autonomy and decoupling are always relative to a given environment. The more an agent’s action becomes independent of the specific environmental features, the more she decouples from the environment and the more autonomous she becomes. However, a significant change in the environment can have a dramatic impact on the agent’s autonomy and her abilities for offline processing. For example, healthy and mature human beings will generally have little difficulty to initiate walking on a firm surface. However, put them in a radically different environment, e.g., on the bottom of the ocean or the surface of the moon, and things start to change rapidly; at least unless means for dealing with these new environments—such as special equipment—are provided.

The coupling and decoupling relations between agent and environment advocated by DEC are dynamic in the sense that they are a matter of degree and never an end in itself. On the contrary, decoupling mainly services recoupling: taking certain processes offline, decoupling allows for novel ways of relating to the environment that provide the agent with more and better tools to act on certain affordances or create new ones. This, in turn, enables the agent to re-enter into online processing. The dynamic interplay of decoupled and coupled processes may be used for optimization of cognitive processing. The obvious (evolutionary) advantage of this cognitive flexibility and autonomy from environmental stimulation is that agents become less dependent upon, and gain new ways of relating to, their direct surroundings, including (in the case of social cognition) their conspecifics. Thus, the development from online to offline forms of cognitive processing is to be seen as one of progressive independence from environmental and bodily features, giving the agent autonomy but at the same time demanding more cognitive work. Against the background of developmental studies, this is a plausible story of how the developing organism progressively acquires decoupled abilities and thus engages in offline cognition while online cognition will still be used throughout adulthood where feasible. We will come to illustrate these points in our discussion of empirical studies on social cognition (“The development of false belief understanding” section).

If the recourse to offline processing advocated by DEC is useful the strict non-representationalism built into ENACb is too strong. At the same time, however, COG’s strict representationalism also seems too demanding for the kind of offline processing we have in mind.Footnote 12 Therefore, we propose a weaker reading of “representation” according to which a representation can be any kind of stand-in for another item that will typically be best understood as grounded in enaction but subsequently taken offline. In this sense, there does not have to be a direct mapping between internal stand-ins and external items; the reading from the barometer represents the current air pressure just as a certain neural activation pattern in the brain represents face perception and the word “cat” represents the concept of a cat.

This more liberal conception of representations (a) still picks out something very different from Shannon’s presentations, and (b) does not ipso facto exclude (sub-)symbolic representations but instead widens the scope of what counts as a representation. As such, it allows us to take a middle course between COG, ENACn, and ENACb. We criticized ENACb’s attempt to discard representational structures altogether in the light of the cognitive gap. If we aim to bridge that gap, we take it, some form of representational processing will be required. In order to achieve this, DEC allows for decoupling from the agent’s environment. Our understanding of “autonomy” as independence of environmental features is, although inspired by ENACb, different from the standard enactivist reading. This marks an attachment point between DEC and ENACb although at the same time it clearly sets them apart. We already pointed out that, due to its cognitivist heritage, ENACn is in principle compatible with the computational nature of cognitive processing. DEC reconciles COG and ENACn by leaving the assumption of cognition being some form of computational process untouched and allowing immediate presentations and (sub-)symbolic representations of environmental features as well as other kinds of stand-ins to be recruited for this process.Footnote 13 Rather than pure enactment of the world, we therefore advocate a mixture of enactive and representational processing that may best be captured within DST.Footnote 14

Drawing together the threads of the current section, we can define Dynamic Embodied Cognition as follows:

Dynamic Embodied Cognition (DEC): Cognition is (developmentally) grounded in the agent’s coupling to her surroundings. This coupling is dynamic insofar it allows agents to (a) rely on direct online cognitive processes, or (b) decouple from their environment and engage in offline cognitive processing. Online processing is ‘cheap’ and efficient as it allows the agent to avoid building up ‘costly’ internal representations which would require additional processing resources; however, it comes at the cost of limited and inflexible responsiveness. Offline processing provides agents with more flexibility (autonomy) regarding their direct environment, but is also cognitively more demanding. For a system to engage in decoupled offline processing, it has to (a) inhibit automatic behavioral responses to environmental features and (b) be able to elicit the behavior formerly coupled to the presence of a certain stimulus independently of environmental stimulation.

Having removed the obscuring emphasis on phenomenology characteristic of ENAC, we are now in a position to recognize that bridging the cognitive gap will require us to acknowledge both online and offline processes within a unified theoretical framework. In order to achieve this, DEC allows for a dynamic interplay of coupled and decoupled processes. This, in turn, requires a more liberal notion of representation than advocated by COG. However, (sub-)symbolic representational processing is still an option within DEC, just a very specific one that may have only limited application—perhaps in later development or higher-order cognition. This way, the traditional explananda of COG do not fall out of the spectrum and the cognitive gap can be bridged.

Dynamical Systems Theory

If DEC is right and cognition really is best understood as the dynamic interplay of enaction and decoupled representational processing, what kind of architecture could be used to capture these facts and model cognitive systems? We here propose that Dynamical Systems Theory may be our best bet.

DST models cognitive systems as sets of differential equations the variables of which dynamically change their values over time. Over time, the dynamical system evolves through a multidimensional state space. The resulting trajectories through state space may be said to represent (in a weak sense) the system’s state.Footnote 15 “Computation” in a dynamical system no longer refers to discrete steps of symbol manipulation but should be conceived of as transforming one kind of information into another where this transformation may occur by dynamic evolution through a system’s states.

We agree with Spivey (2007) that the “traditional information-processing approach (borrowed from the early days of computing theory), […] place[s] too much emphasis on easily labeled static representations that are claimed to be computed at intermittently stable periods of time.” (p. 4) Yet, we also think that—at least for offline processes—there has to be some kind of “mediating stand-in […] in between sensory stimulation and physical action” (ibid., p. 2). Our continued use of the term “representation” for these stand-ins serves to ease the “intellectual transition” (ibid.) from the classic information-processing framework of COG to a DST framework to model cognitive processing. The clue with using DST as a modeling framework is that it provides the tools to not only capture highly dynamic features of a system but also (although this is not the typical area of application for dynamical systems) static ones by introducing more static variables hence introducing stability and independence of, e.g., environmental factors.

Within DST, the decoupling we described above as a transition from online to offline processes will be modeled by reducing the number of dimensions of the dynamical system in question. That is, in the decoupled system, fewer variables will dynamically change their values over time while more variables get assigned fixed values making the system more static. This can be achieved by, e.g., identifying linear relations between two variables such that one of them can be substituted with the product of a scaling factor and the other variable. Similarly, one may assign approximate values for certain variables if, say, their values change within a very restricted interval. Such mathematical procedures result in a reduction of dimensions of the dynamical system.Footnote 16 Taking this line of reasoning a bit further, appropriately constrained dynamical systems can be used as models of classical cognitivist systems (see Garzón 2008 for a similar proposal). Likewise, recoupling can be conceptualized as adding degrees of freedom to an existing dynamical system thus making it more flexible and dynamic. Put slightly differently, the thought is this: dynamical systems have high dimensionality, where each dimension corresponds to a variable the value of which evolves through time. Reducing the number of unknowns (i.e. variables with changing values) hence reduces the number of dimensions of space through which a dynamical system evolves; the lower the dimensionality, the less dynamical (and more rigid) the system.

Hence, applied in the right way, DST shows how there can be a unifying understanding of seemingly very different (i.e. online and offline) processes; and such an understanding is what DEC offers. Although the story DEC tells about decoupling is not the mathematical process of constraining a dynamical system, mathematical constraining may be an adequate model of what happens when a cognitive system decouples from environmental features. Since both static and dynamic aspects of a system—important for modeling decoupled offline and coupled online processes, respectively—can be captured using the same mathematical vocabulary within DST, it comes handy to model cognitive systems as characterized in an integrative framework like DEC.

As we hope has become clear throughout this section, DEC respects both enactive online processing accounts and classicist offline processing accounts of cognition. While it agrees with enactivism that cognition is best modeled using DST, DEC shifts the focus from experiential features prevalent in both ENACn and ENACb towards a focus on dynamicism. At the same time, DEC significantly imports from embodied and embedded views of cognition. These bear a substantial cognitivist heritage insofar as they conceive of cognition as a computational process; yet, they also emphasize the importance of bodily and environmental features for at least some cases of (online) cognitive processing. While advocates of ENACn and ENACb typically deny the importance of offline processing completely, DEC acknowledges the whole spectrum from pure online to pure offline processing and thus offers a way to bridge the cognitive gap. More than that, it assigns an essential role to decoupling and recoupling in both cognitive processing and development. Since this makes for a stark contrast to enactivism, we locate ourselves in the embodied cognition rather than enactivist camp. In the following section, we will use recent developmental findings on false belief understanding in order to illustrate how DEC allows us to resolve the disagreement between COG and ENAC concerning the superiority of offline versus online forms of social cognition.

The development of false belief understanding

Recent findings on false belief understanding

Proponents of TT and ST frequently appeal to developmental psychology to support their views on social cognition. In particular the “elicited response” false belief test (ER-FBT) in which infants are asked to give an explicit prediction of another agent’s behavior on the basis of her false belief, has been a very popular choice. In the unexpected location ER-FBT, for example, children observe a protagonist who sees an object being placed in a certain location (Wimmer and Perner 1983; Baron-Cohen et al. 1985). The protagonist leaves, and the object is moved. When the protagonist returns, she mistakenly believes the object is still in its initial location. At this point, the children are asked to (verbally) predict where the protagonist will look for the object. Test results show that 3-year-olds typically give a wrong answer to this question, while 4-year-olds answer correctly. Many researchers have therefore concluded that false belief understanding does not emerge until 4 years of age (Flavell 2004; Sodian 2005; see Wellman 2002 for a review, and Wellman et al. 2001 for a meta-analysis). The controversy between TT and ST has mainly centered on the question whether these findings are indicative of the four-year-olds’ mastery of a folk psychology (in the fashion of TT) or rather a set of simulation routines (appealed to by ST).

Meanwhile, advocates of enactivism have argued that the results of ER-FBTs do neither support TT nor ST, because the test is not representative of the full scope of social cognitive abilities. Gallagher (2004), for example, claims that ER-FBTs are designed to capture a set of very specialized cognitive abilities, which “put us in an observational mode and do not capture the fuller picture of how we understand other people” (p. 204). This line of reasoning has also been taken by developmental psychologists more widely. For instance, Bloom and German (2000) argue that the ER-FBT is an “ingenious, but very difficult task that taps (only) one aspect of people’s understanding of the minds of others” (p. 30).

The ER-FBT indeed places strong demands on children’s cognitive abilities (Bloom and German 2000; Carlson and Moses 2001). Recent investigations of false belief understanding have attempted to reduce these demands in order to see whether children might be capable of false belief understanding at an earlier age. Employing violation of expectation, anticipatory looking, and active helping paradigms, these “spontaneous response” false belief tests (SR-FBTs) no longer require children to explicitly state the protagonist’s belief. Instead, their false belief understanding is inferred from the behavior they spontaneously produce (Baillargeon et al. 2010). On the basis of these studies it has been claimed that false belief understanding emerges much earlier; it has been reported in 25-month-olds (Southgate et al. 2007), 15-month-olds (Onishi and Baillargeon 2005), 13-month-olds (Surian et al. 2007), and even 7-month-olds (Kovács et al. 2010). This early manifestation of false belief understanding has been called “implicit” (Ruffman et al. 2001), because the infants are not explicitly aware of the false belief of the protagonist. Despite the fact that their behavioral responses indicate sensitivity to the protagonist’s false belief, however, they answer incorrectly when asked for a direct statement of it (Clements and Perner 1994).

Whether or not the SR-FBT findings should indeed be interpreted in terms of false belief has been hotly debated (Perner and Ruffman 2005; Ruffman and Perner 2005; Csibra and Southgate 2006; Sirois and Jackson 2007; Herschbach 2008). On the one hand, ENAC proponents argue that we can explain the SR-FBT results without assuming genuine false belief understanding on the part of the infant (e.g., Hutto 2011). However, such interpretations tend to ignore the role of offline processing and do not address how explicit false belief understanding enters into the picture in later developmental stages. On the other hand, COG proponents claim that SR-FBT results are proof of genuine false belief understanding (Surian et al. 2007; Baillargeon et al. 2010). This, however, seems to give rise to a developmental paradox: if young infants already understand false belief, then why do they fail the ER-FBT?

By applying DEC to false belief understanding, we think that some of these problems can be avoided. In what follows, we show that rephrasing the issue in terms of offline and online processes provides us with an explanation of false belief understanding as an “implicit” ability that starts out as grounded in basic online processes, albeit already partly decoupled, and develops into a more sophisticated “explicit” ability that relies on offline processes to a much larger extent.

False belief understanding and decoupling

Most accounts of false belief understanding assume that infants have a default tendency to attribute their own (true) beliefs to other agents (Leslie et al. 2004) or to respond on the basis of their own knowledge (Birch and Bloom 2007; Carlson and Moses 2001; Mitchell 1996; Russell 1996). In order to pass the ER-FBT, so it is thought, infants have to be capable of decoupling, i.e. take their own reality-congruent perspective offline (Scott and Baillargeon 2009).

Baillargeon et al. (2010), for instance, argue that this decoupling ability is precisely what makes the difference between the task demands of ER- versus SR-FBTs: while the SR-FBT only requires false belief representation, the ER-FBT additionally involves response inhibition (when selecting a response, children must inhibit any prepotent tendency to answer the test question based on their own knowledge), and response selection (when asked the test question, children must access their representation of the other’s false belief to select a response).

However, this seems to ignore the fact that the SR-FBT involves offline processing as well, albeit of a less demanding kind. Several SR-FBTs (Onishi and Baillargeon 2005; Surian et al. 2007; Song and Baillargeon 2008; Träuble et al. 2010; Kovács et al. 2010) show that very young infants already understand that the visual perspective of another agent can be different from their own. For example, Southgate et al. (2007) employed an anticipatory looking SR-FBT in which 25-month-olds observed how a protagonist witnessed a puppet bear who hid a ball in one of two boxes. Then the protagonist got distracted and turned away from the scene. Meanwhile, the bear removed the ball from its original hiding place. Southgate et al. (2007) found that most 25-month-olds correctly anticipated the protagonist’s behavior (i.e. where she would search for the ball on her return) and looked at the location where she falsely believed the ball to be hidden.

Although the experiment by Southgate et al. (2007) does not require infants to deal explicitly with differences in belief, it does require them to process differences between the visual information available to themselves and the visual information available to the other agent. This can only be accomplished offline, since the other’s visual information is not directly available to the infant and needs to be represented by her. Therefore, already this SR-FBT involves a capacity for decoupling from one’s own online processing of visual information and processing offline a representation of the visual information accessible to another agent. Yet, the roles of decoupling and offline processing are still limited. The infant largely relies on online visual information, and only has to process offline the other agent’s representation of the location of a single object. More difficult versions of the SR-FBT place stronger demands on offline processing. For example, Song and Baillargeon (2008) conducted an experiment in which infants had to represent the visual representation of another agent with respect to both the location and the identity of two objects. Among SR-FBTs, we can thus distinguish between more or less demanding versions requiring more or less decoupling respectively.

Crucially, the ER-FBT takes this decoupling further. First, it requires the infant to deal with more abstract task elements (e.g., a cartoon or a short story instead of the performing real-life agents and objects featuring in the SR-FBT) that have to be processed independently of the infant’s direct environment (like a desk). This makes the ER-FBT more difficult than its more “pragmatically natural” counterpart, as Bloom and German (2000) have pointed out.

Second, the ER-FBT requires infants to not only represent but meta-represent. That is, infants have to (a) process offline the other agent’s propositional attitude towards the object (i.e., her false belief about its location), and (b) come up with a verbal prediction of the agent’s behavior on the basis of her belief. The decoupling required in (a) has been demonstrated to place increasing demands on executive functioning: several studies have found robust correlations between ER-FBT performance and response inhibition (Perner and Lang 1999; Cole and Mitchell 2000; Carlson and Moses 2001), and ER-FBT performance and working memory (Carlson et al. 2002; Hala et al. 2003; Perner et al. 2002). Concerning (b), note that verbal interaction between infant and experimenter has been reported to contribute to infants’ difficulties with the ER-FBT (cf. Southgate et al. 2007). Indeed, many studies have found strong correlations between linguistic competence and ER-FBT performance (Dunn et al. 1991; Astington and Jenkins 1999; Gale et al. 1996; De Villiers and De Villiers 2000; Watson et al. 2002; Farrar and Maag 2002). A number of hypotheses have been advanced to explain why children have more difficulty with FBTs that involve linguistic interaction. Some researchers propose that children need to master its semantics (Moore et al. 1990), whereas others argue that what is required is getting a handle on its syntactic structure (Hale and Tager-Flusberg 2003; Lohmann and Tomasello 2003). DEC is not committed to any of these hypotheses in particular. However, note that it is very well possible that the ER-FBT requires a stronger form of decoupling precisely because it involves language. There is much more to be said about how executive function, memory, linguistic and other abilities scaffold the development of false belief understanding. This, however, will have to wait for another occasion.

Given the SR- and ER-FBT findings, some have argued that we should consider the existence of two systems for false belief understanding. Apperly and Butterfill (2009), for example, postulate a cognitively efficient but inflexible “minimal” Theory of Mind to explain the SR-FBT results, and a flexible but cognitively demanding “full-blown” Theory of Mind to explain the ER-FBT findings. Such a proposal fits well with DEC’s emphasis on the development of offline processing as cognitively demanding, but also providing a greater flexibility. However, the exact relation between the two systems suggested by Apperly & Butterfill remains unclear—rather than a unified account, they suggest to keep whatever explanation we already have by making room for each in a separate system. Applying DEC may be enlightening in this context: it suggests online mindreading abilities to not simply be a precursor of (and supported by a different system than) a full-blown Theory of Mind but to be continuously used where possible. Whether or not this is correct is, of course, an empirical question we cannot settle here (but see Aschersleben et al. 2008, Sodian 2010 for discussions about the relationship between the abilities involved in SR- and ER-FBTs).

We hope to have shown in this section that (a) even very basic social capacities such as implicit false belief understanding already involve decoupling, (b) once DEC is adopted, there is no cognitive gap between implicit and explicit false belief understanding since both can be understood in terms of different degrees of decoupling, (c) the ER-FBT involves stronger offline processing than the SR-FBT. By acknowledging the importance of offline processing, DEC is able—unlike ENAC—to give an account of false belief development “all the way up”. At the same time, DEC’s focus on the interplay between online and offline processing allows it to avoid a number of potential problems for standard COG “top-down” analyses of this phenomenon in terms of strong belief representations.

One may object, of course, that we do not actually provide a full dynamical story, a concrete dynamical system, for false belief understanding here. Neither have we shown what exactly the variables in the coupled system are that get constrained once the system decouples. This, however, is simply not possible as long as we do not have a specific dynamical model to look at and know what its variables stand for. It has not been our intention to provide such a dynamical model here. Rather, our aim has been to draw attention to the potential that DST bears as a modeling tool, and to argue that it squares well with the philosophical view of cognition that we have developed in the previous sections.

As a bit of consolation, though, let us point to a dynamical model for the A-not-B task proposed by Smith and Thelen (2003). In this task, infants are required to track the location of a toy the experimenter puts in either of two locations several times in a row. On the critical trial, the experimenter then chooses the other location. Infants up to about 12 months of age continue to look for the toy in the first location on that trial, despite having observed the experimenter putting the toy in the second location. Although not a false belief task, the A-not-B task may also be interpreted as testing for a basic form of decoupling when infants are required to inhibit the reaching behavior trained in previous trials to conform to the new situation and successfully retrieve the toy from the new location. If the transition from a stage where infants fail to do so to a stage where they successfully decouple can adequately be modeled by DST then it seems plausible to assume this also works for the development of false belief understanding. Future research might be able to provide a concrete dynamical model for (the development of) false belief understanding. Although we are in no position to do that here, we may—drawing on our above considerations of dimensionality of more or less constrained dynamical systems—speculate that coupled, fluid and flexible systems primarily supporting online cognition will best be modeled by high-dimensional dynamical systems with many variables. Whereas systems with some decoupling (those passing the SR-FBT) may best be modeled by assigning some of these variables fixed values, highly decoupled systems (those passing the ER-FBT) may best be modeled by assigning many or most variables fixed values resulting in heavily reduced dimensionality. Further research will be required to examine the extent to which the theoretical framework we proposed here can be used to account for the full range of socio-cognitive abilities.