Introduction

Starting from the 1970s, recordings from single cells in the hippocampal formation of the rodent brain resulted in the discovery of cells supporting spatial representations. The first one of these was the place cell — hippocampal neurons that increase their firing rates when the animal is at a specific location (O'Keefe and Dostrovsky 1971). This ‘preferred’ location of a place cell is represented by an area of increased activity called the place field. Another group of spatial cells, called head direction cells, indicate the direction in which the animal is facing. Head direction cells were first identified in the rat postsubiculum and later in several other areas of the brain (Taube et al. 1990). Finally, grid cells — found in the rodent entorhinal cortex — produce a grid-like firing pattern with varying frequency and phase for different cells as the animal travels over long distances (Hafting et al. 2005). Populations of grid cells can be shown to provide a neural odometry function based on self-motion cues (Fiete et al. 2008; Rowland et al. 2016). Thus, grid cells, head direction cells and place cells together support a positioning and navigation system based on the measurement of the animal’s own movements (Moser et al. 2015). The underlying process is sometimes called path integration as it relies on the integration of acceleration and velocity signals in order to keep track of the animal’s position and orientation (Etienne 1992). Path integration inherently accumulates errors and causes drift in distance measurements. Mammals are thought to use landmarks in the environment to correct path integration errors and ‘recalibrate’ their navigation system (Etienne and Jeffery 2004; Etienne et al. 1996).

The discovery of place cells, head direction cells and grid cells supported the long-standing idea of a cognitive map — an allocentric spatial representation of the environment in the mammal brain (Tolman 1948; O'Keefe and Nadel 1978). In an allocentric representation, locations and relationships are expressed relative to an external coordinate system or reference. On the contrary, egocentric representations are relative to the observer and sensor (Klatzky 1998). Animals experience external stimuli in egocentric coordinates, and how these are referenced and converted into an allocentric representation remains an open question (Burgess 2006; Filimon 2015; Marchette et al. 2014; Epstein 2008).

Further studies on spatial representations revealed a wider network of spatially responsive cells, including, for example, border cells and boundary vector cells, which specifically responded to boundaries in the environment (Moser et al. 2015; Lever et al. 2009; Savelli et al. 2008). In this network, grid cells in the medial entorhinal cortex (MEC) are considered to be providing the primary inputs triggering the location-based responses of place cells. However, several studies have shown that place cell responses are also affected by external sensory information and behavior (Muller 1996; Moser et al. 2015; Eichenbaum et al. 1999). In the literature, such information came to be known as context or contextual inputs, while the information provided by grid cells is referred to as metric due to the underlying translational and angular measurements. The lateral entorhinal cortex (LEC) is considered to provide contextual inputs to the hippocampus as LEC neurons are modulated by olfactory and visual information as opposed to the spatial modulation of MEC neurons (Witter et al. 2017; Scaplen et al. 2017). Furthermore, the LEC seems to encode not only the existence of objects and landmarks, but also their egocentric locations (Knierim et al. 2014).

The influence of contextual inputs on place cell responses is more complex than just resetting the path integration system using landmarks. Several experiments discussed in the next section point to a more complex mechanism and demonstrate place cell responses that do not correlate directly with the movement of landmarks or other changes in the environment. Overall, the complexity of place cell firing supports the idea of a spatial representation that is “more than space” and more like rich memories and experiences that are connected to a place (Eichenbaum et al. 1999). However, the exact nature of contextual inputs and their contribution to place cell responses are not fully understood (Jeffrey 2007).

In this paper, I argue that attentional effects along the ventral visual pathway and in the parahippocampal region can explain non-spatial and unpredictable place cell responses. I further suggest that attentional control provides a feedback mechanism by simultaneously guiding the retrieval of spatial memories (attention to memory) and new sensory experience (attention to stimuli), thereby contributing to the stability of spatial representations despite the richness and variability of the sensory environment. For the purpose of the current discussion, I specifically focus on vision as it provides the richest sensory modality for humans and its underlying attention mechanisms are extensively studied in humans and non-human primates.

In the next section, I summarize the findings on non-spatial place cell phenomena and the models proposed to explain them. Then, I briefly discuss attention mechanisms affecting the collection and processing of visual information with reference to the extensive literature on visual attention and memory. Finally, I review the evidence of attentional effects on areas processing and relaying visual information to spatial cells and develop a conceptual model for attention-based integration of visual context in spatial representations. I conclude with a discussion on future directions and potential experimental studies.

Non-spatial place cell responses and remapping

In their early experiments, O’Keefe and colleagues already reported non-spatial and unpredictable responses of place cells (O'Keefe 1976; O'Keefe and Conway 1978; Muller R., 1996). For example, some place cells could have place fields in multiple environments; some fired in the presence of one or two cues and others were affected by more cues in a more complex way. On the other hand, significant landmarks in an environment could be changed or removed without having a big effect on place cell responses (O'Keefe and Conway 1978; Muller and Kubie 1987). One of the criteria determining the presence of a place field at a certain location was behavioral relevance. For example, place fields changed with the introduction of obstacles or changing boundaries in an environment, and they accumulated around reward or goal locations (Muller and Kubie 1987; O'Keefe and Burgess 1996). Other behavioral effects were observed in a virtual corridor experiment, where the direction of travel and anticipation of goal locations changed place cell responses (Battaglia et al. 2004). In another experiment, four rats developed significantly different place fields for the same environment, further demonstrating the variability and individuality of place cell responses (Skaggs and McNaughton 1998).

Changes in place cell responses usually result in a complete or partial reconfiguration of their place fields, called remapping (Muller and Kubie 1987). In a complete remapping, groups of interconnected cells (place cell assemblies) can switch from one set of place fields to another, indicating that these cells participate in multiple representations (Jeffery 2011; Moser et al. 2015; Jeffrey 2007). On the other hand, partial remapping, where some cells change their place fields and others do not, remains a more challenging problem.

Despite the apparent influence of the environment, place cells do not depend on sensory inputs to function. Place fields are sustained in the dark and without any visual or olfactory cues, and they can occur almost instantaneously in a novel environment. However, they are initially not stable and gradually improve with experience in a new environment (Wilson and McNaughton 1993; Kim et al. 2020), indicating a temporal accumulation of sensory information contributing to spatial representations.

Several efforts were made to model and explain the complex influence of sensory context on place cell responses. Skaggs and McNaughton described a conjunction of spatial and non-spatial information to explain apparent partial remapping of place cell responses. According to their conjunctive coding idea, the hippocampal cells receive inputs from a spatial map layer as well as a context layer such that the changes in the set of active units in the context layer give rise to different activation patterns in the hippocampal units (Skaggs and McNaughton 1998). More recently, Hayman and Jeffery proposed the contextual gating model, where they simulated the combination of inputs from the MEC and the LEC at the dendrites of place cells (Jeffrey 2007; Jeffery 2011; Hayman and Jeffery 2008; Jeffery et al. 2004). A key aspect of contextual gating is the fact that place cells continue to function in the absence of contextual inputs, but these inputs play a ‘gating’ role when they are present. In another modeling study, Burgess’ team focused on the combination of inherently different coordinate frames of contextual and metric inputs. They proposed a model of spatial memory, where egocentric representations (e.g., egocentric directions to landmarks) in the parietal cortex are transformed into allocentric representations in the medial temporal lobe, with the retrosplenial cortex (RSC) acting as a transformation circuit (Byrne et al. 2007; Bicanski and Burgess 2018). In humans, the role of the RSC and the parahippocampal place area (PPA) (Epstein and Kanwisher 1998) in encoding ‘spatial layout’ relative to environmental features has been demonstrated experimentally using fMRI (Epstein 2008; Marchette et al. 2014). Similarly, in tracing studies, the RSC was shown to be the likely ‘bridge’ between the parietal cortex and the hippocampus (Wilber et al. 2015).

The conceptual model proposed in this paper is consistent with previous proposals in terms of the joint and variable contributions of metric and contextual information. However, it further introduces attention as a key controlling function in multiple stages upstream of the hippocampus, modulating sensory inputs based on task goals, as well as the animal’s memory and expectations from its perceived location.

Attention

Attention is one of the fundamental processes of the brain, playing a critical role in perception, cognition and action (Johnson and Proctor 2004). It can be defined as “the process by which organisms select a subset of available information upon which to focus enhanced processing and integration” (Ward 2008). Attention involves mechanisms for orienting sensory receptors toward relevant stimuli, searching for a target, and filtering information (including both attenuation of unattended stimuli and enhancement of attended stimuli). Research on attention increased significantly from its early philosophical foundations, with multiple theories explaining its underlying mechanisms (Johnson and Proctor 2004; Ward 2008; Broadbent 1958; Treisman 1964; Wolfe 1994; Posner 1980). In the next section, I review attentional effects in vision as a key sensory contribution to spatial representations.

Visual attention

Mammals have advanced eyes and wide visual fields, giving the impression of an almost complete perception of one’s visual environment at all times. While this may be possible in certain simple settings with few and highly salient stimuli, in most real-world situations, visual information is too rich and cluttered to be processed in its full detail across the visual field. Therefore, complex physical and mental attention mechanisms focus processing resources on important and task-relevant information, while avoiding irrelevant information and distractors (Carrasco 2011).

Attention plays a key role in how we perceive our visual environment. As an example, consider two individuals walking the exact same path in a city for the first time. At the end of their tour, if we could analyze their mental representations of the city, the results would be significantly different due to the different visual, auditory and other sensory information that they attended to during their experience. If both individuals were asked to draw and annotate a map of their path on paper, each map would have different levels of detail and accuracy in different parts, and the two individuals’ actual memory or experience of the city would be very different. Furthermore, the way they could recall different places would also be different based on how, or using which features and landmarks, these places were originally registered. Therefore, if some of these landmarks were changed or moved at a later time, the effect of these changes on the two individuals would again be different based on whether or not they had registered these landmarks in the first place.

Due to the complexity of such experiences and the underlying attention mechanisms, “where we look” and “what we see” have been key questions that kept researchers busy for more than half a century (Chun and Wolfe 2005; Schütz et al. 2011; Schall 2013; Carrasco 2011; Egeth and Yantis 1997; Yarbus 1967). Early models of attention were based on the idea of information filtering or attenuation of irrelevant information (Treisman 1964; Broadbent 1958). Later, motivated by physiological evidence of functional specialization in the visual cortex (Zeki 1976), Treisman’s feature integration theory (FIT) proposed a first stage of parallel feature processing followed by a second stage of combination via serial attention to target objects (Treisman and Gelade 1980). Duncan and Humphreys extended the FIT by describing attention as a competition between visual inputs to access limited perceptual resources based on target–nontarget similarity (Duncan and Humphreys 1989). For example, in search tasks, the difficulty and efficiency of the task depended on the similarity between the search target and the stimuli. In subsequent models, Wolfe and colleagues extended the idea of parallel preattentive processing of features to guide attention toward candidate targets (Wolfe 1994, 2007). Desimone and Duncan further developed the competition idea into a biased competition for processing resources, where bottom-up and top-down attention criteria are used to bias the competition toward task-relevant and important stimuli (Desimone and Duncan 1995). The biased competition idea is supported by more recent fMRI studies, which suggest that maintaining a search item in working memory enhances the processing of the matching visual input, while other non-relevant or “accessory” items hamper task performance (Peters et al. 2012).

Attention is usually categorized into different types based on how it is driven and deployed. Attention is called bottom-up or exogenous when it is driven by highly salient visual features in the environment (such as a sudden flash of light or a sudden motion), and top-down or endogenous when it is self-generated based on interests and task-related goals of the individual (such as searching for an object with a specific shape or color) (Chun and Wolfe 2005; Carrasco 2011; Egeth and Yantis 1997). Recently, effects of past viewing experience or selection history have also been identified as a separate mechanism, possibly operating via visual statistical learning, and distinct from goal-driven effects (Awh et al. 2012; Theeuwes et al. 2022; Theeuwes 2019).

Both low-level visual features (e.g., colors, edges, brightness, motion), discontinuities, irregularities, and complete objects or faces can attract attention (sometimes referred to as feature-based and object-based attention). While a sudden onset of motion or light can be hard to suppress, other salient features, such as color, brightness, and lines, can be ignored voluntarily, for example, based on the search task and their relationship to the search target (Chun and Wolfe 2005). When attention results in a location-based sampling of the visual environment around us, it is called spatial attention. Spatial attention can be deployed physically by moving the eyes and the head to the target (overt attention) or by mentally shifting attention to a peripheral region or object, while the eyes are fixated elsewhere (covert attention).

Finally, attention can also be directed internally, to memories, rather than to external stimuli. This reflective form of attention (also called attention to memory) allows selective retrieval of memories of past experiences (Ciaramelli et al. 2008; Long et al. 2018; Cabeza et al. 2008).

Are place cells attentive?

If attention can be defined as a bias toward certain stimuli based on relevance and goals, then the preceding account of place cell responses shows several characteristics of attention-based behavior. For example, place fields seem to be influenced by the animal’s goals and activities in an environment, which essentially provide the underlying criteria or modalities of attention. Place fields are accumulated around environmental landmarks and behaviorally relevant locations, indicating a bias toward registering these types of cues. They also seem to be individualized based on past experience, such that complete firing patterns can be recalled or triggered without any change in the environment or the animal’s position, which indicate a memory-based attentional state.

Attentional modulation of hippocampal representations has also been demonstrated in experimental studies. Fenton et al. showed that overdispersion of place fields reduced when rats were required to focus on a subset of stimuli in a spatial task (such as selectively using distal features for navigation) (Fenton et al. 2010). Kentros and colleagues analyzed place field stability in mice and suggested that long-term stability of place fields depends on the behavioral relevance of the spatial context, and attention-like cognitive processes play a role in establishing this link (Kentros et al. 2004; Muzzio et al. 2009). More recently, and in line with the above observations, Scaplen et al. showed that visual cues were processed differently in the hippocampus based on their relevance as navigational landmarks or as context (Scaplen et al. 2014). Such relevance was determined by the nature of the cue (e.g., distal landmark or small object on the floor), as well as its salience and the animal’s experience. Further experiments identified LEC as the locus of this processing (Scaplen et al. 2017).

These results show that attentional effects are present at all levels during the collection and processing of sensory information and eventually, in the modulation of place cell response. Therefore, it follows that, especially in more complex environments, spatial memories must be influenced by the subject’s interest and goals in the environment (or parts of it) as much as the environment itself.

Motivated by these observations, an attentional account of spatial representations may explain complex place cell responses and the interaction between attention and long-term spatial memories. In the next two sections, I follow visual information from the retina along the ventral visual pathway and into the MTL, reviewing at each step the evidence of attentional effects on visual context contributing to spatial representations in the hippocampus. The first section below will focus on spatial attention and the spatiotemporal sampling of the visual context. The next section will review attentional effects in the primary (V1) and the extrastriate areas of the visual cortex and the inferior temporal (IT) cortex, as well as the parahippocampal cortex (in primates). In both sections, I will gradually incorporate these findings into a conceptual model describing the notion of attentional control in spatial representations.

Spatiotemporal sampling of the visual context

On the primate retina, the density of daytime photoreceptor cells (cones) is significantly higher in the central focea region compared to the periphery (Curcio et al. 1990). This requires the eyes to be oriented in a way to capture task-relevant information in higher resolution, resulting in rapid eye movements called saccades (Yarbus 1967). Between saccades the eyes are fixated on the attended target (although not completely motionless), allowing the detailed processing of the retinal image. Despite frequent eye movements and the resulting changes in the retinal image, we perceive a clear and stable image of the world without any smear or motion blur. Interestingly, perception of both brief static stimuli and moving stimuli of low spatial frequency is possible during a saccade (Castet et al. 2002; Binda 2018; Burr and Ross 1982). However, complex neural mechanisms, including selective suppression of motion detection, predictive remapping of receptive fields (or attentional pointers as proposed in Cavanagh 2010), and transsaccadic memory, are employed to achieve visual stability across saccades (see reviews on neural mechanisms of visual stability (Melcher 2011; Wurtz 2008)).

The fact that saccadic eye movements are guided by attention has been established in several psychophysical and neurobiological experiments. In his early experiments with cueing tasks, Posner demonstrated that each saccade is preceded by first covertly shifting attention to the saccade target, followed by the saccade and fixation (Posner 1980). Later, the link between attention and saccades was established in dual-task experiments measuring both saccadic and perceptual performance. In these tasks, perceptual performance increased when the cued saccade target coincided with the location of the stimulus (Kowler et al. 1995; Deubel and Schneider 1996). More recently, Castet et al. investigated the dynamics of attentional deployment and observed the build-up of performance at the saccade target prior to a saccade (Castet et al. 2006). In neurobiological studies, key areas involved in controlling eye movements in the primate brain, such as the frontal eye fields (FEF) in the prefrontal cortex (Schall 2004, 2013), the lateral intraparietal area (LIP) in the parietal cortex (Schall 2013; Gottlieb 2007) and the superior colliculus (SC) in the midbrain (Matsumoto et al. 2018), are shown to be influenced by attention through feature tuning and adjustment of receptive fields during spatial attention. There is growing evidence that “spatial attention shares spatial maps with saccade control centers in areas like SC, LIP, FEF” (Cavanagh 2010) and the activity in these maps not only points to the next saccade target but also enhances processing of information (e.g., detection of different features) from the saccade target (see reviews on neurobiology of attention Corbetta and Shulman 2002; Moore 2006)).

In free viewing conditions, spatial attention can be drawn to salient image features, such as color, brightness, and motion, representing a bottom-up (exogeneous) control mechanism. Saliency-based computational models based on the feature integration theory and Ullman and Koch’s early models have been vastly popular in attention and computer vision literature (Koch and Ullman 1985). However, purely bottom-up saliency has poor correlation with real human fixations unless modified by some top-down heuristics (Schütz et al. 2011; Boccignone et al. 2019; Tatler et al. 2011). It is now widely accepted that top-down attention criteria (or attentional sets) play a key role in guiding eye movements. Furthermore, during targeted search or other everyday tasks, eye movements are guided almost exclusively by top-down (endogenous) criteria in a highly purposive and task-oriented manner (Chun and Wolfe 2005; Schütz et al. 2011; Triesch et al. 2003; Land and Hayhoe 2001). In this case, bottom-up saliency (if present) may act as a distraction to the task, but allows the individual to detect unexpected and highly salient stimuli. The purposive nature of eye movements is also supported by the recently proposed role of visuospatial working memory (VSWM)Footnote 1 in guiding saccades and maintaining object and location continuity across saccades (Van der Stigchel and Hollingworth 2018). Equally, visuospatial memory retrieval is shown to be facilitated when subjects looked inside a square congruent with the spatial arrangement of the to-be-recalled objects during the encoding phase (Johansson and Johansson 2014).

Spatial attention can also be deployed covertly without eye movements, by voluntarily shifting attention to peripheral targets while maintaining fixation at a central point (Carrasco 2011). As discussed above, attention can be deployed to saccade targets before eye movement. Shifts of attention (without a saccade) can also serve other goals such as evaluating potential saccade targets or monitoring the environment. Therefore, any fixation target is essentially viewed and attended to at least once in the periphery before it is fixated. Despite this, studies on visual memory show that the reduction of receptor density (hence image resolution) in the periphery and the transient nature of raw sensory information significantly limit the contribution of peripheral vision in long-term scene representation (Henderson and Castelhano 2005; Hollingworth 2006). Unless fixated, only high-level information (e.g., shape structure, parts, orientation) survives a saccade for a very short time (De Graef and Verfaillie 2002; Aagten-Murphy and Bays 2018; Deubel et al. 2002; Henderson and Castelhano 2005). Precise visual information from the entire visual field is highly volatile and lasts for up to a few hundred milliseconds (generally 80 to 300 ms), while a more durable but limited visual short-term memory (VSTM) can hold abstract information for durations in the order of seconds (e.g., see Hollingworth’s review of visual memory systems Hollingworth 2006).

Peripheral vision plays a role in guiding attention and may accelerate the processing of the saccade target (known as the transsaccadic preview effect). Recent reviews point to an intricate relationship, where foveal and peripheral vision serve opposing goals for focused, detailed processing vs. texture-like perception of a larger visual field, but are otherwise intertwined to optimize each goal (Stewart et al. 2020). Nevertheless, high acuity and contrast sensitivity of the fovea means that our understanding and long-term memory of a scene rely primarily on fixations and the processing of the attended target (Rensink et al. 1997; Hollingworth 2006; Hollingworth and Henderson 2002; Irwin and Andrews 1996; Stewart et al. 2020). A recent fMRI study provides a causal relationship between the number of fixations and memory recall and further supports the role of the attended target in visual memory tasks (Fehlmann, et al. 2020).

The prominence of the fovea image on the retina and the guidance of eye movements under top-down attentional control suggest a spatiotemporal and largely task-oriented sampling of the visual context during visual exploration. A model of scene perception and memory supporting this idea is described by Hollingworth and Henderson. Reviewing theories of visual short-term memory (VSTM) and object files (Kahneman et al. 1992), they describe a model where high-level abstract visual representations of attended objects are accumulated from each fixation, indexed to a spatial position and consolidated into the long-term memory (LTM) (Hollingworth and Henderson 2002; Henderson and Hollingworth 2003). Their model supports the notion of abstract visual information being retained across saccades and transferred to LTM under attentional control.

Overall, these results are consistent with the gradual stabilization of place fields in a new environment (Wilson and McNaughton 1993; Kim et al. 2020), and they suggest that this process takes place in a task-oriented manner and largely based on attended targets.

To summarize the effects of early spatial attention mechanisms on spatial representations, the illustration in Fig. 1 shows an animal visiting the same physical location twice with different sets of top-down, task-oriented attention criteria. During each ‘episode,’ although the environment, the animal’s position, and bottom-up saliency remain the same, saccades are guided differently due to the two attentional sets (shown by two separate sequences of fixations). This results in two different streams of retinal information being fed into the visual pathways and eventually into the spatial representations. Considering the significance of the fovea in the formation of long-term visual memories, this information is simplified as a stream of foveal images (1, 2, 3,…) guided by the corresponding attention criteria in each episode. At each fixation, head direction and the saccade vector further provide directions to attended targets and their spatial relationships in egocentric coordinates. This information is assumed to be provided via the parietal cortex (PC) and the retrosplenial cortex (RSC) as discussed above (Wilber et al. 2015; Marchette et al. 2014; Epstein 2008). Note that the model equally accounts for covert spatial attention, where attended information would be collected from the periphery albeit at a significantly lower spatial resolution.

Fig. 1
figure 1

Spatial attention guides fixations (1, 2, 3,…) and covert attention, resulting in spatiotemporal sampling of visual information and directions to attended objects based on top-down, task-oriented attention criteria in each episode

Attentional tuning and modulation in visual areas and the parahippocampal cortex

Visual information from the retina is relayed to the visual cortex via the lateral geniculate nucleus (LGN) of the thalamus. Besides being a relay center, LGN has been shown to be affected by attention so that neural responses to attended stimuli are enhanced, while responses to unattended stimuli are attenuated (O'Connor et al. 2002). In the visual cortex, a large part of early processing and feature detection continues to be dedicated to the fovea due to the topographic organization of the primary visual cortex (area V1). V1 neurons have small receptive fields and respond to simple features such as lines of different orientation, spatial frequency and color. Along the ventral visual pathway, the processing continues with areas V2, V4 and the inferior temporal (IT) cortex, where the receptive fields of cells get bigger and the features they respond to become more high-level (Kandel and Mason 1995; Zeki 1976).

Attentional effects in the visual cortex are traditionally studied by controlling covert attention to stimuli which fall either inside or outside of a cell’s receptive field (Moran and Desimone 1985). In primates, the modulatory effects of top-down attention have been observed in all areas of the visual pathways during both overt and covert shifts of attention (Motter 1993; Reynolds and Chelazzi 2004; Freiwald and Kanwisher 2004).

In the primary visual cortex of the mouse, V1 neurons were found to be modulated by the SC, potentially reflecting bottom-up saliency computations in this area which play a role in directing overt attention (Ahmadlou et al. 2018). In primates, electrophysiological studies demonstrated increased responses of V1 neurons when stimuli in their receptive fields were attended (Motter 1993; McAdams and Reid 2005). Through simultaneous stimulation of the FEF and measuring responses from V4 of monkeys, Moore and Armstrong also demonstrated that the responses of V4 neurons to stimuli in their receptive fields changed based on the preparation of a saccade to this area (Moore and Armstrong 2003).

The nature and magnitude of attentional effects change as visual information propagates through the visual areas (Freiwald and Kanwisher 2004). For example, Muller and colleagues found that attentional modulation changed along the visual pathway from purely bottom-up effects in V1 to purely top-down control in V2 and then a combined bottom-up and top-down influence in V4 (Melloni et al. 2012).

Top-down attentional effects are particularly interesting for spatial representations as they encode task- and memory-based strategies. Recordings from macaque V4 neurons showed that the knowledge of the target in a detection task created an attentional effect which modulated their response to a stimulus array (Chelazzi et al. 2001). A similar experiment showed the same effect for macaque IT neurons, with both findings supporting the biased competition model and the top-down, memory-guided attentional effects in these areas (Chelazzi et al. 1998). Studies using event-related potentials (ERP) showed that IT neurons maintained their firing rates between the cue and search during a cued search task, leading to the hypothesis that the IT cortex is one of the sites that encode top-down attentional sets (Woodman et al. 2013; Conway 2018).

As the last visual area in the ventral visual pathway, the IT cortex plays a critical role in object recognition and memory. Due to their large receptive fields, always including the fovea, IT neurons can respond to stimuli as complex as complete objects or faces (Miller et al. 1991; Desimone 1996). This leads to higher-level and more meaningful attention criteria being observed in this area. For example, similar to Chelazzi’s experiments with macaque monkeys, Peters et al. studied attentional effects in the fusiform face area (FFA) of the IT cortex and the parahippocampal place area (PPA) of the parahippocampal cortex. Human subjects viewed the same image with two different tasks to search for a house and a face. fMRI images showed the increase in activation of the PPA and the FFA, respectively, during the two trials. As the actual stimuli remained unchanged between the two experiments (i.e., either a house or a face), the study concluded that the activity of the PPA and FFA neurons is modulated by the corresponding attention criteria (Peters et al. 2012).

The projections between the IT cortex and MTL in primates indicate that visual information flows via the dorsal and ventral TE areas of the IT cortex and through the perirhinal cortex and entorhinal cortices, before reaching the hippocampus (Saleem et al. 2000; Buffalo et al. 2006; Ungerleider et al. 1998). Here, perirhinal cortex acts as the interface between the ventral visual pathway and the MTL, receiving strong connections from both the IT cortex and the primate parahippocampal cortex, and relaying these inputs to the hippocampus via the entorhinal cortex (Buffalo et al. 2006).

Based on the demonstrated effects of top-down attention throughout the ventral visual pathway, up to the IT cortex, it can be expected that the encoding of information in the hippocampal formation is similarly modulated by attention. In a recent fMRI study, Aly and Turk-Browne found evidence of such attentional modulation in the hippocampus. In a series of experiments, they first identified the attentional states of the hippocampus during a cued search task for paintings and rooms. Then, they correlated the hippocampal activity to these templates during another task requiring top-down attention to similar targets. Finally, they tested the subject’s memory performance on task-relevant items from the second search task and found that memory performance was better when the attentional state of the hippocampus during encoding matched the information that was being encoded, suggesting that the hippocampus maintained a stronger representation of attended information (Aly and Turk-Browne 2016). Similarly, Scaplen et al. showed that LEC inactivation affects subsequent hippocampus activity in different ways depending on the relevance of visual cues for navigation (Scaplen et al. 2017), suggesting that LEC selectively (rather than uniformly) modulates contextual inputs to the hippocampal place cells.

Combining the above evidence in the current model, Fig. 2 shows a simplified illustration of attention-based integration of visual context into spatial representations. Here, the ‘visual context’ is assumed to be simply consisting of one green and one blue object represented by their geometric and color features and having similar bottom-up saliency. The subject is at a specific location (x,y) with bearings θ and φ to the two objects. A place cell assembly is assumed to be tuned to a specific combination of metric and contextual cues. In the two episodes shown, the subject arrives at (x,y) with two different top-down attentional sets (e.g., searching for green rectangular objects or blue triangular objects). The scene is spatiotemporally sampled based on these attention criteria and the resulting retinal information is relayed via the LGN to the visual cortex (areas V1, V2, V4), the IT cortex, and the parahippocampal cortex, where feature-based attention further modulates visual processing differently for the two episodes. This highly modulated contextual information is then combined with metric inputs from the grid cell network within the medial entorhinal cortex (MEC) and with ego/allocentric directions to attended objects possibly computed within the retrosplenial cortex (RSC). Consequently, place cell assemblies respond to the same location and visual context differently based on the subject’s interests and task goals and the associated attentional tuning in upstream areas.

Fig. 2
figure 2

Simplified illustration of the proposed attention-based model of metric-contextual integration. Place cell assemblies respond to the same location and visual context (consisting of one green and one blue object) differently based on the attentional tuning in upstream areas

Note that the above illustration is simplified in a number of ways. First, potential attentional tuning in the hippocampal formation (e.g., in MEC, LEC, or place cells themselves) is not discussed. Secondly, only an instantaneous response is shown and both the encoding phase and the temporal accumulation of visual information across multiple fixations or multiple visits to the same environment are ignored. These topics are discussed below in the conclusion.

Reflective attention as a mechanism to stabilize spatial representations

So far in the above model, I suggest that attentional control of sensory information may cause different spatial representations to be associated with the same physical space. However, in real-world situations, small changes in the environment (such as motion, occlusions, lighting and visibility) can dramatically change the perceived image. These changes can affect feature detection and spatial attention mechanisms, causing significantly different looking behavior. Additionally, the subject’s interests and goals in a place can change from one episode to another, resulting in different top-down attention criteria. Under such variations, and given the influence of sensory information, how is it possible that an animal can maintain a stable representation of a particular environment during multiple visits and recognize this as the same place?

One possible mechanism (e.g., in contextual gating) is the weighting of contextual and metric cues based on their reliability. Such a mechanism can be enhanced by the above attentional effects, where certain features of the environment can be selected to drive spatial representations, while others are ignored or suppressed. This idea is consistent with recent observations that spatially relevant objects are processed differently than non-relevant objects (Scaplen et al. 2014), as well as reports of attentional modulation of hippocampal representations discussed earlier (Fenton et al. 2010; Kentros et al. 2004; Muzzio et al. 2009). However, attention-based encoding of place memories does not completely solve the stability problem. Variations in top-down attention criteria may still result in significant changes in attentional bias, thereby causing contextual input to be significantly different between two or more episodes.

In their model of scene perception and memory, Hollingworth and Henderson reported a similar problem from a visual perception perspective. They considered that subsequent episodes with the same scene could involve the retrieval of previous LTM representations consisting of a scene map and object codes, but how the correct scene map was retrieved from the previous episode remained an open question (Hollingworth and Henderson 2002). Such a mechanism is clearly required to prevent spatial representations from diverging between old and new episodes due to lack of correspondence of sensory information.

To answer this question, I speculate that reflective attention might play a stabilizing role, where spatial memories (including sensory information) are not entirely retrieved, but attentively recalled based on an initial perception of where the animal thinks it is (e.g., based on initial self-motion and visual/olfactory cues). These memories of an initially perceived location then set attention criteria and guide spatial attention in a way to collect further contextual information to verify or update the animal’s initial perception. In other words, the initial perception of the environment triggers selective retrieval of previous spatial memories, which then influence sensory perception through attention, thereby gradually consolidating old memories and new sensory experiences related to the same place in a feedback loop.

Figure 3 illustrates this idea. Here, sensory information is first sampled via spatial attention. This information is processed through the visual pathway under attentional control to eventually form episodic memories, which are also encoded and emphasized by attention. Later, in a new episode, these memories are recalled partly and selectively based on initial sensory experiences and current task goals. This in turn sets an expectation about the sensory context of the space, which contributes, together with task goals, to attentional tuning that guides further sensory exploration. Therefore, a feedback loop is created to resolve conflicts between where we think we are and where we actually are, and to gracefully merge earlier experiences and metric cues with new sensory information.

Fig. 3
figure 3

Initial perception of a place may trigger selective retrieval of previous experiences, setting expectations and attentional sets for new sensory exploration, thereby gradually stabilizing perception in a feedback loop

Note that, although visual context is attentively sampled and encoded in spatial memories, such sampling is based on the attentional sets and saliency maps of past episodes. With experience (multiple episodes or visits to the same place), such memories can be expected to approach a complete representation (possibly still with significant and surprising gaps that we sometimes experience in everyday life). Conversely, the retrieval process employs attention to memory with attentional sets reflecting current goals and using bottom-up saliency from the current state of the environment. Since the same attentional sets and saliency maps are used for perception in the current episode, the proposed mechanism helps to reduce conflict between what is perceived and what is remembered, while also updating memories when there is a real conflict not caused by sampling (i.e., a real change in the environment between the old and new episodes).

The idea of past memories being attentively retrieved and used to set expectations for sensory exploration is supported by a number of studies. First, the ability of spatial cells to generate memory-based responses was demonstrated by Miller et al., who recorded place-responsive cells in human patients during virtual navigation and later during retrieval of navigation-related memories without actual navigation. They found that the same cells were activated during episodic memory retrieval, and the firing pattern was similar to the activity observed during the initial encoding of the locations (Miller, et al. 2013). Secondly, Ciaramelli and Cabeza showed that such retrieval of episodic memories is indeed under attentional control. In their attention to memory (AtoM) model, they proposed that the dorsal and ventral parietal cortices (DPC and VPC) are associated with attention-based retrieval of memories, respectively, for top-down (goal-driven) and bottom-up (memory-driven) attention to memory (Ciaramelli et al. 2008; Cabeza et al. 2008). Combined with Aly and Turk-Browne’s results (Aly and Turk-Browne 2016) showing attentional modulation in the hippocampus during encoding of episodic memories, it can be argued that spatial representations are both attentively encoded and retrieved. Finally, to complete the proposed feedback mechanism, it was recently shown that V1 neurons in the mouse visual cortex shifted their responses according to the animal’s perception of distance traveled (Fournier, et al. 2020). The same study found that both hippocampal CA1 and visual V1 neurons could both be used to decode the animal’s position and the associated errors were highly correlated. The authors provided two possible explanations, one based on the same self-motion cues feeding both areas and the other based on a feedback signal from the hippocampus to V1, possibly via the RSC.

Finally, the anterior cingulate cortex (ACC), implicated in the control of attention, decision making and conflict resolution, might provide further top-down attentional control signals. Strong connections between the ACC and the entorhinal cortex and the error detection function of the ACC are thought to support error-driven learning (Ku et al. 2021; Calderazzo et al. 2021). It is possible that ACC might play a similar role during the encoding of spatial memories to support the above comparison of memories to new sensory experiences.

Conclusion and future work

This paper presented an account of spatial representation based on attentional control of associated contextual and sensory inputs. I started with the widely accepted notion of spatial representations being both metric and sensory in nature. Then, I reviewed the evidence of attentional modulation in several visual and MTL areas and incorporated these into a conceptual model where sensory information is integrated into long-term spatial representations under attentional control, thereby creating task-oriented episodic memories of space. I further speculated that attention could play a role in stabilizing such representations by attentively retrieving past experiences and using these memories to guide new sensory exploration.

Further development of the model can provide more detailed information on the interaction of specific brain areas involved in controlling attention and in encoding and retrieving spatial memories. The nature and loci of attentional sets or memory representations guiding attention require further investigation. These are so far considered to be embedded in connection weights and in responsiveness, sensitivity and receptive fields of individual cells. Recent accounts of time- and experience-dependent responses in the hippocampal formation (e.g., in the LEC) provide new insights into the representation of time (Tsao et al. 2013, 2018; Eichenbaum 2017). These results can help to further develop the temporal aspects of the model to include gradual improvement and consolidation of old and new experiences beyond the initial ideas discussed above.

Finally, experimental verification of the proposed model requires simultaneous control of the sensory environment and the subject’s attention while carrying out a spatial task. Visual attention mechanisms in humans and other primates are extensively studied in the literature; however, similar studies with other mammals are scarce. Conversely, spatial representations are generally studied in rodents due to the possibility of electrophysiological measurements on freely moving animals. Species-specific aspects make it difficult to translate and combine findings in these two areas, as attempted in this paper. Although rodents do not possess a retinal fovea like primates and they are considered to have limited visual abilities, a recent study found that mice do have improved visual resolution in the front and slightly upper part of their visual field (termed fovea to indicate a cortical specialization), mainly due to lower scatter of single-cell receptive fields corresponding to this region in their visual cortex (Van Beest, et al. 2021). They have also been shown to use visual selective attention and eye movements to explore their environment similar to primates (Wang and Krauzlis 2018; Meyer et al. 2020). These studies show that there may be opportunities in future to study attentional effects at neural level in rodent navigation experiments. In humans and non-human primates, advances in virtual environments and fMRI technology can facilitate experiments on visual attention and spatial representations.