1 Introduction

A theoretical pillar of vision science in the information-processing tradition is that perception involves unconscious inference.Footnote 1 The classic support for this pillar is that, since retinal inputs underdetermine their distal causes, visual perception must be the conclusion of a process that starts with premises representing both the sensory input and previous knowledge about the visible world. Call this the argument from underdetermination. In more contemporary forms this argument has often been grounded in applications of Bayesian models in vision science and debate has then centered on how these models should be interpreted.Footnote 2 However, whatever form the argument takes, whether it goes through depends on showing that the visual processes that “solve” the underdetermination problem have qualities that are typically considered distinctive of cognition (Hatfield , 2002).

The fixation on underdetermination invites the impression that the argument and the pillar stand or fall together: if the argument tumbles, little remains to prop up the idea that visual processing involves unconscious inference. In what follows I offer an alternative, and more stable, base for the pillar. Besides underdetermination, another foundational challenge for the visual system is to track invariant features of the environment even though anything we encounter is never seen exactly the same way twice. This invariance problem is clearest in the case of object recognition, which requires representing objects as the same across transformations of viewpoint (DiCarlo et al. , 2012). As I argue, this problem has many features that are diagnostic of inductive inference. In turn, standard explanations of object recognition posit unconscious processes in the visual system that overcome this problem. Therefore, the fact that unconscious processes are posited to explain how the visual system solves an induction problem shows that some aspects of visual processing involves unconscious inference. Call this the argument from invariance. In what follows I develop and defend this argument and conclude that it is better able to bear the weight of a key theoretical pillar of vision science.

The paper is structured as follows. In Sect. 2, I lay out my argumentative strategy for defending the pillar. In Sect. 3, I make a case for moving beyond the argument from underdetermination and its Bayesian variant. In Sect. 4, I present the argument from invariance using object recognition as a case study. In Sect. 5, I address some potential challenges. Section 6 concludes the paper.

2 Laying the groundwork for the argument

Plausibly a (theoretical) inference is a “reasoned change in view”: we start with some beliefs, and through deliberation, end up revising what we believe, or perhaps how strongly we believe it (Boghossian , 2014; Harman , 1986; Kiefer , 2017). In Fig. 1A we cannot see what birds are nesting, but knowing why elevated nests are constructed on piles in waterways deduce that they are probably ospreys. However this pattern of deliberation manifests itself, it easily qualifies as a case of inference in the above sense. It is also at least partially inductive, since prior knowledge is being exploited. Contrast Fig. 1B where we immediately see the bird as an osprey. It is this second sort of case that is the focus of research on object recognition and that I wish to show also involves a process similar to inductive inference. But before doing so, clarity is needed on two fronts: first, on how I see the dialectic around unconscious inference; and second, on the argumentative strategy I intend to adopt.

Fig. 1
figure 1

Two views of ospreys. A Two ospreys on an elevated nest. B An osprey in flight

2.1 Framing the dialectic

Often it is claimed that at issue is whether unconscious visual processing is literally, as opposed to metaphorically, inferential (Hatfield , 2002; Kiefer , 2017; Orlandi , 2014). There are two issues with this framing.

First, visual processing may involve unconscious inference in some senses but not in others. On the one hand, one might mean that seeing simply is thinking, though operating swiftly and outside of awareness. Historically, Ibn Al-Haytham (c. 965–1040)—and later Von Helmholtz (1867)—had this sense of unconscious inference in mind (Hatfield , 2002).Footnote 3 Contemporary vision scientists do not. Instead, they standardly posit unconscious information-processing that is proprietary to the visual system itself (e.g. Rock, 1983). On the other hand, given the historical connection between computation and deduction, the very idea of information-processing can perhaps be seen as a vindication of at least the spirit of earlier theorists. In which case, in the context of explaining visual processing, working vision scientists may treat “unconscious inference” as synonymous with “unconscious information-processing”. One may insist the relevant notion is intermediary between these alternative, but it is not obvious that, once we move away from them, there is a single, privileged sense of unconscious inference that can be distilled as opposed to a multitude of plausible candidates.

Second, even if some aspect of unconsciously visual processing is literally inferential highlighting this fact may be explanatorily superfluous, or worse, misleading, as it accentuates similarities between seeing and thinking when it is the differences that may matter. This sentiment is well expressed by Kanizsa (1985, pp. 27–28):

...the main problem of a theory of this kind, in my opinion, is that of not being able to suggest any advance, because it bears the risk of extinguishing the desire of investigating phenomena for which it has always ready a prefabricated explanation. From this point of view it is preferable to focus on the differences between seeing and thinking, because these, by indicating the possibility that the two classes of phenomena obey to different rules can set us on the road of discovering these rules.

In light of these two issues, my discussion will not focus on whether appeals to unconscious inference are literal, but on whether there are commonalities between seeing and thinking that are explanatory when it comes to particular visual phenomena (Kanizsa , 1985; Pylyshyn , 1999; Rock , 1983). One can treat these commonalities, collectively, as an explication, or operationalization, of a scientifically useful sense of unconscious inference, but the overlap with our commonsense intuitions about inference will only be partial.Footnote 4 Here is one straightforward way in which positing unconscious inference would be explanatory: the visual phenomenon itself has features that we consider diagnostic of some kind of inference. If that were that the case, then the commonalities between seeing and thinking reflected in an explanation would be a natural consequence of what is being explained. Such a strategy has two parts: first showing that the phenomenon has some of the diagnostic features of an induction problem; and second, showing that the unconscious processes that are posited to explain the phenomenon meet plausible requirements for an inferential solution to the problem. I elaborate on this strategy below.

2.2 Characterizing unconscious inductive inference

As typically understood, a “problem” for the visual system is a mapping from sensory input to perceptual output that is not yet understood. Somehow the visual system achieves this mapping and we would like to explain how. When might such a problem be similar in form to induction? I take the following to be a relatively uncontroversial description of a kind of inductive inference: a deliberative process that involves generalizing from past experience to form beliefs about a present circumstance. Figure 1A presents a case that is inductive in this way. This description also points to three diagnostic features that suffice for an operationalization of when an information-processing problem can be considered inductive. First, it is diachronic in that it concerns how we overtly represent the present in light of the past, and might change what we believe accordingly. Second, it is empirical in that it particularly concerns our representation of past experiences or acquired knowledge, which we bring to bear in drawing conclusions about present circumstances. Finally, it is extrapolative in the sense that when we generalize from past experience to novel circumstances there will typically be various ways in which the past and present circumstances differ. In what follows, I will call a phenomenon that has all these features an induction problem for the visual system.Footnote 5

A “solution” to an information-processing problem is an explanation of how the visual system maps the target sensory inputs to the outputs. If the problem is an inductive one, then there will be further requirements regarding the representations that are posited and the process that maps these representations to the resulting percept (cf. Hatfield, 2002).Footnote 6

Regarding the representations, one must be of the environment derived from the sensory input, while others reflect prior knowledge gained from experience. These representations provide the equivalent of premises, the contents of which stand in evidential support to the concluding representation. Looking at Fig. 1A, one represents both the present state of affairs, but also retrieves other representations about the local fauna and the intended purpose of elevated nests in waterways. Together, these provide the evidence for the hypothesis that the birds are ospreys. These are furthermore mental representations and a reasonable expectation is that an unconscious solution to an induction problem will trade in them as well. Following previous discussions of unconscious inference (Mole & Zhao , 2016; Orlandi , 2014), I will assume that it suffices for a state of the visual system to be a mental representation if it has content that is: (i) distal, in the sense of being about properties of the external environment; and (ii) robust in the sense that the content stays the same even when it is tokened in the absence of what it represents (Fodor , 1990).Footnote 7

Regarding the process, there are two requirements. The first is that the visual system transitions between the mental representations in a way that is plausibly inferential; That is, given a representation of the present sensory environment, there is a transition to a percept of the visible world in light of information afforded by representations related to past experiences. It is common to characterize inferential transitions as rule-following. While in paradigmatic cases of conscious deliberation this may require that an agent “takes” the premises to support the conclusion (Boghossian , 2014), others have suggested that, even in the case of cognition, deliberation can operate swiftly, automatically, and outside of conscious awareness (Quilty-Dunn & Mandelbaum , 2018; Wright , 2014). When unconscious in this way, it is simply a matter of our cognitive architecture that, given that the premises are represented, the conclusion is reliably represented as well. Following Quilty-Dunn and Mandelbaum (2018), I will call such operations “bare inferential transitions”. The first requirement then is that a visual process involves some form of bare inferential transition from the representation of the sensory input, along with prior knowledge, to a concluding percept.Footnote 8

Second, the process must recruit some kind of long-term memory store, whereby information gained from prior experience is recorded, and can be retrieved for comparison with the present sensory input. Appealing to memory in this way is arguably latent within the very idea of inductive inference (Aggelopoulos , 2015; Fodor & Pylyshyn , 1981). For making an inference about the present from the past requires being able to represent the past in light of the present. In Fig. 1A, one cannot conjecture that the birds are ospreys without first retrieving from memory information about the nests and different birds that live in the area in order to generate the hypothesis.

To summarize, an information-processing phenomenon presents an induction problem for the visual system if it has the following diagnostic features: it is diachronic, empirical, and extrapolative. If an unconscious visual process that is posited to solve (i.e. explain) this problem also satisfies the above requirements on representation and process, then it follows that an inductive inference problem is solved by some aspect of unconscious visual processing that has many commonalities with inductive inference. In this sense, explaining the phenomenon will require positing a form of unconscious inference. Of course, this is not the only route by which one may show that some aspect of unconscious visual processing is inferential. A phenomenon may fail to satisfy the conditions I have laid out, yet be inferential in some other sense. For example, it may still qualify as a form of unconscious deductive or abductive inference. Though in such cases similar requirements on mental representation and inferential transitions would still apply. Similarly, the characterization of induction I have offered presumes a kind of learning process: that we acquire information about the world and extrapolate from that information to novel circumstances. Thus, it rules out the possibility of wholly innate forms of unconscious inductive inference, in so far as what is innate is not learned, though presumably, for any kind of information-processing, some aspect of it must be innate.Footnote 9 While these considerations highlight the limited scope of my strategy, they are a natural consequence of the fact that the underdetermination problem is typically characterized as involving inductive inference.

3 Undermining the argument from underdetermination

Given the groundwork laid down, how does the argument from underdetermination hold up? In this section I make the case that the underdetermination problem is a poor fit for the argumentative strategy described above. More specifically, it is not an induction problem for the visual system because it is not inherently diachronic. Therefore, if explaining how the visual system solves variants of the underdetermination problem involves positing a form of unconscious inductive inference, it is not because the sensory input is underdetermined by its distal cause. Furthermore, the same issue also arises for Bayesian variants of the argument.

3.1 Underdetermination is not (obviously) an induction problem

The allure of the argument from underdetermination argument derives from the fact that the problem it is constructed from seems to cry out for explanations that appeals to prior knowledge, and therefore inductive inference of some kind (Hatfield , 2002). The problem, recall, is that sensory inputs are underdetermining of their distal cause, thus some other factors must also contribute to the determination of a stable percept of the world. Yet, it does not immediately follow, simply from this description, that these other factors include prior knowledge. For that to be the case one would minimally need to show that underdetermination has diagnostic features for an induction problem: it is diachronic, empirical, and extrapolative. These features are connected. If a problem is synchronic, and only involves representing information from the present environment, then it does not obviously requiring extrapolating from past experiences. In which case, one may then doubt whether explaining the phenomenon will require positing unconscious inductive inference at all.

There is good reason to think the underdetermination problem is synchronic, as illustrated by the common example of “shape from shading” (Ramachandran , 1988). In Fig. 2A, horizontally-aligned linear contrast gradients are enveloped by circular contours. These gradients are ambiguous cues to 3D shape since they can be caused by concave surfaces illuminated from below, convex surfaces illuminated from above, or an infinity of further illumination and surface shape combinations (Freeman , 1994; Wagemans et al. , 2010). Yet we clearly see those with higher luminance at the top as convex dimples unlike those with higher luminance at the bottom. So the visual system appears to make an “assumption” about the typical direction of surface illumination direction. One possibility is that this assumption is overtly represented and recruited by an inferential process. However, a common alternative explanation is that the visual system may internalize, without representing, environmental regularities via natural constraints on its organization.Footnote 10 For example, in their classic theory of edge detection Marr and Hildreth (1980) proposed that the retina respects a “spatial coincidence assumption” such that the outputs of different spatial frequency filters with similar receptive fields are combined, since edges tend to cause illumination changes at multiple spatial frequencies. However, the retina does not represent this assumption.Footnote 11 Similarly, the assumption that illumination coming from above has been described as natural constraint on how the visual system parses surface shape (e.g. Burge, 2010; Orlandi, 2016).

Fig. 2
figure 2

Circular contours filled with A horizontally and B vertically aligned linear contrast gradients

Adjudicating between these alternative interpretations of the light-from-above assumption requires also taking stock of other facets of the phenomenon of shape from shading. Here are two of them. First, the light-from-above assumption is relatively weak and easily overridden by other cues from shading or shadow (Morgenstern et al. , 2011); lighting diffuseness (Morgenstern et al. , 2014); the presence of specular highlights (Adams & Elder , 2014); and the shape of the bounding contour (Todorović , 2014). So only in special cases like Fig. 2A does the assumption appear to play an outsized role (Wagemans et al. , 2010). Second, even in these special cases non-visual cues are still essential for determining which direction is “above”. In particular, the assumption is not constant when the body is rotated so that the gravitational and visual frames of reference are teased apart (Adams , 2008; Barnett-Cowan et al. , 2018; Jenkin et al. , 2004). To experience the effect of frame of reference for oneself, simply tilt one’s head to the left or right until it is horizontal and which stimuli in Fig. 2B appear dimpled will alternate.

The importance of these two facets is that they reveal how shape from shading may be best characterized as a synchronic phenomenon in which multiple visual and non-visual cues are combined to guess at the shape of illuminated surfaces. Several constraints no doubt govern how these inputs are combined, but attention to the details of the phenomenon makes the inferential characterization of the light-from-above assumption increasingly untenable. While I am inclined to think this holds, in general, for how the visual system solves all versions of the underdetermination problem (cf. Burge, 2010; Orlandi, 2014), for present purposes what is important is that the structure of the underdetermination problem itself is equally compatible with such explanations. If a visual phenomenon has the diagnostic features of an induction problem, it will depend less on the fact that there is underdetermination of the input and more so further facets of the phenomenon in question.

3.2 Inferential interpretations of Bayesian models are Underdetermined

The use of Bayesian modeling in vision science is commonly framed as a vindication of the idea that the visual system carries out unconscious inference to solve the underdetermination problem (Rescorla , 2015).Footnote 12 Given its popularity, it is worth considering whether this Bayesian variant of the argument better fits the strategy I have proposed.

Bayesian decision theory is a formal framework for modeling decision-making under uncertainty (Berger , 1985). Central to the framework is the notion of subjective probability, or credence, which is a quantitative estimate of the degree of belief of an agent. The framework specifies norms for how an agent ought to (optimally) assign credences to hypotheses given the evidence available. The most familiar norm is Bayes’ theorem, which expresses the conditional probability P(h|e), or the probability of the hypothesis h being true given the evidence e, as proportional to the unconditional probability of h being true, P(h), and the likelihood of e given the truth of h, or P(e|h). A separate norm is conditionalization, which governs how credences should change with new evidence; that is, upon being presented with e we should update P(h) with P(h|e), or replace the prior probability of the hypothesis with the posterior probability.

Although these norms are distinct (and justified separately), researchers in cognitive science have developed sophisticated models of a wide range of phenomena using both of these norms (Rescorla , 2021), including many visual phenomena (Knill & Richards , 1996; Yuille & Kersten , 2006). Among them is shape-from-shading, where the light-from-above assumption has been formalized as a prior; that is, the credence for the hypothesis that the illumination of a surfaces is directed from above is greater than for other alternative hypotheses about lighting direction. For example, based on behavioral performance across multiple illumination conditions, some studies suggest that the highest credence may actually be for illumination from above-left (Mamassian & Goutcher , 2001; Sun & Perona , 1998).

When made more explicit, the Bayesian variant is grounded in realist interpretations of Bayesian models according to which they are “approximately true” descriptions of the visual system; in other words, visual processing assigns credences to hypotheses in a manner that conforms to the Bayesian norms of reasoning (Rescorla , 2015; Rescorla , 2021). The alternative instrumentalist interpretation treats Bayesian models as merely predictively useful (Block , 2018; Colombo & Seriès , 2012). According to realists, it is the very success of Bayesian modeling that justifies positing credal states in visual processing. Given the strategy I have adopted, these states must be mental representations with the appropriate content if it is to follow that realism about such models warrants positing unconscious inference in the sense I have articulated. It is far from obvious that that is the case, at least without further argument (Orlandi , 2016). However, a more fundamental issue is that Bayesian models are not inherently models of information-processing in the first place.

Inferences are a kind of process: we deliberate from certain premises to conclusions, like guessing that the birds in Fig. 1A are ospreys. However, Bayesian models are not necessarily considered process models. Users of the framework are explicit about this (e.g. Griffiths et al., 2010), as Bayesian models are frequently described as a (rational) aspect of Marr’s (1982) computational theory, which is a specification of what function a system is trying to carry out, and why (Ritchie , 2019; Shagrir , 2010). As a normative framework, Bayesian modeling provides possible constraints on the problem a system is trying to solve and its ideal solution, but the mapping to the process that solves the problem is many to one (Knill & Richards , 1996; Griffiths et al. , 2010; Lake et al. , 2017). In this way, Bayesian models underdetermine the form of the process that may conform to Bayesian norms. This fact, and the connection to Marr’s computational theory, has been used to argue in favor of instrumentalism about Bayesian modeling (Colombo & Seriès , 2012). However, what I think it shows is that what interpretation of Bayesian models we adopt once more depends on the contours of the phenomenon being explained.

This latter point is well illustrated by so-called “rational process models”, which specify algorithms that approximate a process that carries out operations over credal states (Griffiths et al. , 2015). For example, Shi et al. (2010) used exemplar models of category learning to carry out importance sampling (a form of approximate Bayesian decision-making), where events remembered from the past act as samples from the prior. In their study this approach was applied to psychological tasks, including the number game, where having been told a set of natural numbers fit in a category, participants must guess the probability that a particular number is also included (Tenenbaum & Griffiths , 2001). In this case the model is used to describe an inferential process, but that is because playing the number game, as a phenomenon, requires deliberation.

What then of Bayesian solutions to the underdetermination problem? These are often interpreted in non-inferential terms as reflecting natural constraints (Knill et al. , 1996). In the case of shape-from-shading, the light-from-above prior is one of them.Footnote 13 To insist otherwise requires some evidence that the phenomenon also has the features of an induction problem. For example, in defense of inferential realism about the light-from-above prior, Rescorla (2015, 2021) points to the fact that the prior can be altered as suggested by the results of Adams et al. (2004). In their study, visual and haptic stimulation was manipulated to suggest a shift in illumination direction resulting in a change in credences, which also impacted performance on judging which side of a bar was lighter. Showing that the visual system can be recalibrated at a time, and that this change is preserved when performing another task, suggests a phenomenon that is diachronic, though without further evidence this is consistent with the hypothesis that natural constraints are flexible. But whether or not Rescorla’s argument succeeds note that it has little to do with Bayesian modeling as such, but instead depends on features of the phenomenon beyond the platitude that the sensory input is underdetermined.

In summary, appealing to Bayesian modeling does not avoid the defect in the argument from underdetermination; if anything, it only further emphasizes the flaw: underdetermination does not present an inductive inference problem for the visual system. The foregoing is not a decisive blow against the argument from underdetermination. However, it does suffice for motivating the search for an alternative aspect of visual processing that may involve unconscious inference. The invariance problem offers such an alternative.

4 The argument from invariance

I understand “visual object recognition” as the process of applying mental representations (for a category or individual identity) in the visual system to label the objects that we see (DiCarlo et al. , 2012; Riesenhuber & Poggio , 2000). Of course we visually recognize many other things that preoccupy vision scientists including shapes, colors, scenes, and materials. Appropriately adapted, my argument may apply to those cases as well. The argument is also not entirely new. Ibn Al-Haytham took recognition as his paradigm example of unconscious inference (Sabra , 1978), so the argument can also be thought of as a vindication of his theory.Footnote 14 In this section I first illustrate the importance of the invariance problem to explaining object recognition and why it is distinct from underdetermination problem. I then detail why the invariance problem has all the diagnostic features of an inductive inference problem and why the general form of the proposed solutions in vision science meet the requirements on representation and process for unconscious inference.

4.1 What is the invariance problem?

The invariance problem is this: we never see objects in the distal world under identical viewing conditions, yet visual perception is largely invariant to identity-preserving transformations of visual input. One may have seen ospreys in the past, but each new time the viewing conditions will be different: the daylight illumination, the viewing distance, orientation, and bodily configuration will all differ. Yet, across these multitude dimensions of change, what is seen remains the same, even though how it is seen does not.Footnote 15 In virtually all discussions of object recognition the invariance problem is the central explanandum:

The recognition of visual objects is a fundamental, frequently performed cognitive task with two essential requirements, invariance and specificity. For example, we can recognize a specific face among many, despite changes in viewpoint, scale, illumination or expression. The brain performs this and similar object recognition and detection tasks fast and well. But how? (Riesenhuber & Poggio, 1999, p. 1019)

Visual object recognition is an extremely difficult computational problem. The core problem is that each object in the world can cast an infinite number of different 2-D images onto the retina as the object’s position, pose, lighting, and background vary relative to the viewer ...Yet the brain solves this problem effortlessly. (Pinto et al., 2008, p. 1)

Invariance is a central problem in vision: How do we recognize an object or scene to be the same ...across changes in view, size, lighting, configuration, and, even, in the case of category invariance, exemplars? (Gauthier & Tarr, 2016, p. 378)

In so far as object recognition is a fundamental aspect of how we see the world, if the underdetermination problem presents a general challenge for our explanations of visual processing, then the invariance problem does as well (Rust & Stocker , 2010). Both problems relate to how the visual system ultimately generates a stable percept given that there is no one-to-one mapping between visual inputs and their distal causes. In this respect, they can be both thought of as involving sensory uncertainty. However, the problems are conceptually distinct, in at least two respects.

First, they reflect different aspects of visual perception. Solving the underdetermination problem allows us to see a stable and coherent world of, or concerns seeing, while solving the invariance problem allows us to make sense of what we see in this world, or concerns seeing-as. For example, the former relates to how the visual system arrives at the representation of the object in Fig. 1B that is stable and determinate despite the ambiguity in the input; the latter relates to how we recognize it as an osprey despite never having seen the photo before. Second, they differ in structure. The underdetermination problem relates to how the visual system generates a single stable percept given that any single proximal sensory input is compatible with an infinite different distal causes (Fig. 3A). In contrast, the invariance problem relates to how we manage to represent the same distal cause even though across viewing conditions it can produce a near infinity of different proximal sensory inputs (Fig. 3B).

Fig. 3
figure 3

Two mappings between sensory inputs and their distal causes, after Rust & Stocker (2010, Fig. 2). A A single sensory input underdetermined by possible distal causes. B a single distal cause that can produce multiple possible sensory inputs

These differences are important for avoiding two possible confusions. The first is that the underdetermination problem is often associated with perceptual constancies, the fact that perception of shape, color, or size tends to be relatively stable across changes in the sensory input (Cohen , 2015). Although “constancy” and “invariance” are sometimes used interchangeably, constancies primarily relate to phenomena of seeing not seeing-as. For example, we tend to see the color of surfaces as being the same despite often drastic differences in incidental illumination, however the information-processing challenge presented by color constancies is one of underdetermination: that we are able to discriminate color even though illumination and reflectance are confounded in the input (Foster , 2011).Footnote 16 The second possible source of confusion is that underdetermination is also present when we are trying to recognize what we see. For example, because of the importance of spatial frequency tuning to face perception, one stimulus manipulation is to convolve face images filtered at different spatial frequencies with different levels of noise, which allows for parametric variation in the discriminability of face stimuli (e.g. Harmon, 1973; Näsänen, 1999). With such a manipulation, the fact that we see the face images as more or less ambiguous (depending on the amount of noise added) entails that the visual system has already made a guess with respect to the ambiguity relevant to the underdetermination problem.

Explanations of visual-processing must ultimately take stock of both the underdetermination and invariance problems (Rust & Stocker , 2010). But it should be clear now that they are distinct challenges, and only the latter reveals a form of unconscious inference in the visual system, as I will now show.

4.2 Solving the invariance problem requires unconscious inference

Following the groundwork laid down earlier, the argument from invariance proceeds in two stages: first, the invariance problem has all the diagnostic features of an induction problem; Second, the type of unconscious process that is posited to explain how the visual system solves the invariance problem exhibits the required commonalities for unconscious inference. Therefore, the fact that an unconscious process in the visual system solves an induction problem gives us good reason to characterize object recognition as involving a form of unconscious inference. Let us go through each stage in turn.

The form of the invariance problem suggests it has all three diagnostic features for an induction problem. First, the invariance problem is diachronic, since it is defined in terms of how we perceive objects as being the same (in terms of identity or category) across different viewing conditions separated in time. To explain the phenomenon requires accounting for how information from past and present viewings are related to generate a current visual representation of an object (e.g. as an osprey). Second, the problem is empirical, since it concerns information acquired from past visual experience with objects. To explain the phenomenon requires accounting for how this past information is represented and recruited in the present. Third, the problem is extrapolative, since our present viewing conditions are always constitutively different from those of past experiences, even when encountering the same individual object. Thus, perhaps most fundamentally, explaining the phenomenon requires accounting for how we are able to generalize from unlike circumstances across identity-preserving transformations of viewpoint.

That the invariance problem has all three of these diagnostic features is also illustrated by object recognition tasks that focus on how we form representations of novel objects. Here are two classic paradigms. In the first, subjects are presented with different novel stimuli from a restricted set of viewpoints, such as orientations in depth, and they are then tested to see how their performance generalizes to novel viewpoints of the objects not presented during training (Tarr & Pinker , 1989). In the second, subjects are similarly trained on novel objects, “Greebles”, but in a way that involves focusing on local diagnostic features before generalizing the knowledge to new individual Greebles, with subsequent stages of the task intended to achieve expert level performance (Gauthier & Tarr , 1997). These two paradigms have been used extensively to investigate (respectively) how the representations we build up to recognize objects are influenced by viewpoint and how visual expertise might help explain how we see familiar categories like faces. For present purposes, what is notable is that they presuppose that object recognition exhibits the three diagnostic features of an induction problem. For it is the very nature of these tasks to investigate how participants generalize from past training experiences to novel test ones.Footnote 17 Thus, they also make clear that the invariance problem is a kind of induction problem for the visual system.

Next, consider that all information-processing explanations of object recognition take on the same form: a representation of a perceived object and its visible properties (e.g. an object in the sky) is built up through the processing stages of the visual system and compared to those of individual identities or object categories (e.g. stored representations for different bird species). This matching process, as I will call it, occurs automatically and is generally considered inherent to the visual system (DiCarlo et al. , 2012; Gauthier & Tarr , 2016; Riesenhuber & Poggio , 2000). The details are often murky as to how the matching process results in a single percept that attributes the relevant label to an object in the environment. So explanations that posit a matching process should not be considered a complete explanation of object recognition. Still, the matching process at the heart of these explanations has the required commonalities indicative of unconscious inference.

First off, the process includes types of representations with the right contents: one represents information about the object we presently see and its visible properties and the other stored information of the appearance properties for different object identities and categories. In both cases it is also plausible that these are mental representation as their content is both distal and robust. First, distalness is often defined in terms of invariance as the representation of aspects of our environment that remain the same across, and are distinct from, proximal sensory inputs (Burge , 2010; Mole & Zhao , 2016; Orlandi , 2014). The synchronic representations of an object at a time, under particular viewing conditions, is generally thought to occur at the later stages in the visual processing hierarchy in which increasing levels of specificity and invariance in representational content occur (DiCarlo et al. , 2012). So to the extent the greater invariance in content entails greater distality, and object recognition trades in visual representations that exhibit the most wide ranging invariance, the content is distal as well. In turn, our representations of object category or identity are certainly considered distal in so far as they must subserve generalization to novel circumstances. Second, the phenomenon of object recognition itself is typically used to illustrate the robustness of content via examples of misrepresentation (Fodor , 1990). Indeed, both types of representations that feature in the matching process have content that appears to be robust in the requisite way: if we mistake a goshawk for an osprey, the categorical representation for the class of osprey is mistakenly tokened, but this could be because of misperception of crucial distinguishing features such as wing shape or plumage.

Next, the matching process is also rule-following and recruits memory. First, in so far as matching is, by hypothesis, a kind of information-processing, it will be rule-following in the minimal sense that computation (in general terms) involves rules defined over representational states of some kind (Piccinini & Scarantino , 2011). As it has just been argued, the states in question are also plausibly mental representations and so the matching process would seem to conform to the idea of a bare inferential transition: given some criteria for what determines a match, if they are satisfied, then the relevant mental label will tend to be applied to the object in the environment that is being represented. Second, the details described so far would suggest that positing some kind of memory store for the mental representations of different labels (for individuals or categories) is unavoidable, for inherent in the idea of the matching process is that such labels exist based on past experience and can be applied in novel input conditions.

All these representational and processing commonalities with inference can be illustrated by considering a “geometric” way of characterizing the matching process in terms of a neural population code, which has become prevalent in visual neuroscience (DiCarlo & Cox , 2007; DiCarlo et al. , 2012). Under this construal, the representation of an object at a time, in terms of its visually discernible properties, is encoded as a point in a multi-dimensional visual feature space (as implemented in patterns of neural activity). In turn, representations for category or identity make up distinct regions in this space. The matching process is then a result of applying a decision rule to the new encoding space, in a way similar to machine learning classifiers. For example, a particular point in the space may have never been tokened before, but if it is located within a region that constitutes the representation of a familiar category (e.g. osprey), the the visual system attributes the property of being an osprey to the object. The process is rule-following and memory-involving because the transition from tokening of a point in the encoding space to the labeling of a stimulus based on the representation that subsumes the point in the space is a reliable, rule-following one, and the encoding space itself is a kind of long-term memory story for representations of previously encountered object types.

This geometrical construal, while increasingly pervasive, is not uncontroversial. In particular, for it to manifest the commonalities with inference in the way just described, then minimally regions of a state space must be possible vehicles for content, and decision-rule operations defined over those representing spaces suffice for a form of bare inferential transition (Gärdenfors , 2004; Shea , 2007).Footnote 18 However, this construal suffices to show one way in which the matching process could be realized, granting these auxiliary assumptions. As it happens, this geometric characterization also comports with the theory of Ibn Al-Haytham, who claimed the reason one recognizes the osprey in Fig. 1B is because its visible properties are more similar to ospreys we have seen in the past than other birds. In this way his theory conforms to an intuitive characterization of a “nearest neighbor” classifier of a feature space (Pelillo , 2014). So in a theoretically substantive way, the matching process posited by some modern theories of the neural basis of object recognition also conforms closely to the form of unconscious inference first proposed by Ibn Al-Haytham.

To recap, the invariance problem for object recognition exhibits key diagnostic features of an induction problem. Furthermore, although I have only presented the general form of the explanations of object recognition, the matching process central to these explanations exhibits several key commonalities with inductive thinking. Thus, in an explanatorily substantive way, they involve positing a form of unconscious inference.

5 Challenging the argument

Having built up the argument from invariance, in this section I consider some ways to bring it down. The first concerns the ubiquity of the matching process in explanations of object recognition; the second concerns whether the requirements for unconscious inference that I have identified have indeed been satisfied; and the third concerns the status of object recognition as a perceptual phenomenon.Footnote 19

5.1 How ubiquitous is the matching process?

For the argument from invariance to succeed, positing a matching process must be both fundamental and widespread in explanations of object recognition. One may doubt I have provided sufficient evidence of this. Note that it is not enough to offer the platitude that the matching process features in our “best” theories or insist one or two curated studies are representative—as is arguably the case with the Bayesian variant of the argument from underdetermination (e.g. Rescorla, 2015). Instead, to address this concern I briefly review three debates about object recognition that have exercised vision scientists. In each case, the core understanding of the phenomenon, and how it is to be explained in terms of a matching process, is largely agreed upon.

The first debate concerns the format of representations for 3D shape recognition. Early theories posited viewpoint independent structural descriptions of object shape built from volumetric primitives, which are compared to structural descriptions of objects stored in long-term memory (Biederman , 1987; Marr & Nishihara , 1978). Later, image-based theories were proposed according to which new viewings of objects are compared to stored representations of objects from previously experienced or canonical viewpoints (e.g. Bulthoff & Edelman, 1992; Cutzu & Edelman, 1994). The debate between these theories centered on whether, in certain theoretically relevant conditions, recognition performance in fact varied with viewpoint (Biederman & Gerhardstein , 1993; Hayward & Tarr , 1997; Tarr & Bülthoff , 1995). Although the debate has fizzled (Hayward , 2003; Stankiewicz , 2003), regardless of whether the representations recruited during object recognition are structural descriptions or image-based, in both cases a form of matching process is a fundamental posit of the two types of theories, since the debate concerns the format of the representations that are being matched.

The second debate concerns the domain-specificity of category-selective areas in human visual cortex. One of the first areas discovered was the fusiform face area (FFA) based on increase in fMRI BOLD signal amplitude in response to faces relative to other stimuli like scenes (Kanwisher et al. , 1997; McCarthy et al. , 1997). Some studies found that visual expertise for novel (e.g. Greebles) or familiar (e.g. birds or cars) object categories also induced greater BOLD responses in FFA (Gauthier et al. , 1999; Gauthier et al. , 2000; Xu , 2005). Thus, debate centered on whether FFA has a domain-specific specialization for representing faces or a domain-general specialization for visual expertise (Bukach et al. , 2006; Kanwisher & Yovel , 2006). At present, interest in the expertise hypothesis as a rival explanation has somewhat dissipated, and FFA is typically identified as part of a larger network of cortical face areas (for a review, see Duchaine & Yovel, 2015). However, under either hypothesis, FFA is assumed to play some role in building up representations for the matching process at the heart of recognition. At issue is whether this role is exclusive to representing faces or other expertly learned categories as well.

The last debate concerns the use of deep neural networks (DNNs) as models of visual processing.Footnote 20 The interest in DNNs initially came from their (near) human-like performance on image classification tasks (Krizhevsky et al. , 2012; LeCun et al. , 2015), and the apparent similarity between the representations in layers of DNNs trained on these tasks and neural activity in category-selective visual cortex of primates (Cadieu et al. , 2014; Khaligh-Razavi & Kriegeskorte , 2014). Based on such findings, DNNs continue to be used as models of visual processing, and object recognition in particular (Kriegeskorte , 2015; Lindsay , 2020). However, many findings suggest tempering such enthusiasm (Serre , 2019). Classification performance can be disrupted by “adversarial” examples that fool DNNs into incorrectly labeling images that look nothing like the assigned category (Brendel et al. , 2017; Goodfellow et al. , 2014), while the extent of the correspondence between network representations and the brain is a source of ongoing study (Rajalingham et al. , 2018; Xu & Vaziri-Pashkam , 2021). Despite ongoing disagreement about how the networks should be interpreted or utilized (Saxe et al. , 2020), there is agreement about the form of the underlying matching process as one that requires comparing incoming signals to stored category labels.

Taken as a whole, then, all three of these debates provide strong evidence that the matching process is a fundamental and widespread posit of research on object recognition. To the extent that the matching process satisfies the conditions for unconscious inference that have been articulated, it follows that a form of unconscious inference is similarly a ubiquitous posit in explanations of the phenomenon.

5.2 What does unconscious inference require?

Another way to challenge the argument from invariance is to raise doubts that the requirements for unconscious inference that I laid out are in fact satisfied by the matching process at the heart of explanations of object recognition. I will consider two objections of this sort.

The first relates to rule-following, which some proponents of unconscious inference claim requires representation of inferential rules themselves (e.g. Rock, 1983). Many have questioned whether visual processing involves rule-representation of this form. As mentioned earlier, a common explanatory approach is to posit natural constraints, which preclude the need for positing rule-representations, and it has been claimed, unconscious inference as well (Burge , 2010; Orlandi , 2014). For example, since neural networks do not overtly represent rules and yet approximate visual processing, some have claimed they therefore undermine the case for unconscious inference (Hatfield, 2002, p. 136; Orlandi, 2014, pp. 46–49). Thus, if rule-representation is also necessary for rule-following then my argument is at best incomplete, and at worst, demolished.

I have three replies. First, as many have pointed out, rule-representation cannot be a necessary condition for all inferences on pain of infinite regress (Boghossian , 2014; Carroll , 1895; Fodor , 1987; Quilty-Dunn & Mandelbaum , 2018). Briefly: if inference always requires representing a rule linking premises (e.g. modus ponens), then in order to follow that rule, a second-order rule is required that references the first rule, but that second-order rule must itself feature in a third-order rule, and so on. So even when it comes to the sort of deliberation that typifies cognition, rule-following must exclude rule-representation at some level. Second, rule-representation is plausibly connected to the idea that, when deliberating, we consciously “take” the premises to support the conclusion (Boghossian , 2014). In so far as unconscious inference, by definition, precludes such awareness, then rule-representation is ill-motivated as a requirement.Footnote 21 Third, the idea of natural constraints in the visual system is wholly compatible with the idea of bare inferential transitions that I have been relying on (Quilty-Dunn & Mandelbaum , 2018; Wright , 2014). Indeed, natural constraints may provide an articulation of how the visual system could carry out such transitions.Footnote 22 For these reasons, the argument from invariance is not beholden to the requirement that rule-following presupposes rule-representation.

The second objection relates to the conditions for mental representations. Earlier I argued, on general grounds, that the matching process compares forms of mental representation. In order to show that this process is fundamental and pervasive, above I pointed to several debates about object recognition. However, I did not provide evidence that these more specific lines of research posit internal states with content that is both distal and robust. In fact, there is good reason to think they may not. For example, one may reasonably doubt that image-based theories of 3D shape recognition posit states that represent the distal environment as opposed to proximal image features. Furthermore, even if they do posit mental representations, then one would then still need to show that they satisfy Ramsey’s (2007) “job description challenge” for mental representation: that it is in virtue of the posited states being mental representations that they are able to play their explanatory roles. Thus, I have not yet shown that, across these debates, positing mental representations is either widespread, or necessary, to the explanations that have been pro-offered.

In reply I would distinguish between two issues. The first is whether positing mental representations is necessary for the explanation of a visual phenomenon. I have already made arguments that this is the case. The second is the role of mental representation in interpreting particular theories or models of the phenomenon, given a prior commitment to the explanations requiring mental representation. Regarding this issue, it is important to keep in mind that the theories and models I canvassed may either be incomplete, abstract from important details, or simply be incorrect, but none of these alternatives would, by themselves, give us reason to doubt that object recognition involves mental representation. For example, image-based theories, as models of 3D shape recognition, may reflect the incomplete and abstract form of our theorizing. Thus, the fact that I have not shown that mental representations are ubiquitously explicitly posited, or that the job description challenge is consistently met, does not undermine the argument.

5.3 Is object recognition really perceptual?

At this point, one may agree that object recognition involves unconscious inference, but question whether the conclusion is interesting. Object recognition is, after all, part of what is sometimes called visual cognition. So it should come as little surprise that seeing an osprey as such involves unconscious inference as it reflects us learning and remembering from experience (Hatfield , 2002). How much standing we should give this concern depends in part on the extent to which object recognition should be considered a perceptual phenomenon in the first place. For example Thomas Reid, who perhaps first clearly distinguished sensation and perception, considered forms of acquired perception, like object recognition, just as perceptual as other aspects of seeing (Copenhaver , 2010). Still, so far I have assumed that object recognition is principally a perceptual phenomenon, without argument. Furthermore, object recognition has increasingly featured as a test case for determining how perception and cognition should be distinguished (Firestone & Scholl , 2016; Mandelbaum , 2018). While I do not intend to take a stand on that debate here, below I offer three arguments in favor of the claim that object recognition is perceptual, before addressing some reasons one might reject it.

Why think object recognition is indeed perceptual? First, prima facie the experience of object recognition appears perceptual, as illustrated by classic demonstrations of the contrast between seeing and seeing-as. In two-tone “Mooney” images once we parse the image we do not simply see a coherent scene we see the central object as a dalmatian (Fig. 4A). When we look at the “do-it-youself” object of Biederman (1987) we clearly see an object but what is missing is that we see it as anything in particular (Fig. 4B). For bistable images like Rubin’s vase (Rubin , 1915) beyond the figure-ground reversal what objects we see also switches (Fig. 4C). A similar effect occurs when we look at the infamous Duck-Rabbit popularized by Wittgenstein (1953), except there is no change in figure-ground assignment (Fig. 4D). Denying that object recognition is perceptual requires explaining away these experiences, rather than simply taking them at face value.

Fig. 4
figure 4

Four illustrations of the perceptual nature of object recognition. A Two-tone “Mooney” image. B Biederman’s “do-it-yourself” object. C Rubin’s vase. D Wittgenstein’s Duck-Rabbit

Second, a key feature of object recognition is that it often occurs quickly—\(\sim \)120 ms, or about as fast as it takes for visual signals to reach category-selective areas of visual cortex (DiCarlo et al. , 2012). This point is illustrated by human and primate studies using time-series (cellular recordings or M/EEG) decoding methods in which information about stimulus category latent in neural signal patterns reach peak discriminability 150 ms post-stimulus onset (Carlson et al. , 2013; Hung et al. , 2005; Isik et al. , 2014). While decoding results do not license direct conclusions about representational content being reflected in neural signals (Ritchie et al. , 2019), these results are just a sample of many lines of converging evidence that suggest object recognition is often subserved by a rapid feedforward pass of information through the visual system.

Finally, the perceptual status of object recognition depends in part on how one delineates perception from cognition (if at all). As with the notion of inference, there are multiple measures by which we might draw the line. Still, one proposal is that mental representations are perceptual when they are stimulus-dependent in the sense that they have the function of being causally sustained when the visual system is presented with proximal sensory inputs (Beck , 2018). Such a condition helps to distinguish how we represent Fig. 1A and B differently. In both cases the representations have demonstrative and attribute elements: we attribute the property of being an osprey to that thing before us. But only in the case of Fig. 1B is the attributive element stimulus-dependent because we are also attributing the appearance of an osprey, which will be constrained by some particular sensory inputs. In contrast, in Fig. 1A we have a perceptually-determined demonstrative thought that we are looking at ospreys, and may be correct in this belief, but the objects on the nest do not look like ospreys, when seen from afar away. So by one plausible measure, object recognition is clearly perceptual.Footnote 23

What about reasons for denying that object recognition is perceptual? Let us consider two of them. First, if categorization suffices for a form of conceptualization, then if object recognition is perceptual it follows that perception recruits concepts (Mandelbaum , 2018; Prinz , 2002). Many have rejected this claim on the grounds that perception excludes seeing-as (Block , 2014; Burge , 2010). One way to frame this dispute is in terms of the matching process itself and whether it is carried out by the visual system or not; if it is, then perception involves conceptualization; if not, then object recognition is not perceptual (Mandelbaum , 2018). I reject the first entailment. It is far from clear that representations of the appearance of a type of object, which are the type of representations stored and recruited during recognition, qualify as concepts. For example, even if regions of a neural encoding space can be vehicles for mental representation, and the matching process follows the geometric characterization described earlier, it is not obvious that representations in a space that encodes information about object appearances are concepts (cf. Gauker, 2017). Thus, without further premises, my argument is consistent with the possibility of non-conceptual seeing-as.

Second, if memory is considered inherently cognitive, then object recognition is not perceptual. Firestone and Scholl (2016) seem to make this claim to address results of studies that purport to show that perception can be cognitively penetrated. For example, the fact that the memory component of recognition can be influenced by information that an observer is consciously aware of, such as the name of a word making it easier for a stimulus to break through to awareness during continuous flash suppression, has been interpreted as showing that perception is cognitively penetrable (Lupyan & Ward , 2013). However, such results at most show that perception is not informationally encapsulated, which need not be be treated as a requirement for distinguishing perception and cognition (Beck , 2018; Ogilvie & Carruthers , 2016).Footnote 24 For object recognition to be cognitively mediated in a substantive sense would presumably require some level of cognitive control on the matching process itself; it would require a capacity to override the strong stimulus-dependence that determines our representation of the osprey in Fig. 1B, by convincing ourselves we are looking at (say) an eagle instead. So, priming effects on recognition can be dismissed as evidence of cognitive penetration without also denying that recognition is wholly a perceptual phenomenon.

6 Conclusion

The idea that visual information-processing involves unconscious inference is one of the theoretical pillars for much of vision science. I have attempted to provide a novel basis of support for this pillar. We began the present discussion by laying groundwork for an argumentative strategy focused on whether positing a form of unconscious inference plays a role in explaining particular aspects of visual processing. This was used to evaluate the most influential argument in favor of unconscious inference, the argument from underdetermination. This well-known argument, even under a Bayesian guise, is ill-fit to the proposed strategy. In its place, an alternative argument centered on the invariance problem for visual object recognition was constructed. As I have shown, explaining how the visual system overcomes the invariance problem reveals important commonalities between perception and thought. Identifying these commonalities help us recognize why vision is inferential.Footnote 25