Introduction

Visual exploration of a complex scene involves a series of saccades and fixations, which can shift our attention between specific objects or informative features within the scene and make detailed analysis and identification of the scene (Biederman 1987; Henderson and Hollingworth 1999). There are two important aspects of eye movements while studying gaze control during the scene perception, where fixations tend to be directed (fixation position) and how long they typically remain there (fixation duration; Henderson 2003). Although human saccadic eye movements show a variety of stereotypic patterns while inspecting visual scenes (Yarbus 1967), the frequency and size of saccades can be modulated by the cognitive demand and characteristics of the observed scene (Salthouse et al. 1981; Jacobs 1986; Pollatsek et al. 1986; Epelboim et al. 1995; Hooge and Erkelens 1998; Andrews and Coppola 1999). For example, longer fixations are normally associated with difficult words in reading task (Pollatsek et al. 1986) and decreased discriminability of target in visual search task (Jacobs 1986; Hooge and Erkelens 1998); and natural scenes generate shorter fixations and larger saccades compared with simple pattern images in free viewing task (Andrews and Coppola 1999).

As faces can provide visual information about an individual’s gender, age and familiarity, and their expressions offer significant cues to intention and mental state (Bruce and Young 1998; Emery 2000), the ability to recognize these cues and to respond accordingly plays an important role in the social life of higher primates (Andrew 1963; Anderson 1998). It is argued that face perception is involved in a unique cognitive process compared with non-face object or scene perception. For example, psychophysical studies have observed detrimental recognition performance for inverted faces rather than non-face objects or scenes (face inversion effect; e.g. Yin, 1969; Valentine 1988; Rossion and Gauthier 2002), a visual preference for face-like stimuli in human neonates (Johnson and Morton 1991; see also Turati et al. 2002), and selective impairments of face and object recognition in neurological patients (prosopagnosia and visual agnosia) (e.g. Sergent and Signoret 1992; Farah 1996; Moscovitch et al. 1997). Recordings of human event-related potentials showed a different topography to face (including human and animal faces) and non-face object or scene stimuli in the N170 time window (e.g. Bentin et al. 1996; Itier and Taylor 2004; Rousselet et al. 2004). Elecrtophysiology and brain imaging studies further suggested a distinct neuroanatomical region in cerebral cortex associated with the cortical processing of faces (face-selective neurons in monkey inferotemporal cortex, fusiform face area in human cortex; e.g. Sergent et al. 1992; McCarthy et al. 1997; Tanaka 1997; Tsao et al. 2003). However, this view is recently challenged by some brain imaging studies suggesting that faces are processed by a domain-general system for fine-grained, exemplar-level object perception but probably at different level of recognition or different degree of perceptual expertise (Gauthier et al. 1999, 2000; Tarr and Cheng 2003).

It is not clear, however, whether inspection of face and non-face scenes, which have different image characteristics and may involve different cognitive processes (i.e. different cortical processes, different level of recognition or different degree of perceptual expertise), can influence the patterns of visuomotor activity. To examine this issue, we compared monkeys’ saccadic eye movements when they freely viewed face and natural scene images. Familiar scenes sampled from monkeys’ daily environment were also used to examine potential influence of the familiarity of natural scene images. This exploratory project is not only important to increase our understanding of the relation between the category of real world stimuli and the organization of goal-directed eye movements in non-human primates, but also for comparison with findings from humans, as the behavior and neurophysiology of monkeys comprises the most significant model for the advancement of research into human brain function. We observed that the face images tended to generate longer fixations compared with the natural scene images, and these longer fixations were associated with the context of facial features.

Methods

Subjects

Three male adult rhesus monkeys (Macaca mulatta, 4.5–6.0 kg) were trained to fixate a small fixation point (FP) for several seconds in a dimming fixation detection task. To make eye movement recordings, a scleral eye coil and head restraint were implanted under aseptic conditions (Guo and Benson 1998). All procedures complied with the “Principles of laboratory animal care” (NIH publication no. 86-23, revised 1985) and UK Home Office regulations.

Stimuli and apparatus

Digitized gray scale images were presented through a VSG 2/3 graphics system (Cambridge Research Systems) and displayed on a high frequency non-interlaced gamma-corrected color monitor (6.0 cd/m2 background luminance, 110 Hz frame rate, Sony GDM-F500T9) with the resolution of 1,024×768 pixels. At a viewing distance of 57 cm the monitor subtended a visual angle of 40×30°.

Four different classes of images were used as stimuli (see examples in Figs. 1a, 6a): (1) 20 neutral monkey (Macaca mulatta) face images, (2) 20 natural scene images (including buildings, landscape, trees and plants etc.), (3) 15 familiar natural scene images which were taken from monkeys’ daily environment, (4) 10 scrambled monkey face images. The scrambled images were generated by dividing each complete face image into a 4×4 matrix and randomly rearranging the parts (Guo et al. 2003). By doing so, most of the local facial features (eyes, nose and mouth) were kept intact and recognizable, but the global structure of the face was disrupted. All images were in sharp focus at all depths of field, and were gamma-corrected and displayed once in a random order at the center of the screen with a resolution of 512×512 pixels (20×20°).

Fig. 1
figure 1

a Examples of static gray scale face, familiar scene and natural scene images used in the recording. The white dots within the images indicate the position of each fixation sampled during the image presentation. b, c Number of fixations (b) and fixation duration (c) measured while viewing face, familiar scene and natural scene images. Error bars mean standard error of mean

Fig. 2
figure 2

The average local contrast around the fixations while viewing face, familiar scene and natural scene images (left white columns), and the average local contrast from random samples in face, familiar scene and natural scene images (25 samples per image, left gray columns). Error bars mean standard error of mean

During the experiments the monkey sat in a primate chair with head restrained, and viewed the display binocularly. To calibrate eye movement signals, a small red FP (0.2° diameter, 7.8 cd/m2 luminance) was displayed randomly at one of 25 positions (5×5 matrix) across the monitor. The distance between adjacent FP positions was 5°. The monkey was trained to follow the FP and maintain fixation for 1 s. After the calibration procedure, the trial was started with an FP displayed on the center of monitor. If the monkey maintained fixation for 500 ms, the FP disappeared and an image was presented for 20 s. During the presentation, the monkeys passively viewed the images. No reinforcement was given during this procedure, neither were the animals trained on any other task with these stimuli, which could have potentially affected the structure of their behavior. It was considered that with their lack of training, and in the absence of instrumental responding, their behavior should be as natural as possible.

Eye movement recordings and analysis

Horizontal and vertical eye positions were measured using an 18-inch cubic scleral search coil assembly with 6 min arc sensitivity (CNC Engineering). Eye movement signals were amplified and sampled at 500 Hz through CED1401 plus digital interface (Cambridge Electronic Design). The software developed in Matlab computed horizontal and vertical eye displacement signals as a function of time to determine eye velocity and position. Fixation locations and durations were then extracted from the raw eye tracking data using velocity (less than 0.2° eye displacement at a velocity of less than 20°/s) and duration (greater than 50 ms) criteria (Guo et al. 2003).

As the main experimental design comprised three levels of image category (faces vs natural scenes vs familiar scenes), one-way repeated analysis of variance (ANOVA) was carried out after pooling the data from three monkeys. Appropriate post-hoc testing of differences between levels of image category (Tukey’s least significant procedure) was also carried out following detection of significant overall variable ratios.

Results

The gray scale face and natural scene images appeared equally salient to the monkeys. No difference was observed in the number of fixations across the image categories (ANOVA, F (2,162)=0.5, P=0.61; Fig. 1b). During the entire 20-s presentation, three monkeys made 24.73±1.51 (Mean ± SEM), 24.82±1.69 and 22.82±1.58 fixations across the face, familiar scene and natural scene images.

The fixation durations were influenced by the image categories. Although frequency distribution analysis showed that the monkeys made frequent short fixations (peak around 200 ms) while viewing the images (Guo et al. 2003), the faces tended to generate longer fixations (ANOVA, F (2,3975)=35.7, P=4.29E−16; post-hoc test, face vs familiar scene: P=7.91E−13, face vs natural scene: P=1.71E−11; Fig. 1c). In contrast, the familiar scenes and natural scenes had indistinguishable fixation durations (post-hoc test, P=0.66). The mean fixation durations were 317±8 (Mean ± SEM), 249±5 and 253±5 ms for face, familiar scene and natural scene images. The conclusion also holds for the median fixation durations which are less sensitive for the skewed distributions of fixation durations (e.g. Fig. 3b in Guo et al. 2003). The median fixation durations were 222, 205 and 200 ms for face, familiar scene and natural scene images.

Fig. 3
figure 3

Dependence of fixation duration on local luminance contrast in face, familiar scene and natural scene images

Inspection of the natural scene is accompanied by a series of fixations directed towards important and informative scene regions. Recent studies observed higher local luminance contrast and lower local two-point correlation for fixated scene patches than unfixated patches (Reinagel and Zador 1999; Krieger et al. 2000; Parkhurst and Niebur 2003), suggesting that local image statistics, such as luminance contrast, is a major contributor to the saliency map for overt attention (Parkhurst et al. 2002). To examine whether the differences in fixation durations for the three classes of images were due to the differences in the physical properties and statistics of those fixated image regions, we calculated local luminance contrasts around individual fixations in different images. The local contrast is a measure of variability of the intensity within an image patch, and is defined as the standard deviation of the luminance within a square image divided by the mean intensity of the whole image (Reinagel and Zador 1999; Einhäuser and König 2003). The size of the square region was chosen to be 2°×2° (±1° around the fixation) which roughly covers the spatial scale of the size of the fovea. While the average fixation duration in the face images was longer than that in the familiar scenes (Fig. 1c), the average local contrast around the fixations in the face images (0.2568±0.0034) was not significantly different from that in the familiar scenes (0.2539±0.0038; t test, P>0.05; Fig. 2). However, the average local contrast around the fixations in the natural scene images (0.3512±0.0061) was higher than that in the face and familiar scene images (ANOVA, F (2,3975)=157.11, P=2.63E−66). This is due to the physical properties of the natural scene images, as the average local contrast from random samples in the natural scenes (25 samples per image) was also proportionally higher than that in the face and familiar scene images (ANOVA, F (2,1372)=113.02, P=3.67E−46; Fig. 2).

For individual fixations sampled while viewing face, familiar scene and natural scene images, we further plotted its duration against its local contrast (Fig. 3). In agreement with previous study of human subjects (Einhäuser and König 2003), over all images and all subjects, we found no correlation between local contrast and fixation duration (r=0.00005, 0.0007 and 0.0002 for face, familiar scene and natural scene images). This also holds true for the local contrasts calculated using smaller (1°×1°) or larger (3°×3°) spatial scale around the fixations (r<0.001 for all images). This analysis shows that the local luminance contrast was unlikely related to the differences in the fixation durations while viewing face, familiar scene and natural scene images.

As the measurement of local contrast is insensitive to the spatial organization of intensities within an image patch, we also employed two-point correlation function, which calculates the correlation between the point at the center of each fixation and a point within local neighborhood of the fixation (±1° around the fixation in this study), to quantify the correlation in intensity between pairs of pixels in the image patch (Reinagel and Zador 1999). The mean and covariance of correlation matrices over the fixations within individual face, familiar scene and natural scene images were calculated and further averaged over each class of the images and subjects (Cootes and Taylor 1992; Cootes et al. 1992). Figure 4 shows the mean of correlations for each class of images. In general, correlation is a function of distance between image points (pixels). The local image structures around the fixations in the natural scene images seemed to be less correlated than that in the face images.

Fig. 4
figure 4

The mean of correlations over two-degree image patches around the fixations in face (left), familiar scene (middle) and natural scene images (right). In the far right side of the figure, a scale is presented to indicate the brightness with the corresponding correlation values

To further quantify the variations of correlations for each class of images, eigenvalues and eigenvectors of the covariance matrix were computed to analyze principle components of our correlation data over each class (Kreyszig 1999). The Mahalanobis (weighted) distance between the mean of each class and the mean of other classes were finally calculated to determine whether different classes were overlapped with each other or separated from each other (Cootes and Taylor 1992; Cootes et al. 1992). Figure 5 shows the distribution of our data for these three classes of images by considering first two important modes (components) of variations. The distribution function was assumed as a multidimensional Gaussian function whose variances correspond to the eigenvalues of the covariance of the correlation data. These Gaussian functions were considered in a feature space obtained by applying Hotelling transform to our data (Cootes and Taylor 1992; Cootes et al. 1992; Kreyszig 1999). This analysis shows a clear difference in spatial correlations between fixations sampled from the face and natural scene images. The local image structures are more spatially correlated in the face images. However, this difference in local spatial correlations between the face and natural scene images is unlikely related to the difference in fixation durations while viewing the face and natural scene images. Compared with the face images, the correlations between nearby pixels were weak in the natural scene images, indicating a rich structure on small spatial scale in the natural scene images. Therefore the natural scene images are statistically less redundant (Field 1987; Ruderman and Bialek 1994; Simoncelli and Olshausen 2001), and consequently should attract longer fixation durations for the purpose of foveal analysis rather than shorter fixation durations as we observed in the recording. However, the relationship between fixation duration and local spatial structure of the stimulus may well be task dependent. For example, the natural scene image could attract longer fixation durations in a search task compared with the free viewing task we employed in this experiment. Nevertheless, our observation suggests that the fixation duration is dependent upon not only simple local properties like contrast and spatial correlation, but also some complex features like informativeness.

Fig. 5
figure 5

Two-dimensional Gaussian functions in feature space corresponding to face, familiar scene and natural scene images. The two axes correspond to the two most important variations in the covariance matrix, and the units of the axes indicate standard deviations of modes

While viewing the faces, the monkeys’ fixation was mainly directed to the principal local facial features, even with the scrambled faces (see examples in Figs. 1a, 6a; Guo et al. 2003). To investigate whether the longer fixations on facial features are dependent upon their spatial configurations, we compared the durations of fixations on eyes, nose, mouth and facial contours (including hairlines) within normal and scrambled face images (Fig. 6a). While the fixations on eyes, nose and mouth had the same durations between normal and scrambled faces (paired t test, P>0.05), the mean duration of fixations on facial contours of normal faces (302±12 ms) was longer than that of scrambled faces (282±20 ms) (paired t test, P=0.03).

Fig. 6
figure 6

a Comparison of the durations of fixations on facial contours, eyes, nose and mouth region within normal and scrambled face images. Error bars indicate standard error of the mean. The top graphs are examples of normal and scrambled face images used in the experiment. b The change of the fixation durations with increasing fixation sequence at the eyes and facial contours within normal face

We further compared the durations of each of the first seven fixations on the eyes and facial contours within normal face images (this number was chosen as it represented the maximum number of fixations within the region for some images, Fig. 6b). While the fixation durations on the eyes were the same with changing fixation sequence (ANOVA, F (6,268)=0.85, P=0.53), the duration of fixations on the facial contours increased gradually at the later stage of fixation (ANOVA, F (6,214)=3.75, P=0.001). There was no significant change of the fixation durations on the same regions within scrambled faces with increasing fixation sequence (ANOVA, eyes: F (6,98)=1.25, P=0.29; facial contours: F (6,115)=0.67, P=0.68).

Discussion

In the present study, we compared the patterns of saccadic eye movements while monkeys freely viewed face and natural scene images (including familiar and novel natural scenes). The face and natural scene images appeared equally salient to the monkeys. They attracted similar number of fixations during the image presentation. However, viewing of the faces was accompanied by longer fixations compared with the natural scenes. This difference in fixation durations across different classes of images is unlikely to be related to the differences in local physical properties and statistics of these images which was demonstrated by the analysis of local luminance contrast (standard deviation of intensity in a fixation patch, Figs. 2, 3) and local two-point correlation function (intensity of the fixated point and nearby points, Figs. 4, 5) across the different classes of images. Comparison between familiar and novel natural scenes showed that these two classes of natural images attracted similar amount of fixation durations (Fig. 1). Because our familiar scenes were ‘artificial’ man-made scenes sampled from monkeys’ daily environment, and novel natural scenes included both ‘artificial’ scenes (i.e. buildings) and ‘natural’ scenes (i.e. plants), it is difficult to exclude the potential influence of the ‘naturalness’ of scenes on fixation duration without further detailed examination with large sample size. However, as our analysis also revealed that the fixation durations sampled from novel ‘natural’ scenes (253±7 ms) were not significantly different from those sampled from novel ‘artificial’ scenes (248±11 ms) (t test, P=0.61), it is unlikely that the potential interaction between familiarity and ‘naturalness’ of the tested scenes could fully account for our observation of difference in fixation durations between face and natural scene images.

Detailed examination of facial configurations further revealed that the longer fixations on facial contours appeared to be dependent upon the arrangement of these contours into a coherent and recognizable object, namely a face. The duration of the fixations on the same facial contours in the scrambled face images were significantly shorter (Fig. 6). These results suggest that face and natural scene images may generate different patterns of visuomotor activity. The extra fixation duration on faces may be correlated with the detailed analysis of facial features.

It is believed that oculomotor strategies are closely linked with the cognitive demand (Epelboim et al. 1995), and the fixation duration has been correlated with the amount of information being processed during foveal analysis (Moffit 1980). Longer fixations are usually associated with extra cognitive demand, informative visual information at the fixated region, and/or display complexity (Salthouse et al. 1981; Jacobs 1986; Hooge and Erkelens 1998). For example, individual fixation durations are longer during scene memorization than search (Henderson et al. 1999), or for semantically informative than uninformative objects within the scene (Henderson and Hollingworth 1999), or when the image at fixation is reduced by contrast or partially obscured by a noise mask (van Diepen 1995).

One of the major differences between face and natural scene images is that faces have inherent social significance. They are behaviorally relevant visual stimuli for primates, which provide essential information about an individual’s gender, age, familiarity, intention and mental state (e.g. Bruce and Young 1998; Emery 2000). When viewing a complex scene containing faces, the highest portion of human fixations is directed to the faces (Yarbus 1967). The local facial features, such as eyes, are not just simple geometric patterns or objects. They also contain significant social communicative signals. Like human, monkeys are also heavily reliant on facial signals for social communication. Based on facial cues alone, they are readily able to respond appropriately to the expressions of other individuals (Mendelson et al. 1982), to recognize and discriminate the faces of familiar and unfamiliar individuals (Rosenfeld and van Hoesen 1979; Parr et al. 2000). Their visual system also appears to be tuned to the informative facial features (Guo et al. 2003). They showed a preferential interest, high density of fixations and longer fixation durations, to the major local facial features while viewing faces. As local image complexity around the fixations unlikely accounts for the differences in fixation durations between the face and natural scene images (Figs. 2, 3, 4, 5), the extra duration of fixations for the faces may be correlated with the extra cognitive demand (i.e. “configural process”) which involves detailed analysis of local facial features and perceiving relations among the facial features, and therefore maybe important for acquisition and processing of facial cues, such as identity, expression and gaze direction (Maurer et al. 2002). However, from the present data it is difficult to see how the social relevance of the faces could affect the fixation durations as we only tested neural face images in a free viewing task in this experiment. In the future study it will be interesting to systematically manipulate social relevance over controlled sets of face images and/or cognitive demand, and to investigate the relations among social perception, cognitive demand and patterns of saccadic eye movements.

Interestingly, the facial configuration did not appear to have significant influence on individual fixation durations. Indeed, the durations of fixations on major local facial features, such as eyes, nose and mouth, were not different between normal and scrambled faces (Fig. 6). This suggests that the longer fixations on the faces are mainly correlated with the analysis of the local facial features rather than the precise facial configuration. However, the disruption of facial configuration (i.e. inverted or scrambled faces) can significantly reduce the number of fixations compared with the normal upright faces (Guo et al. 2003). Taken these observations together, it seems that the number of fixations rather than the duration of fixations play a more crucial role in the process of face inspection.

When tested with the scrambled face images, the durations of fixations on the facial contours (including hairlines) were slightly decreased (Fig. 6). For a normal upright face, the facial contour provides essential facial metric information which is critical for face perception and recognition (Burton et al. 1993; Perrett et al. 1994; Fellous 1997). Indeed, the responses of face-selective neurons in anterior inferotemporal cortex of macaques are correlated with dimensions relating the hairline to other facial points, such as eyes, in face discrimination tasks (Young and Yamane 1992). In our study, the observed longer fixations on the facial contours within the intact faces may be correlated with the analysis of the properties of facial dimensions, and this process may require extra fixation time.