1 Introduction

In the Phong reflection model, diffuse and specular reflections are varying due to observer and lights positions, but ambient light is constant. Having this assumptions, we miss the interreflections between rendered objects. Adding ambient occlusion (AO) for varying ambient light creates very convincing soft shadows, that combined with direct lighting give realistic images [1, Sect. 9.2]. The AO technique is faster in comparison to the full global illumination solutions, however, it still needs demanding resources to achieve high quality renderings. An approximation of this technique, called screen space ambient occlusion (SSAO) [15] simulates local occlusions in real time. However, the accuracy of approximation strongly depends on the sampling density, which in turn limits its applications. It is especially a drawback in the game engine, in which the rendering time spent for AO computation should use only a fraction of the frame time because other calculations determine the quality of the gameplay. Thus, further acceleration of the SSAO computations is a desirable task which has significant impact on the overall quality of the real-time computer graphics.

In this work we present a gaze-dependent screen space ambient occlusion technique in which information about human viewing direction is employed to vary accuracy of the occlusion factors. The screen region surrounding the observer’s gaze position is rendered with maximum precision, decreasing gradually towards parafoveal and peripheral regions. The idea of this solution is based on the directional characteristic of the human visual system (HVS). People see the high frequency details only in a small viewing angle subtended 2–3\(^\circ \) of the field of view. In this range, people see with a resolution of up to 60 cycles per angular degree, but for a 20-degree viewing angle, this sensitivity is reduced even ten times [8].

The accuracy of the occlusion factors is determined by the number of samples used to calculate these factors. In practice, less samples results in ragged edges of the SSAO shadows. However, this aliasing artifacts are barely visible by humans in peripheral regions of vision. In other words, image deterioration caused by low sampling in the SSAO technique can be clearly visible only in high frequency regions. Number of samples in SSAO can be gradually reduced with distance to a gaze point what significantly speeds-up rendering. We use the eye tracker to capture the human gaze point. Then, sampling is reduced with eccentricity (deviations from the axis of vision) along the curve determined by the gaze-dependent contrast sensitivity function (GD-CSF) [5]. This perceptual function models loss of contrast sensitivity with eccentricity. It can be used to determine the maximum special frequency visible for humans for an arbitrary viewing angle.

Section 2 gives background information on the screen-space ambient occlusion technique and outlines the previous work. Section 3 is focused on our gaze-dependent extension of SSAO and shows how sampling can be reduced without noticeable image deterioration. Section 4 presents the results of the perceptual experiment in which we evaluate the perceptual visibility of the image deteriorations.

2 Background and Previous Work

Screen space ambient occlusion (SSAO) is a rendering technique for approximating the global illumination in real time [15]. For every pixel on the screen, the depths of surrounding pixels are analyzed to compute the amount of occlusion, which is proportional to the depth difference between a current pixel and a sampled pixel.

Fig. 1.
figure 1

Map of occlusion shadows generated by the normal-oriented hemisphere SSAO technique.

We implemented the normal-oriented hemisphere SSAO technique [20]. The hemisphere is oriented along the surface normal at the pixel. The samples from the hemisphere are projected into screen space to get the coordinates into the depth buffer. If the sample position is behind this sample depth (i.e. inside geometry), it contributes to the occlusion factor. The procedure is repeated for every pixel in image to generate the map of occlusion factors (also called the occlusion shadows, see example in Fig. 1). This map forms a characteristic shadowing of ambient light, which is visible as a high frequency information in characteristic regions of the scene (e.g. at corners, close to complex objects, etc.). The shadows are blended with the pixel colors computed based on the Phong lighting equation. Wherein, frequency of the shadows is often higher than variability of the Phong ambient shading.

Fig. 2.
figure 2

Quality of the occlusion shadows in relation to the number of samples. The image deterioration is especially visible in the insets.

In SSAO the screen space computations are performed rather than tracing new rays in 3-dimensional space as it is done in the original Ambient Occlusion (AO) technique [23]. The AO algorithm generates better results than SSAO, but its complexity prevents the use of AO in the game engines and generally in the real time computer graphics. Even gaze-dependent extension of AO proposed in [9] offers rendering of up to 1–2 frames per second at full GPU load.

The SSAO method introduces a number of visible artifacts like z-fighting caused by the limited resolution of the Z-buffer or unrealistic darkening of the objects resulting from applying an arbitrary sampling radius around the pixel (see details in [3, 7, 15]). However, the most problematic is a noise and banding in occlusion shadows caused by too few number of samples (see examples in Fig. 2). To avoid noise visibility, hundreds of samples per pixel should be generated. This is too much for the game engines, for which the trade-off between accuracy and computational complexity is required. The number of samples is reduced to 32, while using the bilateral filtering, which is still time consuming. We noticed that fewer samples from these 32, even with the low-pass filtering, results in perceivable quality deterioration of the ambient occlusion shadows and should not be used in the practical applications.

In the following section we propose a technique, which reduces number of samples in the peripheral region of vision without visible degradation of the image quality. This type of image synthesis is called foveated rendering. The foveated rendering was proposed to accelerate the ray casting by Murphy et al. [16]. Günter et al. [6] presented a rendering engine, which generates three low-resolution images corresponding to the different fields of view. Then, the wide-angle images are magnified and combined with non-scaled image of the area surrounding the gaze point. Thus, the number of processed pixels can be reduced by 10–15 times, while ensuring the deterioration of image quality invisible for observer. Another foveated rendering technique proposed by Stengel et al. [19] aimed to reduce shading complexity in the deferred shading technique [1]. The spatial sampling is constant for the whole image but the material shaders are simplified for peripheral pixels. According to the authors, this technique reduces the shading time up to 80%. The foveated rendering was also proposed for real time tone mapping [10, 13].

3 Gaze-Dependent Rendering of SSAO

In the gaze-dependent SSAO technique the high frequency spatial sampling is performed in the region of interest. The further from the gaze point, the less detailed ambient factor is rendered saving computation time, while the use of eye tracker leaves observer with a feeling that the sampling is fully detailed. The outline of our gaze-dependent SSAO system is presented in Fig. 3. Observer’s gaze position on the screen of the display is captured by the eye tracker (see Sect. 3.3). At the same time, the 3D scene is rendered using the Phong lighting model. Then, the ambient occlusions are calculated using varying number of samples (see Sect. 3.2). Frequency of sampling depends on the angular distance between a pixel and position of the gaze point (see Sect. 3.1). Finally, the occlusions are blended with the color image and displayed in real time on the screen.

Fig. 3.
figure 3

Gaze-dependent screen space ambient occlusion rendering system.

Fig. 4.
figure 4

Left: the most recognizable stimulus frequency as a function of eccentricity expressed (the dashed line shows the maximum frequency of our display). Right: Region-of-interest mask for an image of 1920\(\,\times \,\)1080 pixel resolution (gaze position at (1000, 500)), brighter area depicts higher frequency of HVS. The white spot surrounding the gaze position shows an area, in which the maximum resolution of the display is reached.

3.1 Gaze-Dependent Contrast Sensitivity Function

The fundamental relationship describing the behavior of the human visual system is the contrast sensitivity function (CSF) [2]. It shows the dependence between the threshold contrast visibility and the frequency of the stimulus. For a frequencies of about 4 cpd (cycles-per-degree), people are the most sensitive to contrast, i.e. they will see the pattern despite the slight differences in the brightness of its individual motifs. The CSF can be used to e.g. better compress the image by removing the high frequency details that would not be seen by humans.

An extension of the CSF, called the gaze-dependent CSF (GD-CSF), is measured for stimuli observed in various viewing angles. Following Peli et al. [5], we model the contrast sensitivity \(C_t\) for spatial frequency f at an eccentricity E with the equation:

$$\begin{aligned} C_t(E,f) = C_t(0,f) * exp(kfE), \end{aligned}$$
(1)

where k determines how fast sensitivity drops off with eccentricity (the k value is ranged from 0.030 to 0.057). \(C_t(0,f)\) is the contrast sensitivity for the foveal vision (equivalent to CSF). The plot of this function is presented in Fig. 4 (left).

Based on GD-CSF, for a range of eccentricities, the most recognizable stimulus frequency can be modeled by the equation [21]:

$$\begin{aligned} f_c(E) = E_1 * E_2 / (E_2 + E), \end{aligned}$$
(2)

where \(f_c\) denotes cut-off spatial frequency (above this frequency observer cannot identify the pattern), \(E_2\) is retinal eccentricity at which the spatial frequency cut-off drops to half its foveal maximum (\(E_1\) = 43.1), and \(E_2 = 3.118\) (see details in [22]). An example region-of-interest mask computed for our display based on the above formula is presented in the right image in Fig. 4. Applying this mask, one can sample an image with varying frequency generating less samples for the peripheral regions of vision.

3.2 Region-of-Interest Sampling

In the SSAO technique, a number of samples located in the hemisphere oriented along the surface normal at the pixel is analyzed (see Sect. 2 for details). For pixels distant from the gaze point, we reduce a number of samples in the hemisphere according to Eq. 2. For each pixel in the image the eccentricity E expressed in degrees of the viewing angle is calculated. This transformation must take into account the position of the gaze point as well as physical dimensions of the display, its resolution and viewing distance. Resulting frequency \(f_c(E)\) is normalized to \({<}0,1{>}\) and mapped to a number of samples ranging from 2 to 32. The example ambient occlusion maps generated for varying number of samples are presented in Fig. 5.

Fig. 5.
figure 5

The occlusion shadows with the gaze-dependent reduction of the sampling frequency. The blue arrows point the location of the gaze points. The deterioration of shadows is clearly visible in the areas farther from the gaze points. (Color figure online)

3.3 Eye Tracking

Accuracy of the eye tracking plays a crucial role in our GD-SSAO setup, because even small deviations from the actual gaze position can make the peripheral image deteriorations visible for observer. Eye tracker captures the gaze position indicated by temporary location of the pupil centre [11]. This data must be filtered because saccadic movements of the eye make the gaze position unstable [4]. A typical filtration is based on the fixation algorithms, that analyze velocity and/or dispersion of the gaze points and estimates the average gaze position for a time window [17]. However, the fixation techniques are also prone to accuracy errors and cannot be directly used in our system because of flickering they generate [12]. We found that temporal pooling of the fixation points generates satisfactory results. In our setup a 250 Hz eye tracker is used which frequency allowed to average 4 gaze point locations per frame. In cases of persons “incompatible” with the eye tracker (i.e. receiving significant calibration error) we increase the size of the high frequency sampling area by scaling the \(f_c(E)\) value (multiplying by a number greater than one). This solution eliminates visible flickering of the occlusion shadows, however, also reduces the achieved rendering speed-up.

4 Experimental Evaluation

The main goal of the experiment was to evaluate how reduction of the SSAO sampling affects the quality of the rendered animation. We wanted to test if the peripheral image deteriorations are visible for observers. In this Section we also present a performance boost achieved due to reduced SSAO sampling.

4.1 Stimuli

In the experiment we used the Stanford dragon modelFootnote 1 enclosed in the 5-walls box and Sibenik cathedral sceneFootnote 2. Three camera poses were selected for each scene resulting in 6 different images (see selected shots in Fig. 8). The images were static because we notice that an animation focuses observers’s attention on object movements rather than evaluation of quality of the ambient occlusion effect. Please note, that this assumption leads to more conservative results. The image deteriorations should be less visible in the case of dynamic images because of the visual masking effect.

4.2 Procedure

We asked observer to carefully watch two images presented one by one on the screen in random order. One of these images was rendered with the full frame SSAO with 32 samples per pixel (we called it the reference image). The second image was rendered using our GD-SSAO technique with the gaze point captured by eye tracker. Each image was presented for 10 s and after this time observer were asked to assess the image quality on the 10-points Likert scale ranging from significant deteriorated image (score 0) to excellent quality (score 10). This procedure was repeated for 6 pairs of images twice, resulting in 24 images being evaluated (2 scenes x 3 camera poses x 2 repetitions x 2 images in a pair).

4.3 Participants

The experiment was repeated for 9 observers (aged between 21 and 24 years old, 7 males and 2 females). All of them had normal or corrected to normal vision. No session took longer than 10 min. The participants were naïve about the purpose of the experiment. The eye tracker were calibrated at the beginning of the session. Observer did not know if it was used while watching a given image.

4.4 Apparatus and Performance Tests

The experiment was conducted in a darkened room. Observers sit in the front of the 22-in. LCD display with the screen dimensions of 51\(\,\times \,\)28.5 cm, and the native resolution of 1920\(\,\times \,\)1080 pixels. To achieve the rendering framerate of 30 Hz for the full frame SSAO (32 samples per each pixel), we reduced the image resolution to 1280\(\,\times \,\)720 pixels. A distance from observer’s eyes to the display screen was restricted to 65 cm with a chin rest. We use SMI RED250 [18] eye tracker working with accuracy close to 1\(^\circ \). Our GD-SSAO renderer was run on PC equipped with 2.66 GHz Intel Xenon W3520 CPU with 8 GB of RAM, Windows 7 64bit OS, and a GPU NVIDIA QUADRO 4000 graphics card.

For the full frame SSAO, our system was able to render 30 fps (frames-per-second) for Sibenik scene, and 35.5 fps for the Stanford Dragon. For GD-SSAO the performance increased to the average frame-rate of 47.7 fps for Sibienik and 54.7 fps for Stanford Dragon (1.59-times and 1.56-times acceleration, respectively).

Please note, that the frame-rate depends on location of the gaze point because sampling in image regions corresponding with the complex geometry of the scene is more challenging than for flat regions. For the Stanford Dragon scene the higher frame rates were achieved when observer looks at the corners of the box because the dragon was sampled with the lower frequency. For this scene the frame-rate varied from 48.8 fps to 63.2 fps. For Sibenik scene, it varied from 40.4 fps for observer looking at the centre of the screen, to 52.3 fps for the top-right corner.

It is also worth noting that acceleration in the GD-SSAO technique is related to resolution of the rendered image. Due to hardware limitation, we has to reduce this resolution to 1280\(\,\times \,\)720 pixels. For the full HD or 4k resolution displays the performance boost will be correspondingly greater.

Fig. 6.
figure 6

Results of the quality evaluation for individual observers (left) and scene shots (right). The dashed horizontal line depicts average DMOS value. The error bars show standard error of mean. (Color figure online)

Fig. 7.
figure 7

Ranking graph illustrating lack of the statistically significant difference between tested scenes.

Fig. 8.
figure 8

Images used in the experiment. Left column presents images rendered based on the Phong shading. The ambient occlusions were added in the middle column. Column on the right shows corresponding ambient occlusion maps.

4.5 Quality Evaluation

To evaluate if the peripheral image deterioration was visible for observers we calculated the difference mean opinion score (DMOS) as a difference between the scores given for the reference full-frame SSAO rendering and for GD-SSAO with eye tracker. The score of zero would suggest that observers did not see any difference between techniques, while DMOS = 10 would mean the full disagreement for the GD-SSAO. The DMOS score computed based on the results of our experiment, averaged over all observers and all pairs of images, is equal to 2.25 (std = 2.04), which suggests that observes noticed the quality deterioration when using eye tracker but this deterioration was negligible.

Figure 6 shows the DMOS scores for individual observers (left) and individual scene shots (right). The variation of the scores could suggest that there are different opinions between observers and for different scenes. Therefore, we perform the multiple-comparison test, which identifies statistical difference in ranking tests. After [14], the results of this analysis are presented as the ranking of the mean DMOS score for tested scene (see Fig. 7). The scenes are ordered according to increasing DMOS value, with the smallest DMOS on the left. The percentages indicate the probability that an average observer will choose the scene on the right as better than the scene on the left. If the line connecting two samplings is red and dashed, it indicates that there is no statistical difference between this pair of scenes. The probabilities close to 50% usually result in the lack of statistical significance. For higher probabilities the dashed-lines will start to be replaced with the blue lines but, as can be seen in Fig. 7, the multiple-comparison test confirms lack of the significant statistical difference for all tested scenes.

5 Conclusion

We proposed a novel concept of the SSAO technique, in which a rendering speed-up was achieved based on varying sampling of the ambient occlusion shadows. The results of the conducted experiments show that people can experience only a slight deterioration of image quality in comparison to the full frame SSAO. We argue that this deterioration is caused rather by the eye tracking temporal lag than the reduced sampling. In future work we plan to repeat the experiment using better rendering hardware and higher image resolution.