1 Introduction

Virtual reality (VR) has been heralded as a powerful technology for representing and simulating architectural spaces, as its immersive sense of presence supposedly emulates the natural experience of space (Schuemie et al. 2001). However, several authors have observed inaccuracies in the perception of virtual environments, the most frequent being that spatial dimensions tend to be perceived as smaller (Interrante et al. 2008; Jones et al. 2011; Loomis and Knapp 2003; Messing and Durgin 2005; Murgia and Sharkey 2009; Renner et al. 2013a, b). These distortions pose a serious threat to the validity of VR to represent architectural spaces in a precise and reliable way and therefore call into question its use in architectural applications that require veridical simulations, such as design validation by architects or clients.

The factors behind these distortions are not fully understood. Common theories tend to focus on technical aspects—such as hardware limitations and software errors—or on human factors. In this study, we explore an alternate design-oriented approach. We examine how virtual environments are modeled and whether their design characteristics might have an influence on the accuracy of spatial perception.

We started with the observation that in physical environments the perception of dimensions relies on, among other factors, visual cues provided by the spatial context. When the availability of these visual cues is restricted, the probability of distortions in the perception increases. Consequently, assuming that the subject performs the same processes of spatial perception in an immersive VR experience as in a real space, we hypothesize that virtual environments modeled with a higher availability of meaningful visual cues will lead to a more accurate perception of spatial dimensions. In an experiment, we compare the accuracy of spatial dimension estimation in real and virtual environments with different availabilities of visual cues visualized using a VR device.

2 Background

Underestimations of spatial dimensions in virtual environments, especially when using head-mounted displays (HMD), are profusely reported in the literature. An extensive review, with a focus on egocentric dimensions (i.e., the distance between a point and the observer), can be found in Renner et al. (2013a, b). According to their review, studies report that, in general, the estimated dimensions in virtual environments are about 74% that of the actual modeled dimensions.

There is no conclusive explanation for these distortions. Most common theories point to hardware limitations and/or software errors. The restricted field of view of HMDs and the physical restrictions of the helmets or goggles to allow full head movements are two of the most studied factors. However, studies show diverse and contradictory results, and therefore, these factors cannot be asserted as the main cause of distortions (Creem-Regehr et al. 2005; Knapp and Loomis 2004; Willemsen et al. 2009). In fact, similar distortions have been found in non-HMD systems, such as cave automatic virtual environments (CAVE) and stereoscopic systems (Ng et al. 2016; Bruder et al. 2015, 2016; Lin et al. 2015; Piryankova et al. 2013; Marsh et al. 2014).

Uncorrected geometric distortions are also frequently claimed as a source of inaccuracies (Bruder et al. 2012; Kellner et al. 2012; Steinicke et al. 2011), as well as visual effects (Cidota et al. 2016; Langbehn et al. 2016), but it is unlikely that these would explain all inaccuracies as correction algorithms have been developed and yet distortions still occur. The impact of an incorrect interpupillary distance (IPD) is subject to debate, as researchers have found conflicting evidence regarding the extent of its influence (Renner et al. 2013a, b; Robinett and Rolland 1992; Kellner et al. 2012; Willemsen et al. 2008).

Human factors and interindividual differences have also consistently been mentioned as potential sources of distortions. To this point, it has not been found that gender, age and height have a significant influence (Murgia and Sharkey 2009). On the other hand, studies suggest experience with VR, spatial perception skills and physical qualities might have an influence, yet it is unlikely that personal differences are the sole source of error, and therefore, they should be considered only as a secondary factor.

While most studies tend to look for the causes of this phenomenon in the technical characteristics of the VR devices and/or the psychophysiological attributes of the subjects, substantially fewer studies have considered the design of the VR models as a possible source of distortions.

In physical environments, the perception of dimensions relies on the availability of visual cues gathered from the environment (Howard 2012; Cutting and Vishton 1995). Several studies have shown that when these visual cues are limited, the accuracy of spatial perception declines (Kunnapas 1968; Philbeck and Loomis 1997), even in cases when these contextual cues appear to be irrelevant to estimating the distance between two targets (Lappin et al. 2006; Witt et al. 2007). Hence, it seems plausible that the lack of sufficient and meaningful visual cues might affect the accuracy of spatial perception in virtual environments in a manner similar to the way it does in real environments. Yet only a few studies have directly addressed this hypothesis.

The most common visual cues are binocular disparity, motion parallax, perspective, texture gradient, occlusion, relative size, relative density, lighting and shading. Binocular disparity (i.e., the difference of vision between the left and right eyes) is probably the most important cue (Proffitt and Caudek 2002) and is present in any HMD system when an adequate IPD is set for each viewer. Motion parallax is considered a weak cue distance estimation beyond 2 meters (Philbeck and Loomis 1997; Cutting and Vishton 1995) with no substantial influence in virtual environments (Jones et al. 2011). Perspective cues and texture gradient were found useful in improving distance perception (Surdick et al. 1997; Thomas et al. 2002; Sinai et al. 1998). Similarly, Kenyon et al. (2007), Luo et al. (2009), and Murgia and Sharkey (2009) showed that a more complex visual context (i.e., including relative density and relative size visual cues) helped to improve depth perception. However, adding familiar objects does not improve distance estimation in virtual environments (Interrante et al. 2008; Armbrüster et al. 2008) nor in real environments (Beall et al. 1995). Lighting was studied by Tai (2012), who found a direct relationship between lower luminance contrast and longer distance estimates.

In the architectural spaces, visual clues exist in the form of building elements, natural elements or common objects such as furniture or fixtures. For centuries, architects have used these elements to facilitate or alter the spatial perception, from the use of forced perspective in the Renaissance and Baroque eras (Sinisgalli 2012), to the use of trompe-l’œil decorative effects or altered size furniture in commercial stores to intentionally make spaces look different. Numerous studies have investigated these techniques, founding a direct relationship between the use of furniture (Kaye and Murray 1982; Imamoglu 1970, 1973; Luria et al. 1967; von Castell et al. 2014), lighting (Oberfeld et al. 2010) and other visual elements (Gäbling 1970; Stamps 2010; Serpa and Muhar 1996) in the perception of size and spaciousness of architectural rooms.

The availability of visual cues is a concept that differs from, but is related to and sometimes confused with, the graphic quality. While the availability of visual cues refers to the existence of meaningful content in the visual space, the graphic quality refers to the level of degradation of a video or image in a digital display compared to an ideal. In this regard, some authors have explored the hypothesis that low quality graphics might not provide an adequate sense of presence for an accurate spatial perception. Research results are contradictory. While Kunz et al. (2009) found that higher quality graphics improved distance perception, Willemsen and Gooch (2002) and Thompson et al. (2004) found no evidence supporting this idea.

In summary, the literature shows that the availability of visual cues might be a factor in explaining the inaccuracies in dimension estimation in virtual environments, but also that this theory has been substantially less investigated than others and therefore requires a reconsideration with a more detailed experimental setup and improved technical equipment.

3 Methodology

An experiment was designed to compare the accuracy of egocentric and exocentric dimension estimation in real environments and virtual environments modeled with different availability of visual cues, visualized using a HMD VR device.

Twenty-seven participants, selected by convenience, were told they would be participating in a study on the use of VR in architectural visualization, with no particular mention of dimension estimation. All participants were graduate students and/or family members, with ages ranging 22–51 years and diverse academic backgrounds. All participants had normal (20/20) or corrected-to-normal vision, and none had previous experience with VR technologies.

After an individual calibration of the HMD and a general introduction to its use, the participants were requested to observe, with the naked eye the physical room they were located in, and also three virtual rooms using the HMD, all in random order. For each case, they were asked to describe the room’s configuration, lighting attributes and/or material textures and to estimate the following dimensions: overall length and width of the room (exocentric dimensions), and the distances between them and the nearest and furthest walls (egocentric dimensions). The reason for asking about general spatial properties in addition to the dimensions was to avoid revealing to the participants the purpose of the study and thereby triggering unwanted cognitive calculations in the estimation process. The main objective of this phase was to measure the ability of each individual to estimate dimensions in a physical environment, as a control scenario for later comparisons.

All three virtual rooms were fictitious, with different proportions although similar area (37–45 m2), and modeled with different availability of visual cues, as shown in Fig. 1 and Table 1. Room A was very abstract, with no visual cues other than the binocular vision and motion parallax, which are provided by default by the HMD stereoscopic vision and head tracking features. Room B included some material properties (i.e., texture gradient in walls and floor) and a couple of generic orthogonal and parallel boxes with no indication of size, which were intended to give a sense of perspective. Room C additionally included several pieces of occluding furniture, fine-grained material textures, lighting fixtures that cast shadows and familiar size objects (boxes were replaced by standard furniture), defining a setup that incorporated all visual cues studied.

Fig. 1
figure 1

From top to bottom, virtual models of room a (low availability of visual cues), room b (medium availability of visual cues) and room c (high availability of visual cues), showing a standard standing modeling viewpoint. In the experiment, the view height was calibrated individually for each participant

Table 1 Availability of visual cues for each virtual room

To ensure consistency, the avatars for all participants were positioned in the same spot, as if they were seated in a virtual chair, which was mimicked in the physical setup in real life (Fig. 2). Participants were not allowed to walk (physically or virtually), but permitted to rotate and move their head and torso freely. Since participants were seated, the variable vision height was considered at the moment of creating the models and also when calibrating the HMD individually for each participant. The interpupillary distance (IPD) was also individually calibrated.

This stationary configuration, while uncommon for many VR applications, is widely used in commercial applications in architecture, especially for simple tasks or when there is a risk of dizziness or motion sickness for users unfamiliar with the technology (as this case). For example, point-fixed VR visualizations are common in real estate showrooms where users only need to grasp a sense of the size, proportions and design style of spaces. Free navigation is certainly more powerful and therefore typically used by professionals for more complex tasks such as way finding, accessibility studies, constructability reviews or exploring dynamically the spatial qualities of designs. Free navigation, however, also involves other issues that may affect distance perception—such as the influence of navigation speed or the role of kinetic sensing from moving body parts—which were avoided in this study by choosing a stationary setup.

Because of the location of the avatar and sizes of virtual rooms, a wide range of distances had to be estimated by the participants. Egocentric distances ranged from 0.80 to 6.80 m, and exocentric distances (room dimensions) ranged from 4.50 to 8.50 m.

Finally, participants were requested to comment on their experience with the VR for the visualization of architectural virtual environments.

The VR hardware used was an Oculus Rift™ DK2 HMD, which has a 960 × 1080 per eye resolution, 100° horizontal field of view, 75 Hz refresh rate and inertial and positional head tracking. Previous studies with the Oculus Rift have had auspicious results with the equipment (Creem-Regehr et al. 2015; Young et al. 2014; Andrus et al. 2014). No headphones or joysticks were used. The 3D architectural model was built and rendered using Unreal 4 game engine and then exported as a stand-alone application.

One important methodological consideration is the difficulty in measuring distance perception. Since spatial comprehension is a psychological process that cannot be directly observed, researchers are forced to use indirect methods that might be biased. Direct verbal estimation (i.e., asking the observer to verbally state the distance in some familiar unit) is the most common method, although it has been questioned by some authors because of the influence that cognitive processes (i.e., deductive calculations) might have over the proper perceptual processes (Loomis and Knapp 2003). Other methods that have been used include comparing the distance to a point in relation to a reference, estimating the midpoint between two points, walking blindfolded to a previously visualized point and estimating the walking time between two points (Rieser et al. 1990; Loomis and Philbeck 2008; Kuhl et al. 2006). Nevertheless, after reviewing dozens of studies, Renner et al. (2013a, b) concluded that distance distortions in virtual environments are consistent regardless of the method used. In this study, both direct and indirect verbal estimation techniques were used.

Fig. 2
figure 2

General setup of the experiment

4 Results

The participants’ responses are analyzed and expressed both as absolute errors and as normalized or relative errors. Absolute errors (\(\Delta\)) are defined as the absolute discrepancy between estimated (\(\hat{d}\)) and true distances (\(d\)), expressed in meters:

$$\Delta = \left| {\hat{d} - d} \right|$$

Normalized or relative errors (\(\varepsilon\)) are defined as the difference of estimated (\(\hat{d}\)) and true distances (\(d\)), over the true distance:

$$\varepsilon = \frac{{\hat{d} - d}}{d}$$

The normalized error is interpreted as the proportion of underestimation or overestimation relative to the actual distance. A ε closer to 0 denotes an accurate estimate, while a negative value indicates underestimation and a positive value shows overestimation. To enhance clarity, in this paper underestimations are preceded with a “−” sign, while overestimations are notated with a “+” sign (e.g., – 0.02 vs. + 0.02). While the absolute error is useful to understand the magnitude of errors, the normalized error allows for comparisons between cases. The total error, either absolute (\(\Delta_{\text{total}}\)) or normalized (\(\varepsilon_{\text{total}}\)), is defined as the mean of the errors in egocentric dimensions and exocentric dimensions.

Figures 3, 4, 5 and 6 illustrate the results from this experiment for all participants and all cases. A summary of results is shown in Table 2.

Fig. 3
figure 3

Normalized error (ε) in physical room (control)

Fig. 4
figure 4

Normalized error (ε) in room A

Fig. 5
figure 5

Normalized error (ε) in room B

Fig. 6
figure 6

Normalized error (ε) in room C

Table 2 Summary of results

4.1 Dimension estimation in physical environment (control)

In the physical environment (control), the mean total error is + 0.08 (s = 0.26). Errors in exocentric (+ 0.05) and egocentric dimensions (+ 0.10) are overestimations. Although the means of normalized errors appear to be low, the high standard deviation and mean absolute errors reveal that people are not very good at estimating dimensions even in physical environments. The high correlation between exocentric and egocentric errors R2 = 0.83 indicates that people tend to be consistent in overestimating or underestimating dimensions. Women were slightly more accurate than men (\(\bar{\varepsilon }_{\text{total(w)}} = - \,0.02\) vs. \(\bar{\varepsilon }_{\text{total(m)}} = + \,0.14\)).

4.2 Dimension estimation in virtual environments

Table 2 shows an evident decrease in mean normalized errors as the availability of visual cues increases: − 0.20 for room A, − 0.17 for room B and − 0.10 for room C. The standard deviation is relatively consistent in all cases. This decrease in the magnitude of errors is also apparent in Figs. 4, 5 and 6: The distribution of responses in room A shifted toward the negative quadrants shows that most people tend to underestimate dimensions significantly, while the distribution of responses in room C looks much more balanced around 0, similar to the control group. The mean absolute errors confirm the trend: 1.69 m for room A, 1.44 m for room B and 1.40 m for room C. However, in all cases, including room C, the estimation errors are higher than the control case.

There is an appreciable difference between egocentric and exocentric dimension estimates. While in egocentric dimensions there is a clear decrease in both normalized and absolute errors, in exocentric dimensions the tendency is much less defined (Fig. 7).

Fig. 7
figure 7

Mean normalized error for exocentric and egocentric dimensions and mean totals, for all cases. Dimensions in all virtual environments were underestimated, while dimensions in the control case were slightly overestimated. Estimations for rooms B and C are highly consistent, but no so for room A. Room C shows a similar magnitude of error than the control case

Remarkably, all results, both in egocentric and in exocentric dimensions for all three rooms, show that dimensions were underestimated, while in the control case (physical room) all dimensions were overestimated.

Between egocentric and exocentric errors, the correlation R2 is very similar to the control group (0.87 for the room A, 0.73 for the room B and 0.83 for the room C), indicating that people are also consistent in either overestimating or underestimating dimensions in virtual environments. There are not observable differences between men and women, either on larger or shorter dimensions.

A one-way analysis of variance (ANOVA) was conducted to compare the effect of the availability of visual cues on the accuracy of dimension estimation in the three virtual rooms and control scenario. It was found effectively significant for the total normalized error [F(3,104) = 6.38, p < 0.01]. A deeper analysis of these results shows that there is a difference between egocentric and exocentric errors, revealing a much higher significance for the former:

$${\text{ANOVA}}\;\varepsilon_{\text{ego}} : \, \left[ {F(3,104) = 11.55, \, p < 0.001} \right]$$
$${\text{ANOVA}}\;\varepsilon_{\text{exo}} : \, \left[ {F(3,104) = 2.91, \, p < 0.04} \right].$$

Post hoc comparisons for the total error using the Tukey’s honest significant difference (HSD) test revealed that the control was significantly different than the room A (p < 0.01) and the room B (p < 0.01). However, there were no statistically significant differences between the control and room C. For the egocentric error, the Tukey HSD test also shows significant differences between the control and the room C (p < 0.05), and between rooms A and C (p < 0.01).

5 Discussion

Taken together, the results of this experiment suggest that the availability of visual cues has a clear trend (p < 0.01) to explain inaccuracies in dimension estimation when using a VR HMD device, especially for egocentric dimensions (p < 0.001). The decrease in total error seems to indicate that the greater the availability of visual cues, the greater the level of accuracy in the estimations. The largest significant differences are found between the control room and rooms A and B (low and medium availability of visual cues).

It is worth noting that results also show that the overall estimation of egocentric dimensions in all three virtual environments is \(\bar{x} = - \,0.19\), about 19% underestimation, which is highly consistent with what has been observed before in the literature.

Although these results seem to confirm the hypothesis of this study, there are reasons to evaluate their implications with caution.

First, the sample for this study has statistical power limitations. Both its small size (N = 27) and the fact that it was a convenience sample (as opposed to a random sample) call into question its suitability to adequately represent a normal population. Nevertheless, the data were previously checked using the Shapiro–Wilk normality test, showing a W = 0.97, which is indicative of a normal distribution.

Second, in this study we focused on the combined effect of several visual clues integrated in the same model, since it is a more natural representation of the how the real visual cues exist in physical spaces. However, this approach has the disadvantage that the greater possible effect of one specific cue could be possibly masking the lesser effect of another cue. Also, a cue might have a misleading role that could be distorting the results. These relative differences or unwanted effects could be detected with a sensitivity analysis, which in this case is precluded by the small sample.

Third, the within-subject design of the experiment (i.e., the same group of people is tested in all conditions, including the control) entails some carryover effects that may be affecting the results, including practice and fatigue. To help reduce fatigue (especially visual fatigue and dizziness), the application time of the experiment was very short (< 5 min), and to help prevent participants to focus on their ability to estimate dimensions (and improve through conscious practice), questions about dimensions were camouflaged among other general questions.

Fourth, the experiment was conducted using a desktop PC with a graphics card (GPU) slightly inferior to the recommendations of the manufacturer. Although the virtual models were optimized to have a maximum latency of 25 ms with 75 fps, which is similar to other experiments in the literature, and no participant showed any signs of dizziness or vision discomfort, the barely noticeable motion blur with this equipment may have affected participants in an unknown way.

Fifth, participants were seated and not allowed to walk (virtually or physically) during the experiment. This stationary setup is a less familiar way to explore and visualize a room than walking around, where other factors (e.g., navigation/walking speed, dynamic perspective, etc.) also contribute to distance perception. Consequently, the results of this study must be interpreted only for stationary contexts and cannot be generalized to free navigation VR applications.

Finally, a factor not initially considered in the design of the experiment was the time allowed to each participant to estimate dimensions. This became clear when some participants took almost a minute to estimate a dimension, while others did so in just a few seconds, which suggests that different participants presumably used different calculation methods.

Given all these points, the results might be viewed as suggestive evidence, but in no case conclusive, and therefore, the subject deserves further exploration in deeper and broader studies.

Some of the possible directions that future studies can take are: (a) conduct and similar study with a larger random sample; (b) use highly photorealistic models built using spherical 3D photograph cameras; (c) use models with a varying number of visual cues so to have a continuous numeric data model that allows for bivariate correlation analysis and fine sensitivity analyses; or (d) build a virtual environment that allows participants to interact with the space walking or moving objects; among others.

6 Conclusions

Taken together, the results of this experiment suggest that the availability of visual cues has a clear trend (p < 0.01) to explain inaccuracies in dimension estimation when using a VR HMD device, especially for egocentric dimensions (p < 0.001). The decrease in total error seems to indicate that the greater the availability of visual cues, the greater the level of accuracy in the estimates of dimensions.

However, the results are not conclusive and must be taken with caution. A small sample size, inherent disadvantages of the experimental design, and technical limitations in its implementation are some factors that suggest this study should be viewed as a pilot experiment for an eventual second experiment conducted with a larger sample and more rigorous methodology.

Based on the evidence produced from this study, it is clear the need to study how virtual environments are modeled as the availability of visual cues seems to be one factor influencing the level of accuracy of dimensional estimates. In the long term, this study contributes to a better understanding of how architectural environments should be modeled for use in applications where accurate and reliable simulations are required.