1 Introduction

Three-dimensional (3D) contents are able to provide an immersive experience, allowing for a more natural and realistic user experience than what is possible with conventional two-dimensional (2D) videos. The first appearance of 3D videos can be traced back to 1903, when the Lumière brothers showed the first 3D short film in the Paris World Fair [11]. Although these early films were relatively successful, it was not until the 50s that the movie industry started investing in 3D technologies. Despite their initial success, because of several technical problems, 3D videos failed to fulfill its expectations.

In the last decades, with the help of image processing algorithms and advanced filming and display techniques, the available 3D video technology improved considerably [1, 15, 17]. The use of 3D content in applications like TV, gaming, and teleconferencing increased considerably, producing new real-time 3D video communication applications with enhanced interactive capabilities and better user quality of experience (QoE). An active topic of research has been the optimization of the parameters of the acquisition, representation, transmission, and display technologies, with the goal of saving resources (e.g. bitrate, bandwidth). But, the level of acceptability and popularity of 3D video applications is strongly correlated to the service reliability and the quality of content provided to the user [16]. More specifically, the QoE of a 3D video content is defined as a combination of several perceptual attributes, which include the overall spatial-temporal picture quality (2D quality), the comfort level, the naturalness, and the realism of the 3D content [13, 14].

Currently, the most accurate way to determine the quality of 3D or 2D videos is to perform psychophysical experiments with human subjects. Experimenters first create a set of videos with varying parameters (Hypothetical Reference Circuits – HRCs) that reflect typical conditions found in applications. Then, human observers are asked to rate specific attributes of each video [6]. These experiments are often designed with the goal of understanding how one or more attributes of the video content affects the user experience. Therefore, quality databases containing the sets of videos with varying attributes (and, consequently, degradations), with their corresponding quality scores, are constantly used in 3D technology research.

Currently, there are a few publicly available 3D video quality databases. Among the most popular databases are COSPAD1 [18], MMSPG-EPFL [7], Trinity College Dublin 3D Post-Production Database (TCD-3DPPD) [3], and 3DVCL@FER [4]. The 3D videos in 3DVCL@FER, a super set of the COSPAD1 database, contain compression, packet loss, resizing, and temporal distortions, along with the corresponding quality, depth, and comfort subjective scores. In this database, the only 3D impairments are vertical and horizontal disparities, added during post processing. On the other hand, the MMSPG-EPFL database has real acquisition degradations, generated by changing the camera distances and alignments during acquisition. Nevertheless, the scenes corresponding to a particular HRC were obtained by filming (sequentially) this scene with different camera parameters. This way, the content of the videos corresponding to the same scene (source) are not exactly the same, which makes it more difficult to objectively compare the different camera settings. Finally, the TCD-3DPPD database includes examples of several broadcasting production scenarios, captured at different locations including a studio with controlled lighting and indoor and outdoor sites. This database contains examples of degradations that are commonly present in real-world scenarios, but it has limitations in terms of source material. Also, it does not include the quality scores corresponding to each test sequence.

Up to our knowledge, there is no public quality database that contains 3D videos with typical stereoscopic acquisition degradations, like for example the keystone effect and the cardboard effect. There is also no copyright-free high quality 3D rendering content, similar to what is widely used in cinema industry, which is available for download. Therefore, the first goal of this work is to build such a database. With this goal, we created a set of high quality original Computer Graphics Imaging (CGI) scenes. Then we used different parameters to render 3D videos containing different types of stereoscopic degradations. The proposed database (UnB-3D) is publicly available for download in the website [5] of our laboratory. Researchers can use this database to test objective quality metrics or to propose post-processing algorithms to correct the presented artifacts. Our second goal is to analyze the impact of these stereoscopic degradations on the user quality of experience. To accomplish this goal, we performed a psychophysical experiment in which participants rated the depth quality and comfort of the created set of 3D test sequences. These subjective responses can be used to guide content creators to reduce production cost without compromising the quality of experience.

The paper is divided as follows. In Section 2, we explain how the 3D videos and the degradations were created. In Section 3 we detail the experimental methodology, while in Section 4 we present a statistical analysis of the subjective data collected during with the experiment. Finally, in Section 5 we draw our conclusions.

2 Generation of 3D scenes

We manually created five CGI 3D scenes using 3D models available in the site ‘DigitalTutors’. The ‘Maya 2014’ software was used to create and render the scenes, ‘RealFlow 2013’ was used to simulate physical movements, and scripts in Python and C# were used to introduce stereoscopic degradations. Figure 1 depicts sample frames of the SRC (source) 3D scenes in descriptive format. Each SRC sequence corresponds to a 3D video of 10 seconds and 24 frames per second (fps) with a resolution of 1280×720 pixels.

Fig. 1
figure 1

Sample frames of the five SRC 3D video sequences

Table 1 shows the Temporal Information (TI), the Spatial Information (SI), and the depth of each SRC scenes. SI and TI are computed as described in the ITU-R BT.1788 Recommendation [9]. To compute the SI of a 3D scene, we first process each frame with a Sobel filter, compute the standard deviation of the filtered frame, and take the maximum value among all frames. To compute the TI of a scene, we take the differences between consecutive frames, compute the standard deviation of these frame differences, and, then, compute the maximum value. All operations are carried using only the luminance component, normalized between 0 and 1. The depth (D) is defined as the size of the frustum view, which is given by the distance between the near and far planes of the visualization volume. The value of D, in pixels, denotes the amount of disparity. Therefore, a value of 255, which represents 20% of the image width, indicates the scene has 255 perceived levels depth, typically on an interval of [−X, Y ], depending on the zero parallax position and configuration.

Table 1 Temporal Information (TI), Spatial Information (SI), and depth (D) of the 5 SRC 3D scenes [9]

For each SRC sequence, we generated 10 HRCs, which are listed in Table 2. Each entry contains the disparity, which is represented in terms of image width percentage, and a normalized depth value. Negative values correspond to content in front of screen (popup effect) and positive values to content beyond the screen. HRC01 corresponds to a 2D rendering, i.e. the scene was rendered using only one camera. This does not correspond to using only the left or right view, but to positioning a single camera in the central optical axis and, then, performing the rendering process. It is important to point out that, since these scenes are 2D, their depth is given by only 2D cues.

Table 2 Description of the HRCs, with respect to translation, zero parallax position, normalized depth and disparity range (in image width)

HRC02, HRC03, and HRC04 were rendered using an off-axis 3D camera with three different zero-parallax positions, which vary according to the camera focal distance. The three parallax positions were center (HRC02), back (HRC03), and front (HRC04). These configurations control the pop-up effect created by the negative parallax, which is stronger in HRC03. HRC04 scenes have a positive parallax with objects in the back of the screen, which reduces this effect, and HRC02 scenes have an equal amount of content in the front and back of the screen. These conditions should not affect depth quality, since the perceived depth does not change. We included these HRCs because we wanted to know if the comfort of the 3D scene varies for different zero parallax positions.

HRC05 and HRC06 were created by horizontally shifting the image. The horizontal image translation (HIT) is an electronic process that is commonly used to shift the left-eye and right-eye images horizontally, modifying the stereoscopic characteristics of the 3D content after capture. Most commonly, HIT is used to alter the zero-parallax setting (ZPS) to compensate for stereo window violations or for excessive positive or negative parallax in the source material. In our test conditions, HRC05 was generated by performing a positive HIT in HRC03 (increase of the negative disparity), while HRC06 was generated by performing a negative HIT in HRC04 (increase of the positive disparity). These HRCs either contained cropping or introduced vertical black bars on the borders of the image. We added these HRCs to the test set to analyze if the excessive disparity, or cropping, caused by this type of post-processing technique affects the depth quality and comfort of the scene.

HRC07 and HRC08 correspond to 3D renderings using a toe-in camera configuration. When compared to HRC07, HRC08 scenes have an excess in the vertical disparity, which was introduced by a vertical translation of the image by 20 pixels. These scenes contain keystone [19] degradations, which are illustrated in Fig. 2. Although the keystone distortions are more visible in the image borders, their severity is proportional to the convergence distance and focal length [12]. Notice that the slightly different plane positions of the toe-in cameras, marked in red, causes a difference in the vertical position of homologous points in the different views (vertical parallax). In general, the vertical parallax causes a visual discomfort and, therefore, it is undesirable.

Fig. 2
figure 2

Illustration of the keystone distortion, which is characterized by a vertical parallax and is more visible in the borders of the image

The HRC09 scenes correspond to a 3D rendering implemented using a simple Depth-Image-Based Rendering (DIBR) algorithm. The inputs to the DIBR algorithm were the HRC01 scene (i.e. a 2D scene) and its corresponding depth map. HRC09 labeled as fake 3D, produces image warping and plane curvature degradations [19], as illustrated in Fig. 3. The depth plane distortion is a directly consequence of the DIBR process, which causes an unnatural horizontal parallax, producing a misrepresentation of the relative object distances on the screen. Notice that object’s borders appear to be warped because of the lack of depth information for the occluded regions.

Fig. 3
figure 3

Illustration of a depth plane curvature distortion

According to Lebreton et al., depth perception is the capacity of perceiving the spatial organization of the elements in a scene, including the distance among them [10]. On the other hand, depth quality corresponds to the naturalness or the precision of the depth information. A stereoscopic distortion that illustrates depth quality is the cardboard effect, which is characterized by a bad representation of the binocular depth clues. HRC10 scenes are produced by quantizing the depth map, using the same procedure used to generate the fake 3D. HRC10 scenes may contain cardboard degradations. Figure 4 depicts two depth map images: the original one (left) and its quantized version (right). Although depth can be perceived in both, in the quantized version the several layers of the map break the continuity of the perceived depth that may cause an incorrect identification of the distance [2].

Fig. 4
figure 4

Illustration of the cardboard degradation: (left) original depth map and (right) quantized depth map

Our database has features that are not available in other databases. In general, it is very difficult to extend the number of HRCs in 3D quality databases because the 3D scenes need to be recorded with same lighting conditions and object placement. But, since our database was built using GCI techniques, its set of 3D artifacts can be easily extended by including more acquisition or processing techniques. Also, in our database, the disparity maps can be obtained easily and with 100% accuracy, simply by capturing the renderer values. Table 3 summarizes in terms of HRCs differences between the proposed database and two of the most popular 3D databases.

Table 3 Comparison of UnB-3D database HRCs against 3DVCL@FER and MMSPG-EPFL

3 Experimental methodology

The psychophysical experiment was performed in a sound-proof room. A PC computer was used to play the 3D videos in a 3D display. Participants used 3D glasses to watch the videos and entered the scores using a tablet. Table 4 shows the technical specifications of the equipment used in the experiment. We used the Nvidia 3DVision 2 technology because, when paired with a 144 MHz monitor, the accompanying 72 Hz glasses are able to show the 24 Hz 3Dsequences of our database without shuttering. With this setup, there is little crosstalk (ghosting image) degradation, that leads to the left view being partially perceived by the right eye and vice-versa [2]. In the beginning of the experiment, participants were told that if they felt any discomfort (e.g. nausea or headaches) during the experimental session, they should stop the experiment and call the experimenter.

Table 4 Technical specifications of the equipment and physical environment of the psychophysical experiment

The experiment was divided in the following stages: (1) stereoscopic test, (2) training session, (3) experimental session, and (4) interview. In the stereoscopic test stage, the experimenter checked the participant’s ability to recognize and perceive a stereoscopic content using the equipment (3D display and glasses). This test consisted of showing a pair of noisy images (i.e. the left and right views of a 3D image) to the participants. When seen individually, the two images looked like randomly generated noisy images. But, when the two images were fused and displayed in a 3D monitor, viewers using a 3D glass should have seen a cube in the negative parallax plan. After viewing the fused content, participants were asked to choose one of the following answers: ‘I saw a noisy image’, ‘I saw a cube’,‘I saw a circle’ and ‘I saw a triangle’.

The training session was divided in three parts. In the first part, participants watched a 3D video that contained one main object with varying zero-parallax depth positions. This video had 240 frames and was played at 24 fps. The goal was to help the participant differentiate the object different depth positions in the 3D plane (with respect to the monitor). In the beginning of the video, the position of the main object in the scene was set at zero-parallax. As the video was played, the position of this object changed (slightly) with each frame. If the participant was not able to perceive these depth position differences, the video was replayed until he/she could differentiate them.

In the second part of the training stage, participants visualized a 3D image with different levels of (3D depth) comfort. For a single same SRC (original 3D content), participants first watched a 2D representation, then a sequence of several stereoscopic renderings with different horizontal parallax separations. Figure 5 depicts an example of two anaglyph renderings, just for clarification purposes. Next, participants watched the corresponding fake-3D version and its cardboard variation. The goal here was to help the participant differentiate the variations in 3D, using an image instead of a video. Again, if the participant was not able to distinguish these examples, the sequences were replayed until he/she was able to perceive these differences.

Fig. 5
figure 5

Anaglyph rendering of two timestamps of the training set

In the third part of the training session, the participant was introduced to the scoring interface. The goal here was to increase familiarity and reduce any potential user problems. As mentioned earlier, the participant used a tablet to score the depth quality and the comfort of the displayed scene. After each sequence was played, the interface containing the scoring scales were displayed in the tablet display. Figure 6 shows these two scales, where the lowest value represents ‘one’ and the maximum value is ‘five’. After entering his/her answers, the participant pressed ‘next’ and a message appeared in the tablet asking him/her to look at the (main) monitor, where the next video was displayed. This procedure was repeated until all videos were played.

It took, on average, 30 minutes for each participant to complete the experiment. The experimental session was completely automatized and no intervention from the experimenter was needed. During the experimental session, 50 HRCs were randomly shown to each participant, who was asked to rate their depth quality and comfort. A Single Stimulus (SS) [8] methodology was used in the experiment, which means the test stimuli was displayed by itself, without a reference stimuli. We choose this methodology because we want the participant’s first impression, mimicking a real movie theater scenario, where observers can’t replay a scene. Between the test sequences, a gray screen was displayed on the main monitor and an alert text on the tablet interface warn the participant to score the content. Finally, at the interview stage, the participant was asked two questions: “Did you feel any discomfort?” and “Was your perception of 3D quality modified by this experiment?”.

Fig. 6
figure 6

Comfort and Depth Quality scales used in the experimental session

4 Experimental results

Thirty-three subjects participated in the experiment, but one of the participants was eliminated because he did not pass the stereoscopic test. From the remaining 32 subjects, 25 were male and 7 were female. For all subjects, we collected their age and interpupillary distance (IPD) that is the distance between the centers of the pupils of the eyes. The IPD value is different from person to person and this value affects the way users perceives depth. In our experiment, IPD was measured using a standard ruler placed against the observer’s forehead. Interviews with the participants revealed that some had mild discomfort symptoms, while 3 had stronger reactions (e.g. headaches and ocular stress), mainly caused by previous health issues.

As mentioned earlier, each participant rated the depth quality and the comfort of the 3D sequences. To analyze the quality and comfort data collected during the experiment, we computed the average (over all participants) of these two scores. More specifically, the Mean Depth Quality Score (MDQS) was computed using the following equation:

$$ \text{MDQS}(v) = \frac{1}{N} \sum\limits_{j=1}^{N} \text{DQS}(v,j), $$
(1)

where N is the total number of participants and DQS(v, j) is the depth quality score given by the j-th participant to the v-th 3D video sequence. We also computed the confidence interval for all MDQS values, as given by:

$$ \text{MDQS}(v) \pm 1,96 \cdot \frac{\sigma(MDQS(v))}{\sqrt{N}}, $$
(2)

where σ(MDQS(v)) is the standard deviation of the MDQS values corresponding to the v-th video sequence.

Similarly, we computed the Mean Comfort Score (MCS) values using the following equation:

$$ \text{MCS}(v) = \frac{1}{N} \sum\limits_{j=1}^{N} \text{CS}(v,j), $$
(3)

where N is the total number of participants and CS(v, j) is the comfort score given by the j-th participant to the v-th 3D video sequence. We also computed the confidence interval for all MCS values, as given by:

$$ \text{MCS}(v) \pm 1,96 \cdot \frac{\sigma(MCS(v))}{\sqrt{N}}, $$
(4)

where σ(MCS(v)) is the standard deviation of the MCS values corresponding to the v-th video sequence.

Next, we present a statistical analysis of the subjective scores collected during the experiment, including an analysis of the main effects impacting comfort and depth quality. We also present perceptual models that estimate depth quality and comfort, taking into consideration intrinsic characteristics of the videos, such as depth, spatial information and temporal information.

4.1 Basic analysis of MDQS and MCS values

Figure 7 depicts the histogram of the DQS (in orange) and CS (in blue) values given by all participants to the complete dataset. This graph shows that, in general, participants rated the sequences with high comfort and quality scores (≥ 4). To check the internal consistency of the experiment, we computed the Cronbach’s alpha of the depth quality and comfort scores, obtaining alpha values of 0.905 and 0.889, respectively. Both these alpha values correspond to excellent internal consistency values.

Fig. 7
figure 7

Depth Quality Score (DQS) and Comfort Score (CS) distribution for the experiment

Figure 8 shows the MDQS values averaged across the source content (SRC) and grouped according to the HRC (see Table 2). The graph also shows the confidence intervals of these average values. Although the average MDQS value corresponding to HRC01 (2D) was smaller than the value corresponding to HRC02 (3D), the difference between these two values was not statistically significant (t-test, P < 0.05). In fact, a one way repeated measures Analysis of Variance (ANOVA) test showed that most MDQS differences present in Fig. 8 were not statistically significant. This means that subjects had difficulties differentiating the HRCs. A more careful analysis of the data showed that only a sub-group of the subjects exhibited this behavior: subjects who were less familiar with 3D content. More specifically, the behavior of subjects with familiarity with 3D content (experts) was different from the ones with no familiarity with 3D content (non-experts).

Fig. 8
figure 8

Average MDQS across all SRCs, grouped by the HRCs (see Table 2)

To better understand this phenomenon, we used a k-means algorithm to split the subjects in two groups: experts and non-experts. As expected, the k-means grouping reflected the division of subjects according to their reported familiarity. Figure 9 shows the average MDQS across all SRCs, grouped by the 3D familiarity (experts and non-experts) and by the HRCs (see Table 2). Notice that in most cases, as expected, the MDQS differences were not statistically significant for the non-experts group. But, for the experts group, there was a clear difference between MDQS values corresponding to different 2D and 3D HRCs.

Fig. 9
figure 9

Average MDQS across all SRCs, grouped by the expertise (familiarity) and HRCs (see Table 2)

Figure 10 shows the average MCS across all SRCs, grouped by the 3D familiarity (experts and non-experts) and by the HRCs (see Table 2). Notice that the average MCS values were similar for both experts and non-experts groups. Also, it is clear from the graph that 2D scenes (HRC01) were rated as the most comfortable, followed by cardboard (HRC10) and fake-3D (HRC09) scenes. The 3D scene with vertical disparity, HRC07, obtained the smallest average MCS value. This result shows that the keystone effect has a very negative impact on the observers’ comfort.

Fig. 10
figure 10

Average MCS across all SRCs, grouped by the expertise (familiarity) and HRCs (see Table 2)

Figure 11 shows a plot of MCS for specific SRCs, averaged over the different HRCs without artefacts (HRC2-HRC06), versus the corresponding temporal information (TI). Notice that there is a negative correlation between comfort and the amount of movement of the objects in the scene. In fact, the Pearson Correlation Coefficient (PCC) between the SRC TI (see Table 1) and the average MCS value (across all HRCs) is -0.83. Notice that SRCs that received the highest MCS values are ‘station’ and ‘lemons’, which have a low TI. In contrast, SRCs that received the lowest MCS values are ‘spaceship’ and ‘robot’, which have high TI values, and ‘fight’, which, although does not have a high TI value has several scene cuts and fast disparity changes. This high correlation is explained by the fact that it takes more effort for the brain to fuse two picture frames when there is a high level of temporal information in the scene.

Fig. 11
figure 11

MDQS and MCS values grouped by content (SRC)

Our data also shows that there is a correlation between content and comfort scores. Figure 12 depicts the average MDQS and MCS across all HRCs, grouped by SRC. In this graph, the results are again separated by 3D familiarity: experts (E) and non-expert (NE) groups. Notice that there is indeed a relatively high positive correlation between the content and the depth quality (PCC = 0.68). For example, the SRC sequence ‘lemons’ has the highest MDQS and, at the same time, one of the highest MCS values.

Fig. 12
figure 12

Temporal Information’ effects on MCS

4.2 Statistical analysis of main effects and interaction terms

We performed a statistical analysis of the main effects and interactions affecting both the MDQS and MCS average values. We present first the tests performed on the MDQS values and, then, the tests performed on the MCS values. The very first test performed was a Mixed ANOVA test, with the between-subjects factor as the 3D familiarity, the MDQS as our dependent variable, and the HRC as the independent variable (within subjects factor). Results showed that there is a statistically significant interaction between the 3D familiarity and the HRCs for the variable MDQS, with F(4,84) = 39.72, P < 0.05, partial η2 = 0.732.

Next, to analyze more carefully the effect of the different types of HRCs (instead of the groupings) on the MDQS, we performed a one-way repeated measures ANOVA, with Bonferroni PostHoc analysis, tested individually on the experts and non-experts groups. Naturally, before performing the ANOVA, we tested the normality and sphericity of our data. A Shapiro-Wilk test of normality showed that our data has a normal distribution. Nevertheless, Mauchly’s test of sphericity indicated that the assumption of sphericity was violated (X2 = 124.603, P < 0.0005), i.e. the F-statistic is positively biased rendering it invalid and increasing the risk of a Type I error. To correct this issue, we used the Greenhouse Geisser procedure to determine which degree the sphericity assumption had been violated (𝜖 = 0.673) and corrected the degrees of freedom (df) of the F-distribution accordingly. It should be noted that it is not uncommon to find that sphericity has been violated.

For the experts group, results showed that there were statistically significant differences between the MDQS values of 3D scenes (HRC02-HRC08) and the MDQS values of 2D (HRC01) and fake-3D scenes (HRC09-HRC10) (P < .05). Therefore, it is safe to say that 3D scenes had a higher depth quality than 2D and fake-3D scenes. It is worth mentioning that there were no statistically significant differences between the MDQS values of 2D and fake-3D scenes, which implies that the fake-3D and 2D scenes had similar depth quality values. Otherwise, no statistical differences were found between depth quality scores corresponding to different 3D camera configurations (HRC02-HRC08). This result was expected since the amount of depth in these scenes was the same.

For the non-experts group, we detected statistically significant differences only between the MDQS of fake-3D and cardboard scenes (HRC09 and HRC10, respectively) and the rest of the HRCs (HRC01-HRC08) (P < .05). In other words, non-experts had difficulties identifying differences in depth quality, even when considering 2D scenes and 3D scenes. On the other hand, they were clearly able to notice the artefacts in the fake-3D and cardboard configurations and, therefore, they rated these scenes as having a worse depth quality than the rest of the HRCs, including the 2D scenes (HRC01).

Then, we performed a Mixed ANOVA, with the between-subjects factor as the 3D familiarity, MCS as the dependent variable, and the HRC as the independent variable. Results showed that for MCS there was no statistically significant interaction between the 3D familiarity and HRCs, with F(4,84) = 15.68, P < 0.22, partial η2 = 0.132. So, we analyzed the averages of the MCS values across all SRCs, using a one-way repeated measures ANOVA, with Bonferroni PostHoc analysis, considering all individuals (experts and non-experts). Results showed that the MCS values of HRC01 are statistically significant (P < .05) from the MCS values of all others HRCs. In other words, 2D scenes were perceived as having the highest comfort level. However, there was no difference in comfort for the different camera configurations (HRC02-HRC06), which means that participants did not prefer one type of parallax position. The differences between average MCS values corresponding to HRC07 and the average MCS values corresponding to all others HRCs are statistically significant (P < .05), meaning that increasing vertical parallax causes a decrease in comfort. The exceptions are the MCS for HRC09 and HRC10, which only show statistical significance against HRC01.

Figures 13 and 14 shows the marginal means of MCS and MDQS values, respectively, versus HRCs. For a better understanding of the interactions, the different values corresponding to each content (SRC) are shown in different colors. We can see that even with the same 3D parameters each scene has a different distribution of values, indicating that content have an effect on quality and comfort. To analyze the influence of content in comfort and quality, we performed a Two-Way repeated measures ANOVA, first with MCS as dependent variables, after with MDQS, and HRC and SRC as independent variables. In both tests (for MCS and MDQS), all main effects were statistical significant (P < .05). For MCS, simple effects between the scenes ‘lemons’ and ‘spaceship’ were not statistically significant, which is probably due to their temporal and spatial similarities. But, for all other SRCs, simple effects were statistically significant. This means that the content (SRC) influences the perception of comfort, along with the HRC parameters. With respect to MDQS, simple effects were statistically significant for all SRCs. This shows that content also influenced depth quality.

Fig. 13
figure 13

MCS Estimated means against HRCs

Fig. 14
figure 14

MDQS Estimated means against HRCs

To understand if we are measuring two different variables, and to analyze if the subjects can differentiate what is quality and comfort, a two-way repeated measures ANOVA was executed to determine if the differences between quality (DQS) and comfort (CS) scores were statistically significant. In this case, the null hypothesis corresponds to the case when there is no statistical difference between the two variables (comfort and depth quality). Results showed that there are statistically significant differences between depth quality and comfort scores, with F(2,16) = 30.462, P < 0.001, partial η2 = 0.278. A simple main effects (ANOVA one-way) test showed that differences between depth quality and comfort scores were not statistically significantly for HRC04, HRC05, HRC06, and HRC08. However, differences for HRC01, HRC02, HRC03, HRC07, HRC09, HRC10 were statistically significant. In the case of HRC08, the differences were barely statistically significant.

4.3 Perceptual depth quality and comfort models

With the goal of modeling the MDQS from the depth (D), spatial information (SI), and temporal information (TI) scene characteristics, we fitted a complete linear model, considering all possible combinations and interactions terms of these three variables. The regression used 50 inputs, one for each HRCs. An input is an average of the Depth Quality Scores for that specific HRC. The deep value was taken from the normalized ones in Table 2. The fit returned a complex model with several terms, but several of these terms ended up as not being statistically significantly. Therefore, we refitted the data, considering only the statistically significant ones, which resulted in the following model:

$$ \text{PDQS}= a_{0} +a_{1} \cdot \text{SI}\cdot \text{D} + a_{2} \cdot \text{TI}\cdot \text{D} + a_{3} \cdot \text{TI} + a_{4} \cdot \text{TI}\cdot \text{SI}, $$
(5)

where D is depth, SI is the spatial information, and TI is the temporal information. Assumptions of linearity, independence of errors, homoscedasticity, unusual points, and normality of residuals were all met. The fit was statistically significant, with the following parameters: F(4,45) = 36.620, P < .0005, adjusted R2 = .744, and multiple correlation coefficient R = 0.875.

Table 5 displays the regression coefficients, standard errors, and significance of the fit. Figure 15 shows the residual plot. The intercept term in (5) (a0) can be associated with the 2D depth quality, given that purely 2D videos have depth cues. Notice that the coefficient of the term TI (a3) has a positive value, which suggests that motion affects the depth quality positively. Nevertheless, the coefficients corresponding to the interaction terms TI × D (a2) and TI × SI (a4) are negative. This suggests that when motion is combined with depth and spatial information, the overall quality of depth is decreased. On the other hand, the coefficient of the interaction term SI × D (a1) has a high positive value, which suggests that the combination of depth and spatial information increases depth quality.

Table 5 Fitting parameters of the multiple linear regression analysis performed on the MDQS data
Fig. 15
figure 15

Normal P-P Plot of MDQS Standardized Residuals

We fitted a similar model to the MCS data. The regression used 50 inputs, one for each HRCs. An input is an average of the Comfort Quality Scores for that specific HRC. As before, we fit the complete linear model with all combinations of terms and their interaction. Then we kept only the statistically significant terms, which resulted in the following model:

$$ \text{PCS}= a_{0} +a_{1} \cdot \text{TI}\cdot \text{D} + a_{2} \cdot \frac{\text{SI}}{\text{TI}} + a_{3} \cdot \text{TI}. $$
(6)

The fit returned the following parameters: F(3,46) = 18.530, P < .0005, adjusted R2 = .518, multiple correlation coefficient R = 0.740. Table 6 displays the regression coefficients and standard errors of the fit. Figure 16 shows the residual plot. Again, TI and D have a great impact on comfort. As expected, a 2D scene, with zero depth, gives higher values of PCS because the component (a1) is removed. The other components shows that TI adds to the model. With the presence of depth values, the best comfort is achieved for a lower TI associated with a higher SI. In other words, participants gave lower scores to scenes with a lot of movement and higher depth.

Table 6 Multiple regressions analysis on MCS
Fig. 16
figure 16

Normal P-P Plot of MCS standized residuals

5 Conclusions

In this paper, we described a new database for 3D videos subjective assessment that contains a set of typical stereoscopic distortions. The UnB-3D database was created with Computer Graphics Imaging (CGI) scenes, using different parameters to render 3D scenes and generate different stereoscopic degradations. The UnB-3D is flexible. You can use the same asset to produce a boarder variation of multimedia applications, such as multi-view and 360 video content. Yet the database is extensible, since it is possible to include more camera configurations, lighting conditions, disparity configurations, etc. to create more HRCs for another subjective experiment. To analyze the impact that stereoscopic degradations have on the quality of experience, we performed a psychophysical experiment in which participants rated the depth quality and the comfort of this set of 3D scenes. We noticed that subjects with no familiarity with 3D content (non-experts) had difficulties differentiating HRCs. Participants with familiarity with 3D content (experts), on the other hand, were able to differentiate the depth quality and comfort of different HRCs. Results also showed a negative correlation between comfort and the temporal information of the scene, indicating that movement, when associated with depth, decrease the user’s comfort. Extracted models indicated a high correlation of content (SI and TI) and quality instead of the 3D parameters besides depth. Post processing effects that increase the horizontal disparity did not affect the perceived 3D quality, neither the comfort. But, increasing the vertical disparity produced a higher discomfort. Participants did not unanimously agree that scenes with a predominant positive parallax produced a better comfort than scenes with a negative parallax. Fake-3D scenes can be clearly distinctive from the true 3D ones, but they are more comfortable. Future works consist on using UnB-3D meta files to create 360 videos and analyze the impact on perceived quality and comfort. Also, we plan to create more HRCs to address the 2D depth cues because in this experiment we realized that they have a great impact in the perceived depth, since we didn’t have a lower values of MDQS.