Keywords

1 Introduction

Virtual Reality (VR) has attracted substantial attention from industry and research communities due to its ability to provide users with a stereoscopic and immersive experience through Head-Mounted Displays (HMDs) [6, 8]. Omnidirectional Videos (ODVs), a.k.a, \(360^{\circ }\) videos, panoramic videos or spherical videos, have emerged as a significant form of VR content. By using VR HMDs and adjusting their head orientation, users can explore the audio-visual content in any direction. This immersive experience of simulating real-world scenes has contributed to the popularity of ODVs in various application fields, including medical, education, advertising, tourism, etc.

Compared to traditional videos, ultra high-definition ODVs contain more scene information and multi-channel audio information, which results in a doubling of ODV data volume. Due to the huge amount of data, playback stucking and quality switching caused by network delays and fluctuations usually occur during video transmission, which leads to the degradation of ODVs quality, and further affects the QoE of ODVs. Moreover, ODVs may also suffer from the distortions introduced during the process of capturing or displaying, which further decreases the QoE. Therefore, to provide users with a smooth viewing experience, it is important to monitor the quality of ODVs during the procedure of shooting, codec, transmission, etc., and perform optimization accordingly.

In the past few decades, many objective quality assessment methods have been proposed for traditional plane videos [19, 23], and some recent works have also explored the problem of audio-visual video quality assessment [21]. Recently, with the popularity of VR, many studies have explored the problem of omnidirectional image quality assessment [3, 24] and omnidirectional video quality assessment [13]. However, most omnidirectional video quality assessment research only focuses on the single-mode signal, i.e., visual information, few works have investigated the multimodal quality assessment of ODVs incorporating audio information. As an important part of ODVs, spatial audio may strongly influence the human perceptual quality, thus it is necessary to conduct in-depth research on the audio-visual quality assessment of the omnidirectional videos.

In this paper, we make three contributions to the omnidirectional audio-visual quality assessment (OAVQA) field. Firstly, we construct a large-scale omnidirectional audio-visual quality assessment dataset to solve the poverty problem of the corresponding dataset. We first collected 15 high-quality reference omnidirectional audio-visual (A/V) content, and generated 375 distorted ODVs degraded from them. Subsequently, 22 subjects were recruited to participate in the subjective quality assessment experiment, and the audio-visual quality ratings of the reference and distorted videos were collected. Secondly, we design three baseline methods for full-reference omnidirectional AVQA. The baseline models first utilize the existing state-of-the-art audio and video single-mode quality assessment methods to predict the audio quality and video quality of ODVs, respectively, then utilize different multimodal fusion strategies to fuse A/V prediction results and obtain the overall quality results of the ODVs. Thirdly, we compare and analyze the prediction performance of these models on our dataset, and establish a new benchmark for future studies on OAVQA.

2 Related Work

2.1 Omnidirectional Video Quality Assessment Dataset

Table 1 provides an overview of several existing omnidirectional video quality assessment datasets. It can be observed that most of the existing ODV quality assessment datasets lack spatial audio information, and mainly focus on visual distortions, while audio distortions are rarely been considered.

2.2 Quality Assessment Models

Omnidirectional Video Quality Assessment. As a common storage format of ODVs, ERP projection has severe mapping stretches near the poles. In order to solve this problem, Yu et al. [31] proposed a spherical PSNR scheme (S-PSNR), which is based on a set of uniform sampling points on the spherical surface, the corresponding position on the mapping plane is calculated by different mapping formulas. Sun et al. proposed the Weighted to Spherically uniform PSNR (WS-PSNR) [25], which is directly performed in the original format and combined with different stretching weights according to different mapping methods. Anwar et al. [1] established an ODVs quality assessment model using the Bayesian inference method, and evaluated the impact of buffering on users’ perceptual quality at different bitrates. Fan et al. [10] established an ODVs dataset that contains various distortions such as compression distortion and quality switching, and then used machine learning methods to establish VQA models.

Table 1. An overview of omnidirectional video quality assessment datasets. “Mute" means mute audio and “ambisonics" indicates spatial audio. SI and TI represent spatial information and temporal information respectively. QP indicates quantization parameter and CRF means constant rate factor, which is used to control the video bitrate.

Omnidirectional Audio-Visual Quality Assessment. As an important part of ODVs, the influence of spatial audio on perceptual quality has rarely been studied. Zhang et al. [33] presented a quality assessment methodology for audio-visual multimedia in virtual reality environment. They presented a panoramic audio-visual dataset and the quality factors which represent different distortions were applied as the input to neural network. Fela et al. [14] utilized PSNR and its variants designed for ODVs, i.e., WS-PSNR, CPP-PSNR and S-PSNR [25, 31, 32], as the quality scores and studied the perceptual audio-visual quality prediction based on the fusion of these scores [13]. Four machine learning models including multiple linear regression, decision tree, random forest, and support vector machine (SVM), were tested.

3 Omnidirectional Audio-Visual Quality Assessment Dataset (OAVQAD)

3.1 Reference and Distorted Contents

We first captured 162 different ODVs with different scenes using a professional VR camera Insta360 Pro2. Then, we selected 15 high-quality ODVs from the collected ODVs as the reference videos in our OAVQAD. We utilized FFmpeg to clip the duration of the selected ODVs to 6 s. Each ODV has a resolution of 8K (7680 \(\times \) 3840) in equirectangular projection (ERP) format with a frame rate of 29.97 fps. All ODVs contain first order ambisonics (FOA) with 48,000 Hz audio sampling rate and four audio channels. The audio and video formats are shown in Table 2. The ODV contents include acappella chorus, shopping, guitar playing, restaurant ordering, etc. Figure 1 shows the ERP format previews of the selected 15 reference ODVs.

Fig. 1.
figure 1

EPR format previews of 15 reference ODVs used in our OAVQAD.

Table 2. Omnidirectional audio and video format parameters.

We utilized advanced audio coding (AAC) as the audio encoding method provided by FFmpeg 4.4, and used constant bitrate (CBR) mode to set the audio bitrate to 64Kbps, 32Kbps and 16Kbps, respectively, thereby generating three levels of perceptually well-separated audio compression distortion. Then, we chose HEVC as the video encoding method provided by FFmpeg libx265 encoder, and for each source video we applied 3 different compression levels, i.e., 32, 37 and 42 in constant rate factor (CRF) mode. Besides, we also set the video resolution to three levels including 4K (3840\(\times \)1920), 2K (1920\(\times \)960), 1K (1080\(\times \)540). Moreover, in order to adapt to a wider range of application scenarios, we further introduced more abundant distortion types and added three types of distortions [5, 7] including noise, blur, and stucking, and generated distorted ODVs with various levels of these distortions. To summarize, we applied 25 distortion conditions to 15 reference ODVs, resulting in a total of 375 (15 \(\times \) 25) distorted ODVs.

3.2 Subjective Experiment Methodology

Experiment Apparatus. Since the subjective experiment was needed to be conducted in a VR immersive environment, we used HTC Vive Pro Eye as the HMD to demonstrate ODVs and collect subjective quality ratings. The subjective experiment platform used to play 8K ODVs and perform scoring interaction was build based on Unity 1.1.0 as shown in Fig. 2.

Fig. 2.
figure 2

Demonstration of the subjective experiment interface based on the Unity platform.

Experiment Procedure. The subjective experiment was conducted in a subjective study room in a university. A total of 22 subjects (14 males and 8 females) were invited to participate in the subjective experiment. The subjects were between 20 and 28 years old (mean 22.62, variance 5.23) and were all graduate and undergraduate students. All subjects had normal or corrected-to-normal vision and hearing. In the experiment, subjects firstly received the guidance on the use of VR equipment, including HMD and controllers. Then a training session was performed for the subjects, making them be familiarized with the user interface as well as the general range and types of distortions. In the testing session, subjects watched 390 ODVs and gave perceptual scores of the overall A/V quality. The order of the test videos was random for each subject to avoid bias.

3.3 Subjective Data Processing and Analysis

We followed the subjective data processing method recommended by ITU [2, 4] to perform the outlier detection and subject rejection. None of the 22 subjects was identified as an outlier and eliminated. We normalized the raw scores of subjects to Z-scores ranging between 0 and 100 and calculated the mean of Z-scores to obtain the final mean opinion scores (MOSs), which are formulated as follows:

$$\begin{aligned} z_{i j} =\frac{r_{i j}-\mu _i}{\sigma _i}, \quad z_{i j}^{\prime }=\frac{100\left( z_{i j}+3\right) }{6}, \end{aligned}$$
(1)
$$\begin{aligned} \text {MOS}_j =\frac{1}{N} \sum _{i=1}^N z_{i j}^{\prime }, \end{aligned}$$
(2)

where \(r_{ij}\) is the original score of the i-th subject on the j-th sequence, \(\mu _i\) and \(\sigma _i\) are the mean rating and the standard deviation given by subject i, N is the total number of subjects. Figure 3 draws the histogram of MOS distribution over the entire database, indicating that the perceptual quality scores are widely distributed in the \(\left[ 0,100\right] \) interval, basically covering every score segment, and generally showing a normal distribution. It also manifests that the perceptual quality distribution conforms to our expectations and the distortions setting is quite reasonable.

Fig. 3.
figure 3

Histogram of MOS distribution in the database.

4 Objective Omnidirectional Audio-Visual Quality Assessment

4.1 Single-Mode Models

Many video and audio quality assessment methods have been proposed separately in previous studies. These quality assessment algorithms, only predict quality of single-modal audio or video signals, can be called as single-mode quality assessment methods. We first utilize the existing state-of-the-art single-mode quality assessment methods to predict the omnidirectional video and audio quality, respectively. Since both the single-mode AQA and VQA prediction scores can characterize one aspect of the distortion severity of the distorted video, it is reasonable to directly use the single-mode models to predict the overall audio-visual quality score of the ODVs.

The well-known single-mode assessment models adopted in this paper are introduced as follows:

  • Video: VMAF [19], SSIM [28], MS-SSIM [29], VIFP [23], FSIM [34],

                GMSD [30], WS-PSNR [25], CPP-PSNR [32], S-PSNR [31].

  • Audio: PEAQ [27], STOI [26], VISQOL [16], LLR [17], SNR [17], segSNR [15].

4.2 Weighted-Product Fusion

A single-mode audio/visual quality assessment metric can only characterize one quality aspect thus cannot fully represent the overall subjective perceptual quality of an ODV. Therefore, it is important to use appropriate multimodal feature fusion method to predict the A/V quality of ODVs. The simplest fusion method is to directly multiply the quality scores of a VQA model and an AQA model as the overall quality score of ODVs.

However, for human audio-visual perception, video and audio quality often occupy different importance in ODVs, and people may pay more attention to visual quality. The weighted product can balance the influence of different modalities by assigning different weights to each of them, so the weighted product is a better choice for score fusion compared to the direct multiplication method. The weighted product can be formulated as

$$\begin{aligned} Q_{a v}=\hat{Q}_v^w \cdot \hat{Q}_a^{1-w}, \end{aligned}$$
(3)

where \(\hat{Q}_a\) and \(\hat{Q}_v\) are normalized score of the audio and video, w and \(1-w\) represent the weights of video and audio quality respectively, \(0 \le w \le 1\). \(\hat{Q}_a\) and \(\hat{Q}_v\) are calculated by \(\hat{Q}_a=\frac{Q_a-Q_{a_{\min }}}{Q_{a_{\max }}-Q_{a_{\min }}}\) and \(\hat{Q}_v=\frac{Q_v-Q_{v_{\min }}}{Q_{v_{\max }}-Q_{v_{\min }}}\), where \(Q_{a_{\min }}\), \(Q_{a_{\max }}\), \(Q_{v_{\min }}\) and \(Q_{v_{\max }}\) bound \(Q_a\) and \(Q_v\) respectively. The optimal weights depend on the used single-mode A/V quality evaluation models and we vary the weight from 0 to 1 with 0.05 step increment to find the optimal weight w. Since the score ranges of the video and audio quality assessment models may be different, the multiplication method can only be performed after they are appropriately scaled or normalized.

4.3 Support Vector Regression Fusion

Since Support Vector Regression (SVR) is a commonly used machine learning algorithm for establishing nonlinear relationships between inputs and outputs, we also utilize the SVR method to integrate the quality prediction scores of single-mode models

$$\begin{aligned} Q_{a v}=\textit{SVR}(Q_v, Q_a), \end{aligned}$$
(4)

where \(Q_v\) and \(Q_a\) represent the quality prediction scores of video and audio, respectively, and \(Q_{av}\) denotes the fused A/V quality scores. In this case, SVR uses the single-mode quality scores predicted by traditional AQA and VQA algorithms respectively as the inputs, and the quality score (i.e., MOS) as the labels for regression function training.

Table 3. Video and audio quality prediction algorithms and their corresponding feature types.

The performance of SVR fusion methods can be further improved by substituting scores with quality-aware feature vectors \(\textbf{f}_v\) and \(\textbf{f}_a\), which can be either hand-crafted features or extracted features from existing popular AQA and VQA models. In this way, we can better fuse video and audio quality prediction results by fully utilizing the quality features of audio and video, thereby improving the performance of the entire model. This feature-based fusion method can be expressed as:

$$\begin{aligned} Q_{a v}=\textit{SVR}(\textbf{f}_v, \textbf{f}_a). \end{aligned}$$
(5)

The video and audio quality-aware feature vectors used here are extracted from some existing AQA and VQA models, which are summarized in Table 3.

5 Experiment Validation

5.1 Evaluation of Single-Mode Models

We test different single-mode quality assessment models (6 audio models and 9 video models) on our omnidirectional AVQA dataset to analyze the effectiveness of single-mode quality models. Experimental results are illustrated in Fig. 4.

For AQA models, STOI, VISQOL, SNR, and segSNR yield relatively good performances on our database, in which STOI achieves the both highest SRCC and PLCC performance. Most of the VQA models show similar performance, and all of them are not able to predict A/V quality effectively with SRCC and PLCC below 0.6. The above analysis shows that most single-mode quality assessment models have a poor performance on our OAVQAD, indicating the necessity of fusing single-mode quality prediction results for more accurate OAVQA.

Fig. 4.
figure 4

Performances of single-mode models on overall audio-visual quality prediction.

5.2 Evaluation of Weighted-Product Fusion

For weighted-product fusion methods, we randomly divide the dataset into 80% training set and 20% test set. All distorted ODVs from the same reference ODVs are placed in the same set to ensure that the video content of the two set are completely separated.

In the weighted-product fusion, a total of 54 (9 video models \(\times \) 6 audio models) weighted product quality fusion models are generated. In order to normalize the prediction scores of the single-mode quality prediction models, the following normalization functions are used: \( Q_{\textit{VMAF}}^{'}=\frac{1}{100} Q_{\textit{VMAF}}\), \(Q_{\textit{WS-PSNR}}^{'}=\frac{1}{29}(Q_{\textit{WS-PSNR}}-23)\), \(Q_{\textit{S-PSNR}}^{'}=\frac{1}{29} (Q_{\textit{S-PSNR}}-23)\), \(Q_{\textit{CPP-PSNR}}^{'}=\frac{1}{29}(Q_{\textit{CPP-PSNR}}-23)\), \(Q_{\textit{GMSD}}^{'}=1-\frac{1}{0.26} Q_{\textit{GMSD}}\), \(Q_{\textit{PEAQ}}^{'}=1+\frac{1}{3.5} (Q_{\textit{PEAQ}}-0.21)\), \(Q_{\textit{LLR}}^{'}=1-\frac{1}{1.2-0.7} (|Q_{\textit{LLR}}|-0.7)\), \(Q_{\textit{SNR}}^{'}=\frac{1}{20} Q_{\textit{SNR}}\), \(Q_{\textit{segSNR}}^{'}=\frac{1}{35+2} (Q_{\textit{segSNR}}+2)\). The prediction scores of other models are already bounded in \(\left[ 0,1\right] \), no further normalization is needed.

Table 4. Performances of weighted-product fusion-based A/V quality models. The top 3 models are in bold.
Table 5. Performances of SVR fusion-based A/V quality models. The top 3 models in terms of each metric are in bold.

Table 4 shows the performance of weighted product fusion models. Among these methods, the models fused by VQA algorithms VMAF, MS-SSIM, GMSD, and the AQA algorithms STOI, VISQOL, SNR show relatively better performances. The model combining GMSD and STOI achieves the best performance in terms of SRCC. In addition, with the same AQA components, the performance of fusion models using different VQA components has little difference, which manifests that different AQA components have larger impact on the performance of fusion models. Moreover, the mean optimal weight for visual modality of 54 weighted product models is 0.7231, suggesting that visual modality has a greater impact on QoE than audio modality.

5.3 Evaluation of SVR Fusion

SVR fusion includes two methods including the score-based fusion and the feature-based fusion. A total of 108 (9 video models \(\times \) 6 audio models \(\times \) 2 SVR conditions) models are tested and the normalization process is no longer required. In SVR fusion models, the radial basis function (RBF) is selected as the kernel function, the parameter \(\gamma \) of the kernel function is 0.05, and the penalty factor C is 1024. Table 5 shows the performance of SVR fusion models.

It can be observed that quality score-based SVR fusion models achieve similar performance compared with the weighted-product fusion models, while quality feature-based SVR fusion models achieve much better performance compared to above two methods. The models combining the AQA components, PEAQ, STOI and VISQOL, and the VQA components VIFP and GMSD have relatively better performance.

Figure 5 demonstrates the performance improvement obtained by each single-mode AQA and VQA model, which further confirms the above phenomenon. The performance improvement of each single-mode model is calculated by averaging the SRCC improvements of all combinations of this model with the models from another perceptual mode. It can be observed that only VISQOL and VIFP models gain performance improvement by replacing weighted-product with SVR, suggesting that the weighted-product fusion is generally a more feasible method. Futhermore, Fig. 5 also illustrates that it is more efficient to decompose the single-mode VQA and AQA scores into ODVs’ quality features. It can be observed that the feature-based regression models achieve different degrees of performance improvement for different VQA and AQA fusion, among which PEAQ achieved a significant improvement with nearly 50%. Some of these models, e.g., STOI, LLR, SNR and segSNR, have a small performance progress caused by feature extraction, we reasonably speculate that these algorithm models are not easy to decompose.

Fig. 5.
figure 5

Performance improvements in terms of SRCC introduced by replacing weighted-product fusion with quality score-based SVR fusion, and decomposing quality models into features during SVR fusion.

6 Conclusion

In this work, we construct an informative omnidirectional audio-visual quality assessment dataset, which involves 390 omnidirectional videos with ambisonics and the corresponding perceptual scores collected from 22 participants under immersive environment. Based on our dataset, we design three types of baseline AVQA models which combine AQA and VQA models via two multimodal fusion methods to predict quality scores of ODVs. Moreover, quantitative analyses for the performance of these models are conducted to evaluate the predictive effect of different objective models. The experiment results on our dataset show that SVR fusion based on quality-aware features have the best performance. Our dataset, objective baseline methods and established benchmark can great facilitate the further research of dataset design and algorithm improvement for OAVQA.