Perceptual Quality Assessment of Omnidirectional Audio-Visual Signals

Zhu, Xilei; Duan, Huiyu; Cao, Yuqin; Zhu, Yuxin; Zhu, Yucheng; Liu, Jing; Chen, Li; Min, Xiongkuo; Zhai, Guangtao

doi:10.1007/978-981-99-9119-8_46

Xilei Zhu¹¹,
Huiyu Duan ORCID: orcid.org/0000-0002-6519-4067¹¹,
Yuqin Cao¹¹,
Yuxin Zhu¹¹,
Yucheng Zhu¹¹,
Jing Liu¹¹,
Li Chen¹¹,
Xiongkuo Min¹¹ &
…
Guangtao Zhai¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14474))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

632 Accesses
1 Citations

Abstract

Omnidirectional videos (ODVs) play an increasingly important role in the application fields of medical, education, advertising, tourism, etc. Assessing the quality of ODVs is significant for service-providers to improve the user’s Quality of Experience (QoE). However, most existing quality assessment studies for ODVs only focus on the visual distortions of videos, while ignoring that the overall QoE also depends on the accompanying audio signals. In this paper, we first establish a large-scale audio-visual quality assessment dataset for omnidirectional videos, which includes 375 distorted omnidirectional audio-visual (A/V) sequences generated from 15 high-quality pristine omnidirectional A/V contents, and the corresponding perceptual audio-visual quality scores. Then, we design three baseline methods for full-reference omnidirectional audio-visual quality assessment (OAVQA), which combine existing state-of-the-art single-mode audio and video QA models via multimodal fusion strategies. We validate the effectiveness of the A/V multimodal fusion method for OAVQA on our dataset, which provides a new benchmark for omnidirectional QoE evaluation. Our dataset is available at https://github.com/iamazxl/OAVQA.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Audio-Visual Saliency for Omnidirectional Videos

Analysis and Comparison of Audiovisual Quality Assessment Datasets

Weighted Mean Deviation Similarity Index for Objective Omnidirectional Video Quality Assessment

Keywords

1 Introduction

Virtual Reality (VR) has attracted substantial attention from industry and research communities due to its ability to provide users with a stereoscopic and immersive experience through Head-Mounted Displays (HMDs) [6, 8]. Omnidirectional Videos (ODVs), a.k.a, $360^{\circ }$ videos, panoramic videos or spherical videos, have emerged as a significant form of VR content. By using VR HMDs and adjusting their head orientation, users can explore the audio-visual content in any direction. This immersive experience of simulating real-world scenes has contributed to the popularity of ODVs in various application fields, including medical, education, advertising, tourism, etc.

Compared to traditional videos, ultra high-definition ODVs contain more scene information and multi-channel audio information, which results in a doubling of ODV data volume. Due to the huge amount of data, playback stucking and quality switching caused by network delays and fluctuations usually occur during video transmission, which leads to the degradation of ODVs quality, and further affects the QoE of ODVs. Moreover, ODVs may also suffer from the distortions introduced during the process of capturing or displaying, which further decreases the QoE. Therefore, to provide users with a smooth viewing experience, it is important to monitor the quality of ODVs during the procedure of shooting, codec, transmission, etc., and perform optimization accordingly.

In the past few decades, many objective quality assessment methods have been proposed for traditional plane videos [19, 23], and some recent works have also explored the problem of audio-visual video quality assessment [21]. Recently, with the popularity of VR, many studies have explored the problem of omnidirectional image quality assessment [3, 24] and omnidirectional video quality assessment [13]. However, most omnidirectional video quality assessment research only focuses on the single-mode signal, i.e., visual information, few works have investigated the multimodal quality assessment of ODVs incorporating audio information. As an important part of ODVs, spatial audio may strongly influence the human perceptual quality, thus it is necessary to conduct in-depth research on the audio-visual quality assessment of the omnidirectional videos.

In this paper, we make three contributions to the omnidirectional audio-visual quality assessment (OAVQA) field. Firstly, we construct a large-scale omnidirectional audio-visual quality assessment dataset to solve the poverty problem of the corresponding dataset. We first collected 15 high-quality reference omnidirectional audio-visual (A/V) content, and generated 375 distorted ODVs degraded from them. Subsequently, 22 subjects were recruited to participate in the subjective quality assessment experiment, and the audio-visual quality ratings of the reference and distorted videos were collected. Secondly, we design three baseline methods for full-reference omnidirectional AVQA. The baseline models first utilize the existing state-of-the-art audio and video single-mode quality assessment methods to predict the audio quality and video quality of ODVs, respectively, then utilize different multimodal fusion strategies to fuse A/V prediction results and obtain the overall quality results of the ODVs. Thirdly, we compare and analyze the prediction performance of these models on our dataset, and establish a new benchmark for future studies on OAVQA.

2 Related Work

2.1 Omnidirectional Video Quality Assessment Dataset

Table 1 provides an overview of several existing omnidirectional video quality assessment datasets. It can be observed that most of the existing ODV quality assessment datasets lack spatial audio information, and mainly focus on visual distortions, while audio distortions are rarely been considered.

2.2 Quality Assessment Models

Omnidirectional Video Quality Assessment. As a common storage format of ODVs, ERP projection has severe mapping stretches near the poles. In order to solve this problem, Yu et al. [31] proposed a spherical PSNR scheme (S-PSNR), which is based on a set of uniform sampling points on the spherical surface, the corresponding position on the mapping plane is calculated by different mapping formulas. Sun et al. proposed the Weighted to Spherically uniform PSNR (WS-PSNR) [25], which is directly performed in the original format and combined with different stretching weights according to different mapping methods. Anwar et al. [1] established an ODVs quality assessment model using the Bayesian inference method, and evaluated the impact of buffering on users’ perceptual quality at different bitrates. Fan et al. [10] established an ODVs dataset that contains various distortions such as compression distortion and quality switching, and then used machine learning methods to establish VQA models.

Table 1. An overview of omnidirectional video quality assessment datasets. “Mute" means mute audio and “ambisonics" indicates spatial audio. SI and TI represent spatial information and temporal information respectively. QP indicates quantization parameter and CRF means constant rate factor, which is used to control the video bitrate.

Full size table

Omnidirectional Audio-Visual Quality Assessment. As an important part of ODVs, the influence of spatial audio on perceptual quality has rarely been studied. Zhang et al. [33] presented a quality assessment methodology for audio-visual multimedia in virtual reality environment. They presented a panoramic audio-visual dataset and the quality factors which represent different distortions were applied as the input to neural network. Fela et al. [14] utilized PSNR and its variants designed for ODVs, i.e., WS-PSNR, CPP-PSNR and S-PSNR [25, 31, 32], as the quality scores and studied the perceptual audio-visual quality prediction based on the fusion of these scores [13]. Four machine learning models including multiple linear regression, decision tree, random forest, and support vector machine (SVM), were tested.

3 Omnidirectional Audio-Visual Quality Assessment Dataset (OAVQAD)

3.1 Reference and Distorted Contents

We first captured 162 different ODVs with different scenes using a professional VR camera Insta360 Pro2. Then, we selected 15 high-quality ODVs from the collected ODVs as the reference videos in our OAVQAD. We utilized FFmpeg to clip the duration of the selected ODVs to 6 s. Each ODV has a resolution of 8K (7680 $\times $ 3840) in equirectangular projection (ERP) format with a frame rate of 29.97 fps. All ODVs contain first order ambisonics (FOA) with 48,000 Hz audio sampling rate and four audio channels. The audio and video formats are shown in Table 2. The ODV contents include acappella chorus, shopping, guitar playing, restaurant ordering, etc. Figure 1 shows the ERP format previews of the selected 15 reference ODVs.

Table 2. Omnidirectional audio and video format parameters.

Full size table

We utilized advanced audio coding (AAC) as the audio encoding method provided by FFmpeg 4.4, and used constant bitrate (CBR) mode to set the audio bitrate to 64Kbps, 32Kbps and 16Kbps, respectively, thereby generating three levels of perceptually well-separated audio compression distortion. Then, we chose HEVC as the video encoding method provided by FFmpeg libx265 encoder, and for each source video we applied 3 different compression levels, i.e., 32, 37 and 42 in constant rate factor (CRF) mode. Besides, we also set the video resolution to three levels including 4K (3840$\times $1920), 2K (1920$\times $960), 1K (1080$\times $540). Moreover, in order to adapt to a wider range of application scenarios, we further introduced more abundant distortion types and added three types of distortions [5, 7] including noise, blur, and stucking, and generated distorted ODVs with various levels of these distortions. To summarize, we applied 25 distortion conditions to 15 reference ODVs, resulting in a total of 375 (15 $\times $ 25) distorted ODVs.

3.2 Subjective Experiment Methodology

Experiment Apparatus. Since the subjective experiment was needed to be conducted in a VR immersive environment, we used HTC Vive Pro Eye as the HMD to demonstrate ODVs and collect subjective quality ratings. The subjective experiment platform used to play 8K ODVs and perform scoring interaction was build based on Unity 1.1.0 as shown in Fig. 2.

Experiment Procedure. The subjective experiment was conducted in a subjective study room in a university. A total of 22 subjects (14 males and 8 females) were invited to participate in the subjective experiment. The subjects were between 20 and 28 years old (mean 22.62, variance 5.23) and were all graduate and undergraduate students. All subjects had normal or corrected-to-normal vision and hearing. In the experiment, subjects firstly received the guidance on the use of VR equipment, including HMD and controllers. Then a training session was performed for the subjects, making them be familiarized with the user interface as well as the general range and types of distortions. In the testing session, subjects watched 390 ODVs and gave perceptual scores of the overall A/V quality. The order of the test videos was random for each subject to avoid bias.

3.3 Subjective Data Processing and Analysis

We followed the subjective data processing method recommended by ITU [2, 4] to perform the outlier detection and subject rejection. None of the 22 subjects was identified as an outlier and eliminated. We normalized the raw scores of subjects to Z-scores ranging between 0 and 100 and calculated the mean of Z-scores to obtain the final mean opinion scores (MOSs), which are formulated as follows:

$$\begin{aligned} z_{i j} =\frac{r_{i j}-\mu _i}{\sigma _i}, \quad z_{i j}^{\prime }=\frac{100\left( z_{i j}+3\right) }{6}, \end{aligned}$$

(1)

$$\begin{aligned} \text {MOS}_j =\frac{1}{N} \sum _{i=1}^N z_{i j}^{\prime }, \end{aligned}$$

(2)

where $r_{ij}$ is the original score of the i-th subject on the j-th sequence, $\mu _i$ and $\sigma _i$ are the mean rating and the standard deviation given by subject i, N is the total number of subjects. Figure 3 draws the histogram of MOS distribution over the entire database, indicating that the perceptual quality scores are widely distributed in the $\left[ 0,100\right] $ interval, basically covering every score segment, and generally showing a normal distribution. It also manifests that the perceptual quality distribution conforms to our expectations and the distortions setting is quite reasonable.

4 Objective Omnidirectional Audio-Visual Quality Assessment

4.1 Single-Mode Models

Many video and audio quality assessment methods have been proposed separately in previous studies. These quality assessment algorithms, only predict quality of single-modal audio or video signals, can be called as single-mode quality assessment methods. We first utilize the existing state-of-the-art single-mode quality assessment methods to predict the omnidirectional video and audio quality, respectively. Since both the single-mode AQA and VQA prediction scores can characterize one aspect of the distortion severity of the distorted video, it is reasonable to directly use the single-mode models to predict the overall audio-visual quality score of the ODVs.

The well-known single-mode assessment models adopted in this paper are introduced as follows:

Video: VMAF [19], SSIM [28], MS-SSIM [29], VIFP [23], FSIM [34],

GMSD [30], WS-PSNR [25], CPP-PSNR [32], S-PSNR [31].
Audio: PEAQ [27], STOI [26], VISQOL [16], LLR [17], SNR [17], segSNR [15].

4.2 Weighted-Product Fusion

A single-mode audio/visual quality assessment metric can only characterize one quality aspect thus cannot fully represent the overall subjective perceptual quality of an ODV. Therefore, it is important to use appropriate multimodal feature fusion method to predict the A/V quality of ODVs. The simplest fusion method is to directly multiply the quality scores of a VQA model and an AQA model as the overall quality score of ODVs.

However, for human audio-visual perception, video and audio quality often occupy different importance in ODVs, and people may pay more attention to visual quality. The weighted product can balance the influence of different modalities by assigning different weights to each of them, so the weighted product is a better choice for score fusion compared to the direct multiplication method. The weighted product can be formulated as

$$\begin{aligned} Q_{a v}=\hat{Q}_v^w \cdot \hat{Q}_a^{1-w}, \end{aligned}$$

(3)

where $\hat{Q}_a$ and $\hat{Q}_v$ are normalized score of the audio and video, w and $1-w$ represent the weights of video and audio quality respectively, $0 \le w \le 1$. $\hat{Q}_a$ and $\hat{Q}_v$ are calculated by $\hat{Q}_a=\frac{Q_a-Q_{a_{\min }}}{Q_{a_{\max }}-Q_{a_{\min }}}$ and $\hat{Q}_v=\frac{Q_v-Q_{v_{\min }}}{Q_{v_{\max }}-Q_{v_{\min }}}$, where $Q_{a_{\min }}$, $Q_{a_{\max }}$, $Q_{v_{\min }}$ and $Q_{v_{\max }}$ bound $Q_a$ and $Q_v$ respectively. The optimal weights depend on the used single-mode A/V quality evaluation models and we vary the weight from 0 to 1 with 0.05 step increment to find the optimal weight w. Since the score ranges of the video and audio quality assessment models may be different, the multiplication method can only be performed after they are appropriately scaled or normalized.

4.3 Support Vector Regression Fusion

Since Support Vector Regression (SVR) is a commonly used machine learning algorithm for establishing nonlinear relationships between inputs and outputs, we also utilize the SVR method to integrate the quality prediction scores of single-mode models

$$\begin{aligned} Q_{a v}=\textit{SVR}(Q_v, Q_a), \end{aligned}$$

(4)

where $Q_v$ and $Q_a$ represent the quality prediction scores of video and audio, respectively, and $Q_{av}$ denotes the fused A/V quality scores. In this case, SVR uses the single-mode quality scores predicted by traditional AQA and VQA algorithms respectively as the inputs, and the quality score (i.e., MOS) as the labels for regression function training.

Table 3. Video and audio quality prediction algorithms and their corresponding feature types.

Full size table

The performance of SVR fusion methods can be further improved by substituting scores with quality-aware feature vectors $\textbf{f}_v$ and $\textbf{f}_a$, which can be either hand-crafted features or extracted features from existing popular AQA and VQA models. In this way, we can better fuse video and audio quality prediction results by fully utilizing the quality features of audio and video, thereby improving the performance of the entire model. This feature-based fusion method can be expressed as:

$$\begin{aligned} Q_{a v}=\textit{SVR}(\textbf{f}_v, \textbf{f}_a). \end{aligned}$$

(5)

The video and audio quality-aware feature vectors used here are extracted from some existing AQA and VQA models, which are summarized in Table 3.

5 Experiment Validation

5.1 Evaluation of Single-Mode Models

We test different single-mode quality assessment models (6 audio models and 9 video models) on our omnidirectional AVQA dataset to analyze the effectiveness of single-mode quality models. Experimental results are illustrated in Fig. 4.

For AQA models, STOI, VISQOL, SNR, and segSNR yield relatively good performances on our database, in which STOI achieves the both highest SRCC and PLCC performance. Most of the VQA models show similar performance, and all of them are not able to predict A/V quality effectively with SRCC and PLCC below 0.6. The above analysis shows that most single-mode quality assessment models have a poor performance on our OAVQAD, indicating the necessity of fusing single-mode quality prediction results for more accurate OAVQA.

5.2 Evaluation of Weighted-Product Fusion

For weighted-product fusion methods, we randomly divide the dataset into 80% training set and 20% test set. All distorted ODVs from the same reference ODVs are placed in the same set to ensure that the video content of the two set are completely separated.

In the weighted-product fusion, a total of 54 (9 video models $\times $ 6 audio models) weighted product quality fusion models are generated. In order to normalize the prediction scores of the single-mode quality prediction models, the following normalization functions are used: $ Q_{\textit{VMAF}}^{'}=\frac{1}{100} Q_{\textit{VMAF}}$, $Q_{\textit{WS-PSNR}}^{'}=\frac{1}{29}(Q_{\textit{WS-PSNR}}-23)$, $Q_{\textit{S-PSNR}}^{'}=\frac{1}{29} (Q_{\textit{S-PSNR}}-23)$, $Q_{\textit{CPP-PSNR}}^{'}=\frac{1}{29}(Q_{\textit{CPP-PSNR}}-23)$, $Q_{\textit{GMSD}}^{'}=1-\frac{1}{0.26} Q_{\textit{GMSD}}$, $Q_{\textit{PEAQ}}^{'}=1+\frac{1}{3.5} (Q_{\textit{PEAQ}}-0.21)$, $Q_{\textit{LLR}}^{'}=1-\frac{1}{1.2-0.7} (|Q_{\textit{LLR}}|-0.7)$, $Q_{\textit{SNR}}^{'}=\frac{1}{20} Q_{\textit{SNR}}$, $Q_{\textit{segSNR}}^{'}=\frac{1}{35+2} (Q_{\textit{segSNR}}+2)$. The prediction scores of other models are already bounded in $\left[ 0,1\right] $, no further normalization is needed.

Table 4. Performances of weighted-product fusion-based A/V quality models. The top 3 models are in bold.

Full size table

Table 5. Performances of SVR fusion-based A/V quality models. The top 3 models in terms of each metric are in bold.

Full size table

Table 4 shows the performance of weighted product fusion models. Among these methods, the models fused by VQA algorithms VMAF, MS-SSIM, GMSD, and the AQA algorithms STOI, VISQOL, SNR show relatively better performances. The model combining GMSD and STOI achieves the best performance in terms of SRCC. In addition, with the same AQA components, the performance of fusion models using different VQA components has little difference, which manifests that different AQA components have larger impact on the performance of fusion models. Moreover, the mean optimal weight for visual modality of 54 weighted product models is 0.7231, suggesting that visual modality has a greater impact on QoE than audio modality.

5.3 Evaluation of SVR Fusion

SVR fusion includes two methods including the score-based fusion and the feature-based fusion. A total of 108 (9 video models $\times $ 6 audio models $\times $ 2 SVR conditions) models are tested and the normalization process is no longer required. In SVR fusion models, the radial basis function (RBF) is selected as the kernel function, the parameter $\gamma $ of the kernel function is 0.05, and the penalty factor C is 1024. Table 5 shows the performance of SVR fusion models.

It can be observed that quality score-based SVR fusion models achieve similar performance compared with the weighted-product fusion models, while quality feature-based SVR fusion models achieve much better performance compared to above two methods. The models combining the AQA components, PEAQ, STOI and VISQOL, and the VQA components VIFP and GMSD have relatively better performance.

Figure 5 demonstrates the performance improvement obtained by each single-mode AQA and VQA model, which further confirms the above phenomenon. The performance improvement of each single-mode model is calculated by averaging the SRCC improvements of all combinations of this model with the models from another perceptual mode. It can be observed that only VISQOL and VIFP models gain performance improvement by replacing weighted-product with SVR, suggesting that the weighted-product fusion is generally a more feasible method. Futhermore, Fig. 5 also illustrates that it is more efficient to decompose the single-mode VQA and AQA scores into ODVs’ quality features. It can be observed that the feature-based regression models achieve different degrees of performance improvement for different VQA and AQA fusion, among which PEAQ achieved a significant improvement with nearly 50%. Some of these models, e.g., STOI, LLR, SNR and segSNR, have a small performance progress caused by feature extraction, we reasonably speculate that these algorithm models are not easy to decompose.

6 Conclusion

In this work, we construct an informative omnidirectional audio-visual quality assessment dataset, which involves 390 omnidirectional videos with ambisonics and the corresponding perceptual scores collected from 22 participants under immersive environment. Based on our dataset, we design three types of baseline AVQA models which combine AQA and VQA models via two multimodal fusion methods to predict quality scores of ODVs. Moreover, quantitative analyses for the performance of these models are conducted to evaluate the predictive effect of different objective models. The experiment results on our dataset show that SVR fusion based on quality-aware features have the best performance. Our dataset, objective baseline methods and established benchmark can great facilitate the further research of dataset design and algorithm improvement for OAVQA.

References

Anwar, M.S., Wang, J., Ullah, A., Khan, W., Ahmad, S., Fei, Z.: Measuring quality of experience for 360-degree videos in virtual reality. SCIENCE CHINA Inf. Sci. 63, 1–15 (2020)
Article Google Scholar
Duan, H., Guo, L., Sun, W., Min, X., Chen, L., Zhai, G.: Augmented reality image quality assessment based on visual confusion theory. In: Proceedings of the IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB). pp. 1–6. IEEE (2022)
Google Scholar
Duan, H., Min, X., Sun, W., Zhu, Y., Zhang, X.P., Zhai, G.: Attentive deep image quality assessment for omnidirectional stitching. IEEE J. Selected Topics Signal Process. (JSTSP) (2023)
Google Scholar
Duan, H., Min, X., Zhu, Y., Zhai, G., Yang, X., Le Callet, P.: Confusing image quality assessment: toward better augmented reality experience. IEEE Trans. Image Process. (TIP) 31, 7206–7221 (2022)
Article Google Scholar
Duan, H., et al.: Develop then rival: a human vision-inspired framework for superimposed image decomposition. IEEE Transactions on Multimedia (TMM) (2022)
Google Scholar
Duan, H., Shen, W., Min, X., Tu, D., Li, J., Zhai, G.: Saliency in augmented reality. In: Proceedings of the ACM International Conference on Multimedia (ACM MM), pp. 6549–6558 (2022)
Google Scholar
Duan, H., et al.: Masked autoencoders as image processors. arXiv preprint arXiv:2303.17316 (2023)
Duan, H., Zhai, G., Min, X., Zhu, Y., Fang, Y., Yang, X.: Perceptual quality assessment of omnidirectional images. In: Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5 (2018)
Google Scholar
Duan, H., Zhai, G., Yang, X., Li, D., Zhu, W.: Ivqad 2017: an immersive video quality assessment database. In: Proceedings of the IEEE International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1–5. IEEE (2017)
Google Scholar
Fan, C.L., Hung, T.H., Hsu, C.H.: Modeling the user experience of watching 360 videos with head-mounted displays. ACM Trans. Multimed. Comput., Commun. Appl. (TOMM) 18(1), 1–23 (2022)
Google Scholar
Fei, Z., Wang, F., Wang, J., Xie, X.: Qoe evaluation methods for 360-degree vr video transmission. IEEE J. Selected Topics Signal Process. (JSTSP) 14(1), 78–88 (2019)
Article Google Scholar
Fela, R.F., Pastor, A., Le Callet, P., Zacharov, N., Vigier, T., Forchhammer, S.: Perceptual evaluation on audio-visual dataset of 360 content. In: Proceedings of the IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 1–6. IEEE (2022)
Google Scholar
Fela, R.F., Zacharov, N., Forchhammer, S.: Perceptual evaluation of 360 audiovisual quality and machine learning predictions. In: 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6. IEEE (2021)
Google Scholar
Fela, R.F., Zacharov, N., et al.: Towards a perceived audiovisual quality model for immersive content. In: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. IEEE (2020)
Google Scholar
Hansen, J.H., Pellom, B.L.: An effective quality evaluation protocol for speech enhancement algorithms. In: Proceedings of the Fifth International Conference on Spoken Language Processing (1998)
Google Scholar
Hines, A., Gillen, E., Kelly, D., Skoglund, J., Kokaram, A., Harte, N.: Visqolaudio: An objective audio quality metric for low bitrate codecs. J. Acoust. Soc. America 137(6), EL449–EL455 (2015)
Google Scholar
Hu, Y., Loizou, P.C.: Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229–238 (2007)
Article Google Scholar
Li, C., Xu, M., Du, X., Wang, Z.: Bridge the gap between vqa and human behavior on omnidirectional video: A large-scale dataset and a deep learning model. In: Proceedings of the ACM International Conference on Multimedia, pp. 932–940 (2018)
Google Scholar
Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., Manohara, M.: Toward a practical perceptual video quality metric. The Netflix Tech Blog 6(2), 2 (2016)
Google Scholar
Meng, Y., Ma, Z.: Viewport-based omnidirectional video quality assessment: database, modeling and inference. IEEE Trans. Circuits Syst. Video Technol. 32(1), 120–134 (2022)
Article Google Scholar
Min, X., Zhai, G., Zhou, J., Farias, M.C., Bovik, A.C.: Study of subjective and objective quality assessment of audio-visual signals. IEEE Trans. Image Process. (TIP) 29, 6054–6068 (2020)
Article Google Scholar
Schatz, R., Sackl, A., Timmerer, C., Gardlo, B.: Towards subjective quality of experience assessment for omnidirectional video streaming. In: 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2017)
Google Scholar
Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE Trans. Image Process. (TIP) 15(2), 430–444 (2006)
Article Google Scholar
Sun, W., Min, X., Zhai, G., Gu, K., Duan, H., Ma, S.: Mc360iqa: A multi-channel cnn for blind 360-degree image quality assessment. IEEE J. Selected Topics Signal Process. (JSTSP) 14(1), 64–77 (2019)
Article Google Scholar
Sun, Y., Lu, A., Yu, L.: Weighted-to-spherically-uniform quality evaluation for omnidirectional video. IEEE Signal Process. Lett. 24(9), 1408–1412 (2017)
Google Scholar
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
Article Google Scholar
Thiede, T., et al.: Peaq-the itu standard for objective measurement of perceived audio quality. J. Audio Eng. Society 48(1/2), 3–29 (2000)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing (TIP) 13(4), 600–612 (2004)
Article Google Scholar
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. vol. 2, pp. 1398–1402. IEEE (2003)
Google Scholar
Xue, W., Zhang, L., Mou, X., Bovik, A.C.: Gradient magnitude similarity deviation: a highly efficient perceptual image quality index. IEEE Trans. Image Process. (TIP) 23(2), 684–695 (2013)
Article MathSciNet Google Scholar
Yu, M., Lakshman, H., Girod, B.: A framework to evaluate omnidirectional video coding schemes. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, pp. 31–36. IEEE (2015)
Google Scholar
Zakharchenko, V., Choi, K.P., Park, J.: Quality metric for spherical panoramic video. In: Optical Engineering + Applications (2016)
Google Scholar
Zhang, B., Yan, Z., Wang, J., Luo, Y., Yang, S., Fei, Z.: An audio-visual quality assessment methodology in virtual reality environment. In: 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–6 (2018)
Google Scholar
Zhang, L., Zhang, L., Mou, X., Zhang, D.: FSIM: a feature similarity index for image quality assessment. IEEE Trans. Image Process. (TIP) 20(8), 2378–2386 (2011)
Article MathSciNet Google Scholar

Download references

Acknowledgement

This work is supported by National Key R &D Project of China (2021YFE0206700), NSFC (61831015, 62101325, 62101326, 62271312, 62225112), Shanghai Pujiang Program (22PJ1407400), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), STCSM (22DZ2229005).

Author information

Authors and Affiliations

Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China
Xilei Zhu, Huiyu Duan, Yuqin Cao, Yuxin Zhu, Yucheng Zhu, Jing Liu, Li Chen, Xiongkuo Min & Guangtao Zhai

Authors

Xilei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Huiyu Duan
View author publications
You can also search for this author in PubMed Google Scholar
Yuqin Cao
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yucheng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Li Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiongkuo Min
View author publications
You can also search for this author in PubMed Google Scholar
Guangtao Zhai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangtao Zhai .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Jian Pei
Shanghai Jiao Tong Univeristy, Shanghai, China
Guangtao Zhai
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, X. et al. (2024). Perceptual Quality Assessment of Omnidirectional Audio-Visual Signals. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14474. Springer, Singapore. https://doi.org/10.1007/978-981-99-9119-8_46

Download citation

DOI: https://doi.org/10.1007/978-981-99-9119-8_46
Published: 03 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9118-1
Online ISBN: 978-981-99-9119-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Perceptual Quality Assessment of Omnidirectional Audio-Visual Signals

Abstract

Similar content being viewed by others

Audio-Visual Saliency for Omnidirectional Videos

Analysis and Comparison of Audiovisual Quality Assessment Datasets

Weighted Mean Deviation Similarity Index for Objective Omnidirectional Video Quality Assessment

Keywords

1 Introduction

2 Related Work