1 Introduction

Recently, with the rapid development of Internet technology, all kinds of media formats fill people’s daily lives, and the sharing and transmission of images and videos are more frequent. There is also an increasing demand for high-quality images and videos. In the process of compression and uploading images and videos, the images and videos generated by non-professional devices are prone to damage, resulting in quality degradation, such as packet loss, blur, and Gaussian noise. Video quality assessment (VQA) has wide applications in many fields, such as image compression, video codec, video surveillance, and other necessary technologies.

As network bandwidth and display technology continue to advance, \(360^{\circ }\) videos have emerged as a novel and increasingly popular media format that is quickly gaining traction among the masses. The biggest difference between \(360^{\circ }\) video and traditional 2D video mainly lies in the process of stitching and projection. Stitching refers to the stitching of images shot by cameras from different angles onto a \(360^{\circ }\) sphere, while projection refers to the mapping of video from a sphere onto a plane to meet the needs of encoding, decoding, and transmission. Due to the influence of different cameras, the final video will often bring uneven lighting, ghosts, and other distortion factors in the stitching process. In the process of projection, pixels tend to be distorted and deformed to project the spherical content to the planar content, resulting in loss of information and distortion of the image [1].

Due to the influence of omnidirectional information and projection, common \(360^{\circ }\) videos often contain huge contents, and deformation occurs to different degrees with different image positions. Therefore, conventional feature extraction methods cannot adapt to the distortion deformation caused by projection, and the huge computational amount and complexity caused by stitching can hardly adapt to ordinary video quality assessment algorithms. Effectively quantifying the quality degradation of \(360^{\circ }\) video and evaluating the quality of \(360^{\circ }\) video has great significance and practical application value for the research of \(360^{\circ }\) video processing.

To solve the above-mentioned problems, this paper proposes a quality assessment framework based on \(360^{\circ }\) saliency-guided viewport extraction, in which an innovative new algorithm based on saliency is designed to extract the most representative viewport. Combined with an outstanding saliency prediction algorithm [2], excellent results are achieved in the \(360^{\circ }\) video quality assessment task. The main contributions can be summarized as follows:

  • To simulate the behavior of human eyes to perceive the quality of the region of interest when watching \(360^{\circ }\) videos, panoramic convolution was adopted to predict the saliency of video frames. We designed an algorithm to extract the viewport content to optimally represent the video content for quality assessment.

  • We have designed a quality assessment network based on the attention mechanism. The network mainly uses a three-dimensional convolutional neural network (3D CNN) network to learn the spatiotemporal features of the video. Spatial and channel attention modules were designed to improve the network’s ability to learn information, making it lightweight.

  • The quality assessment network based on the attention mechanism proposed in this paper not only performs well in \(360^{\circ }\) video quality assessment tasks but also in 2D video quality assessment tasks.

2 Related works

2.1 VQA for 2D video

For 2D video, some quality assessment algorithms based on traditional hand-crafted feature extraction have achieved excellent results in video tasks for user-generated content. Considering the temporal effects of the human visual system, some image quality assessment (IQA) methods can be modified for VQA methods [3,4,5,6,7]. These methods employ different strategies and techniques to capture distortions and variations in dynamic video content. Some methods utilize motion compensation [8], while others address the perceptual effects of motion artifacts and quantify distortions through optical flow statistics [9]. AFViQ [10] leverages perceptual visual mechanisms in video quality assessment by introducing an enhanced foveal imaging model for generating the perceived video representation. He et al. [11] utilize 3D discrete cosine transform (DCT) to analyze and exploit energy and frequency distribution. Tu et al. [12] proposed the VIDEVAL algorithm to construct an initial feature set by selecting features from existing best-performing VQA models, which use machine learning models to learn important features. ChipQA [13] and the approach proposed by Wu et al. [14] both used the method of tracking motion trajectory to extract the features of video quality degradation. VMAF [15] is a video quality assessment tool launched by Netflix, designed to address situations where traditional indicators fail to reflect multiple scenes and features in videos.

As deep learning methods become more mature, many algorithms based on deep learning perform well in quality assessment tasks [16,17,18]. VSFA [19] leverages a pre-trained network for content dependency modeling and a recurrent network for temporal memory effect modeling. TLVQM [20] extracts low-complexity features from full video sequences and high-complexity features from representative frames for perceptual quality modeling. Inspired by PaQ-2-PiQ (from patch to picture) algorithm, Ying et al. [21] proposed the patchVQ algorithm. Based on the relationship between video patches and video clips, 3D convolution is adopted to extract spatiotemporal features from the spatiotemporal pool layer for learning. The RIR-net [22] makes creative use of a recurrent neural network to carry out quality assessment tasks. The framework designed by this network can better match the perception of distorted video quality in the human visual system (HVS). Most methods based on deep learning usually regard videos as static images and apply the pre-trained 2D CNN model to perform related tasks of video quality assessment on images. However, this strategy performs poorly in terms of motion sensitivity, as motion information is simply ignored. Because video quality is highly correlated with motion between successive frames, the full reference method proposed by Xu et al. [23], which applies 3D CNN to learn spatiotemporal features, had made progress in dealing with the compression artifacts and spatiotemporal continuity of video frames.

2.2 Quality assessment for \(360^{\circ }\) video/image

VQA for \(360^{\circ }\). The existing \(360^{\circ }\) video quality assessment methods are generally divided into two categories: the traditional method and the learning-based method. The traditional method is an improvement on the 2D quality assessment algorithm. For example, the weighted to spherically uniform PSNR (WS-PSNR) [24] and the area-weighted spherical PSNR (AW-SPSNR) [25] both use weight distribution to balance the non-uniform sampling density when calculating the PSNR so that the non-uniformity of the spherical content mapped to the plane can be taken into account. Yu et al. [26] proposed S-PSNR resample the original frame with a set of uniformly distributed points on the sphere. Zakharchenko et al. [27] proposed a quality assessment based on the Craster parabolic projection (CPP), named CPP-PSNR. Xu et al. [28] introduced a content-based perceptual PSNR (CP-PSNR) approach that calculates a weighted PSNR on the original \(360^{\circ }\) frame. Gao et al. [29] proposed a method that incorporates spatial and temporal considerations, evaluating distortions at the eye fixation level and providing an effective solution for integrating existing spatial video quality assessment metrics. All of these methods mentioned above obtain quality scores by calculating the PSNR of \(360^{\circ }\) video frame, but they do not achieve high accuracy. Yang et al. [30] proposed BP-QAVR, which uses a region of interest (ROI) map to calculate multi-level quality factors. Jiang et al. [31] proposed TB-VMAF, which utilizes elliptical projection, inverse projection, and bilinear interpolation to transform planar tiles into a sphere. By incorporating user head and eye movements to generate a tiled weighted map, they optimized the Video Multimethod Assessment Fusion (VMAF) method, achieving excellent results.

On the other hand, the algorithm based on deep learning has achieved great success in the field of \(360^{\circ }\) video quality assessment. Li et al. [32, 33] proposed the method of viewport-based convolutional neural network (V-CNN) for the full-reference \(360^{\circ }\) VQA task. In the V-CNN method, there are two stages, namely VP-net (Stage I) for extracting the potential viewport and VQ-net (Stage II) for calculating VQA. The proposed network also has two auxiliary tasks, the prediction of the potential viewport and the prediction of viewport significance, both of which have achieved good performance. The researchers also achieved excellent results with the no-referenced VQA algorithm. Meng et al. [34] proposed a method based on the fact that users show very consistent saliency preferences when consuming \(360^{\circ }\) content, the method was designed to combine the quality of highlighting viewport and the quality of quick scanning area of \(360^{\circ }\) video. In NR-OVQA [35], the \(360^{\circ }\) video is first projected by cube map projection (CMP) onto six equal-area 2D videos that are treated as inputs to CNN. Then, two-stream CNN models were built, and spatial and temporal quality features were modeled and learned.

IQA for \(360^{\circ }\). In recent years, numerous IQA methods for \(360^{\circ }\) images have emerged based on deep learning. Many approaches have utilized visual and positional features of \(360^{\circ }\) images for quality prediction, yielding promising results [36,37,38]. Sun et al. [39] propose MC360IQA methods, which project each 360-degree image into six viewport images and use a multi-channel feature extraction model to learn the viewport feature expression. Xu et al. [40] proposed a Viewport-oriented graph convolution network (VGCN) that consists of two branches. One branch utilizes a viewport to calculate local quality scores, while the other branch uses DB-CNN for global quality score detection. It has achieved remarkable performance.

2.3 Saliency models on \(360^{\circ }\) video/image

Saliency Models on \(360^{\circ }\) Image. In recent years, researchers focused on the saliency prediction of \(360^{\circ }\) images. In [41, 42], researchers analyzed participants’ gaze behavior with eye-tracking experiments in \(360^{\circ }\) videos. They improved the saliency model by incorporating specific gaze biases to adapt to this type of video and adjusted the weights of head movement data and eye movement data in their methods.

Saliency Models on \(360^{\circ }\) Video. To predict the behavior of participants in head-mounted displays (HMD), many researchers proposed methods for predicting eye movements (EM) and head movements (HM), which greatly aided in predicting saliency in \(360^{\circ }\) content [43, 44]. Xu et al. [45] proposed a deep reinforcement learning method that enables to predict the area that viewers are most likely to focus on. Additionally, there are various methods [46, 47] that utilize deep neural networks (DNNs) to predict the scanning paths of head and eye movements. Martin et al. [2] proposed a method that uses the panoramic convolutional network to predict \(360^{\circ }\) video saliency.

3 Methodology

Fig. 1
figure 1

The framework of our proposed method, the output channels of each convolution layer are denoted

We designed a quality assessment framework based on saliency prediction and viewport extraction. The framework of our proposed method is shown in Fig. 1, which can be divided into three parts: Video frame saliency prediction module, Optimal viewport positioning module, and Quality assessment network.

3.1 Frame saliency-guided viewport extraction

Previous studies have shown that image quality is highly related to visual saliency [48], and users exhibit consistent saliency preferences when consuming \(360^{\circ }\) content [49]. Therefore, the video frame saliency prediction algorithm adopted in this paper selects the network proposed by [2]. This approach introduces a panoramic convolutional network that learns feature relationships from a simple, non-distorted space, which is inspired by the approach proposed in [50]. Each point p on the sphere with latitude \(\phi \in [\frac{-\pi }{2}, \frac{\pi }{2}]\) and longitude \(\theta \in [-\pi , -\pi ]\), there exists a tangent plane P located at p, whose coordinates x, y \(\in P\) are related to a point on the sphere by its gnomonic projection. Using this network, it is no longer necessary to consider the distortion caused by the projection of panoramic video frames, so the global and local spatial dimension information is preserved.

The network uses a U-net-like structure composed of four encoder layers and four decoder layers. The encoder layer of the module consists of two panoramic blocks, and the decoder layer consists of three panoramic blocks. The structure of the panoramic block is illustrated in Fig. 2. To preserve the details of the final output image, the features of different resolutions on the encoder path are connected to the corresponding features on the decoder path through skipping connections. The model was trained using a publicly available dataset of panoramic image saliency [51]. This dataset successfully generates a saliency map from a single video frame and outperforms other advanced panoramic saliency prediction methods.

Fig. 2
figure 2

The structure of the Panoramic Block

Due to the temporal and spatial continuity of video, unlike the content of \(360^{\circ }\) images, users cannot view multiple viewport contents on the same frame while watching \(360^{\circ }\) videos on HMD devices, making it difficult to evaluate the video quality. In continuous \(360^{\circ }\) videos, although users can view in all directions in HMD devices by rotating their heads, the human eyes can capture quality degradation at most on one viewport content in a certain frame. Figure 3 illustrates the distortion in the same region of equal rectangle projection (ERP) format and viewport. The ERP format causes a certain degree of deformation in the distorted area due to projection distortion, whereas the viewport is more representative of the content viewed in HMD. Therefore, using the viewport image for distortion evaluation is more appropriate. Therefore, to simulate the process of the human eye evaluating video quality, designing the correct viewport selection method becomes crucial.

Fig. 3
figure 3

The distortion difference between ERP format and viewport

Fig. 4
figure 4

A diagram of the viewport generated using the gnomonic projection, the example of 14 viewport centers on the sphere, and corresponding points on sphere surface and tangent plane

Inspired by this behavior, the optimal viewport positioning module (OVPM) is introduced to extract the viewport that optimally represents the video frame. For each \(360^{\circ }\) video, we sample \(n (0 < n \le N - 2)\) video clips at intervals, where N is the total number of frames. Each clip consists of three frames, center frame \(I_t\) and its adjacent frames \(\{ I_{t-1}, I_{t+1}\}\). A total of \(3\times n\) frames are sent into the module for saliency prediction which generates the video frame saliency map of ERP format. We set the horizontal and vertical field of view (FOV) angle of each viewport to \(90^{\circ }\) as shown in Fig. 4, and extraction of viewports is carried out by referring to CMP and gnomonic projection. To extract the viewport, the viewport center point \(p_\mathrm{{ERP}}^0 = (x^0, y^0)\) of saliency image in ERP format is projected back to the sphere \(p_{s}^0 = (\phi ^0, \theta ^0)\) since the angle of FOV is fixed when the user is viewing \(360^{\circ }\) content, each viewport can be represented as a projection of the tangent plane centered on \(p^0 = (\phi ^0, \theta ^0)\). Initially, the saliency image is transformed into an ERP projection format, which is subsequently mapped back onto a sphere. The conversion formula between the ERP domain and the sphere domain is presented as follows:

$$\begin{aligned} \left\{ \begin{array}{l} \phi =\frac{2 \pi u}{W}-\pi \\ \theta =\frac{-\pi v}{H}+\frac{\pi }{2} \end{array}\right. \end{aligned}$$
(1)

where H and W are the height and width of ERP frames, u and v are the coordinates of the ERP domain. \(\phi \in [\frac{-\pi }{2}, \frac{\pi }{2}]\) and \(\theta \in [-\pi , -\pi ]\) are latitude and longitude on the sphere. Based on the gnomonic projection and the angle of FOV, the coordinates of the sphere domain are projected onto the tangent plane of the viewport’s central point to determine the corresponding viewport. The conversion formula between the tangent plane and the sphere domain is presented as follows:

$$\begin{aligned} \begin{aligned} x(\phi , \theta )&=\frac{\cos \phi \sin \left( \theta -\theta _{\Pi _0}\right) }{\sin \phi _{\Pi _0} \sin \phi +\cos \phi _{\Pi _0} \cos \phi \cos \left( \theta -\theta _{\Pi _0}\right) } \\ y(\phi , \theta )&=\frac{\cos \phi _{\Pi _0} \sin \phi -\sin \phi _{\Pi _0} \cos \phi \cos \left( \theta -\theta _{\Pi _0}\right) }{\sin \phi _{\Pi _0} \sin \phi +\cos \phi _{\Pi _0} \cos \phi \cos \left( \theta -\theta _{\Pi _0}\right) } \end{aligned} \end{aligned}$$
(2)

where \((\phi , \theta )\) represents the coordinates of the sphere, (xy) represents the coordinates of the tangent plane, \((\phi _{\Pi _0}, \theta _{\Pi _0})\) represents the center point of the viewport on the sphere domain.

In this module, as shown in Fig. 4, 14 alternative potential viewport centers \(\{ p_1, p_2, p_3, \ldots , p_{14}\}\) are set, of which 12 are located on the equator of the saliency projection sphere, and the other two are located on the poles of the sphere. The centers of the adjacent viewport on the equator \(\Delta \theta\) = \(30^{\circ }\). The 14 alternative viewports fully covered all the information on the video frame. The maximum value of the saliency values on the 14 viewports was calculated \(Max\{S_1, S_2, S_3, \ldots , S_{14}\}\), where the saliency value \(S_i\) was the sum of pixel values in the gray-scale saliency map. The viewport with the highest saliency value represents the content that the human eye is most likely to watch in this frame, which is consistent with the area of interest of the human eye watching the video. The corresponding viewport content is then input into the quality assessment network as the representative of the video frame for quality assessment.

3.2 Quality assessment network

As the viewport content is presented in the HMD for virtual reality applications with almost no impact of projection distortion, we input the optimal viewport content extracted from the previous module to the quality assessment network. Inspired by the 2D full-reference video quality assessment algorithm based on Just Noticeable Difference (JND) [23], the structure of the quality assessment network is shown in Fig. 1. Before inputting viewport content into the network, we calculate the viewport difference map as follows:

$$\begin{aligned} V_t^\mathrm{{diff}} = \left| \frac{2 \ln (255)-\ln \left( (V_t^\mathrm{{ref}} - V_t^\mathrm{{dist}})^2+1\right) }{2 \ln 255}\right| \end{aligned}$$
(3)

where \(V_t^\mathrm{{ref}}\) is the viewport content extracted from the reference \(360^{\circ }\) video frame. For each pixel in both impaired viewport content \(V_t^\mathrm{{dist}}\) and reference viewport content \(V_t^\mathrm{{ref}}\), the alignment is implemented by bilinear interpolation on the frame at the corresponding location. The residual values are normalized to ensure that their range is between 0 and 1, making negative pixel values impossible. Then, both the impaired viewport \(V_t^\mathrm{{dist}}\) and the difference map \(V_t^\mathrm{{diff}}\) are input into the quality assessment network. We use two 3D convolution layers to downsample the viewport content for spatiotemporal feature learning. The spatial attention (SA) module is introduced to give weight to the selection of spatial features, to optimize the network’s learning ability for information. The illustration of the spatial attention module given in Fig. 5a can be summarized as:

$$\begin{aligned} F^{'}= & {} (1 + M_s(F)) \otimes F \end{aligned}$$
(4)
$$\begin{aligned} M_s(F)= & {} \sigma (f^{7\times 7}(\mathrm{{AvgPool}}(F) \oplus \mathrm{{MaxPool}}(F))) \end{aligned}$$
(5)

where F is the feature map after two 3D convolution layers, \(\sigma\) denotes the sigmoid function, \(f^{7\times 7}\) represents a convolution operation with the filter size of \(7\times 7\), \(\otimes\) and \(\oplus\) means element-wise multiplication and concatenation.

Fig. 5
figure 5

Diagram of attention modules: a is the architecture of the spatial attention module, b represents the illustration of the channel attention model

After channel concatenation of the two feature maps \(F^{'}\), four 3D convolution layers and corresponding channel attention (CA) modules are introduced to further select the features. The illustration of the channel attention shown in Fig. 5b can be obtained as follows:

$$\begin{aligned} F^{''}= & {} M_c(F_c) \otimes F_c \end{aligned}$$
(6)
$$\begin{aligned} M_s(F_c)= & {} \sigma (ML(\mathrm{{AvgPool}}(F_c)) + \mathrm{{ML}}(\mathrm{{MaxPool}}(F_c))) \end{aligned}$$
(7)

where the \(\mathrm{{ML}}\) denotes multi-layers, which consist of two convolution layers and a ReLU layer.

Before sending the final feature map into the global average pooling layer, we multiplied the difference map \(V_t^\mathrm{{diff}}\) by the feature map to represent the degree of perceived distortions. Next, two fully connected layers are used to conduct nonlinear mapping between distortions and subjective scores. The network’s training loss can be obtained as:

$$\begin{aligned} \mathcal {L} = \frac{1}{K}\left\| f_\delta \left( \textrm{x}_{\textrm{n}}\right) -y_n\right\| _2^2+\lambda L_2 \end{aligned}$$
(8)

where K denotes the total number of impaired videos in the training set, \(\textrm{x}_\textrm{n}\) and \(y_n\) denote the n-th input video pair and the subjective score of the video, respectively. \(\delta\) is the parameters of the whole network that need to be trained. \(\lambda\) and \(L_2\) represent a hyper-parameter and a regularization term, respectively.

4 Experimental results

In this section, we evaluate the proposed method on the commonly used \(360^{\circ }\) VQA database VQA-ODV [15] and 2D VQA database CSIQ-VQA [52]. The Spearman rank order correlation coefficient (SROCC), Pearson linear correlation coefficient (PLCC), Kendall rank order correlation coefficient (KROCC), and root mean square error (RMSE) are used as evaluation criteria. Our method was compared with several state-of-the-art methods for VQA on \(360^{\circ }\) video and 2D video. We designed experiments to verify the number of viewport content from one frame can be used as network input to achieve optimal performance. The effectiveness of the optimal viewport positioning module was discussed by replacing the optimal viewport position with a fixed viewport position, and we also studied the effect of the FOV angle. We conducted experiments to demonstrate the effectiveness of attention modules and the impact of the number of frames extracted from a video on model performance.

4.1 Datasets and training details

VQA-ODV dataset. This dataset consists of 60 reference videos and 540 distorted videos generated from 3 project patterns, i.e., ERP, reshaped cube map projection (RCMP) and truncated square pyramid projection (TSP), each of which contained three compression levels generated using H.256, i.e., QP = 27, 37 and 42. The resolution of the dataset contains 4K (3840 \(\times\) 1920) to 8K (7680 \(\times\) 3840).

CSIQ-VQA dataset. This dataset consists of 12 reference videos and 216 distorted videos generated from 6 distortion types, i.e., H.264/AVC compression, H.264 video with packet loss rate, MJPEG compression, Wavelet compression, White noise, and HEVC compression, each of which also contained three compression levels. The videos in this dataset all have a resolution of 832 \(\times\) 480.

In our experiments, we randomly select 80% of the reference videos for training, and the remaining 20% are used for testing. Once a reference video is divided into the training or testing set, all distorted videos generated from it will be put into the same set. We use the Adam optimizer to back-propagate gradients and regularization, the initial learning rate for the VQA-ODV dataset and CSIQ-VQA dataset is 7e−4 and 3e−4, respectively, and the weight decay is set 5e−3 and 3e−3, respectively. The learning rate would multiply by 0.9 if the loss saturates for 5 epochs. The number of video clips n put into the method is 9 and 11, respectively. Our experiments are implemented based on the PyTorch framework and run on an NVIDIA GTX3060TI GPU with 8 G memory, an i7-11700k, and 32 GigaBytes of RAM.

4.2 Performance on \(360^{\circ }\) videos

We compare the performance of our proposed method with some state-of-the-art methods for VQA on VQA-ODV dataset [33] as follows, WS-PSNR [24], S-PSNR [26], CPP-PSNR [27], NR-OVQA [35], MC360IQA [39], VGCN [40], BP-QAVR [30], OV-PSNR [29], TB-VMAF [31], V-CNN [32, 33]. Among them, WS-PSNR, S-PSNR, and CPP-PSNR are PSNR-based methods for \(360^{\circ }\) VQA, which are implemented using the codeFootnote 1; MC360IQA and VGCN are no-reference methods for \(360^{\circ }\) image, which achieve the excellent result on \(360^{\circ }\) IQA database. NR-OVQA is a no-reference method for \(360^{\circ }\) VQA; BP-QAVR, OV-PSNR, and TB-VMAF are the traditional full-reference methods for \(360^{\circ }\) VQA; V-CNN is the full-reference deep learning-based method.

The results are shown in Table 1. It can be observed that V-CNN achieves the best performance in PLCC, SROCC, and RMSE, and our proposed method is superior to any other methods in KROCC, only inferior to V-CNN in these three indexes and reaches the same order of magnitude. Moreover, in terms of model complexity, our model parameters and FLOPs are far less than those of other deep learning-based methods. This implies that our proposed method requires less computation to calculate the loss and is better at preventing overfitting.

Table 1 Performance comparison on VQA-ODV dataset with competing methods, the best two results are in bold

4.3 Performance on 2D videos

We compare the performance of our proposed quality assessment network with some state-of-the-art methods for 2D VQA on CSIQ-VQA dataset [52] as follows, ChipQA [13], Wu et al. [14], VMAF [15], RIR-net [22], C3DVQA [23]. Among them, VMAF uses several metrics to calculate and aggregate the frame quality score to obtain the final video quality score; ChipQA and Wu [14] are both the machine learning-based methods for 2D VQA; RIR-net and C3DVQA are the full-reference deep learning-based methods.

The results are shown in Table 2. It can be observed from the results that the quality assessment network with the attention mechanism proposed by us is optimal in both PLCC and SROCC metrics. It is proved that the attention mechanism can play a very positive role in the perception of quality degradation in 2D video quality assessment.

Table 2 Performance comparison on CSIQ-VQA dataset with competing methods, the best results are in bold

4.4 Ablation study

To study the performance of the proposed method with different viewport numbers in one single video frame, we conduct VQA experiments with 1, 3, and 5 viewports. These selected viewports are with the highest saliency values calculated by the OVPM module. We can achieve the best performance when the number of viewports is 1 through Table 3. This result is consistent with the user’s behavior when watching \(360^{\circ }\) videos.

Table 3 Ablation study on the number of viewports extracted in one single video frame on the VQA-ODV dataset

We validate the effectiveness of the OVPM. Inspired by the view direction [35] in CMP format. We replace the module by setting the fixed viewport center of each frame, including \(p_1(0^\circ \textrm{N}, 0^\circ \textrm{E})\), \(p_2(0^\circ \textrm{N}, 90^\circ \textrm{W})\), and \(p_3(0^\circ \textrm{N}, 90^\circ \textrm{E})\) correspond to front, left, and right view in CMP format. The corresponding viewport contents are the input of the quality assessment network. Valuation results are shown in Table 4, as expected, the performance with the OVPM is significantly better than the performance with a fixed viewport center.

Table 4 Ablation study on the optimal viewport positioning module on the VQA-ODV dataset

When human eyes watch \(360^{\circ }\) content in HMD devices, FOV will inevitably affect human eyes’ perception of quality. Five FOV angles during viewport extraction were set for ablation experiments. The experiment results are shown in Table 6. The best performance is obtained when the angle of FOV is \(90^{\circ }\), and it was proved through experiments that the FOV angle also had a great influence on the quality assessment of \(360^{\circ }\) video.

To validate the effectiveness of the proposed attention module, we conducted experiments to investigate the impact of the network without SA, the network without CA, and the network with no attention modules. The results of the ablation study, as shown in Table 5, indicate that both spatial and channel attention modules significantly improve the performance of the network.

Based on our empirical observations, we have found that the number of frames extracted from a video has a significant impact on both the model’s performance and the computational cost. To investigate this impact, we conducted an ablation experiment where we repeated the experiment with each video being extracted at different numbers of frames. Specifically, we tested 21, 27, 33, 39, and 45 frames per video. The results of the ablation experiment, as shown in Fig. 6, indicate that setting the number of frames to 33 per video results in the best performance. Our experiments also revealed that increasing the number of frames does not necessarily lead to better performance; instead, it consumes a significant amount of computational resources and may result in overfitting. Therefore, we opted for an appropriate number of frames as input.

4.5 Performance influence of different FOV angles

When human eyes watch \(360^{\circ }\) content in HMD devices, FOV will inevitably affect human eyes’ perception of quality. Five FOV angles were set during viewport extraction for ablation experiments. The experiment results are shown in Table 6. The best performance is obtained when the angle of FOV is \(90^{\circ }\), and it was proven through experiments that the FOV angle also had a great influence on the quality assessment of \(360^{\circ }\) video.

Table 5 Ablation study on attention module on the VQA-ODV dataset
Fig. 6
figure 6

SROCC and PLCC result of the proposed method trained with different numbers of frames per video

Table 6 Ablation study on the optimal FOV angle on the VQA-ODV dataset

5 Conclusion

To evaluate \(360^{\circ }\) video quality, we propose a deep learning method based on saliency prediction and viewport extraction. The saliency prediction network predicts video frame saliency, and the viewport that most attracts human attention is extracted based on saliency for quality assessment. The experimental results show that our method achieves comparable performance with the state-of-the-art method with much fewer model parameters. The quality assessment network with attention mechanisms can also achieve excellent results in 2D video quality assessment tasks.