Keywords

1 Introduction

With the rapid advancement of social productivity and information technology in recent years, human perception of real-world scenes is no longer confined to a limited field of view. This has resulted in a growing global demand for VR applications [1], and the VR industry is presented with new opportunities and challenges. VR technology has revolutionized traditional media by freeing it from the constraints of traditional screens. With the help of Head Mounted Display (HMD) [2], users can experience a 360\(^\circ \) immersive view and watch videos from any angle by simply rotating their heads. As a visual imaging technology, it offers users interactive services that provide an in-depth and immersive experience, making it the most popular technology for displaying vision without any blind spots.

However, compared with traditional images, omnidirectional images (OI) [3] require capturing 360\(^\circ \) views and typically demand high resolutions such as 4 K, 8 K, or higher to satisfy users’ Quality of Experience (QoE). Therefore, such images are often heavily compressed for transmission and storage purposes [4]. During the process of immersive content acquisition, it is inevitable to encounter image distortion. As a result, it will lead to a degradation in the quality of the final image displayed to the user.

At the same time, visual degradation in VR applications can result in a reduced quality of experience for users. To address this issue, No-Reference Omnidirectional Image Quality Assessment (NR-OIQA) has been developed to enable humans to perceive visual distortion in omnidirectional images and improve the quality of the visual experience. Consequently, designing a feasible objective quality evaluation algorithm for the omnidirectional images holds significant practical and theoretical value.

Based on this, we propose a scanpath-oriented deep learning network for blind omnidirectional image quality assessment. Initially, the scanning path of the omnidirectional image is employed as a reference to derive the trajectory of the human eye’s gaze within the head-mounted device. The viewports are extracted based on this trajectory. Secondly, taking into account the fact that existing CNN-based OIQA methods are limited by the receptive field and cannot establish global contextual connections, we employ the Swin Transformer to extract features for judging viewport quality. Finally, to construct a global correlation of viewports based on scanning paths, we use a graph-based approach. Notably, we extract the Natural Scene Statistics (NSS) features from each viewport, which effectively represents the similarity and correlation between viewports.

Our contributions are listed as follows:

  • We propose a novel approach for extracting viewports from omnidirectional images by leveraging a model of scanning paths. A graph structure is constructed, which represents the complete viewing path of the omnidirectional image. It enables us to simulate the information interaction among different viewports and model the overall viewing process dynamically.

  • We propose employing NSS features to calculate feature similarity and correlation across various viewports, with the objective of constructing an affinity matrix.

  • We propose a novel deep learning model that integrates Swin Transformer with the graph structure to predict quality scores for omnidirectional images. This model facilitates both local and global feature interactions within and across viewports. Our network outperforms existing Full-Reference and No-Reference methods on two benchmark databases.

2 Related Works

In this section, we introduce various methods about No-Reference omnidirectional image quality assessment. Then we make an overview of the recent related works on vision transformers.

2.1 NR-OIQA

NR-OIQA aims to objectively and accurately evaluate visual quality without reference images. Recently, deep learning technologies promote the development of NR-OIQA. Kim et al. [5] proposed a CNN-based adversarial learning method, which is called DeepVR-IQA. They partitioned an omnidirectional image into patches and employed an adversarial network to estimate their local quality and weight. Then the weighted quality scores are aggregated to obtain the final score.

Tian et al. [6] utilized a pseudo-reference viewport and employed spherical convolution to eliminate projection distortion. The final prediction score is obtained by merging the quality scores from two branches.

From the perspective of mitigating geometric distortion, Sun et al. [7] used a multi-channel CNN framework to predict the quality score of omnidirectional images. On this basis, Zhou et al. [8] incorporated a distortion discrimination-assisted network to promote OIQA learning tasks. However, the inherent differences between viewpoints as well as the interactive information between them are being overlooked.

To better illustrate the dependency of various viewports in 360\(^\circ \) images, Xu et al. [9] first introduced graph convolutional networks into OIQA and modeled the spatial positional relationship of viewports in omnidirectional images. However, they only consider the spatial position of the viewports in the construction of the graph but ignore its content characteristics. To this end, Fu et al. [10] developed an adaptive hypergraph convolutional network (AHGCN) for NR-OIQA. In addition to the location-based features, the content-based features are also taken into consideration, which are generated based on their content similarity.

While the spatial and content characteristics of viewports are taken into account, the influence of viewport distortion is overlooked. Therefore, we propose to use NSS features sensitive to distortion to construct the correlation between viewports with the Swin Transformer. The NSS features are also used in [11, 12], and [13] to achieve high consistency with human perception.

2.2 IQA Based on Swin Transformer

Inspired by the success of the Transformer [14] in various NLP tasks [15], an increasing number of methods based on the Transformer [16] have appeared in CV tasks, including no-reference omnidirectional image quality assessment tasks. Compared to frequently employed CNN models, the Swin Transformer introduces a shifted-window self-attention mechanism that facilitates the establishment of contextual connections. Conversely, CNNs possess restricted receptive fields, which restrict their attention to global features. In the task of IQA, both local and global quality perceptions are critical. Evaluators of image quality are sensitive not only to the quality of the current viewport but also to the previously viewed viewport, as this can affect their overall quality perception. Inspired by this fact, we use the Swin Transformer to establish local information interaction within the viewport. Also, a graph structure is used to construct feature transfer between viewports.

3 Method

In this section, we introduce the proposed OIQA method. Figure 1 illustrates the overall architecture. Our method uses a generative model to extract viewports from 360\(^\circ \) images, producing realistic scanpaths. We implement the visual viewport interaction based on human eye perception and generate a perception score.

Fig. 1.
figure 1

The architecture of our proposed model. Viewports are firstly extracted from the distorted omnidirectional image in ERP format and input into the feature extraction module. The semantic features will be sent to the feature interaction network together with the extracted relevance matrix and regress the final perception score.

3.1 Viewport Extraction

When a 360\(^\circ \) image is viewed in a VR device, the visual content is displayed as a flat section that touches the sphere created by the viewing angle. Also, when evaluating the quality of a 360\(^\circ \) image, viewers look around the 360\(^\circ \) image from multiple perspectives. Based on this, we employ a technique that mimics the human visual perception process by examining the scanning path data of an omnidirectional image as seen through human eyes. We use the model, which is proposed in [17] to directly process the equirectangular project (ERP) format. The predicted gaze points are shown in Fig. 1 when viewing the omnidirectional image in HMD.

Figure 2 illustrates the process of viewport extraction. We set the viewing angle to 90\(^{\circ }\), which consists of the FOV of the most popular VR devices. Then, given a distorted omnidirectional image \(V_d\) and select N central points to extract viewports. The viewport sets are denoted as \(\{V_i\}_{i=1}^{N}\). Then, we obtain N viewports covering the 90\(^{\circ }\) FOVs.

Fig. 2.
figure 2

The process of viewports extraction.

3.2 Graph Nodes Constructed by Swin Transformer

The Swin Transformer utilizes a shifted-window-based local attention computation method to achieve a hierarchical Transformer architecture. So we use it to extract the semantic feature. It consists of multiple Swin Transformer blocks. Figure 3 shows two successive blocks.

Fig. 3.
figure 3

Swin Transformer Block.

The window-based multi-head self-attention (W-MSA) module and the shifted window-based multi-head self-attention (SW-MSA) module are employed in two consecutive transformer blocks. Prior to every MSA module and MLP layer, a LayerNorm (LN) layer is employed for normalization, and residual connections are applied after each module. Based on such a window division mechanism, continuous Swin Transformer blocks can be calculated as:

$$\begin{aligned} \hat{F}^l=W\text {-}MSA(LN(F^{l-1}))+F^{l-1}, \end{aligned}$$
(1)
$$\begin{aligned} F^l=MLP(LN(\hat{F}^l))+\hat{F}^l, \end{aligned}$$
(2)
$$\begin{aligned} \hat{F}^{l+1}=SW\text {-}MSA(LN(F^{l}))+F^{l}, \end{aligned}$$
(3)
$$\begin{aligned} F^{l+1}=MLP(LN(\hat{F}^{l+1}))+\hat{F}^{l+1}, \end{aligned}$$
(4)

\(\hat{F}^l\) and \(F^l\) denote the output of the \(l_{th}\) block of the (S)W-MSA and MLP, respectively. N viewports \(\{V_i\}_{i=1}^{N}\) are sampled and sent to the Swin Transformer. The number of blocks in each stage is 2, 2, 6, 2. We represent the feature of N viewports as \(V=\left\{ v_1, v_2, \cdots , v_N\right\} \). The feature of each viewport represents a node of the graph.

3.3 Graph Edges Constructed by NSS

Considering that the extracted viewport is independent, it cannot simulate the process of viewing the omnidirectional image. Additionally, there exist variations in visual distortions across different viewports. We use NSS features that are crucial to the perceptual quality of OI as the edge of the graph structure to represent the similarity and correlation between different viewports.

To measure the loss of naturalness in viewports, it is necessary to compute the local mean subtracted and contrast normalized (MSCN) coefficients. These coefficients can be used to analyze the statistical features. For each distorted ERP map and viewports, the MSCN coefficients are calculated by:

$$\begin{aligned} \hat{D}^z(i, j)=\frac{D^z(i, j)-\mu (i, j)}{\sigma (i, j)+C} \end{aligned}$$
(5)

where i and j represent the spatial coordinates. \(\hat{D}^z(i, j)\) means the MSCN coefficients. \(\mu (i, j)\) and \(\sigma (i, j)\) represent the local mean and the standard deviation.

Then the generalized Gaussian distribution model is employed to model the statistic feature.

Figure 4 shows the difference between the MSCN distribution of different viewports. It is clear that the FOV information exhibits superior features and a greater capacity for expressing noise-related features compared to ERP images.

Fig. 4.
figure 4

The MSCN distribution of different viewports and the ERP image.

In order to construct viewports’ correlation based on NSS features, we calculate the feature similarity through Eq. (6).

$$\begin{aligned} s_{i, m}=\frac{g_i \cdot g_m}{ \left\| g_i\right\| _2 \cdot \left\| g_m\right\| _2} \end{aligned}$$
(6)

where \(i, m \in \{1, 2, \cdots , N\}\), and \(g_i\), \(g_m\) represent the NSS features of the viewport i and m, respectively. \(s_{i,m}\) denotes the natural feature similarity between two viewports on a spherical domain.

Considering that the feature similarity between viewports will change with different distortion types and different distortion levels, we use the average of feature similarities across multiple viewports as the feature similarity threshold. We calculate N viewports with the most similar NSS features by the following formula:

$$\begin{aligned} A_{i, m}\left( v_i, v_m\right) =\left\{ \begin{array}{l} 1, s_{i, m} \ge {\text {average}}\left( s_{i, m}\right) \\ 0, s_{i, m} <{\text {average}}\left( s_{i, m}\right) \end{array}\right. \end{aligned}$$
(7)

where \(A_{i,m}\) is the affinity matrix representing whether there is information interaction between different viewpoints.

3.4 Quality Prediction

With the representation of the node feature vector \(V=\left\{ v_1, v_2, \cdots , v_N\right\} \) and the affinity matrix A, the perception process based on omnidirectional scanning path is constructed. Each node feature is represented as a 768-dimensional feature vector to input. And then the quality of the omnidirectional images can be predicted by the network, which is composed of 5-layer graph convolutions. The process of interacting and updating the node information can be expressed as:

$$\begin{aligned} \boldsymbol{H}^{(l+1)}=f\left( B N\left( \hat{A} \boldsymbol{H}^{(l)} \boldsymbol{W}^{(l)}\right) \right) \end{aligned}$$
(8)

where \(\hat{A}\) is the adjacency matrix after normalization. The Softplus activation function \(f(\cdot )\) [18] is used with batch normalization \(BN(\cdot )\). The resulting feature matrices \(H^{l}\) are obtained by applying activations to the trainable weight matrix \(W^{l}\). To match the number of hierarchical feature nodes of the Swin Transformer, the output dimension of each layer’s feature nodes is 384, 192, 96, 48, 1. We then obtain the score of each viewport and leverage information from each viewport to produce accurate quality score Q.

4 Experimental Results

In this section, we provide an introduction to the databases utilized in our experiments, along with pertinent implementation details. We then compare the performance of our network with other metrics on a single and across databases. Finally, we conduct an ablation study and a cross-database evaluation to demonstrate the robustness and effectiveness of our model.

4.1 Databases

Two databases of omnidirectional images are utilized in the experiment: OIQA Database [19] and CVIQD Database [20].

OIQA Database: The database consists of 16 original images and their corresponding 320 degraded images. The degraded images include JPEG compression (JPEG), JPEG2000 compression (JP2K), Gaussian blur (BLUR), and Gaussian white noise (WN).

CVIQD Database: This database includes 16 reference images and 528 corresponding distorted images. Three encoding techniques are used to compress images, namely JPEG, H.264/AVC, and H.265/HEVC.

4.2 Implementation Details

Our model was executed on an NVIDIA GeForce RTX 3090 GPU with PyTorch. Our model uses Swin-Tiny as the backbone, which is pre-trained on ImageNet [21]. Each viewport image is resized to \(256\times 256\) and the batch size is set to 2. We utilize the Adam optimizer. The learning rate is set as \(1\times 10^{-5}\). The split ratio of the training set and test set in the database is 8:2. To avoid any overlap between the training and test data, distorted images that correspond to the same reference image have been assigned to the same set. The training loss we use is Mean Square Error (MSE) Loss. The final perception score is generated by predicting the mean score of 20 viewports.

Table 1. Overall performance comparison on CVIQD and OIQA databases. Best performance in bold.

4.3 Overall Performance on Individual Databases

We compare our method with the state-of-the-arts on the OIQA and CVIQD databases. Spearman’s Rank Order Correlation Coefficient (SROCC), Pearson Correlation Coefficient (PLCC), and Root Mean Squared Error (RMSE) are used to evaluate the performance of our model. The results of our comparison are presented in Table 1, where we highlight the top performances in boldface. Data of other methods are all from [9, 10]. Our method exhibits superior performance on both databases when compared to six FR-IQA methods and five NR-IQA methods. This is attributed to the effective modeling of human perceptual quality in our proposed approach. In comparison to VGCN and AHGCN on OIQA, our method not only exhibits superior monotonicity but also achieves higher accuracy. However, on CVIQD, our model achieved a slightly lower accuracy of 0.9619 when compared to VGCN and AHGCN.

4.4 Cross Database Validation

Table 2. Cross-database performances of the proposed model.

To substantiate the generalizability of our model, we carried out cross-database experiments using two databases. Specifically, the OIQA database was employed for training purposes, while the CVIQD database was employed for testing purposes, and vice versa. Test results are presented in Table 2, indicating that our model achieves good cross-database performance on OIQA, but performs poorly on the CVIQD database. This is largely attributed to Swin Transformer’s local attention mechanism, which calculates attention only on a portion of the input sequence. This design allows the model to focus more on relevant information when handling different types of noise, thereby reducing its sensitivity to noise.

Fig. 5.
figure 5

The comparison of different number of viewports.

4.5 Ablation Study

Viewports Sampling Strategy: We conducted two experiments to validate our viewport extraction method. Firstly, we determined the optimal number of viewports by comparing the SROCC and PLCC metrics with varying numbers of viewports. Secondly, we compared the effectiveness of our viewport extraction method with a fixed region approach.

Table 3. The influence of the number of viewports.

Figure 5 displays a comparison of various number of viewports. The SROCC and PLCC values are presented in line charts. Table 3 provides specific performance results for each number of viewports. The result reveals that 20 viewports achieve the highest SROCC and PLCC values. Based on it, we selected 20 viewports as the optimal number for our experiments.

Table 4. Performance comparison of different viewport extraction methods.

We conducted a comparative analysis between our proposed viewport extraction method and the fixed area viewport extraction technique to verify its effectiveness. The results, as shown in Table 4, indicate that the fixed viewpoint extraction method is relatively ineffective, while our proposed method demonstrates superior performance.

The Effect of the NSS Features: To validate the effectiveness of constructing viewport information correlation using NSS features, we compared our model without using NSS features, where the model only relied on the Swin Transformer. As shown in Table 5, our method demonstrates better performance on both CVIQD and OIQA databases, especially on the OIQA database, where SROCC and PLCC both achieved a over 0.02 improvement. This confirms the effectiveness of our use of NSS features.

Table 5. Performance comparison with or without NSS features.

5 Conclusion

In this article, we present a deep learning model for the evaluation of omnidirectional image quality. We take into account the fact that the quality of a viewport can have an impact on our perception of subsequent viewports, indicating interdependence between different viewport qualities. Additionally, due to the inherent differences between viewpoints, our perception of quality may also vary across different viewports. We utilize Swin Transformer to facilitate the acquisition of inter-viewpoint information exchange and employ NSS features to determine the similarity and correlation between different viewpoints. This approach enables us to not only model local features but also account for global perception systems, resulting in improved quality regression. Based on experimental results, the proposed model exhibits superior performance compared to the state-of-the-art approaches.