1 Introduction

Advances in hardware and software technology bring more applications of VR and AR to our life, such as computer games, live events, and remote medicine. VR provides users with a new media experience, offering the freedom to explore 360-degree content and giving users visual experience that is more realistic and immersive. This kind of service is available on popular image/video platforms, such as street exploration through Google map or 360-degree content sharing on social platforms. 360-degree images are usually described in terms of longitudes and latitudes or in terms of 3D coordinates on a spherical surface. They are defined as panoramic images, which include information in 360 degrees in the horizontal direction and 180 degrees in the vertical direction. 360-degree images are termed omnidirectional images and spherical images. For efficient storage and transmission, projections are used for 360-degree images that convert each 3D coordinate to a location in the specific 2D plane. The viewport image is generated on demand when the projected 2D image is converted to the spherical domain, followed by rectilinear projection [45].

Several formats are used to represent 360- degree images using specified projections [45]. Equirectangular projection (ERP) is the most widely used format. This stores the 360-degree image in one 2D image. The horizontal and vertical axis in the ERP image represents the information in the longitudinal and the latitudinal directions on the sphere. The ERP image is represented with high resolution. The test 360-degree video of the Joint Video Exploration Team (JVET) [16] has a maximum resolution 8192 × 4096, and a maximum frame rate of 60 Hz. Obviously, a 360-degree image or video requires a huge storage and transmission bandwidth. Therefore, many academic and industrial studies have sought to increase the compression efficiency of 360-degree image and video. Many studies concern the development of 360-degree image/video processing [24], coding [15, 19, 21,22,23, 26, 31, 39,40,41, 43, 46, 47, 51] and streaming [8, 12, 18, 28, 29].

The Video Coding Experts Group (VCEG, ITU-T Q6/16), worked on the standardization of the 360-degree video coding [35]. The Joint Collaborative Team on Video Coding (JCT-VC) studied the means of signaling for supplemental enhancement information (SEI) when encoding a 360-degree video using High Efficiency Video Coding (HEVC) (https://www.itu.int/rec/T-REC-H.265-201612-S/en). SEI specifies information about the projection format and the region-wise packing. Later on, the JVET, which was jointly established by the ITU Telecommunication (ITU-T) Study Group 16 (SG16) and the Motion Picture Experts Group (MPEG), devised a standard for the future video coding, which is called Versatile Video Coding (VVC) (https://jvet.hhi.fraunhofer.de/). 360-degree video coding is one of the most important technologies, in conjunction with high dynamic range (HDR) video coding. The JVET addresses several problems, including the compression of 360-degree video for different projection methods, coding tool libraries and test methods [45]. These developments in VVC are the reference for MPEG-I standard (https://mpeg.chiariglione.org/standards/mpeg-i).

In terms of the development of immersive media technology, the MPEG (ISO/IEC JTC1/SC29/WG 11) established the standard for immersive technology (Immersive Visual Media): MPEG-I (https://mpeg.chiariglione.org/standards/mpeg-i). This standardizes the virtual environment, the degrees of freedom and the related immersive media formats. Currently, there are ten parts. Part 1, “Technical Report on Immersive Media”, defines the requirements and two phases: phase 1a provides three degrees of freedom (3DoF), which allows the viewer to change the yaw, pitch and roll of the rendered viewport and phase 1b extends the 3DoF coordinate system, called 3DoF+, by enabling a slight translation of the viewer position. The second phase allows a significant change in the viewer position and gives enhanced immersion. This is called 6DoF where the rendered viewport is a combination of point cloud data and 360-degree video. Part 2 “Omnidirectional Media Application Format” (OMAF) and part 3 VVC concern the delivery and coding of 360-degree video. For OMAF, it provides basic services for monocular and stereoscopic 360-degree video. The input video is processed by projection and optional region-wise packing before encoding. For VVC, it provides a significantly better compression capability than former standards, such as HEVC.

There are several ways to capture a scene with 360 degrees. Capturing and stitching multi-view images is one solution. Another solution is to use fisheye cameras at both the front and the rear sides. This type of product has been widely deployed in the market. The 360-degree image/video can be displayed on computer monitors, smartphones, tablets and head-mounted display (HMD). The viewing experience on different kinds of devices is discussed in [11]. Participants reported that the HMD offers the most immersive experience at the expense of greater cognitive burden, motion sickness and physical discomfort. Therefore, an understanding of the exploration behavior when viewing 360-degree images is crucial for the development of many related technologies, including compression, delivery and free-view rendering, and to ensure the highest quality of experience (QoE) for the viewer [2].

Visual exploration of a 360-degree image is significantly different from that for conventional images. A much greater degree of freedom of viewpoint is offered by a 360-degree image. When viewing a 360-degree image, the human visual system focuses particularly on visually attractive elements and ignores the less important viewports. So, predicting the visually attractive elements is important. The amount of data that is required for 360-degree image and video is quite huge, so it is inefficient to allocate the same resource for each part of the 360-degree image and video. If the viewport image that is selected by the viewer can be predicted, more bits can be assigned to the predicted region and fewer bits to the remaining parts during encoding. Using a saliency map increases the efficiency with which viewport-on-demand is realized. For streaming applications, several versions of the 360-degree video are encoded and stored in a server. A saliency map and the head movement collected from the HMD can be used to predict the viewing direction of the viewer in the next instant. This allows a seamless viewing experience during the change in the viewport.

Predicting areas of visual attention involves determining the significance with respect to the surrounding environment. There are studies of saliency prediction for the conventional 2D images [9, 14, 17, 30, 44, 50]. Itti et al. [17] proposed a method to predict saliency using a bottom-up mechanism, whereby the information about color, intensity and orientation is integrated. Zhang et al. [50] presented a saliency predictor called Boolean Map Saliency (BMS), which determines the significance of each pixel by comparing it to its neighboring pixels. The rapid development of deep learning techniques allows the prediction of a saliency model that uses deep learning [9, 30]. Although the ERP image is represented as a 2D image, these saliency predictors do not perform well for it. The ERP image suffers from the geometric distortion, which is propositional to the latitude. Thus, this issue should be addressed during the development of saliency prediction for the ERP image.

This study proposes two techniques for 360-degree images. The first predicts the saliency map of the ERP image. Then a coding technique for the ERP image uses visual attention as a guide. To predict the saliency, the proposed model uses existing saliency predictors for a 2D image. Pre-processing and post-processing are necessary to address the geometrical distortion of the ERP image. In particular, smoothing-based optimization is realized in the spherical domain. During encoding of the 360-degree image, a saliency map for the 360-degree image is used to modify the distortion definition for the rate-distortion optimization (RDO) process, to provide a better visual experience.

The remainder of this paper is organized as follows: Section II gives an overview of related works on 360-degree image and video. The proposed technique for saliency prediction for an ERP image and the proposed saliency-based coding for an ERP image are respectively detailed in Section III and Section IV. Section VI details the experimental results and Section VII gives conclusions.

2 Related work on 360-degree images/video

2.1 Projections and viewport generation

There are many projection methods for converting 3D spherical information into a 2D plane [45]. ERP is the most widely used projection method and Youtube supports the ERP format. To describe the ERP conversion, a three-dimensional coordinate system is defined, as shown in Fig. 1(a), where the X axis, Z axis and Y axis respectively points towards the front, the right and the top of the sphere. Any 3D point P(X, Y, Z) on the spherical surface is expressed as X = cos(θ)cos(φ), Y = sin(θ), and Z = −cos(θ)sin(φ), where θ and φ are respectively the latitude and the longitude of the point P. The rectangular plane that is formed by φ and θ is the projection result, where φ is in the range (−π, π) and θ is in the range (−π/2, π/2). This projection method is simple and has an obvious artifact towards the pole. As illustrated in Fig. 1(b) (https://blog.google/products/google-vr/bringing-pixels-front-and-center-vr-video/), the projection density at the poles and the equator is uneven, so the geometrical distortion increases with latitude. This creates problems for saliency prediction and compression for ERP images.

Fig. 1
figure 1

Projection of a 3D spherical surface to a 2D plane: a The 3D coordinate, b ERP image (https://blog.google/products/google-vr/bringing-pixels-front-and-center-vr-video/), c rectilinear projection [45], and d CMP images (https://blog.google/products/google-vr/bringing-pixels-front-and-center-vr-video/)

In addition to ERP, cube map projection (CMP) which uses six square faces to present the surface of the sphere is also a common projection format. Each face represents a 2D image for a particular viewport with a field of view (FOV) of 90o. Each face in CMP is rendered using rectilinear projection. As shown in Fig. 1 (c) and (d), during rectilinear projection, the viewing angle is along the Z axis and the 2D image is formed by projecting the surface of the sphere onto the 2D plane. The pixel value in the 2D image comes from the point on the surface, which is the intersection of the spherical surface and the line that connects the pixel on the 2D plane and the origin of the sphere. If the viewport is not along with the Z axis, the sphere must be rotated so that viewport aligns. The projection is then performed. To allow free-view navigation, when the ERP or the CMP image is projected back to the sphere, a viewport is rendered by rectilinear projection whereby the angles of rotation relative to each axis are specified by the viewer. The JVET has established a 360Lib software package for 360-degree video coding and processing (https://jvet.hhi.fraunhofer.de/svn/svn_360Lib/trunk). The conversion of the projection format and generation of the viewport are realized with 360Lib. In addition to ERP and CMP, the JVET also supports many projections. For more information, please refer to [45].

2.2 Saliency prediction for 360-degree images

There are several techniques for saliency prediction for a 360-degree image [1, 7, 10, 20, 25, 36, 52]. Although ERP is the most commonly used format, it suffers from the geometrical distortion, particularly in regions at high latitudes, so it does not allow an entirely accurate saliency map when a saliency predictor for a traditional 2D image is used. To address this problem, the polar region can be represented using another format, such as a cube map face, which is generated by CMP. The saliency predictor for a conventional 2D image can be used to derive the areas of visual attention for these cube faces and this is then projected to the ERP image. Lebreton et al. [20] proposed GVBS360 and BMS360, which are extended saliency models that use Graph-based Visual Saliency (GVBS) [14] and Boolean Map Saliency (BMS) [50], which are designed for the 2D image. Multi-plane projection is used in [52] to simulate the viewing behavior of human eyes in the HMD. Bottom-up and the top-down feature extractions are performed on each plane. Chao et al. [7] used fine-tuned SalGAN [30] for two image sources, including the original ERP and the cube faces images in several orientations. A fusion process is then used to generate the final saliency map in ERP format. ERP images centered on two different longitudes along the equator and cube map faces generated by rotating the cube center through several angles were used in [36] to generate a saliency map. Ling et al. [25] split the ERP image into patches and extracted sparse features. An integrated saliency map was produced considering the visual acuity and latitude. Abreu et al. [1] determined the saliency using data for eye fixation from subjective experiments. A fusion saliency map was constructed by integrating the saliency maps of the ERP image for various translations.

2.3 360-degree image/video coding

The polar area in an ERP image is stretched so that there are many redundant pixels. Efficient coding must assign fewer bits to regions at high latitudes [19, 21, 24, 31, 41, 46, 47]. Yu et al. [47] divided the ERP image into multiple tiles and adjusted the sampling density by resizing. The sampling rate was determined by rate-distortion optimization (RDO). In [46], a tile-based regional downsampling technique is proposed for inter-frame coding. Three tiles that represent the top, the middle and the bottom parts of the ERP image are rearranged by resizing the top and the bottom tiles. Li et al. [21] used the tile representation, but described the polar region as cambered surfaces and flattened them as circles. The two circular images and tilts are assembled as one 2D image for encoding. A nested polygonal chain mapping was proposed in [19] to improve the coding efficiency for the polar region. The tile format was also used and tiles are resized according to their locations. The tile nearest to the pole is resampled with a larger factor and placed in the middle of the re-packed rectangular region, surrounded by other resampled tiles from lower latitudes. The rule is enacted for all the tiles and finally tiles with various resampling rates are arranged as one 2D image.

The quality of a region can also be adjusted by assigning an adaptive QP (quantization parameter) [15, 31, 39, 41, 51]. Racapé et al. [31] and Tang et al. [41] expressed the QP as a function of the latitude. Another study [15] computed the QP based on the weight in WS-PSNR (weighted-to-spherically-uniform PSNR) [38]. Other coding-optimization-based techniques have been used in [22, 40]. A spherical domain RDO is realized in [22] and a weighted distortion is used, which depends on the latitude of the pixel in the spherical domain. Luz et al. [26] determined the QP by accessing both the saliency and spatial activity. Several studies focus on the motion model in the sphere domain [23, 43]. Li et al. [23] proposed a spherical motion model, which derives the motion of the block in the 2D plane by projecting to the sphere. A rotational motion model is presented in [43], whereby the motion is described as a rotation on the sphere along geodesics. Viewport adaptive encoding is proposed in [18]. Several viewport dependent projection schemes were studied and multiresolution versions of the ERP and the CMP format were proposed.

2.4 Spherical objective quality metrics

To ensure compatibility with the current video coding standard, a 360-degree image must be projected as a 2D image and compressed. The 2D decoded image is then projected back to a sphere. A free-view image is then generated by rectilinear projection. Since each pixel in the 2D image is not equally important, the specified technique is needed to evaluate the coding performance. The JVET supports three quality assessment metrics, including PSNR, WS-PSNR, S-PSNR-NN [48], and CPP-PSNR [49]. The architecture of the 360-degree video evaluation system specified by the JVET is shown in Fig. 2 [45]. The original ERP image is assumed to be 8 K and it is down-converted to 4 K and then converted to other formats, followed by encoding and decoding. The receiving end performs the calculation for WS-PSNR and S-PSNR-NN on the decoded image and the uncompressed image. Besides, the PSNR computations can also be end-to-end realized.

Fig. 2
figure 2

The testing procedure for 360-degree video [45]

3 Proposed saliency prediction model for 360-degree images

Omnidirectional images present a scene in a wider range than a conventional image. However, not all of the areas of omnidirectional images received intensive attention. The image feature and the position on the sphere domain can be used to predict accurate saliency maps for 360-degree images.

3.1 Architecture of the proposed model

The proposed saliency prediction model projects the spherical surface into ERP images and multi-view (MV) images. A process then predicts the saliency for each type of image and an initial saliency map is generated by fusing the saliency maps from the ERP image and the MV image. Figure 3 shows the overall architecture of the proposed saliency prediction model, which has four main steps:

  1. (a)

    ERP-based saliency prediction

  2. (b)

    MV-based saliency prediction

  3. (c)

    The use of an equator bias

  4. (d)

    Optimization in the spherical domain

Fig. 3
figure 3

The architecture of the proposed model

In the following, each step is introduced with details.

3.2 ERP-based saliency prediction

In the ERP image, the borders on the two sides are connected in the scene. However, if saliency prediction uses a conventional predictor, a visual target is difficult to be recognized if it is on the border. To preserve the saliency in the global context, more than one ERP image is generated by translating the original ERP image. The works in [36, 52], respectively use 2 and 4 ERP images.

Similarly to [36, 52], this study uses 8 orientations along the equator with a longitudinal sampling rate of 45 degrees. As mentioned previously, the region near the equator in the ERP image has good geometry, but other regions suffer from the geometrical distortion, which increases with latitude. Therefore, conventional saliency predictors only yield accurate attention models for the middle portion. The proposed method predicts the saliency for the middle portion and the edges separately for each ERP image. The edge portion is the region that corresponds to the top and bottom faces of CMP and the middle portion denotes the remaining region of the ERP, as illustrated in Fig. 4. Saliency maps for the middle portion and the two edge portions are generated directly by SAM-ResNet [9], which achieves good performance for the conventional image that satisfies the MIT300 benchmark [5]. Since the saliency maps for the cube faces at different orientations with fixed latitude can be viewed as one map at various angles of rotation, the saliency needs not to be predicted for each orientation. Only the saliency maps for the top and bottom faces of the original orientation are predicted. For the middle portion, a saliency map is generated for each orientation and these maps are fused into one map by taking the maximum value. The saliency of the middle portion and the edge portions is assembled after the top and bottom faces of the CMP are projected onto the ERP format. However, before integration, the saliency map for the edge portion is scaled appropriately so that the maximum values in the middle portion and at edges are equal. These procedures are illustrated in Fig. 5.

Fig. 4
figure 4

An ERP image is split into two edges and one middle portion

Fig. 5
figure 5

Procedures for the ERP-based saliency prediction

3.3 Multiview-based saliency prediction

In addition to the ERP-based saliency map, the saliency map is derived using Multiview (MV) images around the sphere. These MV images are rendered by changing the viewport. One viewport corresponds to one image. Using the software 360Lib that is developed by the JVET, arbitrary viewport can be generated by assigning the rotational angles of the three axes. Each viewport image is a 2D image so conventional saliency predictors can be used. Different from CMP images, which have six faces for one orientation, the number of MV images is not fixed and more flexibility is allowed. This study uses 62 viewports. One viewport is rendered every 30 degrees along the equator. The procedure is repeated for the circle of latitude 30°, 60°, −30° and − 60°. One viewport is rendered for the north pole and one for the south pole. The FOV for each viewport is 90°. Then the MV images representing the sphere are then produced. Figure 6 illustrates the procedure to generate the viewport along the equator.

Fig. 6
figure 6

Multi-view projection for the 360-degree image

BMS [50] is used to predict the saliency of the MV images. Since these MV images overlap, an MV-based saliency map in the ERP format is obtained by combining all the MV saliency maps. A weight is assigned to each pixel in the MV saliency map, as shown in (1)

$$ {w}^{MV}\left(i,j\right)={\left(1+\frac{d^2\left(i,j\right)}{r^2}\right)}^{-3/2}, $$
(1)

where 2r is the width of the MV image, and d is the distance between the pixel (i,j) and the center pixel in the MV image. The idea is that the pixels in the center of the MV image have the highest weight, while the boundary pixels have the least weights. After projecting the MV saliency maps back to the ERP format, the obtained saliency is calculated as:

$$ {S}^{ERP2}\left(i,j\right)=\frac{\sum_{l=1}^k{w}^{MV}\left({i}_l,{j}_l\right)\times {S}_l^{MV}\left({i}_l,{j}_l\right)}{\sum_{l=1}^k{w}^{MV}\left({i}_l,{j}_l\right)}, $$
(2)

where \( {S}_l^{MV} \)is the saliency map of the lth MV image, (il, jl) is the pixel location in the lth MV image where its corresponding pixel location is (i, j) in the ERP image and k is the number of MV images involved for current pixel (i, j).

When the ERP-based saliency, denoted as SERP1 and the MV-based saliency, denoted as SERP2 are obtained, they are combined into an initial saliency map, denoted as Sini. Before they are combined by averaging, the maps are scaled to ensure the maximum values for each are the same.

3.4 The use of an equator bias

As mentioned in [34] that the regions near the equator are statistically attractive regions for VR navigation, an equator bias is used to predict saliency for 360-degree images. A dataset [32] is used to extract a global latitude-wise subjective attention map. This latitude driven characterization is used to refine the saliency map generated in the previous processes. The equator-bias guided saliency at latitude i is calculated as:

$$ {S}_{EB}(i)=\frac{1}{m\times n}{\sum}_{t=1}^n{\sum}_{j=1}^m{S}_t\left(i,j\right), $$
(3)

where St(i, j) denotes the subjective saliency value for image t at location (i, j), and n and m respectively denote the image numbers and the width of the image. A weighted average of the equator bias map, denoted as SEB,and the initial saliency map, Sini, which is generated from the previous steps, are fused as:

$$ {S}_E=w\times {S}_{ini}+\left(1-w\right)\times {S}_{EB}, $$
(4)

where w is empirically selected as 0.7, which accounts for the contribution of the scene-dependent characteristics and the equator bias.

3.5 Optimization in the spherical domain

The last step involves smoothing the saliency map to remove noise while maintaining the edge. An optimization approach is used [27]. The objective cost function is:

$$ J\left({S}_F\right)={\sum}_p\left(\ {\left({S}_F^p-{S}_T^p\right)}^2+\lambda {\sum}_{q\text{\EUR} N(p)}{\left({S}_F^p-{S}_F^q\right)}^2\right),\kern0.5em $$
(5)

where SF is the smoothed saliency map, p and q respectively denote specified pixels on SF. N(p) is the set of four nearest neighbors for the pixel p and ST is a manipulated version of SE through a masking operation. This means that the value of some pixels of SE is retained on ST, while the remainder is set to 0. The mask is generated by a uniform sampling of the spherical surface using a spiral-based method [6]. In ST, only the pixel that corresponds to a uniformly sampled point on the sphere is preserved. Because neighboring pixels in the ERP format do not have fixed distance in the spherical domain and not all the pixels in the ERP domain are equally important. Similarly to the metric of S-PSNR which computes the PSNR for selected pixels that are uniformly distributed on a sphere surface, the uniformly sampled pixels on the sphere are projected back to the ERP image to form the mask. These pixels become seeds and the image is smoothed. The number of points that is sampled on the spherical surface is directly proportional to the size of the ERP image.

4 Proposed saliency-driven 360-degree image coding

Since the ERP format is widely used, it is used as the input for the proposed scheme. As mentioned in the previous section, the geometrical distortion in the ERP image is greater at higher latitudes. To address this problem and to ensure efficient encoding, some works [21, 46, 47] divide the ERP image to several tiles and reduce the amount of resource to the region at high latitude by squeezing the width of the tile or by assigning a larger QP. However, the coding efficiency decreases with a number of tiles. Besides, when the width of the tiles is squeezed, the prediction across the tiles becomes less efficient so that the coding efficiency is reduced. The adaptive-QP-based method uses different QPs across tiles so the playback of a free view is unsatisfactory if the selected viewport covers several tiles that are reconstructed with different quality.

This study proposes a saliency-driven coding technique for a 360-degree image. The saliency map and a weight map that is used to calculate WS-PSNR are combined into a final weight map. The distortion term in the RDO is modified using this final weight map. This ensures that the regions with a high weight are encoded with smaller QPs and high-quality viewports are rendered after reconstruction. The computation of the weight for WS-PSNR is detailed in the next section.

4.1 Weighted-to-spherically-uniform PSNR (WS-PSNR)

In addition to the saliency map, the weight used for the quality metric WS-PSNR is also used to derive the final weight map for the proposed coding technique. A WS-PSNR considers the position on the spherical surface to compute the PSNR. A stretching ratio is defined that represents the area of a point (x,y) on the projection plane over the area of the corresponding longitude and latitude location (θ, φ) on the spherical surface. The stretching ratio for a point (x,y) in the continuous domain for the ERP format is:

$$ SR\left(x,y\right)=\mathit{\cos}(y), $$
(6)

where the range of x, and y is (−π to π) and (−π/2 to π/2) respectively. Since the ERP image is in a digital format, the SR expression in the continuous domain must be discretized. The SR of the pixel (i, j) in the ERP image is calculated as:

$$ SR\left(i,j\right)=\frac{\iint SR\left(x\left(i,j\right),y\Big(i,j\Big)\right) dxdy}{\iint dxdy} $$
(7)

where (x(i,j), y(i,j)) denotes the sampling location on the continuous x-y plane for a discrete point (i,j). The weight is simplified and expressed as the SR for the center pixel. For the ERP, the weight is calculated as:

$$ {w}_s\left(i,j\right)=\mathit{\cos}\frac{\left(j+0.5-\frac{H}{2}\right)\pi }{H} $$
(8)

where H is the height of the ERP image.

4.2 Modified distortion for RDO

For image/video coding, rate-distortion optimization [37] is used to determine the best coding mode, in order to ensure a compromise between cost and performance. The RDO is generally expressed as:

$$ J=D+\lambda R, $$
(9)

where R is the bitrate that is required for the current block and D is the distortion, which is the sum of the squared difference between the original block and the reconstructed block. The Lagrange Multiplier λ controls the balance of R and D and it is modeled as a function of the QP.

A benefit of expressing a 360-degree image using the ERP image is that the ERP is a rectangular image that can be encoded by state-of-the-art coding standards. Although it is feasible to do this, the performance is not optimal. The ERP image is a data format and is not designed to be displayed directly for VR applications. Therefore, the distortion term for RDO must be modified using the specified characteristics of the ERP image.

This study proposes a saliency-driven RDO. The distortion is weighted in terms of the importance of the pixel and expressed as:

$$ {D}_{CTU}=\frac{W\times H}{\sum_{i=0}^{W-1}{\sum}_{j=0}^{H-1}w\left(i,j\right)}{\sum}_{i=0}^{N-1}{\sum}_{j=0}^{N-1}w\left(i,j\right)\times {\left(I\left(i,j\right)-\hat{I}\left(i,j\right)\right)}^2, $$
(10)

where W and H are the width and height of the ERP image, w(i, j) is the weight that represents the importance of the pixel (i, j) and \( I,\hat{I} \) denotes, respectively, the original and the reconstructed block. For HEVC, the coding unit is CTU (coding tree unit) and the CTU size is N × N. The weight w is computed by considering the saliency value, denoted as wc, and the weight in the WS-PSNR metric, denoted as ws, as

$$ w={w}_c\times {w}_s, $$
(11)

The wc and ws for the test image P4 in the dataset [32] are shown in Fig. 7. It shows that a high weight appears in the region around the equator for both wc(i, j) and ws(i, j).

Fig. 7
figure 7

Illustration of wc and ws. (a) ERP image, (b) wc for (a), and (c) ws

The distortion term is modified during RDO to reduce distortion for the CTU that is more important. In the normalization term in (10), the denominator sums the weight in the whole image. The QP for each CTU is computed by considering the relative importance within the ERP image. If the weight is uniformly distributed throughout the image, the distortion and RDO are not changed. However, if some regions are more important, they become more distorted if the QP is not changed. Using the new balance between the new distortion and the rate, for a CTU with a higher weight, the QP that is determined by the new RDO is smaller. For a CTU that is less important, which is usually near the polar area, the distortion is reduced and the rate becomes dominant. Therefore, a larger QP is assigned. To ensure a QP adjustment, an adaptive QP is used in the proposed scheme, whereby all QPs within the range of initial QP ± ΔQP are examined and the one that gives the best R-D performance is selected. In [22], a spherical domain RDO and a weighted distortion is used, as shown in (12),

$$ J={\sum}_{i=0}^{N-1}{\sum}_{j=0}^{N-1}{w}_s\left(i,j\right){\left(x\left(i,j\right)-\hat{x}\left(i,j\right)\right)}^2+\lambda R $$
(12)

Then, (12) is rewritten as (13) by considering a block-based operation where wa is a block-based weight.

$$ J=D+\frac{\lambda }{w_a}R $$
(13)

In this way, distortion remains and the Lagrange multiplier is changed. This study is different from [22] in three respects: 1) the modified distortion in (10) is defined differently from that in [51], where no normalization term was used; 2) there is no need to modify the Lagrange multiplier for this study but it is modified in [22] by considering the weight in the distortion term, and 3) the QP is automatically determined during the RDO process for this study but it is pre-computed based on the weight in WS-PSNR in [22] and is independent with the input.

5 Experimental results

The performance of the proposed saliency model of the 360-degree image is firstly presented. The saliency model is then used for the proposed coding scheme and the coding performance will be assessed using several objective metrics.

5.1 Saliency prediction

Two datasets [33, 42] are used to evaluate the performance. The dataset in [42] is the test set for the Grand Challenge Salient 360! ICME2017 while the dataset in [33] is the training set for the Grand Challenge Salient 360! ICME2018. These datasets include the data for the original 360-degree images and the head movement and head-eye movement collected from the subjective experiments. The head saliency and head-eye saliency are then served as the ground truth. This study considers the head-eye movement. Four common objective metrics for saliency community are used: Kullback-Leibler Divergence (KLD), Pearson’s Correlation Coefficient (CC), Normalized Saliency Scanpath (NSS) and AUC-Judd [4]. The toolbox [13] is used to compute these scores.

The proposed scheme is first evaluated using a database [42] that contains 25 images. Table 1 shows the results. The results of several works are also compared in this table. It is seen that all the model achieves a similar AUC score. The proposed technique outperforms the other schemes in terms of the NSS score and it is comparable in terms of the KLD and CC metrics. Unlike a previous study by the authors [10], which uses ERP and CMP images as the input for the saliency predictor, this study replaces CMP-based saliency prediction with MV-based saliency prediction and the overall quality is improved. The performance using the dataset [33] is detailed in Table 2. Few studies report the score for the training set for the Grand Challenge Salient 360! ICME2018, so only the results of two studies, [7, 10] are compared. Table 2 shows that the proposed scheme has a smaller KLD score and a higher NSS score, compared to [7].

Table 1 The head-eye movement prediction using dataset [42]
Table 2 The head-eye movement prediction using dataset [33]

5.2 Saliency-based coding for 360-degree image

To verify the performance of the proposed coding technique, 12 images from the dataset of omnidirectional images in [33] are used. These images are divided into three groups. Each group has 4 images. The first group has high performance, the second group has medium performance and the third group has low performance in terms of KLD when the predicted saliency is compared to the ground truth one. The mean score for each group is listed in Table 3. There is a significant difference between these results and the average results are shown in Table 3. These three groups are used to verify the coding performance for saliency maps of a different quality that is predicted using the proposed technique.

Table 3 Saliency prediction scores for three groups of images

The proposed coding scheme is implemented in HM16.17. The coding scheme is all intra and the QP is 22, 27, 32 and 37. When the ERP images are decoded, two groups of viewports are rendered using the tools in 360Lib (https://jvet.hhi.fraunhofer.de/svn/svn_360Lib/trunk). The first group renders six viewports along the equator every 60 degrees with a FOV of 75° in both the horizontal and vertical directions. The second group renders viewports that are centered on specified locations, as determined by the ground truth saliency map. Blocks of size 64 × 64 are determined that present locations with high visual attention. The visual attention for a block is calculated by summing the saliency value inside the block. The centers of the top-3 blocks then serve as the specified viewport locations, and the saliency-based viewport is rendered using rectilinear projection. Only top-3 viewports are used because some 360-degree images do not have many attractive targets, as illustrated in Fig. 8. It is seen that the ground truth saliency has limited regions with high saliency.

Fig. 8
figure 8

Two 360-degree images and their saliency maps (ground truth). The top is P57 and the bottom is P25 [33]

Figure 9 shows the rendered viewports for Image P26 [33]. In the first row, the original image and the saliency for the ground truth and the proposed model are shown. When the optimization on the sphere domain is performed, the predicted saliency is more accurate. For the streetlight, the wrong saliency value is corrected. Figure 9 also illustrates three kinds of rendered viewports for Image P26: the equator-based, top-3 saliency-based viewports and the viewports at the pole. The top-3 saliency-based viewports are the viewports that attract the most attention. The viewports at the pole show the sky and the ground, which are seldom required for free-view navigation.

Fig. 9
figure 9

The rendered viewports for Image P26 [33]. The first row shows the 360-degree image and the saliency overlay on the 360-degree image. From the left to right, they are the ground truth and the predicted saliency without and with the optimization in the sphere domain. The second row shows the equator-based viewports. The third-row shows the top-3 saliency-based viewports inside the green box and the viewports at the pole inside the red box

The BD-rate [3] for the proposed coding scheme is defined with respect to the HEVC anchor. Two kinds of PSNR are considered: the PSNR for 6 viewports on the equators, denoted as EQ-PSNR and the PSNR for the top-3 saliency-based viewports, denoted as SM-PSNR. The metrics over the whole ERP image, including WS-PSNR and S-PSNR-NN are also reported. WS-PSNR uses the weighted mean square error to compute the PSNR while S-PSNR-NN measures the quality of a set of uniformly sampled positions on the sphere.

In (11), two types of weight are used. To demonstrate the superiority over the results for one weight, an experiment is conducted. Three images P5, P25 and P7 [33] with a single weight are evaluated. The weight is either ws or wc, which is the predicted saliency using the proposed method. P5, P25 and P7 respectively represent the group with high, medium and low KLD performance. Their performance determines whether there is a difference in images with different KLD scores. Table 4 shows the results. The bitrate is reduced for EQ-PSNR and SM-PSNR if both weights are used so the strategy that is defined by (11) increases the quality of the viewport that is highly salient.

Table 4 BD-rate (%) for different weight maps (image dataset [33])

The performance of the proposed coding technique is compared with results for [22, 26]. In [26], saliency is used to derive the QP for each CTU while RDO in the spherical domain was adopted in [22]. The saliency used for [26] is the predicted saliency using the proposed work, in order to verify how to use the saliency for 360-degree image coding. In [22], the distortion is modified by considering the weight in WS-PSNR. The Lagrange multiplier is then changed and such a change is equivalent to a QP adjustment. Studies [22, 26] modify the QP in individual ways to achieve improved coding performance for the 360-degree images. For this study, the QP is adjusted automatically and is determined by the RDO when the definition of distortion is modified using the saliency map and the weight in WS-PSNR.

Table 5 summarizes the performance for the 12 images with respect to the HEVC anchor. These results show that the proposed technique achieves a significant bitrate reduction, especially for equator-based and the top-3 saliency-based viewports. The average bitrate reduction for SM-PSNR and EQ-PSNR is 8.73% and 9.76%, respectively, using the proposed scheme. In particular, the maximum bitrate reduction for SM-PSNR is 14.33%. Compared to [26] which is also a saliency-driven coding scheme, Table 5 shows that the modified RDO determines the right QP. For WS-PSNR and S-PSNR-NN metrics, the respective bitrate increment is only 0.95% and 0.99% for the proposed technique and 1.66% and 1.74% for [26]. The bitrate increase for the proposed scheme is smaller than the bitrate reduction in SM-PSNR and EQ-PSNR. Compared to [22], which is also an RDO-based coding scheme, the proposed scheme has a greater bitrate reduction for EQ-PSNR and SM-PSNR. In terms of WS-PSNR an S-PSNR-NN, [22] has a greater bitrate reduction. From the performance for three image groups, Table 5 shows that the quality of the saliency map has an obvious effect on the coding efficiency. Generally, the proposed saliency model is sufficiently accurate in identifying the visual target, so the proposed coding scheme performs well for most of the test images.

Table 5 BD-rate (%) for different schemes (dataset [33])

The proposed scheme has a bitrate reduction of 11.25% for Image P25. To further analyze the performance, the QP map for the proposed scheme, the anchor, [22, 26] are shown in Fig. 10. The HEVC anchor scheme uses the adaptive QP strategy to give better coding efficiency. The proposed method also uses the adaptive QP, but [22, 26] assign QP to each CTU according to some pre-determined calculations. The ΔQP used is 3.

Fig. 10
figure 10

The QP distribution for Image P25, for an initial QP of 22: (a) the 360 degree image, (b) predicted saliency map of (a), (c) the QP map for the anchor, (d), (e) and (f) are the QP maps for [22, 26] and the proposed method

For [26], although its QP map is also correlated with the saliency value, not many CTUs are encoded with smaller QP. Instead, more CTUs are encoded with the maximum QP. There is no RDO optimization involved in [26] and the coding performance is degraded if the compromise between the distortion and rate during the RDO process is not considered. For [22], the QP is only related to the weight for WS-PSNR and is independent of the image content. The R-D curve for Image P25 is shown in Fig. 11. It is seen that the proposed scheme has better performance in terms of EQ-PSNR and SM-PSNR.

Fig. 11
figure 11

R-D performance for Image P25, a EQ-PSNR, b SM-PSNR, c WS-PSNR, and d S-PSNR-NN

6 Conclusion

This study proposes a saliency-based coding scheme using two techniques: a saliency prediction and a saliency-driven RDO. For the saliency prediction, the saliency map of a 360-degree image is predicted using saliency predictors for a conventional 2D image. ERP-based and MV-based saliency predictions are realized. An optimization in the sphere domain is employed to improve the saliency. The experimental results show that the proposed technique accurately predicts the saliency and particularly it has good performance in terms of the NSS score. For the three other metrics, the proposed technique gives results that are comparable to the best experimental results. For the saliency-driven RDO, a saliency map is a reference and the distortion term is modified during the ROD process to give a better visual experience in the region of interest. Compared to the HEVC anchor, the experimental results show that the proposed technique gives a maximum of 14.33% reduction in the overall bitrate when the image quality in the region with high visual attention is considered. For the WS-PSNR and S-PSNR-NN metrics, the performance is comparable to that for the anchor scheme. In particular, the S-PSNR-NN result shows that the strategy of allocating more resources to the regions that attract the most visual attention does not significantly reduce the quality of the whole image. A comparison with results for other studies shows that the proposed scheme gives much better results for the viewports that contain visual targets. These results confirm the effectiveness of the proposed scheme.