1 Introduction

Traditional 2D video is being replaced by 3D video which is an emerging visual media. Research in the field of 3D video technology is getting more attention and importance with increased demand for consumer products. 3D television (3DTV) and free-viewpoint television (FTV) are the two main applications of 3D video along with sports, medical field, education, and so on. 3DTV provides visually realistic scene and FTV gives flexibility of changing view angle to the viewers. Starting with stereoscopic display technology which requires special glasses, now 3D technology has reached autostereoscopy where viewer can enjoy the essence of real scene without glasses and gives wide angle of view [16, 24]. Autostereoscopic display requires multiple views to be acquired, coded and transmitted that increases the complexity of the whole system. Multiview video plus depth (MVD) format is used to reduce the number of views and virtual view synthesis [6, 7] is carried out at the decoder to render intermediate views. Compression, depth accuracy, and rendering algorithm cause distortion in the virtual view.

Virtual view synthesis process uses depth-image-based rendering (DIBR) algorithm. In addition to compression, another important factor that affects the quality of a virtual view is the bit allocation between texture video and depth map. Though depth map is not displayed, its importance lies in the view synthesis process as it contains the geometric data of every pixel in the frame. There is no exact approach to allocate bits between texture and depth map. As a first attempt, fixed 5:1 ratio was used between texture videos and depth map [7]. Computationally complex full search algorithm was developed by Morvan and Farin [15] assuming that real view exists at synthesis position. Liu et al. [13] modeled view synthesis distortion without considering real view. Yaun et al. [30, 31] proposed a model-based optimal bit allocation strategy. They modeled bit allocation as convex optimization problem, and Lagrange multiplier method is used to find optimal solution. A quadratic model between texture video quantization step and depth map quantization step is described in [32]. Shao et al. [21] reported a distortion model between bit-rate and view synthesis distortion and bit-rate ratio is computed through optimization. They performed rate control at view, texture/depth, and frame level. A fast bit allocation method without pre-encoding was proposed by Oh et al. [18]. Adaptive bit allocation was proposed by Yang et al. [28] in which bit rate is adjusted between the views and texture/depth depending on variations in virtual view quality. All these methods use mean square error (MSE) to measure view synthesis distortion. However, human visual system is highly adapted to acquire structural information and structural similarity (SSIM) index is the quality metric that gives better approximation to perceived image quality. In this paper, we proposed a SSIM-based joint bit allocation method for 3D video.

Distortion metrics such as sum-of-squared error(SSE) or sum-of-absolute differences (SAD) used in rate-distortion optimization (RDO) do not contribute to perceptual quality. In conventional 2D video, many SSIM-based RDO schemes have been proposed to improve perceptual quality [4, 5, 11, 12, 14, 19, 29]. In this paper, SSIM is used as distortion metric in mode decision and motion estimation to improve perceptual quality of texture video and depth map and thus synthesized view. Here, we used Lagrangian multiplier as derived in [29] and scaled it using an additional empirical factor to enhance SSIM of a virtual view. Further, for bit allocation, we experimentally derived a relation between view synthesis distortion with texture distortion and depth distortion in terms of SSIM. We converted SSIM-based distortion model to MSE-based model using the relation given in [29]. Lagrangian optimization is used to find expressions for quantization steps of texture video and depth map. Proposed RDO and bit allocation algorithm are implemented using H.264/AVC based 3DV-ATM reference software as well as HEVC based HM reference software.

SSIM-based 3DV RDO is explained in Section 2. SSIM-based distortion model is derived in Section 3. In Section 4, concept of joint bit allocation is described and expressions for texture and depth map quantization steps are derived. Experimental results are given in Section 5 in which, 3DV-ATM encoder results are discussed in Section 5.1, HEVC encoder results in Section 5.2 and Section 5.3 gives a comparison of performance of proposed algorithm using 3DV-ATM and HEVC. Section 6 concludes the paper.

2 SSIM-based RDO for 3D video

The process of RDO aims to achieve trade-off between the bitrate required and the distortion in a reconstructed video for a given rate constraint R c . A classical approach, Lagrangian optimization, combines rate and distortion using Lagrange multiplier λ to form a Lagrangian rate-distortion function J (1a). Optimum values of rate and distortion are obtained by minimizing the cost function J.

$$\begin{array}{@{}rcl@{}} \min {D} \\ {s.t. R \leq {R_{c}}} \\ J=D+{\lambda}R \end{array} $$
(1a)
$$\begin{array}{@{}rcl@{}} \frac{dJ}{dR}=\frac{dD}{dR}+\lambda = 0 \\ \frac{dD}{dR}=-\lambda \end{array} $$
(1b)

where \(\lambda = 0.85\times {{2}^{\frac {(QP-12)}{3}}}\) and QP is the quantization parameter.

The Lagrangian multiplier λ plays a significant role in finding optimum values of rate and distortion. This can be illustrated using RD curve (Fig. 1) that shows a plot of distortion against changing rate. The λ is the slope of the RD curve and defines the operating point. As λ value changes, operating point also changes on RD curve. At operating point B as shown in Fig. 1, R and D will be optimum minimizing the value of J. RDO is applied for mode decision and motion estimation. Initially, motion estimation RDO is carried out as in (2a), where D M E is the prediction error measured using SSE and R M E is the number of bits required to represent motion vectors. This is followed by mode decision RDO as in (2b), where D M D is distortion between original and reconstructed block which is measured using SSE, and R M D is estimated bitrate of the associated mode.

$$\begin{array}{@{}rcl@{}} {J}_{ME}={{D}_{ME}}+{\lambda}_{ME} {R}_{ME} \end{array} $$
(2a)
$$\begin{array}{@{}rcl@{}} {J}_{MD}={{D}_{MD}}+{\lambda}_{MD} {R}_{MD} \end{array} $$
(2b)
Fig. 1
figure 1

Lagrange multiplier on RD curve

The 3D video with texture plus depth format has to be efficiently coded to ensure better display quality that gets affected by view synthesis distortion. This is possible with proper RDO technique where encoder chooses the mode having reduced distortion with available bits. As our objective is to improve perceptual quality of synthesized view, distortion metrics like SSE and SAD need to be replaced by appropriate metric. So SSIM-based mode decision and motion estimation RDO is implied for 3D video [9]. SSIM index that considers the similarities of local luminances, contrasts, and structures between two image blocks x and y, is defined as in (3) [26].

$$ SSIM(x,y)=\frac{(2{{\mu}_{x}}{{\mu}_{y}}+{{C}_{1}})(2{{\sigma} _{xy}}+{{C}_{2}})}{({\mu_{x}^{2}}+{\mu_{y}^{2}}+{{C}_{1}})({\sigma_{x}^{2}}+{\sigma_{y}^{2}}+{{C}_{2}})} $$
(3)

where μ x and σ x are mean and standard deviation of block x respectively, μ y and σ y are mean and standard deviation of block y respectively, and σ x y is cross-correlation between image blocks. C 1 and C 2 are the constants used to limit the range of SSIM values when mean and variance are close to zero. SSIM gives similarity measure and thus dSSIM is used as a distortion metric given by

$$ dSSIM=\frac{1}{SSIM} $$
(4)

Yeo et al. [29] related dSSIM to MSE (5) and derived expression for Lagrange multiplier (6) that is used in motion estimation and mode decision RDO of every macroblock (MB). This replacement avoids the computation of SSIM for each block. We incorporated this method of SSIM-based RDO for 3D video and experimentally it is found that performance is degraded as slope is near to vertical axis (Point A in Fig. 1) on the RD curve. To improve the performance an empirical scaling factor S f is added (7). The λ n e w is used in motion estimation and mode decision for both texture and depth map to improve SSIM.

$$\begin{array}{@{}rcl@{}} dSSIM&\approx& 1+\frac{MSE}{2{\sigma_{x}^{2}}+{{C}_{2}}} \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} {{\lambda}_{i}}&=&\frac{2\sigma_{{{x}_{i}}}^{2}+{{C}_{2}}}{\exp \left( \frac{1}{M}\sum\limits_{j = 1}^{M}{\log (2\sigma_{{{x}_{j}}}^{2}+{{C}_{2}})}\right)}{{\lambda}_{SSE}} \end{array} $$
(6)
$$\begin{array}{@{}rcl@{}} {{\lambda}_{new}}&=&\frac{2\sigma_{{{x}_{i}}}^{2}+{{C}_{2}}}{{{S}_{f}}\left( \exp \left( \frac{1}{M}\sum\limits_{j = 1}^{M}{\log (2\sigma_{{{x}_{j}}}^{2}+{{C}_{2}})}\right) \right)}{{\lambda}_{SSE}} \end{array} $$
(7)

3 SSIM-based distortion model

The 3D video uses texture plus depth format to reduce the coding complexity by reducing the number of input videos. This introduces an additional process of virtual view synthesis to generate intermediate views at decoding end. A scheme of virtual view synthesis is shown in Fig. 2.

Fig. 2
figure 2

Decoding end of 3DV system that uses miltiview video plus depth format with three input texture videos and depth maps

At the encoding side, three out of five (view 1, 3 and 5) texture videos and corresponding depth maps are coded and transmitted. These views are decoded and used to generate intermediate views 2 and 4. In Fig. 2, texture and depth views 1 and 3 are used to generate view 2, and similarly views 3 and 5 are used to generate view 4. Virtual view synthesis of a frame from the sequence Balloons is shown in Fig. 3.

Fig. 3
figure 3

Virtual view synthesis a Texture frame 0 of view 1, b Texture frame 0 of view 3, c Depth map frame 0 of view 1, d Depth map frame 0 of view 3, and e Synthesized frame 0 of view 2

The 3D warping and blending are the two stages in view synthesis. In 3D warping, a pixel in the reference view (existing view) is converted to 3D coordinate and then to virtual view (generated view). A reference view pixel (u r ,v r ) is converted to 3D world point (x w ,y w ,z w ) and then to target pixel (u v ,v v ) as given in (8) and (9) respectively [23].

$$\begin{array}{@{}rcl@{}} {{\left[ {{x}_{w}},{{y}_{w}},{{z}_{w}} \right]}^{\text{T}}} &=&R_{3X3,r}^{-1}\left( {{Z}_{c,r}}A_{3X3,r}^{-1}{{\left[ {{u}_{r}},{{v}_{r}},1 \right]}^{\text{T}}}-{{t}_{3X1,r}} \right) \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} {{Z}_{c,v}}{{\left[ {{u}_{v}},{{v}_{v}},1 \right]}^{\text{T}}} &=&{{A}_{3X3,v}}\left( {{R}_{3X3,v}}{{\left[ {{x}_{w}},{{y}_{w}},{{z}_{w}} \right]}^{\text{T}}}+{{t}_{3X1,v}} \right) \end{array} $$
(9)

where R is a rotation matrix, t is the translation vector, A is an intrinsic matrix of the camera, and Z is the depth calculated from the depth maps. Since pixel mapping from 2D to 3D is not one to one, holes are created in the left image (Fig. 4b) if right image is taken as reference and vice versa. These holes can be filled with a process called blending which is given by (10).

$$ {{I}_{v}}(x,y)={{w}_{L}}{{I}_{L}}({{x}_{L}},{{y}_{L}})+{{w}_{R}}{{I}_{R}}({{x}_{R}},{{y}_{R}}) $$
(10)

where \({{w}_{L}}=\frac {{{I}_{R}}}{({{I}_{L}}+I{}_{R})}\), \({{w}_{R}}=\frac {{{I}_{L}}}{({{I}_{L}}+I{}_{R})}\), I L is the baseline distance between left reference and virtual views, and I R is the baseline distance between right reference and virtual views. Warping can be efficient with an accurate depth map. Inaccuracy in the original depth maps or distortion in encoded depth maps due to lossy compression techniques cause distortions in virtual view. Also, texture distortion directly affects the quality of virtual view. Thus total virtual view distortion consists of texture distortion and depth distortion. If S v is the virtual synthesized image by original texture and depth video, \({{\overline {S}}_{v}}\) is the virtual image synthesized by original texture and compressed depth map, \(\widetilde {{{S}_{v}}}\) is the virtual image synthesized by compressed texture and original depth, then view synthesis distortion is derived in [20] as in (11). Also, authors analyzed the effect of texture and depth distortion in [31] as in (12).

$$\begin{array}{@{}rcl@{}} {{D}_{v}}&\approx& E[{{({{S}_{v}}-{{\overline{S}}_{v}})}^{2}}]+E{{[{{S}_{v}}-\widetilde{{{S}_{v}}})}^{2}}] \end{array} $$
(11)
$$\begin{array}{@{}rcl@{}} {{D}_{v}}&=&A{{D}_{t}}+B{{D}_{d}}+C \end{array} $$
(12)

where D v is the view synthesis distortion, D t is the texture distortion and D d is the depth distortion. A, B, and C are the parameters that depend on the compression distortion.

Fig. 4
figure 4

Virtual view generated from a right view, b left view

For improving perceptual quality, earlier view synthesis distortion model (11) and (12) is modified to SSIM-based distortion model. The view synthesis distortion model was experimentally determined by encoding 3D video sequences. Texture videos and depth maps were encoded with quantization parameters (QPs) ranging from 20 to 44. View synthesis distortion (d S S I M v ) is computed between the original virtual view (generated by original texture and depth video) and the distorted virtual view (generated by compressed texture and depth video). Texture video distortion (d S S I M t ) is computed between the original and compressed texture video. Depth map distortion (d S S I M d ) is calculated between original and compressed depth map. Relationship between dSSIM of virtually synthesized view, texture video, and depth map is a planar model as shown in Fig. 5 and can be defined by (13).

$$ dSSI{{M}_{v}}=a \cdot dSSI{{M}_{t}}+b \cdot dSSI{{M}_{d}}+ c $$
(13)

where a, b, and c are model parameters.

Fig. 5
figure 5

Relationship between dSSIM of virtual view, texture video and depth map

4 Joint bit allocation

Effective bit allocation to have good quality synthesized views is a challenging task. One of the methods for bit allocation is through finding optimal QP pairs. Thus, bit allocation tries to find optimal bitrates (under total rate constraint R c ) for texture video and depth maps such that view synthesis distortion is minimized (14).

$$\begin{array}{@{}rcl@{}} &&~\min\limits_{({R_{t}}, {R_{d}})} {{D}_{v}} \\ &&{s.t. {R_{t}}+{R_{d}} \leq {R_{c}}} \end{array} $$
(14)

where D v is synthesis distortion (12), R t and R d are the bitrates of texture video and depth map respectively. In order to improve perceptual quality, bit allocation problem is stated as

$$\begin{array}{@{}rcl@{}} &&\min\limits_{({R_{t}}, {R_{d}})} {{dSSIM}_{v}} \\ &&{s.t. {R_{t}}+{R_{d}} \leq {R_{c}}} \end{array} $$
(15)

where, d S S I M v is synthesis distortion measured using SSIM as a distortion metric. Solving bit allocation problem is nothing but finding optimum value of quantization step for both texture and depth video. As SSIM and quantization step cannot be related in closed form [3], using the SSIM-MSE relation (5) d S S I M v in terms of MSE is derived as

$$\begin{array}{@{}rcl@{}} {{dSSIM}_{v}}&=&\frac{a}{2{{\sigma}^{2}}_{{{x}_{t}}}+{{C}_{2}}}{{D}_{t}}+\frac{b}{2{{\sigma}^{2}}_{{{x}_{d}}}+{{C}_{2}}}{{D}_{d}}+z \end{array} $$
(16)
$$\begin{array}{@{}rcl@{}} {{dSSIM}_{v}}&=&{{p}_{1}}{{D}_{t}}+{{p}_{2}}{{D}_{d}}+c \end{array} $$
(17)

where \({{p}_{1}}=\frac {a}{2{{\sigma }^{2}}_{{{x}_{t}}}+{{C}_{2}}}\) and \({{p}_{2}}=\frac {b}{2{{\sigma }^{2}}_{{{x}_{d}}}+{{C}_{2}}}\).

With (17), distortion of both texture and depth video can be expressed as a function of quantization step. To obtain Distortion-Quantization (D-Q) relation, texture and depth maps were pre-encoded with different quantization parameters. The D-Q models of texture video and depth map are given in (18a) and (18b) respectively. Rate-Quantization (R-Q) relation is assumed to be linear as in H.264/AVC. R-Q models for both texture and depth map are given in (18c) and (18d) respectively, and are verified experimentally.

$$\begin{array}{@{}rcl@{}} {{D}_{t}}&={{\alpha}_{t}}{{Q}_{t}}+{{\beta}_{t}} \end{array} $$
(18a)
$$\begin{array}{@{}rcl@{}} {{D}_{d}}&={{\alpha}_{d}}{{Q}_{d}}+{{\beta}_{d}} \end{array} $$
(18b)
$$\begin{array}{@{}rcl@{}} {{R}_{t}}&={{a}_{t}}Q_{t}^{-1}+{{b}_{t}} \end{array} $$
(18c)
$$\begin{array}{@{}rcl@{}} {{R}_{d}}&={{a}_{d}}Q_{d}^{-1}+{{b}_{d}} \end{array} $$
(18d)

where a t , a d , b t , b d , α t , α d , β t , β d are the parameters calculated from pre-encoding the texture video and the depth sequences. Q t and Q d are quantization steps of texture video and depth map respectively. Bit allocation problem is framed (19) and minimized to get optimum values of Q t and Q d as

$$\begin{array}{@{}rcl@{}} &&\qquad\min ({{p}_{1}}{{D}_{t}}+{{p}_{2}}{{D}_{d}}) \\ &&{s.t. ({{a}_{t}}Q_{t}^{-1}+{{b}_{t}}+{{a}_{d}}Q_{d}^{-1}+{{b}_{d}})\leq {R_{c}}} \end{array} $$
(19)
$$\begin{array}{@{}rcl@{}} {{Q}_{t}}&=&\frac{{{a}_{t}}+\sqrt{\frac{{{K}_{1}}{{a}_{t}}{{a}_{d}}}{K2}}}{{{R}_{c}}-{{b}_{t}}-{{b}_{d}}} \end{array} $$
(20a)
$$\begin{array}{@{}rcl@{}} {{Q}_{d}}&=&\sqrt{\frac{{{K}_{2}}{{a}_{d}}}{{{K}_{1}}{{a}_{t}}}}{{Q}_{t}} \end{array} $$
(20b)

where K 1 = p 1 α t and K 2 = p 2 α d

5 Experimental results

In this section, we discuss the performance of joint bit allocation algorithm proposed for 3D video to improve perceptual quality. 3D sequences Kendo, Balloons and Breakdancer [8, 33] of size 1024 × 768 are used. We evaluated the performance of SSIM-based RDO and bit allocation using H.264/AVC based 3DV-ATM encoder and HEVC based HTM encoder.

5.1 3DV-ATM results

3DV-ATM used for encoding the 3D sequences is based on H.264/AVC encoder and it accomplishes higher coding efficiency through RDO. Each MB is divided into sub-blocks with different sizes in intra as well as inter mode prediction. The block partition size is considered to be 4 × 4 and 16 × 16 in intra prediction and 16 × 8, 8 × 16, 8 × 8 for inter mode. An 8 × 8 block is subdivided into 8 × 4, 4 × 8, and 4 × 4. Every mode is coded, reconstructed and the best mode has to be selected considering the factors: (i) amount of bits used for coding (rate) and (ii) quality of reconstructed block.

In this section, we discuss the performance of SSIM-based RDO and joint bit allocation algorithm proposed for 3D video to improve perceptual quality. For encoding, we used Nokia’s 3DV-ATMv5.lr2 reference software [1]. Virtual view synthesis is done using View Synthesis Reference Software 3.0 [25]. In both SSIM-based RDO and joint bit allocation, encoder parameters used are as given in Table 1.

Table 1 Encoder parameter setting

5.1.1 SSIM-based RDO in 3DV-ATM

Performance of SSIM-based RDO is compared with SSE-based RDO of original 3DV-ATM encoder. In both methods, SSIM is calculated between the virtual views generated by original views and reconstructed views. RD curves are plotted with rate along horizontal axis and SSIM along vertical axis. Figure 6a shows the RD curves for Kendo sequence and both SSE and SSIM-based RDO have almost the same performances. Since depth maps are not displayed, we performed SSE-based RDO on depth map and SSIM-based RDO on texture video to improve perceptual quality of synthesized view (Fig. 6b). However, in Balloons (Fig. 7) and Breakdancer (Fig. 7a and b) sequences, SSIM-based RDO finds significant improvement. Performance of proposed method is also evaluated using BD-rate figures [2] as shown in Table 2. BD-rate figures compute average percentage of saving in bit-rate along with average gain in SSIM.

Fig. 6
figure 6

SSIM Vs Rate for Kendo sequence when view 2 synthesized from views 1 and 3, when SSIM is used as distortion metric for a both texture and depth, b only for texture

Fig. 7
figure 7

SSIM Vs Rate for a Balloons sequence and b Breakdancer sequence. In both cases view 2 is synthesized from views 1 and 3 with SSIM as distortion metric

Table 2 Comparison of SSIM-based RDO with SSE-based RDO

5.1.2 SSIM-based joint bit allocation in 3DV-ATM

SSIM-based joint bit allocation algorithm is evaluated and compared with model-based joint bit allocation. Model-based joint bit allocation proposed in [31] is evaluated in two steps. First, texture and depth sequences are encoded and model parameters are determined. Using model parameters, quantization step of texture video and depth map are calculated. In the second step, using new optimized values of quantization step, texture and depth sequences are encoded. Here, bit allocation is done at sequence level.

SSIM-based joint bit allocation also requires pre-encoding and computation of model parameters. Quantization parameters are calculated for each MB. In calculating QP for texture MB, variance of corresponding depth MB is computed and vice versa. For both model-based and SSIM-based bit allocation, we evaluated SSIM of virtual views and plotted RD graph as shown in Fig. 8. BD-rate is calculated between the RD curves to get average gain in terms of SSIM (ΔSSIM) and bit-rate reduction (ΔRate) as shown in Table 3.

Fig. 8
figure 8

RD curve of bit allocation algorithm for a Kendo, b Breakdancer

Table 3 Improved SSIM and Rate reduction of SSIM-based bit allocation

Using SSIM as distortion metric for 3D video, we performed RDO followed by bit allocation. Objective evaluation shows the improvement in perceptual quality at reduced rate. To make visual appearance better, structural information is to be maintained. To illustrate this, 50th and 60th frame of breakdancer sequence from decoded sequences of SSIM-based and model-based bit allocation algorithm are shown in Figs. 9 and 10. In 50th frame of Fig. 9a, edges are preserved and structure is retained (particularly man’s finger) from SSIM-based bit allocation as compared to model-based bit allocation in Fig. 9b. A similar comparison is shown in Fig. 10.

Fig. 9
figure 9

50th frame from decoded sequence using a SSIM-based bit allocation, b Model-based bit allocation

Fig. 10
figure 10

60th frame from decoded sequence using a SSIM-based bit allocation, b Model-based bit allocation

5.1.3 Subjective evaluation

Since the aim of our proposed algorithm is to improve visual quality, we conducted subjective evaluation. For the evaluation, the Kendo and Breakdancer sequences encoded with SSIM-based bit allocation and model-based bit allocation were used. Ten viewers evaluated the quality of the video in each case. Each pair of videos were shown five times. Participants were asked to rate the videos on the scale of one to five with the following notation: 5—Excellent, 4—Good, 3—Fair, 2—Poor, 1—Very poor. Mean opinion score (MOS) and standard deviation (SD) were obtained as shown in Table 4. In both the sequences, SSIM-based bit allocation obtained higher MOS and better standard deviation.

Table 4 Mean opinion score and standard deviation

5.1.4 Temporal SSIM

Objective quality metrics used to determine the video quality use the spatial information within a frame. For a video, quality depends both on spatial information and temporal information. Thus, it is necessary to use a quality metric that measures both spatial and temporal distortion. As we aim to improve perceptual quality in 3D video, we measured spatio-temporal SSIM for quality assessment of virtual view synthesized videos.

Wang et al. [27] proposed a quality metric to measure perceptual quality which takes care of spatio-temporal structural information in a video and we used this metric to evaluate both spatial and temporal quality of synthesized views. Gradient is computed using Sobel kernel in all the three directions x-y, x-t and y-t and spatio-temporal gradient magnitude is determined. A threshold is set and compared with the gradient magnitude to determine whether the pixel is salient or not. SSIM (21) is computed only on the patches surrounding the salient pixel at the center. We computed temporal SSIM for synthesized views resulted from SSIM-based bit allocation and model-based bit allocation for Kendo and Balloons sequences. SSIM-based bit allocation has better temporal SSIM as in Fig. 11.

$$ SSI{{M}_{xyt}}=(SSI{{M}_{xy}}+SSI{{M}_{xt}}+SSI{{M}_{yt}})/3 $$
(21)
Fig. 11
figure 11

Comparison of temporal SSIM in a Kendo b Breakdancer

5.2 HEVC results

The High Efficiency Video Coding (HEVC) is the newly developed video coding standard that is replacing existing H.264/AVC standard. The development of hardware technology for acquisition resulted in better quality of video that further requires efficient coding algorithms. For example, HD videos and other advanced video formats need to be compressed with higher coding efficiency. Another concern of HEVC standard is to utilize parallel processing architectures [22]. HEVC is extended to encode 3D video where either stereo or multiview video plus depth format can be used as input [17].

HEVC uses advanced approach of quad tree coding where each picture is divided into coding tree units (CTUs). A CTU is equivalent to a macroblock in previous coding standards. Each CTU is divided into coding units (CUs). In HEVC, three different coding modes are used: intra coding mode, inter coding mode and merge mode. A CU can have variable block size ranging from 8 × 8 to 64 × 64. Coding mode is selected at CU level. A CU is again divided into prediction units (PUs). All PUs are coded in the same coding mode.

Intra coding mode has 35 different variations which includes one planar mode, one DC mode and remaining angular modes. The partition type includes 2Nx2N and NxN. Inter prediction has symmetric and asymmetric block partitions which include: 2Nx2N, NxN, Nx2N, 2NxN, 2NxnU, 2NxnD, nLx2N, nRx2N. For a CTU, RD cost is computed for all modes and the mode with minimum cost is considered to be optimal CTU.

In HEVC, distortion metrics used for prediction cost are SAD or SATD and SSE. For mode decision, SSE is used as distortion metric. RD cost functions used for computing prediction cost are given in (22a) to (22b).

$$\begin{array}{@{}rcl@{}} {{J}_{pred}}&=&SAD+{{\lambda}_{pred}}\cdot {{B}_{pred}} \end{array} $$
(22a)
$$\begin{array}{@{}rcl@{}} {{J}_{pred}}&=&SATD+{{\lambda}_{pred}}\cdot {{B}_{pred}} \end{array} $$
(22b)

where B p r e d is bit cost required for encoding the block. Similar to 3DV-ATM, we aim to replace traditional distortion metrics by SSIM to improve perceptual quality of 3D video which are encoded using HEVC. Qi et al. [19] used SSIM-MSE relation and new Lagrange multiplier derived in [29] to improve perceptual quality of 2D video using HEVC. Thus, we extended the same methodology including an additional empirical scaling factor to the modified Lagrange multiplier as in (7). This resulted in improvement of the perceptual quality in 3D video. Optimized bit allocation between texture and depthmap results in synthesized views of better quality. The objective is to minimize synthesis distortion while improving visual quality at available rate which includes texture rate and depth rate. This requires distortion model in terms of perceptual metric. We assumed the linear dSSIM distortion model as explained in Section 3 and used optimum value of quantization parameters as derived in Section 4.

5.2.1 SSIM-based RDO and bit allocation in HEVC

SSIM-based RDO is implemented using HM-16.2 HEVC reference software [10]. Experiments were conducted using Kendo and Balloons sequences. View synthesis is performed using rendering algorithm available in HM reference software. Proposed SSIM-based RDO is compared with SSE-based original HEVC as shown in Fig. 12. SSIM-based RDO in HEVC performs better as compared to original HEVC. BD-rate comparison in Table 5 shows the improved SSIM at reduced rate.

Fig. 12
figure 12

SSIM-based RDO a Kendo and b Balloons

Table 5 Comparison of SSIM-based RDO with SSE-based RDO in HEVC

Proposed SSIM-based bit allocation method results in improved visual quality of synthesized views. We pre-encoded the video sequences and calculated the parameters required for computation of quantization parameters. Sequences are then encoded using optimum value of quantization parameters. Results of proposed algorithm is compared with model-based bit allocation as shown in Fig. 13 along with BD-rate comparison in Table 6.

Fig. 13
figure 13

Bit allocation in HEVC a Kendo b Balloons

Table 6 Comparison of SSIM-based bit allocation with model-based bit allocation

We computed temporal SSIM for both Kendo and Balloons sequence to verify the spatio-temporal quality of SSIM-based bit allocation. Results shown in Fig. 14 indicate improved perceptual quality with SSIM-based bit allocation.

Fig. 14
figure 14

Comparison of temporal SSIM a Kendo b Balloons

5.3 Performance comparison of proposed algorithms on 3DV-ATM and HEVC encoders

Proposed SSIM-based RDO and bit allocation are implemented in two different encoders: H.264/AVC based 3DV-ATM encoder and HEVC encoder. We compare the performance of proposed algorithms on these encoders as shown in Figs. 15 and 16. As HEVC is designed for higher coding efficiency, it performs better compared to 3DV-ATM encoder. Thus, in both SSIM-based RDO and SSIM-based bit allocation, HEVC encoder achieves better SSIM at lower rates compared to 3DV-ATM due to its higher coding efficiency.

Fig. 15
figure 15

Bit allocation a Kendo b Balloons

Fig. 16
figure 16

Bit allocation a Kendo b Balloons

6 Conclusion

Bit allocation algorithm in 3D video ensures improvement in quality of the synthesized view. In this paper, we proposed SSIM-based bit allocation algorithm to enhance perceptual quality of 3D video using 3DV-ATM as well as HEVC. Initially, SSIM is used as distortion metric in mode decision and motion estimation of both texture and depth video where Lagrange multiplier is adjusted to improve SSIM and thus, visual quality. In SSIM-based joint bit allocation, we experimentally derived view synthesis distortion model using SSIM as distortion metric and converted it into MSE-based model. Further, the model is used to find optimal quantization parameters. Both objective and subjective evaluation show the improved perceptual quality with SSIM-based method.

SSIM-based RDO stage can be addressed more efficiently by automating the selection of empirical scaling factor. In bit allocation algorithm, we assumed that synthesis distortion is linearly dependent on texture and depth distortion. However, influence of depth distortion on synthesis distortion varies depending upon texture details. Linear depth distortion can be replaced by nonlinear distortion model for more accurate results.