Keywords

1 Introduction

The 3D video is a motion picture format that gives the real depth perception and thus gained huge popularity in many application fields. The depth perception is possible with two or more videos. With two views, the stereoscopic display produces a 3D vision and the viewer needs to wear special glasses. Multiview acquisition, coding, and transmission are necessary for an autostereoscopic display to provide 3D perception to the viewers without special glasses. Instead of multiple views, representation formats are used and only a few views along with the depth maps are coded and transmitted. Virtual view synthesis [2] is used to render intermediate views. These views are distorted and thus affect the end display.

Fig. 1
figure 1

Block diagram of 3D video system

Normally, rate distortion optimization (RDO) computes the rate distortion (RD) cost for a macroblock (MB). If distortion is measured using metrics like structural similarity (SSIM) index, the quality of the reconstructed block matches the human vision. In [5], the authors extended the perceptual RDO concept to 3D video by considering the linear relationship between depth distortion and synthesis distortion. However, depth map distortion is nonlinearly related to view synthesis distortion, i.e., depth distortion remains the same whereas synthesis distortion varies according to the details in the texture video. In this paper, a suitable Lagrange multiplier is determined for RDO using nonlinear depth distortion. We used this framework of RDO in SSIM-based bit allocation algorithm [6].

Section 2 gives a brief explanation of view synthesis distortion and its relation with texture and depth distortion. In Sect. 3, SSIM index is discussed briefly. In Sect. 4, we discuss the RDO process with linear and nonlinear depth distortion. Bit allocation criteria for 3D video is explained in Sect. 5. Performance evaluation of the proposed method is discussed in Sect. 6 and concluding remarks are given in Sect. 7.

2 View Synthesis Distortion

3D video system shown in Fig. 1 has multiple videos as inputs and uses multiview plus depth (MVD) representation format that is more economical compared to other representation formats. View synthesis process generates a new intermediate view (target view) from the available texture views (reference views) and depth maps. It consists of two steps: warping and blending [10]. Warping is a process to convert the reference viewpoint to 3D point and then to target viewpoint. This pixel mapping is not one-to-one mapping, and thus holes are created. These holes are filled by the blending process. Warping uses depth data in converting a reference viewpoint to target viewpoint and accuracy of conversion depends on depth data. As lossy compression method is used in encoding depth map, it affects the warping process and causes distortion in the synthesized view.

Distortion in synthesized views is mainly due to texture distortion and depth inaccuracy. Distortion model derived in [14] as well as in [8] assumed that texture and depth distortion (\({D}_{t}\) and \({D}_{d}\)) are linearly related to synthesis distortion (\({D}_{v}\)) as in Eq. 1.

$$\begin{aligned} {{D}_{v}}=A{{D}_{t}}+B{{D}_{d}}+C \end{aligned}$$
(1)

During the synthesis process, geometric errors will be minimum if the depth data is accurate. Inaccurate depth data leads to change in pixel position as shown in Fig. 2. This position error is linearly proportional to depth error as given in Eq. 2 [7].

Fig. 2
figure 2

Geometric error due to inaccurate depth data

$$\begin{aligned} \varDelta P=\frac{f\cdot L}{255}\left( \frac{1}{{{Z}_{near}}}-\frac{1}{{{Z}_{far}}} \right) \end{aligned}$$
(2)

where f represents camera focal length, L is the baseline distance, and \({Z}_{near}\) and \({Z}_{far}\) are nearest and farthest depth values, respectively. However, synthesis distortion depends on details in the texture video and is not the same in all the regions with same position error. This implies, even with the same amount of geometric error, degradation of synthesized view will be different for textured and textureless areas. Texture area with edge information will have more error compared to smooth regions. Considering these factors, synthesis distortion caused by depth distortion is formulated as in Eq. 3 [13].

$$\begin{aligned} {{D}_{d\rightarrow v}}=\varDelta P\cdot {{D}_{d}}\cdot \left[ {{D}_{{{t}_{(x-1)}}}}+{{D}_{{{t}_{(x+1)}}}} \right] \end{aligned}$$
(3)

where \({{D}_{{{t}_{(x-1)}}}}\) and \({{D}_{{{t}_{(x+1)}}}}\) are the horizontal gradients computed between collocated texture blocks.

3 SSIM Index

Traditional methods for image quality measurement use objective evaluations and most of the metrics do not match with human visual characteristics. Human vision is sensitive to structural information in the scene and the quality metric must measure the structural degradation. Wang et al. [12] proposed structural similarity (SSIM) that measures structural degradation and thus evaluates according to human vision. For measuring SSIM, two images are required and measurement is done at the block level. For each block x and y, three different components, namely, luminance (l(xy)), contrast (c(xy)), and structure (s(xy)) are measured and therefore SSIM is expressed as in Eq. 4.

$$\begin{aligned} SSIM(x,y)=f(l(x,y),c(x,y),s(x,y)) \end{aligned}$$
(4)

Instead of similarity measure, we need distortion based on SSIM in RDO. Therefore, dSSIM is used and is defined as \( dSSIM =\frac{1}{ SSIM }\).

4 SSIM-Based RDO with Nonlinear Depth Distortion

Rate distortion optimization helps to reduce the distortion of reconstructed video with minimum rate while increasing the computation complexity. SSIM is used instead of sum of squared error (SSE) in mode decision and motion estimation to improve the visual quality. In our previous work [5], the linearity of texture and depth distortions with view synthesis and suitable Lagrange multiplier is determined as in Eq. 5.

$$\begin{aligned} {{\lambda }_{new}}=\frac{2\sigma _{{{x}_{i}}}^{2}+{{C}_{2}}}{{{S}_{f}}\left( \exp (\frac{1}{M}\sum \limits _{j=1}^{M}{\log (2\sigma _{{{x}_{j}}}^{2}+{{C}_{2}})}) \right) }{{\lambda }_{ SSE }} \end{aligned}$$
(5)

where \(S_f\) is the scaling factor, \(\sigma _{{{x}_{i}}}^{2}\) is the variance of \(i\mathrm{th}\) macroblock, M is the total number of macroblocks, and \(C_2\) is constant to limit the range of SSIM.

Depth map RDO is performed by computing RD cost as in Eq. 6.

$$\begin{aligned} {{J}_{d}}=\varDelta P\cdot {{D}_{{{t}_{G}}}}\cdot {{D}_{d}}+{{\lambda }_{ SSE }}{{R}_{d}} \end{aligned}$$
(6)

where \(R_d\) is the depth map rate and \({{D}_{{{t}_{G}}}} = {{D}_{{{t}_{(x-1)}}}}+{{D}_{{{t}_{(x+1)}}}}\) is horizontal gradient computed from texture video. For depth map RDO, Eq. 6 is minimized and Lagrange multiplier is derived as in Eq. 7.

$$\begin{aligned} {{\lambda }_{i(d)}}=\frac{{{\lambda }_{new}}}{{{S}_{f}}\cdot \kappa }{{\lambda }_{SSE}} \end{aligned}$$
(7)

where \(\kappa =\varDelta P\cdot {{D}_{tG}}\).

5 Bit Allocation Algorithm

In 3D video, bit rate must be set properly between the views to improve the virtual view quality. In the literature [3, 9, 13,14,15,16], many joint bit allocation methods are proposed, and all these methods improve the PSNR of synthesized views. Visual quality enhancement can be achieved by using dSSIM as distortion metric (\({{ dSSIM }_{v}}\)) and bit allocation to improve SSIM is given by Eq. 8.

$$\begin{aligned} \min \limits _{({R_t}, {R_d})} {{ dSSIM }_{v}} ~~~~ \nonumber \\ {s.t. ~~ {R_t}+{R_d} \le {R_c}} \end{aligned}$$
(8)

In terms of \( dSSIM \), a planar model for synthesis distortion is determined as in Eq. 9.

$$\begin{aligned} dSSIM_v =a \cdot dSSIM_t +b \cdot dSSIM_d + c \end{aligned}$$
(9)
Fig. 3
figure 3

SSIM-based bit allocation with nonlinear depth distortion a Kendo sequence, b Balloons sequence, and c Breakdancer sequence

Using SSIM-MSE relation, the distortion model in Eq. 9 is converted into Eq. 10.

$$\begin{aligned} dSSIM_v&=\frac{a}{2{{\sigma }^{2}}_{{{x}_{t}}}+{{C}_{2}}}{{D}_{t}}+\frac{b}{2{{\sigma }^{2}}_{{{x}_{d}}}+{{C}_{2}}}{{D}_{d}}+z \end{aligned}$$
(10)
$$\begin{aligned} dSSIM_v&={{p}_{1}}{{D}_{t}}+{{p}_{2}}{{D}_{d}}+c \end{aligned}$$
(11)

where \({{p}_{1}}=\frac{a}{2{{\sigma }^{2}}_{{{x}_{t}}}+{{C}_{2}}}\) and \({{p}_{2}}=\frac{b}{2{{\sigma }^{2}}_{{{x}_{d}}}+{{C}_{2}}}\). \({Q}_{t}\) (Eq. 13a) and \({Q}_{d}\) (Eq. 13b), quantization steps of texture video and depth map determined by minimizing distortion-quantization model (Eq. 12).

$$\begin{aligned} \min ~~ ~~~({{p}_{1}}{{D}_{t}}+{{p}_{2}}{{D}_{d}}) ~~~~~~~ \nonumber \\ {s.t. ~~ ({{a}_{t}}Q_{t}^{-1}+{{b}_{t}}+{{a}_{d}}Q_{d}^{-1}+{{b}_{d}})\le {R_c}} \end{aligned}$$
(12)
$$\begin{aligned} {{Q}_{t}}&=\frac{{{a}_{t}}+\sqrt{\frac{{{K}_{1}}{{a}_{t}}{{a}_{d}}}{K_2}}}{{{R}_{c}}-{{b}_{t}}-{{b}_{d}}} \end{aligned}$$
(13a)
$$\begin{aligned} {{Q}_{d}}&=\sqrt{\frac{{{K}_{2}}{{a}_{d}}}{{{K}_{1}}{{a}_{t}}}}{{Q}_{t}} \end{aligned}$$
(13b)

where \(K_1={p_1}{\alpha _t}\) and \(K_2={p_2}{\alpha _d}\).

6 Results

We conducted experiments to check the performance of the joint bit allocation algorithm with nonlinear depth distortion. Encoding is done using 3DV-ATM reference software [1] and VSRS 3.0 [11] reference software is used for virtual synthesis. The test sequences used are Kendo, Balloons [4], and Breakdancer [17] with a frame rate of 30 frames/s. Kendo and Breakdancer sequences have 100 frames whereas Balloons sequence has 300 frames.

Table 1 BD-rate comparison
Fig. 4
figure 4

SSIM-based bit allocation with nonlinear depth distortion compared with linear distortion a Kendo sequence, b Balloons sequence, and c Breakdancer sequence

Experiments were conducted to evaluate SSIM-based bit allocation where nonlinear depth distortion is implemented in RDO. SSIM is computed between synthesized views and original views. For comparison, we utilized model-based algorithm of Yuan et al. [14] and Harshalatha and Biswas’s algorithm [6]. RD curves in Fig. 3 give a comparison between our proposed algorithm and bit allocation with model parameters. Bjontegaard distortion-rate (BD-rate) calculations are done and tabulated in Table 1.

Further, bit allocation algorithm with linear and nonlinear effect of depth distortion RDO is compared as in Fig. 4 and also using BD-rate (Table 1). Nonlinear effect of depth distortion considered in RDO gives more accurate bit allocation results.

7 Conclusions

Rate distortion optimization improves the efficiency of an encoder and we proposed depth map RDO for 3D video by considering nonlinear relation of depth map distortion with view synthesis distortion. To improve the visual quality of synthesized views, dSSIM is used as distortion metric. Bit allocation algorithm is verified by using nonlinear depth distortion RDO and gives better performance over linear depth distortion RDO.