1 Introduction

The popularity of 3D technology is increasing day by day as the demand for 3D services is increasing in many application areas like television, surveillance, and so on. To support these demands, vast technological improvements are being made at each stage of the 3D video system. Different stages of a 3D video system are acquisition, representation, coding, view synthesis, and display. The multiview video acquisition enables the autostereoscopic display while increasing the bandwidth requirement and also the task of encoding. The multiview plus depth (MVD) representation format enables the transmission of few views along with respective depth maps. The remaining views are synthesized using view synthesis.

View synthesis process uses decoded texture videos and depth maps to generate new views at different viewpoints. Synthesized views are generated using 3D warping, which is performed by projecting reference images to 3D world coordinates and converting 3D coordinates back to the 2D target view. The texture video, as well as depth data, are distorted due to lossy compression resulting in synthesis distortion. The relationship of view synthesis distortion (Dvs) with texture distortion (Dtex) and depth distortion (Ddep) is given by the (1) where both texture and depth distortions were considered to have a linear relationship with synthesis distortion.

$$ {{D}_{vs}}=\alpha{{D}_{tex}}+\upbeta{{D}_{dep}}+C $$
(1)

where α, β and C are model parameters. However, the error in the depth value leads to a positional error in the target image. The synthesis distortion with the same amount of depth distortion varies according to texture details. As these synthesized views are displayed, it is necessary to reduce the view synthesis distortion. An approach to reduce synthesis distortion is efficient encoding through the rate-distortion optimization (RDO) technique by reducing distortion at the available target rate. For 3D video, synthesis distortion is to be minimized through proper rate allocation between texture video and depth map. This ensures the better quality of synthesized views by reducing the distortion. The first, bit allocation algorithm was based on the details in depth map and texture video. Bit allocation criterion proposed in [5] allocates bit ratio of 5:1 between texture video and depth map. The fixed ratio does not ensure optimal bit allocation and leads to increased distortion in synthesized views. A search algorithm was proposed in [9] which discusses hierarchical optimization algorithm and finds a pair of quantization steps for texture video and depth map. However, this algorithm has computational complexity due to an extensive search. A view synthesis distortion model was proposed by Yuan et al. [18] along with an optimization technique to find optimal values of quantization steps for texture video and depth map. These bit allocation algorithms model the synthesis distortion and try to reduce the same at the available rate. However, the total bandwidth is not considered which is an essential factor for 3D video.

The video coding standards incorporate a rate control algorithm to meet the bandwidth constraint. Bit allocation along with the good quality of the video is possible with a rate control algorithm. The rate control (RC) algorithm mainly tries to balance the bit rate while maintaining the quality of the views. Rate control algorithms are proposed for 2D video in different coding standards. They include TM5 for MPEG2, VM8 for MPEG-4, and the quadratic model for H.264. The bit allocation through rate control was proposed in [11] in which the encoder used was H.264. Similarly, there are rate control algorithms proposed for HEVC, which was replaced later by the lambda-based rate control algorithm. For 3D video, rate control algorithms are proposed by many researchers. Adaptive frame-level rate control is proposed in [14] where the initial QP decision scheme is discussed along with smoothing the bit rate fluctuation. An adaptive bit allocation algorithm with rate control was proposed by [16] using 3D-HEVC. Authors in [3] proposed a rate control algorithm for depth map with rate-lambda model with an adaptive clipping algorithm to exploit depth characteristics. Inter-view dependency-based rate control for 3D-HEVC is proposed in [13]. In this work, based on the synthesis distortion model, bit allocation is done at the texture/depth level, the view level, and frame level. In these works, the video quality is measured through objective metrics that do not consider the human visual characteristics.

In the literature, the rate control algorithms are based on objective metrics, and to the best of our knowledge, there are very few or no works on subjective quality based rate control for 3D video. However, fewer rate control algorithms for 2D video used SSIM as metric [4, 10, 19] to improve the subjective quality. Motivated by the fact that subjective quality improvement gives a better visual experience to the viewer, in this paper we proposed a structural similarity-based rate control algorithm for 3D video. To improve perceptual quality, rate-distortion analysis is performed using dSSIM as a distortion metric to derive a new RD model. Using this RD model, the relationship between rate and Lagrange multiplier is derived. The bit allocation is performed at the texture/depth level, frame level, and basic unit (BU) level. The proposed algorithm is implemented using HEVC reference software HTM-16.2 [7].

A review of Lambda-domain rate control is briefly explained in Section 2. The proposed SSIM-based rate control algorithm is discussed in Section 3 that includes rate-distortion analysis of texture video and depth map in Section 3.1. The bit allocation at different levels is presented in Section 4. Finally, rate-distortion performance, subjective evaluation and rate accuracy comparison are elaborated in Section 5 which is followed by conclusion in Section 6.

2 Review of lambda-domain rate control

Rate control is necessary to regulate the varying bit rate by modifying the necessary encoder parameters. The two important functions of the rate control algorithm are, allocation of bits at different coding levels and monitor allocated bit levels at different stages. The coding levels in any encoder are the group of pictures (GOP) level, frame level, and basic unit level. Rate-distortion optimization (RDO) is a part of the rate control algorithm where an optimum set of rate and distortion is obtained. The rate control method adopted in the HEVC encoder that relates rate (R) and quantization step (Q) is a unified rate-quantization (URQ) model [14]. In this scheme, a quadratic RD model is used, and R is related to Q using the (2).

$$ R=a{{Q}^{-1}}+b{{Q}^{-2}} $$
(2)

where values of a and b depend on the content of the video sequence. ρ-domain rate control algorithm relates quantization parameter Q with the percentage of zeros in DCT coefficients ρ. With this model also, a relation is obtained between R and Q. According to URQ and ρ-domain method, the rate achieved entirely depends on Q value. Thus, there is a need to obtain a proper relation between R and Q. In [8] drawbacks of R-Q methods are mentioned, and the authors proposed λ-domain rate control algorithm. In λ-domain rate control method, a relation between R and Lagrange multiplier, λ is established to achieve the target bit rate. λ is also the slope of the RD curve. λ-domain RD function is a hyperbolic function given in (3).

$$ D(R)=C{{R}^{-K}} $$
(3)

where K is the model parameter. The relation between R and λ is obtained by minimizing the RD function as in (4)

$$ R=\alpha {{\lambda }^{\upbeta }} $$
(4)

where α and β are the parameters. The current GOP level bit allocation always depends on the number of bits allocated for the previous GOP to balance the total bit target. The bit allocation at the GOP level is given by (5).

$$ {{\mathit{Target}}_{\mathit{GOP}}}=\frac{{{R}_{\mathit{PicAvg}}}\times ({{N}_{\mathit{coded}}}+SW)-{{R}_{\mathit{coded}}}}{SW}\times {{N}_{\mathit{GOP}}} $$
(5)

where RPicAvg is the average bits per picture, SW is the size of the sliding window, Ncoded is the number of pictures coded, Rcoded is the bit cost of all coded pictures and NGOP is the total number of pictures in a GOP. The picture level bit allocation is done using (6) and bit cost assigned to each picture depends on the value of ωPic.

$$ {{\mathit{Target}}_{\mathit{pic}}}=\frac{{{\mathit{Target}}_{\mathit{GOP}}}-{{\mathit{Coded}}_{\mathit{GOP}}}}{\sum\limits_{\{\mathit{AllNotCodedPictures}\}}{{{\omega }_{\mathit{Pic}}}}}\times {{\omega }_{\mathit{PicCurr}}} $$
(6)

where TargetGOP is target bits of GOP, CodedGOP is number of bits in the current GOP. The target bits computed for the BU level is given by (7).

$$ {{Target}_{BU}}=\frac{{{Target}_{Pic}}-Bi{{t}_{H}}-Code{{d}_{Pic}}}{\sum\limits_{\{AllNotCodedBUs\}}{{{\omega }_{BU}}}}\times {{\omega }_{BUCurr}} $$
(7)

where BitH is the estimated header bits, TargetPic is the target bits of a picture, ωBU is the weight of BU level bit allocation.

3 SSIM-based rate control algorithm

To improve the visual quality, it is necessary to incorporate perceptual quality metrics for distortion measurement. We use SSIM as perceptual quality metric and distortion is measured using dSSIM. The Lagrange multiplier is modified to incorporate dSSIM and the steps followed are: (i) derive rate-distortion (RD) model where distortion is measured using dSSIM (ii) minimize the RD model to obtain Lagrange multiplier. These steps are explained in the following subsections.

3.1 Rate-distortion analysis

View synthesis distortion is dependent on texture video distortion and depth distortion. Thus, the relationship between rate and distortion is to be analyzed for texture video, depth map, as well as the synthesized view. To obtain the relationship, texture and depth sequences are encoded at different rates. Distortion is computed between the original and encoded sequences. Distortion is measured using both objective quality metric, MSE, and subjective quality metric, dSSIM. The RD graph for texture views and depth map of Kendo sequence is shown in Fig. 1a and b respectively where distortion is measured using MSE. The RD graph in Fig. 2 shows the relation between rate and dSSIM for Kendo sequence. In both cases, we can observe that the rate and distortion are related using the power model and corresponding R2 values for different sequences are listed in Table 1.

Fig. 1
figure 1

RD curve for a Texture video and b Depth map of Kendo sequence (Distortion metric: MSE)

Fig. 2
figure 2

RD curve for a Texture video and b Depth map of Kendo sequence (Distortion metric: dSSIM)

Table 1 Coefficient of determination measured for Synthesized Views

RD analysis is also done for synthesized views of Kendo sequence as shown in Fig. 3. Original texture video and depth maps are used to generate synthesized views that have no distortion. Decoded texture views and depth maps are used to generate synthesized views that are distorted. Distortion is computed between original synthesized views and distorted synthesized views using dSSIM as distortion metric. For synthesized views also, rate and distortion are related using the power model.

Fig. 3
figure 3

RD curve for synthesized view of Kendo sequence

According to RD analysis with dSSIM as distortion metric, texture video and depth maps follow hyperbolic RD function as in (8).

$$ \mathit{dSSIM}(R)=C{{R}^{-K}} $$
(8)

where C and K are the parameters that depend on characteristics of the source. View synthesis distortion dSSIMv, is related to texture video rate Rt and depth map rate Rd as in (9)

$$ {\mathit{dSSIM}_{v}}=a{{R_{t}}^{-K_{t}}}+ b {{R_{d}}^{-K_{d}}} $$
(9)

where a, Kt, b and Kd are the model parameters.

3.2 Rate control based on structural similarity

The rate control algorithm achieves better quality videos, along with maintaining the target bit rate. To improve the subjective quality, we are incorporating dSSIM for distortion measurement. The works related to perceptual quality improvement through rate control are found in the literature for 2D video [2, 4]. For 3D video, view synthesis distortion is modeled as dSSIMv as in (9) and is related to texture video rate Rt and depth map rate Rd.

The distortion dSSIM can be related to MSE [17] and the resulting expression is given in (10).

$$ 1+\frac{MSE(R)}{2{\sigma_{x}^{2}}+{{C}_{2}}}=C{{R}^{-K}} $$
(10)

where \({\sigma _{x}^{2}}\) is the variance of the block to be coded and C2 is constant. Using (10), synthesis distortion of (9) can be expressed in terms of MSE as

$$ {d}_{v} = {{a}_{t}}(2\sigma_{xt}^{2}+{{C}_{2}})R_{t}^{-K_{t}}+{{a}_{d}}(2\sigma_{xd}^{2}+{{C}_{2}})R_{d}^{-K_{d}} $$
(11)

Lagrange multiplier λ is the slope of the RD curve and is given by (12)

$$ \lambda =-\frac{\partial D}{\partial R}=CK(2{\sigma_{x}^{2}}+{{C}_{2}}){{R}^{(-k-1)}} $$
(12)

This Lagrange multiplier is used to achieve better visual quality using the relationship of dSSIM with MSE.

4 Bit allocation

Bit allocation for 3D video is very important to achieve distortion-free synthesized views. Thus, we considered bit allocation at three stages: (i) texture/depth level, (ii) frame level and (iii) BU level bit allocation. In each stage, we aimed to allocate the bits to improve quality in the visual sense.

The bit allocation at texture/depth level is solved as an RDO problem. Another approach to allocating bits between texture video and depth map is to set an optimum bit ratio between them. The bit ratio is computed by minimizing the RD model for view synthesis. In this paper, we consider the RD models with distortion being measured using dSSIM as in (9). Thus, the RDO is formulated as given in (13).

$$ \begin{array}{@{}rcl@{}} \min\limits_{({R_{t}}, {R_{d}})} {{\mathit{dSSIM}}_{v}} ~~~~~~~~ \\ {s.t. ~~ {R_{t}}+{R_{d}} \leq {R_{c}}} \end{array} $$
(13)

where \({\mathit {dSSIM}}_{v} = {{a}_{t}}(2\sigma _{xt}^{2}+{{C}_{2}})R_{t}^{-K_{t}}+{{a}_{d}}(2\sigma _{xd}^{2}+{{C}_{2}})R_{d}^{-K_{d}}\) and \(\sigma _{xt}^{2}\) and \(\sigma _{xd}^{2}\) are the variance of texture video and depth map respectively. The expression for bit ratio, η is calculated [11, 13] as in (14).

$$ \eta =\frac{(2\sigma_{xt}^{2}+{{C}_{2}})R_{t}}{(2\sigma_{xd}^{2}+{{C}_{2}})R_{d}} $$
(14)

The variance of texture video and depth map are to be calculated along with the other model parameters. The ratio of texture to depth bit is expressed as in (15) and (16).

$$ {{R}_{t}}={{R}_{c}}\cdot \frac{\eta }{\eta +1} $$
(15)
$$ {{R}_{d}}={{R}_{c}}\cdot \frac{1}{\eta +1} $$
(16)

For the frame-level bit allocation, we used the concept of entropy. For each frame in both texture video and depth map, we compute the entropy. The entropy gives the amount of information present in a frame. As human vision is sensitive to structural information, the frames with higher entropy must be given more importance. Target bit allocation at the frame level is calculated as in (6). The weight ωPicCurr is considered to be constant in HEVC, and this makes an equal distribution of the target bits between the frames. However, the amount of information in each frame is different and requires bit allocation accordingly. Thus, we measured the entropy of each frame and assigned to weight ωPicCurr.

For bit allocation at BU level, a combination of just noticeable depth difference (JNDD) threshold with texture gradient is used considering the effect of depth on synthesis distortion. Human perception varies with change in depth distance and is sensitive to change in depth value. JNDD model explains that human vision is highly sensitive to closer objects compared to distant objects [12]. A JNDD threshold is set such that, the depth value below the threshold cannot be perceived. Another fact that affects synthesis distortion is its non-linear relation with depth distortion. This implies that, with the same depth distortion, synthesis distortion varies with the texture details. Thus, a new weight factor is proposed for BU level bit allocation (7) as given in (17).

$$ {\omega }_{BUCurr} = {{\mathit{dJND}}_{avg}}\cdot{{t}_{G}} $$
(17)

where dJNDavg is the average JND and tG is the texture gradient. dJND is computed using (18).

$$ \mathit{dJND}=\left \{ 1-\frac{(1-D_{\mathit{jnd}})^{2}}{(1+D_{\mathit{jnd}})^{2}} \right \}\cdot{255} $$
(18)

where Djnd is determined by subjective perception experiments in [1] and is given by (19).

$$ D_{jnd}=\begin{cases} 21 & \text{ if } \ (0\leqslant A(i,j)<64) \\ 19 & \text{ if } \ (64\leqslant A(i,j)<128)\\ 18 & \text{ if } \ (128\leqslant A(i,j)<192) \\ 20 & \text{ if } \ (192\leqslant A(i,j)<255) \end{cases} $$
(19)

5 Results

For testing and verification of the proposed algorithm, we used 3D-HEVC reference software HTM-16.2. The algorithm is tested with different input sequences, namely Kendo, Balloons, Breakdancer, Ballet, Poznan Street and Shark [6, 20]. For each sequence, views are rendered using the rendering algorithm available in HTM-16.2. The GOP size is set to 8 with an intra period of 24. The main objective of the proposed algorithm is to improve the perceptual quality. To the best of our knowledge, SSIM-based rate control algorithms for 3D video are not found in the literature. Hence, all the sequences were encoded with the proposed rate control algorithm and compared with the original λ-based rate control algorithm of HTM-16.2. We also compared the proposed algorithm with the SSIM-based bit allocation algorithm that uses model parameters to determine the quantization parameter.

5.1 Rate-distortion performance evaluation

The rate-distortion performance of the proposed rate control algorithm is compared with the original rate control algorithm of HTM-16.2 and SSIM-based bit allocation algorithm. For comparison, the original RC algorithm is considered as Anchor1, and the SSIM-based bit allocation method [15] is considered as Anchor2. The RD graph is plotted as in Fig. 4. The proposed algorithm shows significant improvement in perceptual quality for Balloons, Ballet, PoznanStreet, and Shark. However, for Kendo and Breakdancer sequences, the proposed algorithm shows better improvement in SSIM at lower rates compared to Anchor1 and Anchor2.

Fig. 4
figure 4

RD curves for a Kendo and b Balloons c Breakdancer d Ballet and e Poznan Street f Shark

The BD-Rate calculation is presented in Tables 2 and 3 to verify the proposed algorithm with Anchor1 and Anchor2 respectively. The BD-Rate shows the reduction in rate and improved SSIM for the proposed algorithm when compared to the other two algorithms. The rate is reduced by 14.6% and 27.34% with respect to Anchor1 and Anchor2 respectively. The rate control accuracy is summarized in Tables 4 and 5 for Kendo and Balloons respectively. The difference between the actual bit rate and the target bit rate is the error. The computed error is much lesser compared to Anchor1 and Anchor2. The average mismatch is less in the case of the proposed algorithm.

Table 2 BD-rate comparison: proposed vs Anchor1
Table 3 BD-rate comparison: proposed vs Anchor2
Table 4 Rate accuracy for the sequence Kendo
Table 5 Rate accuracy for the sequence Balloons

As video quality is decided by both spatial and temporal information, we computed Temporal SSIM between the frames. The spatio-temporal structural gradient is computed in three directions x-t, y-t and x-y directions [15]. The temporal SSIM is computed for synthesized views of the proposed algorithm and compared with anchor1 and anchor2. The RD graph is plotted for Kendo and Ballet as shown in Fig. 5.

Fig. 5
figure 5

Temporal SSIM comparison for a Kendo and b Ballet sequence

5.2 Subjective evaluation

The proposed rate control algorithm improves the perceptual quality and to support this claim, we conducted a subjective evaluation to compare the proposed algorithm with the original rate control algorithm. Twelve participants viewed the sequences which are encoded with SSIM-based rate control algorithm and lambda-based rate control algorithm. For assessing the subjective quality, participants were asked to grade the sequences on a five-point scale: 5 - Excellent, 4 - Good, 3 - Fair, 2 - Poor, 1 - Very poor. Table 6 shows the mean opinion score (MOS) and standard deviation (SD), which implies that the proposed algorithm performs better than the lambda-based algorithm.

Table 6 Subjective evaluation: mean opinion score and standard deviation

6 Conclusion

In this paper, we proposed a rate control algorithm for 3D-HEVC using which optimum bit allocation is performed for texture video and depth map along with improving the visual quality. The relation between rate and distortion of synthesized views is derived experimentally, and distortion is measured using dSSIM. The λ value is modified to achieve structural similarity-based rate control. In addition, bit allocation is performed at the texture/depth level, frame level, and basic unit level. At texture/depth level, bit ratio is derived between texture and depth map, whereas the entropy level decides the weight factor at frame level. At basic unit level, a product of DJND and texture gradient is used as a weight factor. Subjective analysis, RD performance, and BD-Rate evaluation show that the proposed algorithm performs better compared to the original RC algorithm.