1 Introduction

High efficiency video coding (HEVC) is the newest video coding standard [21, 24], which is a successor of the H.264/AVC [10] and developed by joint collaborative team on video coding (JCT-VC) standard committee formed by video coding experts group (VCEG) and moving picture experts group (MPEG). Compared with H.264/AVC, HEVC is able to double the compression efficiency while keeping almost the same subjective quality of the reconstructed video. The improvement of coding efficiency is achieved by exploiting many new techniques, such as coding unit (CU) with variable block sizes varying from 4 × 4 to 64 × 64, prediction unit (PU), transform unit (TU), quad-tree structure, advanced motion vector (MV) prediction, and so on [28]. Due to its high coding performance, the HEVC is expected to be used in various applications, such as HDTV, digital satellite broadcasting, and so on.

As we known, rate control plays an important role in video coding with the goal of maximizing video quality under the constraints of the bit rates and codec buffer in a robust and accurate manner. Similar to H.264/AVC, the HEVC also adopts Lagrangian rate distortion optimization (RDO) [29] to identify the optimal encoding setting for achieving the optimal rate distortion (RD) performance. However, the HEVC still optimizes the coding performance in terms of sum of squared error (SSE), which is not well correlated to the perceptual quality as discussed in [9]. Moreover, since the video quality is ultimately judged by human eyes, it is very essential to develop a perceptual rate control method for the HEVC with the goal of providing the optimal perceptual quality under a given bit rate.

In this paper, an efficient perceptual rate control method for the HEVC, called the perceptual sensitivity-based rate control (PSRC), is proposed by considering the perceptual characteristics of the input video content. In our approach, a perceptual sensitivity measurement is firstly developed to evaluate the perceptual sensitivity for each coding tree unit (CTU) and each frame. Then, the bit allocation is guided by the obtained perceptual sensitivity and an improved R- λ model is exploited to determine the quantization parameter (QP) for meeting the allocated target bits. Experimental results have demonstrated that the proposed PSRC method is able to improve the perceptual coding performance, compared with the original rate control in HEVC.

The rest of this paper is organized as follows. The related works are briefly reviewed in Section 2. The proposed perceptual sensitivity-based rate control, PSRC, method for HEVC is presented in Section 3 in detail. Extensive simulation results are documented and discussed in Section 4. Finally, the summary is drawn in Section 5.

2 Related works

There have been many works on rate control for previous video coding standards (e.g., H.264/AVC) [5]. In these methods, the common way is to establish the rate distortion/quantization model based on the characteristics of the residual or the input video and then use this model to determine the suitable QP. Among them, considering the predicted residual follows the Laplacian distribution, Chiang et al. [6] presented a quadratic rate distortion model that uses the mean absolute difference (MAD) to estimate the complexity of the input video. Liu et al. [16] presented a switched MAD prediction scheme to reduce the MAD abruption and a linear rate quantization model to describe the relationship between the bits and the QP. Kwon et al. [11] presented a rate control method for H.264, in which the inter dependency between RDO and rate control is addressed by QP estimation and update, and the bits for coding header information are estimated as a function of the nonzero MV elements. An et al. [1] suggested an iterative RDO method for H.264 by using the primal-dual decomposition and sub-gradient projection. Dong et al. [8] suggested a context-adaptive model parameter prediction scheme by using the spatial and temporal correlations so that the accuracy of the MAD, model parameter and bits matching could be significantly improved. Tsai et al. [23] improved the rate control performance of intra coding by applying a Taylor series and scene change aware-based rate quantization step size model. However, these methods are not applicable to the HEVC, because HEVC exploits more complicated quad-tree coding structure with variable-sized CUs, PUs and TUs, which is different from that of previous video coding standards. Therefore, it is very necessary and more challenging to develop a novel rate control algorithm for HEVC.

For that, multiple rate control methods for HEVC have been recently proposed. Choi et al. [7] presented a quadratic pixel-based unified rate quantization model for HEVC based on the number of pixels by considering the size of PU varies from CU to CU. Based on the observation that the Lagrangian multiplier is a more important parameter to achieve the target bits than QP, Li et al. [13] suggested a rate-λ model to perform rate control for HEVC instead of the conventional rate-quantization model. It should be pointed out that this rate-λ model has been incorporated to the HEVC reference software. Moreover, Lee et al. [12] presented a frame-level rate control scheme for HEVC based on the texture and non-texture models. More specifically, the texture model is established by using the transform residual that is categorized into three accounts: low, medium and high-textured CUs, while the non-texture model is developed by considering the different characteristics of the non-texture bits in various depths of CUs. By taking into account the inter-frame dependency between the coding frame and its reference frame, Wang et al. [26] proposed an inter-frame dependency based rate and distortion models. Based on these models and a mixed Laplacian distribution of residual information, a new ρ-domain frame-level Rate-group of picture (Rate-GOP) based rate control is presented.

Unfortunately, the above-mentioned rate control methods for H.264/AVC and HEVC do not consider the human visual system (HVS) characteristic and might be inefficient in the sense of perceptual video coding. Since the video quality is ultimately judged by the human eye, how to develop a perceptual rate control method by incorporating the HVS characteristics attracts more and more attentions from both the academic community and industry. Although there are some existing perceptual rate control methods for H.264/AVC (e.g., [18, 27]), they cannot be directly applied to HEVC due to the different coding structures. For the HEVC, considering saliency represents the probability of human attention, Li et al. [14] incorporated the graph-based visual saliency into the quantization control so that the larger QP will be assigned to the CU with lower probability of attention. Li et al. [15] developed a weight-based unified rate quantization (URQ) scheme for perceptual video coding of conversational video in HEVC. Based on the observation that human is usually attracted by the face in conversational video, a hierarchical perceptual model of face is used to compute the weight map, which is then utilized to guide the bit allocation. In general, the perceptual rate control method for the HEVC has not been well investigated. Therefore, we focus on developing a perceptual rate control method for the HEVC.

3 Proposed Perceptual Sensitivity-Based Rate Control (PSRC) for HEVC

The proposed PSRC method consists of three parts as described in the following three sub-sections: 1) the perceptual sensitivity measurement is developed to evaluate the perceptual characteristic of the input video; 2) Bit allocation is performed based on the obtained perceptual sensitivity; and 3) An improved R-λ model is utilized to meet the target bits, including QP determination and parameters update.

3.1 Perceptual sensitivity measurement

Obviously, the development of perceptual sensitivity measurement for guiding the video coding process shall consider the following requirements. First, the perceptual sensitivity measurement shall be able to well describe both the HVS perception and video coding distortions (e.g., quantization, artifacts). Second, the perceptual sensitivity measurement shall have low complexity and be easy to be incorporated into the video coding process (e.g., rate control). Therefore, although there are many existing visual quality assessment metrics that have good measurement of HVS perception [19, 20], they are not applicable to video coding process. As we know, the conventional video codec optimizes the RD performance in terms of SSE, which is computational efficient. However, the correlation between SSE and HVS perception is poor [29]. Further study finds that there is a strong linear relationship between the mean squared error (MSE) and HVS perception [2, 22].

Motivated by this, we propose a perceptual sensitivity measurement as a function of MSE, which can not only better indicate the HVS perception than MSE but also be easier to be incorporated into the video codec. For each CTU, the proposed perceptual sensitivity measurement (PSM) is defined by weighting the MSE based on its perceptual characteristic and the PSM of the whole frame can be computed by simply summing the PSM of each CTU in the frame:

$$ \begin{array}{c}\hfill PS{M}_i=1+{k}_i\times MS{E}_i\hfill \\ {}\hfill PSMF={\displaystyle \sum_{i=1}^NPS{M}_i}\hfill \end{array} $$
(1)

where MSE i is the mean squared error between the original and reconstructed i-th CTU, N is the number of CTU in the current frame f, k i is the perceptual weighting factor to indicate the perceptual characteristic of i-th CTU. Intuitively, the visibility of distortion in a video scene depends on its spatial texture and temporal motion. As a result, the perceptual weighting factor k i is computed by considering both spatial texture and temporal motion. First, for the spatial texture, in general, from the spatial viewpoint, highly spatial texture region can tolerate more distortion than the low/medium spatial texture region. Hence, the spatial texture complexity (STC) is developed as the spatial perceptual feature for each CTU to evaluate its texture complexity:

$$ ST{C}_i=\frac{\sigma_m}{\sigma_f}\times \frac{1}{\sigma_i} $$
(2)

where σ i is the variance of the i-th CTU in current frame f, σ f is the variance of the current frame, and σ m is the variance of the mean value of all the CTUs in the current frame, which can be computed by:

$$ {\sigma}_m=\frac{1}{N}{\displaystyle \sum_{i=1}^N{\left({m}_i-{\displaystyle \sum_{i=1}^N\frac{m_i}{N}}\right)}^2} $$
(3)

where m i is the mean value of the i-th CTU in the current frame f, N is the number of CTU in the current frame. This STC i makes full use of two texture masking properties: global smoothness \( \frac{\sigma_m}{\sigma_f} \) and local contrast \( \frac{1}{\sigma_i} \). One can see that the smaller the STC value is, the larger the variance is, which means that the current CTU is more likely to contain more complex texture information.

Second, for the temporal motion, from the temporal viewpoint, people are more interested in a moving region than a stationary region. Therefore, temporal motion activity (TMA) is presented as the temporal perceptual features for each CTU, namely, using the MV length to evaluate the motion activity. As suggested in [17], once the speed of the moving object is very fast that exceeds the spatial-temporal resolution capacity of humans, the moving regions will be smoothed and thus cause motion blurring. For these motion blurring regions, people always cannot perceive good visual quality and might ignore these regions. Therefore, the TMA i for each CTU can be computed as:

$$ TM{A}_i=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill \mathrm{if}\ \left(\left|{x}_i\left|+\right|{y}_i\right|\right)>L;\hfill \\ {}\hfill \sqrt{x_i^2+{y}_i^2}+1\hfill & \hfill \mathrm{otherwise}.\hfill \end{array}\right. $$
(4)

where MV i  = {x i , y i } is the MV of the i-th CTU in the current frame, and L is empirically-determined as 8 from the extensive experiments. We can see that a smaller TMA value indicates that the CTU has lower motion activity, especially TMA = 1 means that the CTU is likely to fall in a motionless region or a motion blurring region.

Based on the above analysis, for the current CTU, we can individually obtain two perceptual features—spatial texture complexity (i.e., STC) and temporal motion activity (i.e., TMA). In order to make the spatial STC and temporal TMA contribute equally to the evaluation of perceptual distortion, the simple strategy by using the product of these two perceptual features is exploited to compute perceptual weighting factor, k i :

$$ {k}_i=ST{C}_i\times TM{A}_i $$
(5)

It should be pointed out that the proposed perceptual sensitivity measurement in (1) can well indicate the texture masking property of HVS. To be more specific, if the distortion with the same MSE occurs in both the complex texture and smooth CTUs, the perceptual quality reduction in the complex texture CTUs tends to be smaller than that in smooth one; similarly, if the distortion with the same MSE encounters in both the moving and stationary CTUs, the perceptual quality reduction will be easier to be perceived in the moving regions.

3.2 Bit allocation guided by perceptual sensitivity

For the perceptual rate control, the key problem is how to allocate the bits for each frame and each CTU based on their perceptual sensitivity, which can be computed by using the PSM developed in Section 3.1. And bit allocation is performed as the order of the GOP level, frame level, and CTU level as follows.

  1. 1)

    Pre-analysis process

Before bit allocation, we perform a pre-analysis process to obtain the PSM of each CTU and frame. In this pre-analysis process, by using only one nearest previous frame as the reference frame, only the 2N × 2N partition mode for CTU is performed to get the MV for each CTU, and the MSE is computed between the original CTU and its best matching one. In this process, the spatial texture complexity STC i and temporal motion activity TMA i for each CTU can be computed by using (2) and (4), respectively. Then, the PSM for each CTU PSM i and each frame PSMF can be computed by using (1). Note that the first video frame is a special case that there is no reference frame for performing the motion estimation and thus the PSM i are equal and directly set as 1 for all the CTUs in the first video frame.

  1. 2)

    GOP level bit allocation

Note that the GOP level bit allocation is conducted by using the commonly-used way as suggested in [13]. Suppose R T is the target bit rate, the frame rate is f, N GOP is the GOP length, N coded is the number of frame that has been coded, SW is the size of smooth window, which is used to make the bit rate change smoother, R used means the bits that have been used. First, the average bits for each frame can be simply computed as:

$$ {R}_{FraAvg}=\frac{R_T}{f} $$
(6)

Then, the target bits for each frame can be computed as below, consisting of two items: the average bits per frame and the buffer status.

$$ {T}_{AvgFra}={R}_{FraAvg}+\frac{R_{FraAvg}\cdot {N}_{coded}-{R}_{used}}{SW} $$
(7)

Finally, the target bits for the current GOP is

$$ {T}_{GOP}={N}_{GOP}\cdot {T}_{AvgFra} $$
(8)
  1. 3)

    Frame level bit allocation

Frame level bit allocation is performed after obtaining the bits for the current GOP. One can easily perceive that different frame in a GOP has different influence in the subjective quality of the whole GOP. Considering that the frame with higher perceptual sensitivity shall be assigned with more bits, the bits are allocated for each frame based on its perceptual sensitivity PSMF computed in (1). Therefore, the target bits of the j-th frame in the current GOP can be determined as:

$$ {T}_{Fra}^j=\frac{T_{GOP}-{R}_{used}^{GOP}}{{\displaystyle \sum_{NotCodedFrames}PSM{F}_i}}PSM{F}_j $$
(9)

where R GOP used means the bits that have been used for coding the frames in the current GOP and PSMF j is the weight of each frame in the current GOP. One can see that the higher perceptual sensitivity of the frame is, the more bits will be assigned.

  1. 4)

    CTU level bit allocation

Similarly, different CTU in a frame has different effect on the subjective quality of the whole frame. Hence, the target bits of current CTU T CurrCTU are allocated according to its perceptual sensitivity computed in (1) as follows.

$$ {T}_{CurrCTU}=\frac{T_{CurrFra}- Bi{t}_{header}- Code{d}_{Fra}}{{\displaystyle \sum_{NotCodedCTUs}PS{M}_i}}\cdot PS{M}_{CurrCTU} $$
(10)

where Bit header is the estimated bits of all headers, including slice header, MVs, prediction modes, etc., which is estimated according to the actual header bits of previous coded pictures belonging to the same level.

By using the above-mentioned bit allocation guided by perceptual sensitivity, one can see that the frame and the CTU with higher perceptual sensitivity value will be assigned with more bits so that the perceptual visual quality of the reconstructed video can be improved.

3.3 An improved R − λ model

After the CTU level bit allocation, the question becomes how to determine the QP so as to meet the target bits for each CTU. For simplicity, the R- λ model [13] is used, as it has higher accuracy than the traditional rate-quantization model:

$$ \lambda =\alpha \cdot bp{p}^{\beta } $$
(11)

where α and β are the model parameters that are related to the characteristic of the input video. Hence, different CTUs shall have different α and β values. The initial values of α and β are empircialy-determined as 3.2003 and −1.367, respectively. Note that the initial values of α and β are not critical, since they will keep updating in the encoding process. The update principle is based on the assumption that the collocated CTUs in different frames but belong to the same frame level may share the same parameters α and β. And bpp means the bits per pixel, which can be computed as:

$$ bpp=\frac{T_{CurrCTU}}{w\cdot h} $$
(12)

where w and h are width and height of the CTU, respectively.

One can see that the target bits T CurrCTU for each CTU can be computed by (10) and T CurrCTU is then used to compute bpp by (12). Finally, the λ value can be derived by (11) and is then exploited to determine the QP value according to the following equation (13).

$$ QP=6.3256 \ln \lambda +21.8371 $$
(13)

To keep the visual quality consistency, the λ value and the determined QP value for the current CTU are clipped in a narrow range, as follows.

$$ \begin{array}{c}\hfill {\lambda}_{lastCTU}\cdot {2}^{\frac{-1.0}{3.0}}\le {\lambda}_{currCTU}\le {\lambda}_{lastCTU}\cdot {2}^{\frac{1.0}{3.0}}\hfill \\ {}\hfill Q{P}_{lastCTU}-3\le Q{P}_{currCTU}\le Q{P}_{lastCTU}+3\hfill \end{array} $$
(14)

where λ currCTU is the λ value of the current CTU, λ lastCTU is the λ value of previous encoded CTU, QP currCTU is the QP of the current CTU and QP lastCTU is the QP of previous encoded CTU.

In order to adapt to the characteristic of the input video, the parameters need to be continually updated during the encoding process. In this work, there are some parameters: the perceptual sensitivity of each CTU PSM i and each frame PSMF, α and β. The former two parameters will be updated in a GOP period in the pre-analysis process described in Section 3.2. The latter two parameters, α and β, will be updated by using the real encoded bpp (i.e., bpp real ), the real used λ value (i.e., λ real ) and the perceptual sensitivity values as:

$$ \begin{array}{l}{\lambda}_{comp}={\alpha}_{old}\cdot {\left(\frac{bp{p}_{real}}{PS{M}_i}\right)}^{\beta_{old}}\\ {}{\alpha}_{new}={\alpha}_{old}+{\delta}_{\alpha}\cdot \left( \ln {\lambda}_{real}- \ln {\lambda}_{comp}\right)\cdot {\alpha}_{old}\\ {}{\beta}_{new}={\beta}_{old}+{\delta}_{\beta}\cdot \left( \ln {\lambda}_{real}- \ln {\lambda}_{comp}\right)\end{array} $$
(15)

where λ comp is the λ that is computed from the old model parameters by using bpp real value and the PSM value, δ α and δ β are empirically set to be 0.1 and 0.05, respectively. Note that the final values of α and β are clipped in a pre-determined range [0.05, 20] and [−3.0, −0.1] as suggested in [13], respectively.

3.4 Proposed PSRC method

In summary, the proposed PSRC method can be described as below:

figure e

4 Experimental results and discussions

To evaluate the performance, the proposed perceptual sensitivity-based rate control (PSRC) method is incorporated into the HEVC reference software (i.e., HM10.0) and tested on multiple commonly-used sequences. The test conditions are set as standard Low Delay with IPPP structure setting and Random Access setting as suggested in [4], respectively. Moreover, the QP value is set as 22, 27, 32 and 37, respectively. The first 100 frames of each test sequences are encoded. For each sequence under each QP, the target bit rate is set as the corresponding bit rate resulted from the same encoding setting without enabling the rate control scheme, and the initial QP is set as the current QP. All the simulation experiments are conducted on a computer with 3.6 GHz Intel i7 Core 4 processors, 8 GB memory, and Win 7 operating system.

In this work, we exploit the commonly-used visual quality assessment metric—SSIM [25] to measure the perceptual quality of the reconstructed video, which is obtained by simply averaging the corresponding SSIM value of each frame. The proposed PSRC method is compared with the original rate control method in HEVC [13]. And the average difference between their rate-SSIM curves is measured according to the method in [3] by using the following performance indexes: △SSIM means the average SSIM changes; △BR means the total bit rate changes (in percentage); “+” means increment; and “-” means decrement. In addition, bit rate error (BRE) [13] is used to indicate the mismatch between the target and real bit rate.

$$ BRE=\frac{\left|{R}_{\mathrm{target}}-{R}_{\mathrm{actual}}\right|}{R_{\mathrm{target}}}\times 100\% $$
(16)

where R target and R actual are the target bit rate and actual output bit rate, respectively. The lower BRE value indicates a better match between the target and real output bit rate.

Table 1 shows the performance of the proposed PSRC method for the HEVC under low delay and random access setting respectively, compared with the original rate control in HEVC [13]. One can see that the proposed PSRC method can achieve, on average, 6.94 % bit rate reduction and 0.0077 SSIM improvement for low delay setting. And for random access setting, on average, 5.71 % bit rate saving and 0.0058 SSIM increment can be obtained by the proposed PSRC method. In addition, Table 2 shows the bit rate error of the proposed PSRC method and the original rate control in HEVC [13] under low delay and random access setting, respectively. One can see that the proposed PSRC method is able to achieve similar bit rate mismatch between target bits and actual output bits to that of the original rate control in HEVC [13]. Moreover, Figs. 1 and 2 show the examples of the reconstructed frame by the original rate control in HEVC [13] and proposed PSRC method under Low Delay setting and Random Access setting, respectively. One can see that the proposed PSRC method can effectively allocate the bits according to the perceptual sensitivity of the input video so as to significantly reduce the bits while keeping the similar perceptual quality. Similar arguments can be applied to other sequences. Therefore, the proposed PSRC method is able to effectively improve the perceptual coding performance of the HEVC. In addition, another merit of the proposed PSRC method is that it has negligible computational overhead, as one can perceive that the complexity of the perceptual sensitivity measurement is fairly small, compared with the encoding process.

Table 1 The performance of proposed PSRC method under the Low delay setting and Random access setting, compared with the original rate control in HEVC [13]
Table 2 The bit rate error of the original rate control in HEVC [13] and proposed PSRC method for HEVC under the low delay and random access setting
Fig. 1
figure 1

The reconstructed video frame---“BQTerrace”, 17th frame, QP = 32, Low Delay Setting)

Fig. 2
figure 2

The reconstructed video frame---“BasketballDrill”, 3rd frame, QP = 27, Random Access Setting

5 Conclusions

In this paper, an efficient perceptual rate control method, called the perceptual sensitivity-based rate control (PSRC), is proposed for HEVC. The perceptual coding gain is achieved by adaptively allocating the bits for each frame and CTU in the RDO process based on their perceptual sensitivity. More specifically, the CTU with less perceptual sensitivity, which can tolerate more distortion, will be assigned with fewer bits. The above-mentioned perceptual sensitivity is evaluated by using two perceptual features that can effectively reflect the human perception, including spatial texture complexity and temporal motion activity. Experimental results have shown the efficiency of the proposed PSRC method on the improvement of the perceptual coding performance, compared with the original rate control in HEVC.