1 Introduction

High efficiency video coding (HEVC) is a video compression standard launched by the Joint Collaborative Team on Video Coding in 2013. Compared to H.264/AVC, HEVC can save 50% bitrate while generating similar reconstructed video quality [35]. Improvement in coding efficiency is achieved by exploiting several new techniques, such as advanced motion vector, prediction unit, transform unit, quadtree structure-based coding tree unit (LCU) with variable block sizes varying from8 × 8 to 64 × 64, and so on [41]. Due to the above new techniques, HEVC has become one of the most important standards in video applications.

Rate control (RC) plays a key role in video coding systems, and the goal is to maintain good visual quality by matching the constraints of video channel bandwidth. To achieve this objective, most of the video encoding standards incorporate RC algorithms into their encoding frameworks. RC methods can be divided into two steps. The first step is bit allocation, which is performed to achieve optimal rate distortion (R-D) performances through efficient allocation of proper bitrates at different coding levels, including group of picture (GOP) level, frame level, and coding units (CU) level [4]. The second step is choosing proper quantization parameters (QPs) to achieve the allocated bits for each level. The above RC methods is applied in many video coding standards, such as TMN8 [38] applied in H.263 and JVT-N046 [28] applied in H.264/AVC.

Similar to H.264/AVC, HEVC also adopts an RC method to optimize the Lagrangian rate distortion optimization (RDO) [45] performance of coded videos. For example, the latest HEVC RC method, λ domain RC [20, 22], is an important part in HEVC. In the λ domain RC method, the bitrate allocation is still in terms of mean absolute difference (MAD) of LCU. However, the video quality is mostly verified by human eyes, and according to the human visual system (HVS), there is considerable perceptual redundancy in video frames [18]. For instance, when a person watches video frames, a region with people or moving objects (high perceptual importance region (HPIR)) is given more attention than other regions (low perceptual importance regions (LPIR)). Thus, the MAD may not be sufficiently correlated with perceptual quality [13]. Therefore, a large number of bits can be saved by reducing perceptual redundancy in the LPIR region with imperceptible perceptual quality loss.

When people watch videos, more attention is attracted by areas of HPIR, e.g., human faces, figures, and moving objects [18]. The basis of perceptual-based video coding is that the distortion in HPIR is more likely to be perceived; thus, more bits should be assigned to these areas to maintain visual quality. However, the visual quality in HPIR is often lower than that in LPIR. The reason is that HPIR usually contains texture information and fast-moving objects. Thus, compared to LPIR, HPIR achieves lower visual quality improvement with equally allocated bits. Hence, it is necessary to design a perceptual importance-based RC method for HEVC with the aim of providing optimal perceptual video quality under a constraint bandwidth. The perceptual importance-based video coding quality assessment has two requirements. First, the perceptual assessment model should be able to describe HPIR and LPIR with high accuracy. Second, the perceptual assessment model should possess a low complexity property and be easily incorporated into the video coding procedure (e.g., rate control).

The rest of this paper is organized as follows. Section 2 briefly reviews related works, and Sec. 3 describes the proposed perceptual importance classification algorithm. In Sec. 4, a perceptual importance-based RC algorithm is proposed. The experimental results and discussions are presented in Sec. 5. Finally, the conclusion of this paper is provided in Sec. 6.

2 Related works

There have already been several works on RC for video coding standards. For the previous video coding standard H.264/AVC, Liu et al. [30] presented a linear rate quantization (R-Q) model for fitting the relationship between bitrate and the QP. Moreover, a prediction scheme was proposed to reduce the MAD abruption. An et al. [1] presented a primal-dual decomposition and sub-gradient projection-based method to iteratively calculate the RDO procedure for H.264/AVC RC, which can improve the RD performance of the RC algorithm. Dong et al. [9] proposed a context-adaptive parameter prediction scheme to improve the accuracy of the estimated MAD of the R-Q model used in H.264/AVC.

HEVC is a next-generation video encoding standard, and various RC methods have been proposed for it. For instance, Choi et al. [7, 8] proposed some R-Q methods, where the R-Q model is the recommended RC scheme to be adopted by HM6, the reference software of HEVC. Liang et al. also proposed an HEVC RC scheme based on the R-ρ model [27], where ρ is the percentage of zeroes among the DCT coefficients after quantization. Compared to the R-Q model, the R-ρ model shows a slight improvement in estimating the target bits. By analyzing the residual signal probability distribution of hierarchical quad-tree CUs, Lee et al. [19] proposed an RC algorithm that allocated texture and non-texture with different rate models. Subsequently, Li et al. [21, 25] found that the Lagrange multiplier (λ) is a crucial factor for RC in HEVC, and they proposed the R-λ model for HEVC, which has a lower bitrate mismatch and better R-D performance than the R-Q and R-ρ models. The R-λ model was adopted in the HEVC test software HM due to its outstanding rate accuracy and R-D performance. Gao et al. [10] and Guo et al. [15] proposed a temporal RDO propagation models for HEVC bit allocation procedure, and these methods showed better coding efficiency. Li et al. [26] proposed a LCU-level HEVC bit allocation method, which achieved better R-D performance as it considered the R-D characteristics of each LCU in one frame. In [6], Chen et al. proposed an optimized LCU-level low-delay RC approach for HEVC in which the parameter distributions of the estimated R-D model are considered, thereby efficiently improving the R-D performance. In addition to the above works, an extremely low-delay method was designed for an HEVC intra-frame RC model in [29]. Further, in reference [12], Gao et al. proposed a data-driven RC method, which improved HEVC R-D performance by an effective initial QP-chosen method. A joint machine learning-based RC scheme was also proposed by Gao et al. [11] to improve the performance of the R-λ model. In our previous work [48], a new parameter updating method is proposed to improve the RC performance of the R-λ model. However there is some problems in [48]: i. The work of [48] only utilizes the gradient value as spatial information to guide bit allocation, this guidance mode not consider the temporal information, thus the bit allocation procedure is relatively rough. ii. The gradient information is not very match the video perceptual importance. To address the above issues, in this work, a temporal-spatial combined information is proposed to guide bit allocation and a relationship between bitrate and perceptual importance is formulated. In addition, in order to keep the convergence speed, we use the parameter updating method in [48] to maintain the parameter updating speed in this work. It is noteworthy that, only the parameter updating procedure of this work is referred the corresponding part in [48].

As video quality is assessed by humans, video coding standards that incorporate well-designed human perceptual RC algorithms have attracted considerable research interests [46]. Chadha et al. [5] proposed a rate aware perceptual preprocessing method, which can enhance visual quality with any codec and bitrates. Zeng et al. [49] developed a perceptual sensitivity scheme to guide bit allocation. Zhu et al. [52] designed a perceptual based RDO scheme, in this a CNN-based on-line training method is first explored to determine the VMAF-related distortion estimation coefficient. Recently, Zhou et al. [51] also established an SSIM-based rate-distortion model, and the model was transformed into a global optimization problem to guide the LCU-level RC of HEVC. A weight-based R-λ perceptual RC scheme was presented by Li et al. [24] based on the observation that faces draw more attention in conventional video, weight map-based eye tracking was utilized in the bit allocation procedure. In [23], the researchers claim that visual saliency can represent the probability of human attention; hence, graph-based visual saliency was utilized to adjust QP, which assigned less bitrates with a low probability of visual attention. Bai et al. [2] used average saliency to weigh bit allocation for each LCU. The ROI concept is often utilized to guide bit allocation in RC too, in [32], the coding blocks in the ROI were encoded with lower QP to improve perceptual quality. In addition to spatial complexity, the temporal complexity of video sequence is an important factor to measure the perceptual based bit allocation in rate control. Recently, Wei et al. [43] used static and dynamic based perceptual feature to control bit allocation. Wang et al. [42] also proposed a masking effect-based RC method, which considered temporal and spatial information. However, the bitrate accurately of these models are relatively rough. Gong et al. [14] used a temporal-layer-motivated method to guide bit allocation, and achieved better rate control results in random access configuration of HEVC. In [44], Wei et al. used spatial/temporal visual saliency to guide the LCU-level bit allocation in HEVC, and the distortion of each LCU was weighted by the corresponding saliency. These above perceptual based methods all use spatial/temporal human perceptual factors as the weight of each region to represent the perceptual quality, but the factors are is not match the perceptual importance very well. In addition, bit allocation for different regions (such as HPIR and LPIR) is no balance optimization in these algorithms that means the LCU-level bit allocation lacks an optimal global bit allocation between different perceptual regions. Therefore, the LCU bitrate in HPIR might be excessive, causing the perceptual quality of LPIR to be too low, or vice versa. Finally, the RC parameter updating procedure in these methods is all follow the method in [22] that is a first-order convergence model, which means the convergence speed of updating the parameters is relatively slow, that induce low bit rate accuracy.

To address the above problems, in this study, we investigated the LCU-level rate control based on perceptual importance analysis and formulated the LCU-level rate control. The contributions of this study are as follows: a simple but effective perceptual importance analysis algorithm that combines temporal-spatial information to express perceptual importance is proposed. The relationship between bitrate and perceptual importance analysis is established that is further applied in the formulation of bit allocation. A region-level bit allocation which considers a global optimization problem is established that can further balance the video quality of different regions. A new model parameter updating strategy is used in R-λ RC model that is robust to scene variations.

3 Proposed perceptual importance classification algorithm

It should be noted that the high perceptual importance region is not always equal to the more visual attention region in the HVS. For instance, a region with moving objects is likely to attract visual attention. However, once the moving objects are in random texture regions or their speed becomes faster than the noticeable capacity of humans, viewers tend to ignore distortions in these regions. Thus, such regions are perceptually less important [34]. In this section, we describe the proposed perceptual importance classification algorithm for video frames. As the distortions in a region with high perceptual importance are easily noticed by the HVS, the proposed perceptual importance analysis algorithm is mainly composed of three parts: moving analysis model, texture region distinction, and model fusion. The moving degree of LCUs in a frame is represented numerically using the moving analysis model. Subsequently, a texture region distinction is utilized to separate LCUs based on the texture information intensity. After the LCU classification, a complete perceptual importance weight is decided by combining the results of the moving analysis and texture distinction models. The analysis algorithm is described below.

3.1 Moving analysis model

As mentioned above, people are more sensitive to moving objects, especially in video applications such as video conferencing, video surveillance, and visual telephone. In these applications, viewers typically focus on moving objects, which means moving regions in video frames attract more attention than the stationary regions [37]. Hence, any distortion in a moving area is easily detected. Therefore, the perceptual qualities of moving analysis are crucial to the overall video frame quality. As HEVC adopts LCU as the base coding unit, an LCU-based region moving degree (RMD) method is proposed to indicate the moving magnitude of each LCU. To obtain the RMD, each video frame is first sent to a low-pass filter whose main function is to remove the high-frequency noise in the video frame. To reduce complexity, we use a 3 × 3 averaging filter with a uniform weight of 1/9. It is well-known that the smaller the luminance difference between the LCUs in the same position in two consecutive frames [30], the more similar the two LCUs are. Equivalently, the moving magnitude of the LCUs in the current frame is low, and vice versa. Therefore, we use the luminance difference between two LCUs in the same position in two consecutive frames to describe RMD:

$$ {MD}_n\left(X,Y\right)=\sum \limits_{\left(i,j\right)\in {L}_n\left(X,Y\right)}\left({P}_n\left(i,j\right)-{P}_{n-1}\left(i,j\right)\right) $$
(1)

where Pn(i, j)and Pn − 1(i, j) represent the luminance pixels at location (i, j) in the current and previous frames, and Ln(X, Y) represents the LCU at location (X, Y)in frame fn.

For motion LCUs, the distortion sensitivity of the human eye decreases with an increase in motion speed [17]. Thus, Ln(X, Y) is classified as moving at a normal speedLCUMNS if MDN(X, Y)is less than a thresholdTN; otherwise, Ln(X, Y) is classified as moving too fast LCUMTS. The threshold Tm is defined as

$$ {T}_m=\alpha \times \frac{1}{N}\sum \limits_{B_n\left(X,Y\right)\in {f}_n}M{D}_n\left(X,Y\right) $$
(2)

where, αis a scaling factor with a value of 1.2 in our experiments, and N is the number of LCUs in each frame.

After the LCU classification according to motion speed, we comprehensive consider the characteristics of LCUMNS and LCUMTS, and the RMD for each LCU is finally defined as:

$$ {RMD}_m\left(X,Y\right)=\left\{\begin{array}{cc}{T}_m/{MD}_n\left(X,Y\right),&\ \mathrm{if}\ {MD}_n\left(X,Y\right)>{T}_m\\ {}{MD}_n\left(X,Y\right)/{T}_m,&\ \mathrm{Otherwise}\end{array}\right. $$
(3)

Defining RMD with Eq. 3 has two advantages. First, as seen from the equation, RMDn(X, Y) is less than 1 for LCUMTS, but greater than 1 for LCUMNS. This character qualitatively shows that LCUMNS is of more perceptual importance thanLCUMTS. Second, the moving speed degree is also quantitatively reflected from the equation: forLCUMTS, the greater the value of MDn(X, Y), the closer the value of RMDn(X, Y) is to 0, which means the LCU is in a region of moving too fast and has less perceptual importance. For LCUMNS, the greater value of MDn(X, Y) means that the current LCU has a higher magnitude in a normal motion span and more perceptual importance.

3.2 Texture region distinction

The conventional R-λ RC model adopted the MAD value of the LCU in a previous frame in the same level to measure the bit allocation weight [20, 21]. However, the MAD value was weakened to represent the HVS perception, which causes improper bit allocations and degrade perceptual quality [45]. There are several visual quality assessment metrics to measure the HVS perception [36, 39]. The work in [36] reports that there is a strong relationship between the texture characteristics and HVS perception. Hence, in this section, we propose an effective region texture degree (RTD) analysis model to measure the perceptual importance of the HVS in video frames. This not only indicates the HVS perception better than MAD, but is also easier to incorporate into the video compression standard.

An important characteristic of perceptual importance is that people are more likely to be attracted by texture regions containing high spatial contrasts than smooth regions containing low spatial contrasts. However, distortions in the texture regions that contain many edges are usually less noticeable [31]. Thus, more bits should be allocated to the texture region, and fewer bits may be allocated to the edge or smooth regions. Bazen et al. [3] suggested that squared gradients can represent texture, and they can effectively divide texture regions into texture region and edge region characteristics. Therefore, we propose a perceptual texture characteristics presentation method based on squared gradients. First, we adopt the sobel operator to capture the texture information of the current frame. The gradient value of the pixel Pi, j at position (i,j) in Ln(X, Y)is defined as:

$$ {\displaystyle \begin{array}{c}{G}_x={P}_{i-1,j+1}+2{P}_{i,j+1}+{P}_{i+1,j+1}-{P}_{i-1,j-1}-2{P}_{i,j-1}-{P}_{i+1,j-1}\\ {}{G}_y={P}_{i+1,j-1}+2{P}_{i+1,j}+{P}_{i+1,j+1}-{P}_{i-1,j-1}-2{P}_{i-1,j}-{P}_{i-1,j+1}\end{array}} $$
(4)

The squared gradient is then calculated by:

$$ {G}_{ii}=\sum \limits_{\left(i,j\right)\in {L}_n\left(X,Y\right)}{G}_i^2,\kern1em {G}_{jj}=\sum \limits_{\left(i,j\right)\in {L}_n\left(X,Y\right)}{G}_j^2,\kern1em {G}_{ij}=\sum \limits_{\left(i,j\right)\in {L}_n\left(X,Y\right)}{G}_i{G}_j $$
(5)

The texture coherence of the squared gradient can be calculated by [3]:

$$ \mathrm{Coh}=\frac{\sqrt{{\left({G}_{ii}-{G}_{jj}\right)}^2+4{G}_{ij}^2}}{G_{ii}+{G}_{jj}} $$
(6)

If the Coh value of an LCU is larger than the edge threshold Te, then the current LCU contains excessive edge information, and the LCU is classified as an edge LCU. In contrast, if the Coh value of an LCU is smaller than the texture threshold Tt, then the current LCU contains little texture information, and the LCU is classified as a smooth LCU. Otherwise, the LCU belongs to texture LCU. This LCU-type decision is given as:

$$ LCU\ Type=\left\{\begin{array}{ll} Edge\ LCU,& if\ \mathrm{Coh}>{T}_e\\ {} Texture\ LCU,& if\ {T}_t<\mathrm{Coh}\le {T}_e\\ {} Smooth\ LCU,& others\end{array}\right. $$
(7)

After the texture LCU decision, some of the classified LCUs were inconsistent with their neighboring LCUs. This inconsistency causes severe artifacts and significantly degrades video quality. Therefore, after classification, the consistency of each block should be analyzed to rectify the inconsistencies. The rectification process is explained thus: all classified LCUs are examined in a raster scan order; if there are eight edge LCUs around a texture LCU, the predetermined texture LCU should be amended as an edge LCU, and vice versa. Some other cases are amended likewise.

3.3 LCU perceptual importance weight decision

Based on the above analysis, the perceptual importance of each LCU can have two perceptual characteristics: RMD and RTD. To consider these two degrees comprehensively for perceptual importance characterization, a two-degree fusion scheme is proposed to represent perceptual weighting in the RC procedure.

First, in accordance with the texture analysis model, the RTD importance of LCUs is scored by the perceptual importance level (PIL) (the results are shown in Table 1). As analyzed in Sec. 3.2, for the RTD characteristic, the most important perceptual region is texture LCUs, which is scored by level “3.” The perceptual importance of smooth LCUs is one level lower than that of texture LCUs and is scored by level “2.” Finally, the edge LCUs are classified as the lowest level of perceptual importance and scored by level “1” as distortions in these regions are less noticeable. If the value of LCUs in consecutive frames changed dramatically, visible flickering artifacts will be produced. To address this problem, temporal consistency should be considered, and the RTD importance of an LCU should be adjusted. Let Δ denote the maximum difference of the PIL value between an LCU and its reference LCU in a consecutive frame. We set Δ to 1 in our experiments.

Table 1 RTD perceptual importance of LCUs

After the RTD-based perceptual importance level is decided (Table 1), the RMD is incorporated into the fusion scheme. A production-based, two-perceptual-degree fusion method is utilized to compute perceptual weighting factor:

$$ W= PIL\times {RMD}_m\left(X,Y\right) $$
(8)

For the proposed weighting factor W, if a texture LCU contains relatively slow moving objects assigned to relatively low values of W, then this type of LCUs can achieve more perceptual importance than edge LCUs containing too fast moving objects. This is because the distortion in a region with objects moving too fast or the edge region is both more likely to be unnoticed. To verify the effectiveness of the proposed LCU perceptual importance weight decision method, an example of the LCU perceptual importance map for the sequence “BasketballDrive” is shown in Fig. 1. The black regions are edge LCUs, grey regions are moving-too-fast or smooth LCUs, white regions are normal-speed or texture LCUs, and lighter regions represent areas of higher perceptual importance. It is observed that the different importance regions can be successfully classified using the proposed method.

Fig. 1
figure 1

Perceptual importance map for BasketballDrive. (a) Original frame; (b) The Perceptual importance map

AUC is a standard used to measure the quality of classification model, the AUC value is between 0.5 and 1.0, and a larger AUC represents better classification performance [47]. In order to further evaluate the performance of the proposed perceptual importance classification algorithm more fairly, the AUC of the proposed classification method is tested. The comparison results of AUC with different bitrates and different configuration are tested and presented in Table 2. From Table 2, we can see that most AUC values of the proposed algorithm are bigger than two other algorithms. In addition, the average AUC values of proposed method are also bigger than other comparison methods too. That means the proposed perceptual importance classification algorithm is not only classified the perceptual importance area effectively but also better than the other exiting similar classification algorithms.

Table 2 AUC Performance Comparison under LD and RA Configuration and Different Bitrates

4 Proposed perceptual importance based rate control algorithm

4.1 Perceptual importance based bit allocation

4.1.1 Region level bit allocation

In the proposed bit allocation scheme, the GOP level and frame level bit allocations use same method in [22]. As seen from Table 1, the high-level region requires more bits to reduce distortions. However, if the high-level region is allocated too many bits, it will degrade video quality in the low-level region as only too few bits will be left to encode the region. Such quality degradation is inevitably perceived by the human eye. To address the problem, we propose a region-level bit allocation method for the proposed RC algorithm before presenting the LCU-level bit allocation method. For the RTD procedure in Sec. 3, LCUs in one frame are divided into three regions with the same RTD perceptual importance level. The bit allocation for different regions is:

$$ {T}_{Texture}={T}_{Fra}\times \frac{Num_{Texture}}{Num_{Fra me}} $$
(9)
$$ {T}_{Smooth}={T}_{Fra}\times \frac{Num_{Smooth}}{Num_{Fra me}} $$
(10)
$$ {T}_{Edge}={T}_{Fra}\times \frac{Num_{Edge}}{Num_{Fra me}} $$
(11)

where TTexture, TSmooth, and TEdge are the target bits of all LCUs in the texture, smooth, and edge regions, respectively; NumTexture, NumSmooth, and NumEdgeare the LCU numbers in the texture, smooth, and edge regions, respectively; TFra is the target bits of the current frame; and NumFrame is number of LCUs per-frame.

4.1.2 LCU level bit allocation

In the R-λ model, the MAD value and target bits of one whole frame is used to allocate bits in the LCU level. In our method, rather than replacing MAD in [22], the bit allocation of the LCU level follows the proposed perceptual weighting factor W and target bits of the same region. The target bits of each LCU can be formulated as:

$$ {T}_{LCU,R}={T}_R\times \frac{W_{i,R}}{\sum \limits_{i=1}^{Num_R}{W}_{i,R}} $$
(12)

where TLCU, Ris the target bit of the current LCU in the same LCU region; Wi, Ris the weighting factor of the i th LCU, NumRis number of LCUs in the same region; TRis the target bits of the whole region with the same LCU type, R is texture, smooth and edge, which represent texture, smooth and edge regions, respectively.

In a real RC application, there will be always a mismatch between the allocated target bits and actual encoding bits in each LCU. Thus, if the previous LCUs cost more or fewer bits than the target bits, the target bits of the remaining LCUs should be compensated for to maintain video quality. Thus, TLCU, Rshould be improved as:

$$ {\displaystyle \begin{array}{l}{T}_{LCU,R}^I=\left\{{T}_{R, rem}+\frac{\sum_{j=1}^{i-1}\left({T}_{LCU,R,j}-{T}_{LCU,R,j}^{Act}\right)}{SW}\right\}\times Ra{t}_{LCU,R}\\ {} Ra{t}_{LCU,R}=\frac{W_{k,R}}{\sum \limits_{k=i}^{Nu{m}_R}{W}_{k,R}}\end{array}} $$
(13)

where \( {T}_{LCU,R}^I \)is the improved target bits of the current (i.e., ith) LCU; TR, remis the number of remaining bits used to encode the remaining LCUs in the same LCU type region; TLCU, R, jand \( {T}_{LCU,R,j}^{Act} \)are the numbers of actual encoded bits and target bits estimated by Eq.12 of the previous LCUs, respectively; and SWdenotes the size of the sliding window. In our experiments, SW= 8.

4.2 Improved R-λ parameter update model

Once the number of the target bits of LCU is determined, the next step is to determine an appropriate QP to achieve the target bits. In our work, the QP of each LCU can be achieved based on the R-λ model [22] as

$$ \lambda =\alpha \times {bpp}^{\beta } $$
(14)
$$ QP=4.2005\ln \lambda +13.7122 $$
(15)

where λ is the Lagrange coefficient, bpp is the number of target bits for per pixel, α and β are model parameters and are updated by the previous coded LCUs [22].

In order to adapt the different characteristics of the input video, the value of α and β should be continuously update during the encoding process. In the conventional R-λ model [22], α and β are updated as:

$$ {\lambda}_{comp}={\alpha}_{old}{bpp}_{real}^{\beta_{old}} $$
(16)
$$ {\alpha}_{new}={\alpha}_{old}+{\delta}_{\alpha}\times \left(\ln {\lambda}_{real}-\ln {\lambda}_{comp}\right)\times {\alpha}_{old} $$
(17)
$$ {\beta}_{new}={\beta}_{old}+{\delta}_{\beta}\times \left(\ln {\lambda}_{real}\hbox{-} \ln {\lambda}_{comp}\right)\times \ln {bpp}_{real} $$
(18)

where λcomp and λreal are the predicted and real lambda values, respectively; δα and δβ are the learning rates set by 0.1 and 0.05, respectively; bppreal represents real consumed bits.

For the notations in [22], the least mean square (LMS) algorithm was adopted by the R-λ model to update the values of α and β. However, the LMS model is a first-order convergence model, which means the convergence speed of updating the parameters of the LMS is relatively slow. Thus, the LMS algorithm cannot always achieve the accurate target bits. In our previous work [48], we proved that the Broyden-Fletcher-Goldfarb-Shanno (BFGS) model used a positive definite matrix, which avoids the trouble of directly calculating the Hessian matrix. Meanwhile, the inverse of the positive definite matrix can be easily obtained. Thus, the BFGS-based parameter updating algorithm can achieve a more global and faster convergence speed than the LMS-based method. Therefore, in this work, we used the BFGS-based model to update the parameters of the R-λ model. The BFGS-based α and β updating procedure is given as:

$$ {\alpha}_{new}={\alpha}_{old}+{\delta}_{amijo}\cdot {d}_{\alpha}\cdot {\alpha}_{old} $$
(19)
$$ {\beta}_{new}={\beta}_{old}+{\delta}_{armijo}\cdot {d}_{\beta } $$
(20)

where δamijo is the search step size calculated by a linear search process, dα and dβ are search direction vectors of α and β, respectively [48].

Although, the BFGS-based model can update the parameters faster convergence, but the dramatical change in the bits of the LCUs caused by scene change or violent object movement will inevitable cause visible flickering artifacts. Thus, to keep the quality of coded video consistent, both λ and QP should not change significantly. We proposed a new clipped method. That is: the value of λ and QP for the current LCU should be clipped in a range:

$$ {\lambda}_{curr}= clip\left\{\begin{array}{c}\mathit{\max}\left({2}^{\frac{-1}{3}}\cdotp {\lambda}_{pre}/\frac{\sum_{i=1}^{n_{curr}}{w}_i}{n_{curr}},{2}^{\frac{-2}{3}}\cdotp \lambda {}_f{}/\frac{\sum_{i=1}^{n_{curr}}{w}_i}{n_{curr}}\right),\\ {}\mathit{\min}\left\{{\lambda}_{pre},\mathit{\min}\left({2}^{\frac{1}{3}}\cdotp {\lambda}_{pre}\cdotp \frac{\sum_{i=1}^{n_{curr}}{w}_i}{n_{curr}},{2}^{\frac{2}{3}}\cdotp \lambda {}_f{}\cdotp \frac{\sum_{\mathrm{i}=1}^{n_{curr}}{w}_i}{n_{curr}}\right)\right\}\end{array}\right\} $$
(21)
$$ Q{P}_{curr}= clip\left\{\mathit{\max}\left(Q{P}_{pre}-\frac{\sum_{i=1}^{n_{curr}}{w}_i}{n_{curr}}, QP{}_f{} - \frac{2\cdotp {\sum}_{i=1}^{n_{curr}}{w}_i}{n_{curr}}\right)\right\} $$
(22)

where λcurr is the λ value of the current LCU, λpre is the λ value of the previous encoded LCU, QPcurr is the QP of the current LCU, QPpre is the QP of the previous encoded LCU. λfis the λ value of the current frame, and ncurris the index of the current LCU. In the clip procedure, the λ value is not only adjusted by theλf, but also by the perceptual importance weight wi. Thus, the clip procedure can not only maintain visual quality consistently, but also ensures that more bits are allocated to more perceptual important LCUs.

In order to show the whole proposed rate control algorithm more completely and clearly, we present the summary of the proposed rate control Algorithm in below:

  1. Step 1:

    Initialize the GOP level and frame level bit allocations by the method in [22]

  2. Step 2:

    Calculate the perceptual importance of current frame

  3. Step 3:

    Calculate temporal complexity: classify the LCU of current frame into LCUMNSand LCUMTSby Eq. (1)–(3)

  4. Step 4:

    Calculate spatial complexity: classify the LCU of current frame intoEdge LCU, Texture LCU and Smooth LCUby Eq. (4)–(7)

  5. Step 5:

    Decide the perceptual importance of each LCU by Eq. (8) and Table 1

  6. Step 6:

    According to the perceptual importance results of step 2, allocate target bits for each region-level (TTexture,TSmooth and TEdge) of current frame by using Eq. (9)–(11)

  7. Step 7:

    Under the restriction of the above region-level bit allocation, the bits for each LCU in different regions are allocated by:

  8. Step 8:

    Calculate bits of LCU (TLCU, R) by Eq. (12)

  9. Step 9:

    Compensate the bit allocation mismatch, the TLCU, R is improved by Eq.(13)

  10. Step 10:

    Calculate QP for the LCUs by Eq. (14)–(15)

  11. Step 11:

    Clip the value of λ and QP for the current LCU by Eq. (21)–(22)

  12. Step 12:

    Encode LCUs with calculated QP

  13. Step 13:

    Update the parameters by Eq. (19)–(20)

5 Experimental results

5.1 Experiment setup

To evaluate the performance of the proposed perceptual importance-based RC algorithm, named Proposed in this paper. We incorporated the proposed method into the HEVC reference software HM16.19 [16]; RDOQ and RDOQTS were disabled, but the remaining parts were the same as in HM16.19 [16]. The test conditions were set as two encoder configurations: lowdelay_P_main (LD) and randomaccess_main (RA). Thirteen standard test video sequences from classes B, C, D, and E were selected for evaluation and encoded under four QP values: 22, 27, 32, and 37. The first 300 frames of each test sequences were encoded. Further, the target bitrate was obtained by encoding the video sequences according to the same encoding configuration that disables rate control. We then compared the proposed method with several state-of-the-art RC algorithms, including the default RC scheme in HM 16.10, named LI [22]; three related spatial/temporal based perceptually RC methods: [14, 43, 44], named Wei [44], H Wei [43] and Gong [14]; and our previous work [48], named Ye [48].

5.2 R-D performance

The aim of this paper is to improve video quality; thus, the R-D performance is an important performance evaluation metric for the proposed method. The R-D performances of the methods were measured in terms of the Bjøntegaard delta peak signal-to-noise rate (BD-PSNR), which indicates quality improvement against the benchmark of the same coding bitrate. A positive value means performance gain while a negative value means performance loss. In our experiments, the default rate control scheme in HM 16.10, named LI [22] was set as the benchmark. The R-D curves of different RC sequence methods are shown in Fig. 2. For the difference between the curves, it can be seen that the proposed RC method outperforms other rate control methods at both high and low bitrates. In order to display the R-D performance of the proposed method clearly and comprehensively, the results of each test sequence for four different methods under LD and RA configurations are shown in Table 3. As seen from the table, compared to the benchmark method (LI [22]), the proposed method improves the BD-PSNR performance by 0.48 dB and 0.30 dB on average for LD and RA configurations, respectively. These improvement results are also better than the other methods. Specifically, the BD-PSNR improved by 0.60 dB for the test sequence “RaceHorses” under the LD configuration and 0.49 dB for the test sequence “Kimono1” under the RA configuration. For the “Johnny” sequence, the average BD-PSNR improvement of the proposed scheme is 0.28 dB higher than the method of H Wei [43], 0.36 dB higher than the method of Ye [48] and 0.12 dB higher than the method of Wei [44] under the LD configuration. For the “Kimono1” sequences, the average BD-PSNR improvement of the proposed model is 0.16 dB higher than the method of Gong [14], and 0.10 dB higher than the method of Wei [44] and 0.32 dB higher than the method of Ye [48] under the RA configuration. The reason of quality improvement is that: although Wei [44], H Wei [43] and Gong [14] are all use spatial/temporal information to guide bit allocation, but the spatial/temporal information of the proposed method is represent the perceptual importance better. Second, these methods are all not consider the bit balance of different regions that induce the significantly quality decrease of LPIR. In addition of that, Ye [48] only consider the spatial information to guide bit allocation, thus the R-D performance is not as well as other methods.

Fig. 2
figure 2

R-D curves comparison. The (a) (b) are the cases of LD configuration, (c) (d) are the cases of RA configuration

Table 3 R-D Performance Comparison In Terms of Y-Component BD_PSNR (dB) Against HEVC Anchor under LD and RA Configuration

As PSNR does not always match perceptual quality very well, the perceptual importance weighted PSNR (EWPSNR) [50] is used to re-evaluate the perceptual video quality. Table 4 presented the BD-EWPSNR. From Table 4, due to the bits are optimally allocated according to perceptual importance, the BD-EWPSNR gaining by the proposed algorithm are all outperform other methods.

Table 4 R-D Performance Comparison In Terms of Y-Component BD_EWPSNR (dB) Against HEVC Anchor under LD and RA Configuration

5.3 Subjective quality comparison

As the video quality is finally evaluated by humans, subjective assessment such as the perceptual quality and structure similarity (SSIM) are all important video perceptual quality evaluation metric. The comparisons of the perceptual subjective quality for different sequences are shown in Fig. 3. The sequence “FourPeople” is encoded by QP 32 under LD configuration, and the sequence “Cactus” is encoded by QP 37 under RA configuration. Some selected regions are magnified for better comparison. It can be seen that the visual quality of texture and motion regions encoded by the proposed model is better than the conventional RC methods, especially in the selected regions. Especially, the bottom of the cactus in Fig. 3(b) has more detail than the corresponding part in Fig. 3(a); in addition that, the blocking artifacts on the face in Fig. 3(c) is more obviously than the corresponding part in Fig. 3(d).

Fig. 3
figure 3

Subjective results comparisons for HEVC anchor [22] and proposed method. (a) HEVC anchor [22] result for Cactus with QP = 37 under RA configuration, SSIM = 0.9036; (b) Proposed result for Cactus with QP = 37 under RA configuration, SSIM = 0.9137; (c) HEVC anchor [22] result for FourPeople with QP = 32 under LD configuration, SSIM = 0.9269; (d) Proposed result for FourPeople with QP = 32 under LD configuration, SSIM = 0.9428

Moreover, the SSIM of the proposed algorithm are much higher than those of HEVC anchor [22] direct encoding. There is because that compared to conventional RC methods the selected regions are allocated more bits by the proposed algorithm. Therefore, the visual quality of encoded videos can be improved effectively by the proposed algorithm. In addition, the SSIM results of each test sequence for four different methods under LD and RA configurations are shown in Table 5, where the data are average of four QPs corresponding bitrates. From Table 5, the average SSIM value is 0.9125 for our previous work Ye [48], 0.9197 for method of Wei [44] and 0.9249 for the proposed algorithm under LD configuration; the average SSIM value is 0.9027 for our previous work Ye [48], 0.9112 for method of Wei [44] and 0.9179 for the proposed algorithm under RA configuration. From the results, we can see the proposed algorithm outperforms Ye [48] and Wei [44] in the case of subjective quality. These results demonstrate that our method not only has better objective video quality, but also has better perceptual subjective quality when compared with other state-of-the-art RC methods.

Table 5 SSIM Performance Comparison of Proposed Method Against Different Other Methods

Besides Fig. 3, we also adopted an assessment proposed by Rec. ITU-R BT.500 to further evaluates the subjective video quality. The assessment is called single stimulus continuous quality scale (SSCQS). Different video resolutions are tested in the assessment procedure. Before each assessment, the observers were required to watch 10 other training videos to help them better understand the video subjective quality assessment procedure. 10 observers were participant in this assessment. All the videos are displayed in their original resolutions, to avoid the influence of scaling operation. Note that the uncompressed reference and test video sequences were displayed with a random order. The quality rate scales for observers to evaluate video quality are excellent (10–8.1), good (8–6.1), fair (6–4.1), poor (4–2.1), and bad (2–0.1). After observers watched the video, difference mean opinion scores (DMOS) were computed to reveal the difference of subjective quality between the compressed and uncompressed videos. The smaller value of DMOS corresponds to better subjective quality of the compressed video sequence. The Table 6 compares the average DMOS values of different compressed video sequences, all the values are average of four tested QPs corresponding bitrates. From this table, we can see that the DMOS values of our scheme are smaller than the perceptual URQ scheme, and much less than the conventional R-λ scheme, especially at high resolutions. Therefore, our scheme can provide higher subjective video quality. The works in [33] [40] demonstrate that there exit a perceptual redundancies degradation tolerance degree that people cannot perceive significant video quality differences. From Table 6, it can observed that the value of DMOS for the Class E video sequences are lower than other Classes video sequences for all methods. The reason is that, Class E test sequences are almost video conference scene, the characteristic of the type video is static background and slow moving objects, that means the perceptual redundancies degradation tolerance degree of this Class video is higher than other Classes.

Table 6 Comparison In Terms Of DMOS Against HEVC and Other Methods

5.4 Bitrate accuracy comparison

In addition to minimizing coding distortion, the key objection of RC is to make the output bitrate of a video encoder equal to the target bitrate as close as possible. Therefore, bitrate accuracy is another significant evaluation criterion of the RC algorithm, which is measured in terms of bitrate error (BitError). The BitError is defined as:

$$ BitError=\frac{\left|{R}_{tar}-{R}_{act}\right|}{R_{tar}} $$
(23)

where Rtar and Ract are target and actual bitrates, respectively. The smaller the BitError value, the higher the bitrate accuracy achieved.

The averages of the four bitrate errors for each test sequence in the experiments are listed in Table 7. From this table, it can be seen that all four rate control algorithms obtained very high bitrate accuracies. For most of the test videos, our proposed method exceeded four other comparison methods for the two coding configurations. On average, the bitrate error of the proposed method is 0.37% and 2.19% and 2.63% less than Li [22], Wei [44] and H Wei [43] respectively, under the LD configuration. Similarly, 0.48%, 1.55% and 0.26% bitrate error reduction were obtained under the RA configuration in comparison to Li [22], Wei [44] and Gong [14], respectively. As spatial/temporal based RC methods, Gong [14] achieved a lower bitrate error than Li [22] and Wei [44]. However, the bitrate error of proposed method is also less than that of Gong [14]. In general, the proposed RC method not only achieves better R-D performance, but also obtains the smallest bitrate error as the bitrate errors are only 0.32% and 0.29% on average under the LD and RA configurations, respectively. The first reason for this is that in our proposed RC method, the conventional LMS-based parameter updating procedure is replaced by the BFGS-based parameter updating method, which has a faster convergence speed in the parameter update process. This means that the updating procedure of the proposed method may achieve a higher global and faster convergence speed than the LMS based-method. The more important reason is that, the proposed perceptual based region level and LCU level bit allocation method can effectively balance the bit rate between HPIR and LPIR that can significantly reduce the bit shortage of bit allocation procedure and improve the bit accuracy of each LCU. Thus, the proposed method achieves more accurate bitrates than other state-of-the-art methods.

Table 7 Bitrate Accuracy Comparison In Terms Of Bitrate Error (%) Against HEVC Anchor Under LD And RA Configuration

5.5 Complexity comparison

The additional computational cost of the proposed RC algorithm mainly comes from the perceptual importance weight decision. The computational complexity can be measured by the encoding time ratio of the proposed rate control algorithm against the RC method [22]. It is expressed as

$$ \Delta T=\frac{\left|{T}_{Prop}-{T}_{Anch}\right|}{T_{Anch}}\times 100\% $$
(24)

where TProp and TAnch denote the encoding time of the proposed algorithm and the RC method [22] anchor, respectively. If ΔT is greater than 100%, then the encoding complexity increases, and vice versa. The average values of the encoding time of all test sequences were utilized to calculate ΔT, and the results are shown in Table 8. As seen from the table, the proposed RC method slightly increases the encoding time as the complexity of perceptual importance weight decision constitutes a small portion of the complexity of the entire encoding process. The averageΔTof the proposed method is a little higher than the algorithm of Wei [44]. This is because the proposed method uses BFGS to update the RC parameters, which requires extra computational cost due to the use of iterative algorithm.

Table 8 Complexity Comparison of Proposed Method Against RC Method [22] Anchor Under LD And RA Configurations

6 Conclusions

In this paper, we present a perceptual importance-based RC scheme for HEVC. Based on the HVS theory, a formulation of spatial and temporal perceptual importance is developed in a low complexity cost. A fusion method is then utilized to build a comprehensive perceptual importance model, which is formulated as a weighted factor to represent the perceptual importance of each LCU. Furthermore, a new RC scheme is designed from the region level to LCU level. To improve the bitrate accuracy of the proposed RC method, a BFGS-based parameter updating method is utilized to replace the conventional R-λ parameter updating procedure. The experimental results verified that the proposed RC method possesses better properties compared to the state-of-the-art RC methods. This superiority is both in the subjective visual quality and objective quality. Particularly, compared to the conventional HEVC RC R-λ model, the proposed method not only maintains a lower bit error, but it achieved 0.48 and 0.30 dB BD-PSNR gains under the LD and RA coding configuration, respectively. In addition, the proposed method only increases negligible coding complexity.