Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

8.1 Introduction

Context-Based Adaptive Binary Arithmetic Coding (CABAC) [46] is a form of entropy coding used in H.264/AVC [63] and also in HEVC [64]. Entropy coding is a lossless compression scheme that uses the statistical properties to compress data such that the number of bits used to represent the data is logarithmically proportional to the probability of the data. For instance, when compressing a string of characters, frequently used characters are each represented by a few bits, while infrequently used characters are each represented by many bits. From Shannon’s information theory [72], when the compressed data is represented in bits {0,1}, the optimal average code length for a character with probability p is − log2 p.

Entropy coding is performed at the last stage of video encoding (and first stage of video decoding), after the video signal has been reduced to a series of syntax elements. Syntax elements describe how the video signal can be reconstructed at the decoder. This includes the method of prediction (e.g., spatial or temporal prediction) along with its associated prediction parameters as well as the prediction error signal, also referred to as the residual signal. Note that in HEVC only the syntax elements belonging to the slice segment data are CABAC encoded. All other high level syntax elements are coded either with zero-order Exponential (Exp)-Golomb codes or fixed-pattern bit strings. Table 8.1 shows the syntax elements that are encoded with CABAC in HEVC and H.264/AVC. For HEVC, these syntax elements describe properties of the coding tree unit (CTU), prediction unit (PU), and transform unit (TU), while for H.264/AVC, the equivalent syntax elements have been grouped together along the same categories in Table 8.1. For a CTU, the related syntax elements describe the block partitioning of the CTU into coding units (CU), whether the CU is intra-picture (i.e., spatially) predicted or inter-picture (i.e., temporally) predicted, the quantization parameters of the CU, and the type (edge or band) and offsets for sample adaptive offset (SAO) in-loop filtering performed on the CTU. For a PU, the syntax elements describe the intra prediction mode or the motion data. For a TU, the syntax elements describe the residual signal in terms of frequency position, sign and magnitude of the quantized transform coefficients.

Table 8.1 CABAC coded syntax elements in HEVC and H.264/AVC

This chapter describes how CABAC entropy coding has evolved from H.264/AVC to HEVC. While high coding efficiency is important for reducing the transmission and storage cost of video, processing speed and area cost also need to be considered in the development of HEVC in order to handle the demand for higher resolutions and frame rates in future video coding systems. Accordingly, both coding efficiency and throughput improvement tools are discussed. Section 8.2 provides an overview of CABAC entropy coding. Section 8.3 explains the design considerations and techniques used to address both coding efficiency and throughput requirements. Sections 8.48.7 describe how these techniques were applied to coding tree unit coding, prediction unit coding, transform unit coding and context initialization, respectively. Section 8.8 compares the coding efficiency, throughput and memory requirements of HEVC and H.264/AVC for both common conditions and worst case conditions.

8.2 CABAC Overview

The CABAC algorithm was originally developed within the joint H.264/AVC standardization process of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG). In a first preliminary version, the new entropy-coding method of CABAC was introduced as a standard contribution [44] to the ITU-T VCEG meeting in January 2001. CABAC was adopted as one of two alternative methods of entropy coding within the H.264/AVC standard. The other method specified in H.264/AVC was a low-complexity entropy coding technique based on the usage of context-adaptively switched sets of variable-length codes, so-called Context-Adaptive Variable-Length Coding (CAVLC). Compared to CABAC, CAVLC offers reduced implementation cost at the price of lower compression efficiency. Typically, the bit-rate overhead for CAVLC relative to CABAC is in the range of 10–16 % for standard definition (SD) interlaced material, encoded at Main Profile, and 15–22 % for high definition (HD) 1080p material, encoded at High Profile, both measured at the same objective video quality and for the case that all other used coding tools within the corresponding H.264/AVC Profile remain the same [46, 48].

CABAC became also part of the first HEVC test model HM1.0 [53] together with the so-called low-complexity entropy coding (LCEC) as a follow-up of CAVLC. Later, during the HEVC standardization process, it turned out that to improve the compression efficiency of LCEC, the complexity of LCEC had to be increased to a point where LCEC was not significantly lower complexity than CABAC. Thus, CABAC in its improved form, both with respect to throughput speed and compression efficiency, became the single entropy coding method of the HEVC standard.

The basic design of CABAC involves the key elements of binarization, context modeling, and binary arithmetic coding. These elements are illustrated as the main algorithmic building blocks of the CABAC encoding block diagram in Fig. 8.1. Binarization maps the syntax elements to binary symbols (bins). Context modeling estimates the probability of each non-bypassed (i.e., regular coded) bin based on some specific context. Finally, binary arithmetic coding compresses the bins to bits according to the estimated probability.

Fig. 8.1
figure 1

CABAC block diagram (from the encoder perspective): Binarization, context modeling (including probability estimation and assignment), and binary arithmetic coding. In red: Potential throughput bottlenecks, as further discussed from the decoder perspective in Sect. 8.3.2

8.2.1 Binarization

The coding strategy of CABAC is based on the finding that a very efficient coding of non-binary syntax element values in a hybrid block-based video coder, like components of motion vector differences or transform coefficient level values, can be achieved by employing a binarization scheme as a kind of preprocessing unit for the subsequent stages of context modeling and arithmetic coding. In general, a binarization scheme defines a unique mapping of syntax element values to sequences of binary symbols, so-called bins, which can also be interpreted in terms of a binary code tree. The design of binarization schemes in CABAC both for H.264/AVC and HEVC is based on a few elementary prototypes whose structure enables fast implementations and which are representatives of some suitable model-probability distributions.

Table 8.2 Examples of different binarizations

Several different binarization processes are used in HEVC including k-th order truncated Rice (TRk), k-th order Exp-Golomb (EGk), and fixed-length (FL) binarization. Parts of these forms of binarization, including the truncated unary (TrU) scheme as the zero-order TRk binarization, were also used in H.264/AVC. These various methods of binarization can be explained in terms of how they would signal an unsigned value N. Examples are also provided in Table 8.2.

  • Unary coding involves signaling a bin string of length N + 1, where the first N bins are 1 and the last bin is 0. The decoder searches for a 0 to determine when the syntax element is complete. For the TrU scheme, truncation is invoked for the largest possible value cMaxFootnote 1 of the syntax element being decoded.

  • k-th order truncated Rice is a parameterized Rice code that is composed of a prefix and a suffix. The prefix is a truncated unary string of value N > > k, where the largest possible value is cMax. The suffix is a fixed length binary representation of the least significant bins of N; k indicates the number of least significant bins. Note that for k = 0, the truncated Rice is equal to the truncated unary binarization.

  • k-th order Exp-Golomb code is proved to be a robust, near-optimal prefix-free code for geometrically distributed sources with unknown or varying distribution parameter. Each codeword consists of a unary prefix of length l N + 1 and a suffix of length l N + k, where \(l_{N} = \lfloor \log _{2}((N >> k) + 1)\rfloor \) [46].

  • Fixed-length code uses a fixed-length bin string with length \(\lceil \log _{2}(\mathrm{cMax} + 1)\rceil \) and with most significant bins signaled before least significant bins.

The binarization process is selected based on the type of syntax element. In some cases, binarization also depends on the value of a previously processed syntax element (e.g., binarization of coeff_abs_level_remaining depends on the previously decoded coefficient levels) or slice parameters that indicate if certain modes are enabled (e.g., binarization of partition mode, so-called part_mode, depends on whether asymmetric motion partition is enabled). The majority of the syntax elements use the binarization processes as listed above, or some combination of them (e.g., cu_qp_delta_abs uses TrU (prefix) + EG0 (suffix) [98]). However, certain syntax elements (e.g., part_mode and intra_chroma_pred_mode) use custom binarization processes.

During the HEVC standardization process, special attention has been put on the development of an adequately designed binarization scheme for absolute values of transform coefficient levels. In order to guarantee a sufficiently high throughput, the goal here was the maximization of bypass-coded bins under the constraint of not sacrificing coding efficiency too much. This was accomplished by making the binarization scheme adaptive based on previously coded transform coefficient levels. More details on that are given in Sect. 8.6.5.

8.2.2 Context Modeling, Probability Estimation and Assignment

By decomposing each non-binary syntax element value into a sequence of bins, further processing of each bin value in CABAC depends on the associated coding-mode decision, which can be either chosen as the regular or the bypass mode (as described in Sect. 8.2.3). The latter is chosen for bins, which are assumed to be uniformly distributed and for which, consequently, the whole regular binary arithmetic encoding (and decoding) process is simply bypassed. In the regular coding mode, each bin value is encoded by using the regular binary arithmetic coding engine, where the associated probability model is either determined by a fixed choice, based on the type of syntax element and the bin position or bin index (binIdx) in the binarized representation of the syntax element, or adaptively chosen from two or more probability models depending on the related side information (e.g., spatial neighbors as illustrated in Fig. 8.1, component, depth or size of CU/PU/TU, or position within TU). Selection of the probability model is referred to as context modeling. As an important design decision, the latter case is generally applied to the most frequently observed bins only, whereas the other, usually less frequently observed bins, will be treated using a joint, typically zero-order probability model. In this way, CABAC enables selective adaptive probability modeling on a sub-symbol level, and hence, provides an efficient instrument for exploiting inter-symbol redundancies at significantly reduced overall modeling or learning costs. Note that for both the fixed and the adaptive case, in principle, a switch from one probability model to another can occur between any two consecutive regular coded bins. In general, the design of context models in CABAC reflects the aim to find a good compromise between the conflicting objectives of avoiding unnecessary modeling-cost overhead and exploiting the statistical dependencies to a large extent.

The parameters of probability models in CABAC are adaptive, which means that an adaptation of the model probabilities to the statistical variations of the source of bins is performed on a bin-by-bin basis in a backward-adaptive and synchronized fashion both in the encoder and decoder; this process is called probability estimation. For that purpose, each probability model in CABAC can take one out of 126 different states with associated model probability values p ranging in the interval [0. 01875, 0. 98125]. The two parameters of each probability model are stored as 7-bit entries in a context memory: 6 bits for each of the 63 probability states representing the model probability p LPS of the least probable symbol (LPS) and 1 bit for ν MPS, the value of the most probable symbol (MPS). The probability estimator in CABAC is based on a model of “exponential aging” with the following recursive probability update after coding a bin b at time instance t:

$$\displaystyle{ p_{\mathrm{LPS}}^{(t+1)} = \left \{\begin{array}{l@{\quad }l} \alpha {\ast} p_{\mathrm{LPS}}^{(t)}, \quad &\mbox{ if }b =\nu _{\mathrm{ MPS}},\mbox{ i.e., an MPS occurs} \\ 1 -\alpha {\ast}(1 - p_{\mathrm{LPS}}^{(t)}),\quad &\mbox{ otherwise.} \end{array} \right. }$$
(8.1)

Here, the choice of the scaling factor α determines the speed of adaptation: A value of α close to 1 results in a slow adaptation (“steady-state behavior”), while faster adaptation can be achieved for the non-stationary case with decreasing α. Note that this estimation is equivalent to using a sliding window technique [4, 65] with window size \(W_{\alpha } = (1-\alpha )^{-1}\). In the design of CABAC, Eq. (8.1) has been used together with the choice of

$$\displaystyle{ \alpha = \left (\frac{0.01875} {0.5} \right )^{ \frac{1} {63} }\mbox{ with }\min _{t}p_{\mathrm{LPS}}^{(t)} = 0.01875, }$$
(8.2)

and a suitable quantization of the underlying LPS-related model probabilities into 63 different states, to derive a finite-state machine (FSM) with tabulated transition rules [46]. This table-based probability estimation method was unchanged in HEVC, although some proposals for alternative probability estimators [1, 78] have shown average bit rate savings of 0.8–0.9 %, albeit at higher computational costs.

Each probability model in CABAC is addressed using a unique context index (ctxIdx), either determined by a fixed assignment or computed by the context derivation logic by which, in turn, the given context model is specified. A lot of effort has been spent during the HEVC standardization process to improve the model assignment and context derivation logic both in terms of throughput and coding efficiency. More details on the specific choice of context models for selected syntax elements in HEVC are given in Sect. 8.48.6.

8.2.3 Multiplication-Free Binary Arithmetic Coding: The M Coder

Binary arithmetic coding, or arithmetic coding in general, is based on the principle of recursive interval subdivision. An initially given interval represented by its lower bound (base) L and its width (range) R is subdivided into two disjoint subintervals: one interval of width

$$\displaystyle{ R_{\mathrm{LPS}} = p_{\mathrm{LPS}} {\ast} R, }$$
(8.3)

which is associated with the LPS, and the dual interval of width \(R_{\mathrm{MPS}} = R - R_{\mathrm{LPS}}\), which is assigned to the MPS. Depending on the binary value to encode, either identified as LPS or MPS, the corresponding subinterval is then chosen as the new coding interval. By recursively applying this interval-subdivision scheme to each bin b j of a given sequence b = (b 1, b 2, , b N ) of bins, the encoder finally determines a value c b in the subinterval [L (N), L (N) + R (N)) that results after the Nth interval subdivision process. The (minimal) binary representation of c b is the arithmetic code of the input bin sequence b. To ensure that finite-precision registers are sufficient to represent R (j) and L (j) for all j ∈ { 1, 2, , N}, a renormalization operation is required, whenever R (j) falls below a certain limit after one or more interval subdivision process(es). By renormalizing R (j), and accordingly L (j), the leading bits of the arithmetic code can be output as soon as they are unambiguously identified.

On the decoder side, the sequence of encoded binary values can be easily recovered by tracking the interval subdivision, including renormalization, according to Eq. (8.3) step-by-step and by comparing the bounds of both subintervals to the transmitted value representing the final subinterval. Note that the width R (N) of the final subinterval is proportional to the product \(\prod\nolimits_{j=1}^{N}p(b_{j})\) of the individual model probability p(b j ) assigned to the bins b j of the bin sequence, such that for signaling the final subinterval, the lower bound of the empirical entropy of the bin sequence given by \(-\log _{2}\prod\nolimits_{j=1}^{N}p(b_{j}) = -\sum\nolimits_{j=1}^{N}\log _{2}p(b_{j})\) is approximately achieved.

From a practical implementation point of view, the most costly operation involved in binary arithmetic coding is given by the multiplication in Eq. (8.3). Even worse, if probability estimation is based on a simple scaled-count estimator using scaled cumulative frequency counts of bins, this operation may even involve an integer division. A solution to this problem was already proposed during the H.264/AVC standardization process by using a design of a family of multiplication-free binary arithmetic coders, which later became known as the modulo coder (M coder) [43, 45]. The main innovative features of this design are given by a table-based interval subdivision coupled with the above-mentioned FSM-based probability estimation as well as a fast bypass coding mode. The former, which is also the basis of what is called the regular coding mode of the M coder, will be briefly reviewed next, followed by a short discussion of the latter aspect.

8.2.3.1 Regular Coding Mode

The basic idea of the M-coder approach of interval subdivision is to quantize the range of possible interval widths induced by renormalization into a small number of K cells. To further simplify matters, a uniform quantization with K = 2κ is assumed to be performed, resulting in a set \(\mathbf{W} =\{ W_{0},W_{1},\cdots \,,W_{K-1}\}\) of representative interval widths. Together with the representative set of LPS-related probability values of the FSM given by \(\mathbf{P} =\{ p_{0},p_{1},\cdots \,,p_{N-1}\}\), this quantization enables the approximation of the multiplication on the right-hand side of Eq. (8.3) by means of a table of K × N pre-calculated product values {W k p n  | 0 ≤ k < K; 0 ≤ n < N} in a suitable chosen integer precision. The entries of the corresponding 2-D lookup table TabRangeLPS are addressed by the (probability) state index n and the quantization cell index k(R) related to the given value of the interval range R. Computation of k(R) is easily carried out by a concatenation of a bit shift and a bit-masking operation, where the latter can be interpreted as a modulo operation using the operand K = 2κ, hence the naming of the family of coders.

In the context of H.264/AVC, the optimal empirical choice of the free parameters κ = 2 and N = 64 was determined under the constraint of a maximum table size of 2κN ≤ 256 bytes for the lookup table TabRangeLPS with each of its entries being represented with 8 bits. This specific M-coder design of using a lookup table TabRangeLPS with 4 × 64 entries was also adopted for HEVC. Please note that by choosing a value of κ = 0, the 2-D table TabRangeLPS degenerates to a 1-D table, where for all possible values of R only one single representative interval width value W is used for each of the N product values p n R, where 0 ≤ n < N. This choice is equivalent to the subinterval division operation performed in the Q coder and its derivatives of QM and MQ coder, as has been standardized in JBIG, JPEG, and JPEG2000. Thus, the M-coder design can be interpreted as a generalization of the Q-coder family.Footnote 2 Compared to the QM/MQ coder, the M coder, being configured as in H.264/AVC and HEVC, achieves an increase in throughput of 18 %, while at the same time it provides bit-rate savings of 2–4 %, when evaluated in the CABAC environment of H.264/AVC [43]. Interestingly, the throughput improvements of the M coder can be largely attributed to its unique bypass functionality, as being reviewed in the next subsection, while its use of a larger lookup table for interval subdivision generates the main effects in coding-efficiency gain; however, this increased table size can also adversely affect the overall throughput gain of the M coder.

8.2.3.2 Bypass Coding Mode

As already mentioned, most of the throughput improvements of the M coder relative to the Q-coder technology can be attributed to its second innovative feature, which is given by a bypass of the probability estimation for approximately uniform distributed bins. In addition, the interval subdivision is substituted by a hard-wired equipartition in this so-called bypass coding mode. In this way, the whole encoding/decoding process (including renormalization) can be realized by nothing more than a bit shift, a comparison, and for half of the symbols an additional subtraction.

Bypass coding has become an even more important feature during the HEVC standardization process. While in H.264/AVC bypass coding was mainly used for signs and least significant bins of absolute values of quantized transform coefficients, in HEVC the majority of possible bin values is handled through the bypass coding mode. As noted above, this is also a consequence of carefully designed binarization schemes, which already serve as a kind of near-optimal prefix-free codes of the corresponding syntax elements.

8.2.3.3 Fast Renormalization

One of the major throughput bottlenecks in any arithmetic encoding and decoding process is given by the renormalization procedure. Renormalization in the M coder is required whenever the new interval range R after interval subdivision no longer stays within its admissible domain. Each time a renormalization operation must be carried out, one or more bits can be outputted at the encoder or, equivalently, have to be read by the decoder. This process, as it is specified in H.264/AVC and HEVC, is performed bit-by-bit and is controlled by some conditional branches to check each time if further renormalization loops are required. Both conditional branching and bitwise processing, however, constitute considerable obstacles to a sufficiently high throughput.

As a mitigation of this problem, a fast renormalization policy for the M coder was proposed in [48]. By replacing the conventionally bitwise performed operations in the regular coding mode with byte-wise or word-wise processing, a considerably increased decoder throughput of around 25 % can be achieved. The corresponding non-normative, fully standard-compliant changes were integrated into the reference software implementations of both H.264/AVC and HEVC. For more details, please refer to [47, 48].

8.2.3.4 Termination

For termination of the arithmetic codeword in the M coder a special, non-adapting probability state is reserved. The corresponding probability state index is given by n = 63 and the corresponding entries of TabRangeLPS deliver a constant value of R LPS = 2. As a consequence, for each terminating syntax element, such as end_of_slice_segment_flag, end_of_sub_stream_one_bit, or pcm_flag, 7 bits of output are generated in the renormalization process. Two more bits are needed to be flushed in order to properly terminate the arithmetic codeword. Note that the least significant bit in this flushing procedure, i.e., the last written bit at the encoder, is always equal to 1 and thus, represents the so-called rbsp_stop_one_bit. Before packaging of the bitstream, the arithmetic codeword is filled up for byte alignment with zero-valued alignment bits.

8.3 Design Considerations

Most of the proposals submitted to the joint Call for Proposals on HEVC in April 2010 already included some form of advanced entropy coding. Some of those techniques were based on improved versions of CAVLC or CABAC, others were using alternative methods of statistical coding, such as V2V (variable-to-variable) codes [31] or PIPE (probability interval partitioning entropy) codes [50, 51, 102], and a third category introduced increased capabilities for parallel processing on a bin level [84], syntax element level [94, 96], or slice level [25, 28, 105]. In addition, improved techniques for coding of transform coefficients, such as zero-tree representations [2], alternate scanning schemes [40], or template-based context models [55, 102], were proposed.

After an initial testing phase of video coding technology from the best performing HEVC proposals, it was decided to start the first HEVC test model (HM1.0) [53] with two alternate configurations similar to what was given for entropy coding in H.264/AVC: a high efficiency configuration based on CABAC and a low-complexity configuration based on LCEC as a CAVLC surrogate. Interestingly enough, the CABAC-based entropy coding of HM1.0 already included techniques for improving both coding efficiency and throughput relative to its H.264/AVC-related predecessor. To be more specific, a template-based context modeling scheme for larger transform block sizes [49, 55] and a parallel context processing technique for selected syntax elements of transform coefficient coding [10] were already part of HM1.0. During the subsequent collaborative HEVC standardization phase, more techniques covering both aspects of coding efficiency and throughput were integrated, as will be discussed in more details in the following.

While CABAC inherently is targeting at high coding efficiency, its data dependencies can cause it to be a throughput bottleneck, especially at high bit rates as was already analyzed in the context of H.264/AVC [95]. This means that, without any further provision, it might have been difficult to support the growing throughput requirements for future video codecs. Furthermore, since high throughput can be traded-off for power savings using voltage scaling [14], the serial nature of CABAC may limit the battery life for video codecs that reside on mobile devices. This limitation is a critical concern, as a significant portion of video codecs today are running on battery-operated devices. Accordingly, both coding efficiency and throughput improvement tools as well as the trade-off between these two requirements were investigated in the standardization of entropy coding for HEVC. The trade-off between coding efficiency and throughput comes from the fact that, in general, dependencies are a result of removing redundancy which, in turn, improves coding efficiency; however, increasing dependencies usually makes parallel processing more difficult which, as a consequence, may degrade throughput. This section describes the various techniques used to improve both coding efficiency and throughput of CABAC entropy coding for HEVC.

8.3.1 Brief Summary of HEVC Block Structures and CABAC Coding Efficiency Improvements

In the evolutionary process from H.264/AVC to HEVC, improved coding efficiency for CABAC entropy coding was addressed in a number of proposals, such as [24, 102, 106]. The majority of coding-efficiency related CABAC proposals in the HEVC standardization process was oriented towards transform coefficient coding, since at medium to high bit rates the dominant part of bits is consumed by syntax elements related to residual coding. As a consequence, this subsection will focus on considerations that were made with regards to the specific CABAC design for those syntax elements. Note, however, that due to the more consistent design of HEVC in terms of tree structures for both partitioning of prediction blocks and transform blocks, special care has also been taken to ensure an efficient modeling and coding of the corresponding tree structuring elements. In addition, for new coding tools in HEVC, such as block merging and sample adaptive offset (SAO) in-loop filtering, additional assignments of binarization and context modeling schemes were needed.

Transform coding in HEVC is based on a tree-structured variable block size approach with the corresponding quadtree structure referred to as residual quadtree (RQT) [49, 102]. RQTs are nested into the leaves of another quadtree, the so-called coding quadtree (CQT), which determines the subdivision of each block of 2N × 2N luma samples, referred to as a coding tree block (CTB) [49, 102]. The block partitioning for both prediction and transform coding is the same for luma and chroma picture component samples,Footnote 3 and hence, a common coding and residual quadtree syntax is used to signal the partitioning. As a result, the blocks of luma and chroma samples and associated syntax elements are grouped together in a so-called unit.

A transform unit (TU) aggregates the transform blocks (TBs) of luma and chroma samples as well as the syntax elements used to represent the associated transform coefficient levels. Each TU and the related luma and two chroma TBs are determined as a leaf of the corresponding RQT. Supported TB sizes for both luma and chroma are in the range from 4 × 4 to 32 × 32 samples, where the corresponding core transforms are separable applications of a fixed-point approximation of the 1-D Discrete Cosine Transform (DCT) for dyadically increasing lengths from 4 to 32 points [26]. An exception is given for 4 × 4 luma TBs of residual signals resulting from intra-picture predicted blocks, where instead of the DCT-like core transform a separable fixed-point approximation of the 1-D Discrete Sine Transform (DST) is used [100].

Note that a prediction unit (PU) aggregates the prediction blocks (PBs) of luma and chroma samples and the associated syntax elements like motion data. A coding unit (CU) encapsulates the luma and chroma coding block (CB) samples and the so-called prediction mode, i.e., the decision whether the corresponding samples are coded using intra-picture or inter-picture prediction, as well as some additional syntax elements. On the top level of the hierarchy, a coding tree unit (CTU) comprises the CTBs of luma and chroma samples, the associated CQT syntax structure and all CUs at the CQT leaves.

8.3.1.1 Coefficient Grouping into Subblocks

Given the larger variety of TB sizes, one of the primary goals of CABAC entropy coding for transform coefficient data in HEVC was to achieve a design that uses for all block sizes as much of the same logic and the same procedures as possible. Although at first glance this objective seems to be somehow unrelated to coding efficiency, it turns out that at least one particular element leading to such a unified design is also crucial for achieving high coding efficiency. This coding element is given by the grouping of coefficients into so-called subblocks of size 4 × 4 for transform blocks with size greater than 4 × 4. Subblocks were first proposed in [49, 55, 102] and became part of HM1.0. In the subsequent HEVC development process, their use was iteratively refined and extended in a way as will be explained in more detail in Sect. 8.6.

8.3.1.2 Hierarchy of Significance Flags

Since for most common coding conditions, a large portion of transform coefficients is quantized to zero, or equivalently, the representation of the residual signal in the DCT-/DST-like basis functions is supposed to be sparse, a hierarchical structured set of four different significance flagsFootnote 4 is introduced in HEVC to reduce the number of individual significance flags to be transmitted. This hierarchy of syntax elements also reflects the hierarchical processing of TBs within the RQT as well as the processing of subblocks within a given TB.

The use of so-called coded block flags (CBF), indicating the occurrence of significant, i.e., nonzero transform coefficients in a TB, was already part of H.264/AVC CABAC-based residual coding. In HEVC, this concept was extended to also cover the RQT root on the top level of the hierarchy as well as the subblock on a lower level of the hierarchy. Consequently, there are a rqt_root_cbf, at least for RQT roots in inter-predicted CUs, cbf_luma, cbf_cb, and cbf_cr for the visited TBs of the three color components, and a coded_sub_block_flag (CSBF) for each visited subblock in a TB. On the lowest level of the hierarchy, for each visited subblock a so-called significance map indicates the location of nonzero coefficients for each scan position in a subblock.

This hierarchy of significance flags is complemented by the syntax elements indicating the last significant scan position in a TB, which somehow serve as an entry point into each significant TB and which is equivalent to signaling the insignificance of a partial area of a TB. The latter concept differs from H.264/AVC, where for each significant_coeff_flag (SIG) with a value of one, a last_significant_coefficient_flag (LAST) is signaled indicating if the current scan position is the last nonzero coefficient inside the TB. Note that this latter signaling scheme is equivalent to using a TrU binarization (with inverted bin values) for the number of nonzero coefficients in a TB, such that each bin of the resulting bin sequence is intertwined with the corresponding nonzero significance flag. This design aspect of mixing two flags on a bin level in H.264/AVC was later found to be critical in terms of throughput, as will be discussed in Sect. 8.6.

8.3.1.3 Context Modeling for Coding of Significance Flags

Particular care has been taken to properly specify the context models for coding of significance flags. For instance, modeling of the CBF is based on the RQT depth, while that for the CSBF is using neighboring CSBF information. For coding of the significance map, which typically consumes most of the bits in HEVC transform coding, additional dependencies between neighboring elements have been exploited, at least for TBs larger than 4 × 4. Initially, for that purpose a local template was proposed [49, 55, 102] and adopted for HM1.0. Although this design provides high coding efficiency, it introduces some critical data dependencies. As a solution to this problem, a combination of position-based information (as used in H.264/AVC) and template-based neighborhood information was finally adopted for context modeling of significance map entries [41, 77]. This particular example also illustrates how both aspects of coding efficiency and throughput were considered during the HEVC standardization process in a balanced way. More on the throughput aspects is given in the next subsection, while the details of context modeling for all syntax elements related to residual coding are provided in Sect. 8.6.

Fig. 8.2
figure 2

Three key operations in CABAC (from a decoder perspective): Binarization, Context Modeling/Selection and (Binary) Arithmetic Coding. Feedback loops in the decoder are highlighted with dashed lines

8.3.2 CABAC Throughput Bottlenecks

CABAC, as originally designed for H.264/AVC and also, as initially selected for the HEVC standardization starting point in HM1.0, has some serious throughput issues (particularly for decoder implementations at higher bit rates) [80, 95]. The throughput of CABAC is determined based on the number of binary symbols (bins) that it can process per second. The throughput can be improved by increasing the number of bins that can be processed in a cycle. However, the data dependencies in CABAC make processing multiple bins in parallel difficult and costly to achieve. These dependencies result in feedback loops in the CABAC decoder as shown in Fig. 8.2, and can be described as follows:

  1. 1.

    The updated range is fed back for recursive interval subdivision.

  2. 2.

    The updated context is fed back for probability estimation.

  3. 3.

    The context modeler selects the probability model based on the type of syntax element and, as already noted above, for selected syntax elements, based on some derivation process that involves other previously decoded bin values or other relevant side information. At the decoder, for non-binary syntax elements, the decoded bin value is fed back to determine whether to continue processing the same syntax element or to switch to another syntax element. If a switch occurs, the value of the decoded bin may also be used to determine which syntax element to decode next.

  4. 4.

    The context modeler may also select the probability model based on the bin position in the syntax element (binIdx). At the decoder, the decoded bin value is fed back to determine whether to increment binIdx and continue to decode the current syntax element, or set binIdx equal to 0 and switch to another syntax element.

Note that the feedback loops have different degrees of impact on throughput. The range update (1) and context update (2) feedback loops are simpler than the context modeling loops (3, 4) and thus do not affect throughput as severely. If the context of a bin depends on the value of another bin being decoded in parallel, then speculative computations are required, which increases area cost and critical path delay [94]. The amount of speculation can grow exponentially with the number of parallel bins, which limits the throughput that can be achieved [80]. Figure 8.3 shows an example of the speculation tree for significance map in H.264/AVC. Thus the throughput bottleneck is primarily due to the context modeling dependencies.

Fig. 8.3
figure 3

Context speculation required to achieve 5× parallelism when processing the significance map in H.264/AVC. Notation: i = coefficient position; i1 = MaxNumCoeff (BlockType) − 1; EOB = end of block; SIG = significant_coeff_flag; LAST = last_ significant_coeff_flag

8.3.3 Summary of Techniques for CABAC Throughput Improvements

Several techniques were used to improve the throughput of CABAC in HEVC [88]. There was a lot of effort spent in determining how to use these techniques with minimal coding loss. They were applied to various parts of entropy coding in HEVC and will be referred to throughout the rest of this chapter.

8.3.3.1 Reduce Regular Coded Bins

The throughput is limited for regular coded bins due to the data dependencies described in Sect. 8.3.2. However, it is easier to process bypass coded bins in parallel since they do not have the data dependencies related to context modeling (i.e., feedback loops 2, 3 and 4 in Fig. 8.2). In addition, arithmetic coding for bypass bins is simpler as it only requires a right shift versus a table look up for regular coded bins. Thus, the throughput can be improved by reducing the number of regular coded bins and using bypass coded bins instead [16, 54, 58, 59].

8.3.3.2 Group Bypass Coded Bins

Multiple bypass bins can be processed in the same cycle only if they occur consecutively within the bitstream. Thus, bins should be reordered such that bypass coded bins are grouped together in order to increase the likelihood that multiple bins are processed per cycle [19, 67, 87].

8.3.3.3 Group Bins with Same Context

Processing multiple regular coded bins in the same cycle often requires speculative calculations for context modeling. The amount of speculative computations increases if bins using different contexts and context modeling logic are interleaved, since numerous combinations and permutations must be accounted for. Thus, to reduce speculative computations, bins should be reordered such that bins with the same contexts and context modeling logic are grouped together so that they are likely to be processed in the same cycle [9, 10, 73]. This also reduces context switching resulting in fewer memory accesses, which also increases throughput and reduces power consumption. This technique was first introduced in [10] and was referred to as parallel context processing (PCP) throughout the standardization process.

8.3.3.4 Reduce Context Modeling Dependencies

Speculative computations are required for multiple bins per cycle decoding due to the dependencies in the context modeling. For instance, this is an issue when the context modeling for the next bin depends on the decoded value of the current bin. Reducing these dependencies simplifies the context modeling logic and reduces the amount of speculative calculations required to process multiple bins in parallel [18, 80, 85].

8.3.3.5 Reduce Total Number of Bins

In addition to increasing the throughput, it is desirable to reduce the workload itself by reducing the total number of bins that need to be processed. This can be achieved by changing the binarization, inferring the value of some bins,Footnote 5 and sending higher level flags to avoid signaling redundant bins [12, 56, 59].

8.3.3.6 Reduce Parsing Dependencies

As parsing with CABAC may constitute a throughput bottleneck, it is important to minimize any dependency on other video decoding processes, which could cause CABAC to stall or may even prevent a successful parsing process in case of picture loss due to transmission errors [7, 79, 108] (see Sect. 8.5.1.1). Ideally the parsing process should be decoupled from all other decoding processes, which actually is the case for CABAC in H.264/AVC. Decoupling parsing from the sample reconstruction process is also important when entropy decoupling is used, i.e., when a large frame level buffer is inserted between the entropy decoder and the rest of the decoder to absorb the variance in the bit-rate and pixel-rate workloads, respectively.

8.3.3.7 Reduce Memory Requirements

Memory accesses often contribute to the critical path delay. Thus, reducing memory storage requirements is desirable as fewer memory accesses increases throughput as well as reduces implementation cost and power consumption [81, 90].

8.4 Coding Tree Unit and Coding Unit Syntax Elements

In HEVC, a picture is partitioned into a regular grid of disjoint square blocks of 2N × 2N luma samples and, in case of 4:2:0 color sampling, corresponding square blocks of \(2^{N-1} \times 2^{N-1}\) chroma samples. The parameter N = 4, 5, or 6 can be chosen by the encoder and transmitted in the sequence parameter set (SPS), such that the corresponding coding tree units represent luma CTBs of size 16 × 16, 32 × 32, or 64 × 64 samples, respectively. The CTU syntax elements describe how the corresponding CTBs can be further partitioned into smaller coding blocks by use of the coding quadtree and how the method of sample adaptive offset (SAO) in-loop filtering is performed on the reconstructed luma and chroma samples belonging to the CTU.

Within a picture, an integer number of CTUs can be grouped into a slice. Each slice itself consists of one (leading) independent slice segment and zero or more subsequently ordered dependent slice segments. A flag called end_of_slice_segment_flag is sent to indicate the last CTU in a slice segment. In addition, tiles and wavefront parallel processing, which are introduced in Chap. 3, can be used to fragment the slice segment into multiple substreams,Footnote 6 each being represented by its own CABAC codeword. Therefore, if end_of_slice_segment_flag indicates that it is not the last CTU in a slice segment, a flag called end_of_sub_stream_one_bit is used to indicate whether it is the last CTU of the corresponding substream.Footnote 7 An example of this is illustrated in Fig. 8.4. Both end_of_slice_segment_flag and end_of_sub_stream_one_bit are coded using the terminating mode of the arithmetic coding engine. This is required since at the end of a slice segment or a substream, the arithmetic coding engine must be flushed and the resulting CABAC codeword must be byte aligned before, at least in the former case, inserting the startcode for the next slice or entry point for the next slice segment. Figure 8.5 shows an example of the locations of CABAC termination within a bitstream.

Fig. 8.4
figure 4

These two examples illustrate which CTUs are terminated when slice segments are divided into substreams using tiles and wavefront parallel processing. Values of (end_of_slice_segment_flag, end_of_sub_stream_one_bit) are given for each configuration. (a) Tiles: CTU 12, 24, and 36 have (0, 1); CTU 48 (1, not signaled); and the rest of the CTUs have (0,0). (b) Wavefront parallel processing: CTU 8, 16, 24, 32 and 40 have (0, 1); CTU 48 (1, not signaled); and the rest of the CTUs have (0,0)

Fig. 8.5
figure 5

Ordering of the bitstream for the tiles example in Fig. 8.4a. CABAC needs to be terminated before byte alignment (BA) as shown by the black boxes. Entry points for substreams are sent in slice_segment_header()

8.4.1 Coding Block Structure

The coding block structure is determined by the coding quadtree which is signaled by a flag called split_cu_flag at each of its nodes to indicate whether a given coding block should be further subdivided into four smaller CBs. There is a strong spatial correlation between the chosen CQT depth of neighboring CBs, i.e., the block sizes of neighboring CBs, thus the context selection for split_cu_flag depends on the relative depth of the top and left neighboring CBs compared to that of the current CB. Note that in H.264/AVC the partitioning information is sent together with other data as aggregated syntax elements mb_type and sub_mb_type with different ranges of allowed values and hence different binarization schemes for different slices.Footnote 8 This kind of aggregating different information in one single syntax element is mostly due to historical reasons, reflecting the circumstances that earlier video coding standards (including H.264/AVC) were designed under the regime of VLC-based entropy coding, where alphabet extensions are used to circumvent the lower bound of 1 bit per symbol. Thus, by allowing the signaling of a coding quadtree structure with a one-bin syntax element, i.e., the split_cu_flag at each node, HEVC is much more flexible and allows many more coding and prediction block structures than H.264/AVC, even when choosing a CTB size of 16 × 16 luma samples and ignoring the fact that HEVC doesn’t allow for inter-predicted 4 × 4 luma blocks, as discussed in Chap. 3.

Fig. 8.6
figure 6

A coding tree unit is subdivided into CUs along the associated coding quadtree. Each resulting CU may be further subdivided into PUs. The intra-coded CUs are in blue while inter-coded CUs are in orange. Note that the figure only shows the corresponding CTB, CBs, and PBs of the luma component

8.4.2 Prediction Mode and Prediction Block Structure

In P and B slices, a cu_skip_flag is sent for each CU to indicate whether all associated CBs are coded using skip mode, i.e., by using the so-called merge mode for inter-picture prediction (as explicitly described in Sect. 8.5) and not sending any residual data. To leverage spatial correlation of neighboring CUs, the context of the cu_skip_flag depends on whether the top and left neighboring CUs are also skipped. For every non-skipped CU, a regular coded flag called pred_mode_flag is sent to indicate the prediction mode, i.e., the decision whether the CU is either intra coded or inter coded.Footnote 9

Every non-skipped CU may be further subdivided into PUs, as shown for the example in Fig. 8.6. The syntax element part_mode indicates if and how each CU is partitioned for the purpose of prediction. The choice of prediction block structures for each CU depends on whether the CU is intra-coded or inter-coded; accordingly, part_mode is binarized and coded differently for intra-coded CUs and inter-coded CUs, as shown in Fig. 8.7.

Fig. 8.7
figure 7

Context selection and binarization of part_mode. Underlined symbols are bypass coded. (a) Intra-coded CU, (b) inter-coded CU

Intra-coded CUs can have a single PU (referred to as PART_2Nx2N) equal to the size of the CU, or be subdivided into four smaller PUs (referred to as PART_NxN). PART_NxN, however, is only allowed when the CU size is equal to the minimum allowed CU size. If the CU size is greater than the minimum allowed CU size, split_cu_flag is used instead of part_mode to avoid redundant signaling. For instance, if the minimum CU size is 8 × 8 in terms of luma samples, a CU of size 16 × 16 with four 8 × 8 PUs is signaled with split_cu_flag = 1, and part_mode = PART_2Nx2N rather than split_cu_flag = 0 and part_mode = PART_NxN. Accordingly, part_mode is not signaled but inferred to be PART_2Nx2N when the CU size is greater than the minimum allowed CU size. If the CU size is equal to the minimum allowed CU size, part_mode is coded using a flag with a fixed context for a given slice type.

Inter-coded CUs have more prediction partitioning options than intra-coded CUs. In addition to PART_2Nx2N and PART_NxN, inter-coded CU can also be subdivided into two rectangular PUs in either horizontal (PART_Nx2N) or vertical directions (PART_2NxN). In case of enabling the so-called asymmetric motion partitioning (AMP), the additional prediction partitioning possibilities PART_2NxnU, PART_2NxnD, PART_nLx2N, and PART_nRx2N are supported. Custom binarization is used for part_mode as shown in Fig. 8.7b. The first bin indicates whether or not the CU is partitioned into smaller PUs. If the CU size is greater than the minimum allowed CU size, the second bin indicates the direction of the partition (vertical or horizontal), the third bin indicates whether AMP is used and if so, then a fourth bin is sent to indicate which asymmetric partition is used in the given direction. If the CU size is equal to the minimum allowed CU size, AMP is not allowed and truncated unary coding is used to indicate if the partitions are PART_Nx2N, PART_2NxN, and PART_NxN, respectively.Footnote 10 A different context is used for the first and second bin to estimate the probabilities of whether the PU is partitioned into smaller PUs, and the direction of the PU. Two different contexts are used for the third bin depending on when the CU size is greater or equal to the minimum allowed CU size. In the former, the context is based on the probability of whether asymmetric partitions are used, while in the latter, the context is based on the probability of whether PART_NxN is used. To reduce the number of regular coded bins, the fourth bin (for AMP) is bypass coded.

8.4.3 Signaling of Special Coding Modes

HEVC supports two special coding modes, which are invoked on a CU level: the so-called I_PCM mode and the lossless coding mode. Both modes, albeit similar in appearance to some degree, serve different purposes and hence, use different syntax elements for providing different functionalities.

A pcm_flag is sent to indicate whether all samples of the whole CU are coded with pulse code modulation (PCM), such that prediction, transform, quantization, and entropy coding as well as their counterparts on the decoder side are simply bypassed. This I_PCM mode, however, is only allowed for intra-coded CUs with prediction partitioning mode PART_2Nx2N.Footnote 11 The pcm_flag is coded with the termination mode of the arithmetic coding engine, since in most cases I_PCM mode is not used, and if it is used, the arithmetic coding engine must be flushed and the resulting CABAC codeword must be byte aligned before the PCM sample values can be written directly into the bitstream with fixed length codewords.Footnote 12 This procedure also indicates that the I_PCM mode is in particular useful in cases, where the statistics of the residual signal would be such that otherwise, an excessive amount of bits would be generated when applying the regular CABAC residual coding process.

The option of lossless coding, where for coding of the prediction residual both the transform and quantization (but not the entropy coding) are bypassed, is also enabled on a CU level and indicated by a regular coded flag called cu_transquant_bypass_flag. The resulting samples of the losslessly represented residual signal in the spatial domain are entropy coded by the CABAC residual coding process (see Sect. 8.6), as if they were conventional transform coefficient levels. Note that in lossless coding mode, both in-loop filters are also bypassed in the reconstruction process (which is not necessarily the case for I_PCM), such that a mathematically lossless (local) reconstruction of the input signal is achieved.

8.4.4 Signaling of Block-Based Quantization Parameter Change

In the regular, i.e., lossy residual coding process, a different quantizer step size can be used for each CU to improve bit allocation, rate control, or both. Rather than sending the absolute quantization parameter (QP), the difference in QP steps relative to the slice QP is sent in the form of a so-called delta QP. This functionality can be enabled in the picture parameter set (PPS) by using the syntax element cu_qp_delta_enabled_flag.

In H.264/AVC, mb_qp_delta is used to provide the same instrument of delta QP at the macroblock level. The value of mb_qp_delta can range from \(-(26 +\mathrm{ QpBdOffset}_{\mathrm{Y}}/2\)) to \(25 +\mathrm{ QpBdOffset}_{\mathrm{Y}}/2\). For 8-bit video, this is −26 to 25, while for 10-bit video this is −32 to 31. mb_qp_delta is unary coded and thus requires up to 53 bins for 8-bit video and 65 bins for 10-bit video. All bins are regular coded.

In HEVC, delta QP is represented by the two syntax elements cu_qp_delta_abs and cu_qp_delta_sign_flag, if cu_qp_delta_enabled_flag in the PPS indicates so. The sign is sent separately from the absolute value, which reduces the average number of bins by half [23]. cu_qp_delta_sign_flag is only sent if the absolute value is non-zero. The absolute value is binarized with TrU (cMax=5) as the prefix and EG0 as the suffix [98]. The prefix is regular coded and the suffix is bypass coded. The first bin of the prefix uses a different context than the other four bins in the prefix (which share the same context) to capture the probability of having a zero-valued delta QP. Note that syntax elements for delta QP are only signaled for CUs that have non-vanishing prediction errors (i.e., at least one non-zero transform coefficient). Conceptually, the delta QP is an element of the transform coding part of HEVC and hence, can also be interpreted as a syntax element that is always signaled at the root of the RQT, regardless which transform block partitioning is given by the RQT structure. Table 8.3 shows examples of how delta QP is signaled for H.264/AVC and HEVC.

Table 8.3 Coding of delta QP in HEVC and H.264/AVC

8.4.5 Signaling of SAO Parameters

SAO is a form of in-loop filtering that was introduced in HEVC. It is used to process the output of samples from the deblocking filter process and is the last step of the decoding process. SAO involves sample based processing rather than block based processing. There are two types of filtering: edge offset and band offset.

Edge offset (EO) involves comparing the sample and its neighboring sample values in one of four angular directions (horizontal, vertical, 45, 135).Footnote 13 The sample is compared to its neighbors in the selected direction (e.g., the sample has a lower value than both its neighbors); based on the comparison, the sample is assigned to a category, which determines the offset that is added to the sample. The value of the offset for a given category is set by the encoder. Band offset (BO) involves dividing the intensity range into four bands and then adding a different offset to the sample depending which band its sample intensity belongs to. For more details on SAO, please refer to Chap. 7.

The type, direction and offsets used to define the SAO filter can change for each CTB; however, all samples belonging to a CTB are processed with the same SAO filter (but luma and chroma CTBs may use different SAO filters). The SAO type is signaled using sao_type_idx_luma and sao_type_idx_chroma with TrU binarization. The first bin indicates whether the SAO filter is enabled and is regular coded, while the second bin indicates if edge or band offset is used and is bypass coded.

If edge offset is used, the direction of the edge is signaled using sao_eo_class_luma and sao_eo_class_chroma with FL binarization of two bins, all of which are bypass coded. If band offset is used, four sao_band_position syntax elements are signaled to indicate the start position of each band with a FL binarization of five bins, all of which are bypass coded.

For both types of SAO filtering, four sao_offset_abs are signaled (one for each category or band) using TrU with cMax computed by Eq. (8.4) and all bins are bypass coded.

$$\displaystyle{ \mathrm{cMax} = (1 << (\min (\mathrm{bitDepth},10) - 5)) - 1 }$$
(8.4)

For the band offset, the sao_offset_sign is signaled only when the offset is non-zero to reduce the total number of bins [36], while for edge offset the sign is inferred from the category [40].

To leverage the spatial correlation across CTBs, sao_merge_left_flag and sao_merge_up_flag are used to indicate if SAO parameters can be inherited from neighboring CTBs, which reduces signaling overhead. Both of these flags are regular coded using separate context models.

Significant effort was made to reduce the number of regular coded bins required to represent SAO filter syntax elements. As a result, the only regular coded bins are the merge flags and the first bin of the sao_type_idx_luma and sao_type_idx_chroma with the latter indicating whether SAO is enabled for luma and chroma CTBs, respectively.

8.4.6 Comparison of HEVC and H.264/AVC

Table 8.4 highlights the differences in signaling between the CTU/CU layer in HEVC and the macroblock (MB) layer in H.264/AVC, when processing 8-bit video. For a comparable block partitioning, HEVC typically produces fewer regular coded bins than H.264/AVC. At the same time, some of those regular coded bins in addition to those of the skip flag are adaptively selected based on CU depth, size and neighbors in HEVC, which improves coding efficiency relative to H.264/AVC. In general, however, the total amount of bits spent for signaling at the CTU/CU or MB layer is lower by more than an order of magnitude compared to the total amount of bits spent for transform coefficient level coding. As already discussed above and summarized in Table 8.4, the majority of bypass bins for the SAO parameters are due to the signaling of the offsets, while for H.264/AVC an excessive number of bins is only generated in the rare cases where large delta QP values have to be transmitted.

Table 8.4 Differences in signaling between CTU/CU layer in HEVC and MB layer in H.264/AVC

8.5 Prediction Unit Syntax Elements

The prediction unit (PU) syntax elements describe how the prediction is performed in order to reconstruct the samples belonging to each PU. Coding efficiency improvements have been made in HEVC for both modeling and coding of motion parameters and intra prediction modes. While H.264/AVC uses a single motion vector predictor (unless direct mode is used) and a single most probable mode (MPM), HEVC uses multiple candidate predictors or MPMs together with an index or flag for signaling the selected predictor or MPM, respectively. In addition, HEVC provides a mechanism for exploiting spatial and temporal dependencies with regard to motion modeling by merging neighboring blocks with identical motion parameters. This has been found to be particularly useful in combination with quadtree-based block partitioning, since a pure hierarchical subdivision approach may lead to partitionings with suboptimal rate-distortion behavior [32, 49, 102]. Also, due to the significant increased number of angular intra prediction modes relative to H.264/AVC, three MPMs for each PU are considered in HEVC.

This section will discuss how the various PU syntax elements are processed in terms of binarization, context modeling, and context assignment. Also, aspects related to parsing dependencies and throughput for the various prediction parameters are considered.

8.5.1 Motion Data Coding

In HEVC, motion data can be either signaled using merge mode or directly using motion vector differences, reference indices, and inter-prediction direction.

8.5.1.1 Signaling of Merge Mode

In HEVC, merge mode enables motion data (i.e., prediction direction, reference index and motion vectors) to be inherited from a spatial or temporal (co-located) neighbor. A list of merge candidates are generated from these neighbors. merge_flag is signaled to indicate whether merge is used in a given PU. If merge is used, then merge_idx is signaled to indicate from which candidate the motion data should be inherited. merge_idx is coded with truncated unary, which means that the bins are parsed until a zero bin value is reached or when the number of bins is equal to the cMax, the max allowed number of bins.

Determining how to set cMax involved evaluating the throughput and coding efficiency trade-offs in a core experiment [7]. For optimal coding efficiency, cMax should be set to equal the merge candidate list size of the PU. Furthermore, merge_flag should not be signaled if the list is empty. However, this makes parsing depend on list construction, which is needed to determine the list size. Constructing the list requires a large amount of computation since it involves reading from multiple locations (i.e., fetching the co-located neighbor and spatial neighbors) and performing several comparisons to prune the list; thus, dependency on list construction would significantly degrade parsing throughput [33, 108].

To decouple the list generation process from the parsing process such that they can operate in parallel in HEVC, cMax is signaled in the slice header using five_minus_max_num_merge_cand and does not depend on list size. To compensate for the coding loss due to the fixed cMax, combined bi-predictive and zero motion vector candidates are added when the list size is less than the maximum number of allowed candidates as defined by cMax [79]. This also ensures that the list is never empty and that merge_flag is always signaled [107]. For more details on candidate list construction please refer to Chap. 5.

8.5.1.2 Signaling of Motion Vector Differences, Reference Indices, and Inter-Prediction Direction

If merge mode is not used, then the motion vector is predicted from its neighboring blocks and the difference between motion vector (mv) and motion vector prediction (mvp), referred to as motion vector difference (mvd), is signaled:

$$\displaystyle\begin{array}{rcl} \mathrm{mvd}(x,y) =\mathrm{ mv}(x,y) -\mathrm{ mvp}(x,y)& & {}\\ \end{array}$$

In H.264/AVC, a single predictor is calculated for mvp from the median of the left, top and top-right spatial 4 × 4 neighbors.

In HEVC, advanced motion vector prediction (AMVP) is used, where several candidates for mvp are determined from spatial and temporal neighbors [38]. A list of mvp candidates is generated from these neighbors, and the list is pruned to remove redundant candidates such that there is a maximum of two candidates. A syntax element called mvp_l0_flag (or mvp_l1_flag depending on the reference list) is used to indicate which candidate is used from the list as the mvp. To ensure that parsing is independent of list construction, mvp_l0_flag is signaled even if there is only one candidate in the list. The list is never empty as the zero motion vector is used as the default candidate.

In HEVC, improvements were also made on the coding process of mvd itself. In H.264/AVC, the first nine bins of mvd are regular coded truncated unary bins, followed by bypass coded 3rd order Exp-Golomb bins. In HEVC, the number of regular coded bins for mvd is significantly reduced [58]. Only the first two bins are regular coded (abs_mvd_greater0_flag, abs_mvd_greater1_flag), followed by bypass coded first-order Exp-Golomb (EG1) bins (abs_mvd_minus2).

In H.264/AVC, context selection for the first bin in mvd depends on whether the sum of the motion vectors of the top and left 4 × 4 neighbors are greater than 32 (or less than 3). This requires 5-bit storage per neighboring motion vector, which accounts 24,576 of the 30,720-bit CABAC line buffer needed to support a 4k × 2k sequence. The need to reduce the line buffer size in HEVC by modifying the context selection logic was highlighted in [90]. Accordingly, all dependencies on the neighbors were removed and the context is selected based on the binIdx (i.e., whether it is the first or second bin) [82, 91].

To maximize the impact of fast bypass coding, the bypass coded bins for both the horizontal (x) and vertical (y) components of mvd are grouped together in HEVC [67]. For instance, for a motion vector difference of mvd(x, y) = (2, 2), the coding order is 11110000, where underlined values are bypass coded. Without bypass grouping, the coding order is 11001100. If four bypass bins can be processed in a single cycle, enabling bypass grouping reduces the number of cycles required to process the motion vector by one.

In HEVC, reference indices ref_idx_l0 and ref_idx_l1 are coded with truncated unary regular coded bins, which is the same as for H.264/AVC; the maximum length of the truncated unary binarization, cMax, is dictated by the reference picture list size. However, in HEVC only the first two bins are regular coded [71], whereas all bins are regular coded in H.264/AVC. In both HEVC and H.264/AVC, the regular coded bins of the reference indices for different reference picture lists share the same set of contexts. The inter-prediction direction (list 0, list 1 or bi-directional) is signaled using inter_pred_idc with custom binarization.

8.5.2 Intra Prediction Mode Coding

Similar to motion data coding, a most probable mode (MPM) is calculated for intra mode coding. In H.264/AVC, the minimum mode of the top and left neighbors is used as MPM. prev_intra4x4_pred_mode_flag (or prev_intra8x8_pred_mode_flag) is signaled to indicate whether the most probable mode is used. If the MPM is not used, the remainder mode rem_intra4x4_pred_mode_flag (or rem_intra8x8_pred_mode_flag) is signaled.

In HEVC, additional MPMs are used to improve coding efficiency. A candidate list of most probable modes with a fixed length of three is constructed based on the left and top neighbors. The additional candidate modes (DC, planar, vertical) can be added if the left and top neighbors are the same or unavailable. Note that the top neighbors outside the current CTU are considered unavailable in order to avoid the need for a line buffer.Footnote 14 The prediction flag prev_intra_pred_mode_flag is signaled to indicate whether one of the most probable modes is used. If an MPM is used, a most probable mode index (mpm_idx) is signaled to indicate which candidate to use. It should be noted that in HEVC, the order in which the coefficients of the residual are parsed (e.g., diagonal, vertical or horizontal) depends on the reconstructed intra mode (i.e., the parsing of the TU data that follows depends on list construction and intra mode reconstruction). Thus, the candidate list size was limited to three for reduced computation to ensure that it would not affect entropy decoding throughput [22, 83].

The number of regular coded bins was reduced for intra mode coding in HEVC relative to the corresponding part in H.264/AVC, where both the flag and the 3 fixed-length bins of the remainder mode are regular coded using two separate context models. In HEVC, the flag is regular coded as well, but the remainder mode is a fixed-length 5-bin value that is entirely bypass coded. The most probable mode index (mpm_idx) is also entirely bypass coded. The number of contexts used to code intra_chroma_pred_mode is reduced from 4 to 1 for HEVC relative to H.264/AVC. To maximize the impact of fast bypass coding, the bypass coded bins for luma intra prediction mode coding within a CU are grouped together in HEVC [19]. This is beneficial when the partition mode is PART_NxN, and there are four sets of prediction modes.

Table 8.5 Differences between prediction unit coding in HEVC and H.264/AVC

8.5.3 Comparison of HEVC and H.264/AVC

The differences between H.264/AVC and HEVC in signaling of syntax elements at the PU layer are summarized in Table 8.5. HEVC uses both spatial and temporal neighbors as predictors, while H.264/AVC only uses spatial neighbors (unless direct mode is enabled). In terms of the impact of the throughput improvement techniques, HEVC has around 6× fewer maximum regular coded bins per inter-predicted PU than H.264/AVC. HEVC also requires around 2× fewer contexts for PU syntax elements than H.264/AVC.

8.6 Transform Unit Syntax Elements

In video coding, both intra and inter prediction are used to reduce the amount of data that needs to be transmitted. In addition, rather than sending the original samples of the prediction signal, an appropriately quantized approximation of the prediction error is transmitted. To this end, the prediction error is blockwise transformed from spatial to frequency domain, thereby decorrelating the residual samples and performing an energy compaction in the sense that, after quantization, the signal can be represented in terms of a few non-vanishing coefficients. The method of signaling the quantized values and frequency positions of these coefficients is referred to as transform coefficient coding.

Syntax elements related to transform coefficient coding account for a significant portion of the bin workload as shown in Table 8.6. At the same time, those syntax elements also account for a significant portion of the total number of bits for a compressed video, and as a result the compression of quantized transform coefficients significantly impacts the overall coding efficiency. Thus, transform coefficient coding with CABAC must be carefully designed in order to balance coding efficiency and throughput demands. Accordingly, as part of the HEVC standardization process, a core experiment on coefficient scanning and coding was established to investigate tools related to transform coefficient coding [97].

Table 8.6 Distribution of bins in CABAC for HEVC and H.264/AVC under common test conditions [6, 101] and for the worst case

This section describes how transform coefficient coding evolved from H.264/AVC to the first test model of HEVC (HM1.0) to the Final Draft International Standard (FDIS) of HEVC (HM10.0), and discusses the reasons behind design choices that were made. Many of the throughput improvement techniques were applied, and new tools for improved coding efficiency were simplified. As a reference for the beginning and end points of the development, Figs. 8.8 and 8.9 show examples of transform coefficient coding for 4 × 4 blocks in H.264/AVC and HEVC, respectively.

Fig. 8.8
figure 8

Example of CABAC-based transform coefficient coding for a 4 × 4 transform block in H.264/AVC. Note, however, that the corresponding bins for signaling of the absolute level (in yellow) are not explicitly shown

Fig. 8.9
figure 9

Example of transform coefficient coding for a 4 × 4 transform block in HEVC. Note, however, that the corresponding bins for signaling of the “last” information (in red) and absolute level remaining (in yellow) are not explicitly shown

8.6.1 Transform Block Structure

As already discussed in Sect. 8.3.1, transform coding in HEVC involves a tree-structured variable block-size approach with supported transform block sizes of 4 × 4, 8 × 8, 16 × 16, and 32 × 32. This means that the actual transform block sizes, used to code the prediction error of a given CU, can be selected based on the characteristics of the residual signal by using a quadtree-based partitioning, also known as residual quadtree (RQT), as illustrated in Fig. 8.10. While this larger variety of transform block partitioning relative to H.264/AVC provides significant coding gains, it also has implications in terms of implementation costs, both in terms of memory bandwidth and computational complexity. To address this issue, HEVC allows to restrict the RQT-based transform block partitioning by four parameters, signaled by corresponding syntax elements in the SPS: the maximum and minimum allowed transform block size (in terms of block width) n max and n min, respectively, and the maximum depth of the RQT d max, with the latter given both individually for intra-picture and inter-picture prediction. Note, however, that there is a rather involved interdependency between these parameters (and other syntax elements), such that, for instance, implicit subdivisions or implicit leaf nodes of the RQT may occur. For more details, please refer to Chap. 3.

The signaling of the transform block structure for each CU is similar to that of the coding block structure at the CTU level. For each node of the RQT, a flag called split_transform_flag is signaled to indicate whether a given transform block should be further subdivided into four smaller TBs. Context modeling for the coding of this flag involves three different contexts with its related context increment equal to 5 − log2(TrafoSize), where TrafoSize denotes the block width of the corresponding luma transform block at the given RQT depth. Note that for the choice of a luma CTB size of 64, n max = 32, n min = 4, and d max = 4, an implicit leaf node is implied for the case of TrafoSize = 4, whereas an implicit subdivision is given for a luma CB size of 64 at RQT depth equal to 0. Table 8.7 and Fig. 8.11 illustrate an example of this configuration. Therefore, even if up to five different RQT levels are permitted, only up to three different context models are required for coding of split_transform_flag. Note that the signaling of split_transform_flag at the RQT root is omitted if the quantized residual of the corresponding CU contains no non-zero transform coefficient at all, i.e., if the corresponding coded block flag at the RQT root (see Sect. 8.6.3) is equal to 0.

Fig. 8.10
figure 10

Illustration of residual quadtrees (one for each CU) used to signal transform units for residual coding of CUs. Note that the same relationships and comments as given in Fig. 8.6 apply here as well

Fig. 8.11
figure 11

Illustration of signaling of split_transform_flag, cbf_luma, cbf_cb, and cbf_cr for an RQT with depth 4. Note that at RQT depth = 0, no split_transform_flag is signaled since an implicit transform split occurs for CU of 64 as n max = 32. cbf_luma is only signaled for leaf transform blocks (highlighted in red). cbf_cb and cbf_cr are signaled for the root node and all nodes where the corresponding CBF at the parent node is non-zero, except for the nodes related to TrafoSize = 4

8.6.2 Transform Skip

For regions or blocks with many sharp edges (e.g., as typically given in screen content coding), coding gains can be achieved by skipping the transform [42, 61]. When the transform is skipped for a given block, the prediction error in the spatial domain is quantized and coded in the same manner as for transform coefficient coding (i.e., the quantized block samples of the spatial error are coded as if they were quantized transform coefficients). The so-called transform skip mode is only allowed for 4 × 4 TUs and only if the corresponding functionality is enabled by the transform_skip_enabled_flag in the PPS. Signaling of this mode is performed by using the transform_skip_flag, which is coded using a single fixed context model.

8.6.3 Coded Block Flags

At the top level of the hierarchy of significance flags, as already explained in Sect. 8.3.1, coded block flags (CBFs) are signaled for the RQT root, i.e., at the CU level in the form of the rqt_root_cbf and for subsequent luma and chroma TBs in the form of cbf_luma and cbf_cb, cbf_cr, respectively. rqt_root_cbf is only coded and transmitted for inter-predicted CUs that are not coded in merge mode using a single PU (PART_2Nx2N)Footnote 15; for that a single context model is used. While signaling of cbf_luma is only performed at the leaf nodes of the RQT, provided that a non-zero rqt_root_cbf was signaled before, the chroma CBFs cbf_cb and cbf_cr are transmitted at each internal node as long as a corresponding non-zero chroma CBF at its parent node occurred. For coding of both cbf_cb and cbf_cr, four contexts are used such that the corresponding context increment depends on the RQT depth (with admissible values between 0 and 3, since for the case of TrafoSize = 4 no chroma CBFs are transmitted), whereas for cbf_luma only two contexts are provided with its discriminating context increment depending on the condition RQT depth = 0. For more background on the use of RQT and related syntax elements, please refer to Chap. 3.

8.6.4 Significance Map

In H.264/AVC, the significance map for each transform block is signaled by transmitting a significant_coeff_flag (SIG) for each position to indicate whether the coefficient is non-zero. The positions are processed in an order based on a zig-zag scan. After each non-zero SIG, an additional flag called last_significant_coeff_flag (LAST) is immediately sent to indicate whether it is the last non-zero SIG; this prevents unnecessary SIG from being signaled. Different contexts are used depending on the position within the 4 × 4 and 8 × 8 transform blocks, and whether the bin represents an SIG or LAST. Since SIG and LAST are interleaved, the context selection of the current bin depends on the immediate preceding bin. The dependency of LAST on SIG results in a strong bin to bin dependency for context selection of significance map entries in H.264/AVC as illustrated in Fig. 8.3.

8.6.4.1 sig_coeff_flag (SIG)

While in HEVC position based context assignment for coding of sig_coeff_flag (SIG) is used for 4 × 4 TBs as shown in Fig. 8.12, new forms of context assignment for larger transforms were needed. In HM1.0, additional dependencies were introduced in the context selection of SIG for 16 × 16 and 32 × 32 TBs to improve coding efficiency. Specifically, the context selection of SIG was calculated based on a local template using 10 (already decoded) SIG neighbors as shown in Fig. 8.13a [57, 102]. By using this template-based context selection bit rate savings of 1.4–2.8 % were reported [57].

Table 8.7 Derivation of context increment (ctxInc) for split_transform_flag, cbf_luma, cbf_cb, and cbf_cr for the example in Fig. 8.11
Fig. 8.12
figure 12

Context index assignment for sig_coeff_flag in 4 × 4 TBs

To reduce context selection dependencies and storage costs, Sze and Budagavi [85] proposed using fewer neighbors and showed that this could be done without severely sacrificing coding efficiency. For instance, using only a maximum of 8 neighbors (removing neighbors A and D as shown in Fig. 8.13b) had negligible impact on coding efficiency, while using only six neighbors (removing neighbors A, B, D, E and H as shown in Fig. 8.13c) results in a coding efficiency loss of only 0.2 %. This was further extended in [18] for HM2.0, where only a maximum of five neighbors was used by removing dependencies on positions G and K, as shown in Fig. 8.13d. In HM2.0, the significance map was scanned in zig-zag order, so removing the diagonal neighbors G and K is important since those neighbors pertain to the most recently decoded SIG.

Fig. 8.13
figure 13

Local templates for SIG context selection. X (in blue) represents the current position of the bin being processed. (a) Ten neighbors (HM1.0), (b) eight neighbors, (c) six neighbors, (d) 5 neighbors (HM3.0), and (e) inverted for reverse scan (HM4.0)

Fig. 8.14
figure 14

Scans used to process SIG. Diagonal scan avoids dependency on the most recently processed bin. Context selection for blue positions is affected by values of the neighboring grey positions. (a) Zig-zag scan and (b) diagonal scan

Despite reducing the number of SIG neighbors in HM2.0, dependency on the most recently processed SIG neighbors still existed for the positions at the edge of the transform block as shown in Fig. 8.14a. The horizontal or vertical shift that is required to go from one diagonal to the next in the zig-zag scan causes the previously decoded bin to be one of the neighbors (F or I) that is needed for context selection. In order to address this issue, in HM4.0, a diagonal scan was introduced to replace the zig-zag scan [86] as shown in Fig. 8.14b. Changing from zig-zag to diagonal scan had negligible impact on coding efficiency, but removed the dependency on recently processed SIG for all positions in the TB. In HM4.0, the scan was also reversed (from high frequency to low frequency) [74]. Accordingly, the neighbor dependencies were inverted from top-left to bottom-right, as shown in Fig. 8.13e.

Dependencies in context selection of SIG for 16 × 16 and 32 × 32 TBs were further reduced in HM7.0, where 16 × 16 and 32 × 32 TBs are divided into 4 × 4 subblocks. This will be described in more detail in Sect. 8.6.4.3 on coded_sub_block_flag (CSBF). In HM8.0, 8 × 8 TBs were also divided into 4 × 4 subblocks such that all TB sizes above 4 × 4 are based on a 4 × 4 subblock processing for a harmonized design [77].

Fig. 8.15
figure 15

Regions in 8 × 8, 16 × 16 and 32 × 32 TBs map to different context sets for SIG

The 8 × 8, 16 × 16 and 32 × 32 TBs are divided into three regions based on frequency, as shown in Fig. 8.15. The DC, low-frequency and mid/high-frequency regions all use different sets of contexts. To reduce memory size, the contexts for coding the SIG of 16 × 16 and 32 × 32 TBs are shared [81, 99].

For improved coding efficiency for intra predicted CUs, so-called mode dependent coefficient scanning (MDCS) is used to select between vertical, horizontal, and diagonal scans based on the chosen intra prediction mode [106], as illustrated in Fig. 8.16. Table 8.8 shows how the scans are assigned based on intra prediction mode, TB size, and component. As mentioned in Sect. 8.5.2, this requires the intra mode to be decoded before decoding the corresponding transform coefficients. MDCS is only used for 4 × 4 and 8 × 8 TBs and provides coding gains of up to 1.2 %. Note that for TBs larger than 8 × 8, and for TBs of inter predicted CUs, only the diagonal scan is used.

Fig. 8.16
figure 16

Diagonal, vertical, and horizontal scans for 4 × 4 TBs

Table 8.8 Mode dependent coefficient scanning: Mapping of intra prediction mode to scans (0 = Diagonal, 1 = Horizontal, 2 = Vertical) for different TB sizes and components

8.6.4.2 Last Position Coding

As mentioned earlier, there are strong data dependencies between significant_coeff_flag (SIG) and last_significant_coeff_flag (LAST) in H.264/AVC due to the fact that they are interleaved. Budagavi and Demircin [10] proposed grouping several SIG together by transmitting a LAST only once per N number of SIG. If all of the N SIG are zero, LAST is not transmitted. Sole et al. [73] avoids interleaving of SIG and LAST altogether. Specifically, the horizontal (x) and vertical (y) position of the last non-zero SIG in a TB is sent in advance rather than LAST by using the syntax elements last_sig_coeff_x and last_sig_coeff_y, respectively. For instance, in the example shown in Fig. 8.9, last_sig_coeff_x equal to 3 and last_sig_coeff_y equal to 0 are sent before processing the TB rather than signaling LAST for each SIG with value of 1. Signaling the (x, y) position of the last non-zero SIG for each TB was adopted into HM3.0. Note that the SIG in the last scan position is inferred to be 1.

The last position, given by its coordinates in both x and y direction, is composed of a prefix and suffix as shown in Table 8.9. The prefixes last_sig_coeff_x_prefix and last_sig_coeff_y_prefix are both regular coded using TrU binarization with \(\mathrm{cMax} = 2 {\ast} (\log _{2}\mathrm{TrafoSize}) - 1\) [70]. A suffix is present when the corresponding prefix is composed of more than four bins. In that case, the suffixes last_sig_coeff_x_suffix and last_sig_coeff_y_suffix are bypass coded using FL binarization. Some of the contexts are shared across the chroma TB sizes to reduce context memory, as shown in Table 8.10. To maximize the impact of fast bypass coding, the bypass coded bins (i.e., the suffix bins) for both the x and y coordinate of the last position are grouped together for each TB in HEVC.

Table 8.9 Binarization of coordinate values of the last position
Table 8.10 Context selection for regular coded prefix bins of the coordinates of the last position last_sig_coeff_x_prefix and last_sig_coeff_y_prefix

8.6.4.3 coded_sub_block_flag (CSBF)

As already explained in Sect. 8.3.1, the number of bins to be transmitted for signaling the significance map is considerably reduced by using a hierarchical signaling scheme of significance flags. Part of this hierarchy is the coded_sub_block_flag (CSBF) that indicates for each 4 × 4 subblock of a TB whether there are non-zero coefficients in the subblock [56, 60]. If CSBF is equal to 1, the subblock contains at least one non-zero transform coefficient level and, consequently, SIGs within the subblock are signaled. No SIGs are signaled for a 4 × 4 subblock that contains all vanishing transform coefficients, since this information is signaled by a CSBF equal to 0. For large TB sizes, a reduction in SIG bins of up to a 30 % can be achieved by the use of CSBFs, which corresponds to an overall bin reduction of 3–4 % under common test conditions. To avoid signaling of redundant information, the CSBF for the subblocks containing the DC and the last position are inferred to be equal to 1. Figure 8.17 shows an example of the hierarchical signaling of an 8 × 8 significance map.

Fig. 8.17
figure 17

Example of the hierarchical signaling of an 8 × 8 significance map

In HM7.0, the CSBF was additionally used to further reduce dependencies in the context selection of SIG for 16 × 16 and 32 × 32 TBs. Specifically, the neighboring subblocks and their corresponding CSBFs (Fig. 8.18) are used for context selection rather than the individual SIG neighbors, as shown in Fig. 8.13e [41]. This context selection scheme was extended to 8 × 8 TBs in HM8.0 [77]. According to this scheme, the CSBF of the neighboring right and bottom subblocks (CSBF right , CSBF bottom ) are used to select one of four patterns shown in Fig. 8.19: (0,0) maps to pattern 1, (1,0) to pattern 2, (0,1) to pattern 3 and (1,1) to pattern 4. The pattern maps each position within the 4 × 4 subblock to one of three contexts. As a result, there are no intrinsic dependencies for context selection of SIG within each 4 × 4 subblock.

Fig. 8.18
figure 18

Neighboring CSBFs (right, bottom) used for SIG context selection

Fig. 8.19
figure 19

4 × 4 position based mapping for SIG context selection based on CSBF of neighboring subblocks. (a) Pattern 1, (b) pattern 2, (c) pattern 3 and (d) pattern 4

Reverse diagonal scanning order is used within the subblocks and for the processing order of the subblocks themselves, as shown in Fig. 8.20 [76]. Both significance map and coefficient levels are processed in this order. As an exception to this rule, for 4 × 4 and 8 × 8 TBs to which MDCS is applied, reverse vertical and horizontal scanning orders are used within the subblocks as well as for the processing order of the subblocks themselves. Furthermore, as shown in Table 8.11, different sets of contexts for coding of SIG are used for diagonal and non-diagonal (vertical and horizontal) scans in both 4 × 4 luma and chroma TBs, and 8 × 8 luma TBs [77].

Fig. 8.20
figure 20

Subblock scans. Scan for 4 × 4 TB shown in Fig. 8.16. (a) Subblock scan for 8×8 TB. (b) Subblock scan for 16 × 16 TB. Scan for 32 × 32 TB is also all diagonal

Fig. 8.21
figure 21

Flow chart for coding the syntax elements of a TB in HEVC

Table 8.11 Context selection of sig_coeff_flag based on component, TB size, scan order (Table 8.8), position of subblock within the TB (Fig. 8.15), and position based context index within 4 × 4 TB or subblock (SubIdx) (Fig. 8.12 or 8.19, resp.)

8.6.4.4 Summary of Significance Map Coding in HEVC

Figure 8.21 summarize the steps required to code the significance map. This process is repeated for every non-zero TB in HEVC. Table 8.11 summarizes the multiple steps of classification used to assign the 42 contexts of sig_coeff_flag. Contexts 0 to 26 are used for luma coded TBs, while 27 to 41 used for chroma TBs. The contexts are further mapped based on the TB size, the scan direction, whether the subblock is DC or non-DC, CSBF of neighboring subblocks, and position within the subblock. Note that context 0 is used to code the sig_coeff_flag of the DC position of all luma TBs, and context 27 is used for the DC position of all chroma TBs.

8.6.5 Absolute Coefficient Level and Coefficient Sign

In HEVC, parsing of transform coefficient level information is performed subblock-by-subblock using up to five scan passes for each subblock. The first scan pass is devoted to the SIG flags, as already explained in Sects. 8.6.4.1 and 8.6.4.3. In the second and third pass, the two additional flags coeff_abs_level_greater1_flag (ALG1) and coeff_abs_level_greater2_flag (ALG2) are conditionally parsed, indicating for each relevant scan position if the corresponding absolute value of the coefficient level, i.e., the absolute level (AL) is greater than 1 and 2, respectively. However, only up to 8 ALG1 flags and one ALG2 flag are transmitted for each subblock, as will be explained in more detail below. In the third scan pass, the sign of each significant level is signaled with the possible exception of the last non-zero coefficient in the subblock in reverse scanning order, as will be discussed in more detail in Sect. 8.6.5.2. Finally, in the last and fifth scan pass, the remaining information of absolute levels in the subblock (if present) is transmitted by using the syntax element coeff_abs_level_remaining (ALRem), as will be further detailed in Sect. 8.6.5.1 below.

8.6.5.1 Coding of Absolute Level

Coding of absolute levels requires the choice of suitable binarization schemes and, for selected bin indices, the choice of suitable context models. According to the design considerations, as discussed in Sect. 8.3, both aspects of coding efficiency and throughput have been properly addressed by the revised CABAC design of HEVC. This is especially true for the coding of absolute levels which typically contribute the dominant portion to the total number of generated bins. In the following, we will first elaborate on how the specific binarization scheme for absolute levels in HEVC has been designed. Then, in the second part of this subsection, we will present the context selection rules applied to the (few remaining) regular coded bins of absolute levels, unless not already done so in Sects. 8.6.4.1 and 8.6.4.3.

Fig. 8.22
figure 22

Illustration of the adaptive binarization scheme for absolute levels in HEVC consisting of a concatenation of the three elementary binarizations TrU, TRk, and EGk, the latter two with varying order k and k + 1, respectively (0 ≤ k ≤ 4). The two variable thresholds B 0 and B 1 specify the (variable) transition points between them

Conceptually, the binarization of an absolute level, denoted as z in the following, relies on a concatenated application of three binarization processes [21, 54, 59]: truncated unary (TrU), k-th order truncated Rice (TRk), and (k + 1)-th order Exp-Golomb (EGk). Figure 8.22 illustrates this binarization scheme for arbitrary z along the (discrete) number line. There are two thresholding parameters B 0, B 1 with B 0 < B 1 which separate the three regions from one another for application of each of the three binarization processes and which also determine the truncation parameters cMax(TrU) \(= B_{0} + 1\) and cMax(TRk) \(= B_{1} - B_{0}\). The selection of the two parameters B 0, B 1 together with the choice of the parameter k is performed in a backward-adaptive manner for each subblock in such a way that the resulting bin strings are already close to a minimum-redundancy prefix code for the collection of all absolute levels z in each subblock. As a consequence, the majority of resulting bins can be simply bypass coded without compromising coding efficiency.

For each subblock, the initialization and adaptation processes for the parameters B 0, B 1, and k is performed, as follows. Before starting the processing of an subblock, the parameter k is set equal to 0, whereas B 0 is set equal to 2. The second thresholding parameter B 1 depends on k and B 0 by the fixed relation \(B_{1} = 4 {\ast} 2^{k} + B_{0}\), which means that B 1 is adapted whenever B 0 or k are changed. For each scan position in the subblock processing, the absolute level z is evaluated after encoding/decoding and B 0 is decremented by 1 after the first occurrence of z > 1, which corresponds to the first scan position in the subblock for which an ALG2 flag is signaled. A further adaptation of B 0 to its minimum value of 0 is performed after z > 0, i.e., after an ALG1 flag occurs eight times in the subblock. The parameter k is set to min(k + 1, 4) after each scan position for which the corresponding absolute level z fulfills the condition z > 3 ∗ 2k. Note that according to this adaptation rule, k can take integer values from 0 to 4 inclusive. Tables 8.12 and 8.13 show example binarizations for two different configurations of the parameters B 0, B 1, and k. Please note that the result of the binarization for z can also be interpreted as a concatenation of a unary prefix and, if present, a fixed-length suffix for different ranges of z [21]. Table 8.14 shows the corresponding binarization of ALRem, which has a maximum bin length of 32 [12].

Table 8.12 Binarization of the absolute level z for the choice of parameters B 0 = 2, B 1 = 6, and k = 0, corresponding to a concatenation of TrU with cMax  = 3, zero-order Truncated Rice (TRk) with cMax  = 4, and first-order Exp-Golomb (EGk)
Table 8.13 Binarization of the absolute level z for the choice of parameters B 0 = 1, B 1 = 9, and k = 1, corresponding to a concatenation of TrU with cMax  = 2, first-order Truncated Rice (TRk) with cMax  = 8, and second-order Exp-Golomb (EGk)
Table 8.14 An alternative representation of coeff_abs_level_remaining (ALRem) binarization as a concatenation of a unary prefix and fixed length suffix

As already indicated above, the signaling of the absolute level z involves four different syntax elements, given as sig_coeff_flag (SIG), coeff_abs_level_greater1_flag (ALG1), coeff_abs_level_greater2_flag (ALG2), and coeff_abs_level_remaining (ALRem), such that

$$\displaystyle\begin{array}{rcl} z =\mathrm{ SIG} +\mathrm{ ALG1} +\mathrm{ ALG2} +\mathrm{ ALRem},& & {}\\ \end{array}$$

provided that the values of the corresponding syntax elements are inferred to be equal to 0, when not explicitly signaled. Note that the flags SIG, ALG1, and ALG2 represent the first and the optional second and third bin indices of the TrU part of z, respectively. ALRem corresponds to the concatenation of the TRk and EGk part of z with all of its bin values being bypass coded and with a maximum bin string length of 32 [12]. Only the values of the three flags are regular coded. However, due to the adaptation rules for B 0, ALG2 can occur only once in each subblock, while the occurrence of ALG1 is restricted to 8 scan positions per subblock at the maximum [16]. Together with the maximum of 16 SIG flags per subblock, only up to 25 regular coded bins can occur in each subblock (without accounting for CSBF). Thus, the maximum number of regular coded bins per 4 × 4 transform (sub-) block is reduced by a factor of about 9.6 relative to the corresponding maximum number of \(16 {\ast} 14 + 15 = 239\) regular coded bins for H.264/AVC CABAC (including SIG bins but without accounting for LAST) [46]. This change provides obviously the most substantial reduction to the (worst case) number of regular coded bins in the entire revision of CABAC.

The rationale behind processing SIG, ALG1, ALG2, and ALRem with individual syntax elements rather than as conventional bin indices of the adaptive binarization of z is given by the fact that all values of one syntax element in each subblock are grouped together and signaled in separate scan passes. This grouping provides essentially three advantages. First, bins in the coefficient level binarization that use the same context selection logic are grouped together to reduce the amount of speculative context selection computations, as shown in Fig. 8.23. Second, by grouping bypass coded bins together, the throughput advantages of bypass bins are maximized [87]. Third, the storage for (partially reconstructed) coefficient data during the parsing process at the decoder can be reduced, as further explained in Sect. 8.6.5.2 below. Note that the reordering of bins has no impact on coding efficiency.

Fig. 8.23
figure 23

Grouping same regular coded bins and bypass bins to increase throughput. s = coeff_sign_flag

Context modeling for coding of the regular coded bins of the absolute level is restricted to the three flags SIG, ALG1, and ALG2. Since context model selection for the SIG flag has already been introduced in Sects. 8.6.4.1 and 8.6.4.3, we will focus in the following on the two flags ALG1 and ALG2. For each of both flags, six sets of context models are provided: four sets for subblocks of the luma component and two sets for subblocks of the chroma component. Since only up to one ALG2 flag per subblock is encoded/decoded, each of the six ALG2 related sets contains only one context model. For the ALG1 flag, each set consists of four context models and the context increment ctxInc(ALG1) for selecting one of this four models within each set is quite similar to what is specified for the coding of the first bin of the syntax element coeff_abs_level_minus1 in H.264/AVC (see [46] for a motivation of this design choice):

$$\displaystyle{ \text{ctxInc(ALG1)} = \left \{\begin{array}{l@{\quad }l} 0, \quad &\mbox{ if NumG1} > 0\\ 1 +\min (2, \text{NumT1} ),\quad &\mbox{ otherwise} \end{array} \right.,}$$

where NumT1 denotes the accumulated number of encoded/decoded trailing 1’s, i.e., absolute levels equal to 1, and NumG1 denotes the accumulated number of encoded/decoded levels with absolute value greater than 1, both computed along the reverse scanning pattern of the subblock up to (but not including) the current scan position. Note that both NumT1 and NumG1 are initialized with the value of 0 at the beginning of the subblock scan of ALG1 flags. After each encoded/decoded ALG1 flag with the value of 0, NumT1 is incremented by 1, while after each encoded/decoded ALG1 flag with the value of 1, NumG1 is incremented by 1. Figure 8.24 shows the flow chart for context increment computation of ALG1.

Fig. 8.24
figure 24

Flow chart for derivation of context increment (ctxInc) for up to 8 different events b i (0 ≤ i ≤ 7) of ALG1 in a 4 × 4 subblock

Since the statistics of trailing 1’s may differ from subblock to subblock as well as for subblocks belonging to different components or different locations within the TB, different sets of context models are provided, both for ALG1 and ALG2, as already mentioned above. For subblocks belonging to the luma component, 2 separate sets are used for subblocks containing the DC of the TB, i.e., for the top left subblocks in a TB. Another two sets are given for luma subblocks containing no DC as well as two additional sets for chroma subblocks. Depending on the value of ctxInc(ALG1) for the last decoded ALG1 flag in the preceding subblock, the two members of each of the relevant sets related to luma DC, luma non-DC, and chroma are selected: One for the case of ctxInc(ALG1) = 0 and the other for the case of ctxInc(ALG1) > 0. Thus, a total number of 30 context models are used for coding of ALG1 and ALG2: 6 ⋅ 4 = 24 for ALG1 and 6 for ALG2. Interestingly enough, there was a 4× reduction (from 120 to 30) in the total number of contexts used for coding of the ALG1 and ALG2 flags during the development from HM3.0 to HM6.0 at virtually no loss in coding efficiency.

8.6.5.2 Coding of Sign

To reduce storage cost of the coefficients, as already noted above, the transform coefficient data is grouped for every 4 × 4 subblock and the sign bins are bypass coded and signaled before coeff_abs_level_remaining bins. Before coeff_abs_level_remaining is added, the partial value of the coefficient level can be represented with 4 bits. Thus, CABAC in HEVC only requires storage of 4 × 4 × 4 bits for each subblock (as compared to 8 × 8 × 9 bits for a 4 × 4 transform block in H.264/AVC), and the reconstructed transform coefficient level can be immediately written out once coeff_abs_level_remaining is parsed.

To improve coding efficiency, the optional sign bit hiding (SBH) technique can be used [24]. SBH is a technique to hide one bit such as, e.g., a sign of a non-zero coefficient in a group of non-zero coefficients. For this, the encoder quantizes the coefficients in the group such that the sum of their absolute level values is even or odd for the sign bit to be hidden having value 0 or 1, respectively. This inherently lossy coding technique is based on the idea that in a group of quantized coefficients, it is likely that there is at least one coefficient level for which the value can be increased or decreased by 1 with only marginally increased rate-distortion cost. This is, e.g., the case, when the unquantized coefficient was close to a quantization decision threshold, such that quantizing the coefficient to the next lower or next higher possible quantized value are both similarly good decisions.

SBH is enabled by sign_data_hiding_enabled_flag in the PPS and if it is enabled, it applies to each 4 × 4 subblock for which the number of non-zero coefficients exceeds a certain threshold. This threshold was chosen in HEVC to be a value of 3 and the sign bit to be hidden is that of the last significant scan position in the reverse scanning pattern of each subblock. The condition for SBH can be checked while parsing the significance map and thus, SBH does not have a significant impact on the entropy decoding throughput. Average bit rate savings between 0.6 and 0.9 % were reported for SBH at common test conditions [104].

8.6.5.3 Summary of Absolute Level and Sign Coding in HEVC

Figure 8.25 summarizes the last four out of up to five scan passes required for parsing the absolute levels and signs for every non-zero 4 × 4 subblock in HEVC.

Fig. 8.25
figure 25

Flow chart for coding the syntax elements of absolute level minus 1 and sign for a 4 × 4 subblock in HEVC

8.6.6 Comparison of HEVC and H.264/AVC

Table 8.15 summarizes the differences in transform coefficient coding between HEVC and H.264/AVC as well as across different transform block sizes. In terms of throughput and memory related aspects, HEVC requires 3× fewer contexts (121 vs. 359) than H.264/AVC for transform coefficient coding. Note, however, that in H.264/AVC CABAC two separate sets of context models are used for frame-based and field-based coding of SIG and LAST. Furthermore, HEVC has a 9× lower maximum number of regular coded bins per coefficient (1.9 vs. 17.1) than H.264/AVC.

Table 8.15 Differences between CABAC for different TB sizes in HEVC and H.264/AVC

8.7 Context Initialization

In HEVC, slices consist of an integer number of CTUs, which collectively form an independently decodable unit. This implies in particular that at the beginning of each slice, the parameters of all probability models must be reset to some predefined values. Typically, without any prior knowledge of the statistical nature of the source, each probability model would be initialized with the state corresponding to the uniform distribution (p = 0. 5). However, in order to bridge the learning phase of the adaptive probability models and to enable a kind of preadaptation at different coding conditions, it was found to be beneficial to provide some more appropriate initialization value than equi-probable state for each probability model at the beginning of each slice.

Similar to H.264/AVC, CABAC in HEVC involves a quantization-parameter dependent initialization process that is invoked at the beginning of each slice. It generates an initial probability state value representing the LPS probability p LPS as well as the value of the MPS ν MPS depending on the given initial value of the luma quantization parameter SliceQPY for the slice. For that purpose, a pair of so-called initialization parameters is stored for each model, from which a linear relationship between SliceQPY and the model probability p is derived. In contrast to H.264/AVC, the initialization parameters in HEVC do not directly represent the slope m and the offset n of the corresponding linear model. Instead, these two parameters are packed into a single 8 bit table entry in a memory-efficient way, as will be explained in more detail in the subsequent section.

For each of the three slice types I, P, and B, separate table entries are provided. However, for P and B slices the encoder can choose between the corresponding two table entries of initialization parameters and signal its choice to the decoder by use of the syntax element cabac_init_flag. Note that this mechanism is similar to that already available in H.264/AVC where, however, the choice between three instead of two pairs of initialization parameters is given for P and B slices [46, 63].

8.7.1 8-Bit Design

To reduce the memory requirements for context initialization tables, it was proposed in [52] to use 8-bit values to derive the initialization parameters rather than storing the pair of 16-bit values (m, n) of the linear model directly, as in H.264/AVC. From the high nibble of the 8-bit table entry InitValue, a variable slopeIdx is derived, while the low nibble of InitValue represents the variable offsetIdx, from which the slope m and offset n of the linear model are derived using [29]

$$\displaystyle\begin{array}{rcl} m& =& \mbox{ slopeIdx} \cdot 5 - 45 {}\\ n& =& (\mbox{ offsetIdx} << 3) - 16. {}\\ \end{array}$$

Given the values of m and n, exactly the same initialization procedure as in H.264/AVC is performed for derivation of the parameters of each probability model [46, 63]. Note that the 8-bit design allows to cut in half the amount of storage needed for context initialization tables. Further restriction to two instead of three table entries for P and B slice types reduces the memory requirements for those tables in HEVC by at least another 12.5–15 % relative to those of an 8-bit equivalent of H.264/AVC. Since there are 134 contexts for I slices and 154 for each of both slice types P and B, a total amount of 442 bytes of memory is needed for storage of all context initialization tables in HEVC.

8.7.2 Context Training

The main purpose of the context initialization tables is to bridge the learning phase starting from a uniform distribution, i.e., the case of no prior knowledge of the statistics of the given bin distributions, towards the well-adapted phase of the probability estimator. Assuming that after processing of a number of N τ bins, the probability estimator that starts from p = 0. 5 reaches such a well-adapted state, the bins for each probability model were tracked for N τ bins for each test sequence of a training set at a particular QP and for a particular slice type. As a result, a model probability p τ, QP was estimated from the relative frequency obtained after coding the first N τ bins for each probability model. This training procedure was performed separately for each QP and each of the three slice types. To finally determine the pair of parameters (m, n) that describe the assumed linear relationship between QP and model probability p τ, QP, a simple linear regression was applied for each slice type. Note that a choice of N τ  = 50 was assumed to be appropriate.

8.7.3 Context Memory for Wavefront Parallel Processing and Dependent Slices

For improving the parallelization and low-delay capabilities beyond the use of regular slices, as known from H.264/AVC, a partitioning of pictures into tiles, wavefronts and dependent slices have been introduced in HEVC. Since the use of regular slices implies in particular that the corresponding CABAC bitstream must be independently parsable, re-initialization of the CABAC probability models is required at the beginning of each regular slice. Although the initialization procedure, as described above, mitigates the effect of such a rigorous partitioning, the loss in coding efficiency is still too large to be acceptable for certain applications.

Wavefront parallel processing (WPP) is such a technique for picture partitioning with the focus on improving the capabilities for parallel processing at virtually no loss in coding efficiency [27, 34]. According to the WPP scheme, a picture is partitioned into rows of CTUs with each row being represented by its own CABAC bitstream which, however, is not fully independently parsable except for the bitstream belonging to first row of CTUs in a picture. Nevertheless, independent parsing and decoding of the WPP bitstreams is possible, if the processing from one CTU row to the next complies with an offset of two consecutive CTUs. This offset guarantees, on the one hand, that all spatial dependencies for the decoding process are preserved and, on the other hand, it permits inheritance of the adapted probability models from the first two CTUs in the preceding row of CTUs. The latter functionality, however, requires to store the content of all probability models after decoding the second CTU in a row. As already discussed above, the required memory depends on the slice type: for I slices 134 bytes and for P and B slices each 154 bytes of memory are needed. Note, however, that by using a proper scheduling and synchronization at the decoder, only one instance of such an additional context memory is required in addition to the N ω context memories required for parsing and decoding N ω CTU rows in parallel.

The same context memory handling applies also to the concept of dependent slice segments [69]. In HEVC, slices are composed of one initial independent slice segment and zero or more dependent slice segments, all of which contains an integer number of CTUs. Compared to regular slices or independent slice segments, dependent slice segments do not break the coding dependencies within the picture area to which the corresponding CTUs belong. Although each dependent slice segment has its own CABAC bitstream, the parsing of this bitstream cannot start before the parsing of the preceding dependent or independent slice segment has been finished. In particular, the content of all adapted probability models after parsing the last CTU in the preceding slice segment needs to be stored and propagated to the current dependent slice segment. Therefore, the same amount of additional context memory is required as in the WPP case. Note, however, that WPP and dependent slices, even though most often used together, are different concepts. While WPP is targeting at parallel processing, dependent slices cannot be processed in parallel and are most useful in applications requiring ultra-low delay, since each dependent slice segment can be put into a separate transport packet. Please refer to Chap. 3 for more details.

8.8 Overall Performance

This section analyzes the improvements of CABAC in HEVC relative to CABAC in H.264/AVC. In the first part of this section, the impact of all relevant CABAC changes in terms of coding efficiency is experimentally evaluated, while in the second part, an assessment of its throughput implications is performed. Finally, the reduction in memory requirements is analyzed.

Simulations were performed under common test conditions set by the JCT-VC [6, 101] as well as corresponding settings for H.264/AVC JM [30]. Note that those common conditions for the HEVC reference software HM [35] are intended to reflect the typical bitstreams in applications of HEVC. During standardization of HEVC, this configuration was also used to evaluate the coding efficiency impact of proposals.

In [6], four different test cases labeled as Intra, Random Access, Low Delay B, and Low Delay P are specified. The Intra test case specifies that all pictures are coded as intra pictures. In the Random Access test case, intra pictures are inserted in regular intervals of approximately 1.1 s in order to enable random access. As a temporal coding structure, hierarchical B pictures with groups of eight pictures are employed. Both the Low Delay B and Low Delay P test case specify that the pictures are coded in display order, so that the resulting structural encoding-decoding delay is suitable for low-delay communication applications. The latter two coding conditions differ only in the used slice type. In the Low Delay B test case, B slices are used, whereas only P slices are used in the Low Delay P test case. Note that in those low-delay test cases only one intra picture is used at the beginning of each test sequence.

The same set of test sequences as in the standardization process of HEVC has been used [6]. The test sequences are categorized into different classes, each with a particular spatial resolution. As an exception, the class labeled as Screen content in the following represents a special class that contains test sequences with typical screen and graphics content, but with varying spatial resolutions.

8.8.1 Coding Efficiency

Evaluation of coding efficiency for CABAC has been restricted to the syntax elements of transform coefficient coding. For that purpose, an extension of the residual coding scheme, specified for CABAC in H.264/AVC [46], was implemented into the HM to also cover residual coding of 16 × 16 and 32 × 32 TBs. This straightforward extension was realized by increasing the number of successive scan positions sharing the same context model for both SIG and LAST of those TBs. For the remaining syntax elements related to transform coefficient level coding, the same rules as defined for CABAC in H.264/AVC are applied [46].

Table 8.16 BD-rate performance of CABAC transform coefficient coding in HEVC compared to the extended CABAC transform coefficient coding of H.264/AVC

Table 8.16 shows the so-called Bjøntegaard delta bit rate (BD-rate) for the luma component [5] as a measure of the gain in coding efficiency obtained for the transform coefficient level coding in HEVC relative to the aforementioned straightforward CABAC extension. Overall performance gains of 3.4–4.8 % in terms of averaged BD-rate savings can be attributed to the improved transform coefficient coding techniques in HEVC. The largest improvements are achieved for the Intra test case, which is mainly due to the relatively large energy of the corresponding residual signals.

Table 8.17 summarizes the individual coding efficiency impact of various adopted tools for HEVC. Note, however, that the majority of adopted tools focused on throughput improvements with minimal coding loss, as will be discussed in the following.

Table 8.17 Coding efficiency impact of adopted TU coding tools

8.8.2 Throughput Analysis

This section describes throughput of HEVC relative to H.264/AVC. The impact of the techniques, outlined in Sect. 8.3.3, are discussed. Analysis was also done for the worst case throughput which is defined as the case with the maximum number of bins per 16 × 16 coding tree unit (CTU) or macroblock. The results for both common conditions and worst case are summarized in Tables 8.18 and 8.19, respectively.

Table 8.18 Distribution of regular coded, bypass and termination bins for CABAC in H.264/AVC (JM-16.2) and HEVC (HM8.0) under common test conditions [6, 101]
Table 8.19 Reduction of worst case number of bins and memory in HEVC over H.264/AVC

8.8.2.1 Reduce Regular Coded Bins

As mentioned earlier, bypass coded bins can be processed faster than regular coded bins, since they don’t have data dependencies due to context selection, and their interval subdivision can be performed by a simple shift. Table 8.18 shows that the percentage of regular coded bins under common conditions is lower for HEVC than H.264/AVC. Table 8.19 also shows that in the worst case conditions, there are 9× fewer regular coded bins in HEVC than H.264/AVC. The reduction in regular coded bins is primarily due to the improved binarizations of absolute coefficient levels and components of the motion vector difference.

Using the implementation found in [103], where up to 2 regular coded bins or 4 bypass coded bins can be processed per cycle, HEVC gives 2× higher throughput than H.264/AVC under the worst case (this includes the impact of 1. 5× fewer total bins in HEVC). This can also be translated into power saving using voltage scaling as mentioned earlier.

8.8.2.2 Group Bypass Coded Bins

Grouping bypass bins together into longer chains increases the number of bins processed per cycle and reduces the number of cycles required to process a single bypass bin. This is a technique used in coding of syntax elements related to motion vector difference, intra mode, last position, and coefficient levels. For instance, for the Kimono sequence, encoded using the RandomAccess configuration, grouping bypass bins increases the average bypass bin run length from 2.1 to 6.4. In HEVC, under common test conditions, up to a 30 % reduction in number of cycles can be achieved compared to the case of no grouping [89].

The benefit of bypass grouping can also be seen in the example of Figs. 8.8 and 8.9. If bypass grouping was not used, it would take five cycles to process the 5 sign bypass bins. Assuming the architecture of [103], where 4 bypass bins are processed per cycle, only two cycles are required to process the 5 sign bins.

8.8.2.3 Group Bins with Same Context

Grouping bins with same context together is done for motion vector difference, significance map and coefficient level. As a results, fewer speculative calculations are needed to decode multiple bins per cycle since all bins that use the same logic for context selection are grouped together.

Figure 8.3 showed the speculation required when significant_coeff _flag and last_significant_coeff_flag are interleaved in H.264/AVC. In HEVC, no speculation is required for significance map as shown in Fig. 8.26. Thus for this example, the number of operations are reduced from 14 to 5.

Fig. 8.26
figure 26

No context speculation is required to achieve 5× parallelism when processing the 4 × 4 significance map in HEVC. i = coefficient position; EOB = end of block; SIG = sig_coeff_flag

8.8.2.4 Reduce Context Selection Dependencies

Context selection dependencies were reduced such that coding gains could be achieved without significant penalty to throughput. For instance, the last significant coefficient position information is sent before the SIG flag to remove a tight bin to bin data dependency. Relative to HM1.0, the neighboring dependencies for SIG were reduced from 10 to 5 neighboring SIG bins, and then further modified to only depend on neighboring 4 × 4 subblocks. The remaining context selection for SIG is only based on its position within the block as in H.264/AVC.

8.8.2.5 Reduce Total Number of Bins

When comparing the total number of bins in the worst case, and thus the throughput requirement, HEVC has 1. 5× fewer bins than H.264/AVC. Assuming the same number of cycles per bin are required, HEVC can run at a 1. 5× lower clock rate at a lower voltage for 50 % power savings assuming linear scaling with voltage and frequency, or it can process at a bin rate that is 1. 5× faster than H.264/AVC.

8.8.2.6 Reduce Parsing Dependencies

Parsing dependencies were removed or reduced such that coding gains could be achieve without significantly sacrificing throughput. Removing the parsing dependency for merge and mvp enables parsing to be mostly decoupled from the reconstruction process, as it is the case for H.264/AVC. HEVC does have parsing dependencies on intra mode reconstruction, which are not present in H.264/AVC; however, efforts were made to keep intra mode reconstruction simple to avoid affecting parsing throughput.

8.8.2.7 Summary of Throughput Improvement Techniques

Table 8.20 contains a summary of the techniques for throughput improvement and related standard contributions. An HEVC CABAC decoder that leverages several of these improvements to achieve a throughput of over 2 Gbin/s is described in [17].

Table 8.20 Summary of throughput improvement techniques with references to related standard contributions

8.8.3 Memory Requirement Reduction

This section describes how the size and bandwidth requirements of various memories in CABAC have been reduced in HEVC in order to increase throughput as well as lower implementation cost and power consumption.

8.8.3.1 Context Memory

The motivation for context reduction was first proposed in [81], where the number of contexts was reduced for coeff_abs_level_greater1_flag and coeff_abs_level_greater2_flag without impacting coding efficiency. Subsequent proposals [3, 66, 93] were made to reduce the number of contexts for other syntax elements (e.g., sig_coeff_flag). HEVC uses only 154 contexts as compared to 441 (or 292 without interlaced) used in H.264/AVC as shown in Table 8.21; thus, a 3× reduction in context memory size is achieved with HEVC.

Table 8.21 Context memory requirements for H.264/AVC (4:2:0) and HEVC

8.8.3.2 Line Buffer Memory

The motivation to reduce the size of the line buffer in the CABAC was first proposed in [90, 92], where the line buffer size was reduced by changing the context selection for motion vector difference. Subsequent proposals [15, 20, 58, 68, 82, 91] were made to further reduce neighboring dependencies to reduce the line buffer size. Based on these optimizations, in the worst case, the line buffer only need to store the CU depth (2-bits) of the top neighbor for context selection of split_cu_flag for every 8 × 8 block, and to indicate if the top neighbor is skipped (1-bit) for context selection of cu_skip_flag for ever 4 × 4 block. Assuming a minimum CU size of 8 × 8 for a 4k × 2k sequence, HEVC only requires a line buffer size of 1,024 bits versus 30,720 bits in H.264/AVC, which is a 30 × reduction.

8.8.3.3 Coefficient Storage

Large TB sizes have large hardware cost implications. Compared to H.264/AVC, the 16 × 16 and 32 × 32 TBs in HEVC have 4× and 16× more coefficients than an 8 × 8 TB, respectively, and consequently require an increase in storage cost. Several techniques were used to reduce the coefficient storage cost. First, the sign information is sent before coeff_abs_level_remaining such that only 3-bits storage is required per coefficient for the partial decoded value (if stored as a 2-bit number with a range from 0 to 3, and a sign bit). Second, the coefficient information is interleaved at a 4 × 4 subblock level, such that the fully constructed coefficient can be achieved for every subblock and be sent out to the next module [75]. Thus, only a coefficient storage of 4 × 4 × 3-bits is required in HEVC CABAC (compared with 8 × 8 × 9-bit in H.264/AVC) in order to reconstruct the coefficient levels.

8.8.3.4 Context Initialization Tables

As already discussed in Sect. 8.7, the memory requirements for storing the context initialization tables in HEVC have been reduced to a large extent when compared to those of H.264/AVC. Accounting for the reduction in number of contexts, number of bits per InitValue and number of InitValue sets, HEVC has an 9× smaller context initialization table than H.264/AVC.

8.9 Conclusions

Entropy coding was a highly active area of development throughout the HEVC standardization process with proposals for both coding efficiency and throughput improvement. The trade-off between the two requirements were carefully evaluated in multiple Core Experiments and Ad Hoc Groups [8, 11, 13, 37, 97]. Beside coding-efficiency improving technology, many techniques were incorporated to improve throughput including reducing regular coded bins, grouping bypass bins together, grouping bins that use the same contexts together, reducing context selection dependencies, and reducing the total number of signaled bins. CABAC memory requirements were also significantly reduced. The final design of CABAC in HEVC shows that by accounting for implementation cost and coding efficiency when designing entropy coding algorithms results in a design that can maximize processing speed and minimize area cost, while delivering high coding efficiency in the latest video coding standard.