Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

6.1 Introduction

In the block-based hybrid video coding approach, transforms are applied to the residual signal resulting from inter- or intra-picture prediction as shown in Fig. 6.1. At the encoder, the residual signal of a picture is divided into square blocks of size N × N where N = 2M and M is an integer. Each residual block (U) is then input to a two-dimensional N × N forward transform. The two-dimensional transform can be implemented as a separable transform by applying an N-point one-dimensional transform to each row and each column separately. The resulting N × N transform coefficients (coeff) are then subject to quantization (which is equivalent to division by quantization step size Qstep and subsequent rounding) to obtain quantized transform coefficients (level). At the decoder, the quantized transform coefficients are then de-quantized (which is equivalent to multiplication by Qstep). Finally, a two-dimensional N × N separable inverse transform is applied to the de-quantized transform coefficients (coeff Q  ) resulting in a residual block of quantized samples which is then added to the intra- or inter-prediction samples to obtain the reconstructed block.

Fig. 6.1
figure 1

Block-based hybrid video coding. (a) Encoder, (b) Decoder. C is the transform matrix and Qstep is the quantization step size. Reproduced with permission from [6]. © IEEE 2013

Typically, the forward- and inverse transform matrices are transposes of each other and are designed to achieve near lossless reconstruction of the input residual block when concatenated without the intermediate quantization and de-quantization steps.

In video coding standards such as HEVC, the de-quantization process and inverse transforms are specified, while the forward transforms and quantization process are chosen by the implementer (subject to constraints on the bitstream).

This chapter is organized as follows. Section 6.2 describes the two transform types used in HEVC: the core transform based on the discrete cosine transform and the alternate transform based on the discrete sine transform. Design principles used to develop the transform are also highlighted to provide insight into the transform design process which considered both coding efficiency and complexity. In Sect. 6.3, the HEVC quantization process is described. Topics covered in this section include the actual quantization and de-quantization steps, quantization matrices, and quantization parameter derivation. Section 6.4 provides an overview of the three special coding modes in HEVC (I_PCM mode, Lossless mode, and Transform skip mode) that modify the transform and quantization process by either skipping the transform or by skipping both transform and quantization. Sections 6.5 and 6.6 provide complexity analysis and coding performance results respectively.

6.2 HEVC Transform

Portions of this section are © 2013 IEEE. Reprinted, with permission, from M. Budagavi, A. Fuldseth, G. Bjøntegaard, V. Sze, M. Sadafale, “Core Transform Design in the High Efficiency Video Coding (HEVC) Standard,” IEEE Journal of Selected Topics in Signal Processing, December 2013.

The HEVC standard [16] specifies core transform matrices of size 4 × 4, 8 × 8, 16 × 16 and 32 × 32 to be used for two-dimensional transforms in the context of block-based motion-compensated video compression. Multiple transform sizes improve compression performance, but also increase the implementation complexity. Hence a careful design of the core transforms is needed.

HEVC specifies two-dimensional core transforms that are finite precision approximations to the inverse discrete cosine transform (IDCT) for all transform sizes. Note that because of the approximations, the HEVC core transforms are not the IDCT. The fact that an IDCT is not used does not necessarily make the HEVC core transforms imperfect. In fact, the finite precision approximations are desirable as explained in the next two paragraphs. The main purpose of the transform is to de-correlate the input residual block. The optimal de-correlating transform is the Karhunen–Loeve transform (KLT) [22] and not necessarily the DCT. This is especially true for the coding of 4 × 4 luma intra-prediction residual blocks where HEVC specifies an alternate 4 × 4 integer transform based on the discrete sine transform (DST) [24]. Note that only the inverse transforms are specified in the HEVC standard and the forward transforms are not. So an encoder may get additional coding efficiency benefits by using the actual inverse rather than the transpose of the inverse transform.

In the H.261, MPEG-1, H.262/MPEG-2, and H.263 video coding standards, an 8-point IDCT was specified with infinite precision. To ensure interoperability and to minimize drift between encoder and decoder implementations using finite precision, two features were included in the standards. First, block-level periodic intra refresh was mandatory. Second, a conformance test for the accuracy of the IDCT using a pseudo-random test pattern was specified.

In the H.264/MPEG-4 Advanced Video Coding (AVC) standard [15], the problem of encoder–decoder drift was solved by specifying integer valued 4 × 4 and 8 × 8 transform matrices. The transforms were designed as approximations to the IDCT with emphasis on minimizing the number of arithmetic operations. These transforms had large variations of the norm of the basis vectors. As a consequence of this, non-flat default de-quantization matrices were specified to compensate for the different norms of the basis vectors [20].

During the development of HEVC, several different approximations of the IDCT were studied for the core transform. The first version of the HEVC Test Model HM1 used the H.264/AVC transforms for 4 × 4 and 8 × 8 blocks and integer approximation of Chen’s fast IDCT [7] for 16 × 16 and 32 × 32 blocks. The HM1 inverse transforms had the following characteristics [23, 28]:

  • Non-flat de-quantization matrices for all transform sizes: While acceptable for small transform sizes, the implementation cost of using de-quantization matrices for larger transforms is high because of larger block sizes,

  • Different architectures for different transform sizes: This leads to increased area since hardware sharing across different transform sizes is difficult,

  • A 20-bit transpose buffer used for storing intermediate results after the first transform stage in 2D transform: An increased transpose buffer size leads to larger memory and memory bandwidth. In hardware, the transpose buffer area can be significant and comparable to transform logic area [30],

  • Full factorization architecture requiring cascaded multipliers and intermediate rounding for 16- and 32-point transforms: This increases data path dependencies and impacts parallel processing performance. It also leads to increased bit width for multipliers and accumulators (32 bits and 64 bits respectively in software). In hardware, in addition to area increase, it also leads to increased circuit delay thereby limiting the maximum frequency at which the inverse transform block can operate.

To address the complexity concerns of the HM1 transforms, a matrix multiplication based core transform was proposed in [10] and eventually adopted as the HEVC core transform. The design goal was to develop a transform that was efficient to implement in both software on SIMD machines and in hardware. Alternative proposals to the HEVC core transform design can be found in [1, 9, 17].

The HEVC core transform matrices were designed to have the following properties [10]:

  • Closeness to the IDCT

  • Almost orthogonal basis vectors

  • Almost equal norm of all basis vectors

  • Same symmetry properties as the IDCT basis vectors

  • Smaller transform matrices are embedded in larger transform matrices

  • Eight-bit representation of transform matrix elements

  • Sixteen-bit transpose buffer

  • Multipliers can be represented using 16 bits or less with no cascaded multiplications or intermediate rounding

  • Accumulators can be implemented using less than 32 bits

6.2.1 Discrete Cosine Transform

The N transform coefficients v i of an N-point 1D DCT applied to the input samples u i can be expressed as

$$ {v}_i={\displaystyle \sum_{j=0}^{N-1}}{u}_j{c}_{i j} $$
(6.1)

where i = 0, …, N −1. Elements c ij of the DCT transform matrix C are defined as

$$ {c}_{ij}=\frac{P}{\sqrt{N}} \cos \left[\frac{\pi}{N}\left( j+\frac{1}{2}\right) i\right] $$
(6.2)

where i, j = 0, …, N −1 and where P is equal to 1 and \( \sqrt{2} \) for i = 0 and i > 0, respectively. Furthermore, the basis vectors c i of the DCT are defined as c i  = [c i0, …, c i(N −1)]T where i = 0, …, N −1.

The DCT has several properties that are considered useful both for compression efficiency and for efficient implementation [22].

  1. 1.

    The basis vectors are orthogonal, i.e. c T i c j  = 0 for i ≠ j. This property is desirable for compression efficiency by achieving transform coefficients that are uncorrelated.

  2. 2.

    The basis vectors of the DCT have been shown to provide good energy compaction which is also desirable for compression efficiency.

  3. 3.

    The basis vectors of the DCT have equal norm, i.e. c T i c i  = 1 for i = 0, …, N −1. This property is desirable for simplifying the quantization/de-quantization process. Assuming that equal frequency-weighting of the quantization error is desired, equal norm of the basis vectors eliminates the need for quantization/de-quantization matrices.

  4. 4.

    Let N = 2M. The elements of a DCT matrix of size 2M × 2M is a subset of the elements of a DCT matrix of size 2M + 1 × 2M + 1. More specifically, the basis vectors of the smaller matrix is equal to the first half of the even basis vectors of the larger matrix. This property is useful to reduce implementation costs as the same multipliers can be reused for various transform sizes.

  5. 5.

    The DCT matrix can be specified by using a small number of unique elements. By examining the elements c ij of (6.2) it can be shown that the number of unique elements in a DCT matrix of size 2M × 2M is equal to 2M −1. As further elaborated in Sect. 6.2.4, this is particularly advantageous in hardware implementations.

  6. 6.

    The even basis vectors of the DCT are symmetric, while the odd basis vectors are anti-symmetric. This property is useful to reduce the number of arithmetic operations.

  7. 7.

    The coefficients of a DCT matrix have certain trigonometric relationships that allows for a reduction of the number of arithmetic operations beyond what is possible by exploiting the (anti-)symmetry properties. These properties can be utilized to implement fast algorithms such as the Chen’s fast factorization [7].

6.2.2 Finite Precision DCT Approximations

The core transform matrices of HEVC are finite precision approximations of the DCT matrix. The benefit of using finite precision in a video coding standard is that the approximation to the real-valued DCT matrix is specified in the standard rather than being implementation dependent. This avoids encoder–decoder mismatch and drift caused by manufacturers implementing the IDCT with slightly different floating point representations. On the other hand, a disadvantage of using approximate matrix elements is that some of the properties of the DCT discussed in Sect. 6.2.1 may not be satisfied anymore. More specifically, there is a trade-off between the computational cost associated with using high bit-depth for the matrix elements and the degree to which some of the conditions of Sect. 6.2.1 are satisfied.

A straightforward way of determining integer approximations to the DCT matrix elements is to scale each matrix element with some large number (typically between 25 and 216) and then round to the closest integer. However, this approach does not necessarily result in the best compression performance. As shown in Sect. 6.2.3, for a given bit-depth of the matrix elements, a different strategy for approximating the DCT matrix elements results in a different trade-off between some of the properties of Sect. 6.2.1.

6.2.3 HEVC Core Transform Design Principles

The DCT approximations used for the core transforms of HEVC were chosen according to the following principles. First, properties 4–6 of Sect. 6.2.1 were satisfied without any compromise. This choice ensures that several implementation friendly aspects of the DCT are preserved. Second, for properties 1–3 and 7 of Sect. 6.2.1, there were trade-offs between the number of bits used to represent each matrix element and the degree by which each of the properties were satisfied.

To measure the degree of approximation for properties 1–3 of Sect. 6.2.1, the following measures are defined for an integer N-point DCT approximation with scaled matrix elements equal to d ij and basis vectors equal to d i  = [d i0, …, d i(N −1)]T where i = 0, …, N −1.

  1. 1.

    Orthogonality measure: o ij  = d T i d j /d T0 d 0, i ≠ j

  2. 2.

    Closeness to DCT measure: m ij  = |αc ij  −d ij |/d 00

  3. 3.

    Norm measure: n i  = |1 −d T i d i /d T0 d 0|

where i, j = 0, …, N −1, c ij are the DCT matrix elements of (6.2), and the scale factor α is defined as d 00 N 1/2.

As a result of careful investigation, it was decided to represent each matrix coefficient with 8 bit (including sign bit), and to choose the elements of the first basis vector to be equal to 64 (i.e. d 0j  = 64, j = 0, …, N −1). Note that this results in a scale factor of 26 + M/2 for the HEVC transform matrix when compared to the orthonormal DCT. The remaining matrix elements were hand-tuned (within the constraints of properties 4–6 of Sect. 6.2.1) to achieve a good balance between properties 1–3 of Sect. 6.2.1. The hand-tuning was performed as follows. First, the real-valued scaled DCT matrix elements, αc ij , were derived. Next, for each unique number in the resulting matrices, each integer value in the interval [−1.5, 1.5] around αc ij was examined and the resulting values of o ij , m ij , and n i were calculated. Since there are only 31 unique numbers in the transform matrices (see Sect. 6.2.4), various permutations can be examined systematically (although not exhaustively). The final integer matrix elements were chosen to give a good compromise between all measures o ij , m ij , and n i . The resulting worst case values of o ij , m ij , and n i are shown in the second column of Table 6.1. The norm was considered to be sufficiently close to 1 (i.e. the norm measure n i is sufficiently close to 0) to justify not using a non-flat default de-quantization matrix in HEVC (i.e. all transform coefficients scaled equally).

Table 6.1 Comparison of transform design methods

For comparison purposes, the resulting measures when multiplying the real-valued DCT matrix elements with 26 + M/2 and rounding to the closest integer are listed in the third column of Table 6.1. As can be seen from the table, although the matrix elements of the HEVC transforms are farther from the scaled DCT matrix elements, they have better orthogonality and norm properties.

Finally, by using only 8 bit representation, property 7 of Sect. 6.2.1 (trigonometric relationship between matrix elements) was not easily preserved. The authors are not aware of any trigonometric property of the HEVC core transforms that can be utilized to reduce the number of arithmetic operations below those required when using the (anti-) symmetry properties.

6.2.4 Basis Vectors of the HEVC Core Transforms

The left half of the 32 × 32 matrix specifying the 32-point forward transform is shown in Fig. 6.2. The right half can be derived by using the (anti-) symmetry properties of the basis vectors (property 6 of Sect. 6.2.1). The inverse transform matrix of HEVC is defined as the transpose of the matrix resulting from the figure. The 32 × 32 matrix contains up to 31 unique numbers as follows.

Fig. 6.2
figure 2

Left half of the 32 × 32 matrix specifying the 32-point forward transform. Embedded 4-point (green shading), 8-point (pink shading) and 16-point (yellow shading) forward transform matrices are also shown in the figure. Reproduced with permission from [6]. © IEEE 2013

$$ {d}_{i,0}^{32}, i=1,\dots, 31=\begin{array}{c}\hfill \Big\{90,90,90,89,88,87,85,83,82,80,78,75,73,70,67,64,\hfill \\ {}\hfill 61,57,54,50,46,43,38,36,31,25,22,18,13,9,4\Big\}\ \hfill \end{array} $$
(6.3)

These unique numbers are elements 1–31 of the first column of the forward transform matrix. Note that although the number 90 occurs three times, this is by accident and not generally true. The unique numbers property was used in [26] to enable 25 % area reduction for hardware designs with practical throughput.

Furthermore, the coefficients d N ij of the smaller transform matrices (N = 4, 8, 16) can be derived from the coefficients d 32 ij of the 32 × 32 transform matrix as:

$$ {d}_{i j}^N={d}_{i\left(32/ N\right), j}^{32}, i, j=0,\dots, N-1 $$
(6.4)

Let D 4 denote the 4 × 4 transform matrix. By using (6.4) and Fig. 6.2, D 4 can be obtained as:

$$ {\mathbf{D}}_4=\left[\begin{array}{cc} \begin{array}{cc} {d}_{0,0}^{32}\hfill & \hfill {d}_{0,1}^{32}\hfill \\ {}\hfill {d}_{8,0}^{32}\hfill & \hfill {d}_{8,1}^{32}\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill {d}_{0,2}^{32}\hfill & \hfill \kern0.75em {d}_{0,3}^{32}\hfill \\ {}\hfill {d}_{8,2}^{32}\hfill & \hfill \kern0.5em {d}_{8,3}^{32}\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill {d}_{16,0}^{32}\hfill & \hfill {d}_{16,1}^{32}\hfill \\ {}\hfill {d}_{24,0}^{32}\hfill & \hfill {d}_{24,1}^{32}\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill {d}_{16,2}^{32}\hfill & \hfill {d}_{16,3}^{32}\hfill \\ {}\hfill {d}_{24,2}^{32}\hfill & \hfill {d}_{24,3}^{32}\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{rr} \begin{array}{l@{\quad}r} 64\hspace*{8pt} & 64 \\ 83\hspace*{8pt}& 36 \end{array}& \begin{array}{r@{\quad}r} 64& 64\\ -36& -83\end{array}\\ \begin{array}{l@{\quad}r} 64& -64\\ 36& -83\end{array}& \begin{array}{r@{\quad}r} -64& 64\\ 83& -36\end{array}\end{array}\right] $$

The 8 × 8 transform matrix D 8 and the 16 × 16 transform matrix D 16 can be similarly obtained from the 32 × 32 transform matrix as shown in Fig. 6.2 where different colors are used to highlight the embedded 16 × 16, 8 × 8 and 4 × 4 forward transform matrices. This property allows for different transform sizes to be implemented using the same architecture thereby facilitating hardware sharing [6].

Note that from the unique numbers property of (6.3) and the (anti-)symmetry properties, D 4 is also equal to:

$$ {\mathbf{D}}_4=\left[\begin{array}{cc}\hfill \begin{array}{cc}\hfill {d}_{16,0}^{32}\hfill & \hfill \kern0.5em {d}_{16,0}^{32}\hfill \\ {}\hfill {d}_{8,0}^{32}\hfill & \hfill \kern0.5em {d}_{24,0}^{32}\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill {d}_{16,0}^{32}\hfill & \hfill {d}_{16,0}^{32}\hfill \\ {}\hfill -{d}_{24,0}^{32}\hfill & \hfill -{d}_{8,0}^{32}\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill {d}_{16,0}^{32}\hfill & \hfill -{d}_{16,0}^{32}\hfill \\ {}\hfill {d}_{24,0}^{32}\hfill & \hfill -{d}_{8,0}^{32}\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill -{d}_{16,0}^{32}\hfill & \hfill {d}_{16,0}^{32}\hfill \\ {}\hfill {d}_{8,0}^{32}\hfill & \hfill -{d}_{24,0}^{32}\hfill \end{array}\hfill \end{array}\right] $$
(6.5)

6.2.5 Intermediate Scaling

Since the HEVC matrices are scaled by 2(6 + M/2) compared to an orthonormal DCT transform, and in order to preserve the norm of the residual block through the forward and inverse two-dimensional transforms, additional scale factors—S T1, S T2, S IT1, S IT2 —need to be applied as shown in Fig. 6.3. Note that Fig. 6.3 is basically a fixed point implementation of the transform and quantization in Fig. 6.1. While the HEVC standard specifies the scale factors of the inverse transform (i.e. S IT1, S IT2), the HEVC reference software also specifies corresponding scale factors for the forward transform (i.e. S T1, S T2). The scale factors were chosen with the following constraints:

Fig. 6.3
figure 3

Additional scale factors S T1, S T2, S IT1, S IT2, S Q , S IQ required to implement HEVC integer transform and quantization. (a) Forward transform and quantization, (b) inverse transform and quantization. The 2D forward and inverse transform are implemented as separable 1D column and row transforms. C is the orthonormal DCT matrix. D is the scaled approximation of the DCT matrix. M = log2(N) where N is the transform size. Reproduced with permission from [6]. © IEEE 2013

  1. 1.

    All scale factors shall be a power of two to allow the scaling to be implemented as a right shift.

  2. 2.

    Assuming full range of the input residual block (e.g. a DC block with all samples having maximum amplitude), the bit depth after each transform stage shall be equal to 16 bits (including the sign bit). This was considered a reasonable trade-off between accuracy and implementation costs.

  3. 3.

    Since the HEVC matrices are scaled by 2(6 + M/2), cascading of the two-dimensional forward and inverse transform will results in a scaling of 2(6 + M/2) for each of the 1D row forward transform, the 1D column forward transform, the 1D column inverse transform, and the 1D row inverse transform. Consequently to preserve the norm through the two-dimensional forward and inverse transforms, the product of all scale factors shall be equal to (1/2(6 + M/2))4 = 2−242−2M.

The process of selecting the forward transform scale factors is illustrated using the 4 × 4 forward transform as an example in Fig. 6.4. When video has a bit depth of B bits, the residual will be in the range of [−2B + 1, 2B −1 ] requiring (B + 1) bits to represent it. In the following worst case bit-depth analysis we will assume a residual block with all samples having maximum amplitude equal to −2B as input to the first stage of the forward transform. We believe this is a reasonable assumption since all basis vectors have almost the same norm. Note also that we are using −2B instead of −2B + 1 or 2B −1 in the worst case analysis since it is a power of 2. The scale factor derivation becomes simpler assuming input to be −2B (which still fits within (B + 1) bits) since all the scale factors are a power of 2. For this worst case input block, the maximum value of an output sample will be −2B × N × 64. This corresponds to the dot product of the first basis vector (of length N with all values equal to 64) with an input vector consisting of values equal to −2B. Therefore, with N = 2M, for the output to fit within 16 bits (i.e., maximum value of −215) a scaling of 1/(2B × 2M × 26 × 2−15) is required. Consequently, the scale factor after the first transform stage is chosen as S T1 = 2−(B + M −9).

Fig. 6.4
figure 4

Intermediate scaling factor determination for the forward transform so that the intermediate and output values fit within 16-bits. B is video bit depth and M = log2(N) where N is the transform size. Worst case bit-depth analysis is done assuming a residual block with all samples having maximum amplitude equal to −2B (where B = 8 is the video bit depth), as input to the first stage of the forward transform. (a) First stage of the forward transform, (b) Second stage of the forward transform. Reproduced with permission from [6]. © IEEE 2013

The second stage of the forward transform consists of multiplication of the result of the first transform stage with D T4 . The input into the second stage of the forward transform is the output from the first stage which is a matrix with all elements in the first row having a value of −215. All other elements will be zero as shown in Fig. 6.4b. The output of multiplication with D T4 will be a matrix with only a DC value equal to −215 × 2M × 26 and all remaining values equal to 0. This implies that the scaling required after the second stage of transform is S T2 = 2−(M + 6) in order for the output to fit within 16 bits.

The first stage of the inverse transform consists of multiplication of the result of the forward transform with D T4 . In our example, the input into the first stage of the inverse transform is the output matrix from the forward transform which is a matrix with only the DC element equal to −215. The output of multiplication with D T4 will be a matrix with first column elements equal to −215 × 26. Consequently, the scaling required after the first stage of the inverse transform for the output to fit within 16 bits is S IT1 = 2−6.

The second stage of the inverse transform consists of multiplication of the result of the first stage of the inverse transform with D 4. The input into the second stage of the inverse transform is the output matrix from the first stage of inverse transform which is a matrix with first column elements equal to −215. The output of multiplication with D 4 will be a matrix with all elements equal to −215 × 26. So the scaling required after the second stage of inverse transform to get the output values into the original range of [−2B,  2B −1] is S IT2 = 2−(21 −B).

In summary the constraints imposed in this section result in the following scale factors after different transform stages:

  • After the first forward transform stage: S T1 = 2−(B + M −9)

  • After the second forward transform stage: S T2 = 2−(M + 6)

  • After the first inverse transform stage: S IT1 = 2−6

  • After the second inverse transform stage: S IT2 = 2−(21 −B)

where B is the bit depth of the input/output signal (e.g. 8 bit) and M = log2(N).

Without quantization/de-quantization, this choice of scale factors ensures a bit depth of 16 bit after all transform stages. However, quantization errors introduced by the quantization/de-quantization process might increase the dynamic range before each inverse transform stage to more than 16 bit. For example, consider the situation where B = 8 and all input samples to the forward transform are equal to 255. In this case, the output of the forward transform will be a DC coefficient with value equal to 255 << 7 = 32640. For high QP values and with a quantizer rounding upwards, the input to each inverse transform stage can easily exceed the allowed 16 bit dynamic range of [−32768,  32767]. While clipping to 16 bit range was considered trivial after the de-quantizer, it was considered undesirable after the first inverse transform stage. In order to allow for quantization error of some reasonable magnitude and at the same time limit the dynamic range between the two inverse transform stages to 16 bits, the choice of scale factors for the inverse transform was finally modified as followsFootnote 1:

  • After the first inverse transform stage: S IT1 = 2−7

  • After the second inverse transform stage: S IT2 = 2−(20 −B)

The use of the inverse transform scale factors is illustrated in Fig. 6.5 using the 4 × 4 inverse transform as an example assuming the input to be the final output of Fig. 6.4.

Fig. 6.5
figure 5

Use of the inverse transform scale factors assuming the input to be the final output of Fig. 6.4. Video bit depth B = 8 (a) First stage of the inverse transform, (b) Second stage of the inverse transform. Reproduced with permission from [6]. © IEEE 2013

Tables 6.2 and 6.3 summarize the different scaling factors of the forward and inverse transform, respectively, when compared to the orthonormal DCT.

Table 6.2 Scaling in different stages for the 2D forward transform
Table 6.3 Scaling in different stages for the 2D inverse transform

The HEVC specification specifies an offset value to be added before scaling to carry out rounding. This offset value is equal to the scale factor divided by 2. The offset is not explicitly shown in Figs. 6.3, 6.4, and 6.5.

Finally, two useful consequences of using 8-bit coefficients and limiting the bit-depth of the intermediate data to 16 bit is that all multiplications can be represented with multipliers having 16 bits or less and that the accumulators before right shift can be implemented with less than 32 bits for all transform stages.

Note also a relevant analysis in [18] that studies the dynamic range of the HEVC inverse transform and provides additional information on the bit depth limits of the intermediate data in the inverse transform.

6.2.6 HEVC Alternate 4 × 4 Transform

The alternate transform is applied to 4 × 4 Luma intra-prediction residual blocks. The forward transform matrix is given by:

$$ {\mathbf{A}}_4=\left[\begin{array}{cc}\hfill \begin{array}{cc}\hfill 29\hfill & \hfill \kern0.75em 55\hfill \\ {}\hfill 74\hfill & \hfill \kern0.75em 74\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill \kern0.75em 74\hfill & \hfill \kern0.75em 84\hfill \\ {}\hfill \kern1.25em 0\hfill & \hfill -74\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill 84\hfill & \hfill -29\hfill \\ {}\hfill 55\hfill & \hfill -84\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill -74\hfill & \hfill \kern0.75em 55\hfill \\ {}\hfill \kern0.75em 74\hfill & \hfill -29\hfill \end{array}\hfill \end{array}\right] $$

The inverse transform matrix is A T4 . Elements a ij of the alternate transform matrix A 4 are a fixed point representation of Type-7 discrete sine transform (DST) obtained as follows:

$$ {a}_{ij}=\mathrm{round}\left(128*\frac{2}{\sqrt{2 N+1}} \sin \left(\frac{\left(2 i+1\right)\left( j+1\right)\pi}{2 N+1}\right)\right) $$

The intermediate scaling and quantization/de-quantization used for the alternate transform is the same as that for the core transform.

The alternate transform provides around 1 % bit-rate reduction while coding intra pictures [25]. In intra-picture prediction, a block is predicted from left and/or top neighboring samples. The prediction quality is better near the left and/or top boundary resulting in an intra-prediction residual that tends to have lower amplitude near the boundary samples and higher amplitudes away from the boundary samples. The DST basis functions are better than the DCT basis functions in modeling this spatial characteristic of the intra prediction residual. This can be seen from the first row (basis function) of the alternate transform matrix which increases from left to right as opposed to the DCT transform matrix that has a flat first row. A theoretical analysis of the optimality of DST for intra-prediction residual is provided in [25].

During the course of the development of HEVC, alternate transforms for transform block sizes of 8 × 8 and higher were also studied. However, only the 4 × 4 alternate transform was adopted in HEVC since the additional coding gain from using the larger alternate transforms was not significant (also, their complexity is higher since there is no symmetry in the transform matrix and a full matrix multiplication is needed to implement them for transform sizes 8 × 8 and larger).

6.3 Quantization and De-quantization

Quantization consists of division by a quantization step size (Qstep) and subsequent rounding while inverse quantization consists of multiplication by the quantization step size. Here, Qstep refers to the equivalent step size for an orthonormal transform, i.e. without the scaling factors of Tables 6.2 and 6.3. Similar to H.264/AVC [27], a quantization parameter (QP) is used to determine the quantization step size in HEVC. QP can take 52 values from 0 to 51 for 8-bit video sequences. An increase of 1 in QP means an increase of the quantization step size by approximately 12 % (i.e., 21/6). An increase of 6 leads to an increase in the quantization step size by a factor of 2. In addition to specifying the relative difference between the step-sizes of two consecutive QP values, there is also a need to define the absolute step-size associated with the range of QP values. This was done by selecting Qstep = 1 for QP = 4.

The resulting relationship between QP and the equivalent quantization step size for an orthonormal transform is now given by:

$$ Qstep(QP)={\left({2}^{1/6}\right)}^{QP-4} $$
(6.6)

Figure 6.6 shows how the quantization step size increases non-linearly with QP.

Fig. 6.6
figure 6

Relationship between quantization step size (Qstep) and quantization parameter (QP)

Equation (6.6) can be also be expressed as:

$$ Qstep(QP)={G}_{QP\%6}<< \frac{QP}{6} $$
(6.7)

where

$$ \mathbf{G}={\left[{G}_0,{G}_1,{G}_2,{G}_3,{G}_4,{G}_5\right]}^T={\left[{2}^{-4/6},{2}^{-3/6},{2}^{-2/6},{2}^{-1/6},{2}^0,{2}^{1/6}\right]}^T $$

The fixed point approximation of (6.7) in HEVC is given by

$$ {g}_{QP\%6}=\mathrm{round}\left({2}^6\times {G}_{QP\%6}\right) $$

This results in

$$ \mathbf{g}={\left[{g}_0,{g}_1,{g}_2,{g}_3,{g}_4,{g}_5\right]}^T={\left[40,45,51,57,64,72\right]}^T $$

HEVC supports frequency-dependent quantization by using quantization matrices for all transform block sizes. Let W[x][y] denote the quantization matrix weight for the transform coefficients at location (x, y) in a transform block. A value of W[x][y] = 1 indicates that there is no weighting. The fixed point representation of W[x][y] is given by:

$$ w\left[ x\right]\left[ y\right]=\mathrm{round}\left(16\times W\left[ x\right]\left[ y\right]\right) $$

where w[x][y] is represented using 8-bit values.

For a quantizer output, level[x][y], the de-quantizer is specified in the HEVC standard as

$$ \begin{array}{ll}\hbox{\textit{coeff}}_Q\left[ x\right]\left[ y\right]&=\left(\left( \mathit{level}\left[ x\right]\left[ y\right]\times w\left[ x\right]\left[ y\right]\times \left({g}_{Q P\%6}<< \frac{ Q P}{6}\right)\right)\, +\, \mathit{offse}{t}_{IQ}\right)\\ &>> \mathit{shift1} \end{array}$$
(6.8)

where shift1 = (M −5 + B) and offset IQ  = 1 << (M −6 + B). Note that the quantization matrix weights w[x][y] modulate the quantization step size used for level at different positions in the transform block leading to a frequency-dependent quantization.

The scale factor S IQ of Fig. 6.3 is equal to 2shift1 and is obtained as follows: When QP = 4 (i.e., Qstep = 1) and there is no frequency dependent scaling (i.e., w[x][y] = 16), the combined scaling of the inverse transform and de-quantization in Fig. 6.3 when multiplied together should result in a product of 1 to maintain the norm of the residual block through inverse transform and inverse quantization, i.e.,

$$ {S}_{IQ}\times {g}_4\times 16\times {2}^{-\left(15- B- M\right)}=1 $$
(6.9)

This results in S IQ  = 2−(M −5 + B) leading to shift1 being equal to right shift by (M −5 + B). The scale factor 2−(15 −B −M) in (6.9) is obtained from Table 6.3.

For the output sample of the forward transform, coeff[x][y], a straightforward quantization scheme can be implemented as follows:

$$ \begin{array}{l}\mathit{level}\left[ x\right]\left[ y\right]= sign\left( \hbox{\textit{coeff}}\left[ x\right]\left[ y\right]\right)\\ *\left(\left(\left( abs\left( \hbox{\textit{coeff}}\left[ x\right]\left[ y\right]\right)\times {f}_{Q P\%6}\times \frac{16}{w\left[ x\right]\left[ y\right]}+ \hbox{\textit{offse{t}}}_Q\right)>> \frac{ Q P}{6}\right)>> \hbox{\textit{shift2}}\right)\end{array}$$
(6.10)

where shift2 = 29 −M −B, and

$$ \mathbf{f}={\left[{f}_0,{f}_1,{f}_2,{f}_3,{f}_4,{f}_5\right]}^T={\left[26214,23302,20560,18396,16384,14564\right]}^T $$

Note that f QP % 6 ≈ 214/G QP % 6. The value of shift2 is obtained by imposing similar constraints on the combined scaling in the forward transform and the quantizater as in (6.9), i.e., S Q  × f 4 × 215 −B −M = 1, where S Q  = 2shift2.

Finally, offset Q is chosen to achieve the desired rounding.

To summarize, the quantizer multipliers, f i , and dequantizer multipliers, g i , were chosen to satisfy the following conditions

  • Ensure that g i can be represented with signed 8 bit data type (i.e., g i  < 27, i = 0, …, 5)

  • Ensure an almost equal increase in step size from one QP value to the next (approximately 12 %) (i.e., g i + 1/g i  ≈ 21/6,  i = 0, …, 4 and 2g 0/g 5 ≈ 21/6)

  • Ensure approximately unity gain through the quantization and de-quantization processes (i.e., f i  × g i  × 16 ≈ 1 << (shift1 + shift2) = 26 × 214 × 16, i = 0, …, 5)

  • Provide the desired absolute value of the quantization step size for QP = 4 (i.e. Qstep(4) = 1, or equivalently, level = coeff × 2−(15 −B −M) for QP = 4).

Note that the quantization equation in (6.10) is not specified in the HEVC standard and the encoder has flexibility to implement more sophisticated quantization schemes such as the rate-distortion optimized quantization (RDOQ) scheme implemented in the HEVC Test Model [13]. The idea behind RDOQ is briefly described in Chap. 9.

6.3.1 Quantization Matrix

In HEVC, the encoder can signal whether or not to use quantization matrices enabling frequency dependent scaling. Frequency dependent scaling is useful to carry out human visual system (HVS)-based quantization where low frequency coefficients are quantized with a finer quantization step size when compared to high frequency coefficients in the transform block [12]. HVS-based quantization can provide better visual quality than frequency independent quantization on some video sequences. HEVC uses the following 20 quantization matrices that depend on the size and type of the transform block:

  • Luma: Intra 4 × 4, Inter 4 × 4, Intra 8 × 8, Inter 8 × 8, Intra 16 × 16, Inter 16 × 16, Intra 32 × 32, Inter 32 × 32

  • Cb: Intra 4 × 4, Inter 4 × 4, Intra 8 × 8, Inter 8 × 8, Intra 16 × 16, Inter 16 × 16

  • Cr: Intra 4 × 4, Inter 4 × 4, Intra 8 × 8, Inter 8 × 8, Intra 16 × 16, Inter 16 × 16

When frequency dependent scaling is enabled by using the syntax element scaling_list_enabled_flag, the quantization matrices of sizes 4 × 4 and 8 × 8 have default values as shown in Fig. 6.7. The default quantization matrices for transform blocks of size 16 × 16 and 32 × 32 are obtained from the default 8 × 8 quantization matrices of the same type by upsampling using replication as shown in Fig. 6.8. The red colored blocks in the figure indicate that a quantization matrix entry in the 8 × 8 quantization matrix is replicated into a 2 × 2 region in the 16 × 16 quantization matrix and into a 4 × 4 region in the 32 × 32 quantization matrix. 8 × 8 matrices are used to represent 16 × 16 and 32 × 32 quantization matrices in order to reduce the memory needed to store the quantization matrices.

Fig. 6.7
figure 7

Default quantization matrices for transform blocks of size 4 × 4 and 8 × 8

Fig. 6.8
figure 8

Construction of default quantization matrices for transform block sizes 16 × 16 and 32 × 32 by using the default quantization matrix of size 8 × 8

Non-default quantization matrices can also be optionally transmitted in the bitstream in sequence parameter sets (SPS) or picture parameter sets (PPS). Quantization matrix entries are scanned using an up-right diagonal scan and DPCM coded and transmitted. For 16 × 16 and 32 × 32 quantization matrices, only size 8 × 8 matrices (which then get upsampled to the correct size in the decoder as shown in Fig. 6.8) and the quantization matrix entry at the DC (zero-frequency) position are transmitted. HEVC also allows for prediction of a quantization matrix from another quantization matrix of the same size. The use of quantization matrix (termed as scaling matrix in HEVC) is enabled by setting the flag scaling_list_enabled_flag in SPS. When this flag is enabled, additional flags in SPS and PPS control whether the default quantization matrices or non-default quantization matrices are used.

6.3.2 QP Parameter Derivation

The quantization step size (and therefore the QP value) may need to be changed within a picture for e.g. rate control and perceptual quantization purposes. HEVC allows for transmission of a delta QP value at a quantization group (QG) level to allow for QP changes within a picture. This is similar to H.264/AVC that allows for modification of QP values at a macroblock level. The QG size is a multiple of coding unit size that can vary from 8 × 8 to 64 × 64 depending on the coding tree unit (CTU) size and the syntax element diff_cu_qp_delta_depth as shown in Table 6.4.

Table 6.4 Quantization group size for different coding tree unit sizes

The delta QP is transmitted only in coding units with non-zero transform coefficients. If the CTU is split into coding units that are greater than the QG size, then delta QP is signaled at a coding unit (with non-zero transform coefficients) that is greater than the QG size. If the CTU is split into coding units that are smaller than the QG size, then the delta QP is signaled in the first coding unit with non-zero transform coefficients in the QG. If a QG has coding units with all zero transform coefficients (e.g. if the merge mode is used in all the coding units of the QG), then delta QP will not be signaled.

The QP predictor used for calculating the delta QP uses a combination of QP values from the left, above and the previous QG in decoding order as shown in Fig. 6.9 [21]. The QP predictor uses a combination of two predictive techniques: spatial QP prediction (from left and above QGs) and previous QP prediction. It uses spatial prediction from left and above within a CTU and uses the previous QP as predictor at the CTU boundary. This is shown in Fig. 6.9. The spatially adjacent QP values, QPLEFT and QPABOVE are considered to be not available when they are in a different CTU or if the current QG is at a slice/tile/picture boundary. When a spatially adjacent QP value is not available, it is replaced with the previous QP value, QPPREV, in decoding order. The previous QP, QPPREV, is initialized to the slice QP value at the beginning of the slice, tile or wavefront.

Fig. 6.9
figure 9

QP predictor calculation using QP values from the left, above and previous QGs [21]

The QP derivation process described in this subsection is used for calculating the luma QP value. The chroma QP values (one for the Cr component and one for the Cb component) are derived from the luma QP by using picture level and slice level offsets and a table lookup.

6.4 HEVC Special Coding Modes

HEVC has three special modes that modify the transform and quantization process: (a) I_PCM mode [8], (b) lossless mode [31], and (c) transform skip mode [19]. These modes skip either the transform or both the transform and quantization. Figure 6.10 shows these modes on top of the generic video decoder data flow of Fig. 6.1.

Fig. 6.10
figure 10

I_PCM, lossless and transform skip modes in decoder

  • In the I_PCM mode, both transform and transform-domain quantization are skipped. In addition, entropy coding and prediction are skipped too and the video samples are directly coded with the specified PCM bit depth. The I_PCM mode is designed for use when there is data expansion during coding e.g. when random noise is input to the video codec. By directly coding the video samples, the data expansion can be avoided for such extreme video sequences. The IPCM mode is signaled at the coding unit level using the syntax element pcm_flag.

  • In the lossless mode, both transform and quantization are skipped. (The in-loop filter which is not shown in Fig. 6.1 is skipped too.) Mathematically lossless reconstruction is possible since the residual from inter- or intra-picture prediction is directly coded. The lossless mode is signaled at a coding unit level (using the syntax element cu_transquant_bypass_flag) in order to enable mixed lossy/lossless coding of pictures. Such a feature is useful in coding video sequences with mixed content, e.g. natural video with overlaid text and graphics. The text and graphics regions can be coded losslessly to maximize readability whereas the natural content can be coded in a lossy fashion.

  • In the transform skip mode, only the transform is skipped. This mode was found to improve compression of screen-content video sequences generated in applications such as remote desktop, slideshows etc. These video sequences predominantly contain text and graphics. Transform skip is restricted to only 4 × 4 transform blocks and its use is signaled at the transform unit level by the transform_skip_flag syntax element.

6.5 Complexity Analysis

With straightforward matrix multiplication, the number of operations for the 1D inverse transform is N 2 multiplications and N(N −1) additions. For the 2D transform, the number of multiplications required is 2N 3 and the number of additions required is 2N 2(N −1). However, by utilizing the (anti-) symmetry properties of each basis vector inherited from DCT, the number of arithmetic operations can be significantly reduced. We refer to the algorithm that does this as the Even–Odd decomposition in this paper (it was also referred to as partial butterfly during HEVC development) [14]. Even–Odd decomposition is illustrated below using the 4- and 8-point inverse transform.

Consider the 4-point forward transform matrix defined in (6.5). For notational simplicity the constants d 32 i,0 of Eq. (6.5) will be denoted by d i . Using the new notation (6.5) becomes

$$ {\mathbf{D}}_4=\left[\begin{array}{cc}\hfill \begin{array}{cc}\hfill {d}_{16}\hfill & \hfill\ {d}_{16}\hfill \\ {}\hfill {d}_8\hfill & \hfill \kern0.75em {d}_{24}\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill\ {d}_{16}\hfill & \hfill \kern0.75em {d}_{16}\hfill \\ {}\hfill -{d}_{24}\hfill & \hfill -{d}_8\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill {d}_{16}\hfill & \hfill -{d}_{16}\hfill \\ {}\hfill {d}_{24}\hfill & \hfill -{d}_8\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill -{d}_{16}\hfill & \hfill \kern0.5em {d}_{16}\hfill \\ {}\hfill \kern0.75em {d}_8\hfill & \hfill -{d}_{24}\hfill \end{array}\hfill \end{array}\right] $$
(6.11)

The inverse transform matrix is given by D T4 . Let x = [x 0, x 1, x 2, x 3]T be the input vector and y = [y 0, y 1, y 2, y 3]T denote the output. The 1D 4-point inverse transform is given by the following equation:

$$ \mathbf{y}={\mathbf{D}}_4^T\mathbf{x} $$
(6.12)

The Even–Odd decomposition of the inverse transform of an N-point input consists of the following three steps:

  1. 1.

    Calculate the even part using a N/2 × N/2 subset matrix obtained from the even columns of the inverse transform matrix (6.13 shows an example).

  2. 2.

    Calculate the odd part using a N/2 × N/2 subset matrix obtained from the odd columns of the inverse transform matrix (6.15 shows an example).

  3. 3.

    Add/subtract the odd and even parts to generate N-point output (6.16 shows an example).

Even–odd decomposition of the inverse 4-point transform is given by (6.146.16):

Even part:

$$ \left[\begin{array}{c}\hfill {z}_0\hfill \\ {}\hfill {z}_1\hfill \end{array}\right]=\left[\begin{array}{cc}\hfill {d}_{16}\hfill & \hfill {d}_{16}\hfill \\ {}\hfill {d}_{16}\hfill & \hfill -{d}_{16}\hfill \end{array}\right]\left[\begin{array}{c}\hfill {x}_0\hfill \\ {}\hfill {x}_2\hfill \end{array}\right] $$
(6.13)

The even part can be further simplified as:

$$ \begin{array}{l}{t}_0={d}_{16}{x}_0\\ {}{t}_1={d}_{16}{x}_2\\ {}\left[\begin{array}{c}\hfill {z}_0\hfill \\ {}\hfill {z}_1\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {t}_0+{t}_1\hfill \\ {}\hfill {t}_0-{t}_1\hfill \end{array}\right]\end{array} $$
(6.14)

Odd part:

$$ \left[\begin{array}{c}\hfill {z}_2\hfill \\ {}\hfill {z}_3\hfill \end{array}\right]=\left[\begin{array}{c@{\quad}c}\hfill -{d}_{24}\hfill & \hfill {d}_8\hfill \\ {}\hfill -{d}_8\hfill & \hfill -{d}_{24}\hfill \end{array}\right]\left[\begin{array}{c}\hfill {x}_1\hfill \\ {}\hfill {x}_3\hfill \end{array}\right] $$
(6.15)

Add/sub:

$$ \left[\begin{array}{c}\hfill \begin{array}{c}\hfill {y}_0\hfill \\ {}\hfill {y}_1\hfill \end{array}\hfill \\ {}\hfill \begin{array}{c}\hfill {y}_2\hfill \\ {}\hfill {y}_3\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill \begin{array}{c}\hfill {z}_0-{z}_3\hfill \\ {}\hfill {z}_1-{z}_2\hfill \end{array}\hfill \\ {}\hfill \begin{array}{c}\hfill {z}_1+{z}_2\hfill \\ {}\hfill {z}_0+{z}_3\hfill \end{array}\hfill \end{array}\right] $$
(6.16)

The direct 1D 4-point transform using (6.12) would require 16 multiplications and 12 additions. The 2D transform will require 128 multiplications and 96 additions. Even–Odd decomposition on the other hand requires a total of six multiplications and eight additions for 1D transform using (6.146.16). The 2D transform using Even–Odd decomposition will require a total of 48 multiplications and 64 additions which is 62.5 % savings in number of multiplications and 33.3 % savings in number of additions when compared to direct matrix multiplication.

The 8-point 1D inverse transform is defined by the following equation:

$$ \mathbf{y}={\mathbf{D}}_8^T\mathbf{x} $$
(6.17)

where x = [x 0, x 1, …, x 7]T is input and y = [y 0, y 1, …, y 7]T is output, and D 8 is given by:

$$ {\mathbf{D}}_8=\left[\begin{array}{cccccccc}\hfill {d}_{16}\hfill & \hfill {d}_{16}\hfill & \hfill {d}_{16}\hfill & \hfill {d}_{16}\hfill & \hfill {d}_{16}\hfill & \hfill {d}_{16}\hfill & \hfill {d}_{16}\hfill & \hfill {d}_{16}\hfill \\ {}\hfill {d}_4\hfill & \hfill {d}_{12}\hfill & \hfill {d}_{20}\hfill & \hfill {d}_{28}\hfill & \hfill -{d}_{28}\hfill & \hfill -{d}_{20}\hfill & \hfill -{d}_{12}\hfill & \hfill -{d}_4\hfill \\ {}\hfill {d}_8\hfill & \hfill {d}_{24}\hfill & \hfill -{d}_{24}\hfill & \hfill -{d}_8\hfill & \hfill -{d}_8\hfill & \hfill -{d}_{24}\hfill & \hfill {d}_{24}\hfill & \hfill {d}_8\hfill \\ {}\hfill {d}_{12}\hfill & \hfill -{d}_{28}\hfill & \hfill -{d}_4\hfill & \hfill -{d}_{20}\hfill & \hfill {d}_{20}\hfill & \hfill {d}_4\hfill & \hfill {d}_{28}\hfill & \hfill -{d}_{12}\hfill \\ {}\hfill {d}_{16}\hfill & \hfill -{d}_{16}\hfill & \hfill -{d}_{16}\hfill & \hfill {d}_{16}\hfill & \hfill {d}_{16}\hfill & \hfill -{d}_{16}\hfill & \hfill -{d}_{16}\hfill & \hfill {d}_{16}\hfill \\ {}\hfill {d}_{20}\hfill & \hfill -{d}_4\hfill & \hfill {d}_{28}\hfill & \hfill {d}_{12}\hfill & \hfill -{d}_{12}\hfill & \hfill -{d}_{28}\hfill & \hfill {d}_4\hfill & \hfill -{d}_{20}\hfill \\ {}\hfill {d}_{24}\hfill & \hfill -{d}_8\hfill & \hfill {d}_8\hfill & \hfill -{d}_{24}\hfill & \hfill -{d}_{24}\hfill & \hfill {d}_8\hfill & \hfill -{d}_8\hfill & \hfill {d}_{24}\hfill \\ {}\hfill {d}_{28}\hfill & \hfill -{d}_{20}\hfill & \hfill {d}_{12}\hfill & \hfill -{d}_4\hfill & \hfill {d}_4\hfill & \hfill -{d}_{12}\hfill & \hfill {d}_{20}\hfill & \hfill -{d}_{28}\hfill \end{array}\right] $$
(6.18)

Even–Odd decomposition for the 8-point inverse transform is given by (6.196.21).

Even part:

$$ \left[\begin{array}{c}{z}_0\hfill \\ {}\hfill \begin{array}{c} {z}_1\hfill \\ {}\hfill \begin{array}{c} {z}_2\hfill \\ {}\hfill {z}_3\hfill \end{array}\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{ll}\hfill \begin{array}{ll} {d}_{16}\hspace*{5pt}& \hfill \kern0.5em {d}_8\hfill \\ {d}_{16}& \hfill\ {d}_{24}\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill \kern0.75em {d}_{16}\hfill & \hfill \kern0.75em {d}_{24}\hfill \\ {}\hfill -{d}_{16}\hfill & \hfill \kern0.5em -{d}_8\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill {d}_{16}\hfill & \hfill -{d}_{24}\hfill \\ {}\hfill {d}_{16}\hfill & \hfill -{d}_8\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill -{d}_{16}\hfill & \hfill \kern1em {d}_8\hfill \\ {}\hfill \kern0.75em {d}_{16}\hfill & \hfill -{d}_{24}\hfill \end{array}\hfill \end{array}\right]\left[\begin{array}{c} {x}_0\hfill \\ {}\hfill \begin{array}{c} {x}_2\hfill \\ {}\hfill \begin{array}{c}{x}_4\hfill \\ {}\hfill {x}_6\hfill \end{array}\hfill \end{array}\hfill \end{array}\right] $$
(6.19)

Odd part:

$$ \left[\begin{array}{c}{z}_4\hfill \\ {}\hfill \begin{array}{c}{z}_5\hfill \\ {}\hfill \begin{array}{c}{z}_6\hfill \\ {}\hfill {z}_7\hfill \end{array}\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{cc}\hfill \begin{array}{cc}\hfill -{d}_{28}\hfill & \hfill \kern0.5em {d}_{20}\hfill \\ {}\hfill -{d}_{20}\hfill & \hfill \kern0.5em {d}_4\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill -{d}_{12}\hfill & \hfill \kern0.5em {d}_4\hfill \\ {}\hfill -{d}_{28}\hfill & \hfill -{d}_{12}\hfill \end{array}\hfill \\ {}\hfill \begin{array}{cc}\hfill -{d}_{12}\hfill & \hfill \kern0.5em {d}_{28}\hfill \\ {}\hfill -{d}_4\hfill & \hfill -{d}_{12}\hfill \end{array}\hfill & \hfill \begin{array}{cc}\hfill \kern0.5em {d}_4\hfill & \hfill \kern0.75em {d}_{20}\hfill \\ {}\hfill -{d}_{20}\hfill & \hfill -{d}_{28}\hfill \end{array}\hfill \end{array}\right]\left[\begin{array}{c}{x}_1\hfill \\ {}\hfill \begin{array}{c}{x}_3\hfill \\ {}\hfill \begin{array}{c}{x}_5\hfill \\ {}\hfill {x}_7\hfill \end{array}\hfill \end{array}\hfill \end{array}\right] $$
(6.20)

Add/sub:

$$ \mathbf{y}={\left[{z}_0-{z}_7,{z}_1-{z}_6,{z}_2-{z}_5,{z}_3-{z}_4,{z}_3+{z}_4,{z}_2+{z}_5,{z}_1+{z}_6,{z}_0+{z}_7\right]}^T $$
(6.21)

Note that the even part of the 8-point inverse transform is actually a 4-point inverse transform (by comparing 6.19 with transpose of D 4 in 6.11) i.e.,

$$ \left[\begin{array}{c}\hfill \begin{array}{l}\hfill {z}_0\hfill \\ {}\hfill {z}_1\hfill \end{array}\hfill \\ {}\hfill {z}_2\,\hfill \\ {}\hfill {z}_3\,\hfill \end{array}\right]={\mathbf{D}}_4^T\left[\begin{array}{l}\hfill \begin{array}{c}\hfill {x}_0\hfill \\ {}\hfill {x}_2\hfill \end{array}\hfill \\ {}\hfill {x}_4\,\hfill \\ {}\hfill {x}_6\,\hfill \end{array}\right] $$
(6.22)

So the Even–Odd decomposition of the 4-point inverse transform (6.146.16) can be used to further reduce computational complexity of the even part of the 8-point transform in (6.19).

The direct 1D 8-point transform using (6.17) would require 64 multiplications and 56 additions. The 2D transform will require 1,024 multiplications and 896 additions. An even–odd decomposition on the other hand requires 6 multiplications for (6.22) and 16 multiplications for (6.20) resulting in a total of 22 multiplications. It requires 8 additions for (6.22), 12 additions for (6.20) and 8 additions for (6.21) resulting in a total of 28 additions. The 2D transform using Even–Odd decomposition will require a total of 352 multiplications and 448 additions.

The computational complexity calculation for the 4-point and 8-point inverse transform can be extended to inverse transforms of larger sizes. In general, the resulting number of multiplications and additions (excluding the rounding operations associated with the shift operations) for the two-dimensional N-point inverse transform can be shown to be

$$ {O}_{mult}=2 N\left(1+{\displaystyle \sum_{k=1}^{\log_2 N}}{2}^{2 k-2}\right)\vspace*{-1.2pc} $$
$$ {O}_{add}=2 N\left({\displaystyle \sum_{k=1}^{\log_2 N}}{2}^{k-1}\left({2}^{k-1}+1\!\right)\right) $$

The number of arithmetic operations for the inverse transform can be further reduced if knowledge about zero-valued input transform coefficients is assumed. In an HEVC decoder, this information can be obtained from the entropy decoding or de-quantization process. For typical video content many blocks of size N × N will have non-zero coefficients only in a K × K low frequency sub-block. For example in [5] it was found that on average around 75 % of the transform blocks had non-zero coefficients only in K × K low frequency sub-blocks. Computations can be saved in two ways for such transform blocks. Figure 6.11 shows the first way. Columns that are completely zero need not be inverse transformed. So only K 1D IDCTs along columns needs to be carried out. However, all N rows will need to be transformed subsequently. The second way to reduce computations is by exploiting the fact that each of the column and row IDCT is on a vector that has non-zero values only in the first K locations. For example with K = N/2, x 4 = x 5 = x 6 = x 7 = 0, roughly half the computations for the inverse transformation can be eliminated by simplifying Eqs. (6.196.20) to

Even part:

$$ \left[\begin{array}{c}{z}_0\\ \begin{array}{c} {z}_1 \\ {} \begin{array}{c} {z}_2 \\ {z}_3 \end{array}\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{c} \begin{array}{c@{\quad}c}{d}_{16}\hfill & \hfill \kern0.5em {d}_8\hfill \\ {}\hfill {d}_{16}\hfill & \hfill \kern0.5em {d}_{24}\hfill \end{array}\hfill \\ {}\hfill \begin{array}{c@{\quad}c}{d}_{16}\hfill & \hfill -{d}_{24}\hfill \\ {}\hfill {d}_{16}\hfill & \hfill -{d}_8\hfill \end{array}\hfill \end{array}\right]\left[\begin{array}{c}\hfill {x}_0\hfill \\ {}\hfill {x}_2\hfill \end{array}\right] $$

Odd part:

$$ \left[\begin{array}{@{}l@{}} {z}_4\\ \begin{array}{@{}l@{}}{z}_5\\ \begin{array}{@{}l@{}}{z}_6\\ {z}_7 \end{array}\hfill \end{array}\hfill \end{array}\right]=\left[\begin{array}{c} \begin{array}{c@{\quad}c}-{d}_{28}\hfill & \hfill \kern0.5em {d}_{20}\hfill \\ {}\hfill -{d}_{20}\hfill & \hfill {d}_4\hfill \end{array}\hfill \\ {}\hfill \begin{array}{c@{\quad}c} -{d}_{12}\hfill & \hfill\ {d}_{28}\hfill \\ {}\hfill -{d}_4\hfill & \hfill -{d}_{12}\hfill \end{array}\hfill \end{array}\right]\left[\begin{array}{c}\hfill {x}_1\hfill \\ {}\hfill {x}_3\hfill \end{array}\right] $$
Fig. 6.11
figure 11

Efficient implementation of inverse transform of a block with non-zero coefficients in only the K × K low frequency sub-block. Shaded regions denote the regions that can contain non-zero coefficients. Only K 1D IDCTs are required along columns

In general, the number of multiplications can be reduced approximately by a factor of (N/K)2 for the first stage and a factor of (N/K) for the second stage. Table 6.5 shows the number of arithmetic operations for various values of N and K.

Table 6.5 Arithmetic operation counts for HEVC two-dimensional inverse transforms

Note that the majority of the arithmetic operations listed in Table 6.5 can be efficiently implemented using SIMD instructions since the operations are matrix multiply operations. For example, for an 8 × 8 inverse transform implementation, (6.20) can be efficiently implemented on a 4-way SIMD processor in 4 cycles v/s 16 cycles on a processor without SIMD acceleration. Software performance using SIMD acceleration on various Intel processor architectures for the 8 × 8, 16 × 16, and 32 × 32 transform sizes are provided in [3, 11].

Only the Even–Odd decomposition of the inverse transform has been described in this subsection. However, the Even–Odd decomposition idea can be used to reduce the complexity of the forward transform too. The article [6] presents analysis of both the forward and the inverse core transform in more details. It also describes hardware sharing by the application of property 4 of Sect. 6.2.1 (smaller transforms being embedded in larger transforms).

6.6 Coding Performance

The different transform sizes used in a coding block in HEVC are signaled in a quadtree structure [29]. The maximum transform size to use in a coding block is signaled in the sequence parameter set. Table 6.6 compares the coding performance of HEVC when all transform sizes (up to 32 × 32) are used to the coding performance when only 4 × 4 and 8 × 8 transforms are used as in H.264/AVC. The standard Bjøntegaard Delta-Rate (BD-Rate) metric [2] is used for comparison. Table 6.6 shows that there is a bit rate savings in the range of 5.6–6.8 % on average because of the introduction of larger transform sizes (16 × 16 and 32 × 32) in HEVC. The bit rate savings are higher at larger resolution video such as 4K (2560 × 1600) and 1080p (1920 × 1080). The HEVC Test Model, HM-9.0.1 [13] was used for the simulations and the video sequences and coding conditions used were as described in [4].

Table 6.6 BD-rate savings of using larger transform sizes (16 × 16 and 32 × 32) on top of the smaller transform sizes (4 × 4 and 8 × 8)
Table 6.7 BD-Rate savings of the HEVC 4 × 4 and 8 × 8 transforms versus the H.264/AVC 4 × 4 and 8 × 8 transforms

Table 6.7 compares the coding performance of the HEVC 4 × 4 and 8 × 8 transforms to that of the corresponding H.264/AVC transforms. The H.264/AVC 4 × 4 and 8 × 8 transforms were converted to 8-bit precision and implemented in the HM-9.0.1 Test Model. Only the 4 × 4 and 8 × 8 transform sizes were enabled in the simulations. It can be seen from Table 6.7 that the HEVC 4 × 4 and 8 × 8 transforms perform better than the corresponding H.264/AVC transforms in terms of coding performance.