1 Introduction

High efficiency video coding (HEVC) project was officially launched based on a project partnership by ISO/IEC MPEG and ITU-T VCEG in January 2010 [14]. The state-of-the-art video compression technology developed by collaboration work of the two international standardization organizations was released as a new international video compression standard in January 2013 [58]. By releasing the reference software called HEVC Test Model (HM) [9], including the draft specification [10], inspection on HEVC conformance has been cultivated abruptly by both industry and academic researchers [1116]. Nowadays, HEVC has been well known as a promising video coding standard with superior capability in compression by improving the bit-rate reduction about 50 % in relative to its predecessors, advanced video coding (AVC) high profile, in subjective quality [7, 17, 18]. Furthermore, due to the increasing demand for higher quality video, the dominant AVC applications used worldwide are expected to be soon replaced by the HEVC.

HEVC is investigated to support greater color and bit-depth precision of a new video content format. It is developed to meet the goal of video applications targeting the higher-resolution videos format (e.g., 4 and 8 K resolutions), such as ultra-high-definition television (UHDTV) that will be launched in the near future [1720]. Main and Main 10 [9, 10] profiles have been specified for various applications. The Main profile for typical consumer applications is being deployed today, and the Main 10 profile is also employed for supporting up to 10 bits precision [16]. Both Main profile and Main 10 profile have the same basic tool configuration [10].

Against its prior video coding standards, the remarkable compression performance of HEVC originates from the nested quad-tree coding unit structure called coding tree unit (CTU) [2123]. The CTU supports larger and various block-shape structures against the fixed size of macro-block structure concept introduced in H.264/AVC [21], by splitting quad-tree-based CU partitioning. Each CU can have various prediction unit (PU) and quad-tree transform unit (TU) structures [2125]. HEVC also supports many prediction modes including intra and inter prediction algorithms. Intra prediction modes of HEVC consist of 33 directional, DC, and planar prediction modes [2128]. Two in-loop filtering methods are employed as deblocking filter and sample adaptive offset [29, 30] for significant compression and visual quality performance of HEVC. However, these coding tools also significantly increase computational complexity that may lead to impracticality for real-time video applications [21, 27, 32, 33] by keeping the best visual quality.

Intra coding is performed to exploit spatial redundancy among the neighboring intra blocks by extrapolation values derived from the reconstructed pixels of neighboring blocks. In HEVC, intra coding is designed for each 2N × 2N and N × N PU sizes ranging from 64 × 64 to 4 × 4 by utilizing 35 intra prediction modes derived from the previous coded block. Intra prediction mode 0 refers to the planar mode and mode 1 refers to the DC mode. In addition, other 33 intra modes are for angular modes. This design evidently improves coding efficiency of HEVC for intra coded pictures over the prior video [27] and the still picture codecs [61, 62]. It is reported that the HEVC intra coding improves bit-rate reduction to be 22.3 % on average and up to 36 % over the H.264/AVC reference. In addition, the average objective bit-rate reductions provided by the HEVC intra coding against the still picture codes are 61.6 and 20 % with respect to JPEG and JPEG 2000 [61], and 32.2 % with respect to JPEG XR [62], respectively. Furthermore, subjective picture quality is also significantly improved when comparing the resulting pictures at fixed bit-rate compared both prior video and still picture codecs. However, due to various PU sizes and intra prediction modes, intra coding of HEVC requires heavy computational load [27]. The rate distortion optimization (RDO) employed to select the best performance for intra coding of HEVC requires exhaustive search computation [26] and leads impractical for most of the applications [16].

To find the optimal prediction mode for the current block with a lower complexity, HM intra coder consists of three stages such as the rough mode decision (RMD), the rate distortion optimization (RDO), and the residual quad-tree process (RQT) [31]. However, it still requires heavy computational complexity. The detailed information for intra prediction of HM will be further discussed in Sect. 2. To improve intra prediction complexity while maintaining coding efficiency for HEVC, various proposed algorithms have been published. PU-based RMD algorithms were basically proposed to decrease the number of prediction modes, instead of utilizing the modes supported in HM [3436]. However, decreasing intra prediction modes contributes only small encoding time saving. Therefore, other papers proposed the fast intra mode decision that exploits not only in RMD stage, but also in the other stages [24, 41, 42]. In addition, the early termination technique or other approaches for CU, PU, or TU decision in intra coding are also presented to accelerate the current version of intra coding [39, 4446].

In this paper, a context-adaptive fast intra coding algorithm is presented to decrease the computational complexity and to maintain the Bjongtegaard Distortion rate (BD rate) [60] increment for all mentioned stages of intra coding in HM. The proposed algorithm is designed to improve our previous work [24] for better encoding time saving and low BD-rate loss. Two new approaches for RQT stage are introduced in this work to enhance encoding time reduction including the early skip for RQT stage and the early terminate of RQT partition. The proposed algorithm for RQT stage is designed only for intra coding case which is organized very differently from our previous work [37]. First, an adaptive rough mode selection is employed to reduce candidate modes for RMD stage. It is organized based on the upper CU layer and the neighboring PU blocks. Then, the rate distortion (RD) cost estimated from RMD stage and an average RD cost from the available neighboring CU blocks are estimated as thresholds to early skip RDO and RQT stages, respectively. The experimental results show that the proposed algorithm can achieve encoding time saving to about 40 % with negligible BD-rate loss of 1.3 %.

The rest of this paper is organized as follows. Section 2 presents the overview of intra prediction of HEVC. In Sect. 3, the proposed context-adaptive fast intra coding algorithm is presented. Section 4 shows the performance evaluation of the proposed algorithm, and Sect. 5 concludes with further research topics.

2 Overview intra prediction of HEVC

In this section, intra prediction of HEVC standard is briefly introduced. In order to achieve high coding efficiency, intra prediction is performed to spatially predict target pixels from the reconstructed neighboring pixels of the left and/or upper blocks.

As shown in Fig. 1a, the target pixels (a, b, …, and p) are predicted with the neighboring pixels (A, B, …, and Q). Furthermore, to precisely predict the target block, 35 prediction modes were designed in HEVC standard, consisting of planar, DC, and 33 angular prediction modes, as shown in Fig. 1b.

Fig. 1
figure 1

Intra prediction: a A 4 × 4 block and its neighboring pixels and b 35 intra prediction modes of HEVC

Figure 2 depicts block diagram of the adopted fast intra coding in HM. As shown in Fig. 2, the fast intra coding algorithm selects limited numbers, among 35 prediction modes, as the candidate modes in the RDO stage, depending on the CU sizes from 64 × 64 to 4 × 4. RD cost of each prediction mode is computed by the sum of absolute Hadamard transform difference (SATD) and its associated bit, and it is defined by:

Fig. 2
figure 2

Block diagram of intra prediction of HM

$$J_{\text{RMD}} = {\text{SATD}} + \lambda .{\text{Bit}}_{\text{mode}}$$
(1)

The selected modes are sorted in ascending order based on its cost, and then the modes are evaluated in RDO stage with the zero depth of RQT to find the best mode for the current CU. Finally, RQT stage is performed to determine the best TU partition. In HM, the maximum depth of TU quad-tree for intra coding is set to 3 for coding efficiency improvement. Although, the fast intra coding already adopted for HEVC; however, further improvement is required due to it still remains the complexity load for encoding. Table 1 shows the complexity load observation of HM for intra coding. RDO stage is the most expensive process that occupies about 42 % in computation complexity, followed by RQT stage at 41 %. Note that RMD stage covers only about 17 % [24].

Table 1 Complexity observation of HM intra coding

To decrease the computational load of intra coding, many works have been proposed based on HM reference software. In general, they can be categorized depending on the three stages of HM: fast intra prediction mode decision (consisting of RMD and RDO optimizations), fast block types (CU, PU, or TU) decision, and their combinations. Existing algorithms will be discussed in these three categorizations as follows.

2.1 Fast intra prediction mode decision

Zhao et al. [34] proposed a new intra prediction algorithm to reduce the intra mode prediction mode decision complexity. There are four configurations employed on HM-1.0 to decrease the number of candidate modes. These configurations are set according to PU size of 4 × 4 to 64 × 64, respectively. The experimental results show that this algorithm improves encoding time saving up to 20 % with BD-rate loss of 0.12 % in the high efficiency test condition. The similar idea was also proposed by Mengmeng et al. [36]. The proposed algorithm performs different treatments for each PU size based on the statistics of intra prediction mode probability on HM-4.0. It is claimed to achieve about 15 % of encoding time reduction with 0.64 % BD-rate increment for main intra configuration. According to these works and other similar works [35, 50, 52], acceleration of only RMD or RDO stage will not provide significant improvement in reducing the computational complexity. These conventional algorithms are based on the fact that coding efficiency and computational complexity are jointly optimized by adaptive number of coding modes depending on PU size, instead of a fixed number of coding modes.

Other works [31, 38, 42] optimize both RMD and RDO stages in order to achieve better encoding time reduction with minimal BD-rate increment. A local saliency detection for intra prediction mode decision is proposed to optimize RMD and RDO processes [31]. It is reported to achieve about 19 % complexity reduction with 0.3 % BD-rate loss over HM-7.0. Hao’s work [38] proposed a progressive intra prediction mode reduction for RMD stage. An early RDO termination is also proposed for further complexity reduction in RDO stage. However, this method still loses high BD-rate of 3 % with encoding time reduction of approximately 38 % over HM-6.0.

A fast intra coding for ultra-high definition (UHD) videos is also proposed by Dhollande et al. [42]. Although this work claims significant time saving of 67.3 % on average, the BD-rate increment is somehow high 5 %, thanks to the Hadamard transform for RMD and RDO stages. Note that performance evaluation is not clearly presented in this manuscript. Jamali et al. [49] utilize Sobel operator determine the dominant edge of PU block to optimize RMD and SATD cost is also obtained to skip RDO stage. It yields about 36 % encoding time reduction with 1.07 % BD-rate loss, compared the reference software of HM-15.0.

2.2 Fast block types (CU, PU, or TU) decision

Liu’s et al. [43] proposed to reduce intra coding of HEVC by optimizing RDO stage in HM-12.1. This algorithm was designed by employing the support vector machine (SVM). The SVM is trained according to the average pixel difference on CU boundary, pixel variance of current CU, variance of the mean of sub-CUs, and the number of edge points. It is then used to decide the intra CU depth to improve encoding time of intra coding. This fast algorithm can accelerate encoding about 46.5 % on average with BD-rate lose over 2 %. An effective CU size decision for intra coding of HEVC is proposed by Shen et al. [44]. Two algorithms (adaptive thresholding and bypass algorithms for large CUs) are presented according to texture homogeneity and coding information of the adjacent coded CU blocks. The experimental results yield that about 47 % complexity reduction with BD-rate loss of 1.08 % on average against HM 10.0 are achieved.

Another CU size decision algorithm is introduced by Li et al. [45]. A fast CU splitting and pruning algorithms have been carried out to reduce the intra coding complexity over the reference software HM-10.0. In this work, each CU is classified depending on its size. Then, the statistics of RD cost and SATD cost are utilized to determine whether CU will be split or pruned further. This method reports around 38 % encoding time reduction on average with 1.05 % BD-rate loss. Other conventional works were also proposed by considering edge and texture characteristics inside of CU [47, 56, 58]. Min and Cheung [47] tried to analyze global and local edge complexities of each CU to determine its CU size decision. This algorithm shows about 52 % encoding time reduction with only 0.8 % BD-rate increment over HM-10.0. In HM-RExt platform, Huang et al. [56] and Zhang et al. [58] utilized CU texture complexities to decide CU size decision. It is reported that the presented algorithm [56] could achieve 39 % encoding time reduction with 0.6 % BD-rate increment over HM-Rext-13.0. In addition, reference [58] was proposed based on early decision of block partition, yielding about 33 % complexity reduction, compared HM-Rext-12.0.

2.3 Combination fast prediction mode decision and fast block types decision

Zhang’s algorithm [39] is designed to improve the previous work [38] for fast intra coding of HEVC. Hadamard cost and RD cost are employed to early terminate CU and TU partition decision. The progressive rough mode search in RMD stage and the early termination for RDO stage presented in [38] are also still employed. It yields to achieve average encoding time saving of 32 % against HM software version 7.0 with only 1.1 % BD-rate loss. And then the algorithm [40] is proposed to improve algorithm [39]. Macro and micro operations are organized to speed up intra coding of HM-10.0. These proposed algorithms reduce intra coding complexity about 60 % with 1.0 % BD-rate loss.

The same approach is also introduced by Na et al. [41]. The algorithm employs pre-processing, Sobel edge operation, implemented on HM-10.0. Edge operation is used to compute the correlation between a local edge of the image block and an intra prediction mode to optimize RMD and RDO stage. However, this algorithm yields high BD-BR loss of 2.5 % with encoding time reduction by 56.8 %. Yang and Grecos [46] proposed a heuristic fast intra encoding algorithm introduced on HM-10.0. This algorithm consists of fast CU skip decision, CU early termination, fast PU mode decision, and fast TU size decision by considering both time reduction and BD-rate increment for intra coding. It reports to achieve approximately 65 % in encoding time on average with 1.3 % BD-rate lose. Hu and Yang [48] tried to optimize RMD and RDO stages and also proposed entropy coding refinement against intra coding of HM-15.0. These algorithms show approximately 50 % encoding time reduction with only 0.2 % BD-rate increment. Khan et al. [51] consider hardware implementation on FPGA to speed up intra prediction mode decision in RMD stage of HM-7.2. Mode decision is performed after early evaluating PU block size decision. Around 50 % complexity reduction is achieved for a few test sequences. In addition, it is reported that this work also reduces the energy consumption about 35 % over the HM.

Palomino et al. [53] proposed a fast intra coding algorithm to tackle RMD, RDO, and PU decision. The previously chosen intra modes of PU decision are employed to optimize RMD and RDO processes. The proposed algorithm could achieve 59 % encoding time reduction for intra case with 1.1 % BD-rate increment, compared HM-10.0. Coll et al. [54] utilize a machine learning approach to determine CU sizes for RDO process complexity reduction. The proposed algorithm reports only 28 % encoding time reduction over HM-10.0 with 0.6 % BD-rate loss. Other conventional algorithms were also introduced by Shi et al. [55] and Wang et al. [57]. Shi’s algorithm [55] reduces encoding time of about 27 % with 0.8 % BD-rate loss over HM-12.1. It is achieved by considering edge information for bottom-up partition in PU decision and by optimizing the RMD stage on the largest PU size. Wang’s algorithm [57] was designed for RMD, RDO, and CU optimizations. Around 54 % complexity reduction is reported with 1.0 % BD-rate loss over HM-10.0. In this algorithm, CTU depth range is predicted as a key process for CU size decision, RMD, and RDO optimization.

Although many intra coding algorithms have been proposed to speed up the fast intra coding, computation loads of the existing algorithms and HM still remain high for real-time video applications. In this paper, a context-adaptive fast intra coding algorithm is proposed based on intra coding mode decision optimization and early termination techniques.

3 Proposed context-adaptive fast intra coding algorithm

This paper proposes a context-adaptive fast intra coding algorithm of HEVC to reduce the computational complexity. Figure 3 is depicted to show the flowchart of the proposed context-adaptive fast intra mode selection algorithm.

Fig. 3
figure 3

Flowchart of the proposed fast intra decision algorithm: a proposed fast intra decision and b proposed early termination of TU split algorithm in RQT stage

Figure 3a shows the flowchart of the overall proposed algorithm. First, an adaptive rough mode selection is proposed to decrease the number of intra modes for 32 × 32 to 4 × 4 PUs in the RMD stage. The proposed algorithm organizes the modes for RMD from the prediction modes of upper layer CU and neighboring PUs. Lm_RMD represents the subset candidate prediction mode set organized by the adaptive rough mode selection. Furthermore, a simple threshold α is obtained to early terminate the RDO stage for all PU sizes. It is computed by considering the RD cost of prediction mode computed from the RMD stage, denoted as RDcost in Fig. 3a. The mean and the standard deviation values expressed by MC and SC, respectively, are calculated to divide the RD cost into groups of RD cost. If a given condition is satisfied for this stage, then the modes that are provided for RDO stage are skipped.

In addition, the RQT skip and the early terminate TU split in quad-tree fashion are proposed to enhance encoding time saving. The average of RD-cost values for the reconstructed neighboring CUs MN is calculated to skip the RQT stage. Then, it is compared with RD cost of the TU depth zero computed in the RDO stage referred as RD*. Note that only the reconstructed CUs having the same depth with the current CU are evaluated to retain the BD-rate loss. Moreover, an adaptive threshold value Th is also set to early terminate the TU split cost TC in the RQT stage, as shown in Fig. 3b. It is provided to give higher acceleration of the proposed context-adaptive fast intra mode selection algorithm.

To describe the propose algorithm, the proposed adaptive rough mode selection algorithm in RMD stage is firstly discussed. Then, the proposed algorithm for RDO stage is presented. Finally, the proposed algorithm for RQT stage with the skip RQT stage and the early terminate TU split of RQT are discussed.

3.1 Proposed adaptive rough mode selection

In RMD stage, the proposed algorithm carefully reduces the prediction modes to retain the bit-rate increment. The proposed algorithm organizes the intra prediction candidate list derived from the upper layer of a CU. First, we observe that the mode of the upper CU layer is likely to be the same as one of the current coded block.

Table 2 shows mode dependency between the current CU block and its upper CU layers. The investigation was performed for the first 30 frames of three sequences (‘Traffic’, ‘Kimono’, and ‘Video4’) with three quantization parameters (QP) of 27, 32, and 37. According to Table 2, it shows that the smaller block sizes are likely to have higher similarity in the best prediction mode than the larger blocks. In other words, the upper CU layer is likely to share the same prediction mode in predicting the lower CU layer to be encoded. In addition, we investigated the probability of the modes of a current PU in terms of its adjacent PU blocks. The investigation was performed for the first 30 frames of ‘BQTerrace’ sequence at quantization parameter (QP) of 32. Table 3 presents the co-occurrence of the current PU with the surrounded PU blocks, in terms of the PU depth. As shown in Table 3, each adjacent PU blocks have different co-occurrence percentages over the current PU. The high prediction dependency is observed from the top and the left neighboring PU blocks for all the PU depths.

Table 2 Probability of the current PU having the same mode with its upper PU layer
Table 3 Probability of the current PU having the same mode with neighboring PU layer

An adaptive rough mode selection algorithm is proposed for RMD stage to reduce the complexity with minimum bit-rate increment. The adaptive rough mode selection is performed with the PU block sizes from 4 × 4 up to 32 × 32. The candidate list modes generated by the proposed algorithm is called Lm_RMD. It is organized by involving the best mode (BM) and the second best mode (SBM) derived from the upper layer of CU, as illustrated in Fig. 4a. Furthermore, two modes from left and right sides of BM and SBM are also added into Lm_RMD, respectively. These additional modes are obtained in series from the prediction modes of the left and the right of BM to the modes of the left–right of SBM. Note that there should be no overlap modes listed in Lm_RMD. Otherwise, only the mode that is prior listed be retained. By default, the listed modes in Lm_RMD can contain 10 candidate prediction modes that will be used in RMD stage for the current CU without any overlap modes condition. However, if the prediction modes in Lm_RMD are less than the modes that should be processed for RDO stage, prediction mode from the neighboring PUs, as shown in Fig. 4b, is considered to be in the list. In addition, the modes from the RMD stage of the upper layer of CU are also added into the Lm_RMD. The proposed adaptive rough mode selection algorithm for RMD stage can be illustrated with an example as below:

Fig. 4
figure 4

The proposed rough mode selection for RMD stage: a the upper layer of CU block, b the neighboring PUs block

  1. 1.

    First, BM and SBM from the upper CU are added into Lm_RMD. For example: BM → 10 and SBM → 26.

  2. 2.

    Then, add two modes from left and right of BM and SBM: two modes from BM’s left → 8, 9; two modes from BM’s right → 11, 12; two modes from SBM’s left → 24, 25; and two modes from SBM’s right → 27, 28.

  3. 3.

    If the candidate mode set for RMD is smaller than the candidate modes for RDO of current PU, then add modes derived from the neighboring PU blocks (top left, top, top right, left, and bottom left), and also add modes from the RMD stage of the upper layer of CU.

3.2 Proposed early RDO skip algorithm

An early RDO skip algorithm is proposed by employing an adaptive threshold to early terminate the RDO stage. The basic concept of this proposed algorithm is to divide the RD-cost modes, estimated from the RMD stage, into two groups. One group will be processed in RDO stage while another one will be considered to be skipped to achieve more encoding time saving. Firstly, the mean value of RD cost (referred as MC) is set as the threshold to divide the two groups. Prediction modes with larger RD cost than MC value will be skipped for the next RDO stage. However, it could not be efficient to set the MC value as the only one boundary for all the cases. Note that only three modes are investigated for 64 × 64, 32 × 32, and 16 × 16 PUs for RDO stage. The pruning with only MC value can lead somehow BD-rate lose. Therefore, we use the additional threshold to extend the boundary, Sc which is used to refer to the normalized standard deviation of RD-cost values. With the additional decision, more prediction modes can be fed into the RDO stage.

In general, early skip algorithms could make several optimal modes skipped as a side effect. The proposed algorithm is designed to improve the correct decision rate by using the additional thresholding. In our observations, we found that the block having a large MC is likely to have larger variation for all the prediction modes for RDO stage and vice versa. Therefore, the additional thresholding is employed to reduce skipping of the best mode. In our proposed algorithm, the threshold Sc is calculated by normalizing the variance and it is denoted by

$$V_{\text{C}} = \frac{1}{{(n \times M_{\text{C}} )}}\;\sum\limits_{i = 1}^{n} {(R_{{{\text{C}}i}} - M_{\text{C}} )^{2} },$$
(2)
$$S_{C} = \sqrt {V_{c} } ,$$
(3)

where \(\varvec{R}_{{{\text{C}}_{\varvec{i}} }}\) refers to the RD-cost value of the \(\varvec{i}\)-th prediction mode and \(\varvec{n}\) represents the total number of prediction modes for the RDO stage. In our algorithm, the condition set by mean and normalized standard deviation value is designed as:

$$({\text{RDcost}} > M_{\text{C}} )\,\& \&\, (S_{\text{C}} > \alpha )$$
(4)

where RDcost is computed from RMD stage with SATD and bit of intra modes; and α is determined by evaluating the average of BD-rate and encoding time of some video sequences with different resolutions. It should be noted that RDcost is already sorted in ascending order in the RMD stage. The normalized standard deviation formula can consider more modes that would be evaluated in RDO stage to maintain both BD-rate increment and encoding time saving.

The threshold value, \(\varvec{\alpha}\) used in our proposed algorithm is determined by evaluating several video sequences: ‘PeopleOnStreet’ and ‘NebutaFestival’ (4 K), ‘Cactus’ and ‘BQTerrace’ (1080p), ‘PartyScene’ and ‘BasketballDrill’ (WVGA), ‘RaceHorse’ and ‘BlowingBubbles’ (WQVGA), and ‘Video1’ and ‘Video4’ (720p). These video sequences are evaluated with all QP ranges under the common test condition (CTC) defined in HEVC [59]. Each sequence is tested with the given threshold value as shown in Fig. 5. Then, the average of BD rate and encoding time corresponding to the sequences are evaluated. Figure 5a and b present BD-rate loss and encoding time in terms of threshold \(\varvec{\alpha}\), respectively. The figures show that the higher threshold value provides the better coding efficiency, but the lower encoding time saving and vice versa. In our proposed work, the threshold is set to 0.15 to achieve both encoding time reduction and BD-rate retained.

Fig. 5
figure 5

a BD rate in terms of the threshold and b encoding time saving in terms of the threshold

3.3 Proposed RQT skip and early termination of TU split

In this paper, RQT skip and early termination are proposed to accelerate the RQT stage. In HM, the quad-tree process is performed with depth-first search technique into 4 × 4 to 32 × 32 of TU partition. The maximum search depth of this process is set to 3 for both intra and inter coding. In order to determine the best TU of current depth (e.g. 16 × 16 TU), its cost will be compared with the sum of four sub-TU partitions (e.g. 8 × 8 TU). However, the RQT process for intra and inter coding are treated in a quite different way. RQT process for intra case is carried out for each PU. When it is performed to determine the best prediction mode in the RDO stage, the RQT search depth is set to 1 to save the encoding time. It is further called as TU of zero depth in this paper. However, the search depth is set to 3 when it is performed to decide the best TU split partition in RQT stage. In other words, this maximum search depth of RQT is conducted recursively with heavy computation to find the optimum TU for the coding efficiency reason. It should be noted that the RD-cost TU of zero depth is always compared with the RD-cost TU evaluated from RQT stage, and TU with the smaller RD cost will be decided as the optimum TU for intra case.

In this work, the selection percentages between zero-depth TU (denoted by RD*) and full quad-tree TU partition from RQT stage (referred as Rfull) were evaluated with HM-14.0. Several sequences from Class A, Class B, and Class C with QPs of 22, 27, 32, and 37 were used for investigation, as shown in Table 4. According to Table 4, the zero-depth TU partition (RD*) is mostly selected as the best TU decision over the full quad-tree TU partition. If we can decide whether the best TU is determined from only RD*, RQT stage can be skipped to achieve further time reduction for intra coding.

Table 4 Selection ratios of TU decision between zero-depth TU and full quad-tree TU partitions

However, it is very difficult to fairly skip the RQT stage without increasing bit-rate. Therefore, after the RDO stage, the proposed RQT skip algorithm is designed in a very careful way. In the proposed algorithm, the mean value of RD cost of the reconstructed CU blocks, denoted as MN, is adaptively used as a threshold. Then, each RD* is compared with MN to decide whether the RQT stage will be skipped or not. Note that only RD costs from the reconstructed CUs, having the same depth with the current CU, are evaluated for MN. By applying this condition to skip the RQT stage, bit-rate increment can be avoided; however, the encoding time saving could be low. Therefore, to achieve more encoding time reduction, we need to exploit the TU split part in the RQT process. Thus, an early termination of TU split in the RQT process is then proposed to enhance the encoding time saving.

In this algorithm, a simple adaptive threshold is computed from average of neighboring RD costs and that of the current depth TU, and the threshold is defined as:

$${\text{Th}} = \frac{1}{n + 1}\left( {\sum\limits_{k = 1}^{n} {{\text{RC}}_{k}} + Cd} \right)$$
(5)

where n represents the number of the available neighboring CUs, Cd is expressed for RD cost of the current depth TU, and RCk refers to the RD-cost value of the k-th neighboring CU. In HM, each TU splits into four partitions, then RD cost of each single TU partition (TC) is accumulated and the accumulated cost is compared with Cd to determine the optimum TU partition in RQT stage. In the proposed algorithm, we use the single sub-TC cost (TC) multiplied by four as an approximation of the accumulation TC to reduce computational complexity. If the TC multiplied by four is larger than the threshold (Th), the rest of sub-TU partition is skipped for the proposed algorithm, as shown in Fig. 3b.

4 Experiment results

In this section, several performance evaluations of the proposed context-adaptive fast intra coding algorithm of HEVC are presented. These evaluations consist of coding efficiency, computational profiling, and power consumption analysis. BD-rate performance [60] and encoding time reduction were evaluated on HM-14.0 main profile to measure the coding efficiency. The profiling analysis is useful to determine which components are the most time consuming; hence, practically, the profiling is widely used [33]. In addition, power consumption evaluation of the proposed algorithm was performed against the reference software ones. We employed “Joulemeter” software [63] developed by Microsoft Research to assess power consumption of the proposed algorithm and existing algorithms. The software measures energy consumption for hardware resources, software applications, or individual software in every second. It monitors some resource usages, such as CPU, screen, memory, and storage power in Watts [64, 65].

In terms of BD-rate performance and complexity reduction, various video sequences were employed, such as those in Class A (4 K), Class B (1080p), Class C (WVGA), Class D (WQVGA), and Class E (720p). The evaluations were conducted under ‘all intra configuration’, defined in the HEVC common test conditions. Four QP values at 22, 27, 32, and 37 are set to demonstrate the proposed algorithm performance. Furthermore, the proposed algorithm is tested on Intel(R) Core(TM) i7-4770 K CPU @ 3.50 GHz with memory of 32 GB, and Windows 7 Professional 64-bit. The overall performances of the integrated proposed algorithm are also presented against the HM software and the conventional algorithms. Encoding time reduction (TR) is estimated to measure the complexity comparison of the proposed algorithm over the anchor reference software, which is defined by:

$${\text{TR}} = \frac{{T_{{{\text{HM - }}14.0}} - T_{\text{Proposed}} }}{{T_{{{\text{HM - }}14.0}} }} \times 100\,(\% )$$
(6)

where THM-14.0 and TProposed represent the encoding time of HM-14.0 and the proposed algorithm, respectively.

As shown in Table 5, three modules of the proposed algorithm (RMD, RDO, and RQT) achieve different BD increments and time savings. The proposed algorithm in RMD, by decreasing the number of prediction modes, achieves 20 % of time reduction on average with 0.2 % BD-rate loss. The proposed algorithm in RDO, based on the early termination of RDO stage, decreases the encoding time by about 34 %, but increases BD-rate of 1.5 %. Moreover, the proposed algorithm in RQT process reduces the encoding time by approximately 10 % without any loss in BD-rate performance. The proposed algorithms are compared against the HM-14.0 reference software. Based on the experimental results reported in Table 5, the proposed RDO algorithm provides the highest BD-rate increment and encoding time reduction. The proposed RMD and RQT algorithms yield almost similar performance in BD- rate and encoding time over the reference software.

Table 5 BD-rate performance and encoding time reduction of individual proposed algorithm in rough mode decision (RMD), rate distortion optimization (RDO), and residual quad-tree (RQT)

Additionally, the encoding performance of the proposed context-adaptive intra coding algorithm of HEVC is listed in Table 6. It is reported that the proposed algorithm can save encoding time by about 40 % with BD-rate increment of about 1.3 % on average, against the default HM-14.0. For ‘RaceHorse’ test sequence of Class D, the lowest performance is observed in term of the complexity reduction at about 36 % with slight distortion in BD-rate performance. However, about 42 % of encoding time reduction is favorable for ‘Cactus’ test sequence in Class B with BD-rate increment of 1.1 %.

Table 6 BD-rate and time saving of the proposed algorithm

For further evaluation of the proposed algorithm, several conventional algorithms of the fast intra coding HEVC are compared. In this paper, three conventional algorithms [34, 38, 39] elaborated in Sect. 2 were implemented on HM-14.0. Table 7 shows the overall comparisons of RD performance (∆R) and encoding time (∆T) of the three conventional and the proposed algorithms. Although they were implemented on the same software model, their performances were observed to be almost similar to those in the original literatures. As shown in Table 7, overall Zhao’s algorithm [34] yields the least BD-rate increment of 0.4 % with only small encoding time reduction of 24 %. Hao’s algorithm [38] shows the best performance with the encoding time reduction of 43 %, but it also presents the highest BD-rate loss of 2.7 %. Zhan’s algorithm [39] gives similar BD-rate increment of 1.3 % with the proposed algorithm. However, the proposed algorithm reports with better encoding time reduction of 10 %, compared to Zhan’s algorithm. Based on these experimental results, the proposed context-adaptive fast intra coding algorithm of HEVC provides better performance improvements with the trade-off in coding efficiency and the encoding complexity.

Table 7 Overall comparisons of RD performance (∆R) and encoding time (∆T) between the conventional algorithms and the proposed algorithm implemented in HM-14.0

In terms of profiling analysis, Table 8 shows the most active functions in encoding time for HEVC intra coding under the ‘all-intra configuration’. The first 30 frames of one test sequence of Class A, ‘NebutaFestival’, were employed to evaluate the profiling results with QP of 22. According to Table 8, the proposed works are able to reduce the CPU time usages for the same function by a factor of two. For “Reconstruction of Intra Coding QT” and “Intra Coding Luma Block” functions, the proposed algorithm achieves the highest speed gain in CPU time (∆Time) against the anchor. The proposed system reduces the CPU usage about 52 % for both functions in intra coding. Other two functions, “Estimation Intra Prediction QT” and “Rate Distortion Optimize Quantization,” also show the significant suppression of about 45 % in CPU usage over the anchor. Figure 6 shows a histogram of the total CPU elapsed time. In the figure, we can see that the proposed algorithm can reduce the encoder complexity almost two times against the reference software.

Table 8 Profiling analysis between the anchor software HM-14.0 and the proposed algorithm
Fig. 6
figure 6

Histogram of total CPU time usage

Regarding evaluation of power consumption, several test sequences in Class C, Class D, and Class E for QP = 22 by using “Joulemeter” were conducted as shown in Table 9. One notebook computer with a charged battery was utilized in this experiments. As presented in the table, the proposed algorithm can efficiently reduce power usages up to 42 % for two sequences, ‘Flowervase’ in Class D and ‘Video3’ in Class E. In addition, our proposed algorithm is observed to save approximately 38 % on average over the anchor ones. The fluctuation of powers for both anchor and proposed systems per second is also evaluated. Based on this evaluation, the proposed algorithm can save power consumption with minimal power fluctuation over time, compared to those of the reference software of HM-14.0. In Fig. 7, the power impact of the first 30 frames of ‘NebutaFestival’ test sequence in Class A with QP = 22 is depicted. The graph shows that the proposed algorithms are able to decrease CPU power utilization approximately 40 % over HM-14.0. The power consumption efficiency of our proposed algorithms is mainly provided aligned with the encoding time reduction. It is clearly illustrated that the proposed algorithm terminates earlier than the anchor. As consequence, it significantly saves the CPU power consumption almost two times. In addition, the proposed algorithm also shows that the power fluctuation over time is more static than that of anchor during the encoding process.

Table 9 Power consumption comparison between the anchor software HM-14.0 and the proposed algorithm
Fig. 7
figure 7

CPU power consumption for the first 30 frames of ‘NebutaFestival’ test sequence

The proposed algorithm can be implemented in many platforms such as GPP, special purpose processors, and hardwired circuits. However, in targeting a real time for high resolution videos, many implementation issues should be also acknowledged depending on platforms. The hardwire implementation based on the proposed algorithm can be further studied to deal with the complexity of intra coding.

5 Conclusion

In this paper, a context-adaptive fast intra coding algorithm of HEVC is proposed to reduce the computational complexity of intra coding. An adaptive rough mode selection is proposed to reduce prediction modes in RMD stage, by utilizing the upper layer of CU and the neighboring PU blocks. The RDO skip is performed by estimating mean and normalizing standard deviation. The RQT skip and the early termination of TU split processed are conducted using a simple threshold derived from the average RD cost of the available neighboring coded CU blocks. The experimental results show that the proposed algorithm reduces encoding time on average about 40 % and BD-rate increases about 1.3 %, compared to HM-14.0. In addition, the proposed algorithm also reduces the CPU power usage in approximation up to 42 % over the reference software. For further work, the CU level optimization or the hardware implementation for intra coding might be considered to achieve better intra coding acceleration in HEVC.