1 Introduction

The High Efficiency Video Coding (HEVC) standard is one of the successors of the well-known MPEG-4 AVC (H.264 or MPEG-4 Part 10). It is designed to target various applications, especially those dealing with Ultra High Definition (HD) content. The HEVC project was formally initiated when a joint Call for Proposals was issued by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) in January 2010 [30]. The prime focus was directed towards significantly improving the compression performance relative to existing standards.

After its completion in January 2013, the HEVC standard provides twice the compression capabilities as that offered by its predecessor. Given that the appropriate encoder settings are used, around 50% bit-rate reduction is possible, while maintaining minimal video quality level loss [21]. However, this coding efficiency is introduced at the cost of increasing the encoding computational complexity, which can reach up to 40% more than that of H.264/AVC [34].

Among other factors, both the enhanced compression efficiency and the increased encoding computational complexity can be attributed to HEVC’s usage of flexible partitioning structures. HEVC uses quad-tree Coding Tree Units (CTUs), Prediction Units (PUs), and Residual Quad-Trees (RQTs) rather than macroblocks (MBs). In order to achieve the best configuration in terms of selecting the optimal partitioning structure, an exhaustive rate-distortion optimization (RDO) process takes place, which is the main reason behind the intensification of the computational complexity. Most of the encoding time involves recursively repeating the RDO process at each Coding Unit (CU) depth level for each structure, where every combination of encoding structure is tested and the one that minimizes the rate-distortion (RD) cost is chosen [31].

Several early termination algorithms for optimizing the encoding process in HEVC can be found in the literature, where their aim is to reduce the computational complexity while any minimizing performance degradation. Among many, some approaches utilize the textural or structural characteristics of a given CU [1, 7, 9, 12, 22, 24, 28, 35, 38, 40, 41], while others use machine learning techniques [4, 6, 11, 14,15,16,17, 29, 36, 39]. The optimizations are not limited to HEVC inter-coding as some also considered enhancing intra-coding [16, 22, 24, 28]. In this field of research, utilizing machine learning techniques as a tool to minimize RD efficiency losses is limited, and most algorithms proposed do not achieve superior results in terms of computation complexity reduction without introducing significant video quality level losses.

In this work, we use a sequence-dependent approach to model the relationship between CU feature variables and split decisions. The feature variables are extracted from both the present CU and its surrounding spatial and temporal CUs. Additionally, we use dimensionality reduction techniques for the three CU depths of 64 × 64, 32 × 32 and 16 × 16. This is needed to reduce the number of extracted features. The feature extraction and modeling is also performed at three CU coding depths. We use stepwise regression, random forest reduction and principal component analysis (PCA) for dimensionality reduction. Moreover, we utilize polynomial networks and random forests for classification.

This paper is organized as follows. Section 2 presents a review of algorithms proposed in the literature that are reduced the encoding computational complexity. The overall CU split prediction system proposed is overviewed in Section 3. The feature extraction process and dimensionality reduction are discussed in detail in Section 4 in addition to the classification tools and arrangements used in this work. The experimental setup and experimental results are presented in Section 5. Lastly, Section 6 concludes the paper.

2 Related work

As mentioned earlier, HEVC introduced significant coding efficiency improvements at the cost of increasing the computational complexity. Therefore, existing research work is conducted to limit this computational complexity whilst minimizing the adverse effect on compression efficiency. The work reported in [1, 7, 9, 12, 22, 24, 28, 35, 38, 40, 41] investigated the textural or structural characteristics of CUs at a given CU depth to optimize the HEVC encoding procedure. [41] proposed an inter-prediction optimization scheme, where the CTU structure is analyzed in a reverse order. Alternatively, a subjective-driven complexity control approach is presented in [7], which examines the relationship between visual distortion and maximum depth of all largest CUs. Another complexity control algorithm is proposed in [12], where an early termination condition is defined at each CU depth based on the content of the video sequence being encoding, the configuration files and the target complexity.

In [40], the authors present a hierarchical structure-based fast mode decision scheme. A fast CU decision algorithm is presented in [38], where the coded block flag and RD costs are checked to determine if intra- and inter- PUs can be skipped. In [35], a two-layered motion estimation based fast CU decision process is proposed, which uses the sum of absolute differences (SAD) estimation to extract the SAD costs for a CU and its sub-CUs. [22] speeds up the HEVC intra-coding process mainly by using encoded CU depths and RD costs of co-located CTU to predict both the current CU’s depth search range and the RD cost for CU splitting termination. Local texture descriptors or image characteristics were used in [9, 24, 28] to allow faster CU size selection. A spatiotemporal based CU encoding technique is explored in [1], where sample-adaptive-offset (SAO) parameters were utilized to predict the textural complexity of the CU being encoded. The work in [5] introduces an interesting approach for predicting CU splitting based on deep learning using a reinforced learning algorithm. The algorithm is also capable of predicting the reduction in rate-distortion cost. The solution is applied to all-intra configuration and results in a BD-rate loss of 2.5%. In [8], the authors proposed an offline training algorithm based on random forests to predicting early termination of CU splitting. Neighboring CU sizes are also used in determining the depth of the current CU. The algorithm is applied to all-intra mode and reported a complexity reduction of 48.3% with a BD-rate increase of 0.8%.

Other approaches utilized the Bayesian decision rule and other machine learning techniques to improve the time complexity of an HEVC encoder. For instance, the work in [15, 16] uses the Bayes’ rule to optimize PU and CU skip algorithms, respectively. In [14], the authors present a joint online and offline learning-based fast CU partitioning method that uses the Bayesian decision rule to optimize the CU partitioning process. The Bayesian decision theory is also utilized in [29] along with the correlation between the variances of the residual coefficients and the transform size to enhance the PU size decision process. Alternatively, a fast CU splitting and pruning algorithm is proposed in [4], which is applied at each CU depth according to a Bayes decision rule method based on low-complexity RD costs and full RD costs. A fast CU size and PU mode prediction algorithm that uses the k-means clustering method is introduced by [17].

On the other hand, [11] presents an early mode decision algorithm based on the Neyman-Pearson approach. In [36], a fast pyramid motion divergence (PMD)-based CU selection algorithm is proposed, where a k nearest neighbors (k-NN) like method is used to determine the optimal CU size. The work in [39] used a machine learning-based fast coding unit (CU) depth decision method, where the quad-tree CU depth levels are modeled as a three-level of hierarchical binary decision problem. The work proposed in [6] implemented early termination techniques on CUs, PUs, and TUs using a set of decision trees grown with the aid of Waikato Environment for Knowledge Analysis (WEKA) [10], an open source data mining tool.

3 System overview

In the proposed prediction system, the first 10% of frames of a video sequence are used for training. Hence, modeling and prediction will be specific to one video as opposed to training the classification system using many video sequences. The former training approach is known as “video-dependent” modeling, while the latter is known as “video-independent” modeling. The problem with the video-independent modeling is that it follows a one-size-fits-all approach in which there is an implicit assumption that the videos used for training are suitable for predicting the CU split decisions of all other videos. Video-dependent modeling, on the other hand, makes sure that the prediction model is most suitable for predicting the CU split decisions of the remaining video content.

The concept of video-dependent modeling was previously introduced by the author in [23, 25, 27]. The first 10% of video frames were used for training and the prediction model is then used throughout the sequence in a video transcoding context. If needed, the training can be repeated periodically or in the case of detecting scene cuts. Reducing the percentage of train frames might result in a less accurate classification model as the number of feature vectors in the train set are reduced. Increasing the percentage, on the other hand, might result in a more accurate classification model. However, the time at which the model is applied to predict the split decisions of CUs will be delayed which decreases the overall gain in terms of computational complexity.

Figures. 1 and 2 present the flowcharts of the proposed training system where FV refers to Feature Vectors. Figure 1 illustrates the data collection process of the training system. The video encoder will run with normal compression operations for the first 10% of the video frames during which, for each CU, features are extracted and recorded at the highest level, which is typically 64 × 64. The corresponding split decision is also recorded. If the encoder decides to split the CU, then the split decisions at the 32 × 32 and 16 × 16 levels will be recursively calculated during which, the training system will record the features and corresponding split decisions at 32 × 32 and 16 × 16 CU levels. The details of the selected feature variables are discussed in the next section.

Fig. 1
figure 1

Flowchart of data collection during the training phase

Fig. 2
figure 2

Flowchart of CU split modeling in the training phase

The output of this data collection process is three sets of data. Each data set contains feature vectors and the corresponding split flags for 64 × 64, 32 × 32 and 16 × 16 CU levels. The second step in the training system is to map the feature vectors to the split decisions. This is illustrated in Fig. 2. The result of this step is 3 training models that can be used for the prediction of CU split decisions at 64 × 64, 32 × 32 and 16 × 16 CU levels. Prior to model generation, there is an optional dimensionality reduction step. Again, this is applied at the three CU levels and the dimensionally reduced models are stored and used for reducing the dimensionality of the feature vectors during the testing phase, as shall be explained next. The system modeling and dimensionality reduction techniques used in this work are explained in the next section.

Once the system is trained, the generated models are used to predict the split decisions of the remaining CUs of the underlying video sequence. This process is illustrated in Fig. 3. Basically, feature variables are extracted at the highest CU level, which in this work it is 64 × 64. The corresponding train model is then used to predict the split flag. If predicted as ‘no split,’ then early termination is applied. Otherwise, the second train model is applied for each of the 32 × 32 CU levels and 4 split flags are predicted. If any of the flags are predicted as ‘split’, then the process is repeated at the 16 × 16 CU levels using the third train model. At each level, feature vectors are calculated and reduced in dimensionality if required. Again, dimensionality reduction models are calculated during the training phase.

Fig. 3
figure 3

Flowchart of applying the train models to predict the CU split flags

4 System training

This section introduces the proposed feature extraction and dimensionality reduction process. It also reviews the machine learning techniques used.

4.1 Feature extraction and dimensionality reduction

In this work, feature extraction is applied at each of the three coding levels (i.e. 64 × 64, 32 × 32 and 16 × 16). Common to all levels are features extracted from surrounding CTUs. The surrounding CTUs are previously encoded and include the CTUs at the following locations relative to the current CU: left, top-left, top, top-right and co-located from the previous frame. The total number of surrounding CTUs is therefore 5. The complete list of extracted features and their description are listed in Table 1, where MVs refer to Motion Vectors. The first 15 features in Table 1 belong to the current CU, whereas the remaining 55 features belong to surrounding CTUs. The total number of features is therefore 70.

Table 1 Feature variables representing CUs

As illustrated in Fig. 2 above, the dimensionality of these features can be reduced prior to generating the training model. In this work, we generate experimental results with and without dimensionality reduction. We propose the use of the following dimensionality reduction techniques: stepwise regression, principle component analysis (PCA) and reduction based on random forests. In the following, we briefly summarize the use of each relative to the proposed solution. It is important to mention that all dimensionality reduction techniques are applied to the train data set as illustrated in Fig. 2. The generated model is then applied to the test data set.

Stepwise regression is a feature selection algorithm; however, it can be used as a dimensionality reduction technique as reported in [26]. In this work, we treat the feature vectors of CUs as predictors and the split decisions as response variables. As such, the problem can be formalized in a regression context. The idea of stepwise regression is to start with one feature variable and compute its correlation with the split decision. Then, another feature variable is added and the correlation is computed again. The significance of adding another feature variable is assessed by means of examining the P value at a 0.05 level of significance. If the added feature variable is found significant, then it is retained; otherwise, it is removed from the list of variables. Likewise, once a variable is retained, the stepwise regression algorithm proceeds by revisiting the previous feature variables and reassessing their significant, taking into account that a new variable has been retained. The algorithm terminates when there are no further feature variables to add or to eliminate. A full description of the algorithm can be found in [20].

Once applied to the train data set at 64 × 64, 32 × 32 and 16 × 16 CU levels, the result of the stepwise regression is simply 3 sets of indices of the retained feature variables, one set for each CU coding depth. These indices can be used to reduce the dimensionality of the feature vectors during the testing phase. Since we are using a video-dependent approach to learning in this work, the number of retained feature variables varies from one video sequence to the other. Full information about the experimental setup are given in the experimental results section; nonetheless, for completeness, we briefly discuss the results of applying the stepwise regression algorithm here. The number of retained variables for each video sequence is given in Table 2. The table lists the average number of retained CU variables with QP values of {22, 27, 32 and 37}. Since 3 models are generated, the table lists the retained variables at 64 × 64, 32 × 32 and 16 × 16 CU coding levels. A full example showing specific names of retained variables for the RaceHorses sequence is shown in Table 3.

Table 2 Retained variables using stepwise regression
Table 3 Stepwise regression, example retained variables for 32 × 32 depth level with QP = 32

In this work, we also experiment with dimensionality reduction using random forests. In this approach, we generate a large set of trees against the CU split decision. Each tree is trained on a small number of feature variables. The usage statistics of each feature variable can be used to find an informative subset of features. More specifically, if a feature variable is repeatedly selected as best split, it is consequently a good candidate to retain. More information about this algorithm can be found in [18].

Here, a random forest of 100 trees is grown, where the maximum number of decision splits or branch nodes is set to be the initial set of 70 features. The training dataset is sampled for each decision tree with replacement and the feature variables selected at random for each decision split are chosen without replacement within the same decision tree. The importance of each of these features in predicting the correct classification of a test instance from the out-of-bag data is computed and used to select the features whose raw importance score make up 80% of the total importance score. The out-of-bag data is the set of instances that were left out during the training process of a given tree in the random forest.

The number of retained variables, using random forests, for each video sequence is given in Table 4. Similar to Table 2 above, the table lists the average number of retained CU variables with QP values of {22, 27, 32 and 37}. Since 3 models are generated, the table lists the retained variables at 64 × 64, 32 × 32 and 16 × 16 CU coding levels. A full example showing specific names of retained variables for the RaceHorses sequence is shown in Table 5.

Table 4 Retained variables using feature importance with random forests
Table 5 Random forest feature importance, example retained variables for 32 × 32 depth level with QP = 32

Lastly, we also experiment with dimensionality reduction using PCA. In this approach, an orthogonal transformation is used to transfer the data from the feature domain to the principle component domain. The first principle component of the transformed data account for the highest variability in the feature data. One drawback of PCA is that it results in a reduced data set that cannot be directly interpreted. This is not the case for the other two dimensionality reduction techniques used in this paper. The number of principle components retained depends on the chosen Proportion of Variance (PoV) explained, which is 90% in this work. Again, the PCA is applied to the training dataset at 64 × 64, 32 × 32 and 16 × 16 CU levels. The resulting principle components are then stored and used for reducing the test data set.

The number of retained variables, using PCA, for each video sequence is given in Table 6. Similar to Tables 2 and 4 above, the table lists the average number of retained CU variables with QP values of {22, 27, 32 and 37}. Since 3 models are generated, the table lists the retained variables at 64 × 64, 32 × 32 and 16 × 16 CU coding levels. Unlike stepwise regression and random forest variable selection, PCA results in a reduced data set that cannot be directly interpreted; hence, there are no specific retained variable names to list.

Table 6 Retained variables using PCA

4.2 Classification methods

In this work, we model the relationship between the CU features and split decisions using two classification tools; namely polynomial networks [33], random forest [3] .

In the polynomial networks, we experiment with a second order polynomial classifier. In the case of random forests, 100 trees are grown, where the maximal number of branch nodes is the square root of the number of retained feature variables. This is determined based on the out-of-bag estimates of the features’ importance in the tree ensemble. Based on the retained features, the training dataset is sampled for each decision tree with replacement. The variables selected at random for each decision split are chosen without replacement within the same decision tree. As the purpose of growing trees is classification, only one observation or class label can be seen per tree leaf. The arrangement in which we combine the classification tools with the dimensionality reduction techniques are presented in Table 7.

Table 7 Arrangement of classification solutions

5 Experimental results

We assess the performance and efficiency of the proposed solutions by implementing them in the HM reference software version 13.0 [13]. All video sequences are encoded with QPs of {22, 27, 32, and 37}. The main profile and the standard random-access temporal configuration are used. The same motion estimation parameters are used in both the original and proposed video coders. For a fair comparison with [14], we use the HEVC video test sequences. The sequence resolutions are Class A (2560 × 1600), Class B (1080 pixels), Class C (800 × 480), and Class D (400 × 240). A total of 17 video sequences are used as reported in Table 8. We ran the experimental results on a PC with Intel Core i7-37400QM, 2.7-GHz CPU with a 16-GB DDR3 RAM. Compression efficiency is quantified in terms of BD-rate and BD-PSNR. The compression times using the proposed solutions are also computed and compared with the corresponding times obtained from running the unmodified HEVC encoder. Furthermore, the classification accuracy of the proposed classification systems is presented. We start by reporting the classification accuracy of the CU split prediction. Table 9 presents the overall classification accuracies of the 4 proposed solutions. These results are the average for all video sequences coded with QPs of {22, 27, 32 and 37}. The table reports the classification results for the 3 models generated according to the CU coding depth. The results indicate that the average classification accuracy enhances slightly as the depth of the CU coding increases. The results also show that the classification solutions using random forest are the most accurate. In Table 10, the average classification results for individual video sequences using the “R.F. Select & R.F.” solution are shown. The resolution of the video does not seem to affect the classification accuracy. Moreover, the accuracies do not seem to vary much according to the underlying video sequence as well.

Table 8 List of video sequences used
Table 9 Overall classification average of CU split prediction
Table 10 Classification accuracy of CU splits for the “R.F. Select & R.F.” solution

In the following experiments, we report the coding efficiency of the proposed solutions using a number of approaches; namely BD-rate, BD-PSNR [13] and Computational Complexity Reduction (CCR). CCR is computed as

$$ \mathrm{CCR}=\left({Time}_{ref.}-{Time}_{proposedSolution}\right)/{Time}_{ref.} $$
(1)

Where Timeref. is the time needed for regular encoding and TimeproposedSolution is the time needed to encode a sequence using the proposed fast encoder. In Table 11, the coding results of the 4 proposed solutions are listed. As mentioned earlier, all results are obtained with QPs of {22, 27, 32 and 37}. The best coding results in terms of BD-rate and BD-PSNR are achieved by the “R.F. Select & RF” and “R.F.” solutions. The reported results also indicate that, in general, the coding efficiency enhances as the resolution of the video increases. The effect of the proposed solution on the video quality in terms of BD-PSNR ranges from −0.05 to −0.02 dB. This gives a clear indication that the effect of the proposed solution on the visual quality is minimal. Nonetheless, we show example images of regular encoding and encoding using the proposed in Fig. 4. As shown in the figure, there are no subjective artifacts as a result of the proposed solution.

Table 11 Coding efficiency of proposed solutions
Fig. 4
figure 4

Example images of regular encoding, (a & c) and encoding with the proposed RF select & RF solution (b & d). Image number 20 of RaceHorses and PartyScene with QP of 27

We report the time complexity in terms of CCR% in Table 12, where sequences are referred to by the IDs listed in Table 8. We present the CCR% for two cases; the first case is the time savings without taking the training or the model generation time into account, while the second case is the CCR% with the training time being taken into account. In Table 12, it seems that without giving consideration to the training time, all solutions provide similar complexity reductions. However, because some of the training solutions are more computationally demanding, when the training time is taken into account, the range of CCR% increases from (39.1% - 37.5%) to (38.9% - 31.8%). Clearly, the most demanding training solution is the one that uses random forest. This is followed by random forest with dimensionality reduction. In this work, no attempt was taken to reduce the training time. However, one solution would be to reduce the number of train feature vectors by means of sampling, therefore reducing the model generation time. The reported results also indicate that, in general, the CCR% enhances as the resolution of the video increases. For comparison with existing work, we refer to the work reported in [14, 38]. These results contain the BD-rate and time savings only. However, the time savings are computed as

$$ \Delta \mathrm{Time}=\left({T}_{ref.}-{Time}_{proposedSolution}\right)/{Time}_{proposedSolution} $$
(2)
Table 12 Complexity reduction of proposed solutions using CCR (%)

This is different than the CCR reported in Table 11 above, where time savings are calculated by dividing by Timereference instead of TimeproposedSolution. For a fair comparison, in Table 13, we calculated the time savings of the proposed solution accordingly. In Table 13, we compare the results of our best solution against existing work.

Table 13 Comparison with existing work [14, 38]

It was not to clear to the authors if the reviewed work considered the training time when reporting the ∆Time%; hence, in Table 13, we report our results with and without taking into account the training time. In the table, ∆1Time% refers to the time savings without training time and ∆2Time% refers to the time savings taking into consideration the training time.

As previously mentioned, it seems that the results of the proposed solution enhance with the increasing sequence resolution. This is in contrast to the results reported in [14], where the BD-rate seems to improve with decreasing sequence resolution. Moreover, the time reduction in [14] does not seem to be affected by the resolution of the video sequences.

All in all, the results in Table 13 indicate that the proposed solution of “R.F. Select & R.F.” has a clear advantage in terms of BD-rate and time savings. With reference to the average reported BD-rate of existing work (i.e. 0.765 dB), the proposed solution provides an enhancement of 27.1%. Additionally, this quality enhancement comes with further reduction in computational time. More specifically, with reference to the average reported time savings (i.e. -45.5%), the proposed solution provides an enhancement of 57% and 34% for ∆1Time% and ∆2Time%, respectively.

Lastly, the results in Table 14 present a comparison with the recent work reported in [36], which uses CCR% for measuring the time complexity reduction. Table 14 lists the results for 15 video sequences that are used in both [36] and the proposed solution.

Table 14 Comparison with existing work [32, 36]

The results in the table show that the BD-rate of the proposed solution is on average 0.56, whereas that of the proposed solutions [32, 36] are 2.2 and 1.36 respectively. This noticeable enhancement in the BD-rate reduction comes at a slight advantage of increased computational complexity, where the CCR of the reviewed work is 37.5% [36] and that of the proposed work is 39.2%. The work in [32] has a CCR of 43.9% which comes at the expense of effecting the video quality as evident in the BD-rate.

Clearly, higher computational savings are possible, however at the expense of BD-rate. For example, the work reported in [19] proposed a solution for split prediction of CUs based on the motion features and rate-distortion cost of the NxN inter mode. The solution reuses motion vectors to expedite compression. The CCR results reported range from 55% to 61%. However, this speedup comes at the expense of high BD-rate loss of 1.93% to 2.33%. The work in [37] proposed a fast CU mode decision solution for HEVC trans-rating.

The solution uses modes and motion vectors of the correlated CUs to decide on an early SKIP decision. It also restricts the CU depth search by estimating a weighted average of the depth levels of correlated CUs. Additionally, CUs with a higher rate-distortions are divided into smaller CUs without evaluating RD costs of the remaining partitioning modes. The CCR results reported range from 55%, but again, this speedup comes at the expense of high BD-rate loss of 2.26%.

6 Conclusion

We proposed a solution for reducing the complexity of determining the CU split decisions in HEVC video coding. The solution uses a video sequence-dependent approach to collect features that represent CUs at various coding depths. The features are extracted from both the underlying CU and the previously encoded CUs. A classification model is then built based on these feature vectors and corresponding split decisions at various CU coding depths. Additionally, dimensionality reduction is optionality used as a preprocess to model generation. Therefore, the output of the training phase is a set of classification models for predicting split decisions and a set of dimensionality reduction models. We have used a number of dimensionality reduction and classification techniques, including stepwise regression, random forest variable selection, PCA, polynomial classifiers and random forest classifiers. Experimental results were carried out on many test video sequences with different resolutions. Comparison with existing work revealed that the proposed solution has an advantage in terms of coding efficiency and time savings. It was shown that, on average, the classification accuracy of the CU split models is 86.5%. In comparison to regular HEVC coding, the proposed solutions resulted in a BD-rate of 0.55 and a BD-PSNR of −0.02 dB. The average reported computational complexity reduction was 39.2%. Future work includes the investigation of model suitability over very long sequences and the possibility of model regeneration on demand. Future work can also include investing the suitability of the proposed work for model generation using a sequence independent approach.