1 Introduction

Over the last decade, with the significant deployment of new technology, such as the Internet of Things (IoT) in smart city and smart industry, video data has become the largest source of data consumed globally. Due to the rapid growth of video applications (video surveillance, 3D videos, mobile video, smart city and industry video traffic) and the increasing demand for superior quality video services, video data volume has become a challenge for transmission and multimedia storage. According to the high-quality requirements of video applications, the ultra high definition (UHD) video (4 K and 8 K) increases the traffic load explosively. In this context, the new generation of video coding standard ‘High Efficiency Video Coding’ (HEVC) standardized by the Joint Collaborative Team on Video Coding (JCT-VC) in 2013, was created to address these challenges [1]. HEVC saves approximately 50% of bitrate (BR) for the same subjective video quality, with respect to its predecessor H.264/advanced video coding (AVC) standard. However, this unmatched performance is achieved by increasing the encoder computational complexity mainly due to its block partition structure [2]. The quadtree structure of the CU is the most critical part in terms of HEVC coding complexity, since it consists of an exhaustive search for the best rate distortion optimization (RDO) partition. Consequently, the HEVC complexity becomes one of the most important tasks requiring more advanced techniques to provide the optimal performance in terms of rate distortion (RD) performance and complexity reduction. As shown in Fig. 1, the greatest complexity lies in the selection of the optimal prediction mode, especially in the inter-mode [3].

Fig. 1
figure 1

HEVC profile

To adequately address the CU mode decision issue in video coding, the existing works on fast mode decision algorithms can be divided into two categories, including statistical approaches and machine learning-based schemes. Several statistical approaches have contributed in different ways to reduce the HEVC complexity in terms of both inter and intra coding [3,4,5,6]. For example, authors in Ref. [3] proposed a mode decision algorithm based on a look-ahead stage to simplify the CU partition process at inter-mode. To alleviate the computational complexity of HEVC, in Ref. [5], the authors proposed a fast algorithm to split CU based on pyramid motion divergence at inter prediction. In addition, a fast early CU-splitting and pruning method using Bayesian decision rule with low complexity and full RD cost was developed in Ref. [6]. Although the above statistical methods have considerably enhanced the coding efficiency, they give an insufficient performance in some special cases because of statistical thresholds.

On the other hand, machine learning category has witnessed great success in many disciplines, especially in image and video compression. However, mode decision problem can be transformed into a classification problem, and learning algorithms were then explored when classifying modes in video coding. In this context, machine learning approaches have been adopted to predict the CU partition toward HEVC complexity reduction. A CU depth algorithm composed of multiple binary classifiers based on SVM with different parameters was proposed in Ref. [15] to predict the splitting of the CU partition. To reduce the encoding complexity, a fuzzy SVM-based fast CU decision method was proposed to reduce the HEVC complexity [16]. For the HEVC intra coding, Liu et al. [17] applied a hardware CNN to reduce the maximum intra coding complexity. In view of this complexity at HEVC inter-mode, a neural network-based inter prediction algorithm was proposed in Ref. [23]. However, these approaches are shallow, with limited learning capacity, which makes them insufficient to accurately model the complicated CU partition process. Based on machine learning, this paper proposes a fast CU partition algorithm to reduce both HEVC complexity and RD performance. We propose an online SVM-based fast CU partition, which reduces the HEVC complexity at inter-mode. Next, this paper develops a deep convolutional neural network (CNN) structure to predict the CU partition at HEVC inter-mode. To learn our deep CNN model, a large-scale database is built on the inter-mode CU partition. The proposed method can achieve a good trade-off between complexity reduction and RD performance. Our main contributions are summarized in:

  1. 1.

    We propose an online SVM-based fast CU algorithm for reducing HEVC complexity.

  2. 2.

    We design a deep CNN architecture to predict the CU partition of HEVC at inter-mode.

  3. 3.

    We construct a large-scale database for CU partition of the inter-mode HEVC, to accurately train the deep CNN that aims to reduce the HEVC complexity.

The remaining of this paper is organized as follows. Section 2 presents the review of related works. Section 3 explains the overview of the CU partition in HEVC. In Sect. 4, we propose a machine learning approach to reduce HEVC complexity at inter-mode. The experimental results are shown in Sect. 5. We conclude the paper in Sect. 6.

2 Related works

The HEVC complexity reduction has always been a popular challenge in the video coding field. According to this complexity, many features have been adopted to simplify the RDO search of the CU partition, which classified into statistical and machine learning methods. In statistical methods, several fast decision algorithms have been introduced in Refs. [3,4,5,6,7,8,9,10,11]. To reduce the HEVC computational complexity, Gabriel et al. [3] introduced a look-ahead stage-based fast partitioning and mode decision algorithm. Wang et al. [4] proposed a threshold-based splitting decision scheme with respect to the RD cost of each CU. It reduces the number of available intra candidates, adaptive reference frame selection and early termination of coding unit splitting. In Ref. [5], authors proposed a fast algorithm to split CU based on pyramid motion divergence at inter prediction. In addition, a fast early CU-splitting and pruning method with low complexity and full RD cost was developed by Cho et al. [6]. In a similar way, a fast coding unit based on Bayesian rules to minimize the RD cost was proposed, as Shen et al. [7]. Furthermore, authors in Ref. [8] proposed an adaptive CU depth decision approach, which exploits both the existence of non-zero coefficients after the transform, and the maximum depth of temporally co-located CTUs. Also based on the spatial and temporal homogeneity of the images, some authors perform an analysis of the input pictures, such as Fernandez et al. [9] and Lee et al. [10], who proposed an early termination algorithm. With regard to inter prediction, the square-type-first mode decision algorithm was proposed to decrease the encoding time [11]. These methods are based on the statistics on the RD cost properties, temporal and spatial correlation, which limit their applicability and may be difficult to handle the situations with various contents, complex coding structures.

The past few years have exhibited great success in applying machine learning tools to enhance the video coding. In this vein, great efforts have been carried out to integrate machine learning tools to predict the CU partition to reduce HEVC complexity [12,13,14,15,16,17,18,19,20,21,22,23]. The search for the optimal partitioning has also been considered as a classification problem. For example, Corrêa et al. [12] proposed data mining techniques-based three early termination schemes to simplify the decision on the optimal CTU structures. In a similar way, an SVM-based fast CU partition decision is proposed in Ref. [13]. To reduce the encoding complexity, Zhang et al. [14] propose a CU early termination algorithm. In this work, the authors designed a CU depth decision process in HEVC and model it as a three-level of hierarchical classification decision. In this regard, an SVM-based fast HEVC encoding algorithm was proposed by Zhu et al. [15] to predict both the CU partition and PU mode. The CU early termination is modeled as hierarchical binary classifications, whereas the PU selection is decided as a multi-class classification. To reduce the HEVC encoding complexity, Zhu et al. [16] proposed a CU decision method based on fuzzy SVM that achieve a good trade-off between computational complexity reduction and RD performance. In recent studies, learning-based techniques have also been applied to fast CU partitioning of intra-mode HEVC, such as [17], which implements a CNN algorithm along with its VLSI design, and [18], which uses logistic regression classification-based fast HEVC intra mode decision. To improve the HEVC complexity, Amer et al. [20] proposed a fully connected neural networks and Laplacian transparent composite models. Most recently, deep learning techniques have also been employed to speed up the encoding process and to predict the CU partition [21,22,23].

The analysis of these earlier works shows that it is possible to achieve a significant HEVC complexity reduction by using learning-based solutions. More sophisticated techniques such as CNNs should then be able to yield competitive results, particularly with regard to CU prediction.

While studying the methodology of the machine learning-based approaches [12,13,14,15,16,17,18,19,20,21,22,23], we noticed that some aspects could still be improved, for example in the training process. In contrast, the main motivation of this paper is to focus our efforts between SVM and CNN, in which the large-scale database for the CU partition of HEVC at inter-mode is considered.

3 Overview CU partition

The main contributions of the HEVC standard are the block partition structure that significantly improves compression performance [24]. First, the picture is partitioned into several coding tree unit (CTU) of size 64 × 64. This CTU replaces the macroblock in the previous standard. The hierarchical coding structure of the HEVC varies between the largest coding unit (LCU), having a size of 64 × 64, and the smallest coding unit (SCU) of size 8 × 8. A CU can be 64 × 64, 32 × 32, 16 × 16 or 8 × 8, corresponding to four CU depths, 0, 1, 2 and 3. A quadtree partition can be used to represent the hierarchical partition of CTU into CU [25]. Moreover, CUs are split into prediction unit (PU) and transform unit (TU). With our knowledge, the depth choice in each CTU goes through a decision process the RD cost calculation-based of each CU partition inside the CTU. An example of the HEVC quadtree concept is shown in Fig. 2.

Fig. 2
figure 2

CU partition structure in HEVC

In the CTU with 64 × 64 size, the split flag is set to 0, the RD cost RD1 is calculated. Then the sub-CUs of 32 × 32 size are obtained when the split flag changes to 1. The first one, CU10 has an RD cost equal to RD2. The next depth is reached where the CU is partitioned into four CUs of size 16 × 16. The first CU (CU20) of size 16 × 16 has an RD cost equal to RD3. When its split flag is 1, the last depth (depth = 3) is reached and it is therefore partitioned into four SCU of size 8 × 8. The RD cost for each SCU will be noted RD4, RD5, RD6 and RD7, respectively. The first decision will be taken from the bottom to the top by determining if the first CU of size 16 × 16 is checked or not. We need a comparison of the sum of the four RD cost of the SCU 8 × 8 with the RD3 of the CU 16 × 16 to make a decision. If the RD3 is greater than the sum of RD4, RD5, RD6 and RD7, the partitioning decision of CU20 will be taken, otherwise CU20 will not be split. Alike for the other CUs, the decision is always based on the Eq. (1). Generally, RDO is a method to decide the optimal mode in video coding. The determination of optimal CU modes is obtained via the minimum RD cost. In a 64 × 64 CTU, 85 possible CUs are selected:

$${\text{RD}}_{{{\text{cost}}_{\text{CU}} }} < \mathop \sum \limits_{k = 0}^{3} {\text{RD}}_{{{\text{cost}}_{\text{subCU}} }} (k).$$
(1)

4 Proposed work

In this section, we introduce the proposed learning approach-based fast CU partition. Firstly, an online SVM-based fast CU algorithm for reducing HEVC complexity is introduced. Secondly, we design the deep CNN-based network architecture to predict the CU partition structure at each depth from 0 (64 × 64) to 3 (16 × 16). Finally, before results analysis, we introduce the training phase of our deep CNN, in which a large-scale database modeled on the encoding information obtained from HEVC standard.

4.1 Three-level CU classifier

In the HEVC standard, the CTU supports quadtree CU partitions with four levels of CU depth from 0 to 3, which corresponds to CU size from 64 × 64 to 8 × 8. The CU partition can be considered as a combination of binary classifiers \(\{ F_{i} \}_{1}^{3}\) at three levels of decisions l ∈ {1, 2, 3} on whether to split a parent CU into sub-CUs. For example, at level 1, the CU with size 64 × 64 is split into 32 × 32 CUs. Next, l = 2 means the level of decision for 32 × 32 into 16 × 16, and l = 3 stands for 16 × 16 into 8 ×8 as shown in Fig. 2. According to the CTU, we assume that the CUs are denoted as CU, CUi, CUi,j corresponding to depth 0, 1, 2, 3, where i, j ∈ {0, 1, 2, 3} are the index of sub-CUs. In each CU depth, we need to determine whether to split the current CU or not. The total number of splitting patterns for CU is 83,522. There are too many types of CU partitions and it is hard to be solved by a single multi-class classification in one step. However, due to the large number of patterns combinations, the prediction is adopted at each decision level to yield \(\tilde{F}_{1} ({\text{CU}})\), \(\{ \tilde{F}_{2} ({\text{CU}}_{i} )\}_{i = 0}^{3}\) and \(\{ \tilde{F}_{3} ({\text{CU}}_{i,j} )\}_{i,j = 0}^{3}\), which denotes the predicted \(F_{1} ({\text{CU}})\), \(\{ F_{2} ({\text{CU}}_{i} )\}_{i = 0}^{3}\) and \(\{ F_{3} ({\text{CU}}_{i,j} )\}_{i,j = 0}^{3}\), respectively.

4.2 Online support vector machine (SVM)

In machine learning theory, SVM is a supervised learning tool that performs classification and regression analysis [26]. A hyperplane technique is used in SVM to separate the data from one dimension to high dimensional space.

The SVM can transform the data to the high dimensional space through nonlinear transformation, if the data points are clearly not linearly separable in the input space. To separate the two classes of data points, SVM maps the sample data into a hyperspace. In addition, the main goal of SVM is to solve linear and nonlinear problems to find an optimal hyperplane. SVM classifier creates a hyperplane to maximize the margin between the hyperplanes and the support vectors [27]. The support vector classifier principal used in this work is shown in Fig. 3.

Fig. 3
figure 3

Example of support vector classifier

The CU split decision can be modeled as a binary classification problem, with classes split and non-split. Here, we propose an online SVM as a machine learning technique, since it is robust and popular in solving the binary classification problem with significant computational advantages. The main idea is to find a hyperplane that can separate the training samples of different classes while maximizing the margin between these classes to determine the CU splitting level. According to Eq. (2), the ideal weight vector w is a linear combination of support vectors. Therefore, the support vectors are the training points that minimize the misclassification. Given training set with N samples, \(\{ x_{i} ,y_{i} \}_{i = 1}^{N}\), xi\(R^{n}\) while yi ∈ {− 1, 1}, the hyperplane parameterized by the normal vector w that maximizes margins can be found by solving the optimization problem:

$$\begin{array}{*{20}c} { \hbox{min} } \\ w \\ \end{array} \frac{\gamma }{2} {\left \|{w} \right\|}^{2} + \frac{1}{n}\sum\limits_{i = 1}^{n} { \hbox{max} } (0,1 - y(w \cdot x)),$$
(2)

where \(\gamma\) ≥ 0 is the smoothing parameter and is defined by: \(\gamma = \frac{1}{nC}\), where C is the parameter which need to be tuned during SVM training.

Mathematically, support vector machines (SVMs) handle such situations by using a kernel function which maps the data to a different space where a linear hyperplane can be used to separate classes. In this work, Gaussian radial basis function (RBF) is applied as the kernel function, which is defined as:

$$K(x_{i} ,x_{j} ) = \exp \left( { - \frac{{{\left \| x_{i} - x_{j} \right \|}^{2} }}{{2\sigma^{2} }}} \right).$$
(3)

To reduce the HEVC complexity, we online train our SVM model to early terminate the CU splitting process. In the online training mode, some frames of a sequence are encoded with the original encoder and it outputs class labels and feature for model training. Then, the successive frames are encoded and their CU depths are predicted based on the trained model. The training frames and models can be refreshed on demand. Figure 4 illustrates an example of an online training mode, where training frames are in yellow color and the predicting frames are in blue color. The advantage of online training is the properties of the video sequence of the training and testing are quite close. It is better for improving the prediction accuracy.

Fig. 4
figure 4

Online training mode

4.3 Deep CNN architecture

Convolutional neural network is the most widely used deep learning model for video processing applications. According to the mechanism of the CU partition at inter-mode HEVC, a deep CNN structure is shown in Fig. 5.

Fig. 5
figure 5

Deep CNN architecture

The residual CTU is fed into CNN architecture. Here, the residue is obtained by pre-coding the frame in HEVC. Our proposed architecture is composed of pre-convolution, convolution layers, concatenated vector and fully connected layers. The pre-convolution layers are residual CUs of CU, CUi or CUi,j, corresponding to the three levels. Therefore, the residual block is subtracted by the mean intensity values to reduce the variation of the input CTU samples. Specifically, at the first level of CU partition, the mean value of CU is removed in accordance with the output of \(\tilde{F}_{1} ({\text{CU}})\). At the second level, four CUs \(\{ {\text{CU}}_{i} \}_{0}^{3}\) are subtracted by their corresponding mean values, matching the 2 × 2 output of \(\{ \tilde{F}_{2} ({\text{CU}}_{i} )\}_{i = 0}^{3}\). At the third level, \(\{ {\text{CU}}_{i,j} \}_{i,j = 0}^{3}\) remove the mean values in each CU for the 4 × 4 output \(\{ \tilde{F}_{3} ({\text{CU}}_{i,j} )\}_{i,j = 0}^{3}\).

After pre-convolution task, the three convolutional layers are used to extract features from data at all levels. The convolution layer is a mathematical operation that takes two inputs such as CU partition and a filters. In each layer, the convolution kernels of all three levels have the same size. In our work, at the first convolutional layer, 16 kernels are used to extract the low features maps for the CU partition. Following the first convolutional layer, the feature maps are convoluted twice with 2 × 2 kernels to generate features at a higher level. The strides of all the above convolutions are equal to the sizes of the corresponding kernels for non-overlap convolution.

The above design of the convolutional layer is in accordance with all possible non-overlap CUs at different sizes for CTU partition. At the end of the convolution, through the concatenation layer, the final feature maps are concatenated together and then flatten into a vector. In the following fully connected layers, features generated from the whole CTU are all considered to predict the CU partition at each single level.

Finally, the concatenated vector flows through three fully connected layers as illustrated in Fig. 5, including two hidden layers and one output layer. The two hidden fully connected layers successively generate feature vectors denoted by \(({\text{FC}}_{l} )_{l = 1}^{3}\). The outputs of deep CNN are 1, 4, and 16 elements, such as the predicted binary labels \(\tilde{F}_{1} ({\text{CU}})\) in 1 × 1, \(\{ \tilde{F}_{2} ({\text{CU}}_{i} )\}_{i = 0}^{3}\) in 2 × 2 and \(\{ \tilde{F}_{3} ({\text{CU}}_{i,j} )\}_{i,j = 0}^{3}\) in 4 × 4 at three levels, respectively. In deep CNN structure, the early termination may result in the calculation of the fully connected layers at levels 2 and 3 being skipped, thus saving computation time. Specifically, if CU is decided not to be split at level 1, the calculation of \(\{ \tilde{F}_{2} ({\text{CU}}_{i} )\}_{i = 0}^{3}\) is terminated early at level 2. At level 3, the \(\{ \tilde{F}_{3} ({\text{CU}}_{i,j} )\}_{i,j = 0}^{3}\) does not need to be computed for the early termination, if \(\{ {\text{CU}}_{i} \}_{0}^{3}\) are not all split. The function rectified linear units (ReLU) is used to activate all convolutional layers and hidden fully connected layers, since ReLU has better convergence speed [28]. Moreover, since all the labels for splitting or non-splitting are binary, all the output layers in three levels are activated with the sigmoid function.

4.4 Training phase

In this section, we present the training process for the proposed deep CNN as shown in Fig. 6. We train our model in a supervised learning manner, in which the deep CNN has been learned based on labeled data. In this context, we create the database for training the proposed model, which satisfy highly performances (high accuracy, low loss).

Fig. 6
figure 6

Training process

We establish a large-scale database for CU partition of the inter-mode HEVC (CPIH), to increase the prediction accuracy. However, to construct our CPIH database, we selected 114 raw video sequences with different resolutions (from 352 × 240 to 2560 × 1600) [29,30,31,32]. These sequences are gathered into three sub-sets: 86 sequences for training, 10 sequences for validation, and 18 sequences for test. Table 1 summarizes the chosen videos and the number of frames (41,349) in our CPIH database.

Table 1 Sequences in CPIH database

First, we encode the original database (114 video sequences) by HEVC (original HEVC encoder) common test condition at different quantization parameters (QP = 22, 27, 32, 37) using low delay P configuration (using encoder_lowdelay_P_main.cfg) to obtain the residue and the ground truth CU depth. The ground truth CU depth files contain the division probability of the entire sequences.

Second, to construct a training sample, the train, valid, and the test data are generated by implementing the ‘EXTRACT DATA program’. The training data is used to train the model as before, where the validation data is used to determine when to stop the learning process. For the test data, 18 sequences of classes A–E from the Joint Collaborative Team on Video Coding (JCT-VC) are used to evaluate the performance of the proposed deep CNN [31].

As shown in Fig. 6, the TRAIN MODEL process summarizes the manner on how to train the model based on the CPIH database construction. The stochastic gradient descent algorithm with momentum (SGD) is used as a powerful optimization algorithm to update the network weights at each iteration and minimize gradient error between the ground truth labels and the prediction outputs. This process will continue until the loss function reaches a minimum value (Fig. 7). Furthermore, the deep CNN model is trained at four QPs by using different sizes of CU, which varies from 16 × 16 to 64 × 64.

Fig. 7
figure 7

Training loss

For the learning of our deep CNN model, we assume that the cross entropy is applied as the loss function following Eqs. (4) and (5):

$$L = \frac{1}{N}\sum\limits_{n}^{N} {L_{n} } ,$$
(4)

where N is the number of training samples and Ln is the sum of the cross entropy:

$$L_{n} = Y(F_{1}^{n} ({\text{CU}}),\tilde{F}_{1}^{n} ({\text{CU}})) + \sum\limits_{{i \in \{ 0,1,2,3\} }} Y (F_{2}^{n} ({\text{CU}}_{i} ),\tilde{F}_{2}^{n} ({\text{CU}}_{i} )) + \sum\limits_{{i,j \in \{ 0,1,2,3\} }} Y (F_{3}^{n} ({\text{CU}}_{i,j} ),\tilde{F}_{3}^{n} ({\text{CU}}_{i,j} )),$$
(5)

where Y denotes the cross entropy between the ground truth labels and the predicted labels. The labels predicted by our deep CNN are represented by \(\{ \tilde{F}_{1}^{n} ({\text{CU}}),\{ \tilde{F}_{2}^{n} ({\text{CU}}_{i} )\}_{i = 0}^{3} \,{\text{and}}\,\{ \tilde{F}_{3}^{n} ({\text{CU}}_{i,j} )\}_{i,j = 0}^{3} \}_{n = 0}^{N} .\)

We use the Tensorflow-GPU deep learning framework to train our proposed deep CNN on an NVIDIA GeForce GTX 480 GPU that can dramatically improve speed during training compared to the CPU. We adopt a batch mode learning method with a batch size of 64 where the momentum of the stochastic gradient descent algorithm optimization is set to 0.9. To train our deep CNN, the base learning rate is set to decay exponentially to 0.01, changing every 1000 iterations. The total number of iterations was 2,000,000. Finally, the trained model (deep CNN) can be used to predict the CU partition (classes) at HEVC inter-mode.

5 Experimental results

In this section, we evaluate the performance of the proposed approaches. All experimental results are implemented in the HEVC reference software HM16.5 using random access (RA) and low delay P (LDP) configurations. In this regard, the QP values tested were 22, 27, 32 and 37 for encoding process. At inter-mode, our experiment was carried out using 18 video sequences of the JCT-VC standard test set [32], which include the following resolutions: 2560 × 1600 (A), 1920 × 1080 (B), 832 × 480 (C), 416 × 240 (D), and 1280 × 720 (E). Simulation results were conducted on windows 10 OS platform with Intel®core TM i7-3770 @ 3.4 GHz CPU and 16 GB RAM.

5.1 Performance metric

To evaluate our proposed algorithms, we use the most crucial performance metric of fast encoding, denoted the computational time saving (T), as shown in Eq. (6):

$$\Delta T = \frac{{T_{\text{P}} - T_{\text{o}} }}{{T_{\text{o}} }} \times 100(\% ),$$
(6)

where To is the computational time of the original HM, and Tp is the computational time of our proposed algorithm-based fast CU encoding.

Additionally, the RD performance is the critical metric for evaluation. We use the peak signal-to-noise ratio (PSNR) for objective video quality measurement and the BR compared to the original HM, which are defined as follows in Eqs. (7) and (8):

$$\Delta {\text{PSNR}} = {\text{PSNRP}} - {\text{PSNRo}}\,({\text{dB}}),$$
(7)
$$\Delta {\text{BR}} = \frac{{{\text{BR}}_{\text{P}} - {\text{BR}}_{\text{o}} }}{{{\text{BR}}_{\text{o}} }} \times 100\,(\% ).$$
(8)

5.2 Performance evaluation of our online SVM

Table 2 summarizes the performance comparison between our proposed scheme and the original HEVC under LDP and RA configurations, respectively.

Table 2 Performance analysis of our fast online SVM

According to simulation results, our proposed online SVM helps reduce HEVC complexity and improve the RD performance significantly. As it can be observed, the RA configuration performs better results in terms of coding efficiency and time reduction on average compared to LDP configuration. With regard at computational complexity, our scheme allows a maximum time saving of 62.21% with an average of 53.14% using random access configuration. Also, it achieves on average more than 1.269% in terms of BR with almost negligible decrease in PSNR. While by using LDP configuration, our approach saves 52.28% in execution time with a loss of 1.928% in the BR.

These results confirm the robustness of our algorithm in reducing the complexity and coding efficiency of inter-mode HEVC. This refers to the optimized method of finding the best partition with the optimal RD in a significant time compared to the standard method which calculates all the partitions of the CU.

For more evaluation, Fig. 8 shows the HEVC complexity reduction and RD performance at different video classes under RA and LDP configurations.

Fig. 8
figure 8

Complexity reduction and RD performance at different video classes under RA and LDP configurations

The BR of our proposed method is averagely better for the RA configuration than for the LDP configuration. On the other hand, the proposed scheme performs much better in the LDP configuration in terms of PSNR compared to the RA configuration. As shown in Fig. 8, at the RA configuration, our approach is able to reduce the encoding time at all video sequences compared to the LDP configuration.

5.3 Performance evaluation with online SVM and deep CNN

Table 3 gives a comparison of our two proposed methods; deep CNN and online SVM, in terms of complexity reduction and RD performance using LDP configuration.

Table 3 Performances comparison between deep CNN and online SVM

The experimental results show that our deep CNN model achieves a significant complexity reduction in the range of 53.99% with 0.195% BR compared to the online SVM at LDP configuration. On the other hand, our online SVM demonstrates significant coding losses in BR of 1.928% and an increase on average of 52.28% in time reduction. As it can be seen, our proposed deep CNN obtains significantly best results in terms of execution time for the sequence of class E, this is caused by the low motion activities displayed in these sequences, which leads to larger partitions. For the same reason, it is possible to observe a slightly higher encoding time for high-resolution sequences compared with those of lower resolution.

In summary, from the overall performance evaluation we can find that the proposed method deep CNN outperforms the online SVM in terms of both complexity reduction and RD performance of inter-mode HEVC, as seen in Table 3. This implies that the proposed deep CNN is robust in reducing complexity of inter-mode HEVC when compared to the online SVM. This refers that the CNN works well with visual images recognition whereas SVM is used widely in classification problems. Also, it is difficult to parallelize SVM but the CNN architecture inherently supports parallelization.

For more evaluation, the complexity reduction and the RD performance of deep CNN versus online SVM at the LDP configuration is shown in Fig. 9.

Fig. 9
figure 9

Complexity reduction and RD performance of deep CNN versus online SVM under LDP configuration

As shown in Fig. 9, the deep CNN performances exceed in terms of BR those of online SVM for all test sequences. However, online SVM is superior to deep CNN in terms of PSNR. With regard to complexity reduction, for classes B and E, deep CNN is more efficient than online SVM.

5.4 Comparison with the state-of-the-art

In this section, the obtained results are compared to the other state-of-the-art approaches. Table 4 gives the comparison of the proposed algorithm with the other state-of-the-art methods.

Table 4 Results of our deep CNN model compared with two state-of-the-art methods

With regard to complexity reduction, we note that ΔT results are averaged over four QPs {22, 27, 32 and 37} at each class. As seen in this table, ΔT of our proposed method achieves 53.99% on average, which is superior to 46.44% obtained by Zhang et al. [14], and 44.44% obtained by Mallikarachchi et al. [19]. Therefore, our scheme achieves a largest complexity reduction at inter-mode HEVC than the other two approaches.

Additionally to the complexity reduction, the RD performance is considered as a critical metric for evaluation. Table 4 lists the results of the ΔPSNR and ΔBR for evaluating RD performance. As shown in this table, the ΔPSNR of our deep CNN method averages − 0.063 dB, which is better than − 0.115 dB of [14]. On the other hand, the ΔBR of our method is averagely − 0.195%, which outperforms 1.826% of [14] and 3.806% of [19].

Consequently, our proposed algorithm outperforms other state-of-the-art approaches [14] and [19] in terms of both time reduction and RD performance.

6 Conclusion

In this paper, we proposed a fast CU partition based on machine learning approaches to reduce the HEVC complexity of inter-mode. An online SVM-based fast CU partition method was proposed for reducing the encoding complexity of HEVC. Then, to predict the CU partition of HEVC, a deep CNN was proposed, which reduces the HEVC complexity at inter-mode. In experiment results, the online SVM reduces the execution time by 52.28% on average with an increase in the BR of 1.928%. However, deep CNN model improves the RD performance with 0.195% in the BR saving and archives on average 53.99% of time saving under LDP configuration. Consequently, our deep CNN scheme performs better trade-off between RD performance and complexity reduction compared to online SVM. The comparative results demonstrate that the proposed deep CNN proves its effectiveness in reducing the HEVC complexity.