1 Introduction

After the success of H.264/AVC, video compression standards target higher resolutions reaching up to 4K by 2K, known as ultra high definition (UHD) levels. As the demand increased, the International Telecommunication Union and International Organization for Standardization develop a new video coding standard with improved compressing tools focusing on UHD videos. The high-efficiency video coding (HEVC) standard is emerging, aiming to double the compression rates when compared to H.264/AVC for the same video quality [1]. The computational complexity of the encoding and decoding process increases from 2 to 10 times as compared to the H.264/AVC. Motion estimation and motion compensation are two essential processes in block-based video coding, because they reduce the temporal redundancy in consecutive frames. Motion estimation is based on searching the best motion vector in previous or future frames or combining both of them. Reducing the number of searching points decreases the complexity of motion estimation. In addition to that, some algorithms have been proposed to decrease the mode decision complexity for the HEVC encoder achieving a significant time saving with a negligible quality loss. This paper proposes a new motion estimation algorithm (MEA) for HEVC encoder based on the modification of the pattern search. It also explains and adopts new options for fast mode decision used for sub-partitioning large coding unit (LCU). The rest of the paper is organized as follows. In Sect. 2, an overview of the HEVC encoder is presented. Section 3 details some related works on MEA and fast mode decision used in HEVC to save the encoding time. Methods based on fast coding are developed in Sect. 4. In Sect. 5, the motion estimation process is detailed and followed by our proposed algorithms. Then, the experimental results based on combining our two approaches are given in Sect. 6. Finally, a conclusion and some perspectives are presented in Sect. 7.

2 Overview of the HEVC encoder

2.1 Encoder structure

Video coding layer (VCL) uses the same hybrid approach which is based on intra/inter-prediction modules. This approach was used in the previous standards. Inter-prediction mode is a temporal prediction using motion estimation between different frames and intra prediction mode is a spatial prediction in the same frame. These stages are followed by the integer transformation and scalar quantization. As shown in Fig. 1, quantized coefficients will be then entropy coded using CABAC. As in previous standards, two loop filters that consist of a deblocking and sample adaptive offset (SAO) filters are applied to the reconstructed frame. The deblocking filter is proposed to reduce the blocking artifacts due to the block-based coding. It is applied only on the samples located next to transform unit (TU) or prediction unit (PU) boundaries. SAO is a process that modifies the samples after the deblocking filter. It is a non-linear filter which is obtained through a look-up table. This filter is able to reduce the distortion of the reconstructed frames by adding an offset of the reconstructed pixels [1]. The main feature which makes HEVC different from AVC replaces the macroblock structure by a quad-tree structure.

Fig. 1
figure 1

HEVC encoder structure

The principle of the new structure is based on coding tree unit (CTU) instead of macroblock structure. This structure is based on blocks and units. The three basic units used in the quad-tree structure for HEVC are coding unit (CU), prediction unit (PU), and transform unit (TU) [2]. As the HEVC is basically allocated to larger videos, larger macroblocks are needed to encode more efficiently. On the other side, details are provided only through small blocks. This is the challenge of the HEVC standard. While using, the quad-tree provides a compromise between a good quality and less bit-rate. This new structure is more detailed in Sect. 3.

For intra prediction, in HEVC, the size of the smallest block did not change from the H.264/AVC one which is 4 × 4. In addition, to perform larger bloc’s size, the PU can reach up to 64 × 64 [3]. The number of prediction modes increased to augment the coding efficiency. There are at most 34 intra prediction modes. The angular modes in HEVC are more complex than the directional ones in H.264/AVC since more multiplications are required. The number of intra modes varies with the PU size as shown in Table 1 [4].

Table 1 Number of supported intra modes for each PU size

To get the best intra prediction mode, the choice in the HEVC reference software is based on computing for each mode its rate-distortion cost known as J RDO given by the following formula:

$$J_{\text{RDO}} = {\text{SSD}} + \lambda \times R$$
(1)

where SSD is the sum of squared differences, R is the total bits needed to encode the relative mode, and λ is a value calculated using different parameters such as the quantization parameter (QP), the slice type (I, P or B slice), and the configuration (random access, low delay or intra-only). The best mode is the one with minimum rate-distortion J RDO [5]. In case of processing the chrominance, the five existent modes are: intra planar, intra angular (dir = 26, vertical), intra angular (dir = 10, horizontal), intra DC, and intra derived which mean that the chrominance uses the same angular direction of its corresponding luminance [4].

The process used to encode an inter-picture is to choose a motion vector applied for predicting the current block by exploiting the temporal redundancy present in the video signal. The encoding process for inter-picture prediction consists of forming motion data for each PU that includes an index for a reference picture and motion vector (MV) that will be applied for predicting the samples of each block. The basic theory of motion compensation prediction (MCP) is to reduce the amount of the transmitted information to the decoder. For each block in the frame, the encoder searches in the reference frames to find its best matching block. The MV is coded by transmitting only the index and the difference vector into the candidate list [5]. The differences between vectors are much smaller in amplitude and thus can be coded more efficiently. Motion estimation (ME) is one of the key elements in video coding standard. It is dedicated to achieve high coding performance by reducing temporal redundancy. Developing fast algorithms for ME has been an essential and challenging issue. The most simple and basic way to find the optimal position is the full search (FS) algorithm. It checks every possible point in the search range to select the best one. The best point is chosen after computing the rate-distortion (RD) for each candidate. Although the full search algorithm reaches the global minimum, the computational complexity is usually very excessive. This part will be more detailed in Sect. 5.

Either intra or inter-frame prediction, the residual signal which is the difference between the source frame and the predicted one, is the subject of a conversion to concentrate the signal energy. The purpose of the transformation is to convert the unit into the spatial frequency space. However, as the HEVC is devoted mainly for large resolutions, different sizes from 4 × 4 to 32 × 32 for the TU were used [6]. All transformed coefficients in the TU are equally quantized depending on the QP value. HEVC uses the same uniform-reconstruction quantization (URQ) as H.264/AVC. The range of the QP values is defined from 0 to 51. Quantization scaling matrices are also supported [7].

2.2 HM test reference analysis

The HM software implementation is developed by JCT-VC group as a common reference of the HEVC encoder. Due to the complexity of the HEVC encoder, the HM is supposed to be relatively slower than the AVC encoder as it has extra modules like the quad-tree structure, SAO.etc. Among the contributing factors to the slowness of the HM encoder is a heavy reliance on the rate-distortion optimization. JCT-VC common test conditions defines different configurations for the encoder that can be used [8]. These configurations consist of:

  • All intra (AI), where all pictures are encoded using only I slices.

  • Random access (RA), where all the pictures are reordered in a pyramidal structure with a random access picture about every 1 s. This configuration may be used in a broadcasting environment.

  • Low delay (LD), where only the first frame is encoded using I slices and no picture reordering is used. This configuration is generally used in a videoconferencing environment. This paper focused on motion estimation algorithms and some fast encoding methods which influence only the inter-mode decision. Thus, only RA and LD configurations were considered.

The HM (HEVC test model) encoder has been profiled to determine which of the components is the most time-consuming. Figure 2 shows the time spent in various C++ modules in the HEVC encoder.

Fig. 2
figure 2

Encoding time distribution

These results were obtained with Vtune tools when processing “Vidyo1” sequence coded in RA mode with QP = 37. In this configuration, inter-prediction module takes the lion share of the encoding time with up to 40 %. The second important time consumer module is the RdCost that includes a heavy computation of sum of absolute differences (SAD) and other distortion metrics which consumes about 33 % of the total encoding time. Transformation and quantization are equal to 9 %. On the entropy coding front, the amount of time spent in core CABAC operations is relatively small and equal to 3 %. It is also interesting to note that a small amount of time is spent on setting and copying memory areas (memcpy and memset system functions). In this case, around 4 % of the time is spent in these system functions. The Fig. 2 shows that the two most critical modules in term of time consuming are inter prediction module based on motion estimation and RD cost module based on mode decision. That is why in this work, we started exploring these two parts.

3 Related works

The complexity of HEVC is a critical problem, especially when looking forward to a real-time implementation. It is reported [9] that mode decision and motion estimation modules take the lion’s share while profiling the encoder. Therefore, in the literature, a number of efforts have been made to explore algorithms based on fast decision and motion estimation. In [10], an algorithm called “Fast Coding unit Size algorithm” was based on Bayesian decision rule saving an average of 41 % from the encoding time with a negligible loss of 1.88 % on RD performance. In [11], they present the “Effective CU size decision” algorithm based on skipping some specific depths which are rarely used in the previous frame. This algorithm provides a time saving reaching 26 % with a negligible coding efficiency loss and a slight bit-rate increase. In [12], the recursive CU splitting process is early terminated according to an adaptive threshold value of the mean square error (MSE). This work investigated the correlation between CU splitting and MSE in the current CU level. When the MSE of the current CU is small, most CUs do not need to be split. If the MSE of the current CU reaches MSEth, the partition process can be terminated. This work reduced the encoding time by 24 % while having a little decrease in quality with 0.3 % in worst cases. At the time of writing, we evaluated a recent work based on the contextual mode information of neighboring CUs to detect the merge skip mode. This proposed achieves the average time-saving factor of 43.67 % in the random access configuration with the HEVC test model (HM) 10.0 reference software. Compared to HM 10.0 encoder, a small bit-rate loss of 0.55 % is observed without significant loss of image quality [13]. In addition to the algorithms previously quoted, other algorithms were implemented in the reference software HM. These algorithms will be detailed in the next section.

In the other hand, a lot of researches are concentrated on improving the motion estimation algorithm to optimize the test zonal search (TZ search). In [14], an algorithm based on a hexagonal pattern was proposed. The proposal contains different algorithms based on this pattern replaced at the first step (initial grid step) and at the refinement stage. The simulation results reveal that the proposed algorithms reduce a major amount (around 40–80 %) of the motion estimation complexity compared to the TZ search algorithm implemented in HEVC reference software. In [15], authors modified the pattern search from diamond to hexagonal as [14]. Further, the algorithm was improved by modifying the searching threshold in each grid in the search area. Results show that the overall encoding time can be reduced by almost 50 % compared to the TZ search algorithm, while maintaining almost the same PSNR and bit-rate.

When analyzing these previous works, we note that different mode decision algorithms were proposed for the HEVC to speed up encoding. Many algorithms were proposed based on fast mode decision, especially on the quad-tree structure. Furthermore, other proposed algorithms are based on decreasing the number of search to optimize the motion estimation process. This paper is based on combining the two concepts for an efficient coding time saving.

4 Effective fast decision methods

The coding structure consists of four level structures to have a partitioning in multiple sizes of blocks. The CU is defined as the basic unit chosen always as a square form. It is basically a replacement of (16 × 16) macroblock of H.264/AVC. The principal difference between them is that the CU can have different sizes. The whole processing, including the intra/inter-prediction, the transformation, and the entropy coding, is based on CU [16]. A CU is identified by a “p” depth and a 2N × 2N size. It can be split into several PUs when the split flag is set to zero. Otherwise, it is divided into four smaller CUs with a depth equal to p + 1 and sized N × N. Consequently, each CU of depth equal to p is processed in a recursive way.

If the depth of the divided CU reaches the smaller unit sized 8 × 8, the partitioning is stopped [17]. Once the process of the CU partitioning is launched, the prediction methods will be specified. The basic unit for the prediction process is the PU. It should be noted that the PU is defined for all the CUs in each depth and its maximum size is limited to its corresponding CU. Two terms are specified in the prediction: the prediction mode and the size of the PU. Similar to the previous standards, the prediction mode can be skip, intra or inter. The possible PU sizes are defined according to the prediction mode [18]. Figure 3 shows the various possible blocks of PU according to the prediction modes.

Fig. 3
figure 3

Partitioning modes of prediction blocks

In addition to the CU and the PU, a TU is used for the transformation and the quantification. This unit is defined separately. It should be noted that the size of the TU can be larger than the PU one, but it should not exceed the size of CU. The size of the TU is fixed by the split flag. When it is set to 1, the TU is partitioned. Figure 4 illustrates the relation between the three different units. For each CU, a PU is defined by specifying the way in which the prediction can be generated. The TU size is specified by the size of the CU and the type of the PU partitioning. This kind of flexible and recursive processing provides several benefits. The first comes from supporting CU sizes greater than the conventional 16 × 16 size. When the region is homogeneous, larger CUs will be coded leading to a relatively smaller number of symbols compared to the case using several small blocks for the same area. In addition, by eliminating the distinction between macroblock and sub-macroblock and using only the coding unit, the multilevel quad-tree structure can be specified in a very simple way, especially for parallel processing.

Fig. 4
figure 4

Relation between different units

Based on this recursive structure, the encoder needs to exhaust all the combinations of CU, PU and TU to find the optimal solution which is a compromise between an optimized rate distortion and a best video quality.

Figure 5 shows the conventional mode decision process for partitioning CUs. Each CU is encoded by determining its best mode. For the current CU, all the modes are tested one by one from the skip to intra N × N. This is done by calculating continuously the RD cost of each CU. The computing process is a stage which takes enormous time and induces a great complexity.

Fig. 5
figure 5

The conventional mode decision process

It would be preferable if the encoder can perceive the best mode without carrying out the computing of the RD cost for all the modes. To reduce the computational complexity of the encoder, many fast mode decision algorithms for partitioning were adopted in HEVC test model, such as ECU, ESD and CBF that are detailed below.

4.1 Early CU termination

The early CU termination (ECU) [19] is an algorithm that operates in the passage from depth p to depth p + 1. The sub-tree computations can be skipped if the best prediction mode of the current CU is Skip mode as Fig. 6 shows. The best mode is chosen by computing the RD cost. If the minimal RD cost corresponds to the skip mode, there is no need to continue the partitioning.

Fig. 6
figure 6

Early CU algorithm

4.2 Early skip detection

Some statistics show that the skip mode was the most chosen mode [20]. This explains the fact that a great improvement is obtained when the detection of the skip mode is anticipated. Choosing the skip mode is a very effective tool for coding as it represents a coded block without residual information.

The early skip detection algorithm (ESD) represents a simple checking of the differential motion vector (DMV) and the coded block flag (CBF) which are the two conditions called as “early Skip conditions” after searching the best inter 2N × 2N mode as shown in Fig. 7. The current CU searches two modes inter 2N × 2N (AMVP and merge) before checking the skip mode. After selecting the mode having the minimum RD cost, the DMV and CBF are checked. If DMV of the best inter 2N × 2N mode is equal to (0, 0) and CBF is equal to zero, the best mode of current CU will be set to skip mode. Thus, the remaining PU modes are not investigated anymore [21].

Fig. 7
figure 7

Early skip detection algorithm

4.3 Coded block flag algorithm

The coded block flag (CBF) is an algorithm that allows an early detection of the best prediction mode [22]. RD cost is computed for each mode of the PU belonging to a CU. Then, the coded block flag is carried out. If CBF is zero (that means all transform coefficients are zeros) after encoding a PU for luminance (Y) and two chrominance (U, V) components, the next PU encoding process of the CU is terminated. The algorithm is described in Fig. 8.

Fig. 8
figure 8

Coded block flag algorithm

4.4 Proposed algorithm for fast partitioning

All the methods detailed below are already implemented in the HM encoder. But they are disabled in the HEVC common coding conditions.

Our proposed algorithm takes advantage of the three algorithms (ESD, ECU and CBF). Our new algorithm is illustrated in Fig. 9 as follows:

Fig. 9
figure 9

Our proposed fast algorithm for partitioning

The inter 2N × 2N is first tested to take advantage from the ESD strategy. Then, ESD and CBF conditions are tested. If the conditions are verified, the evaluation of the other modes stops. Otherwise, we evaluate the conventional mode decision process. Once for a CU all modes were checked, the ECU can be tested by verifying if the best mode is skip or not to decide the partitioning or not.

5 Motion estimation process

As the profiling results show, the inter-prediction takes a considerable part on time consuming as well as the mode decision part. The inter-prediction is conceptually simple in HEVC, but comes with some overhead compared to the previous standard H.264/AVC. Motion estimation is an important stage in video coding because it exploits temporal redundancies present within a sequence. The main purpose of motion estimation is to obtain the MV. This requires an important computing power. Implementing fast algorithms for motion estimation can reduce the complexity of computing, thus reducing the encoding time, while preserving the same quality and bit-rate. The block matching methods have been more specifically developed in the framework of video coding [23]. All standards to date, including HEVC, are based on this paradigm. HEVC adopts two algorithms.

5.1 Full search algorithm

This algorithm presents a very intuitive choice for the search area in the whole reference image. It consists of processing all the pixels in the search window to find the best motion vector having the minimum SAD. While using the FS algorithm as the main tool for the inter-prediction, motion estimation spends more than 96 % of the required encoding time [24]. Other techniques exist which are able to find the motion vector with less significant complexity, but they are suboptimal. In fact, the chosen vector is not always the best one. TZ search is an algorithm implemented in the HM software test model and reduces the computational complexity compared to the full search algorithm. In fact, when applying TZ search, it increases the speed by a ratio of 23 compared to FS with small quality losses.

5.2 TZ search algorithm

The TZS algorithm is a combination of four stages as shown in Fig. 10: motion vector prediction, zonal search, raster search and refinement stage.

Fig. 10
figure 10

TZ search algorithm diagram

The flow of the complete algorithm can be divided into four steps in the following sub-sections:

5.2.1 MVP: motion vector prediction

The TZS algorithm employs the median predictor obtained using left predictor, up predictor, upper right predictor as Fig. 11 shows. The minimum of these predictors is selected as a starting point for the next search steps.

$${\text{Median}}\left( {A, \, B, \, C} \right) \, = \, A \, + \, B \, + \, C - {\text{Min}}\left( {A,{\text{ Min}}\left( {B, \, C} \right)} \right) - {\text{Max}}\left( {A,{\text{ Max}}\left( {B, \, C} \right)} \right)$$
Fig. 11
figure 11

Motion vector prediction

5.2.2 Initial grid search

At this step, a diamond or a square pattern is used to find the search window while varying the distance (dist) by a power of 2, from 1 to search range which defines the maximum length of the search window. The patterns used are either diamond or square depending on the configuration.

A sample grid with a search range 8 (maximum distance = 8) for the diamond (a) and the square pattern (b) is shown in Fig. 12. All the points of the pattern are tested. The best point having the minimum SAD is chosen as the center search point for further steps. The distance obtained for the best point is stored in the “Bestdist” variable which will be evaluated in further steps.

Fig. 12
figure 12

Search patterns for motion estimation

5.2.3 Raster search

The raster search is a simple full search on a down-sampled version of the search window. As Fig. 13 shows, raster search is done for “iRaster” = 5. A predefined value “iRaster” for raster scan can be set in the configuration file [25]. The condition to perform the raster search is that “Bestdist” (obtained from previous step) is superior to “iRaster”. If this condition is satisfied, “Bestdist” is changed to “iRaster” value. Else, no raster search is conducted.

Fig. 13
figure 13

Raster search for iRaster = 5

5.2.4 Raster/star refinement

This final step is a refinement of the motion vectors obtained from previous steps. The refinement can be raster or star. Only one of the refinement methods is enabled. Either square or diamond pattern is used. These details are fixed in the TZ configuration. The difference between the two refinement types is that the raster refinement is obtained by down-scaling the Bestdist obtained in the previous step and the star refinement is done by up-scaling it. The down-scaling is done until BestDist will be equal to 1. The up-scaling stops when the BestDist arrives at the search range value.

5.3 Proposed fast algorithms for motion estimation

Our proposal for fast algorithms for motion estimation consists of improving the TZ search algorithm by modifying the pattern. In these algorithms, the first steps MVP and the initial grid search were kept. The change starts from the raster search step. As from the initial grid search, a best point was chosen; there is no need to have an additional search while the refinement stage exists. Consequently, the stage of refinement is conducted directly.

This step will be based on two different algorithms namely small diamond search pattern (SDSP) [26] or horizontal diamond search (HDS) [27] instead of using the simple diamond or square pattern search.

5.3.1 Small diamond search pattern

The small diamond search pattern algorithm [26] consists of operating on the central point and its four neighbors. The search consists of a loop of small diamond search pattern until the best matched point corresponds to the center of the diamond. Figure 14 shows the SDSP search pattern presenting the final motion vector chosen. The main improvement of this algorithm is the speed performances as SDSP reduces the number of searching point.

Fig. 14
figure 14

Small diamond search pattern

5.3.2 Horizontal diamond search

For most videos, we observe that objects move in a translational manner and motion is more likely to be in the horizontal direction than the vertical one [28].

Figure 15 illustrates the HDS algorithm that consists of a “small diamond” with two extra points on the horizontal direction added. To find the best point, the algorithm is repeated until the best search corresponds to the center point or the limits of the search window are crossed.

Fig. 15
figure 15

Horizontal diamond search pattern

In fact, by adapting one of the two proposed search patterns, we can overcome the effect of omitting the step of global minimum search and hence save computational time that can be reduced by a factor up to three.

6 Experimental results

6.1 Experimental conditions

To evaluate the performance of the proposed approaches, different algorithms were implemented on the recent available HEVC reference software (HM 10.0) [29]. The evaluation was done for the fast mode decision algorithm, the two proposed motion estimation algorithms and then when combining these two approaches. All the implementations were compared to the original algorithm in terms of PSNR, bit-rate and encoding time speedup. The experimental conditions are similar to the common test conditions [8] in HEVC. The largest coding unit (LCU) is fixed to 64 × 64 with a maximum depth level equal to 4, resulting in a minimum coding unit size of 8 × 8. The maximum search range used is set to 64 and the entropy coder used is CABAC. The number of frames taken in each sequence is 100.

All the simulations were carried on a Windows 7 OS platform with Intel ®core TM i7-3770 @ 3,4 GHz CPU and 12 GB RAM. The proposed algorithms were evaluated with QPs 22, 27, 32 and 37 using all the sequences recommended by JCT-VC for five resolutions [30] (416 × 240/832 × 480/1,280 × 720/1,920 × 1,080/2,560 × 1,600 formats) for the configuration random access (RA).

6.2 Evaluation criteria

The evaluation was done through the following criteria given in Table 2.

Table 2 Evaluation criteria

where:

  • PSNRp, BPp and Tp represent, respectively, the PSNR index, the bit-rate and the encoding time of the proposed algorithm.

  • PSNRo, BPo and To represent, respectively, the PSNR index, the bit-rate and the encoding time of the original algorithm.

6.3 Results

Table 3 evaluates the performance of the two proposed algorithms for ME, namely HDS and SDSP. The two approaches are not very different concerning the performances. For SDSP, the time saving is in average of 5 % while maintaining the same quality for most videos with a little degradation in bit-rate by 0.5 %. In fact, in the worst case, the time saving can be only 0.1 % for the video BQSquare that is characterized by its slow motion as the camera moves from the left to up right slowly. In this video, the background has low motion. Some people are moving in predicable directions with low speed.

Table 3 Results of the proposed SDSP and HDS versus the original algorithm

As this algorithm is for reducing the motion estimation complexity, the improvement for videos with low motion activities is not important contrarily to high-activity sequences, such as PeopleOnStreet and BasketballDrive. The motion is imperceptible and the effectiveness of the algorithm is visible with 11 and 13 % reduced at encoding time. BasketballDrive is a video sequence taken during a basketball game. This sequence contains pictures of high motion activities and high contrast. The background (floor and wall) has rather similar texture. The basketball players are moving in random directions. The proposed algorithm takes advantage of these types of video and shows important values of time saving. Concerning evaluating quality and bit-rate, they are proportional, i.e., a degradation of quality leads to an increase in bit-rate.

Concerning HDS, this algorithm reduces the time execution by an average of 5.7 %. Results show that the majority of videos have maintained the same quality. For bit-rate, the worst cases are for RaceHorses video. In fact, RaceHorses is a sequence that records horse racing. It is a dynamic and motion-filled video. In this video, there are a lot of high-frequency details. Horse tail is usually costly to encode.

Table 4 evaluates the performance of implementing the algorithm for fast partitioning that combines the three approaches of fast coding ECU, ESD, and CBF. As it is shown, results concerning time saving are important and reaching up to 70 % for some videos. In average, not only the encoding time was reduced by 40 %, but also the bit-rate that had decreased by 1 %. Concerning the quality, the degradation varies from 0.07 to 0.16 dB. The results given by the two tables summarize the performance of each algorithm alone.

Table 4 Results of the fast decision algorithm (ECU, ESD, CBF) versus the original algorithm

Table 5 shows the results of combining the two proposed approaches: fast motion estimation algorithm with fast mode decision algorithm. Two algorithms evaluated are HDS fast and SDSP fast. Difference is observed only for motion estimation part. In fact, results are as expected with an important time saving for Vidyo1, Traffic, and BQSquare sequences varying from 53 to 78 %. This saving is justified by the slowness of the motion in these videos. As for resulted sequences quality, we anticipate a slight degradation when implementing these algorithms since some modes will be omitted. However, bit-rate was reduced by 1 % for some videos except for some sequences where we noticed a slight increase in terms of bit-rate for example, BasketBallDrive and RaceHorses videos which are as indicated previously motion-filled videos and containing a lot of texture. We can remark also from analyzing different results that the obtained gain of our final algorithm is equal to the sum of each of the two algorithms that constitute our proposal (fast mode decision algorithm and fast motion estimation). In fact, for People On Street sequence, we have ΔT Fast_SDSP (%) ≈ ΔT Fast (%) + ΔT SDSP (%).

Table 5 Results of the combined algorithms SDSP fast and HDS fast versus the original algorithm

Figure 16 evaluates the degradation of quality. The curves were done for videos of class A having the highest resolution. The four points represented are calculated for 4 QPs: 22, 27, 22, and 37. When evaluating HDS and SDPS algorithms, the degradation is not visible. For QP equal to 32 and 37, ΔPSNR is null. This means that there is no considerable degradation in quality. Concerning the combined algorithms, the PSNR degradation varies from −0.35 to −0.05. Figure 16 shows that for lower values of QPs, the degradation is important. When QP is important (32 or 37), the curves will be close to each other.

Fig. 16
figure 16

RD curves for sequences (2,560 × 1,600) coded in random access with QP = (22, 27, 32, 37)

Figure 17 evaluates the time saving while varying QP. The time saving increases proportionally while we increase the QP value. For QP = 37, the encoding time is reduced by 50 % for traffic video and 60 % for PeopleOnStreet video. The speedup is more important for higher QPs due to the fact that skip mode is more likely to be chosen for important values of QP [20].

Fig. 17
figure 17

Time saving curves for sequences (2,560 × 1,600) coded in random access with QP = (22, 27, 32, 37)

The comparison was done also using the Bjontegaard criteria which evaluate numerical averages between different RD curves [31]. This criterion is used for evaluating the proposed algorithm with precision. Table 6 summarizes the results for Bjontegaard average PSNR (BDPSNR) and Bjontegaard average bit-rate (BDBR). The proposed algorithms show a bit-rate saving of 2 % in BDBR and a loss of 0, 1 dB in quality. The best case is for videos with low motion estimation like Vidyo1 and Traffic. While implementing algorithms of fast decision and early termination, it leads to a little decrease in BDBR reaching 3 %. However, for high-activity videos like BasketBallDrive and Racehorses, the loss is relatively important with 2 and 4 % for BDBR.

Table 6 Bjontegaard results of the proposed SDSP fast and HDS fast versus the original HM10.0

After implementing these algorithms, the new software is profiled to evaluate the impact of our work. Results are shown in Table 7. The effect of our contribution was clear, especially in the Inter-prediction module and the RD cost module. This improvement was accompanied with a small increase in the transformation module.

Table 7 Time distribution for the different algorithms

Table 8 summarizes the performances of different algorithms. The proposal of Liquan [11] was based on skipping some specific depths used in the analysis of previous frames. This algorithm provided a time saving of 29 % in average, while having a negligible loss in quality and an increase of 1 % in bit-rate. The algorithm implemented by Qin [12] was based on the early termination according to an adaptive threshold value of MSE. This algorithm assures a time saving while maintaining the quality. Another interesting proposal algorithm was presented by Park [13]. His idea was based on detecting the merge skip mode when analyzing the mode information of neighboring CUs. This idea was able to save 47 % of the encoding time while having a little decrease in quality and a small increase in bit-rate. The proposed algorithms were more efficient when having more important time saving, in addition to the decrease in bit-rate.

Table 8 Summary of performances of comparing proposed algorithms to the original algorithm

7 Conclusion and perspectives

HEVC is a new standard that provides a significant amount of increased coding efficiency compared to previous standards. To satisfy a higher coding efficiency, a robust algorithm for the mode decision process is required. Reducing the number of search points in motion estimation helps in further time saving while reducing the RD cost computation. This paper presented two fast algorithms based on optimizing the motion estimation process and using also a fast mode decision algorithm for CU partitioning. Combining both propositions will enable us to achieve an important improvement in time saving reaching up to 78 % and averaging 52 % with a negligible degradation in video quality and decreasing the bit-rate.

As perspectives, we intend to reduce more and more the mode decision partitioning time consuming by setting empirical threshold values for the stop criteria based on QP values to implement the optimized reference on a digital platform adequate to video processing.