1 Introduction

The high demand for video communication leads to the provision of highly packetized video data [32]. Video communication has become increasingly popular in recent years due to advancements in technology and the proliferation of high-speed internet connections. As a result, there has been a surge in the demand for video content, which has led to the development of highly packetized video data. Packetization is the process of dividing data into smaller units, or packets, that can be transmitted over a network. The packetization of video data offers several benefits, such as improved network efficiency, reduced latency, and increased reliability. Moreover, packetization allows for the transmission of video data over a wide range of networks, including wireless and mobile networks. The recent advancements in video codecs, such as the H.265/HEVC, have further improved the packetization of video data. These codecs use advanced compression techniques to reduce the size of video data, resulting in higher compression ratios compared to older codecs. This makes them highly suitable for a wide range of networks, including those with limited bandwidth. The use of highly packetized video data in conjunction with advanced video codecs has led to significant improvements in video quality, reduced buffering times, and increased network efficiency. As a result, these technologies are becoming increasingly important for video communication applications, such as video conferencing, live streaming, and video-on-demand services [20].

However, the high compression used in video codecs, particularly in the H.265/HEVC coding standard, makes them vulnerable to channel errors [16]. These errors can significantly degrade the video quality, especially in inter-coded consecutive frames [15]. The loss of critical information in these frames can result in motion discontinuities, blockiness, and even complete frame freezing, severely impairing the overall viewing experience. Therefore, effective algorithms are essential to mitigate the impact of channel errors and restore the video quality by accurately reconstructing the lost or corrupted information [24].

To address this problem, efforts are being made to develop error-resilient and concealment methods. The former aims to strengthen the integrity of the video data, while the latter attempts to hide the erroneously decoded frames [1]. Furthermore, Forward Error Correction (FEC) and Automatic Repeat Request (ARQ) are introduced in the literature. However, FEC increases video bandwidth, and ARQ is not suitable for interactive applications [11, 35, 38]. The Error Concealment (EC) methods, on the other hand, are post-processing techniques that try to recover a corrupted parts of a frame from the correctly decoded spatial and temporal information. These methods are categorized into spatial, temporal, and hybrid classes. Spatial Error Concealment (SEC) algorithms exploit the pixel information available in the current video frame. Also, Temporal Error Concealment (TEC) techniques benefit from motion data, consisting of motion vectors (MVs) for the nearest correctly received motion data. Moreover, hybrid EC methods use temporal and spatial correlated information to recover erroneous parts of the video frame [17, 33]. The EC algorithms in H.264/AVC aim to conceal damaged Macroblocks (MBs) within each erroneous slice. However, for H.265/HEVC, the large damaged region of a frame makes the EC techniques used in H.264/AVC less effective and causes several significant issues [17]:

  • Traditional EC methods rely only on neighboring MVs, which can produce inaccurate results when a video frame exhibits non-uniform motion.

  • In H.265/HEVC, there is no Flexible Macroblock Ordering (FMO) slicing available; therefore, the EC method cannot rely only on adjacent motion data. Moreover, the damaged region is far from the correctly received spatially neighboring blocks. Thus, there are insufficient available MVs for the concealment procedure. The reason is that the Coding Tree Unit (CTU) size can be as large as \(64 \times 64\) pixels in H.265/HEVC. To account for transmission errors, an integer number of CTUs are grouped into one slice, and an integer number of slices are encapsulated into a single transmission packet. Consequently, a lost packet in an H.265/HEVC bitstream affects several CTUs, resulting in a significant amount of frame area being lost. Therefore, all the information associated with these CTUs will be corrupted. Also, in this case, SEC algorithms may not be effective in recovering the lost data since the pixels are too far apart from each other to be useful. Therefore, alternative techniques such as TEC methods and inter-frame prediction are commonly used to mitigate the effects of packet loss on video quality [17, 18].

  • In H.265/HEVC, the traditional SEC techniques are not applicable in real scenarios since the errors damage multiple consecutive CTUs, which are wide regions of a frame [16].

  • While various EC algorithms have been proposed in the literature that rely on the side information, they are often unsuitable for real-world scenarios because they are non-standard and require modifications to the codec. Consequently, these algorithms have limited practical applications [17].

This paper introduces a novel TEC method specifically designed for estimating lost blocks in H.265/HEVC encoded video data. Our proposed block-based approach aims to strike a balance between computational complexity and video quality preservation. To achieve this, we leverage a PCA-based hierarchical clustering algorithm to identify highly correlated MVs and improve the accuracy of MV candidates. The proposed method considers each mutually orthogonal unit eigenvector as an axis of an ellipsoid. In other words, it develops a classification scheme that categorizes correlated temporally neighboring MVs of the erroneous regions of the frame. The proposed approach can also discover essential features of the most correlated motion information of the frame’s faulty parts and reveal previously unsuspected relationships. Additionally, the overall area of the ellipsoid can be interpreted as a parameter of computational complexity, i.e., there is a positive relationship between the MVs’ nonuniformity and the area of the ellipsoid.

The proposed method offers several advantages over the previous algorithms. Our method is content adaptive, which means it can adapt to the characteristics of the video being processed, resulting in more accurate and visually pleasing results. In addition, our method has lower complexity than the previous methods, which can lead to faster processing times and lower computational requirements. Another significant advantage of our method is its applicability to the H.265/HEVC video compression standard, which is becoming increasingly popular. Unlike many previous methods, our method can generate new MVs, which can be especially useful in cases where there are insufficient MVs available. This can improve the accuracy of the reconstructed frames and reduce the visual artifacts. We evaluated the performance of our method using several benchmark video sequences, and compared the results with several previous methods. The experiments show that our method outperforms compared methods in terms of visual quality and computational efficiency. Overall, our proposed method for video error concealment offers a comprehensive set of advantages that make it a valuable tool for a wide range of applications in video processing and compression. The major contribution of this paper are given as follows:

  • Our proposed method makes a significant contribution by utilizing the directional correlation among an object’s MVs in a video scene, interpreting them as elongated clusters within a frame’s region.

  • Our paper introduces a novel approach by utilizing Principal Component Analysis (PCA) for modeling ellipsoidal motion data, which enables us to limit candidate regions in directions with lower diversification, thereby enhancing the algorithm’s efficiency.

  • In our paper, we employ a divisive hierarchical clustering technique to classify MVs, leveraging the Sum of Square Error (SSE) to minimize Euclidean distance between clusters and decrease the order of complexity, thereby improving the method’s performance.

  • Our proposed method presents a comprehensive methodology for generating new MV candidates using the axis-aligned ellipse, where the magnitude and angle of the ellipse are determined by the variance and covariance of the MVs in each cluster.

  • The results and discussion in our paper highlight the exceptional efficiency and performance of our proposed scheme, demonstrating the significant contribution of our work.

This paper is organized as follows. The state-of-the-art EC algorithms are reviewed in Section 2. Section 3 summarizes the problem statement, discussing the MV modeling and the divisive hierarchical clustering algorithm. Section 4 describes the proposed algorithm. Detailed experimental results and computational complexity analysis are presented in Section 5. Finally, Section 6 concludes the paper with final remarks.

Table 1 summarizes the main symbols utilized in our proposed method.

Table 1 The description of key symbols in our proposed method

2 Related work

TEC algorithms mostly use the available motion data surrounding the damaged area. A novel nonlinear and non-Gaussian motion estimation technique is proposed in [28]. MV recovery should also consider MVs, which are strongly correlated with the lost MV, as presented in [14]. The authors find the tendency of the MV in each \(4 \times 4\) block by processing the co-located MV in the previous frame with successfully received blocks. Moreover, the method provides information about the closest MV to the motion information of the lost block. Although this method presents an appropriate strategy for motion recovery, it only exploits the upper blocks for tendency estimation, which may not be valid in all scenarios. Furthermore, the recovered MV for the more inner blocks is not sufficiently trustable for large damaged areas, and the result may not be accurate.

A trained PCA is used to conceal damaged parts of a frame using information from the previous frames’ buffer [5]. The authors proposed a novel method for detecting scene changes using dynamic thresholding and a similarity metric with a novel scene change detection algorithm. Additionally, an updated PCA model is presented by combining scene change features with index transformation buffers, and missing images are concealed by Projection Onto the Convex Set (POCS). Due to scene change, all previous frames may not be suitable and should be discarded from the buffer. Furthermore, the success of this method depends on the correlation of the lost area with the surrounding pixels, which is not the case for H.265/HEVC since the available correct pixel data is far from the lossy blocks.

Data hiding is a technique that can embed auxiliary or watermark data into digital video, which has a high degree of data redundancy. This can enable video error concealment by recovering the lost parts of the video frames from the embedded data. For example, the authors in [4] present a scheme that uses compressed sensing to hide auxiliary data into video frames over lossy channels. Compressed sensing is a method that can reconstruct sparse signals from fewer measurements than the conventional methods. Data hiding can also enhance image authentication by embedding watermark data [10].

Li et al. model neighboring MVs of the lost block by plane fitting, which shows how they change in small regions of the damaged frame [21]. However, this approach is not meaningful and needs further modifications for heavily corrupted video frames. Similarly, Lin et al. use the nearest available MVs to predict the missing ones. They propose a novel TEC method to estimate MVs’ weighting using disparities among MVs of available neighboring blocks [23]. This method does not work well for large, lossy regions where the neighboring MVs are missing for inner blocks. In [13], the authors propose a shape-preservation technique that uses different EC strategies depending on the position of the corrupted blocks. However, this method is limited by the difficulty of measuring the object’s boundaries.

In [37], the authors propose a novel weighted boundary matching EC scheme for H.265/HEVC based on the Coding Unit (CU) depths and Prediction Unit (PU) partitions in reference frames. The proposed method uses the information of CU depths in reference frames for lost slices, and for each Largest Coding Units (LCU) in a lost slice. The LCUs surrounding the co-located LCU are used to calculate the summed CU-depth weight to determine the conceal order of each CU. The co-located partition decision from the reference frame is adopted for PUs in each lost CU, and the sequence of PUs to conceal is sorted based on the texture randomness index weight. Finally, the best estimated MV for the lost PU is selected for concealment. Although experimental results show that the proposed method achieves higher PSNR gains compared to the other methods, this approach may face challenges when dealing with videos that have high motion. In such scenarios, the co-located CUs and PUs in reference frames may not provide accurate or reliable motion information, particularly when the MVs are large. As a result, the proposed method cannot determine the conceal order of each CU since the neighboring LCUs may not provide sufficient motion information to accurately estimate the CU depths. In addition, the use of MV prediction for lost PUs may result in suboptimal concealment when the MVs are unreliable or ambiguous due to high motion.

In [22], the authors propose an EC algorithm that improves upon a commonly used algorithm that extrapolates MVs from the previous frame. The proposed method considers partition decision information from the previous frame to improve the concealment process. The results of the study demonstrate that the proposed method performs better than the conventional algorithm. However, the proposed method relies on the partition decision information from the previous frame, which may not be accurate or reliable in scenarios with fast object movement, as the object boundaries and sizes may change rapidly. Moreover, it is worth noting that the proposed method may also require additional computational resources to account for the partition decision information from the previous frame, which may impact its real-time performance in certain scenarios. In [3], the paper proposes a motion-compensated EC method for H.265/HEVC. The proposed method refines the MV from the co-located block for motion compensation and merges blocks based on the reliability of these MVs. New MVs are then assigned to the merged blocks. The experimental results demonstrate a significant gain in PSNR and an improvement in visual quality compared to the conventional EC methods. However, any motion-compensated EC method may face challenges in scenarios where the lost data is significant or the MVs are unreliable, which can result in suboptimal concealment performance. Additionally, the proposed method relies on the reliability of the MVs from the co-located block, which may not be accurate or reliable in scenarios when there are fast object movements.

In [36], a trained Generative Adversarial Network (GAN) is exploited for EC. This method trains completion and discriminator networks with the temporally and spatially surrounding pixels as inputs. Thus the generator is used for loss recovery. However, training the GAN network is very time-consuming. It is not applicable since the loss area should be known apriori, which is impossible for most EC scenarios. In [30], the authors propose a convolutional Long Short-Term Memory (LSTM) layer and a simple convolutional layer for optical flow estimation. There are two rows of MBs above and below each slice for three frames which are exploited for recovery purposes. Furthermore, only using optical flow and spatial information cannot be sufficient. Training the network for potential loss locations in real scenarios is necessary, but it limits its practical application. In addition, LSTM has the ability to perform intelligent feature extraction from the data packets. For instance, in [34], the authors demonstrates a deep learning approach for feature extraction from the captured data that can handle limited sample sizes effectively.

In [6], the authors present a homography-based registration approach for the pixels surrounding the corrupted region and their corresponding pixels in the reference frame. This method aims to identify the registration with the least distortion regarding the whole frame and the area around the lossy region. Further, this technique is performed forward and backward, with the remaining points being filled by SEC. It enhances the registration quality when the correlation is sufficient for determining the registration points. However, finding good matching points is impossible for all contents, especially for H.265/HEVC, since the lossy areas are large.

There is a parallelogram partitioning approach for H.265/HEVC in [12]. Corrupted areas are divided into smaller parallelograms, and MV recovery and filling are applied. Moreover, it is used for the various sizes and angles of parallelograms. Despite this algorithm’s high quality, it is more complex since it should iterate for different angles and sizes. In [27], the two spatially adjacent MBs of each lost \(8 \times 8\) block are used to identify the most correlated MV in the previous frame. Then, if any block with a sufficiently close motion trajectory is found in the few previous frames, its MV is used to recover the lost \(8 \times 8\) block. A weighted average of two MVs of spatially adjacent MBs is calculated if the MV is not found. Then, one of these three MVs is selected, which produces the least boundary distortion. The algorithm is not appropriate for H.265/HEVC due to its reliance on nearby MVs that are not always available.

3 Problem statement

The EC methods attempt to improve the video quality by exploiting the most correlated available information for the corrupted area and replacing the damaged CU. The lossy area in H.265/HEVC is relatively large, and the available Prediction Block (PB) is far from each damaged block in the temporal domain. Specifically, the data is encapsulated in an integer number of Coding Tree Units (CTUs), and each CTU can be up to \(64 \times 64\) pixels in size. Thus, packet loss can result in losing a significant amount of frame data. A CU of H.265/HEVC varies from LCU to Smallest Coding Unit (SCU). In LCUs and SCUs, the sizes are \(64 \times 64\) pixels and \(8 \times 8\) pixels, respectively [7]. A lost LCU is split into several SCUs for further processing.

3.1 Hierarchical MV clustering using divisive approach

The divisive hierarchical clustering algorithm is a top-down technique. First, the data points are associated with one cluster, and then it branches into smaller sets as it goes down the hierarchy. Finally, each cluster splits into two new clusters iteratively until the stopping criterion is satisfied. There are ways of breaking each cluster [39]. The proposed algorithm exploits the SSE and minimizes it at each stage of cluster analysis to avoid the increasing order of complexity [26]. In addition, there is a limited distance among the MVs since they are highly correlated in a region. Furthermore, there are \(O(2^n)\) ways of dividing the clusters. The variable n represents the number of data points or observations in the dataset. In our proposed model, n refers specifically to the temporally neighboring MVs in the set S. Also, a fixed threshold determines the stopping criterion, which is determined by \(\Theta _{SSE}\). The smaller the threshold, the greater the complexity and accuracy [29].

To apply this method, we assume the set S such that \({\textbf {X}}_l \in S\) where \(l \in N\) and N is the number of temporally neighboring MVs. Each cluster is represented by \(C_{i,j}\) and is a member of the set L, which contains all clusters. The variable i denotes the level of the cluster while j specifies its position within that level. We will proceed with the following steps to apply this method.

  1. 1.

    Build the unique set S, which is \(C_{0,0}\).

  2. 2.

    Partition each cluster into two distinct clusters.

  3. 3.

    Reiterate step 2 until the SSE for each cluster \(R_{i,j}\) becomes smaller than a predefined threshold.

Algorithm 1 describes the divisive clustering method in detail. In this algorithm, \(\Theta _{SSE}\) is the threshold for the SSE of each cluster \(C_{i,j}\) and \(stop_{flag}\) sets the stopping criterion for the algorithm. Initially, the set L consists only of the set S. At each iteration of the while loop, the members of all \(C_{i,j} \in L\) are used to measure the SSE, denoted by \(R_{i,j}\). If the SSE value is greater than or equal to \(\Theta _{SSE}\), then \(stop_{flag}\) is set to zero, indicating that \(C_{i,j}\) must be divided, and the algorithm should proceed to the next iteration. Additionally, \(C_{i,j}\) should be removed from L after being divided. This iterative process continues until the SSE value of all members of L is less than \(\Theta _{SSE}\). Finally, L contains all the clusters.

The divisive hierarchical clustering method is illustrated in Fig. 1. At each level, the algorithm splits a cluster into two subclusters if the SSE is greater than a predefined threshold \(\Theta _{SSE}\).

Fig. 1
figure 1

The divisive hierarchical clustering algorithm is visualized. Clusters are split into two separate clusters at each level if the SSE exceeds the threshold value \(\Theta _{SSE}\). The \(l^{th}\) data point is denoted as \({\textbf {X}}_l \in C_{i,j}\). Initially, the parent cluster is defined as the set S. As the algorithm progresses, new clusters are formed by splitting existing clusters. Each cluster is labeled with a subscripted index, such as \(C_{i,j}\), where i denotes the level of the cluster in the hierarchy and j denotes its position within that level

Algorithm 1
figure a

Divisive hierarchical clustering algorithm

3.2 Motion vector modeling using PCA

A key challenge for EC algorithms is to accurately model the MVs of erroneous regions in a video frame. EC methods attempt to infer motion patterns to achieve better performance. The proposed approach develops a motion tracking model for the corrupted block. Also, it creates a model for each of these corrupted blocks by utilizing motion data clustering to improve the accuracy and efficiency of estimating the missing MVs. We define MV of a block as a 2D vector \({\textbf {MV}}\), which is denoted in (1):

$$\begin{aligned} {\textbf {MV}}=(MV_x,MV_y) \end{aligned}$$
(1)

where \(MV_x\) and \(MV_y\) are x and y components, respectively, of the considered missing MV.

Furthermore, the definition of a candidate region within a frame suggests more accurate predictions for MV recovery. Thoroughly determining this region requires considering the characteristic properties of the ellipsoid. In this paper, we propose to embed each MV into its specific cluster and investigate these clusters to find the most appropriate region for correctly predicting the lost motion information of the corrupted blocks. To achieve this, we use PCA, a statistical technique widely used for dimensionality reduction, to compute linearly uncorrelated Principal Components (PCs) from the input training data [2, 9, 19]. Using the parameters obtained from PCA, we generate new data points from the modeled ellipsoids, which allows us to increase the size of the training data set without collecting new data. This approach is particularly useful in applications where the cost of collecting new data is high or where training data availability is limited.

Assume that the data are distributed in a 2D plane. Figure 2 visualizes the iso-contour representation of the ellipses in the 5th frame of the ‘Stefan’ sequence for the H.265/HEVC coding scenario. This experiment illustrates clusters of MVs in the third Picture Order Count (POC). The MVs are chosen from the set S, which is made up of the adjacent motion information of the considered corrupted block. This experiment shows that the temporally neighboring motion information can be effectively classified into multiple ellipses. Figure 2 (c) depicts such scenarios where a rotated ellipse bounds two MVs.

Fig. 2
figure 2

Iso-contour visualization of the ellipses. This figure represents the MV clustering in the HEVC standard for the fifth frame of the ‘Stefan’ sequence. In addition, this experiment shows the ellipsoidal clustering regions in the third POC. (a) Region: \(P_x\in [312,336]\), \(P_y \in [264,288]\) (b) Region: \(P_x \in [88,112]\), \(P_y \in [248,272]\) (c) Region: \(P_x \in [304,328]\), \(P_y \in [192,216]\)

4 Proposed algorithm

The Generative PCA-based Hierarchical (GPH) method is an innovative clustering EC approach that utilizes the PCA technique to specify the MVs for each cluster. The main objective of MV clustering is generating the correlated MVs inside a cluster. Moreover, the commonly used criterion for clustering is to minimize the Euclidean distance between data points. In addition, the object’s MVs are directionally correlated to each other in a video scene that can be intuitively interpreted as elongated clusters in a frame’s region. The MVs reflect the motion of the camera and the objects in the scene. We argue that MVs are correlated depending on the type and extent of motion in the video. When there is a global motion (such as panning or zooming), all MVs have a similar direction and magnitude. When there is a local motion (such as an object moving independently), some MVs have a different direction and magnitude.

Furthermore, the proposed algorithm enhances accuracy and speed due to innovative ellipsoidal motion data modeling. It derives more correlated MV candidates for the corrupted area. Also, to avoid any confusion, we would like to emphasize that our proposed video EC technique does not address consecutive frame losses, and that the use of an IDR frame at the start of a GOP is a potential solution to recover from such losses.

Also, this classification technique allows us to restrict the candidate region in the directions with more limited diversification. In other words, the candidate region is related to the motion disparity and variance of the correctly received MVs. GPH proposes a comprehensive methodology to achieve the MV candidates utilizing the axis-aligned ellipse. Also, the covariance of the MVs determines the ellipse’s angle.

On the other hand, the ellipse magnitudes depend on the variance of the MVs, i.e., the larger the variance, the more extended the ellipse’s axis. Moreover, the covariance of the MVs defines the angle of the ellipse. Hence, it can be a nonzero number since the MVs are highly correlated to each other. If the covariance is zero, the corresponding MVs are uncorrelated, and the ellipse is axis-aligned in this case. Figure 3 depicts the ellipse with a major axis \(2R_x\) and a minor axis \(2R_y\), which holds all the correctly received motion data.

Fig. 3
figure 3

This illustration shows a 2D ellipse for the MV points of a corresponding erroneous block. It displays the solution intervals on both the x and y axes. For each value of \(x_i\), there exists a corresponding value of \(y_i\) that falls within the interval \([y_{min},y_{max}]\)

The axis directions are where motion data varies the most, characterized by a linearly transformed matrix called the covariance matrix. Furthermore, the ellipse axes are eigenvectors of the covariance matrix, and the length of each axis is proportional to the square root of the corresponding eigenvalue, i.e., \(\sqrt{\lambda _1}\) and \(\sqrt{\lambda _2}\). These eigenvalues indicate the amount of variation in the motion data along the corresponding eigenvectors. Moreover, each cluster can be modeled as a confidence ellipsoid shape with a 95% confidence region, where the major and minor semi-axis lengths are given by \(R_x = \sigma _x\sqrt{5.991\lambda _1}\) and \(R_y = \sigma _y\sqrt{5.991\lambda _2}\), respectively, where \(\sigma _x\) and \(\sigma _y\) are the standard deviations of the motion data along the first and second PCs, and \(\lambda _1\) and \(\lambda _2\) are the largest and second-largest eigenvalues of the covariance matrix, respectively.

One can measure the orientation of the ellipse using the largest eigenvector towards the x-axis:

$$\begin{aligned} \alpha = \tan ^{-1}\frac{\textbf{u}_1(y)}{\textbf{u}_1(x)} \end{aligned}$$
(2)

where \(\textbf{u}_1\) is the largest eigenvector of the covariance matrix and \(\alpha \in [0,\pi ]\). Accordingly, the general equation of the ellipse is:

$$\begin{aligned} \frac{((x-x_d)cos(\alpha )+(y-y_d)sin(\alpha ))^2}{R_x^2}+\frac{((x-x_d)sin(\alpha )-(y-y_d)cos(\alpha ))^2}{R_y^2}=s \end{aligned}$$
(3)

where \((x_d,y_d)\) is the displacement. Also, \(R_x\) and \(R_y\) are semi-axes in x and y directions, respectively. To simplify the (3), assume that a, b, and c are defined as follows:

$$\begin{aligned} a = \frac{cos(\alpha )^2}{R_x^2}+\frac{sin(\alpha )^2}{R_y^2} \end{aligned}$$
(4)
$$\begin{aligned} b=(\frac{1}{R_x^2}-\frac{1}{R_y^2})sin(2\alpha ) \end{aligned}$$
(5)
$$\begin{aligned} c=\frac{sin(\alpha )^2}{R_x^2}+\frac{cos(\alpha )^2}{R_y^2} \end{aligned}$$
(6)

Hence, the ellipse equation in (3) is reformulated as:

$$\begin{aligned} a(y-y_d)^2+(y-y_d)(x-x_d)b+(x-x_d)^2c-s=0 \end{aligned}$$
(7)

where a, b and c are the coefficients of the quadratic (7). It has a closed-form solution. It is categorized into four cases that affect the major and minor axes of the ellipse and its shape:

  1. 1.

    \(R_x \ne 0 , R_y \ne 0 , 0 \le \alpha \le \pi \): In this case, the solution written in terms of its coefficients is:

    $$\begin{aligned} y=\frac{-b^\prime \pm \sqrt{{b^\prime }^2 - 4ac^\prime }}{2a} \end{aligned}$$
    (8)

    where \(b^\prime \) and \(c^\prime \) are:

    $$\begin{aligned} b^\prime = -2ay_d + (x-x_d)b \end{aligned}$$
    (9)
    $$\begin{aligned} c^\prime =ay_d^2-y_d(x-x_d)b+(x-x_d)^2c-s \end{aligned}$$
    (10)

    Figure 3 illustrates the solution intervals for the x and y axes. \(x_i \in [x_{min},x_{max}]\) is selected based on the quarter pixel resolution for the entire video sequence since higher resolution can better provide accuracy. Hence, \(b^\prime \) and \(c^\prime \) are determined after obtaining \(x_i\) using (9) and (10), and then \(y_i \in [y_{min},y_{max}]\) is specified by applying (8).

  2. 2.

    \(R_x=0, R_y \ne 0, \alpha = k \pi \pm \frac{\pi }{2}\): In the present case, the ellipse transforms into a vertical segment limited between \(y_{max}\) and \(y_{min}\). It can commonly occur if there is a fast vertical camera movement in a scene or the objects shift vertically in the video frames. It is depicted in Fig. 4a.

  3. 3.

    \(R_x \ne 0, R_y = 0, \alpha = k \pi \): This case is illustrated in Fig. 4b. The MVs only vary in the x-axis, which explains the nonuniformity of motion in the x-direction. It can also be the result of the object’s displacement or camera movement only in the x-axis.

  4. 4.

    \(R_x = 0, R_y = 0\): This case indicates that the temporally neighboring MVs are uniform around the erroneous region. Figure 4c depicts this scenario where \(MV_{x_0}\) and \(MV_{y_0}\) are the x and y components, respectively, of the correctly received MV.

Fig. 4
figure 4

The special cases of the cluster. (a) \(R_x=0,R_y \ne 0, \alpha = k \pi \pm \frac{\pi }{2}\) (b) \(R_x \ne 0,R_y=0, \alpha = k \pi \) (c) \(R_x = 0,R_y=0\)

Furthermore, this paper recommends two methods to determine the relevant cluster for extracting the MV candidates: 1)Maximum Likelihood (ML) and 2)Nearest Mean (NM). The former selects the cluster with the highest probability density function (PDF) value, while the latter finds the cluster with a mean value more closely related to the average of the temporally neighboring correctly received MVs. Both of these approaches are used to select a single cluster from the clusters obtained from the clustering process. The ML approach is a standard method used in many clustering algorithms due to its ease of implementation. On the other hand, the NM approach is suitable for applications where there is a strong correlation between the data, such as neighboring MVs in a local region of a video frame. However, in the case of H.265/HEVC, the erroneous region is often large, and spatially neighboring MVs are not available, which results in a decrease in performance of the NM approach compared to the ML approach.

Figure 5 demonstrates the extraction of MV candidates utilizing the GPH method. The ellipse equation is estimated for each cluster exploiting the PCA approach and the correctly received MVs, i.e., the set S. Furthermore, the resolution of x is \(\Delta x\), with a minimum value of 0.25, and \(n_{\Delta x}\) represents the number of iterations required to compute solutions for \(x_i \in [x_{min},x_{max}]\). Additionally, the \(y_i\) component of the MV candidates are derived by applying (7) where \(y_i \in [y_{min},y_{max}]\). GPH attempts to classify the MVs field and estimates the cluster for MV recovery. The clusters with ellipsoidal shapes are in the direction with more disparity represented in two radii extremes, the most extended set of MVs. The GPH technique exploits the selected cluster for MV candidate recovery, refined utilizing the Boundary Matching Algorithm (BMA) procedure. The proposed scheme can be broken down into the following steps:

Fig. 5
figure 5

Diagram of the proposed algorithm. After building the set S, the mean of the set is subtracted from each element. Next, the set’s covariance is measured, and the eigenvalues are extracted, which results in finding the PCs and the ellipse orientation. Based on the \(R_x\) and \(R_y\), one of the four cases is determined, and the range of y is measured, leading to obtaining the MV candidates

  • Error detection: The first step involves decoding the received video stream and checking for errors. Any errors found are flagged and sent to the EC module for further processing.

  • Data extraction: In this step, the set S is built using available temporally neighboring MVs from the previous frame. These MVs are extracted from co-located LCU in the previous frame and its spatially neighboring MVs are added to the set S. These data points are aggregated, and also zero and average MVs are added to the set.

  • Data clustering: The data points are associated with a cluster in this step, and the cluster branches into smaller sets as it goes down the hierarchy. The process iteratively splits each set into smaller ones until a stopping criterion is satisfied. ML or NM algorithms are then used to determine the relevant cluster.

  • Model training: In this step, the PCA procedure is performed, and the PCs are calculated using existing data in the chosen cluster. To do this, the mean is first subtracted from each element in the cluster. Then, the covariance is calculated, and the eigenvalues are derived.

  • Data generation: Once the model is trained, it can be used to generate new data points. This involves deriving the ellipse orientation using the trained model’s parameters, as well as determining the major-axis and minor-axis lengths of the ellipse. Then, the x and y components of the data points, which are the MV candidates, are calculated.

  • Concealment: In this final step, the erroneous blocks are concealed using the MVs derived in the data generation step, utilizing the BMA approach.

5 Experimental results

5.1 Experimental environment

The simulations for EC were carried out on an Intel core i5 Dual-Core, 2.6 GHz processor with 8GB RAM of a MacBook laptop. The experiments were conducted using the H.265/HEVC reference software HM 16.20 [8]. The video sequence was encoded using the main profile with an input bit depth of 8. The maximum width and height of coding units were limited to 64 pixels, while the maximum partition depth was set at 4. The decoding refresh type was not utilized, and a Group of Pictures (GOP) size of 4 was selected. Motion estimation was performed using the TZ search method with a search range of 64.

In this paper, we use a video dataset from Xiph.org [25]. The dataset consists of various video sequences that covers a wide range of content. We tested the proposed method on five video sequences: ‘Shields,’ ‘Ducks Take-Off (DTO),’ ‘Park run (PR),’ and ‘Johnny (JO),’. All sequences have a resolution of \(1280 \times 720\) pixels and use the 720 progressive (720p) HD signal format. We use 4:2:0 color subsampling on these video signal. The frame rates for all the video sequences are 50 Frames per Second (fps), except for ‘Johnny’, which has a frame rate of 60 fps.

In our experimental setup, we specified a specific Packet Loss Rate (PLR) for each simulation and applied a concealment procedure after detecting errors in the decoded video. The video sequences were encoded in Slice Mode, with each slice containing an integer number of CTU. We set the maximum slice size to 1500 bytes, which aligns with the Maximum Transportation Unit (MTU) of the network for a single packet. We tested two PLR values of 10% and 20%, and modeled the transmission channel using the Elliot-Gilbert model. We generated 20 instances of packet loss, with an average burst length of three packets. Moreover, we only considered P-frames and did not include B-frames. The coding structure used for testing the performance is “IPPP", which consists of only one intra-frame (I) after 51 inter-frames (P). There are 150 frames used in each video sequence. Quantization Parameters (QP)s were set to 25 and 35.

We evaluated the performance of the proposed GPH method and its two approaches, \(\mathrm {GPH_{ML}}\) and \(\mathrm {GPH_{NM}}\), on a set of benchmark datasets. The experimental results show that both \(\mathrm {GPH_{ML}}\) and \(\mathrm {GPH_{NM}}\) achieve superior performance compared to the state-of-the-art methods in terms of accuracy and computational efficiency. Notably, the two approaches offer robust and efficient solutions to the problem across a range of datasets.

Table 2 Performance comparison of H.265/HEVC in terms of PSNR, SSIM, and run time for packet loss rate 10%
Table 3 performance comparison of H.265/HEVC in terms of PSNR, SSIM, and run time for packet loss rate 20%

5.2 Performance evaluation

The paper compares the performance and complexity of the proposed algorithm with those of other related EC schemes such as Xu method [37], Lin method [22], and Chang method [3] for H.265/HEVC. In experiments, we use PSNR as an objective metric for quality estimation and evaluate the video frames using Structural Similarity Index Measurement (SSIM), which captures essential perception-based facts or inter-dependent pixel information better than the PSNR [31].

$$\begin{aligned} PSNR=10 \log _{10} \frac{M^2}{MSE} \end{aligned}$$
(11)

where M represents the most significant value of the signal, which in our case was 255, and MSE represents the Mean Square Error difference between two video frames.

Tables 2 and 3 show the PSNR and SSIM results for a PLR of 10% and 20%, respectively. These tests were conducted on the entire test sequence. The table demonstrates that the proposed method outperforms the others by more than 4.99 dB and 4.19 dB in PSNR of 10% and 20%, respectively. Additionally, it achieves a total gain of 0.0208 and 0.029 in terms of SSIM for a PLR of 10% and 20%, respectively. Accordingly, the improved performance of the proposed approaches is consistent with our expectation that the clustering hypothesis experiment would outperform the motion-compensated method. Due to the high compression ratio of HEVC, error propagation is entirely severe in the succeeding frames in a video sequence which causes rapid degradation for all EC techniques since even minor error results in a high degradation in the following GOP frames. PSNR gain is consistent in our proposed schemes, and visual quality is superior to the algorithms we compared. Also, we have evaluated the performance of both the ML and NM approaches for selecting a single cluster. Our experimental results demonstrate that the \(\mathrm {GPH_{ML}}\) approach outperforms the \(\mathrm {GPH_{NM}}\) approach in terms of both concealment quality and computational complexity. This finding is consistent with the fact that in the erroneous region of a frame in H.265/HEVC, the availability of spatially neighboring MVs is limited, and the correlation between them decreases dramatically. Therefore, we recommend using the \(\mathrm {GPH_{ML}}\) approach in our proposed algorithm or in similar applications.

Fig. 6
figure 6

Error concealment results of 50th frame over ‘Park run’ sequence at the PLR of 20% and QP=35. (a)Original frame (b)Xu approach [37] (c)Lin approach [22] (d)Chang approach [3] (e)The proposed method by \(\mathrm {GPH_{nm}}\) approach. (f)The proposed method by \(\mathrm {GPH_{ml}}\) approach

Moreover, the proposed method requires fewer computations for video frames with more uniform motion vectors. For example, in Table 2, the ‘Shields’ sequence requires about 30% fewer computations for the proposed algorithms than the other compared ones. This video sequence depicts a man walking uniformly in front of a wall. The results show that the proposed methods have less computational complexity than the other compared methods.

Figure 6 illustrates a visual comparison of the 50th frame of the ‘Park Run’ sequence using the algorithms by Xu [37], Lin [22], and Chang [3]. In this experiment, QP is 35 and PLR is set to be 20%. The frames in Fig. 6b, c, and d are concealed using Xu’s, Lin’s, and Chang’s algorithms, respectively. Figure 6a shows the original frame of the 50th frame of the ‘Park Run’ sequence. Figure 6e and f show the frames recovered using the proposed methods, i.e., \(\mathrm {GPH_{nm}}\) and \(\mathrm {GPH_{ml}}\), respectively. These figures show that the proposed methods significantly enhance the visual quality compared to the other algorithms. The video sequence depicts a man running in a park with an umbrella in his hand. The proposed approaches accurately predict the shape of the man. Moreover, there are trees, snow, and water in the background with lots of intricate patterns. The proposed approaches, specifically \(\mathrm {GPH_{ml}}\), preserve the details more accurately than the other compared methods.

Fig. 7
figure 7

Error concealment results of 70th frame over ‘Johnny’ sequence at the PLR of 20% and QP=25. (a)Original frame (b)Xu approach [37] (c)Lin approach [22] (d)Chang approach [3] (e)The proposed method by \(\mathrm {GPH_{nm}}\) approach. (f)The proposed method by \(\mathrm {GPH_{ml}}\) approach

Figure 7 shows the subjective comparison of the 70th frame of the ‘Johnny’ sequence. In this experiment, QP is 25 and PLR is set to be 20%. Figure 7a shows the original frame, which contains a man moving his head and a static background with several details. The other figures, namely Fig. 7b, c, and d, demonstrate the results of using Xu’s [37], Lin’s [22], and Chang’s [3] algorithms to conceal parts of the scene. Errors occurring on moving parts, particularly the left side of the head, greatly affect the visual quality of the scene. This is because sudden movements in these areas result in more non-uniform motions that are challenging for EC methods to recover. Our proposed methods, \(\mathrm {GPH_{nm}}\) and \(\mathrm {GPH_{ml}}\), demonstrate superior motion information estimation compared to the other methods we tested. This is evidenced by the results in Fig. 7e and f. Overall, our findings demonstrate the superiority of our algorithms over the other compared methods.

We attribute this exceptional performance to two reasons that mitigate the obstacles:

  1. 1.

    The EC algorithms, such as the Xu method [37], employ limited motion information, and hence they cannot predict the missed MVs considerably.

  2. 2.

    In the absence of motion information in adjacent blocks, EC techniques employing only neighboring MVs cannot provide a good estimation of the MV.

Furthermore, the moving facial region of the frame does not provide precise PU/TU information at the slice level. Thus, applying only the neighboring MVs can result in successive slice errors and blocking artifacts as motion increases. In our study, we used the “IPPP” coding structure, which means that each frame (except the first one) is dependent on the previous frames for reconstruction. This can make the video more vulnerable to the effects of packet loss. However, we did not introduce any I-frames to the coding structure to reduce the dependence on previous frames. As a result, our results may represent the worst-case scenario in an error-prone environment. Furthermore, future studies could explore the use of I-frames in the coding structure to improve the video quality in such environments.

5.3 Computational complexity analysis

Tables 2 and 3 compare the computational complexity of decoding video frames concealed in different EC techniques. They show that the EC algorithms demand post-processing procedures that increase the overall decoding time. The required decoding time for the proposed algorithms depends on the degree of MV uniformity around the corrupted region, i.e., the smaller the MV range, the fewer computations are needed. In contrast, other EC algorithms usually consume the same amount of decoding time in different video sequences. In the proposed GPH method, the first PC determines the number of computations and the special cases of the quadratic (7) which defines the ellipse transformation. It is illustrated in Fig. 4. Consequently, these conditions automatically reduce the computational complexity based on the local information of the MV uniformity.

6 Conclusion

This paper proposes a novel approach for lost MV recovery in H.265/HEVC using a PCA-based hierarchical clustering technique to address the challenge of dealing with large lossy areas where the available PBs are far from the damaged blocks. Our study utilizes divisive hierarchical clustering to classify temporally and spatially adjacent MVs. Through several experiments, we employed the PCA method to estimate the magnitude and direction of each cluster. Additionally, our proposed procedure generates an ellipsoidal model that limits the diversification of motion data for loss concealment.

Furthermore, our proposed method introduces two strategies for selecting the best cluster: maximum likelihood and nearest mean. The former utilizes the PDF, while the latter relies on the average MV for each cluster. Other similar approaches could also be studied to select the best cluster. The proposed algorithm generates related motion data based on the variance of the selected cluster, making it suitable for processing different video sequences as its computational complexity is determined by motion diversity and variance. Thus, our method provides a new framework for video EC, and further research in this area is needed.

The experimental results demonstrate that the proposed method outperforms other tested EC algorithms in both PSNR and SSIM. Specifically, for PLR of 10%, the proposed method improves PSNR by up to 16.11% and SSIM by up to 2.1586% on average. Similarly, for PLR of 20%, the proposed method improves PSNR by up to 14.72% and SSIM by up to 3.0857% on average, compared to the other EC algorithms.

Our study proposes a PCA-based approach for error concealment in H.265/HEVC, leveraging MV clustering to generate accurate MVs for missing regions of damaged frames. While our findings are promising, there is still much to explore in this area. Future research could focus on evaluating other effective clustering methods for different types of video frames beyond the Euclidean distance measure used in this paper, potentially leading to increased homogeneity within clusters. Additionally, investigating alternative strategies for proper cluster selection could also be valuable. Finally, exploring properties of generative models, such as hidden Markov models and mixture of Gaussians, could help identify more suitable approaches for motion estimation.

In addition, our proposed method effectively preserves the structure of moving objects and edges in concealed regions. We also conducted experiments to determine the decoding time for our approach and compared it with other related techniques. The results indicate that our method outperforms the other EC schemes, achieving high video quality while reducing computational complexity.