1 Introduction

The prediction of human body pose joints in three-dimensional space, known as 3D human pose estimation (HPE) in video content, serves various applications such as video surveillance, human–robot interaction, and physiotherapy [1]. Utilizing advanced motion sensors like motion capture systems, depth sensors, or stereoscopic cameras [2, 3] enables the direct extraction of 3D human poses. This task can be undertaken in either multi-view setups, involving multiple cameras, or monocular settings, where a single camera is used. Despite the generally superior performance, based on specific criteria, of state-of-the-art multi-view methods [4,5,6,7] compared to monocular ones, the cost-effectiveness and wide application of ordinary RGB monocular cameras in real-world surveillance scenarios make 3D HPE from monocular videos an essential and challenging task, attracting increasing research interest, particularly in areas such as feature extraction or real-time processing. Recent works in the monocular view domain can be categorized into model-based and model-free methods [8]. Model-based methods [9, 10] incorporate parametric body models such as kinematic [11], planar [12], and volumetric models [13] for 3D HPE. On the other hand, model-free methods can be further divided into single-stage and 2D to 3D lifting methods. Single-stage methods directly estimate the 3D pose from images in an end-to-end manner [14,15,16,17,18,19]. 2D to 3D lifting methods introduce an intermediate 2D pose estimation layer [20,21,22,23]. Notably, 2D to 3D lifting methods, particularly when implemented with ground truth 2D poses, demonstrate improved performance in terms of accuracy or robustness.

Advancements in the accuracy and efficiency of 2D human pose detection, achieved through detectors like Mask R-CNN (MRCNN) [24], cascaded pyramid network (CPN) [25], stacked hourglass (SH) detector [26], and HR-Net [27], are notable. The intermediate 2D pose estimation stage, facilitated by these detectors, plays a pivotal role in significantly reducing the data volume and complexity associated with the 3D HPE task.

Regarding temporal information, mainstream methods [21,22,23, 28,29,30] have witnessed substantial improvements in accuracy and efficiency by processing extended sequences of 2D pose frames, contributing to the advancement of 2D to 3D lifting methods. Among them, [30] stands out for achieving state-of-the-art performance using ground truth 2D poses. Recent approaches [30, 31] have streamlined the process by fine-tuning these 2D pose detectors on target datasets, resulting in notable enhancements in the accuracy and efficiency of estimated 2D pose data. Despite these advancements, the performance still falls short in certain aspects compared to results obtained using ground truth 2D pose. This observation prompts a focused effort on enhancing specific aspects of 3D HPE, such as accuracy or robustness, through the utilization of ground truth 2D pose data, anticipating potential improvements with future, higher-quality estimated 2D pose data.

Motivated by the promising performance and advantages of 2D to 3D lifting methods, our work adds to the existing literature in this domain. Recent advanced models in various 2D to 3D lifting approaches, as categorized by [20]’s introduction of the fully connected network (FCN), fall into three main groups: temporal convolutional network (TCN)-based [21, 22], graph convolutional network (GCN)-based [23, 28, 32], and transformer-based models [29, 30, 33]. It is noteworthy that existing TCN- and transformer-based methods exhibit the capability to handle large receptive fields, allowing for the representation of extended 2D pose sequences, through the utilization of strided convolutions, providing enhanced context or spatial information. However, a significant challenge arises in designing intuitive methods to backtrace local joint features based on the pose structure, especially given that the 2D pose sequence is flattened and fed to the model, necessitating innovative solutions for effective feature extraction. Additionally, these methods rely on the same fully connected layer for estimating different pose joints, potentially neglecting the unique and independent characteristics of distinct pose joints, which could impact the accuracy of joint predictions. Conversely, GCN-based models explicitly preserve the structure of 2D and 3D human poses during convolutional propagation. Yet, the potential advantages of GCN in this context remain underexplored, and further exploration could unveil novel insights or improvements in 2D to 3D lifting methods. Existing GCN-based methods [23, 32] also employ a fully connected layer for estimating different 3D pose joints, potentially overlooking the structural features of GCN representations.

In pursuit of the stated objective, we introduce an innovative and effective framework for 3D human pose estimation, utilizing a dual absorbing graph representation strategy. The initial step involves downsampling the input dense event stream into a sparse event stream and dividing it into non-overlapping voxel grids. Subsequently, distinct dual absorbing graph models are crafted for the point and voxel streams, each encompassing all sparse point/voxel nodes and a dedicated absorbing node. The subsequent phase introduces a novel absorbing graph convolutional network (AGCN) designed for absorbing graph representation and learning in the context of 3D human pose estimation. The AGCN model provides three distinct advantages. Firstly, it adeptly captures the importance of event nodes in learning the graph-level representation through the introduced absorbing node. Secondly, the AGCN’s absorbing node dynamically aggregates information from all event nodes, improving the summarization of node representations compared to conventional pooling layers. Thirdly, in AGCN, each node aggregates messages from both its neighbors and the absorbing node, concurrently preserving local and global structures to enhance the learning of graph representation. Finally, the outputs of the dual AGCN branches are concatenated to extract complementary information from both streams. This combined information is then fed into a linear layer for accurate 3D human pose estimation.

In summary, the primary innovations presented in this work are as follows:

  • Employing OctreeGrid filtering and voxel construction streamlines computational complexity, extracting vital joint points and voxel representations through downsampling for effective 3D pose estimation.

  • Introducing CPointGraph and VoxelGraph, our absorbing graphs focus on specific spatiotemporal relationships in point clouds and voxels. The innovative absorbing graph convolutional network (AGCN) utilizes graph convolutional networks (GCNs) to learn feature descriptors crucial for accurate 3D pose estimation.

  • AGCN’s multilayer design with a residual connection facilitates seamless information flow, enhancing the model’s understanding of intricate spatiotemporal structures. Absorbing nodes play a pivotal role in consolidating information for improved graph-level representations.

2 Related work

2.1 2D to 3D lifting

Early attempts to infer 3D positions from 2D projections, like [34,35,36], often relied on manually chosen parameters based on assumptions about joint mobility. While methods such as [10, 37] have made impressive strides in estimating 3D pose from fewer frames, they sometimes neglect the consideration of temporal information evolution over the sequence. Recent advancements in 2D human pose estimation, exemplified by [25, 26, 38], have paved the way for 2D to 3D lifting approaches, which have outperformed other methods. Building upon the principles of [20], more sophisticated learning architectures have emerged, with a particular emphasis on utilizing temporal information. These approaches, collectively known as 2D to 3D lifting, can be categorized into three directions: TCN-based, GCN-based, and transformer-based architectures [21,22,23, 28,29,30, 32].

TCN-based methods, exemplified by [21, 22], have significantly advanced the field of 2D to 3D pose lifting, particularly through their effective handling of temporal sequences and dimensionality reduction. This design enables the features to undergo a reduction in dimensionality, effectively transforming a 2D pose sequence into a feature embedding for 3D pose estimation through a final fully connected layer. The fully connected layer typically has 1024 channels and is used to predict the 3D positions of all pose joints. Research has extensively explored various numbers of input 2D pose frames, revealing that a reasonable number of frames benefits the 3D pose reconstruction. The strided design efficiently reduces the feature size by decreasing the number of temporal frames during the propagation of several TCN blocks, contributing to enhanced computational efficiency and improved real-time processing. Building upon this strided structure, transformer-based methods, particularly [30], have exhibited promising performance. [30] capitalizes on weighted and temporal loss functions, surpassing GCN-based methods optimized with an additional motion loss [23, 32]. Notably, the effectiveness of the motion loss was found to be limited in [30]. These observations prompt the exploration of effective models in the realm of GCN-based models. The aim is to incorporate the inspiring designs from the TCN-based methods without relying on novel loss functions. This research direction seeks to strike a balance between innovation and simplicity in architectural design, aiming for a model that is both advanced and easily interpretable.

2.2 Graph convolutional network

A widely used approach for representing pose data is the Spatial Temporal GCN (ST-GCN), initially designed to model large receptive fields for improved skeleton-based action recognition. Building on this, more sophisticated GCN models like [23, 32, 39, 40] have emerged to further advance 3D human pose estimation (3D HPE). These models aim to enhance the understanding and accuracy of 3D pose estimation, leveraging the principles established by ST-GCN.

In the realm of graph convolutional network (GCN)-based models dedicated to 3D human pose estimation (3D HPE), several innovative architectures have emerged in recent years. Ci et al. [39] introduced the locally connected network (LCN), which combines ideas from both fully connected networks and GCNs. Specifically, LCN performs graph convolutions over a neighbor set defined by distance, similar to the design in ST-GCN [41]. Zhao et al. [32] presented SemGCN, a novel model that stacks GCN layers followed by a flattening fully connected layer. By optimizing with both joint positions and bone vectors, SemGCN achieves strong performance on 3D HPE. Choi et al. [40] offered a new perspective by utilizing GCN to lift 2D poses to 3D, demonstrating its effectiveness in recovering 3D human poses and meshes. Liu et al. [42] investigated different weight sharing schemes in GCNs for the pose lifting task and identified the superiority of the pre-aggregation scheme in terms of performance. The architecture proposed in [42] shares similarities with SemGCN. The aforementioned GCN-based approaches have exhibited compelling results given single-frame 2D poses as input. However, they did not fully take advantage of the temporal information available in 2D pose sequences. This reveals opportunities for future investigation into modeling the temporal dynamics to further enhance the performance of GCN-based 3D human pose estimation.

The U-shaped graph convolution networks (UGCN), as seen in [23, 28], represent a significant advancement in GCN-based methods for 3D human pose estimation (3D HPE). UGCN excels by considering the temporal characteristics of pose motion, specifically addressing the reconstruction of a single 3D pose frame from multiple 2D pose frames. UGCN leverages the spatial–temporal GCN [41] to predict a 3D pose sequence from a 2D pose sequence, regulating the temporal trajectory of pose joints with a motion loss term based on the prediction and the corresponding ground truth 3D pose sequence. While prior works like SemGCN and UGCN have made improvements by introducing novel loss terms, our contribution to the literature of 2D-3D lifting focuses on the application of a consistent loss term, inspired by the proven effectiveness of [21, 22]. In our model design, we propose to incorporate strided convolutions into a GCN-based model to represent the global information of a 2D pose sequence. Leveraging the structure of GCN representation, we explicitly employ the structured features of different pose joints to locally predict their corresponding 3D pose locations. This approach builds upon the existing literature while enhancing the accuracy and robustness of 3D HPE.

3 Method

In this section, we first give an overview of our proposed 3D human pose estimation model and initial event representation. Then, we dive into details of the absorbing graph representation learning method, focusing on graph construction and absorbing graph convolutional networks (AGCN).

Fig. 1
figure 1

A comprehensive overview of our proposed absorbing graph representation learning framework tailored for human pose estimation. In this context, we initially convert the input representing human poses into dual forms, namely the sparse pose cloud and voxelized representations. Subsequently, we establish dual graphs based on these two inputs, with each point or voxel-grid serving as a graph node. It is noteworthy that we incorporate absorbing nodes into the graph structure to capture global information crucial for accurate pose estimation. The absorbing graph convolutional network (AGCN) is meticulously crafted to specialize in structured feature learning and simultaneous global feature aggregation, aligning with the unique demands of human pose estimation. In the final stage, the predictions from the dual branches are concatenated and fed through a linear layer to yield the ultimate human pose estimation

3.1 Overview

For an input video stream containing hundreds of thousands of events, we initially employ OctreeGrid filtering [43] and Voxel construction. This process aims to derive point cloud and voxel representations, respectively. Next, we construct two absorbing graphs, namely CPointGraph and VoxelGraph, to capture the spatiotemporal relationships between the point clouds and voxels. Subsequently, we introduce a novel absorbing graph convolutional network (AGCN) designed to learn feature descriptors from the information captured by the two graphs. In the final step, we integrate the information from the two graphical representations to perform 3D human pose estimation. Figure 1 outlines the overall framework, with details of each module provided in the following sections.

3.2 Initial event representation

Within the realm of 3D pose estimation, where the challenges of dealing with substantial data volumes and computational complexity are evident, the use of downsampling techniques is of utmost importance to effectively curtail the number of events. In this paper, we employ two distinct sampling methods to obtain concise event representations, seamlessly integrating them with the 3D pose estimation process. The first crucial step involves the extraction of representative joint points, which are explicitly designated as center points.

For a more detailed elaboration, let us focus on the original event stream, denoted as \({\mathcal {E}}\), which encompasses N events. Our initial phase involves the application of the OctreeGrid filtering algorithm [43], a pivotal step in efficiently reducing the data. This results in the extraction of a set of representative events, specifically designated as center points, which we refer to as \(\mathcal {C}=\{c_{1},c_{2}\ldots c_{M}\}\). Each of these representative events, denoted as ci, is encapsulated within a 4D tuple, presented as follows:

$$\begin{aligned} c_i=(x_i,y_i,t_i,p_i). \end{aligned}$$
(1)

The variables \(x_i\) and \(y_i\) are employed to represent spatial coordinates, while \(t_i\) corresponds to the event’s timestamp. Furthermore, the variable \(p_i\) denotes the event’s attribute or polarity. In the context of our 3D pose estimation research, our primary emphasis is placed on \((x_i, y_i, t_i)\), which collectively characterizes the spatiotemporal coordinates or positions of an event. It is noteworthy that the sampled set \(\mathcal {C}\), in contrast to the original events in set \({\mathcal {E}}\), contains a substantially reduced number of events, yet it effectively preserves the fundamental spatiotemporal structure.

Our approach not only involves identifying center points as \(\mathcal {C}\) but also incorporates voxelization to obtain voxel representations. More specifically, when considering the original event stream \({\mathcal {E}}\) within a spatial–temporal 3D space defined by dimensions H, W, and T, we partition this space into voxels, each having dimensions \(h'\), \(w'\), and \(t'\). Consequently, each voxel typically encompasses multiple events, and the resulting event voxels within this spatial–temporal space are characterized by dimensions \(H/h'\), \(W/w'\), and \(T/t'\). In practice, the aforementioned voxelization process typically still results in the generation of tens of thousands of voxels. To further reduce the number of voxels and mitigate the impact of noisy voxels in the context of human pose estimation, we also implement a voxel selection procedure. This procedure identifies the top K voxels based on the number of events contained within each voxel. Let \(\mathcal {O}=\{o_1,o_2\cdot \cdot \cdot o_K\}\) represent the collection of the final selected voxels. Each event voxel, denoted as \(o_i\), is associated with a feature descriptor \({\textbf{a}}_i\in {\mathbb {R}}^C\) that incorporates attributes, including polarity, from the events it encompasses. Consequently, each \(o_i\in \mathcal {O}\) is represented as:

$$\begin{aligned} o_i=(x_i,y_i,t_i,{\textbf{a}}_i), \end{aligned}$$
(2)

where \(x_i\),\(y_i\),\(t_i\) represent the 3D coordinates of each voxel.

3.3 Absorbing graph representation learning

After obtaining the initial event representations from \(\mathcal {C}\) and \(\mathcal {O}\), we introduce an effective method to learn distinctive representations tailored for 3D human pose estimation tasks. These initial representations in \(\mathcal {C}\) and \(\mathcal {O}\) encapsulate crucial spatiotemporal relationships among event units, whether they are points or voxels. Recognizing the significance of these relationships, we leverage graph models and a learning approach to represent the pre-processed event streams associated with 3D human pose estimation.

In the upcoming sections, we delve into the specifics of our graph construction techniques for the data from \(\mathcal {C}\) and \(\mathcal {O}\), focusing on their relevance to 3D human pose estimation. Subsequently, we unveil a novel absorbing graph convolutional network (AGCN) designed to adeptly learn and generate effective representations for the event data originating from \(\mathcal {C}\) and \(\mathcal {O}\). This integrated approach is pivotal in improving the performance of our 3D human pose estimation tasks by capturing the inherent spatiotemporal relationships within the event data.

3.3.1 Graph construction

The core innovation is the introduction of graph convolutional networks (GCNs) to model relationships between points and voxels in event streams.

Center points graph. For each center point event datum \(c_i\) in \(\mathcal {C}\) with attributes \((x_{i},y_{i},t_{i},p_{i})\), we add a node \(v_i\) to \(G^c\). Nodes \(v_i\) and \(v_j\) are adjacent if the spatial distance between \(c_i\) and \(c_j\) is less than R. This geometric graph \(G^c\) with node set \(V^c\) and edge set \(E^c\) captures the relationships between nearby events.

$$\begin{aligned} d(c_i,c_j)<R\end{aligned}$$
(3)

where R is a preset parameter. In our experiments, \(d(c_i,c_j)\) calculates the distance between events \(c_i\) and \(c_j\).

$$\begin{aligned} d(c_i,c_j)=\sqrt{(x_i-x_j)^2+(y_i-y_j)^2+(t_i-t_j)^2}\end{aligned}$$
(4)

An absorbing node \({\bar{v}}\) is introduced and linked to every event node \(v_i\) in \(V^c\) by adding edges. This augmented center point graph is shown in Fig. 1.

Voxel graph. When addressing the topic of 3D pose estimation, we apply a similar approach to handle voxel event data, denoted as \(\mathcal {O}\). Here, we establish a geometric neighborhood graph labeled as \(G^{\text {pose}}(V^{\text {pose}},E^{\text {pose}})\). In this context, each node \(v_i\) within \(V^{\text {pose}}\) represents a pose instance \(p_{i}=(x_{i},y_{i},z_{i},\hat{{\textbf{p}}_{i}})\) from the dataset \(\mathcal {P}\), described by a feature vector \({\textbf{p}}_{i}\in {\mathbb {R}}^{D}\). We define an edge \(e_{ij}{\in }E^{\text {pose}}\) connecting \(v_i\) and \(v_j\) if the Euclidean distance between their 3D coordinates is less than a predefined threshold \(D_{\text {lim}}\), as defined in Eq. (4). Moreover, we introduce an absorbing node \(v^*\), which establishes connections with all pose nodes within \(V^{\text {pose}}\). Its role is to aggregate and integrate information from all pose instances, facilitating the extraction of a comprehensive, global-level representation for the entire pose graph. For a visual representation of this pose estimation process, please refer to Fig. 1.

3.3.2 Absorbing GCN

Introducing the absorbing graph convolutional network (AGCN)—an inventive model crafted for the autonomous derivation of meaningful representations in the realm of 3D pose estimation. The inspiration for this innovative approach is rooted in the central point mentioned earlier and voxel graphs incorporating absorbing nodes. AGCN is structured with multiple learning layers, including a residual connection between the initial and final layers, as exemplified in Fig. 1 (right). Each layer plays a pivotal role in facilitating the seamless process of message passing across the graph. To delve further into this concept, within every AGCN layer, each event node \(v_i\) adeptly aggregates features from its neighboring nodes as

$$\begin{aligned} f_d'(v_i)\leftarrow \sigma \Big (\sum _{v\in \{\mathcal {N}(v_i)\bigcup {\bar{v}}\}}\omega _d(v_i,v)f(v)\Big ),d=1,2\ldots D\nonumber \\\end{aligned}$$
(5)

The absorbing node, denoted as \({\tilde{v}}\), collects and consolidates messages from all the remaining nodes in the following manner:

$$\begin{aligned} f_d'({\bar{v}})\leftarrow \sigma \Big (\sum _{v\in V}\omega _d({\bar{v}},v)f(v)\Big ),d=1,2\ldots D \end{aligned}$$
(6)

The activation function, typically ReLU, is applied. We use \(\omega _{d}(v_{i},v)\) and \(\omega _{d}(\tilde{v},v)\) to represent the flexible convolution kernel weights. Following prior research [44, 45], we define these weights as a Gaussian mixture model (GMM) function [45] based on the pseudocoordinate. In a general context, for any node pair (uv), we calculate their \(pseudocoordinate^3\), referred to as \(z_{u,v}\), and then proceed to train the weight kernel \(\omega _{d}(v_{i},v)\).

$$\begin{aligned} \omega _{d}(u,v)=\sum _{k=1}^{K}\alpha _{k}\exp (-\frac{1}{2}(z_{uv}-\mu _{k})^{T}\Sigma _{k}{}^{-\Psi }(z_{uv}-\mu _{k}))\nonumber \\\end{aligned}$$
(7)

the parameters \(\mu _{k}\) and \(\Sigma _{k}\) are adaptable and undergo a learning process, while \(\alpha _{k}\) characterizes the importance assigned to the k-th Gaussian kernel. The total number of Gaussian kernels used is represented by K.

By implementing the layer-wise message passing method explained earlier, we can construct a multilayer AGCN architecture that integrates a residual connection, bridging the initial and final layers. This architectural design is illustrated in Fig. 1 (right).

To represent the results obtained from the two branches following the application of the AGCN module, we employ \(Y^c\) to denote the output of the first branch, with dimensions of Md, and \(Y^o\) for the second branch, which is of size Ld.

$$\begin{aligned} Y^c=\text {AGCN}(G^c,\Omega ^c),Y^o=\text {AGCN}(G^o,\Omega ^o)\end{aligned}$$
(8)

In the realm of 3D pose estimation, we refer to \(\Omega ^{c}\) and \(\Omega ^{o}\) as encompassing all the parameters associated with the two branches.

3.4 Classification head and network training

We employ \(Y_{\tilde{v}}^{c}\) and \(Y_{\tilde{v}}^{o}\) to denote the representations associated with the absorbing nodes in both the center point and voxel graphs. As detailed in 3.3.2, the absorbing node’s remarkable ability to consolidate information from all event nodes positions it as an excellent representation of the overall graph-level information. Consequently, we combine \(Y_{\tilde{v}}^{c}\) and \(Y_{\tilde{v}}^{o}\) and employ a MLP for the final classification, ultimately predicting the class label for 3D pose estimation.

$$\begin{aligned} Y=\textrm{MLP}(Y_{{\bar{v}}}^{c}\Vert Y_{{\bar{v}}}^{o})\end{aligned}$$
(9)

The || symbol signifies the concatenation operation in our approach. We also introduce dropout and batch-normalization layers between the layers of the MLP to mitigate possible challenges related to overfitting and gradient issues. The entire network is trained in a holistic end-to-end fashion. For the optimization of the complete network, we employ the negative log likelihood loss [46] as our chosen loss function in the context of 3D pose estimation.

Table 1 Protocol #1 evaluates the reconstruction error using Mean Per Joint Position Error (MPJPE) in millimeters on the Human3.6M dataset
Fig. 2
figure 2

Loss on the training set and MPJPE on the test set

Fig. 3
figure 3

A qualitative comparison is conducted with MGCN, Mixste and U-CondDGConv for subjects S9 and S11 on two actions within the Human3.6M dataset. Noticeable improvements are emphasized and magnified

4 Experiments

4.1 Datasets and evaluation

Our experiments are conducted on three widely used datasets in the field of 3D human pose estimation: Human3.6M, HumanEva-I, and MPI-INF-3DHP.

For Human3.6M, we utilize data from human subjects labeled as S1, S5, S6, S7, and S8 for training, aligning with established practices in the field [21,22,23, 32]. Data from subjects S9 and S11 are reserved for testing.

In the case of HumanEva-I, following the approach taken in [20] and [22], we use data for the "walk" and "jog" actions from subjects S1, S2, and S3 for both training and testing.

Regarding MPI-INF-3DHP, we adhere to the experimental settings outlined in the recent state-of-the-art work [54] to ensure a fair and rigorous comparison.

We employ standard evaluation protocols for our experiments, including Mean Per Joint Position Error (MPJPE) and pose-aligned MPJPE (P-MPJPE). MPJPE is calculated as the mean Euclidean distance between the predicted 3D pose joints and the ground truth 3D pose joints, aligned to the root joint. P-MPJPE incorporates additional post-processing steps, such as scale, rotation, and translation, to align the predicted 3D pose more rigidly with the ground truth. These evaluation metrics are consistent with previous studies [58,59,60]. This ensures that our results are comparable and meaningful within the field of 3D human pose estimation.

4.2 Implementation details

We provide a comprehensive overview of the implementation details for our PVA-GCN model, covering three primary aspects: 2D pose detections, model configuration, and training hyperparameters. For a fair and consistent comparison with prior works such as [21, 22], we adopt the 2D pose detections from Human3.6M and HumanEva-I. CPN’s 2D pose detection involves 17 joints, and MRCNN’s detection involves 15 joints, influencing the granularity of the pose information used in our experiments.

Our model’s tunability provides flexibility for optimization, allowing us to explore and adjust various parameters for improved performance. Ablation studies on Human3.6M were conducted to systematically vary channels and pose frames (\(C_{out}\), T) and assess their impact on the model’s performance. Experiments, conducted on four GeForce GTX 4060 GPUs, utilized batch sizes of 512, 256, and 256 for Human3.6M, HumanEva-I, and MPI-INF-3DHP, respectively. Leveraging sparse 3D supervision, our approach achieved state-of-the-art performance in lifting 2D to 3D poses, surpassing previous methods while requiring significantly fewer 3D labels.

4.3 Comparison with state-of-the-art

Table 1 and Fig. 3 present a detailed comparison between our PVA-GCN approach and state-of-the-art methods. These tables showcase results obtained on the Human3.6M and HumanEva-I datasets using Protocol #1 and Protocol #2, respectively. The optimization of our implementation, whether based on ground truth (GT) 2D pose with or without the loss for reconstructing the intermediate 3D pose sequence, significantly impacts the obtained results. This optimization strategy is crucial for understanding the trade-offs between model accuracy and computational efficiency, offering valuable insights into the robustness and generalizability of our approach across diverse datasets and scenarios.

Figure 2 provides insights into the training process of our PVA-GCN on the Human3.6M dataset, illustrating convergence dynamics and visually representing the evolution of the model’s loss function over training epochs.

Results in Table 2 on the HumanEva-I dataset under \(Protocol \#2\) confirm the superiority of our method over state-of-the-art alternatives, particularly in reducing the MPJPE. Notably, this improvement is achieved solely through utilizing the MPJPE loss, highlighting the efficacy of our model in enhancing the accuracy of 3D human pose estimation.

To further evaluate our approach, we qualitatively compare it with a state-of-the-art method lacking a 3D pose sequence reconstruction module [22]. This comparison aims to highlight the nuanced improvements achieved by our model in capturing the intricacies of 3D human pose. Our visual analysis, focusing on specific instances like the "S11 WalkT." action, reveals that our method produces more accurate and coherent representations of joint movements compared to the alternative method. The absence of a 3D pose sequence reconstruction module in the compared approach becomes apparent in scenarios where capturing temporal dynamics is crucial for accurate pose estimation. This qualitative evaluation further substantiates the advantage of our model in effectively leveraging temporal information for superior 3D human pose estimation.

Previous research predominantly focuses on performance evaluation using estimated 2D pose data, such as CPN or HR-Net pose data. This emphasis has implications for benchmarking, where evaluation criteria may favor methods adept at handling lower-quality 2D pose data, potentially influencing the perceived efficacy of different approaches (Tables 3, 4).

Acknowledging a limitation, our method exhibits subpar performance compared to recent approaches [33, 30, 54] in scenarios with relatively low-quality estimated 2D pose data. This highlights a specific challenge and prompts a deeper exploration of methodologies to enhance the robustness of 3D human pose estimation models under varied 2D pose data quality conditions. Addressing this challenge is crucial for advancing the applicability of such models in real-world scenarios with diverse data sources (Table 5).

Table 2 Results from Protocol #2 for HumanEva-I are presented
Table 3 Outcomes from Protocol #1 for MPI-INF-3DHP are provided
Table 4 Protocol #2 assesses the reconstruction error following rigid alignment, measured with P-MPJPE (millimeters), on the Human3.6M dataset
Table 5 Comparison with state-of-the-art methods on Human3.6M is conducted, implementing various receptive fields for ground truth 2D pose in the evaluation of Protocol #1

Discussion: the Effect of 2D Pose Quality. Going back to the inception of 3D pose lifting research, Martinez et al. [20] employed the SH 2D pose detector, fine-tuned on the Human3.6M dataset, to enhance 3D human pose estimation (HPE). This refinement led to a significant reduction in the average Mean Per Joint Position Error (MPJPE), from 67.5mm to 62.9mm, underscoring the pivotal role of high-quality 2D pose data in 3D HPE. Recent works, including [23, 30, 33], leveraged the advanced 2D pose detector HR-Net, achieving even better performance, with an average MPJPE of 39.8mm. Furthermore, Zhu et al. [31] achieved notable progress by fine-tuning the SH network [26] on the Human3.6M dataset, resulting in an average MPJPE of 37.5mm. However, it is important to note that these advancements still fall short of the results achieved when using ground truth (GT) 2D pose data.

The same pattern holds true when considering the HumanEva-I and MPI-INF-3DHP datasets. As depicted in Table 2, our method demonstrates a substantial 41% decrease in P-MPJPE on the HumanEva-I dataset. Notably, with ground truth (GT) 2D pose data, the P-MPJPE reduces from 15.3mm to 9.3mm when compared to the best-performing state-of-the-art algorithm. Meanwhile, on the MPI-INF-3DHP dataset, the MPJPE decreases from 32.2mm to 26.76mm.

As a result, the performance improvement of estimated poses predominantly hinges on the quality of 2D pose data. This quality can be achieved either by employing advanced 2D pose detectors capable of generating pose data closely resembling ground truth (GT) 2D pose or by fine-tuning existing pose detectors as necessary. In contrast, the utility of reconstructed 3D pose data generated by advanced pose detectors remains uncertain in certain scenarios. One such scenario is 3D human pose estimation in real-world conditions, typically evaluated through qualitative visualization [29]. Nevertheless, the question of whether 3D pose reconstructed from estimated 2D pose data can effectively contribute to pose-based tasks remains an area that has not been thoroughly explored. Given the straightforward nature of improving the performance of estimated 2D pose and the absence of clearly defined practical use cases, we argue that comparisons based on GT 2D pose data offer a more accurate representation of a model’s 3D human pose estimation (HPE) capability than comparisons based on estimated 2D pose data.

4.4 Ablation studies

In our analysis, we eliminate gradients from our model design, which comprises Voxel, Points, and Proxy-Node layers. The appropriateness of AGCN layers is assessed by comparing our model with a version implemented using ST-GCN [41] blocks, resulting in the ablation of AGCN. The results for Protocol #2 across datasets Human3.6M and HumanEva-I consistently indicate superior performance with the use of AGCN blocks, as shown in Table 6. To ablate the strided design, we apply average pooling to the second dimension (i.e., temporal) of the feature map, as an alternative approach. The absence of the strided design not only results in a larger feature map representation, increasing from F(\(C_{out}\), 1, N) to F(\(C_{out}\), T, N), but also adversely affects the accuracy of 3D human pose estimation (3D HPE).

In order to validate the effectiveness of our Proxy-Node layer design, we compare it with a fully connected layer that uses the expanded feature map as its input. The results, as presented in Table 6, demonstrate a significant enhancement in performance achieved by our individual connected layer in effectively leveraging the structured representation of GCN. Visualizations of feature distinctions before the prediction layers (i.e., individually and fully connected layers) are depicted in the upper and lower rows of Fig. 4. These visualizations in Fig. 4 really drive home the point about the power of our individual connected layer in making predictions. It is like having a detailed map of interpretable features, which a regular fully connected layer just cannot match. The independence of arm and leg joints in actions like "eating" and "walking" speaks volumes about the effectiveness of our approach in maintaining predictive accuracy. It is like having a sharper lens to capture the nuances of each movement.

Discussion: Limitation on Model. Similar to state-of-the-art methodologies, our approach confronts the challenge of heightened computational overhead. Notably, the data presented in the lower section of Table 5 underscores that our model surpasses the performance of state-of-the-art methods while requiring slightly more model parameters. This accentuates the dual limitation of increased computational demands and a marginal rise in model complexity. Addressing these challenges constitutes a central focus for our future work, where advanced techniques like model pruning will be explored to optimize efficiency without compromising performance.

Table 6 An ablation study is performed to analyze the key designs of our PVA-GCN
Fig. 4
figure 4

Visualizations of inter-joint feature cosine similarity are presented for actions "Walking" (first three columns) and "Eating" (last columns) in the Human3.6M dataset

Furthermore, echoing the constraints of existing methodologies [28, 30, 33], our approach exhibits a reliance on extensive training data. Despite achieving superior performance compared to state-of-the-art methods [23, 30], our model shows a dependency on a larger volume of training data. Subsequent endeavors will be dedicated to refining the model’s generalization capabilities and diminishing its dependence on extensive datasets, thereby enhancing overall efficiency and applicability.

5 Conclusion

In this paper, we propose a novel point-voxel absorbing graph convolutional network (PVA-GCN) method for addressing the problem of 3D human pose estimation. Our approach involves transforming the event stream into a sparse event cloud and voxel grids, creating a joint representation that strikes a balance between performance and efficiency. The dual representations facilitate improved performance by addressing the challenges of fragmented node feature learning and global classification feature aggregation encountered in previous event-based classification models. To achieve this, we introduce absorbing nodes into the dual graphs for global information aggregation, and employ absorbing graph convolution networks (AGCN) for structured feature learning and global feature aggregation simultaneously.

Our PVA-GCN framework’s efficacy has been thoroughly validated through extensive experiments on multiple benchmark datasets for event-based classification. The results of these experiments showcase the superior performance of PVA-GCN when compared to state-of-the-art methods, utilizing ground truth (GT) 2D poses across datasets like Human3.6M, HumanEva-I, and MPI-INF-3DHP. We have substantiated the appropriateness of our model design through comprehensive ablation studies and visualizations. Additionally, studies such as [61, 62] offer valuable insights into leveraging graph-based methodologies for point cloud registration and innovating image quality assessment. These insights contribute significantly to discussions on 3D human pose estimation. In our future work, we plan to address the challenge of parameter efficiency by incorporating tuning techniques [63]. Furthermore, we intend to explore the impact of our model in diverse application scenarios, such as human behavior understanding. Additionally, we will delve into the examination of other loss terms, including those based on bone features [57] and motion trajectory [23], to further refine our approach.