1 Introduction

Extracting three-dimensional (3D) human body skeletons from geometric surfaces is an important research topic in the fields of computer graphics, pattern recognition, and human–computer interaction. It has wide applications in pose estimation [24, 28], human body modeling [2, 31], and skeleton manipulation [10, 30]. In general, skeleton extraction methods can be divided into two categories according to different inputs: point cloud-based methods [13, 33, 35] and mesh-based methods [7, 32]. While many research work are devoted to human body skeleton extraction from static point clouds (e.g., [13]), the results are unsatisfactory due to many factors, e.g., self-occlusion, environmental noise.

Although a few skeletonization technologies from point clouds have been proposed and achieved great successes [7, 13, 32], those approaches suffer from issues when applying to extract body skeletons from consecutive point clouds: the number of body skeletons extracted from different frames may differ; many body skeletons locate with great errors due to the ignorance of prior structure of human body. The first issue prevents those technologies from directly applying to skeleton-based animation as well as 3D human body operation, while the second issue reduces the effect of human–computer interaction. In essence, spatial and temporal coherence of human poses is far from being sufficiently explored.

The recent development of light field acquisition devices provides an opportunity for solving those issues. Compared with traditional multiview devices, light field devices can automatically capture a sequence of multiview images of human body that performs continuous motion based on the lighting and frequency setting, while traditional multiview devices can only collect single-frame multiview image sets under specific actions. Based on a light field device, we propose a spatial–temporal consistency model (STC) for extracting human body skeletons from consecutive point clouds. The key of STC is a spatial–temporal consistency adjustment model, which fine-tunes the location of skeletons by exploiting both inter-frame and intra-frame consistency of skeletons. Compared with traditional skeleton extraction methods, STC is unsupervised, completely automated, and requires no training stages. Experimental results on light field acquisition data indicate that 3D human body skeletons produced by STC are precise and suitable for applications such as skeleton-based animation.

This paper is organized as follows. Section 2 reviews related work on body extracting skeletons of human body and skeletonization of objects. Section 3 proposes the methodology of STC. Section 4 gives the experimental results. Section 5 concludes this paper.

2 Related work

We review related work on extraction of human body skeletons and object skeletonization based on static surfaces and consecutive surfaces.

2.1 Skeleton extraction from static surfaces

Static surfaces can be divided into static mesh surfaces and static point clouds.

2.1.1 Skeleton extraction from mesh surfaces

Curvature flow methods Tagliasacchi et al. proposed a curve skeletonization method based on the property of area minimization of mean curvature flows (MCF) by assigning the curvature flow with extreme values [33]. Chuang  et al. provided an effective flow curve method for mesh evolution, which requires no initialization at the beginning of each step due to the proposed finite-elements hierarchy [6].

Contraction methods Au  et al. generated curve skeletonization by contracting the mesh geometry into zero volume skeleton shape using implicit Laplacian smoothing and global position constraints; the skeletonization method can retain the shape and topology relationship of the original mesh [1]. Cao  et al. developed a curve skeletonization technology based on local Delaunay triangulation and topology refinement [7]. A contraction operation was proposed to repair collected skeletons from meshes with missing data and applied to surfaces with boundary. Jiang  et al. proposed an algorithm to extract curve skeletonization from triangle meshes [15]. In this work, the initial skeletonization map is constructed by copying the connectivity and geometric information of the input mesh, and then, the nodes of the skeletonization map are iteratively generated by using the coupling process of graph contraction and surface clustering.

Mesh decomposition methods Katz  et al. proposed a skeleton extraction method by dividing complex objects into simple sub-objects using a hierarchical mesh decomposition algorithm [16]. Li  et al. decomposed meshes into semantic segmentation based on the idea of edge contraction and space scanning [20]. Sharf  et al. proposed a real-time skeleton extraction algorithm from both point clouds and polygonal meshes [29]. The algorithm is based on a deformable model evolution process, which captures the volume and shape of objects. The deformable model, which consists of multiple competing fronts, evolves in the interior of objects in a rough-to-fine fashion, tracks the center of these surfaces, and then merges and filters the generated arcs to obtain curve skeletons of objects. Chuang  et al. proposed a 3D object shape description based on generalized cylinders [4]. The derived generalized cylinder representation is better than the object shapes based on simple generalized cylinder subclasses.

2.1.2 Skeleton extraction from point clouds

Skeleton extraction from missing data Cao et al. developed a contraction operation for generalized discrete geometry data by local Delaunay triangulation and topology refinement, which handles missing data without explicit surface reconstruction [7]. Tagliasacchi et al. proposed an algorithm to extract skeleton curves from point clouds with a large amount of missing data based on generalized rotational symmetry axis of an oriented point set [33]. Huang  et al. developed a L1-media skeleton construction algorithm, which can be directly applied to unoriented raw point scan with significant noise, outliers and large area of missing data, without strong requirements for the quality of input point clouds [13]. Zhang  et al. proposed \(\ell _0\)-regularization-based skeleton optimization method from continuous point set of dynamic human body [35]. By integrating spatiotemporal constraints, the method recovers missing points in the skeletons, corrects the outliers in the skeleton, and maintains the motion characteristics.

Other skeleton extraction methods Liang  et al. proposed a framework for skeletonization of point clouds by using a discretization scheme of differential operators and applied to geometric understanding of point clouds [21]. The framework defined a discretized Laplace–Beltrami operator on point clouds, which effectively combines local information with global information. Zhang  et al. proposed to reconstruct a skeletonization of trees using an enhanced PyrLK optical flow method [36]. The method circumvents the issue of manual interaction and inaccuracy and facilitates automatic tree modeling by reconstructing a 3D skeletonization model of trees with realistic sense. Mei  et al. proposed an incomplete point cloud skeletonization of trees from laser scanning data, by using a hybrid model consisting of an L1 intermediate skeleton algorithm and minimum spanning tree algorithm [22].

In summary, the existing skeleton extraction methods from static surfaces suffer from three shortcomings:

  • Imperfect point clouds with missing data cannot be well treated [7, 21, 33].

  • Closed loops and glitches exist in extracted skeletons [6, 7, 15].

  • Incomplete skeletons or skeletons with error branches may be produced [35].

2.2 Skeleton extraction from consecutive surfaces

Topology matching methods Topology matching is particularly useful for interactive retrieval of 3D objects. Hilaga  et al. proposed a topology matching technology by comparing multi-resolution Reeb map, calculating the similarity between polyhedral models, establishing the corresponding relationship between various parts of the object, and achieving accurate search of 3D shape data set [12]. Dey  et al. defined a medial geodesic function and derived an approximate algorithm of curve skeletons; the generated curve skeletonization together with proper attributes can model different types of real-world objects [9]. Chen  et al. proposed a 3D object shape description based on generalized cylinders, which can better approximate the entity [4]. Zimovnov  et al. proposed an effective algorithm for calculating the three-dimensional distance transformation of voxels inside visual shells to form the first approximation of curve skeletons [37]. Zheng  et al. regarded the curve skeletons of shapes as a global description feature and assumed that the skeleton structure of the captured shape be consistent for a period of time [38]. Other scholars extract skeletons by determining motion postures based on the observation of continuous frames of body movements [5, 35].

Animation driven methods Wang  et al. proposed a shape correspondence method based on base point driving, which can extract joint object skeleton from 3D mesh shape, and can be applied to skeleton driven animation [34]. Aguiar  et al. proposed a robust framework, which automatically extracts motion skeleton and surface skin weight from any mesh animation [8]. Using this framework, deformation mesh sequences can be automatically switched to fully assembled virtual objects completely. Pantuwong  et al. proposed an algorithm for automatically generating inverse kinematic skeleton of characters and did not require the input 3D character model which have a certain attitude or direction [25]. Le  et al. introduced an example-based assembly method for automatically generating a linear hybrid skin model with skeleton structure [19]. James  et al. extended the method of skinned characters to a general setting of skinned deformable mesh animation and provided an algorithm of automatic and progressive skin approximation, which is particularly effective for pseudo-joint motion [14]. Baran  et al. proposed an automatic character animation method based on static character meshes and general skeletons, by attaching skeletons to the surface of characters for realistic animation [3]. Pang  et al. proposed to extract skeletons from animation surfaces by using a global skeleton alignment method, which can spread the key skeletons to the initial skeletons [26].

In summary, the existing skeleton extraction methods from consecutive surfaces suffer from two shortcomings.

  • Inter-frame consistency skeletons are difficult to produce.

  • Most of methods are unsuitable for the data of point cloud with outliers, noises, and much missing points.

3 Details of STC

We introduce details of STC in this section. Figure 1 shows a flowchart of STC, and Algorithm 1 shows an algorithm of STC. STC mainly consists of four stages, each of which is detailed in the following subsections.

Denote \(\Vert \mathbf {A}\Vert _F\) to be the Frobenius norm of a matrix \(\mathbf {A}\), and denote \(\Vert \mathbf {v}\Vert _2, \Vert \mathbf {v}\Vert _0\) to be the \(\ell _2\) norm, the \(\ell _0\) pseudo-norm of a vector \(\mathbf {v}\), respectively. Denote \([\mathbf {A}]_{i,j}\) to be the element of the ith row, jth column of \(\mathbf {A}\), and denote \([\mathbf {A}]_{j}\) to be the jth column of \(\mathbf {A}\).

3.1 Data preprocessing

The first stage of STC is data preprocessing, which consists of three steps (see Lines 1–3 of Algorithm 1).

Multiview image collection We set different frame rates according to different actions (see Table 1) and collect multiview images of a moving human body of each action using the light field acquisition device (see Fig. 2), which contains 50 industrial cameras with a given frame rate (see Fig. 1a).

Point cloud generation, normalization, and alignment We generate a 3D dense point cloud of human body using patch-based multiview stereopsis (PMVS [11]) and then perform a normalization and alignment scheme to scale to a unit box and move to the origin on the point cloud so that all point clouds of human body of an action sequence share similar sizes, geometric centers and orientations (see Fig. 1b–e).

Fig. 1
figure 1

A flowchart of spatial–temporal consistency model. a Multiview image collection; b sparse point cloud; c dense point cloud of whole scene; d dense point cloud of human body; e normalization and alignment; f Poisson surface reconstruction; g initial skeleton extraction; and h skeleton adjustment

Triangular mesh reconstruction In order to perform a semantic segmentation on human body, we require a mesh representation of human body besides the point cloud model. Thus, we downsample the dense point cloud by merging multiple points within the same grid box into a single point, whose color intensity (normal, resp.) is determined by averaging the color intensity (normal, resp.) of multiple points in a box whose size is set as \(0.03\times 0.03\times 0.03\), and perform Poisson surface reconstruction [17] to obtain a triangular mesh of human body (see Fig. 1f).

Table 1 Different frame rates for different actions

3.2 Initial skeleton extraction

The second stage of STC extracts initial skeletons from point clouds of each frame individually based on a semantic segmentation of triangular meshes of human body (see Fig. 3). Specially, this stage consists of four steps (see Lines 4–7 of Algorithm 1) which are introduced as follows.

Pseudo-skeleton generation (Fig. 3b, c): We segment the mesh into several semantic patches using [18] (Fig. 3b) and generate “pseudo-skeletons” using the centroid of each patch (Fig. 3c). Those pseudo-skeletons differ from standard human body skeletons in two aspects: pseudo-skeletons always have different numbers with standard skeletons within each body component, and may have incorrect locations compared with standard skeletons. We solve the first issue using the following two steps and leave the second issue until Sect. 3.4.

CShoulder, Waist, and head determination (Fig. 3d): We connect each pair of pseudo-skeletons belonging to adjacent semantic patches with an edge, and CShoulder is recognized as the unique pseudo-skeleton which achieves the maximum degree.Footnote 1 Similarly, Waist is recognized as the unique pseudo-skeleton which achieves degree three.Footnote 2 Then, we determine head as the only point which both achieves degree one and is adjacent to CShoulder.

Fig. 2
figure 2

A light field acquisition system

Fig. 3
figure 3

A flowchart of initial skeleton extraction. a Input point cloud; b semantic segmentation; c pseudo-skeleton generation; d Determining CShoulder, Waist, head; e determining LShoulder, RShoulder; and f standard skeleton completion

Fig. 4
figure 4

Standard human model: a directed rooted tree of 20 body skeletons (black spheres) and six body components

figure e
(colored line segments)

LShoulder and RShoulder determination (Fig. 3e): After determining CShoulder and Waist skeletons, we observe that the patch corresponding to CShoulder includes LShoulder and RShoulder additionally. To determine their locations, we first select the leftmost adjacent patch (i.e., left upper arm) and rightmost adjacent patch (i.e., right upper arm) of current patch by human body topology connection obtained when model segmentation. Then, we divide points of current patch into three subpatches with equal cardinality according to a distance rule: the first (second, resp.) subpatch is the point set which achieves closest distances to the leftmost (rightmost, resp.) adjacent patch. Finally, we determine LShoulder and RShoulder using the centroid of the first and second subpatches, respectively.

Standard skeleton completion (Fig. 3f): So far we obtain several pre-defined pseudo-skeletons and four skeletons (CShoulder, Waist, LShoulder, RShoulder), whose number may differ from standard skeletons. To fulfill an initial skeleton extraction with the same number and similar locations to standard skeletons, we divide the collection of all pseudo-skeletons and those four skeletons into six subsets corresponding to six components of human body: Torso, Head, LArm, RArm, LLeg, RLeg, according to their connectivity (see Fig. 4). Note each component corresponds to a number of standard skeletons. If the number of pseudo-skeletons of a component exceeds the number of corresponding standard skeletons, we select the shortest edge among all edges of the component and replace both of two endvertices of that edge with their center; if the number of pseudo-skeletons of a component is less than the number of corresponding standard skeletons, we select the longest edge among all edges of current component and add its center as a new pseudo-skeleton. Either of two tricks is repeated until the number of pseudo-skeletons equals the standard number of current component.

Fig. 5
figure 5

Effect of inter-frame skeleton matching. Column 1: the 16th frame (top) and 21st frame (bottom) of Walking. Columns 2–21: \(\textcircled {{01}}\)-\(\textcircled {{20}}\)skeletons of the 16th frame (top) and 21st frame (bottom) of Walking (marked by green). All the skeletons of the 21st frame are correctly matched

3.3 Skeleton matching

The third stage of STC is to match skeleton points between consecutive frames, i.e., to establish the correspondence between skeletons of different frames so that each skeleton of different frames is correctly matched. We first align CShoulder of all frames so that all CShoulder skeletons share the same coordinate. Next, we find a correspondence between two arms (and two legs) of pairwise adjacent frames, by comparing the total distance from LArm and RArm of the next frame with respect to LArm of the previous frame, i.e., we denote \(\mathbf {x}_{t,i}\in \mathbb {R}^3\) to be the 3D coordinates of the ith skeleton of the tth frame; if

$$\begin{aligned} \sum _{i=2}^{5}\Vert \mathbf {x}_{t,i} - \mathbf {x}_{t+1,i}\Vert _2^2 < \sum _{i=2}^{5}\Vert \mathbf {x}_{t,i} - \mathbf {x}_{t+1,i+4}\Vert _2^2 \end{aligned}$$
(1)

holds, then the skeletons of the arm of the \((t+1)\)th frame are correctly matched; otherwise we switch the skeletons of two arms of the \((t+1)\)th frame from LArm to LArm. The correspondence between two legs is computed in a similar fashion.

figure f

Figure 5 shows the effect of the skeleton matching of the 16th frame and the 21st frame of Walking. The leftmost subfigure of Fig. 5 denotes the 16th frame (top) and 21st frame (bottom) before matching. The top row represents all skeleton points of the 16th frame, while the bottom row represents all skeleton points of the 21st frame after matching.

The xyz coordinates of initial skeletons obtained in this section are denoted by \(\mathbf {X}^{(1)}_\mathrm{init}, \mathbf {X}^{(2)}_\mathrm{init}, \mathbf {X}^{(3)}_\mathrm{init}\in \mathbb {R}^{T\times 20}\), respectively, where T, 20 are total frame number and total skeleton number, respectively, and the tth row of \(\mathbf {X}^{(k)}_\mathrm{init}\) corresponds to the coordinates of initial skeletons at the tth frame, \(k=1,2,3\), \(t=1,\ldots ,T\).

$$\begin{aligned}&\min _{\{\mathbf {X}^{(k)}\}_{k=1}^3} \underbrace{\sum _{k=1}^3 {\Vert \mathbf {D}\mathbf {X}^{(k)}\Vert }_0}_{\mathrm{inter\text {-}frame~consistency}} +\underbrace{\alpha \sum _{k=1}^3\sum _{j=2}^{20} \left\| [\mathbf {X}^{(k)}]_{j}-(1-\epsilon )\cdot [\mathbf {X}^{(k)}_\mathrm{init}]_{j} -\beta _j\epsilon \cdot [\mathbf {X}^{(k)}_\mathrm{init}]_{\mathrm{parent}(j)}\right\| _2^2}_{\mathrm{intra\text {-}frame~consistency}}\nonumber \\ \beta _j= & {} \left\{ \begin{array}{ll} 1 &{}\text{ if } j=2,3,6,7,11,12,13,16,17,20 \\ -1 &{}\text{ if } j=4,5,8,9,14,15,18,19\end{array}\right. ,~j=2,\ldots ,20, \\ {[}\mathbf {D}]_{i,j}= & {} \left\{ \begin{array}{ll} -1 &{} \text{ if } 2\le i\le T-1 \wedge i=j\pm 1\\ 2 &{} \text{ if } 2\le i\le T-1 \wedge i=j\\ 0 &{} \text{ otherwise } \end{array} \right. , ~i,j=1,\ldots ,T,\nonumber \end{aligned}$$
(2)

3.4 Skeleton adjustment

The fourth stage of STC adjusts skeletons by using a spatial–temporal consistency adjustment model (see Fig. 1h). Our motivation arises from two observations. For one thing, positions of each skeleton of a motion sequence exhibit continuous change, i.e., for almost all frames, the position of a skeleton can be given by the median value of the positions of the same skeleton of the former and latter frames (see Fig. 6 which demonstrates the smooth change of the positions of a skeleton); for another, for each frame, semantic segmentation produced by [18] is imprecise: most non-root skeletons locate far from the corresponding “parent skeletons” defined by Fig. 4 except four ending skeletons (LHand, RHand, LFoot, RFoot) which locate close from their “parent skeletons.” The reason is that each of those four skeletons locates at the end of a body component, and the segmented patch produced by [18] cannot distinguish that skeleton from its parent skeleton. Based on the argument above, we propose the spatial–temporal consistency adjustment model (2), which consists of an inter-frame consistency term and an intra-frame consistency term (\(\alpha >0\) is a parameter for balancing two terms). The first term enforces the medium representation of skeletons of almost all frames, with a median representation matrix \(\mathbf {D}\in \mathbb {R}^{T\times T}\); the second term enforces a framewise fine-tuning over all non-root skeletons for approaching or keeping away from the corresponding parent skeletons, with pre-given parameter \(\beta _j,j=2,\ldots ,20\). That parameter determines whether each skeleton approaches or keeps away from its parent skeleton. Specially, for most of skeletons, we set \(\beta _j=1\) to guarantee that each of them approach its parent skeleton; for skeletons \(\textcircled {{04}},\textcircled {{05}}, \textcircled {{08}},\textcircled {{09}}\), \(\textcircled {{14}},\textcircled {{15}}, \textcircled {{18}},\textcircled {{19}}\) which belong to four limbs LArm, RArm, LLeg, RLeg, respectively, as two skeletons locating at the end of each limb are closer to the parent node, we set \(\beta _j=-1\) to keep them away from each corresponding parent skeleton.

Fig. 6
figure 6

Coordinate changes at LElbow on the 20th, 33rd, 49th, and 66th frames of Arm Stretching

To solve model (2), we introduce auxiliary matrices \(\mathbf {Y}^{(k)}\in \mathbb {R}^{T\times 20}\) for replacing \(\mathbf {D}\mathbf {X}^{(k)}\), \(k=1,2,3\), and transform (2) into (3)

$$\begin{aligned}&\min _{\{\mathbf {X}^{(k)},\mathbf {Y}^{(k)}\}_{k=1}^3} \sum _{k=1}^3 \left( \Vert \mathbf {Y}^{(k)}\Vert _0 +\alpha \sum _{j=2}^{20}\left\| [\mathbf {X}^{(k)}]_{j}- (1-\epsilon )\cdot [\mathbf {X}^{(k)}_\mathrm{init}]_{j}\right. \right. \nonumber \\&\left. \left. -\beta _j\epsilon \cdot [\mathbf {X}^{(k)}_\mathrm{init}]_{\mathrm{parent}(j)} \right\| _2^2 + {\lambda }\Vert \mathbf {Y}^{(k)} -\mathbf {D}\mathbf {X}^{(k)}\Vert _F^2 \right) . \end{aligned}$$
(3)

by using naive Lagrange multiplier method, where \(\lambda >0\) is the penalty parameter. We then solve (3) by alternating solving two subproblems of \(\mathbf {X}^{(k)}\) and \(\mathbf {Y}^{(k)}\) (see Lines 1013 of Algorithm 1).

Fig. 7
figure 7

Multiview images of three actions captured by the light field acquisition system

4 Experimental results and analysis

In this section, we evaluate the effectiveness of STC by comparing it with state-of-the-art methods. The experiments are conducted on an Intel(R) Core(TM) i5-8250U 1.8 GHZ CPU with 8 GB RAM using Matlab R2016a. We collect multiview color images of three actions by using 50 industrial cameras with 2.2 million pixels through the light field acquisition system (Fig. 2), and the captured images are of \(2048\times 1088\) resolution. The three actions are detailed as follows:

  • Arm Stretching: which consists of 40, 000 images collected from 50 different perspectives in 160 s, with frame rate 5 fps (see Fig. 7a).

  • Walking: which consists of 90, 000 images collected from 50 different perspectives in 60 s, with frame rate 30 fps (see Fig. 7b).

  • Arms&Legs Moving: which consists of 120, 000 images collected from 50 different perspectives in 80 s, with frame rate 30 fps (see Fig. 7c).

Fig. 8
figure 8

Top: input data of the 25th, 50th, 55th, 63rd, 76th, 84th, 88th, and 90th frames of Arm Stretching. Middle: results before skeleton adjustment. Bottom: results after skeleton adjustment

Fig. 9
figure 9

Top: input data of the 16th, 21st, 26th, 31st, 36th, 46th, 56th, and 61st frames of Walking. Middle: results before skeleton adjustment. Bottom: results after skeleton adjustment

Fig. 10
figure 10

Top: input data of the 71st, 86th, 91st, 101st, and 111th frames of Walking. Middle: results before skeleton adjustment. Bottom: results after skeleton adjustment

Fig. 11
figure 11

Top: input data of the 643rd–650th frames of Arms&Legs Moving. Middle: results before skeleton adjustment. Bottom: results after skeleton adjustment

Fig. 12
figure 12

Top: input data of the 651st, 652nd, 655th, 656th, 657th, 658th, 659th, 660th, 661st, and 662nd frames of Arms&Legs Moving. Middle: results before skeleton adjustment. Bottom: results after skeleton adjustment

4.1 Ablation study on skeleton adjustment

While STC consists of three stages, the skeleton adjustment stage plays a key role for the whole STC framework. Thus, we first show comparative results before skeleton adjustment and after skeleton adjustment in Figs. 8, 9, 10, 11, and 12.

According to Fig. 8, the results before skeleton adjustment tend to produce Waist of lower height (see the red boxes), and produce LWrist (RWrist, resp.) which is closer to LElbow (RElbow, resp.) (see the green boxes); moreover, both LLeg and RLeg before skeleton adjustment exhibit abnormal lengths (see the blue boxes). In contrast, results after skeleton adjustment reflect a promising location of body skeletons.

According to Figs. 9 and 10, skeleton adjustment produces a more accurate prediction of LWrist, RWrist, Waist, LAnkle, and RAnkle (the red boxes). Moreover, the length of both LLeg and RLeg is more consistent over those frames after using skeleton adjustment (the blue boxes).

According to Figs. 11 and 12, the results before skeleton adjustment produce the skeletons that are not compatible with the actual skeleton of the human body, such as LHand, RHand, LWrist, RWrist, LFoot, and RFoot (the red boxes). In contract, the skeletons after adjustment are more accurate (the green boxes).

Fig. 13
figure 13

Qualitative results of Tagliasacchi et al. [32] (row 2), Cao et al. [7] (row 3), Huang et al. [13] (row 4), Zhang et al. [35] (row 5), and STC (row 6) of 25th, 50th, 55th, 63rd, 76th, 84th, 88th, and 90th frames of Arm Stretching

Fig. 14
figure 14

Qualitative results of Tagliasacchi et al. [32] (row 2), Cao et al. [7] (row 3), Huang et al. [13] (row 4), Zhang et al. [35] (row 5), and STC (row 6) of 36th, 46th, 56th, 61st, 91st and 86th frames of Walking

Fig. 15
figure 15

Qualitative results of Tagliasacchi et al. [32] (row 2), Cao et al. [7] (row 3), Huang et al. [13] (row 4), Zhang et al. [35] (row 5), and STC (row 6) of 643rd–651st frames of Arms&Legs Moving

Overall, by using the skeleton adjustment, STC produces skeletons which are tidier, smoother, and are closer to real body skeletons, and hence achieves good results from different action sequences.

4.2 Comparative results with state-of-the-art methods

We select four state-of-the-art methods for comparative experiment and introduce them as follows:

  • Tagliasacchi et al. [32] propose a average curvature skeleton extraction method. The authors formulate the skeletonization problem via MCF. While the classical application of MCF is surface fairing, Tagliasacchi et al. take advantage of its area-minimizing characteristic to drive the curvature flow toward the extreme so as to collapse the input mesh geometry and obtain a skeletal structure.

  • Cao et al. [7] propose a Laplacian contraction method. The authors develop a contraction operation that is designed to work on generalized discrete geometry data, particularly point clouds, via local Delaunay triangulation and topological thinning. The method is robust to noise and can handle moderate amounts of missing data, allowing skeleton-based manipulation of point clouds without explicit surface reconstruction.

  • Huang et al. [13] propose an \(\ell _1\) median skeleton extraction method by introducing \(\ell _1\)-medial skeleton as a curve skeleton representation for 3D point cloud data. They adapted \(\ell _1\)-medians locally to a point set representing a 3D shape that gives rise to a one-dimensional structure, which can be seen as a localized center of the shape.

  • Zhang et al. [35] propose an \(\ell _0\)-regularization-based skeleton optimization method from consecutive point sets of kinetic human body to extract consecutive skeletons by using the temporal constraint and a spatial constraint.

Figures 13, 14, and 15 show qualitative results of STC and four state-of-the-art methods for Arm Stretching, Walking, and Arms&Legs Moving, respectively. We summarize the main shortcomings of comparative methods as follows.

Tagliasacchi et al. [32] always produce incomplete skeletons on arms obviously (see the 61st, 96th, frames of Fig. 14, the 643rd, 644th skeletons of Fig. 15, marked by red boxes), as well as inconsistent connection points of four limbs and Torso (see the 50th, 55th, 63rd frames of Fig. 13, the 36th, 46th, 61st, 91st, 86th frames of Fig. 14, and the 645th, 647th, 648th, 649th, 650th, and 651st skeletons of Fig. 15, marked by blue boxes).

Cao et al. [7] suffer from missing of skeletons, especially on LArm-Torso junction, RArm-Torso junction, LLeg-Torso junction, RLeg-Torso junction (see the 25th, 50th, 55th, 63rd, 76th, 84th, 88th, and 90th frames of Fig. 13, the 36th, 46th, 56th, 61st, 91st, and 86th frames of Fig. 14 and the 643rd–649th frames of Fig. 15, marked by blue boxes) and great prediction errors on LArm-Torso junction, RArm-Torso junction (see the 61st frames of Fig. 14), LKnee-Torso junction (see the 63rd frame of Fig. 13), the RArm skeleton (see the 56th frame of Fig. 14), and RLeg-Torso junction (see the 63rd frame of Fig. 13 marked by red boxes). The skeletons are incomplete on LArm of the 643rd frame of Fig. 15 (marked by red boxes).

Huang  et al. [13] suffer from obvious problems such as missing of skeleton points (see the 76th, 84th, and 90th frames of Fig. 13, the 61st, 91st frames of Fig. 14, the 643rd, 645–648th, 650th, and 651st frames of Fig. 15 marked by red boxes), missing of branches (the 76th of Fig. 13, the 645–648th frames of Fig. 15), incorrectness of connection between branches (the 63rd of Fig. 13, the 61st frame of Fig. 14 marked by blue boxes, and the 643rd, 645th frames of Fig. 15 marked by blue boxes), as well as incorrectness of branches (see the 651st frames of Fig. 15 marked by orange boxes).

Zhang et al. [35] occasionally produce incomplete skeletons on the head (the 50th, 55th frames of Fig. 13 marked by red boxes), incomplete skeletons on two arms (see the 644th, 649th, 650th, and 651st frames of Fig. 15), incorrect branches (see the legs of the 646th, 647th frames of Fig. 15 marked by blue boxes), and the skeletons are not an accurate representation of the human body. In particular, for the 46th, 56th, 61st frames of Fig. 14, Zhang et al. [35] produce bent left arms (which should be straight) and incorrect proportion of human body (i.e., shorter legs); for the 643rd, 644th, 645th frames skeletons of Fig. 15, Zhang et al. [35] produce very small difference between extracted skeletons exist in the results of Zhang et al. [35], while those frames of the human pose produce great different movement.

In contrast, STC produces more accurate skeletons generally, without the appearance of incorrect branches, and more complete than above skeletons, and are consistent, response to human posture better. Because initial standard skeleton extraction algorithm based on shape segmentation can extract the 3D human body skeleton with 20 points. The temporal consistency preserving skeleton optimization algorithm has the position constraints of the intra-frame skeleton points and the position constraints of inter-frame skeleton points. Our optimization model makes the final standard skeletons more accurate, more tidy, and more conformable to the original input surfaces, more in line with the actual human body skeleton points distribution. Therefore, STC is better than many traditional skeleton extraction methods and is more convenient for subsequent posture estimation, human body modeling and operation.

4.3 Analysis and discussion

It should be noted that STC has many shortcomings and requires improvement. First, the effect of skeleton extraction of STC heavily depends on the effect of mesh segmentation. Failure of model segmentation may occur when point cloud is seriously missing. Segmentation errors occur when body parts are in contact. As a result, the extracted skeleton is inaccurate. Secondly, compared with deep learning-based methods [23, 27], STC cannot treat singular poses or poses with sudden changes due to the lack of training set. This issue may be circumvented by exploiting motion principles or detecting anchor landmarks (e.g., head) of special actions.

5 Conclusion

We propose a 3D human body standard skeleton extraction method from consecutive surfaces, by using a spatiotemporal consistency model. Our model can be applied to 3D human body standard skeletons extraction from meshes which are reconstructed from either multiview images of moving body or 3D scanned human motion surfaces, without requiring manual intervention. The model produces more complete, tidier, more accurate 3D human body standard skeletons, and facilitates subsequent posture estimation, human modeling and operation.

In the future work, we shall consider generalizing our method to (semi-)supervised fashion, so that singular poses can be inferred. Moreover, we shall consider action specific periodicity estimation for improving skeleton extraction accuracy.