Keywords

1 Introduction

The capability to segment the body parts or, more generally, to estimate a skeleton of a person in an unsupervised way is fundamental for many applications: from health-care to ambient assisted living, from surveillance to action-recognition and people re-identification. The introduction of affordable RGB-D cameras as the Microsoft Kinect, has given a boost to the research in this area and many marker-less skeleton estimation algorithms were born, as Shotton’s human pose recognition [1] and Buys’ body pose detection [2]. However, all these systems perform better when the subject is seen frontally with respect to the depth camera, mainly because most of the training examples they were trained with referred to this pose. In this work, we want to overcome this problem when multiple cameras are available, thus generating a virtual depth image of the subject warped in frontal view after having fused the depth information coming from the cameras. Moreover, we propose an improvement to Buys’ body pose detector [2], here used for skeleton estimation, by adding a preliminary people detection phase for background removal and performing an alpha-beta tracking on the final skeleton joints. The system has been tested on sequences of two freely moving persons imaged by a network composed of two first-generation Microsoft Kinect sensors. Summarizing, the contribution of this work is two-fold:

  • We introduce a novel multi-view method to estimate the skeleton of a person based on the fusion of the 3D information coming from all the sensors in the network and a subsequent warping to a frontal pose;

  • We improve the body pose detector in [2] by removing background points from the input depth image with a people detection phase and by adding a joint tracking filter to the output of the detector.

The remainder of the paper is organized as follows: Sect. 2, reviews the state-of-the-art of both single-camera and multi-camera skeleton tracking algorithms, while Sect. 3 gives an overview of our system. In Sect. 4, we describe the multi-view data fusion part of our system, while in Sect. 5 we describe the skeleton estimation algorithm we used and how we improved it. Finally, Sect. 6 details the experiments we performed and the results we achieved and in Sect. 7 conclusions are drawn.

2 Related Work

The skeleton of a person gives important cues on what the person is doing (action-recognition) [3, 4], who is the person viewed (people re-identification) [5,6,7], what are her intentions (surveillance) [8] and how are her health conditions (health-care) [9]. Furthermore, the wide literature on people tracking [10,11,12] demonstrates its usefulness for both security applications and human-robot interaction. One of the most important works on skeleton tracking is the one by Shotton et al. [1], which trains a random forest to recognize the body parts of a person with a huge training dataset composed of real and synthetic depth images of people. The classifier, licensed by Microsoft for entertainment applications, achieves good performance and works in real-time. The system is released within the Microsoft Kinect SDK and only works with Windows-based computers. Another work released as open-source within the Point Cloud Library [13] is the work of Buys et al. [2], which uses an approach similar to Shotton’s. In our work, we use this latter body-part detector, that we also improved by adding a people detection pre-processing phase and an alpha-beta tracking algorithm.

Intelligent surveillance systems rely more and more on camera network cooperation. Indeed, more cameras are able to cover more space and from multiple views, obtaining better 3D shapes of the subject and decreasing the probability of occlusions. Recent works relies on camera collaboration in network to enhance skeletal estimation. In [14], the skeleton obtained by single RGB images is fused with the skeleton estimated from a 3D model composed with the visual hull technique. The visual hull is used for refining the pose obtained from the single images. In [15], a skeleton is computed for every camera from a single image and then these estimated are projected to 3D and intersected in space. The work by Gao et al. [16] addresses this problem by registering a 3D model to the scanned point cloud obtained by two Kinects. This work is very accurate but unfeasible for real-time purposes given the 6 seconds needed to process each frame. The work of Yeung et al. [17] proposes a solution to the same problem with two Kinects that can be used in real-time. In particular, they uses two orthogonal Kinects and fuse the skeletons obtained from the Microsoft SDK with a constrained optimized framework.

In this work we exploit the multi-view information at the depth level, leaving the skeleton estimation as the last part of the pipeline. In this way, we are able to obtain better skeletons also when the single ones are potentially noisy or when they have some not tracked joints. Moreover, we minimize the skeleton estimation error by warping the fused data to a frontal view, given that the skeleton estimation is best performed from frontally viewed persons.

3 System Overview

Figure 1 provides an overview of our system. In this work, a network composed of two first-generation Microsoft Kinect is considered, but the extension to a higher number of cameras is straightforward. At each new frame, the Kinects compute the 3D point cloud of the scene and the people detector segments only the points belonging to the persons in the scene. Afterwards, we transform the point clouds to a common reference frame given that the network is calibrated and we fuse the point cloud data after performing a fine registration with the Iterative Closest Point (ICP) algorithm [18]. The multi-view cloud obtained is then rotated and reprojected to a virtual image plane so as to generate a depth map of the persons seen from a frontal view. Then, body parts detection is performed on this virtual depth map and the joint position is computed from the body segmentation and tracked with an alpha-beta tracking filter. The obtained skeleton can then be reprojected to either of the original images. In Sects. 4 and 5, we will better review each step of the proposed method.

Fig. 1
figure 1

Overview of the proposed system

4 Multi-view Data Fusion

State-of-the-art body part detectors [1, 2] perform poorly in presence of occlusions. This case often occurs when a person is side-viewed by a camera, and having more cameras in the scene does not ensure that one of them sees the person completely. For this reason, our system exploits the perception network to perform data fusion and frontal view generation in order to provide to the body part detector a more complete depth image of the person in the scene, thus improving the final performance.

4.1 People Detection for Background Removal

The body part detector in [2] poorly estimates the lower body parts of a person when the ground plane is visible under the feet of the person or when the person is too close to the background. To overcome this problem, we added a people detection phase as a preprocessing for background removal. In this way, we build a new depth image where all the background points are set to a big depth value (e.g. 10 m), so that the Random Forest in [2] can easily discard them from belonging to the foreground person. As for the people detector, we exploit the RGB-D people detection in [10, 11], that is publicly available in the Point Cloud Library and allows to robustly detect people and provide the point cloud points belonging to them. Then, these 3D points are reprojected to 2D to create a masked image which can be used instead of the entire depth image, improving the output of the original body part detector.

4.2 Point Cloud Fusion

The information coming from multiple cameras is here exploited at the depth level, fusing the point clouds by means of the Iterative Closest Point algorithm. In particular, considering the network of two cameras \(C_0, C_1\) we used for the experiments, we first obtain the segmented point clouds \( P _0, P _1\) by means of the people detector and then, given the extrinsic parameters of the network, we refer these point clouds to a common world reference frame. After this transformation, the resulting point clouds \( P_1^w \) and \(P_0^w\) are finely registered by means of an ICP algorithm in order to account for depth estimation errors intrinsic of the sensors [19] or possible inaccuracies in the extrinsic calibration of the network. In formulas, we obtain the point clouds:

$$\begin{aligned} P_0^w&= \mathfrak {T}_0^w ( P_0 ) \end{aligned}$$
(1)
$$\begin{aligned} P_1^w&= \mathfrak {T}_1^w ( P_1 ) \end{aligned}$$
(2)
$$\begin{aligned} P_{total}^w&= P_0^w \oplus \mathfrak {T}_{ICP}( P_1^w ) \end{aligned}$$
(3)

where, \(\mathfrak {T}_i^j\) represents the transformation from the i reference frame to the j reference frame and \(\mathfrak {T}_{ICP}\) is the transformation obtained by performing ICP with the point cloud \( P_0^w \) as the target cloud and the \( P_1^w \) as the source cloud. In Fig. 2, an example of this process is shown. In order to lower the time to compute \(\mathfrak {T}_{ICP}\), we calculate this transformation by using two downsampled versions of \(P_0^w\) and \(P_1^w\) and by limiting to 30 the number of iterations.

Fig. 2
figure 2

An example of the depth data fusion process. In a the point cloud obtained from the \(C_1\) camera, in b the point cloud obtained from the \(C_0\) camera and in c the final fused cloud

4.3 Frontal View Generation

The best skeleton estimation comes from frontal-viewed persons. For this reason, we want to warp the total point cloud \( P_{total}^w \) obtained at Sect. 4.2 to be frontal with respect to the camera we chose as a reference, here \(C_0\). In order to obtain this result, as shown in Fig. 3, we project the points of \(P_{total}^w\) to the ground, that is the xOy plane of the world reference frame, thus obtaining a 2D shape that usually resembles an ellipsoid O of points.

Fig. 3
figure 3

The frontal view generation phase. On the left there are the reference system of our Kinects and the common world reference system, the collaborative cloud obtained with ICP after the people detection phase and the same cloud projected on the xy plane of the world reference frame (visible in red). On the right is shown a top-view of the world reference frame, the vector representing the principal component \(\widehat{v}\) and the angle \(\theta \) which is used for the frontal view warping. Best viewed in color

We then calculate the principal components [20] of O in order to find a vector \(\widehat{v}\) with the same direction of the major axis of O, that is then used to rototranslate the original \(P_{total}^w\) with M:

$$\begin{aligned} M = \left[ \begin{array}{cccc} &{} &{} &{} \\ &{} \text {R} &{} &{} \text {T} \\ &{} &{} &{} \\ &{} 0^{3 \times 1} &{} &{} 1 \\ \end{array} \right] \end{aligned}$$
(4)

where R is the rotation matrix which rotates a cloud of \(\theta \) around the world z-axis and T the translation to bring the final point cloud to be centered on the world reference frame. In order to compute R, we need to compute \(\theta \), which is the angle between \(\widehat{v}\) and \(u_x=(1, 0, 0)\). In formulas, we have:

$$\begin{aligned} \theta&= arccos \left( \frac{\widehat{v} \cdot u_x}{|\widehat{v}||u_x|} \right) \end{aligned}$$
(5)
$$\begin{aligned} R&= \left( \begin{array}{ccc} cos\theta &{} sin\theta &{} 0 \\ -sin\theta &{} cos\theta &{} 0 \\ 0 &{} 0 &{} 1 \end{array} \right) \end{aligned}$$
(6)
$$\begin{aligned} T&= - \left( \begin{array}{c} k_x \\ k_y \\ 0 \end{array} \right) ,&K = (k_x, k_y, k_z) = \frac{\sum _{i = 0}^{| P_{total}^w |} P_{total}^w(i)}{|P_{total}^w|} \end{aligned}$$
(7)

where K is the centroid of the total point cloud before the rototranslation. We can now obtain the desired frontal-view point cloud as:

$$\begin{aligned} P_{fv}^w = \{ p = (x_p, y_p, z_p)^T \text { } | \text { } \exists q \in P_{total}^w , p = Mq \} \end{aligned}$$
(8)

5 Body Skeleton Estimation

In this work, we use the algorithm in [2] to perform body part detection that is open source and available in the Point Cloud Library. This detector takes as input a depth image, that is then classified by a Random Forest. For this reason, the multi-view and frontal point cloud \( P_{fv}^{w} \) obtained in Sect. 4.3 has to be projected to 2D in order to create a virtual depth-image \(D_{virtual}\) that could be processed by the body part detector.

5.1 Virtual Depth Image Generation

In this work, the virtual depth image \(D_{virtual}\) is estimated by projecting the points in \( P_{fv}^{w} \) to the image plane of the \(C_0\) camera, that has been taken as a reference. However, this process often leaves some holes in the generated image. We thus implemented a hole-filling procedure that fills the holes with the nearest valid points until a threshold distance. In formulas:

$$\begin{aligned} D_{virtual}&= \{ d_{ij} = (i,j) \in \mathbb {N}^2 | i \in (0, 480), j \in (0,640) \} \end{aligned}$$
(9)
$$\begin{aligned} d_{ij}&= {\left\{ \begin{array}{ll} P_{fv}^0(i,j), &{} (i,j) \text { provides a valid point in } P_{fv}^0\\ P_{fv}^0(\overline{i,j}), &{} (\overline{i, j}) = argmin\{ || (i,j) - (\widehat{i,j}) || < t | P_{fv}^0(\widehat{i,j}) \text { is valid}\}\\ 10000, &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
(10)

In Fig. 4, a comparison of a sample image with and without the hole-filling procedure is shown. This hole-filled depth map is then provided as input of the body part detector.

Fig. 4
figure 4

On the left, the resultant point cloud re-projection to the image plane of the reference camera. On the right, the re-projection after the hole-filling procedure

5.2 Joint Estimation

The body part detector [2] assigns one of the 24 labels defined at training time to each pixel of the depth map and calculates the blobs of the coherent voxels with the same label.

From this preliminary segmentation, we then compute the positions of the skeleton joints in two steps. At first, we address the problem of the possibility of a single label being assigned to multiple coherent groups of voxels. This issue can be solved by combining the coherent voxel groups into a single blob or sorting them by their size with the largest one being selected for the joint calculation. Although the results of these simple methods were satisfying, the body part positions were imprecise in certain cases, especially as it comes to the smaller body parts such as hands and elbows. An improvement was achieved by building an optimal tree of the body parts, starting from the Neck as the root blob and further recursively estimating the child-blobs. This method is based on a pre-defined skeleton structure, which settles whether two body parts are connected as well as certain constraints regarding the expected size of the limbs.

In the second phase, the 3-D position of the joint is calculated from the selected blob. In most cases, the 3D centroid of the corresponding blob point cloud provides a good estimate for the joint position. An exception to this is the Hip blob, which also contains a large part of the torso. Besides, the Shoulder and Elbow joints are special cases which are described below.

Shoulders The shoulder position is calculated from the corresponding chest blob. Inside the blob point cloud, we estimate the voxel \(V_y\_max\) with the maximum Y-value. We further build a sub-group of voxels belonging to the chest blob and having the distance to the \(V_y\_max\) below a certain threshold (10 cm) and use the centroid of this sub-blob as the final position.

Elbows If the elbow blob was detected, we use the normal approach calculating the centroid of the blob. Otherwise, we estimate the point inside the arm blob, which has the longest distance from the previously estimated Shoulder joint.

Hips We define a certain threshold and build a sub-group of voxels belonging to the lower part of the hip blob. The centroid of this sub-group is used as the result position.

5.3 Joint Tracking over Time

The Alpha-Beta filter detailed in Algorithm 1 was implemented on top of the standard joint calculation to assure a consistent and continuous motion over time. This deterministic approach estimates the new position based on the predicted and measured position, with the weight of the measured position given by the parameter \(\alpha \), while \(\beta \) shows the weight of the velocity update.

Careful tuning of the \(\alpha \) and \(\beta \) parameters is necessary to achieve the best results. Additionally, we have modified the update parameters for the hand joints, which usually have higher velocities then other body parts.

figure a

6 Experiments

We tested the different steps of the proposed approach with two series of RGB-D frames recorded from a network of two first-generation Microsoft Kinect. In these sequences, two different freely-moving persons were performing different movements. For measuring the accuracy, we considered the following skeleton estimation error:

$$\begin{aligned} \epsilon = \frac{\sum _{frames}\frac{\sum _{joints} || pos_{estim} - pos_{actual} ||}{N_{joints}}}{N_{frames}} \end{aligned}$$
(11)

where the ground truth for joint position \(pos_{actual}\) has been manually annotated. The system used for testing the methods proposed is an Ubuntu 14.04 machine with an Intel core i7-4770 CPU and a NVidia Geforce GTX 670 GPU. In Table 1, we reported a quantitative comparison of skeleton estimation with our methods and the original one in terms of (11). In this table, we report also a baseline multi-view approach at the skeleton level in which each fused skeleton \(\widehat{S}\) is the average of the single-view skeletons \(S_{0}\) and \(S_{1}\). While the original method [2] is independent from the background for the body pose estimation, the joint estimation algorithm is not and this cause the large \(\epsilon \) obtained in our tests. Adding a people detection step, thus improve exponentially the performance gained by the joint estimator and our joint-tracking filter maintains the performance while smoothing the joints estimated. Our novel multi-view approach outperforms both the single-view skeleton estimation and the baseline multi-view method we used. The results achieved are from 20 to 33 % better than the single ones and up to 24 % better than the baseline skeleton-based multi-view approach. Furthermore, the computational burden needed for computing a skeleton is around 100 ms (60 ms for computing the virtual depth image plus 40 ms for the PCL skeleton computation) allowing the real-time usage of the proposed approach. In Fig. 5, we reported a qualitative comparison of skeleton estimation with these techniques.

Table 1 The performance achieved by the original method [2] and our method. PD stands for people detection and JT for joint tracking
Fig. 5
figure 5

Some sample frames of the dataset we used for testing the proposed approach. Each row represents a frame. The different columns represent: a [2] on the \(C_0\) stream; b ours with PD and JT on the \(C_0\) stream; c ours with PD and JT on the \(C_1\) stream; d our multi-view approach re-projected on the \(C_0\) camera

7 Conclusions

In this work, we addressed the problem of human skeleton estimation and tracking in camera networks. We proposed a novel system to fuse depth data coming from multiple cameras and to generate a frontal view of the person in order to improve the skeleton estimation that can be obtained with state-of-the-art algorithms operating on depth data. Furthermore, we improved single-camera skeletal tracking by exploiting people detection for background removal and joint tracking for filtering joint trajectories. We tested the proposed system on hundreds of frames taken from two Kinect cameras, obtaining a great improvement with respect to state-of-the-art skeletal tracking applied to each camera. The proposed approach can be also applied to real-time scenarios given the low computational burden required.