Keywords

1 Introduction

Three-dimensional models of real world objects, in particular models of a human body are applied in many areas, such as applications of the added and virtual reality, video games, in the reengineering process. Such models are created by expensive specialized hardware and software applications.

In this work we offered and described the method of receiving three-dimensional model of a human body with use two MS Kinect devices without additional equipment. The Kinect device is much cheaper than specialized scanners and provides depth stream which can be transformed to a set of three-dimensional points, and then to a polygonal grid of human body.

Now there are a large number of reconstruction systems of complex objects using the Kinect device. However, the reconstruction of the human body imposes a number of additional restrictions on the configuration of the scanning system. So to provide acceptable details level of the surface must be obtained separately the original surface of the scanned body [1, 2]. Some systems use a rotating mechanism to rotate the scanned person. In other multiple devices, which scanning certain areas [3]. Third, these approaches are combined [4].

Another problem is that not all parts of the body surface into the view of the camera. These blank areas are called “empty zones” [5, 6]—areas where the surface is not determined. This problem might be solved in different ways, but most often just an empty area being completed manually in the package of 3D graphics [7].

2 Description of the System Workflow

As entrance data the system accepts two streams from two depth cameras of a Kinect sensor. The system processes each scan \( (R_{k}^{\prime}) \) in streams which represents the image with the permission 640 × 480 pixels. Each pixel describes distance to a site of the scanned surface. This distance is expressed in millimeters.

Before processing of depth it is necessary to carry out their filtration. It is a problem it is connected with that these depths returned by the device rather strongly noise. These noise have casual additive character statistically independent of a signal. In this case it is necessary to remove noise from shots of depth and in too time to keep the clearness of borders. It is caused by that borders on a shot of depth express transition between surface levels. Smoothing of borders is inadmissible and will lead to loss of information. Use of the bilateral filter allows to smooth the image with preservation of a clear boundary [8].

Further the filtered depth shots \( (R_{k} ,R_{k - 1} ) \) are processed for receiving a set of three-dimensional points \( (V_{k} ) \) and normals \( (N_{k} ) \) to them. As, resolution of the camera of depth isn’t great it makes sense to scan a surface of a body of the person in parts. For combination of partial frames of a surface the variation of algorithm of ICP is used. At the exit the full frame of a surface \( (V^{g} ,N^{g} ) \) forward and back part of a body of the person turns out.

Further in view of the fact that there are surface sites which can’t come into the view of any of depth cameras (“empty zones”) it is necessary to complete a surface in these zones. The algorithm of creation of tops on the basis of square Bezier curves is for this purpose developed. At the exit of this stage the general surface \( (V^{{\prime }} ,N^{{\prime }} ) \) a human body united by the added tops between forward and a back surface will be received. After calculation of the general surface it is necessary to receive polygonal model \( (M_{x,y,z} ) \). Is for this purpose used the special algorithm of reconstruction based on the solution of the equation of Poisson (Poisson Surface Reconstruction) [9].

As the output data the system provides the *.ply file which contains polygonal model of a surface of a body of the person. In Fig. 1 the scheme of the general functioning of system is submitted.

Fig. 1
figure 1

Overall system workflow

3 Description of a Creation Human Body Surface Model Method

Before start of system it is necessary to place MS Kinect devices and to allocate space for scanning of a scene. Two devices are located before the person and behind him, and sent to the opposite sides. The distance between two devices depends on growth of the scanned person. Height at which devices are located is equal to a half of growth of the scanned person.

During process of scanning of people has to be motionless precisely in the center (i.e. on identical removal from both devices) the virtual cube limited to the planes of cameras of MS Kinect devices. Generally, the user can accept any motionless pose. However, thus no part of a lobby or back surface should block the review of the camera.

The viewing angle of the MS Kinect chamber is limited and makes 43° down. It means that the distance of capture of a surface to the utmost of the scanned person is limited. However if conditionally to divide a surface into 3 parts (top, average and lower) with rather big consecutive overlappings the distance can be reduced. It will increase the accuracy of measurement of distance to surface sites as now each partial shot will have permission 640 × 480 pixels, but not a frame of all surface.

After installation and control of system thus the preparatory stage is finished. The scheme of physical placement of system is submitted in Fig. 2.

Fig. 2
figure 2

Illustration of a system setup

Before processing of shots of a stream of depth it is necessary to carry out their filtration. In this work for a filtration the bilateral filter [10] is used. As the intensity of the color pixels is the distance to the surface. Thus, the formula of new value of depth \( ( {\text{R}}_{k} ) \) in concrete pixel (q) is expressed by a formula, where \( R_{k}^{{\prime }} \)—value of distance in a concrete point of a shot of depth, u—the central point of level of a surface, \( {\mathcal{N}}_{\sigma } \left( t \right) = { \exp }( - t^{2} \sigma^{ - 2} ) \)—Gauss function, \( W_{\text{p}} \)—some weight coefficient.

$$ {\text{R}}_{\text{k}} \left( {\text{q}} \right) = \frac{1}{{{\text{W}}_{\text{p}} }}\sum\nolimits_{{{\text{q}} \in {\mathcal{U}}}} {{\mathcal{N}}_{{\sigma_{\text{s}} }} (||{\text{u}} - {\text{q||}}_{2} ){\mathcal{N}}_{{\sigma_{\text{r}} }} (||{\text{R}}_{\text{k}}^{{\prime }} ({\text{u}}) - {\text{R}}_{\text{k}}^{{\prime }} ({\text{q}})||_{2} ){\text{R}}_{\text{k}}^{{\prime }} ({\text{q}})} $$
(1)

After processing the bilateral filter will receive a smoothed shot of depth. For transformation of a frame of depth to a cloud of three-dimensional points it is necessary to calculate the first step a matrix of a perspective projection of the camera (K). It can be received on the basis of an internal calibration matrix of the MS Kinect chamber.

Each pixel of a frame of depth represents a vector, where x—the coordinate of “pixel” on width of a shot, y—coordinate on height of a shot, z—value of depth.

$$ {\text{Q}} = \left[ {\begin{array}{*{20}c} {\text{x}} & {\text{y}} & {\text{z}} \\ \end{array} } \right]^{\text{T}} $$
(2)

For receiving a set of three-dimensional tops \( (V_{k} ) \) it is necessary to increase a vector (Q) by the return internal matrix for each pixel of a shot of depth.

$$ {\text{V}}_{\text{k}} ({\text{q}}) = {\text{D}}_{\text{k}} ({\text{q}}){\text{K}}^{ - 1} $$
(3)

The following step—calculation of a normal for each top. The vector of a normal \( ( {\text{N}}_{\text{k}} ) \) is calculated on the basis of coordinates of the next tops.

$$ {\text{N}}_{\text{k}} ({\text{q}}) = \left( {{\text{V}}_{\text{k}} ({\text{x}} + 1,{\text{y}}) - {\text{V}}_{\text{k}} ({\text{x}},{\text{y}})} \right) \times \left( {{\text{V}}_{\text{k}} ({\text{x}},{\text{y}} + 1) - {\text{V}}_{\text{k}} ({\text{x}},{\text{y}})} \right) $$
(4)

Forward and back surfaces have to be constructed by combination of partial frames. The top, average and lower frames have mutual overlappings therefore for their combination we will use algorithm of ICP (Iterative Closest Point) [11]. The top and average frame, then average and lower frames are consistently combined. On the last step the turned-out frames in a uniform surface are combined (forward and back). The scheme of overlappings in partial frames is submitted in Fig. 3.

Fig. 3
figure 3

Illustration of a system setup

The first step of algorithm consists in finding of coinciding tops in two partial frames. The current frame \( ( {\text{R}}_{\text{k}} ) \) and the previous frame are compared \( ( {\text{R}}_{{{\text{k}} - 1}} ) \). If the matrix of transformation of the camera from the previous step (the previous partial shot of depth) is known it is possible to project tops in global coordinates \( ( {\text{V}}_{{{\text{i}} - 1}} ) \). As the Kinect chamber is fixed in one point and has only one degree of freedom (rotation round axis X) the matrix of transformation has an appearance

$$ {\text{T}}_{{{\text{i}} - 1}} = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 \\ 0 & {\cos \,{\upvarphi }_{{{\text{i}} - 1}} } & { - \sin \,{\upvarphi }_{{{\text{i}} - 1}} } \\ 0 & {\sin \,{\upvarphi }_{{{\text{i}} - 1}} } & {\cos \,{\upvarphi }_{{{\text{i}} - 1}} } \\ \end{array} } \right] $$
(5)

Thus the formula of a projection of tops to the card of depth (p) has an appearance

$$ {\text{V}}_{{{\text{i}} - 1}} = {\text{T}}_{{{\text{i}} - 1}}^{ - 1} {\text{V}}_{{{\text{i}} - 1}}^{\text{g}} $$
(6)

Further, if the projection contains tops (\( {\text{V}}_{\text{i}} \)), then it is necessary to calculate new values of a normal (n) for each top.

$$ {\text{n}} = {\text{T}}_{\text{i}} {\text{N}}_{\text{i}} ({\text{p}}) $$
(7)

Coordinates of tops don’t need to be recalculated as the camera in a new shot isn’t displaced on spatial axes. If threshold value (ts) that is exceeded the current point will be recognized as the combined. If isn’t present, then the point is considered new (from a frame of a new shot of depth).

$$ {\text{n}} \times {\text{n}}_{{{\text{i}} - 1}}^{\text{g}} < ts $$
(8)

Value of a threshold (ts) determines by the following formula

$$ {\text{argmin}}\sum\nolimits_{u} {||(TV_{i} (u) - V_{i - 1}^{g} (u)) \times n_{i - 1}^{g} (u)||^{2} } $$
(9)

At the exit of this stage back and forward surfaces in the form of clouds of three-dimensional points and normals to them will be received.

After paired combination of partial frames we will receive separate back and forward surfaces of a body. Between these surfaces will be there is a thin strip of “empty zones”. These zones are formed because of inaccessibility to scanning in view of a physical arrangement of cameras.

The main idea of a method of filling the empty zones is represented in Figs. 4 and 5.

Fig. 4
figure 4

Illustration of the method of constructing the surface in “empty areas”

Fig. 5
figure 5

Building the square Bezier curve

For finding of boundary tops threshold value of depth by means of the operator Sobel (3 × 3) is used [12]. The top \( (P_{1} ) \) consistently gets out of a set of boundary tops for a forward surface \( (P_{f} ) \). For each such top the next top \( (P_{2} ) \) gets out of a set for a back surface \( (P_{b} ) \). Further is under construction lines of a normal for both tops and the point of intersection \( (P_{c} ) \) pays off. On the basis of the received tops the square Bezier curve on the following formula, where B(t)—auxiliary top, t—parameter [0, 1].

$$ {\text{B}}\left( {\text{t}} \right) = \left( {1 - {\text{t}}} \right)^{2} {\text{P}}_{1} + 2{\text{t}}\left( {1 - {\text{t}}} \right){\text{P}}_{\text{c}} + {\text{t}}^{2} {\text{P}}_{2} $$
(10)

Thus, “empty zones” will be filled with a set of tops of B(t), as shown in Fig. 5.

The output will be obtained overall model of the human body in the form of three-dimensional vertices reciprocally clouds. The last stage—the construction of the mesh-method, based on the solution of the Poisson equation. The output is a polygon mesh, which is described geometry of the human body in the format * .ply.

4 Results

On the basis of this method we have developed a program for the reconstruction of surface geometry of the human body. For the implementation of reconstruction used Poisson library PCL (Point Cloud Library). The method of combining the partial surface of the frame is realized on the basis of the library Kinect Fusion. The architecture developed by the program is shown in Fig. 6. In the nearest future development of the system provided for adding functionality to animate models using BVH skeleton.

Fig. 6
figure 6

Architecture

Figure 7 in left part shows the partial frames depths, which will be processed by our system. On the right represents the reconstructed body model after filtering and alignment.

Fig. 7
figure 7

Reconstructed woman body example