Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Kinect [4, 14, 22] has become an important 3D sensor. It has received a lot of attention thanks to the rapid human pose recognition system developed on top of 3D measurement [17]. Its low cost, reliability and speed of the measurement promise to make Kinect the primary 3D measuring device in indoor robotics [25], 3D scene reconstruction [7], and object recognition [12].

In this chapter we provide a geometrical analysis of Kinect, design its geometrical model, propose a calibration procedure and demonstrate its performance. We extend here our preliminary results presented in [18].

Approaches to modeling Kinect geometry, which have appeared recently, provide a good basis for understanding the sensor. There exists the following most relevant work: The authors of [2] combined OpenCV camera calibration [24] with Kinect inverse disparity measurement model [3] to obtain the basic Kinect calibration procedure. The project did not study particular features of Kinect sensors and did not correct for them. An almost identical procedure [11] is implemented in ROS, where an apparent shift between the infrared and depth images is corrected. Another variation of that approach appeared in [8], where OpenCV calibration is replaced by Bouguet’s [1] calibration toolbox. We build on top of previous work and design an accurate calibration procedure based on considering geometrical models as well as on “learning” of an additional correction procedure accounting for remaining non-modeled errors. We use the full camera models and their calibration procedures as implemented in [1], the relationship between Kinect inverse disparity and depth as in [3], correct for depth and infrared image displacement as in [11], and add additional corrections trained on examples of calibration boards. We demonstrate that a calibrated Kinect can be combined with Structure from Motion to get 3D data in a consistent coordinate system allowing to construct the surface of the observed scene by Multiview Stereo. Our comparison shows that Kinect is superior in accuracy to SwissRanger SR-4000 3D-TOF camera and close to a medium resolution SLR Stereo rig. Our results are in accordance with [10] that mentions compatible observations about the Kinect depth quantization.

2 Kinect as a 3D Measuring Device

Kinect is a composite device consisting of a near-infrared laser pattern projector, an IR camera and a color (RGB) camera, Fig. 1.1. The IR camera and projector are used as a stereo pair to triangulate points in 3D space. The RGB camera can be then used to texture the 3D points or to recognize the image content. As a measuring device Kinect delivers three outputs: IR image, RGB image, and an (inverse) depth image.

Fig. 1.1
figure 1

Kinect consists of infrared (IR) projector, IR camera and RGB camera (illustration from [11])

2.1 IR Image

The IR camera, Fig. 1.3(b), (1280×1024 pixels for 57×45 degrees FOV, 6.1 mm focal length, 5.2 μm pixel size) is used to observe and decode the IR projection pattern to triangulate the 3D scene. If suitably illuminated by a halogen lamp [19, 23] and with the IR projector blocked, Fig. 1.7(c, d), it can be reliably calibrated by [1] using the same checkerboard pattern used for the RGB camera calibration. The camera exhibits non-negligible radial and tangential distortions, see Sect. 1.4.

2.2 RGB Image

The RGB camera, Fig. 1.3(a), (1280×1024 pixels for 63×50 degrees FOV, 2.9 mm focal length, 2.8 μm pixel size) delivers medium quality images. It can be calibrated by [1] and used to track relative poses between subsequent images by using an SfM system, e.g. [6, 20].

2.3 Depth Image

The main raw output of Kinect is an 11-bit image, Fig. 1.3(c), which corresponds to the depth in the scene. Rather than providing the actual depth z, Kinect returns “inverse depth” 1/z, as shown in Fig. 1.4(a). Taking into account the depth resolution achievable with a Kinect (Sect. 1.2.4), we adopted the model suggested in [11]. The depth image is constructed by triangulation from the IR image and the projector and hence it is “carried” by the IR image, as shown in Eq. 1.5.

The depth image has a vertical stripe of pixels on the right (8 pixels wide) where no depth is calculated, see Fig. 1.3(c). This is probably due the windowing effect of block correlation used in calculating the disparity [11]. We have estimated the size of the correlation window (see Sect. 1.3.1) to be 9×7 pixels.

2.4 Depth Resolution

Figure 1.4(b, c) shows the resolution of the measured depth as a function of the true depth. The depth resolution was measured by moving Kinect away (0.5 m–15 m) from a planar target in sufficiently fine steps to record all the values returned in a view field of approximately 5° around the image center.

The size of the quantization step q [mm], which is the distance between two consecutive recorded values, was found to be the following function of the depth z [m]:

$$ q(z) = 2.73\,z^2 + 0.74\,z - 0.58. $$
(1.1)

This is in accordance with the expected quadratic depth resolution for triangulation-based devices. The values of q at the beginning, resp. at the end, of the operational range were q(0.50 m)=0.65 mm, resp. q(15.7 m)=685 mm. These findings are in accordance with [10].

3 Kinect Geometrical Model

We model Kinect as a multi-view system consisting of RGB, IR and Depth cameras. A Geometrical model of RGB and IR cameras, which project a 3D point X into an image point [u,v], is given by [1]:

(1.2)
(1.3)
(1.4)

with distortion parameters k=[k 1,k 2,…,k 5], camera calibration matrix K, rotation R and camera center C [5].

The Depth camera of Kinect is associated to the geometry of the IR camera. It returns the inverse depth d along the z-axis, as visible in Fig. 1.5, for every pixel [u,v] of the IR cameras as

(1.5)

where u, v are given by Eq. 1.3, true depth z by Eq. 1.4, [u 0,v 0] by Table 1.1, X stands for 3D coordinates of a 3D point, and c 1, c 0 are parameters of the model. We associate the Kinect coordinate system with the IR camera and hence get R IR=I and C IR=0. A 3D point X IR is constructed from the measurement [x,y,d] in the depth image by

$$ \mathtt{X}_{\mathrm{IR}} = \frac{1}{c_1 d+c_0} \operatorname {dis}^{-1}\left(\mathtt{K}_{\mathrm{IR}}^{-1}\left[ \begin{array}{c} x+u_0\\y+v_0\\ 1 \end{array} \right],\mathtt{k}_{\mathrm{IR}}\right) $$
(1.6)

and projected to the RGB images as

$$ \mathtt{u}_{\mathrm{RGB}} = \mathtt{K}_{\mathrm{RGB}} \operatorname {dis}\bigl(\mathtt{R}_{\mathrm{RGB}}(\mathtt{X}_{\mathrm{IR}}-\mathtt{C}_{\mathrm{RGB}}), \mathtt{k}_{\mathrm{RGB}}\bigr) $$
(1.7)

where \(\operatorname {dis}\) is the distortion function given by Eq. 1.3, k IR, k RGB are the respective distortion parameters of the IR and RGB cameras, K IR is the IR camera calibration matrix and K RGB,R RGB,C RGB are the calibration matrix, the rotation matrix and the center of the RGB camera, respectively.

Table 1.1 IR to Depth-camera pixel position shift

3.1 Shift Between IR Image and Depth Image

IR and Depth images were found to be shifted. To determine the shift [u 0,v 0], circular targets spanning the field of view were captured from different distances in the IR and Depth images, Fig. 1.8(a). Edges of the targets were computed in the IR and Depth images using the Sobel edge detector. In order to mitigate the effect of the unstable Depth image edges, reconstruction circles were fit to the measured data, Fig. 1.8(b). The pixel distances between centers of the fitted circles are shown in Table 1.1. The shift was estimated as the mean value of the distances over all the experiments. We conclude that there is a shift of about 4 pixels in the u direction and of 3 pixels in the v direction.

3.2 Identification of the IR Projector Geometrical Center

We have first acquired seven IR and Depth images of a plane positioned at different distances. The projected pattern contains nine brighter and easily identifiable speckle dots, Fig. 1.9(a). These points were formed by r=1,…,9 rays l r transmitted from the IR projector. Each point was reconstructed in the 3D space and grouped by its ray of origin \(\mathtt{X}_{\mathrm{IR}_{i,r}}\). The IR projector center C P is located in the common intersection of the nine rays. We formulated a nonlinear optimization problem to find the projector center C P by minimizing the perpendicular distances of the reconstructed points \(\mathtt{X}_{\mathrm{IR}_{i,r}}\) from a bundle of rays passing through C P. Figure 1.9(b) shows the resulting ray bundle next to the IR cameras frame. Figure 1.9(c) shows the residual distances from the points \(\mathtt{X}_{\mathrm{IR}_{i,r}}\) to their corresponding rays of the optimal ray bundle. All residual distances are smaller than 2 mm. The estimated projector center has coordinates C P=[74.6,1.1,1.3] [mm] in the IR camera reference frame.

3.3 Identification of Effective Depth Resolutions of the IR Camera and Projector Stereo Pair

In this section the view fields of the Kinect IR camera and of the Kinect projector and their effective resolution, which determines the distribution of the resolution in 3D measurement, will be investigated.

The size of the IR image and of the depth image is known to be 640×480 pixels with 10.4 μm pixel size, spanning approx. 60×45 view angle. This gives an angular resolution of 0.0938/pixel in the IR camera.

Counting the speckle dots on the projected pattern yields about 800 dots along the central horizontal line across the projector field of view. Projector FOV and IR camera FOV are approximately the same. Hence we get 800 dots per 60° and an angular resolution of 0.0750/ray for the projector rays. The green curve in Fig. 1.11 shows the simulated depth quantization along the central IR camera ray (the blue line in Fig. 1.10) for the camera and projector resolution described above. It clearly does not correspond to the red curve measured on a real Kinect.

To get our simulation closer to reality, we assume that ray detection is done with higher accuracy by interpolating rays from the projected patterns. The blue curve in Fig. 1.11 corresponds to detecting rays with 1/8 pixel accuracy, as was hypothesized in [11]. Hence we get the effective resolution of 5120=640×8 rays per 60°, i.e. 0.00938/ray, in the projector. This corresponds to our measurement on a real Kinect.

Figure 1.10 illustrates view fields and ray arrangements for Kinect IR camera (blue) and projector (green). The bold blue line marks the center of the IR camera view where the distance resolution was evaluated. For clarity, we show only every 64th camera ray and every 150th projector ray and their intersections as red dots.

4 Kinect Calibration

We calibrate, as proposed in [1], Kinect cameras together by showing the same calibration target to the IR and RGB cameras, Fig. 1.7(c). This allows to calibrate both cameras w.r.t. the same 3D points and hence the poses of the cameras w.r.t. the points can be chained to give their relative pose, Fig. 1.12. Taking the Cartesian coordinate system of the IR camera as the global Kinect coordinate system makes the camera relative pose equal to R RGB,C RGB.

Tables 1.2 and 1.3 show the internal parameters and Fig. 1.6 shows the effect of distortions in the cameras. We included the tangential distortion since it non-negligibly increased the overall accuracy of 3D measurements. Figure 1.7(a) shows the IR image of the calibration board under the normal Kinect operation when it is illuminated by its IR projector. A better image is obtained by blocking the IR projector and illuminating the target by a halogen lamp Fig. 1.7(b).

Table 1.2 Intrinsic parameters of the Kinect IR camera
Table 1.3 Intrinsic parameters of the Kinect RGB camera

Parameters c 0,c 1 of the Depth camera are calibrated as follows: We get n measurements \(\mathtt{X}_{D_{i}} = [x_{i},y_{i},d_{i}]^{\top}\), i=1,…,n, of all the calibration points from the depth images, Fig. 1.7(d). The Cartesian coordinates \(\mathtt{X}_{\mathrm{IR}_{i}}\) of the same calibration points were measured in the IR Cartesian system by intersecting the rays projecting the points into IR images with the best plane fits to the reconstructed calibration points. Parameters c 0,c 1 were optimized to best fit \(\mathtt{X}_{D_{i}}\) to \(\mathtt{X}_{\mathrm{IR}_{i}}\) using Eq. 1.6.

4.1 Learning Complex Residual Errors

It has been observed that a Kinect calibrated with the above procedure still exhibited small but relatively complex residual errors for the close range measurements. Figure 1.13 shows residuals after fitting the plane to the calibrated Kinect measurement of a plane spanning the field of view. The target has been captured from 18 different distances ranging from 0.7 to 1.3 meters. Highly correlated residuals were accounted.

Residuals along the 250th horizontal Depth image row are shown in Fig. 1.14(a). Note that the residual values do not depend on the actual distance to the target plane (in this limited range). The values are consistently positive in the center and negative at the periphery. To compensate for this residual error, we form a z-correction image of z values constructed as the pixel-wise mean of all residual images. The z-correction image is subtracted from the z coordinate of X IR computed by Eq. 1.6.

To evaluate this correction method, the z-correction image was constructed from residuals of even images and then applied to odd (the first row of Table 1.4) and to even (the second row of Table 1.4) depth images. The standard deviation of residuals decreased.

Table 1.4 Evaluation of the z-correction. The standard deviation of the residuals of the plane fit to the measurement of a planar target has been reduced

After applying the z-correction to Kinect measurements from the experiment described in Sect. 1.5.2, the mean of the residual errors decreased by approximately 0.25 mm, Fig. 1.14(b). The residuals were evaluated on 4410 points spanning the field of view.

5 Validation

In this section, different publicly available Kinect depth models are tested and compared to our method on a 3D calibration object. Furthermore, we provide a comparison of the accuracy of Kinect measurements against stereo triangulation and 3D measurements based on Time-of-Flight. Finally,we demonstrate the functionality of our Kinect calibration procedure by integrating it into an SfM pipeline.

5.1 Kinect Depth Models Evaluation on a 3D Calibration Object

We evaluate the accuracy of the calibration by measuring a reference 3D object. The 3D object consisted of five flat targets that were rigidly mounted together along a straight line on a rigid bench, Fig. 1.15(a). As ground truth, the distances between centers of the targets were carefully measured by a measure tape with accuracy better than 1 mm.

The object was then captured using Kinect from two different distances to get measurements in the range between 0.7 m to 2 m, Fig. 1.15(b). After extracting the central points of the targets in the IR image, Fig. 1.15(a), several different reconstruction methods were used to get their 3-dimensional positions, Fig. 1.15(b).

Our Kinect calibration model, which was described in Sect. 1.4, was compared to the ROS calibration [11], Burrus calibration [2], Magnenat calibration [21], OpenNi calibration [16] and Microsoft Kinect SDK calibration [15].

Distances between the reconstructed target points were compared to the ground truth measurements in Table 1.5 and in Fig. 1.16. The experiment was performed on two Kinect devices. Kinect 1 is the device for which the complete calibration, as described in this chapter, was made. Kinect 2 was evaluated with the calibration from Kinect 1, to determine whether it is possible to transfer calibration parameters of one device to another. We see that our method is the best for Kinect 1 and among the best three for Kinect 2.

Table 1.5 Accuracy evaluation of different reconstruction methods on a reference 3D object. Kinect 1 is the device for which we made complete calibration as described in this chapter. Kinect 2 was evaluated with the calibration from Kinect 1

5.2 Comparison of Kinect, SLR Stereo and 3D TOF

We have compared the accuracy of Kinect, SLR Stereo and 3D TOF cameras on the measurements of planar targets: Kinect and SLR Stereo (image size 2304×1536 pixels) were rigidly mounted (Fig. 1.2) and calibrated (Fig. 1.12) together. SLR Stereo was performed by reconstructing calibration points extracted by [1] and triangulated by the linear least squares triangulation [5]. They measured the same planar targets in 315 control calibration points on each of the 14 targets. SR-4000 3D TOF [13] measured different planar targets but in a comparable range of distances 0.9–1.4 meters from the sensor in 88 control calibration points on each of the 11 calibration targets. The error e, Table 1.6, corresponds to the Euclidean distance between the points returned by the sensors and points reconstructed in the process of calibration of the cameras of the sensors. SLR Stereo is the most accurate, Kinect follows and SR-4000 is the least accurate.

Fig. 1.2
figure 2

A rig with a Kinect and two Nikon D60 SLR cameras

Fig. 1.3
figure 3

Example of Kinect output images

Fig. 1.4
figure 4

The estimated size of the Kinect quantization step q as a function of target distance for 0–5 m

Fig. 1.5
figure 5

Geometrical model of Kinect

Fig. 1.6
figure 6

Estimated distortions of the Kinect cameras. The red numbers denote the sizes and the arrows denote the directions of pixel displacements induced by the lens distortion. The cross indicates the image center, the circle marks the location of the principal point

Fig. 1.7
figure 7

The calibration board in the IR, RGB and Depth images

Fig. 1.8
figure 8

Illustration of the IR to Depth image shift

Fig. 1.9
figure 9

Identification of the geometrical model

Fig. 1.10
figure 10

Kinect IR camera (blue) and projector (green) view fields and ray distribution in the xy plane estimated in Sect. 1.3.3. For clarity, we plot only every 64th camera ray, i.e. there are 11 rays for the IR camera, and every 150th projector ray, i.e. there are 32 projector rays. Red dots illustrate the sampling of the space by points that can be reconstructed. The bold blue line marks the center ray of the IR camera where the distance resolution shown in Fig. 1.11 was estimated. Note that the closest point, which is actually measured by the real device, is at a depth of about 40 cm

Fig. 1.11
figure 11

Comparison of stereo reconstruction uncertainty measured with Kinect and simulated using identified parameters of the stereo system

Fig. 1.12
figure 12

Position and orientation of Kinect IR and RGB cameras and the SLR stereo pair (Left, Right) altogether with 3D calibration points reconstructed on planar calibration targets

Fig. 1.13
figure 13

Residuals of the plane fitting showing the fixed-pattern noise on depth images from different distances

Fig. 1.14
figure 14

Correcting complex residual errors

Fig. 1.15
figure 15

Kinect accuracy evaluation on a 3D reference object with five flat targets mounted on a rigid bench

Fig. 1.16
figure 16

Accuracy evaluation of different reconstruction methods on a 3D calibration object

Table 1.6 Comparison of SLR Stereo triangulation, Kinect and SR-4000 3D TOF depth sensing

5.3 Combining Kinect and Structure from Motion

Figure 1.17 shows a pair of 1/2-resolution (640×480) Kinect RGB and depth images (where the original depth image was reprojected using Eq. 1.7 to correspond with the RGB image pixels). A sequence of 50 RGB-Depth image pairs has been acquired and the relative poses of the RGB cameras have been computed by a SfM pipeline [6, 20]. Figure 1.18(a) shows a surface reconstructed from 3D points obtained by mere Multiview stereo [9] using only Kinect RGB images. Utilizing retrieved relative poses, depth data were registered together and used in the same method to provide improved reconstruction, Fig. 1.18(b).

Fig. 1.17
figure 17

Example of images from Kinect RGB cameras and the corresponding depth that were used for scene reconstruction

Fig. 1.18
figure 18

Scene reconstruction from Kinect RGB camera. The figure shows a comparison of reconstruction quality when the scene is reconstructed only using Multiview stereo and the case when the 3D data from Kinect are also available

Figure 1.19 compares a 3D surface reconstruction from point cloud computed by plane sweeping [9] with 70 Kinect 3D data processed by surface reconstruction of [9] (2304×1536 pixels). Kinect 3D data were registered into a common coordinate system via SfM [6, 20] applied to Kinect image data. We see that when multiple measurements are used, the Kinect result is quite comparable to more accurate Multiview stereo reconstruction.

Fig. 1.19
figure 19

Comparison of Kinect with Multiview reconstruction [9]

6 Conclusion

We have provided an analysis of Kinect 3D measurement capabilities and its calibration procedure allowing to combine Kinect with SfM and Multiview Stereo, which opens a new area of applications for Kinect. It was interesting to observe that in the quality of the multi-view reconstruction, Kinect over-performed SwissRanger SR-4000 and was close to 3.5 M pixel SLR Stereo.