3D with Kinect

Smisek, Jan; Jancosek, Michal; Pajdla, Tomas

doi:10.1007/978-1-4471-4640-7_1

Jan Smisek⁶,
Michal Jancosek⁶ &
Tomas Pajdla⁶

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

9502 Accesses
155 Citations

Abstract

We analyze Kinect as a 3D measuring device, experimentally investigate depth measurement resolution and error properties, and make a quantitative comparison of Kinect accuracy with stereo reconstruction from SLR cameras and a 3D-TOF camera. We propose a Kinect geometrical model and its calibration procedure providing an accurate calibration of Kinect 3D measurement and Kinect cameras. We compare our Kinect calibration procedure with its alternatives available on Internet, and integrate it into an SfM pipeline where 3D measurements from a moving Kinect are transformed into a common coordinate system, by computing relative poses from matches in its color camera.

Access provided by Autonomous University of Puebla. Download chapter PDF

3D Depth Cameras in Vision: Benefits and Limitations of the Hardware

Enhancing 3D Capture with Multiple Depth Camera Systems: A State-of-the-Art Report

A Comparison between Active and Passive 3D Vision Sensors: BumblebeeXB3 and Microsoft Kinect

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Kinect [4, 14, 22] has become an important 3D sensor. It has received a lot of attention thanks to the rapid human pose recognition system developed on top of 3D measurement [17]. Its low cost, reliability and speed of the measurement promise to make Kinect the primary 3D measuring device in indoor robotics [25], 3D scene reconstruction [7], and object recognition [12].

In this chapter we provide a geometrical analysis of Kinect, design its geometrical model, propose a calibration procedure and demonstrate its performance. We extend here our preliminary results presented in [18].

Approaches to modeling Kinect geometry, which have appeared recently, provide a good basis for understanding the sensor. There exists the following most relevant work: The authors of [2] combined OpenCV camera calibration [24] with Kinect inverse disparity measurement model [3] to obtain the basic Kinect calibration procedure. The project did not study particular features of Kinect sensors and did not correct for them. An almost identical procedure [11] is implemented in ROS, where an apparent shift between the infrared and depth images is corrected. Another variation of that approach appeared in [8], where OpenCV calibration is replaced by Bouguet’s [1] calibration toolbox. We build on top of previous work and design an accurate calibration procedure based on considering geometrical models as well as on “learning” of an additional correction procedure accounting for remaining non-modeled errors. We use the full camera models and their calibration procedures as implemented in [1], the relationship between Kinect inverse disparity and depth as in [3], correct for depth and infrared image displacement as in [11], and add additional corrections trained on examples of calibration boards. We demonstrate that a calibrated Kinect can be combined with Structure from Motion to get 3D data in a consistent coordinate system allowing to construct the surface of the observed scene by Multiview Stereo. Our comparison shows that Kinect is superior in accuracy to SwissRanger SR-4000 3D-TOF camera and close to a medium resolution SLR Stereo rig. Our results are in accordance with [10] that mentions compatible observations about the Kinect depth quantization.

2 Kinect as a 3D Measuring Device

Kinect is a composite device consisting of a near-infrared laser pattern projector, an IR camera and a color (RGB) camera, Fig. 1.1. The IR camera and projector are used as a stereo pair to triangulate points in 3D space. The RGB camera can be then used to texture the 3D points or to recognize the image content. As a measuring device Kinect delivers three outputs: IR image, RGB image, and an (inverse) depth image.

2.1 IR Image

The IR camera, Fig. 1.3(b), (1280×1024 pixels for 57×45 degrees FOV, 6.1 mm focal length, 5.2 μm pixel size) is used to observe and decode the IR projection pattern to triangulate the 3D scene. If suitably illuminated by a halogen lamp [19, 23] and with the IR projector blocked, Fig. 1.7(c, d), it can be reliably calibrated by [1] using the same checkerboard pattern used for the RGB camera calibration. The camera exhibits non-negligible radial and tangential distortions, see Sect. 1.4.

2.2 RGB Image

The RGB camera, Fig. 1.3(a), (1280×1024 pixels for 63×50 degrees FOV, 2.9 mm focal length, 2.8 μm pixel size) delivers medium quality images. It can be calibrated by [1] and used to track relative poses between subsequent images by using an SfM system, e.g. [6, 20].

2.3 Depth Image

The main raw output of Kinect is an 11-bit image, Fig. 1.3(c), which corresponds to the depth in the scene. Rather than providing the actual depth z, Kinect returns “inverse depth” 1/z, as shown in Fig. 1.4(a). Taking into account the depth resolution achievable with a Kinect (Sect. 1.2.4), we adopted the model suggested in [11]. The depth image is constructed by triangulation from the IR image and the projector and hence it is “carried” by the IR image, as shown in Eq. 1.5.

The depth image has a vertical stripe of pixels on the right (8 pixels wide) where no depth is calculated, see Fig. 1.3(c). This is probably due the windowing effect of block correlation used in calculating the disparity [11]. We have estimated the size of the correlation window (see Sect. 1.3.1) to be 9×7 pixels.

2.4 Depth Resolution

Figure 1.4(b, c) shows the resolution of the measured depth as a function of the true depth. The depth resolution was measured by moving Kinect away (0.5 m–15 m) from a planar target in sufficiently fine steps to record all the values returned in a view field of approximately 5° around the image center.

The size of the quantization step q [mm], which is the distance between two consecutive recorded values, was found to be the following function of the depth z [m]:

$$ q(z) = 2.73\,z^2 + 0.74\,z - 0.58. $$

(1.1)

This is in accordance with the expected quadratic depth resolution for triangulation-based devices. The values of q at the beginning, resp. at the end, of the operational range were q(0.50 m)=0.65 mm, resp. q(15.7 m)=685 mm. These findings are in accordance with [10].

3 Kinect Geometrical Model

We model Kinect as a multi-view system consisting of RGB, IR and Depth cameras. A Geometrical model of RGB and IR cameras, which project a 3D point X into an image point [u,v]^⊤, is given by [1]:

(1.2)

(1.3)

(1.4)

with distortion parameters k=[k ₁,k ₂,…,k ₅], camera calibration matrix K, rotation R and camera center C [5].

The Depth camera of Kinect is associated to the geometry of the IR camera. It returns the inverse depth d along the z-axis, as visible in Fig. 1.5, for every pixel [u,v]^⊤ of the IR cameras as

(1.5)

where u, v are given by Eq. 1.3, true depth z by Eq. 1.4, [u ₀,v ₀]^⊤ by Table 1.1, X stands for 3D coordinates of a 3D point, and c ₁, c ₀ are parameters of the model. We associate the Kinect coordinate system with the IR camera and hence get R _IR=I and C _IR=0. A 3D point X _IR is constructed from the measurement [x,y,d] in the depth image by

$$ \mathtt{X}_{\mathrm{IR}} = \frac{1}{c_1 d+c_0} \operatorname {dis}^{-1}\left(\mathtt{K}_{\mathrm{IR}}^{-1}\left[ \begin{array}{c} x+u_0\\y+v_0\\ 1 \end{array} \right],\mathtt{k}_{\mathrm{IR}}\right) $$

(1.6)

and projected to the RGB images as

$$ \mathtt{u}_{\mathrm{RGB}} = \mathtt{K}_{\mathrm{RGB}} \operatorname {dis}\bigl(\mathtt{R}_{\mathrm{RGB}}(\mathtt{X}_{\mathrm{IR}}-\mathtt{C}_{\mathrm{RGB}}), \mathtt{k}_{\mathrm{RGB}}\bigr) $$

(1.7)

where $\operatorname {dis}$ is the distortion function given by Eq. 1.3, k _IR, k _RGB are the respective distortion parameters of the IR and RGB cameras, K _IR is the IR camera calibration matrix and K _RGB,R _RGB,C _RGB are the calibration matrix, the rotation matrix and the center of the RGB camera, respectively.

Table 1.1 IR to Depth-camera pixel position shift

Full size table

3.1 Shift Between IR Image and Depth Image

IR and Depth images were found to be shifted. To determine the shift [u ₀,v ₀]^⊤, circular targets spanning the field of view were captured from different distances in the IR and Depth images, Fig. 1.8(a). Edges of the targets were computed in the IR and Depth images using the Sobel edge detector. In order to mitigate the effect of the unstable Depth image edges, reconstruction circles were fit to the measured data, Fig. 1.8(b). The pixel distances between centers of the fitted circles are shown in Table 1.1. The shift was estimated as the mean value of the distances over all the experiments. We conclude that there is a shift of about 4 pixels in the u direction and of 3 pixels in the v direction.

3.2 Identification of the IR Projector Geometrical Center

We have first acquired seven IR and Depth images of a plane positioned at different distances. The projected pattern contains nine brighter and easily identifiable speckle dots, Fig. 1.9(a). These points were formed by r=1,…,9 rays l _r transmitted from the IR projector. Each point was reconstructed in the 3D space and grouped by its ray of origin $\mathtt{X}_{\mathrm{IR}_{i,r}}$. The IR projector center C _P is located in the common intersection of the nine rays. We formulated a nonlinear optimization problem to find the projector center C _P by minimizing the perpendicular distances of the reconstructed points $\mathtt{X}_{\mathrm{IR}_{i,r}}$ from a bundle of rays passing through C _P. Figure 1.9(b) shows the resulting ray bundle next to the IR cameras frame. Figure 1.9(c) shows the residual distances from the points $\mathtt{X}_{\mathrm{IR}_{i,r}}$ to their corresponding rays of the optimal ray bundle. All residual distances are smaller than 2 mm. The estimated projector center has coordinates C _P=[74.6,1.1,1.3]^⊤ [mm] in the IR camera reference frame.

3.3 Identification of Effective Depth Resolutions of the IR Camera and Projector Stereo Pair

In this section the view fields of the Kinect IR camera and of the Kinect projector and their effective resolution, which determines the distribution of the resolution in 3D measurement, will be investigated.

The size of the IR image and of the depth image is known to be 640×480 pixels with 10.4 μm pixel size, spanning approx. 60^∘×45^∘ view angle. This gives an angular resolution of 0.0938^∘/pixel in the IR camera.

Counting the speckle dots on the projected pattern yields about 800 dots along the central horizontal line across the projector field of view. Projector FOV and IR camera FOV are approximately the same. Hence we get 800 dots per 60° and an angular resolution of 0.0750^∘/ray for the projector rays. The green curve in Fig. 1.11 shows the simulated depth quantization along the central IR camera ray (the blue line in Fig. 1.10) for the camera and projector resolution described above. It clearly does not correspond to the red curve measured on a real Kinect.

To get our simulation closer to reality, we assume that ray detection is done with higher accuracy by interpolating rays from the projected patterns. The blue curve in Fig. 1.11 corresponds to detecting rays with 1/8 pixel accuracy, as was hypothesized in [11]. Hence we get the effective resolution of 5120=640×8 rays per 60°, i.e. 0.00938^∘/ray, in the projector. This corresponds to our measurement on a real Kinect.

Figure 1.10 illustrates view fields and ray arrangements for Kinect IR camera (blue) and projector (green). The bold blue line marks the center of the IR camera view where the distance resolution was evaluated. For clarity, we show only every 64th camera ray and every 150th projector ray and their intersections as red dots.

4 Kinect Calibration

We calibrate, as proposed in [1], Kinect cameras together by showing the same calibration target to the IR and RGB cameras, Fig. 1.7(c). This allows to calibrate both cameras w.r.t. the same 3D points and hence the poses of the cameras w.r.t. the points can be chained to give their relative pose, Fig. 1.12. Taking the Cartesian coordinate system of the IR camera as the global Kinect coordinate system makes the camera relative pose equal to R _RGB,C _RGB.

Tables 1.2 and 1.3 show the internal parameters and Fig. 1.6 shows the effect of distortions in the cameras. We included the tangential distortion since it non-negligibly increased the overall accuracy of 3D measurements. Figure 1.7(a) shows the IR image of the calibration board under the normal Kinect operation when it is illuminated by its IR projector. A better image is obtained by blocking the IR projector and illuminating the target by a halogen lamp Fig. 1.7(b).

Table 1.2 Intrinsic parameters of the Kinect IR camera

Full size table

Table 1.3 Intrinsic parameters of the Kinect RGB camera

Full size table

Parameters c ₀,c ₁ of the Depth camera are calibrated as follows: We get n measurements $\mathtt{X}_{D_{i}} = [x_{i},y_{i},d_{i}]^{\top}$, i=1,…,n, of all the calibration points from the depth images, Fig. 1.7(d). The Cartesian coordinates $\mathtt{X}_{\mathrm{IR}_{i}}$ of the same calibration points were measured in the IR Cartesian system by intersecting the rays projecting the points into IR images with the best plane fits to the reconstructed calibration points. Parameters c ₀,c ₁ were optimized to best fit $\mathtt{X}_{D_{i}}$ to $\mathtt{X}_{\mathrm{IR}_{i}}$ using Eq. 1.6.

4.1 Learning Complex Residual Errors

It has been observed that a Kinect calibrated with the above procedure still exhibited small but relatively complex residual errors for the close range measurements. Figure 1.13 shows residuals after fitting the plane to the calibrated Kinect measurement of a plane spanning the field of view. The target has been captured from 18 different distances ranging from 0.7 to 1.3 meters. Highly correlated residuals were accounted.

Residuals along the 250th horizontal Depth image row are shown in Fig. 1.14(a). Note that the residual values do not depend on the actual distance to the target plane (in this limited range). The values are consistently positive in the center and negative at the periphery. To compensate for this residual error, we form a z-correction image of z values constructed as the pixel-wise mean of all residual images. The z-correction image is subtracted from the z coordinate of X _IR computed by Eq. 1.6.

To evaluate this correction method, the z-correction image was constructed from residuals of even images and then applied to odd (the first row of Table 1.4) and to even (the second row of Table 1.4) depth images. The standard deviation of residuals decreased.

Table 1.4 Evaluation of the z-correction. The standard deviation of the residuals of the plane fit to the measurement of a planar target has been reduced

Full size table

After applying the z-correction to Kinect measurements from the experiment described in Sect. 1.5.2, the mean of the residual errors decreased by approximately 0.25 mm, Fig. 1.14(b). The residuals were evaluated on 4410 points spanning the field of view.

5 Validation

In this section, different publicly available Kinect depth models are tested and compared to our method on a 3D calibration object. Furthermore, we provide a comparison of the accuracy of Kinect measurements against stereo triangulation and 3D measurements based on Time-of-Flight. Finally,we demonstrate the functionality of our Kinect calibration procedure by integrating it into an SfM pipeline.

5.1 Kinect Depth Models Evaluation on a 3D Calibration Object

We evaluate the accuracy of the calibration by measuring a reference 3D object. The 3D object consisted of five flat targets that were rigidly mounted together along a straight line on a rigid bench, Fig. 1.15(a). As ground truth, the distances between centers of the targets were carefully measured by a measure tape with accuracy better than 1 mm.

The object was then captured using Kinect from two different distances to get measurements in the range between 0.7 m to 2 m, Fig. 1.15(b). After extracting the central points of the targets in the IR image, Fig. 1.15(a), several different reconstruction methods were used to get their 3-dimensional positions, Fig. 1.15(b).

Our Kinect calibration model, which was described in Sect. 1.4, was compared to the ROS calibration [11], Burrus calibration [2], Magnenat calibration [21], OpenNi calibration [16] and Microsoft Kinect SDK calibration [15].

Distances between the reconstructed target points were compared to the ground truth measurements in Table 1.5 and in Fig. 1.16. The experiment was performed on two Kinect devices. Kinect 1 is the device for which the complete calibration, as described in this chapter, was made. Kinect 2 was evaluated with the calibration from Kinect 1, to determine whether it is possible to transfer calibration parameters of one device to another. We see that our method is the best for Kinect 1 and among the best three for Kinect 2.

Table 1.5 Accuracy evaluation of different reconstruction methods on a reference 3D object. Kinect 1 is the device for which we made complete calibration as described in this chapter. Kinect 2 was evaluated with the calibration from Kinect 1

Full size table

5.2 Comparison of Kinect, SLR Stereo and 3D TOF

We have compared the accuracy of Kinect, SLR Stereo and 3D TOF cameras on the measurements of planar targets: Kinect and SLR Stereo (image size 2304×1536 pixels) were rigidly mounted (Fig. 1.2) and calibrated (Fig. 1.12) together. SLR Stereo was performed by reconstructing calibration points extracted by [1] and triangulated by the linear least squares triangulation [5]. They measured the same planar targets in 315 control calibration points on each of the 14 targets. SR-4000 3D TOF [13] measured different planar targets but in a comparable range of distances 0.9–1.4 meters from the sensor in 88 control calibration points on each of the 11 calibration targets. The error e, Table 1.6, corresponds to the Euclidean distance between the points returned by the sensors and points reconstructed in the process of calibration of the cameras of the sensors. SLR Stereo is the most accurate, Kinect follows and SR-4000 is the least accurate.

Table 1.6 Comparison of SLR Stereo triangulation, Kinect and SR-4000 3D TOF depth sensing

Full size table

5.3 Combining Kinect and Structure from Motion

Figure 1.17 shows a pair of 1/2-resolution (640×480) Kinect RGB and depth images (where the original depth image was reprojected using Eq. 1.7 to correspond with the RGB image pixels). A sequence of 50 RGB-Depth image pairs has been acquired and the relative poses of the RGB cameras have been computed by a SfM pipeline [6, 20]. Figure 1.18(a) shows a surface reconstructed from 3D points obtained by mere Multiview stereo [9] using only Kinect RGB images. Utilizing retrieved relative poses, depth data were registered together and used in the same method to provide improved reconstruction, Fig. 1.18(b).

Figure 1.19 compares a 3D surface reconstruction from point cloud computed by plane sweeping [9] with 70 Kinect 3D data processed by surface reconstruction of [9] (2304×1536 pixels). Kinect 3D data were registered into a common coordinate system via SfM [6, 20] applied to Kinect image data. We see that when multiple measurements are used, the Kinect result is quite comparable to more accurate Multiview stereo reconstruction.

6 Conclusion

We have provided an analysis of Kinect 3D measurement capabilities and its calibration procedure allowing to combine Kinect with SfM and Multiview Stereo, which opens a new area of applications for Kinect. It was interesting to observe that in the quality of the multi-view reconstruction, Kinect over-performed SwissRanger SR-4000 and was close to 3.5 M pixel SLR Stereo.

References

Bouguet, J.Y.: Camera calibration toolbox. http://www.vision.caltech.edu/bouguetj/calib_doc/ (2010)
Burrus, N.: Kinect calibration. http://nicolas.burrus.name/index.php/Research/KinectCalibration (2010)
Dryanovski, I., Morris, W., Magnenat, S.: kinect_node. http://www.ros.org/wiki/kinect_node (2010)
Freedman, B., Shpunt, A., Machline, M., Arieli, Y.: Depth mapping using projected patterns. US Patent (2010)
Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003)
Google Scholar
Havlena, M., Torii, A., Pajdla, T.: Efficient structure from motion by graph optimization. doi:10.1007/978-3-642-15552-9_8
Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: RGB-D mapping: using Kinect-style depth cameras for dense 3d modeling of indoor environments. Int. J. Robot. Res. (2012). doi:10.1177/0278364911434148
Google Scholar
Herrera, D.C., Kannala, J., Heikkila, J.: Accurate and practical calibration of a depth and color camera pair. http://www.ee.oulu.fi/~dherrera/kinect/2011-depth_calibration.pdf (2011)
Jancosek, M., Pajdla, T.: Multi-view reconstruction preserving weakly-supported surfaces. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)
Google Scholar
Khoshelham, K.: Accuracy analysis of Kinect depth data. In: ISPRS Workshop Laser Scanning, vol. XXXVIII (2011)
Google Scholar
Konolige, K., Mihelich, P.: Technical description of Kinect calibration. http://www.ros.org/wiki/kinect_calibration/technical (2011)
Lai, K., Bo, L., Ren, X., Fox, D.: Sparse distance learning for object recognition combining RGB and depth information. In: IEEE International Conference on Robotics and Automation (2011)
Google Scholar
MESA Imaging: SwissRanger SR-4000. http://www.mesa-imaging.ch/ (2011)
Microsoft: Kinect for X-BOX 360. http://www.xbox.com/en-US/kinect (2010)
Microsoft: Kinect for Windows. http://www.kinectforwindows.org/ (2012)
Openni: http://openni.org/ (2011)
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from a single depth image. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)
Google Scholar
Smisek, J., Jancosek, M., Pajdla, T.: 3D with Kinect. In: International Conference on Computer Vision—Workshop on Consumer Depth Cameras for Computer Vision (2011)
Google Scholar
Smisek, J., Pajdla, T.: 3D camera calibration. M.Sc. thesis, Czech Technical University in Prague (2011)
Google Scholar
Snavely, N., Seitz, S., Szeliski, R.: Modeling the world from internet photo collections. Int. J. Comput. Vis. (2007)
Google Scholar
Magnenat, S.: Stéphane Magnenat’s distance model. http://groups.google.com/group/openkinect/browse_thread/thread/31351846fd33c78/e98a94ac605b9f21 (2011)
Wikipedia: Kinect. http://en.wikipedia.org/wiki/Kinect
Willow Garage: ROS—Kinect calibration: code complete. http://www.ros.org/news/2010/12/kinect-calibration-code-complete.html (2010)
Willow Garage: Camera calibration and 3D reconstruction. http://opencv.willowgarage.com/documentation/cpp/camera_calibration_and_3d_reconstruction.html (2011)
Willow Garage: Turtlebot. http://www.willowgarage.com/turtlebot (2011)

Download references

Acknowledgements

This research was supported by TA02011275—ATOM—Automatic Three-dimensional Terrain Monitoring and FP7-SPACE-241523 ProViScout grants.

Author information

Authors and Affiliations

Center for Machine Perception, Dept. of Cybernetics, FEE, Czech Technical University in Prague, Prague, Czech Republic
Jan Smisek, Michal Jancosek & Tomas Pajdla

Authors

Jan Smisek
View author publications
You can also search for this author in PubMed Google Scholar
Michal Jancosek
View author publications
You can also search for this author in PubMed Google Scholar
Tomas Pajdla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Smisek .

Editor information

Editors and Affiliations

Computer Vision Laboratory, ETH Zürich, Sternwartstrasse 7, Zürich, 8092, Switzerland
Andrea Fossati
Perceiving Systems Department, Max Planck Inst. for Intelligent Systems, Spemannstrasse 41, Tübingen, 72076, Germany
Juergen Gall
Computer Vision Laboratory, ETH Zürich, Sternwartstrasse 7, Zürich, 8092, Switzerland
Helmut Grabner
Intel Science and Technology Center, Allen Center 462, Seattle, 98195, Washington, USA
Xiaofeng Ren
Industrial Perception, Industrial Ave 911, Palo Alto, 94303, California, USA
Kurt Konolige

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Smisek, J., Jancosek, M., Pajdla, T. (2013). 3D with Kinect. In: Fossati, A., Gall, J., Grabner, H., Ren, X., Konolige, K. (eds) Consumer Depth Cameras for Computer Vision. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-4640-7_1

Download citation

DOI: https://doi.org/10.1007/978-1-4471-4640-7_1
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4639-1
Online ISBN: 978-1-4471-4640-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

3D with Kinect

Abstract

Similar content being viewed by others

3D Depth Cameras in Vision: Benefits and Limitations of the Hardware

Enhancing 3D Capture with Multiple Depth Camera Systems: A State-of-the-Art Report

A Comparison between Active and Passive 3D Vision Sensors: BumblebeeXB3 and Microsoft Kinect

Keywords

1 Introduction