1 Introduction

The Microsoft Kinect Footnote 1 sensor is an RGB-D sensor which provides RGB and depth information (see Fig. 1). It is developed by the Israeli company PrimeSense and was originally intended to be used with the Microsoft Xbox 360 gaming console. However, very soon after the launch of the Kinect in November 2010 it became possible to gain access to the Kinect functionality from a PC.Footnote 2 The possibility of using the Kinect with normal PCs made the Computer Vision community highly interested because of its impressive performance on depth extraction and relatively low cost. Since the Kinect actively emits texture in the infrared domain (see Fig. 1b, c), matching is possible on homogeneous surfaces. This solves a major pitfall of classical stereo vision approaches, and allows a more complete 3D representation of the scene.

Fig. 1
figure 1

a Kinect device without cover (image from http://www.ifixit.com/) and b recording of projected IR pattern. c Triangulation with the Kinect: (1) the known IR pattern is projected on the scene. (2) The IR image is recorded. (3) Matching of (1) and (2) is performed to achieve (4) a disparity value, which is transformed to a depth value

Meanwhile, a large amount of work has been done utilizing the Kinect camera to its full extent, e.g., both PCL Footnote 3 (see [25]) and ROS Footnote 4 (see [23]), two major contributors in the 3D Computer Vision and Robotics community, do support the Kinect device. Recent application areas of the Kinect include range flow estimation [5], human detection using contours and 3D surface models [29], 3D object detection [26], 3D modeling of indoor environments [7] and dense surface mapping and tracking of scenes [19].

Recently, a significant number of papers have addressed the calibration of Kinect cameras and their use for the processing of depth descriptors. ROS and [16] contributed general knowledge about Kinect cameras as well as calibration procedures required. Calibration and the geometric properties of depth measurements by the Kinect are also investigated in [11, 28] (for a list of general parameters derived from the Kinect calibration process see, e.g., [28]). In [2], a Kinect calibration scheme is proposed which uses a checkerboard with special materials that either reflects or deflects the infrared light, to make the pattern visible in both the depth and RGB image. By that a normal stereo calibration can be done using a binary version of the depth image and the RGB image as left and right input. They also systematically evaluate the degradation of the matching and reconstruction under concurrent use of multiple Kinects and show that blocking the infrared emitters, enabling only one Kinect pattern at a single point in time will actually degrade the results significantly. This indicates that the Kinect depth processing relies on temporal consistency in the IR image and are disturbed by the IR patterns of the other Kinects depending on their relative position. Calibration of time-of-flight (ToF) cameras and sensor fusion systems have been studied extensively in, e.g., [13, 27]. This is relevant here because many concepts of the calibration can be transferred from sensors that combine ToF and an RGB camera to Kinect cameras. Both sensors can provide a dense depth image with the inherent challenges of aligning that to a color image. Another known problem of Kinect cameras is the problem of dealing with depth discontinuities and surfaces with poor reflectivity, which have also been addressed in studies of ToF [24].

A primary goal of the Kinect is to be a low-cost device for ‘natural interaction’ (according to PrimeSense) rather than a precision tool in computer vision and robotics research. OpenNI Footnote 5 has been used to interface with the Kinect sensor. However, the OpenNI driver handles the modeling of the reconstruction process and the calibration of the involved parameters in a suboptimal way. In order to achieve more precise reconstructions, the pinhole model used for back-projection in the OpenNI driver needs to be extended and some imprecisions are identified and corrected.

In this paper, we present a real-time algorithm for the computation of local surface patches (in the following called ’texlets’, see Fig. 2c, d) with Kinect cameras. A texlet shares many properties with the patchlet feature, presented in [18], and the surfel feature presented in [3] and used in [7, 20].

Fig. 2
figure 2

a RGB scene image obtained by Kinect and b the corresponding depth map (colored for visualization purposes) from the OpenNI driver. c Texlet representation and parameterization. Red and green lines indicate principal axes of position and orientation uncertainty, respectively. d Example of extracted texlets, with details of a flat surface (including uncertainty visualization) and curvature

In the context of deriving an uncertainty model for texlets, we give both a qualitative and quantitative estimate of the uncertainty distribution of point-wise depth reconstruction with the Kinect camera. As a first contribution of the paper, we provide a quantification of the uncertainty of point reconstructions by deriving a polynomial approximation of the 3D point noise, and by means of Monte Carlo simulation we can use this to compute texlet uncertainties.

A texlet is a local surface descriptor consisting of a 3D center point P and a surface normal N, both with associated uncertainties (see Fig. 2c). In addition, information about size and appearance are associated. Texlets allow for a semantically richer and more condensed representation at smooth 3D surfaces than point clouds [21]. The texlet descriptor is part of the Early Cognitive Vision (ECV) system [22]. The ECV system is a hierarchical system consisting of visual descriptors of different abstraction level (such as texlets, surflets and surfaces) and has been used in particular in robotic applications (see, e.g., [21]). Figure 3 shows a robot platform realizing an intelligent work-cell with two arms which is also used for quantifications in this paper. The workspace is observed by three pairs of Kinect and stereo cameras (see sub-figure on bottom left in Fig. 3).

Fig. 3
figure 3

Robotic setup with multiple stereo and Kinect cameras

Texlets as described here can be extracted both from Kinect cameras as well as stereo cameras. Since entities of higher abstraction level in the ECV system are dependent on the texlets, the derivation of an appropriate uncertainty model of texlets as well as real-time capabilities are of importance. Real-time processing is achieved by means of a Graphics Processing Unit (GPU) which is now widely used in computer vision.

We use GPUs to extract texlets in real-time from the 3D information provided by the Kinect camera. Our approach can be used also for any other 3D sensing device—such as a standard stereo cameras—provided an appropriate noise model exists. We compare three different methods to compute the texlet position P and the associated surface normal N. By this, we can indicate two methods with similar reconstruction quality from which one is more suitable to be transferred to a GPU due to its non-iterative nature. As a second contribution, we achieve a processing speed for texlet computation (including an associated uncertainty based on the uncertainty model of point reconstruction) of up to 30 Hz, thus utilizing the full frame rate of the Kinect. As a reference, the same algorithm as described here applied to 3D point clouds extracted from stereo camera input (with approx. double resolution compared to the Kinect camera) achieves a processing speed of up to 10 Hz. Based on the pixel-wise uncertainty model and the texlet extraction mechanism, we—as a third contribution—derive an uncertainty model for texlets (also computed in real-time) which is crucial for higher level processes such as temporal accumulation (see [9]) and optimal parameter estimation (see, e.g., [1]).

The fact that depth reconstruction by means of Kinect cameras, is based on a projection of an infrared pattern, results in interferences when multiple Kinect cameras are used at the same time, since the projected patterns overlap. This potentially reduces the matching quality, and as a consequence also the reconstruction quality. In this context, as a fourth contribution we quantify—in terms of known geometric relations of a test object—the degradation of the quality of the texlet reconstruction on an example scene when using one, two and three Kinect cameras simultaneously. Thus, we are able to investigate the effect of the interference of multiple Kinect cameras for depth extraction.

The paper is structured as following: In Sect. 2, we describe the geometry and reconstruction with Kinect cameras including the indicated simplifications in the available drivers and the way we correct these. In this context, we also perform a modeling of the uncertainty of point reconstruction. In Sect. 3, we describe our work on texlet reconstruction and its associated uncertainty as well as the real-time implementation on GPUs. In Sect. 4, we then discuss and quantify the effect of the interference when using multiple Kinect sensors for scene reconstruction.

2 Reconstruction with Kinect cameras

Multiple drivers are available for using the Microsoft Kinect. The first major initiative resulted in the unofficial Open Source driver LibFreenect. PrimeSense later on released their own driver and initiated a new Open Source framework called OpenNI, which easily enables the functionality of their RGB-D sensors including the one in the Kinect. These drivers are still under constant development. LibFreenect has the advantage of providing support for the hardware not related to the PrimeSense chip (e.g., the motors and LEDs of the Kinect), whereas OpenNI shows a more complete interface to the features of the PrimeSense chip. We prefer the latter to have easy access to both RGB image, IR image and depth image, hence this is considered to be the driver of choice. In the following section, we deal with the reconstruction process realized in Kinect cameras and in Sect. 2.2 we address the point-wise uncertainty computation.

2.1 Reconstruction of 3D points

We first describe the 3D reconstruction process realized in the OpenNI driver in the following section and then describe our approach to correct inaccuracies in Sect. 2.1.2.

2.1.1 Reconstruction of 3D points using the OpenNI driver

The Kinect consists of two sensors and one emitter in order to provide color and depth information. Figure 1a shows the different components. It possesses a standard RGB camera, which is horizontally aligned to an IR camera with a baseline of approximately 2.5 cm. It has an IR projector which emits a pattern of speckles as shown in Fig. 1b. The IR projector and IR camera have a baseline of approximately 7.5 cm.

Since the projected IR pattern is known to the Kinect, this serves as a virtual second image. The recorded IR image will contain a distorted version of the projected image allowing for calculations of disparities between the image pairs by means of matching. The exact process of matching and the extraction of depth information through triangulation (see Fig. 1c) is undisclosed and protected by a PrimeSense patent. Reverse engineering efforts have revealed some of the details though (see [16, 28]).

When using the OpenNI driver, the depth map (Fig. 2b) can be acquired directly. The depth map is a per pixel map containing a 16-bit depth value z depth (with 11-bit resolution) at each pixel point (x pixely pixel) of the IR camera. If coordinates are needed in 3D space and not as image coordinates, they can be transformed by the driver to real-world points relative to the IR camera coordinate system. OpenNI uses a simplified pinhole model (assuming undistorted images and that the principal point is at the image center) when doing the conversion from (x pixely pixelz depth) to (xyz) coordinates as described in Eqs. (13).

$$ \left( \begin{array}{l} x\\ y\\ z \end{array}\right)=\left(\begin{array}{c} x_{\rm norm}\cdot \tan(\frac{h_{\rm FOV}^{\rm IR}}{2})\\ y_{\rm norm}\cdot \tan(\frac{v_{\rm FOV}^{\rm IR}}{2})\\ 1 \end{array} \right)\cdot z_{depth} $$
(1)

Where:

$$ x_{\rm norm} = 2 \times \left(\frac{x_{\rm pixel}}{r_x}-0.5\right) $$
(2)
$$ y_{\rm norm} = 2 \times \left(0.5-\frac{y_{\rm pixel}}{r_y}\right) $$
(3)

Here r x and r y are the resolution of the depth map in x and y, respectively. h IRFOV and v IRFOV are the horizontal and vertical field of views of the IR camera, which are stored in the Kinect devices. h IRFOV is displayed in Fig. 1c.

2.1.2 Improving reconstruction quality

Equation (1) assumes that the principal point of the IR camera is in the center of the image. This simplification leads to some inaccuracy in the reconstruction of real-world points derived from the depth map.

The driver also provides the functionality to align the depth map and the RGB image, a process called registration, giving the ability to associate a color to each real-world point. Registration is a well-known concept from the calibration of, e.g., ToF cameras, and is studied in, e.g., [12].

When aligning the depth map obtained from the IR camera with the RGB image using OpenNI, the depth map coordinates (x pixely pixelz depth)T are first projected into the RGB camera coordinate system. This is necessary since the distance between the RGB camera and the IR sensor gives slightly different views. The FOV value is used, as this can be read from the sensor device. One error we encountered arose when these new re-projected depth map coordinates were back-projected into real-world points. Here, the field of view of the IR camera h IRFOV and v IRFOV from Eq. (1) should be replaced with the FOV of the RGB camera (h RGBFOV and v RGBFOV ).

The identified problems can be solved in several ways. One method is to simply use the correct FOV when back-projecting the aligned points, which however is still a simplification.

When we are anyway calibrating the Kinect, the precision of reconstruction can be further increased by doing individual monocular calibration of the RGB camera. The depth data, once registered to the RGB camera, are in this viewpoint, and the reconstruction must thus be done with the RGB camera intrinsics. Having the calibration matrix of the RGB camera, the vector representing the back-projection ray going from the focal point through the image point (x pixely pixel), can be calculated (see, e.g., [6]). Since the depth is known, the originating 3D point can be reconstructed. Calibration of the RGB camera was done as discussed in [28].

2.2 Uncertainty model for point reconstructions with Kinect cameras

In the ECV system [21, 22] the texlets represent only the initial state of a processing hierarchy. For the computation of entities at higher levels, it is crucial to propagate the uncertainty associated with the entities at lower levels of the hierarchy to higher levels for spatial–temporal disambiguation (see [9]), as well as numerical optimization purposes when estimating properties of higher level entities (see [1]).

Since the details of the matching process as well as the reconstruction process are not completely known, it is difficult to derive an analytical model for the uncertainty of depth reconstruction with Kinect cameras and we rely on a Gaussian distribution model estimated experimentally for our uncertainty computations. A series of experiments were conducted to estimate the varying noise at all positions in the depth image returned from a Kinect sensor. The Kinect was pointed towards a flat wall and for a number of specified distances and orientations 100 depth maps were computed. From these depth recordings a standard deviation (SD) was calculated for each pixel position. Figure 4 shows the SD images obtained at 0° at 2 different depths. The SDs computed are for visual purposes normalized to gray-scale values (with white as zero deviation and black as the maximum deviation recorded) and printed for each pixel position.

Fig. 4
figure 4

a and b Recorded image of SD. Black means higher. a Distance between camera and wall 1.5 m. b Distance between camera and wall 2.5 m. c Residuals of plane fitting against flat wall

From the recordings it is clearly visible how the SD increases with distance. Furthermore, it seems that there is a radial dependency, where the noise increases with larger distance from the center of the depth map. This is expected since the reconstruction process is based on the very same geometric relations as a normal stereo setup. It is, however, also noticeable that straight vertical lines appear in the images in Fig. 4a, b. These vertical artifacts are also observable in the raw depth map. The depth map seems to be divided into vertical slices. Between two such slices there seems to be a static depth offset. Since the size and appearance of these slices are temporally dependent they also appear in the SD images, generating larger variations in the transition areas. The vertical slices can also be seen in Fig. 4c. Here, the Kinect is again pointed against a flat wall and all valid points are fitted to a plane. The errors in each point are then depicted with green representing points in front of the plane and red representing points behind.

It has not been possible to determine a deterministic behavior of these lines as they seem to show up at different, seemingly random, places (see the difference in Fig. 4a, b) and are temporally dependent. The static difference between two slices, however, seems to be at the magnitude of the discretization level. These lines are not accounted for in the final uncertainty model for the Kinect since their appearance is not predictable. For a further investigation in the residual errors see [28], where a learning process is utilized to learn tendencies, and correct some of the static error present.

Since the reconstruction process on the Kinect cameras is comparable to a standard stereo reconstruction (see Sect. 2.1), an uncertainty model n = f(rd) which models the variance of Gaussian noise depending on the radial distance r from the center and the depth d were chosen. We made experiments for fitting the function f using different polynomial degrees which shows that a second order polynomial approximation is optimal.

Altogether six sets of recordings were made at different depths, starting from around 70 cm and up to approx. 175 cm. It was not possible to include sets at larger distances, i.e., utilize the full range of the Kinect, since this requires a very large planar surface. Each set includes measurements where the Kinect were placed in seven different orientation configurations.

The first configuration is frontal parallel to the wall, referred to as 0°. In another two configurations the Kinect is tilted 15° relative to the wall, in either the horizontal or vertical direction. The next two configurations are similar, but the angle is increased to 30°. In the final two configurations the Kinect was tilted both horizontally and vertically so the combined angle relative to the wall would be either 15° or 30°. The polynomial fit of some of the data sets sorted by orientation configurations (0°, 30° horizontal and vertical and 30° diagonal) are shown in Fig. 5. Interestingly, we see that the 3D reconstruction quality does not significantly depend on the orientation of the wall, within the tested orientations, indicating that the angle is not required as a third parameter of the noise model. This might be connected to the (unfortunately undisclosed) matching function used in the Kinect camera, which possibly has an inbuilt invariance towards changes in the infrared pattern caused by slanted surfaces in the scene. Figure 6 shows the fitted uncertainty model based on all orientation measurements. The parametrized model is as follows:

$$ f(r,d) = p_{00} + p_{10}\times r + p_{01}\times d + p_{20}\times r^2 + p_{11}\times r\times d + p_{02}\times d^2 $$
(4)

where

$$ \begin{aligned} p_{00} &= 2.344, \; \; p_{10} = -1.202e{-}2, \; \; p_{01} = -2.734e{-}3,\\ p_{20} &= 1.818e{-}5, \; \; p_{11} = 6.516e{-}6, \; \; p_{02} = 1.233e{-}6.\\ \end{aligned} $$

Figure 7 shows the depth quantization as a function of distance, which was measured by recording a number of Kinect images of a frontal parallel wall. The smallest change of depth value of a pixel at a given distance are regarded as the quantization step. Comparing Figs. 6 and 7, we conclude that a significant part of the measured noise are in the vicinity of the discretization step.

Fig. 5
figure 5

Polynomial fittings using data acquired with a 0° measurements, b 30° measurements in either horizontal or vertical direction, c 30° diagonal (combined vertical and horizontal) measurements

Fig. 6
figure 6

Uncertainty model using two second degree polynomials to fit radius in pixels (x-axis) and distance in mm (y-axis). z-axis is the SD of the uncertainty in mm. Sub-sampled original data are marked with dots (red 0°, green 15° and blue 30°)

Fig. 7
figure 7

Depth quantization as a function of distance

3 Texlet extraction with Kinect cameras

The texlet feature consists of appearance and geometry information and is extracted from a hexagonal grid of 2D locations in the image (see [22]). The hexagonal grid is chosen since it, contrary to a square grid, ensures a more uniform distance to all neighboring texlets [17]. Appearance is not of focus here and can be expressed in different ways (e.g., by using a color histogram computed from the RGB image or even simpler a mean color value, see e.g., [21]). The geometric information include the surface patch parameters P and N, which is the 5D pose, represented here by a point in 3D space, and a normal vector of the plane, respectively. For each parameter we associate an uncertainty estimate represented using covariance matrices \({\Upsigma_{\mathbf{P}} \in {\mathbb{R}}^{3} \times {\mathbb{R}}^{3}}\) for the position and \({\Upsigma_{{\mathbf{N}}} \in {\mathbb{R}}^{2} \times {\mathbb{R}}^{2}}\) for the angular equivalent representation of the orientation vector. The texlet parameterization is shown in Fig. 2c.

To extract the texlet geometry from 3D point data, usually total least squares methods are used (see e.g., [18]) to compute the surface patch parameters. Such methods minimize the orthogonal distance between the plane and every point associated with the patch. This least square optimization is often computed by SVD or other iterative algorithms. Unfortunately, such iterative approaches do not map very well to GPUs except for very large design matrices.Footnote 6 Two non-iterative methods—in the following called ‘normal vector averaging’ [15] and ‘PCA’ [8]—are compared in the following section, with an SVD approach as a reference. In Sect. 3.2, we describe our GPU implementation.

3.1 Comparison of methods for least square fitting of planes

The averaging method [15] is a very lightweight implementation. It uses a center point and the six surrounding neighbors in the hexagonal grid. The normal vectors of the six patches, created by two adjacent neighbor points and the center point, are found and the average is calculated and used as the orientation n.

$$ {{\mathbf{n}}} = \frac{1}{n}\sum_{i=0}^{n-1}{(p_i-p_c)\times(p_{i+1}-p_c)} $$
(5)

where p c is the center point, p 0 to p n−1 are the neighboring points in a counter clockwise direction around p c , with the last point p n wrapping around and being equal to p 0. This vector combined with the center point constitutes the texlet orientation N and position P (see Fig. 2c). Note that this approach is only suitable for small neighborhood sizes.

The PCA method minimizes the variances of the points instead of the orthogonal distance. All 3D points within a certain neighborhood of the center point are used, and the covariance matrix is calculated for those. Since the covariance matrix is symmetric and positive semi-definite, the eigenvalue decomposition gives real and non-negative Eigenvalues. The Eigenvalue decomposition of a matrix A can be written as

$$ A=V\Upsigma V^T $$
(6)

where \(\Upsigma\) contains the Eigenvalues and V contains the Eigenvectors of A. The covariance matrix represents the variance ellipsoid spanned by the points. The Eigenvectors are the principal components of this ellipsoid and the Eigenvalues represent the variances in these directions. Assuming that the noise perpendicular to the true plane is smaller than the in-plane variance (which should be determined by the neighborhood size), the Eigenvector corresponding to the smallest Eigenvalue is the normal vector of the plane patch. The Eigenvalues of a matrix A can be found by solving

$$ \det(A-\sigma I) =0 $$
(7)

with σ being the variable (Eigenvalues), and I being the identity matrix. Since the points are in \({{\mathbb{R}}^3,}\) Eq. (7) can be solved as a cubic function, and since A was symmetric and positive semi-definite Eq. (7) gives 3 non-negative solutionsFootnote 7 for σ. This reduces the computation complexity. The texlet orientation N is thus the Eigenvector corresponding to the smallest Eigenvalue. During the covariance calculation the average of the neighborhood points are already calculated. This is more suitable to be used as the texlet position P than just the center point, as it represents the point’s center of gravity.Footnote 8 As the average is already calculated, this choice does not impose extra computational complexity on the algorithm.

An emulation of the two algorithms, with SVD as a reference has been done in order to investigate the stability of the calculated orientation N, when exposed to noise on the input points. The mean and variance of the angular difference between ground truth plane normal and estimated plane normal over 100 runs for increasing added noise can be seen in Fig. 8a, b.

Fig. 8
figure 8

Mean angle error and variance of unit normal vectors with varying noise

It is clear that the non-iterative PCA method can compete with regular SVD when calculating the normal vector. The very lightweight normal vector averaging is too influenced by even small amounts of noise to be used in our context.

3.2 GPU implementation

In order to acquire the texlets in real-time, the CUDA platform associated with NVIDIA graphics cards has been utilized. The data independence of the computation of each texlet fits very well to the SIMT (Single Instruction Multiple Threads) architecture used in CUDA. The calculation is done in three steps as outlined below. Here, we mention a few parameters that have an impact on the performance (see also Table 1). First, the grid spacing between texlets is relevant, since with smaller spaces more texlets are computed. Each texlet is then calculated at the chosen location using a surrounding neighborhood of image points. The second parameter is thus a radius that defines the size of the texlet neighborhood. This determines the number of points used in the computation of each texlet. This is further discussed after the description of the texlet extraction process based on RGB-D data from the Kinect.

Table 1 Performance parameters of the GPU implementation when varying grid spacing or neighborhood radius

Step 1 Transformation to real-world coordinates The depth map coordinates acquired from the Kinect (aligned to the RGB image) is transformed into real-world coordinates, using intrinsics and distortion coefficients for the RGB camera to obtain a back-projection ray, and intersect this at the known depth. If multiple Kinects are used, the data sets are aligned to a common reference coordinate system. This can be done pixel-wise, and thus, in this step, one pixel calculation maps to one thread on the CUDA GPU.

Step 2 Removing outliers from texlet neighborhoods Before the actual reconstruction is done, RANSAC [4] is used to remove any of the 3D points which are classified as outliers. This is done locally using the 3D point associated with each of the pixels in the neighborhood; we iteratively choose three points randomly as shown in pseudo code in Algorithm 1. Each set of three points span a plane, and the outlier criterion for the rest of the points is defined as the distance between a point and the plane. If the distance exceeds a threshold value, the point is considered as an outlier. To achieve good performance when mapping the RANSAC algorithm to the GPU architecture, we choose a finite number of iterations and omit a step of refitting the plane to the current set of inlier points. For GPU execution, the responsibility of each texlet is divided amongst threads. This means that a thread will be responsible for the processing of the RANSAC algorithm for one texlet neighborhood.

figure a

Step 3 Calculating texlets from neighborhood points The orientation N of the texlet is found by applying PCA on the 3D points still remaining after applying RANSAC in each texlet neighborhood. The position P is set to the average of these remaining points. The uncertainties associated to each texlet’s orientation and position are calculated by Monte Carlo simulation [14], which is a computationally heavy method, compared to an analytical derivation. Monte Carlo on the other hand, gives us the advantage of easy adaptation to other feature types or input data where we want to propagate input uncertainties that might be given in another space, e.g., 2D points of stereo images.Footnote 9 This is the case in our setup shown in Fig. 3, where the amount of input data makes a fast GPU implementation beneficial. On the GPU Monte Carlo is preferred over, e.g., SVD for propagating uncertainties, because of the complexity of the algorithms and how they map to the architecture.

To compute the uncertainties of the texlet parameters, the texlet is created multiple times, each time with noise from the uncertainty model derived in Sect. 2.2 added to each 3D point. The uncertainty representation for the texlet is the covariance matrices \({\Upsigma_{{\mathbf{P}}} \in {\mathbb{R}}^{3} \times {\mathbb{R}}^{3}}\) and \({\Upsigma_{{\mathbf{N}}} \in {\mathbb{R}}^{2} \times {\mathbb{R}}^{2}}\) calculated from this Monte Carlo set. They are visualized in Fig. 2c, and the left highlight in Fig. 2d, using red lines and green lines for the principal axes of the Gaussian distribution model, for position and orientation, respectively.

The spacing of the hexagonal grid (i.e., the distance between the texlet center pixels) can be adjusted given the desired resolution or performance demands. The impact of spacing on the performance and number of created texlets can be seen in Table 1. The effect of changing the neighborhood size is also shown. All the data are obtained using 20 RANSAC iterations and 50 Monte Carlo simulations which has been found experimentally to be adequate. The data with varying spacing (rows 1–3) are obtained with a neighborhood radius of 6 pixels, and the test of neighborhood radius (rows 4–6) are obtained with a spacing of 6 pixels.

4 Quantification of texlet reconstruction with Kinect cameras

As a final step we want to give a qualitative and quantitative evaluation of the reconstruction precision of texlets in a well defined setup. In addition, we want to measure the effects of interference when using multiple (up to three) Kinect cameras. As already mentioned in the introduction, the fact that the Kinect camera is an active sensor which emits light in the infrared domain might affect the matching quality when multiple projection patterns overlay. Since multi-view reconstruction is a common method in computer vision (e.g., to deal with occlusions), it is of interest to measure these effects of interference.

In order to be able to evaluate the quality of extracted texlets they need to be extracted from a known object. A test setup was made, with three Kinect sensors, in a triangular setup, observing a rectangular box, with known side lengths as shown in Fig. 10a. Each texlet extracted from the scene has a relation (n (i,j)θ) to any other texlet in the scene. n (i,j) is the normal distance (for parallel planes the normal distance corresponds to the distance between the planes) from texlet i to j, i.e., the orthogonal projection of the center of one plane to the other plane (see Fig. 9). θ is the angular difference in the orientations. The computed relations are expected to agree with the constraints shown in Table 2.Footnote 10 This is equivalent to the center of the green areas in Fig. 10b: the two peaks at the top correspond to the two side-widths of the object (texlets being parallel but normal vectors pointing in opposite directions) while the left bottom peak corresponds to all texlets in the same plane, the bottom right peak corresponds to the relation between the table and the top surface (texlets being parallel and pointing in the same direction). The horizontal line at 90° angle corresponds to all 90° angles of texlet relations at different non-parallel surfaces and naturally ends at a normal distance corresponding to the height of the object.

Fig. 9
figure 9

Texlet relation (n (i,j)θ) (normal distance and angle) between texlet i and j

Fig. 10
figure 10

a Known object used in test setup. b Bounding boxes used to classify inliers

Table 2 Texlet relation constraints in test setup

The normal distances and angles from each texlet to all other extracted texlets are plotted as 2D histograms. This test was conducted with three Kinects calibrated to the same reference frame. First each Kinect were turned on individually in order to get a sample without interference. This can be seen for neighborhood radius 2, 6 and 10 in Fig. 11 a, b, c. For visual purposes, logarithmic values of the histogram bins are plotted. As expected the larger the radius of the texlets, the better the actual extraction quality and hence the more entries in the histograms are positioned in the expected areas indicated in Fig. 10b. The inlier percentages can also be seen in Table 1. Figure 11d–f shows the same diagram but with an additional Kinect turned on to generate noise. The last row of Fig. 11g–i is conducted with all three Kinects turned on at the same time. The degeneration of precision of the reconstruction is clearly visible in the larger variance when multiple Kinect cameras are used at the same time.

Fig. 11
figure 11

Quality histograms. a, b, and c Each Kinect individually (neighborhood radius 2, 6 and 10). d, e and f Two Kinects at the same time (neighborhood radius 2, 6 and 10). g, h and i Three Kinects at the same time (neighborhood radius 2, 6 and 10)

In order to quantitatively compare the measurements the threshold bounding boxes shown in Fig. 10b is applied. All relations between texlets within the bounding boxes shown in Fig. 10 (threshold of 10° and 5 mm), are defined as inlier relations. The inlier relations counted from scenarios with one, two and three simultaneously enabled Kinect cameras can be seen in Fig. 13. The number of inliers are shown as a function of the size of the texlet (neighborhood radius). It shows clearly that using multiple Kinects synchronously degrades the overall quality of 3D texlets. The lower inlier count when using two or more Kinects are expected due to the overlap and interference of the two infrared patterns. Accordingly, more Kinects will add more interference. This interference is also evident when looking at the raw depth maps returned from Kinects in Fig. 12.

Fig. 12
figure 12

Interference from multiple Kinects shown on colored depth map and texlets. a, d No interference. b, e Interference from one additional Kinect. c, f Interference from two additional Kinects. Note that in the top surface the degradation effect is largest since all Kinect cameras project on the top surface while—due to occlusion—only one or two Kinect cameras project on the surfaces on the side

Fig. 13
figure 13

Percentage of inliers with one, two and three Kinects

Figure 11 also indicates an important fact on the complexity of generating an appropriate model for the interference of multiple Kinect cameras: since the number of projections influences the reconstruction precision, the effect of interference depends on the actual 3D geometry of the scene. For example, the top surface and also most of the base plane is subject to the infrared projections of all three cameras. As a consequence, texlets extracted from these areas are in particular affected. This is most visible (as being expressed in the high variance) in the peak on the bottom right (see Fig. 11i) which corresponds to the relation between the top surface and the base plate (see difference to Fig. 11c). For the side surfaces of the box, the corresponding peaks (the two peaks at the top in Fig. 11i) have a lower increase in variance (compared to Fig. 11c) since the surfaces are only subject to projections by one or maximally two Kinect cameras. This conclusion is also in agreement with the work in [2] which states that the reconstruction degradation depends on the relative orientation of the concurrently running Kinects. Hence, any uncertainty model would need to explicitly model the scene geometry and the relative camera object relations.

5 Conclusion and future work

In this paper, we have described an algorithm for the real-time extraction of 3D surface patches (called ‘texlets’) with an associated uncertainty model. To achieve high precision it was required to be able to describe the reconstruction process and the associated calibration problems in detail. We pointed out a number of issues that lead to degeneration of reconstruction quality which need to be taken care of before using Kinect cameras when high precision of the reconstruction is required. This requires in particular the understanding of the geometry behind the reconstruction process realized on Kinect cameras. To derive the uncertainty model for surface patches we used an approximate model for the uncertainty associated to a point-wise reconstruction with Kinect cameras. Hence, besides reaching the actual goal of real-time texlet extraction and the associated uncertainty we could provide information which will be of general use for the vision community and which will further facilitate applications of Kinect cameras requiring precise 3D reconstruction. This also includes the analysis of the interference effects when using multiple Kinect cameras. Here it would be interesting if new versions of the Kinect cameras would allow for a better control of the infrared projector to minimize these effects by switching the projector on and off easily and fast. All algorithms described in this paper are available via the C++ vision library CoViS.Footnote 11

In Fig. 12 the effect of interference on the reconstruction is shown. When multiple Kinects are introduced the matching process seems to fail more often, giving empty spots in the depth map as seen in Fig. 12a–c. The interference also propagates to the texlets giving a visual degradation. This can be seen in 12d–f. Notice that the degradation is not present on the leftmost visible side of the box. Due to the triangular placement of the Kinects, only one pattern is projected to this surface in all cases. The rightmost visible side of the box suffers from the pattern of up to two Kinects, where the top of the box may have all three patterns interfering.

A particular challenge when using Kinect cameras is to get reliable information at depth discontinuities, which is also a problem with ToF cameras [24]. In [22], we have presented a 3D edge descriptor (covering 3D point and orientation information as well as a basic appearance description in terms of phase and color information for the left and right side of the edge) as part of our ECV system which is so far based on stereo information only. Integration of Kinect cameras and stereo information will allow to extend this edge descriptors by associating surface normals on both sides of the edge structure. In this context, we aim at addressing the issue of higher uncertainty at depth discontinuities.