1 Introduction

Virtual experience space provides a user with experiences and services in a virtual environment. For example, a virtual reality arcade or a three-dimensional (3D) military training simulator offers unique and interesting experiences by simulating various situations that might be impossible in the physical world. However, complex, high-cost equipment is required for such simulations.

The CAVE [1] suggests a new method that uses multiple image projectors instead of special equipment. The projection of virtual space images onto multiple screens around a user enables the user to easily experience a virtual environment. However, the method limits any direct interaction between the user and the virtual space. A virtual experience space for an e-learning system, which consists of one screen and one camera, was suggested in [6]. Even though this system has some limitations, the devices are easily available; therefore, the system can be adopted easily in the school and home environments.

One of the difficulties in popularizing a virtual experience space is the initialization process. A virtual experience space inherently needs a calibration step because it uses a coordinate system that is different from the physical environment. Lee [6] solves this problem by using prevalent camera calibration methods such as those defined by Zhang [9] and Tsai [8]; however, special markers such as a checkerboard must be positioned in the physical environment to match the reference points in the virtual space. This requirement might be cumbersome for a user who is not familiar with these technologies.

The Kinect sensor, which is used to capture 3D space information, is a popular product owing to its affordability. In this paper, a new automatic registration method using 3D space information captured by Kinect to replace the traditional initialization process of virtual systems is suggested. Some research studies focused on calibration methods using Kinect [3, 10]. However, they still require additional equipment and markers such as a checkerboard. In addition, the automatic 3D space information retrieval methods such as the ones described in [4] and [7] are not computationally effective. By contrast, our method does not require the use of markers, and it performs the registration in real-time.

We begin in Section 2 with an introduction about virtual experience space and its registration method. We present our approach in Section 3. The results of experiments are presented in Section 4. A discussion of future work is presented in Section 5.

2 Virtual experience space and registration

2.1 Virtual experience space

This paper aims to develop a technology for spreading a virtual experience space with a simple setup as shown in Fig. 1. This system consists of three parts; the display (e.g., a monitor or a combination of a projector and a screen), the audio (e.g., speakers), and image sensors (e.g., a Kinect). These components are positioned around the action zone where users play. The display and the audio are required for the user experience, and the image sensor acquires data related to the user’s behavior and position.

Fig. 1
figure 1

Virtual experience space setup

This system provides users with the combination of sounds and images that are part of the virtual space. It also simulates the users’ silhouettes based on the data acquired from Kinect, and it provides functionality for manipulating objects in the virtual space. Thus, the users can feel like they are really inside the virtual space and can be more involved in the prepared situation.

2.2 Registration

Registration is the process of overlaying the real space with the virtual space in order to create a correspondence between points in real space and points in virtual space. This can be resolved by making the coordinate systems of the real camera and the virtual camera consistent with each other. The intrinsic and extrinsic parameters of the cameras can be calculated and then used to augment real objects into the virtual space and vice versa.

Using a common camera model, the coordinates of a point in the real world can be converted into camera-centered 3-dimensional coordinates using the extrinsic parameter and the converted point can be translated into a point with 2-dimensional coordinates using the intrinsic parameter of the camera.

2.3 Coordinate

In order for users to have a new experience through interaction with virtual objects, the virtual space needs to be authored in advance to encourage the users to perform tasks, such as moving virtual objects. When authoring the virtual space, a standard coordinate system is necessary to deploy objects in the virtual space, and this coordinate system needs to reflect the real-world coordinate system where the authors or users stand.

In a camera-centered coordinate system, if a camera is positioned parallel to the floor, actions performed in the real world can be simulated in the virtual space simply by translating the movement. However, if the camera is not parallel to the floor, authors will have difficulty in translating the movement. In fact, if the camera does not capture the entire body of the user, the camera is tilted diagonally by moving it up a little. The more tilted the camera, the greater the difference between the camera-centered coordinate system and the coordinate system of the user.

Thus, a coordinate system that is orthogonal to the floor in the real world is ideal for both authoring and registration.

2.4 Camera assumption

This paper assumes that the sensor is located at the front center of the action zone and is tilted downward to capture the entire body of a user, as in Fig. 2b. The action zone is the area where a user stands and moves his whole body to interact with virtual environment. The world coordinates and camera coordinates are displayed in Fig. 2b, c, d.

Fig. 2
figure 2

a Example of coordinate origin (red cross), expected action range (green) b Camera and Action Zone coordinates c Camera and Action Zone (x-z plane) d Camera and Action Zone (y-z plane)

The origin of the real-world coordinates is assumed to correspond to the point in the 2D image that is the center of the x-axis and the mass center of the valid distance pixels along the y-axis (Fig. 2a). Section 3 will discuss the process of determining the 3D position of this point in the real world. The parameter could be calculated using the normal direction of the basis plane and the origin position of the coordinates according to the applications.

The other assumption is that the color data image and depth data image are pre-aligned using other methods, such as NuiImageGetColorPixelCoordinateFrame-FromDepthPixelFrame() from the MS Kinect SDK (http://www.microsoft.com/en-us/kinectforwindows/).

3 Automatic registration

3.1 Intrinsic parameter

In traditional methods, the acquisition of the positions of pairs that correspond between the real world and the image is required to calculate a camera’s intrinsic parameter. To obtain the positions in the camera image, a prepared marker or a pattern panel is used to automatically or manually determine the corresponding locations in the real world. Each camera has its own parameters, even cameras of same model, although the difference is not significant. For automation, we use a simple camera model and parameters from the manufacturer.

Ignoring the radial distortion, the intrinsic parameter model is as follows:

$$ \left[\begin{array}{ccc}\hfill {F}_x\hfill & \hfill S\hfill & \hfill {C}_x\hfill \\ {}\hfill 0\hfill & \hfill {F}_y\hfill & \hfill {C}_y\hfill \\ {}\hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill \end{array}\right] $$
(1)

Where the skew parameter S is assumed as 0 and (C x , C y ) is the center of the image.

The manufacturer of Kinect provides the camera parameters as shown in Table 1.

Table 1 Intrinsic related parameters from Kinect SDK

These values are based on a 640 × 480 image resolution, and they represent the FOV in the horizontal and vertical directions.

Horizontal FOV (FOVh) and Vertical FOV (FOVv) are defined as follows:

$$ {\mathrm{FOV}}_{\mathrm{h}}=2\mathrm{atan}\left(\frac{ Width}{2{F}_x}\right) $$
(2)
$$ {\mathrm{FOV}}_{\mathrm{v}}=2\mathrm{atan}\left(\frac{ Height}{2{F}_y}\right) $$
(3)

From (2) to (3),

$$ {\mathrm{F}}_{\mathrm{x}}=\frac{ Width}{2 \tan \left(\frac{ FO{V}_x}{2}\right)} $$
(4)
$$ {\mathrm{F}}_{\mathrm{y}}=\frac{ Height}{2 \tan \left(\frac{ FO{V}_y}{2}\right)} $$
(5)

Using the values in Fig. 3 and the Eqs. (4) and (5), Fx = 532.569434 and Fy = 531.541079 are calculated.

Fig. 3
figure 3

Origin of the floor plane

Therefore, the Kinect intrinsic parameter is as follows:

$$ \left[\kern0.5em \begin{array}{ccc}\hfill 532.569434\hfill & \hfill 000.000000\hfill & \hfill 320.000000\hfill \\ {}\hfill 000.000000\hfill & \hfill 531.541079\hfill & \hfill 240.000000\hfill \\ {}\hfill 000.000000\hfill & \hfill 000.000000\hfill & \hfill 001.000000\hfill \end{array}\kern0.5em \right] $$
(6)

3.2 Extrinsic parameter

The extrinsic parameter represents the relation between the camera and the real world. This consists of 3DOF translation and 3DOF rotation which can be obtained by calculating the position of the world coordinate’s origin and the angle of the camera’s rotation from the floor plane.

3.2.1 Rotation calculation

Using Kinect, the distance from the camera and the camera-centered coordinates of each pixel in the 2D image can be achieved. The rotation part of the extrinsic parameter is obtained by calculating the angle between the up-vector of the camera and the normal vector of the floor with 1° of freedom along the normal direction. Compensation of this degree of freedom can be handled by multiplying the inverse angle along the normal direction with the assumption that the camera looks towards the center of the action zone (mentioned in Section 2.4). The up-vector means the up direction of the camera and holds the proper camera direction resolving the rolling effect. In this paper, up-vector is used due to the resemblance to the normal direction of the floor. With the coordinates in Fig. 2b-d, this vector has the value of (0, 1, 0).

With the camera angle of looking downward, the entire body of the user is captured by the camera and the floor plane is assumed as the dominant plane. Therefore, the normal direction of the floor plane is the dominant direction of the pixels of the image. The Algorithm 1, 2 and 3 calculate the rotation part of the extrinsic parameter:

figure g
figure h
figure i

Get_Camera_Centered_Position(i, j) is from the Kinect SDK; Convert_AxisAngle_To_Matrix and Convert_Matrix_To_Euclidean are well-known methods that convert between rotation representations. These methods are discussed in detail in (http://www.euclideanspace.com/maths/geometry/rotations/conversions/index.htm).

The normal directions of pixels are calculated by Algorithm 2 and Algorithm 3. The goal is to get the overall direction of the whole image; therefore, only a representative set of the pixels need to be concerned. For uniform sampling, this algorithm skips pixels by a predefined step which will be tested in Section 4.

The normal direction of one pixel is defined as the average of the cross-products of the vectors of each vertically positioned pair from near pixels (see (12) and (13)). This result is less accurate; but faster than other elaborate methods, and it is sufficient because this algorithm aims to obtain only one representative value.

3.2.2 Translation calculation

The point of origin, which is needed for the translation part, is based on the assumptions described in Section 2.4. The x-axis value is fixed and the y-axis value is calculated in Algorithm 2 at (9).

This origin point from the 2D image is denoted as p 2D and its 3D position from the Kinect sensor as p 3D. If a user or other objects are located at this position, then p 3D is not the origin of the floor plane (P1 in Fig. 3). Therefore, the intersection of the floor plane and the line between the camera center and p 3D has to be calculated (P2 in Fig. 3).

A plane is represented by a 4D-vector π(a, b, c, d). The normal direction, which is obtained in the previous algorithm, represents (a, b, c). Because 3 values are fixed, the plane that includes p(x,y,z) is defined as (a, b, c, − ax − by − cz). If q(x ′, y ′, z ′) is on this plane, then ax ′ + by ′ + cz ′ + d = 0. For all pixels that have the same direction as the representative normal direction, each “d” needs to be calculated and tested how many other points are on this plane. The plane with the most pixels represents the floor plane.

In homogeneous expression, the line between the camera center C(0,0,0,1) and (p 3D | 1) is represented by L = C ⋅ p T3D  + p 3D ⋅ C T, and is the point where L intersects with the plane π [2]. Using the above values, the purposed point is (p 3D. x × d, p 3D. y × d, p 3D. z × d, − π ⋅ p 3D). Therefore, the origin O is defined as:

$$ \mathrm{O}\left(\frac{{\mathbf{p}}_{3\mathrm{D}}.\mathrm{x}\times \mathrm{d}}{-\pi \cdot {\mathbf{p}}_{3\mathrm{D}}},\frac{{\mathbf{p}}_{3\mathrm{D}}.\mathrm{y}\times \mathrm{d}}{-\pi \cdot {\mathbf{p}}_{3\mathrm{D}}},\frac{{\mathbf{p}}_{3\mathrm{D}}.\mathrm{z}\times \mathrm{d}}{-\pi \cdot {\mathbf{p}}_{3\mathrm{D}}}\right) $$
(14)

3.2.3 Extrinsic parameter form

Generally, the extrinsic parameter is represented as a 3 × 4 matrix, which is a combination of a 3 × 3 rotation matrix and a 3 × 1 translation matrix. This form uses the 3 × 3 rotation matrix defined in Section 3.2.1 and the 3 × 1 translation matrix from the vector (14) in 3.2.2.

4 Experiment and analysis

A point in world coordinates is converted to the camera-centered position using an extrinsic parameter, and the result is converted to a position on the 2D image using an intrinsic parameter. Users can see the synthesized scene with this process, and gesture recognition technologies help the users sense virtual the reality by giving the user a task where he controls virtual objects.

In Section 3, four parameters are presented; 1) the threshold value (7) for outlier removal in Algorithm 2; 2) the step size (8) for sampling image pixels; 3) neighbor model (10) for Algorithm 3 (i.e., 4-Neighbor or 8-Neighbor); and 4) the distance to the near pixel for calculating the normal direction (11). The machine used for the experiments has an Intel® Core™ i5-2500 @ 3.30GHz and 4GB memory, and the experiment was conducted at the place shown in Fig. 4. The overall configuration and the equipment setup are described in Figs. 1 and 2. Figure 4 shows the registration result (green grid) and normal directions of sampling points (red lines). The green circles have are included points in the normal plane calculation and blue circles are outliers.

Fig. 4
figure 4

Experiment location and registration result

4.1 Outlier threshold to calculate normal vector

To clearly illustrate the performance changes, the sampling step is set to 5 pixels. The graph in Fig. 5 shows the effect of the outlier threshold value (x-axis) on the rotation part of the extrinsic parameter and on the time and accuracy performance. The blue line indicates the execution time(in ms; time units indicated on the left side of the graph), while the red and the green lines show the rotation differences Axis-Diff and Angle-Diff (both in degrees; units indicated on the right side of the graph). Axis-Diff and Angle-Diff are from the values of axis-angle represent forms of the result rotation matrix, both Tsai result and proposed method.

Fig. 5
figure 5

Performance according to outlier threshold value

Execution time ranges from 57 ms to 30 ms. With smaller outlier values, more noises are produced when the dominant plane is calculated; therefore, execution takes more time and errors increase. Based on this result, the outlier threshold of 0.8 is used in other experiments.

4.2 Step value

The sampling step size that is used to calculate the normal direction of the floor plane also affects the execution time and the comparative difference from the Tsai results, as shown in Fig. 6.

Fig. 6
figure 6

Performance according to step value

The sampling step size is the most influential parameter. The graph in Fig. 6 shows that the rotation accuracy is not affected by the step; nevertheless, the execution time drops dramatically. Considering the standard deviation of the accuracy, a sampling step size of 10 pixels is acceptable, which gives an average execution time of 6.27 ms.

The specifications of the machine used, the configuration of the installation, and the content and other process needed to run the scenario all affect the performance; therefore, tuning these parameters could produce better results.

4.3 4-neighbor vs. 8-neighbor

Table 2 represents the changes in time and accuracy (sampling step size is set to 5 pixels) using the 4-neighbor and 8-neighbor models for calculating the rotation in Algorithm 3, as described in 3.2.1. The result shows that this variable doesn’t significantly affect the performance.

Table 2 Performance according to 4-neighbor and 8-neighbor

4.4 Distance of near pixels for normal calculation

Setting the outlier threshold as 0.8, the sampling step size as 5 pixels, and the neighbor model as 8-Neighbor, the performance result based on the near pixel distance is as shown in Table 3. Like the neighbor model, this variable doesn’t significantly affect the performance.

Table 3 Performance according to the near pixel distance for calculating the normal direction

4.5 Time

The average time for 100 executions is 2 ms ~ 40 ms, depending on the step value that is used for sampling. If the step is set to 10 pixels, the average time of executions is 6 ms. This result shows that the registration process could be executed for each frame within a reasonable time.

4.6 Comparison of the camera parameter analysis with the Tsai method

As described in Section 3.1, this paper uses a simple camera model, similar to the one used with Tsai [8]. Some points on the floor are set down and their positions in a simulated scene are manually entered. The focal length difference is about 40 mm and rotation difference is about 2°.

4.7 Application

Registration makes it possible for a user to interact with the objects inside the virtual world. For showing this, an application is developed (Fig. 7). This application can allow a user to manipulate virtual objects by selection, translation and rotation; therefore, it is suited for educational or entertainment scenarios.

Fig. 7
figure 7

A manipulation application; object selection, translation and rotation

To implement this application, the manipulation gestures are designed. Touching an object for 1 s with one hand means selection and release. Object follows the hand after selection and system enters from the translation mode to the rotation mode with two hands push gesture for 1 s. User can rotate the selected object with two hands in the rotation mode.

4.8 Additional analysis and future work

This paper made some assumptions, which could cause some problems. First, according to the assumption that the floor plane is the most dominant plane, when two or more planes have similar portions of an image or when a wall is more dominant than the floor, the result of the registration could be wrong. To offset this problem, the camera could be repositioned and adjusted to point at the front center of the action zone, while ensuring that the floor covers more than half of the captured image. In addition, other technologies will be developed using the assumption of the possible camera rotation angle or other information.

The reliability of a 3D point cloud from Kinect needs to be analyzed and applied to improve the accuracy of this registration. The materials of objects could also influence the result of the Kinect sensor [5].

5 Conclusion

This paper aims to create a virtual experience space with less cost and effort in order for a user to experience a new environment and situation. To achieve this goal, the fully automated registration method using a Kinect sensor is developed. Even though it has slightly less accuracy, it doesn’t need any other equipment, such as a calibration panel, and it is fully automated and fast. This algorithm takes 6 ms of execution time with a rotation error of approximately 2°.

With this algorithm, scenarios that control virtual objects in a real world background are possible, like the application shown in Section 4.7. Furthermore, some scenarios that use virtual world backgrounds and segmented user images would be improved with advanced technologies, such as a user image boundary enhancement using aligned 3D and 2D information.