Keywords

1 Introduction

In recent years, with the continuous development of technologies such as digitization, visualization and mixed reality, people have put forward higher requirements for the experience of real-time performance on the stage. In the traditional stage performance field that combines mixed reality technology, the audience’s experience and satisfaction are mainly from the interaction between actors and virtual props [1]. But so far the technology of superimposing virtual elements on the exact position of a real scene by a computer is still immature and requires manual assistance to achieve. In practical applications, taking the stage performances of ancient Chinese mythology as an example, these mythological backgrounds require some special effects, which often need the interaction of real actors and props to produce. In the current means of producing these special effects, one of the means is to let the actors fly by using mechanical techniques, and another current mainstream method uses actors to interact with virtual images in the background of the screen [2], the above means often have the characteristics of low dynamic response resolution, insufficient degree of freedom and process-uncontrollability. In view of the improvement of the above shortcomings, authors developed an entity mixed reality interactive system to apply to occasions such as stage performances.

In this system, experiment was unfolded between the experiencer and the controllable Four-axis unmanned aerial vehicle, which some physical items can be overlaid or hung underneath. The system mainly aims to accomplish such a function: Under different mode settings, the drone with the physical props can move according to a certain path as the entity moves. Of course, the entity here is a hand but is not limited to this part. In order to achieve this function, the system is mainly divided into the following sections:

  1. I.

    The infrared binocular camera is used to detect the infrared light source deployed on the experimenter’s hand and the drone. The detection algorithm uses the Camshift algorithm to obtain the 2D coordinates of the two moving targets on one frame of the video stream in real time.

  2. II.

    Calibrating the binocular camera to obtain the stereo space where the moving target is located, and converting the 2D coordinates \( (x,y) \) into real-time 3D coordinates \( (x,y,z) \).

  3. III.

    Through the movement of the experimenter’s hand, the real-time \( P_{hand} = (x_{hand} ,y_{hand} ,z_{hand} ) \) coordinates are obtained. According to the preset flight path relationship \( P_{UAV} = f(P_{hand} ) \) between the two targets, we can combined with the flight control system command and Ant Colony Algorithm and adjusted the drone to \( P_{UAV} = (x_{UAV} ,y{}_{UAV},z_{UAV} ) \) in real time.

Here, we will give the overall design framework of the system and elaborate on the theoretical details and engineering implementation details. Figure 1 describes it.

Fig. 1.
figure 1

System overall flow chart.

2 System Design

2.1 Camshift Moving Target Detection Algorithm

In our system, the 2D coordinate tracking of two infrared moving target points (hand and drone) uses the Camshift algorithm. As we all know, Camshift uses the target’s color histogram model to convert the image into a color probability distribution map. It can initialize the size and position of a search window, and adaptively adjust the position and size of the search window based on the results obtained in the previous frame, thereby locate the central location of the target in the current image [3]. The system realizes the tracking of 2D coordinates mainly adopts the following three processes.

  1. I.
    1. (1)

      To avoid image sensitivity to light, we convert the image from RGB space to HSV space.

    2. (2)

      By calculating the histogram of the H component, the probability or number of pixels representing the occurrence of different H component values is found in the histogram, which can get the color probability look-up table.

    3. (3)

      A color probability distribution map is obtained by replacing the value of each pixel in the image with the probability pair in which the color appears. Actually this is a back projection process and color probability distribution map is a grayscale image.

  2. II.

    The second process of the Camshift algorithm uses meanshift as the kernel. The meanshift algorithm is a non-parametric method for density function gradient estimation. It detects the target by iteratively finding and optimizing the extreme value of the probability distribution. We use the following flow chart to represent this process.

  3. III.

    Extending the meanshift algorithm to a continuous image sequence is achieved by the camshift algorithm [4]. It performs a meanshift operation on all frames of the images, and takes the result of the previous frame. That is, the size and center of the search window becomes the initial value of the search window of the next frame of the meanshift algorithm. With this iteration, we can track 2D coordinates of the experimenter’s hand and drone on the image. Its main following process is based on the integration of I and II.

In the engineering implementation of the above algorithm, we developed a MFC-based PC software using the Camshift function in OpenCV as a kernel to display the 2D coordinates of the two moving target in real time.

2.2 3D Reconstruction

In order to obtain the 3D coordinates of the hand and drone in the world coordinate system, we carried out a 3D reconstruction experiment based on camera calibration. Before understanding 3D coordinates, we need to understand the four coordinate systems and the three-dimensional geometric relationship between them. First, we introduce 3D reconstruction based on binocular camera.

  1. I.

    Regarding the calibration of the camera, we should first understand the four coordinate systems.

    1. (1)

      Pixel coordinate system. The Cartesian coordinate system u-v is defined on the image, and the coordinates \( (u,v) \) of each pixel are the number of columns and the number of rows of the pixel in the array. Therefore, \( (u,v) \) is the coordinate of image coordinate system in pixels, which is also the \( (x,y) \) value obtained by the Camshift algorithm in our previous system module.

    2. (2)

      Retinal coordinate system. Since the image coordinate system only indicates the number of columns and rows of pixels in the digital image [5], and the physical position of the pixel in the image is not represented by physical units, it is necessary to establish an retinal coordinate system x-y expressed in physical units (for example, centimeters). we use \( (x,y) \) to represent the coordinates of the retinal coordinate system measured in physical units. In the x-y coordinate system, the origin \( O_{1} \) is defined at the intersection of the camera’s optical axis and the image plane, and becomes the principal point of the image [6]. This point is generally located at the center of the image, but there may be some deviation due to camera production. \( O_{1} \) becomes \( (u_{0} ,v_{0} ) \) in the coordinate system, and the physical size of each pixel in the x-axis and y-axis directions is \( dx \), \( dy \). The relationship between the two coordinate systems is as follows:

      $$ \left[ {\begin{array}{*{20}c} u \\ v \\ 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {{1 \mathord{\left/ {\vphantom {1 {dx}}} \right. \kern-0pt} {dx}}} & {s^{\prime}} & {u_{0} } \\ 0 & {{1 \mathord{\left/ {\vphantom {1 {dy}}} \right. \kern-0pt} {dy}}} & {v_{0} } \\ 0 & 0 & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} x \\ y \\ 1 \\ \end{array} } \right] $$

      where \( s^{\prime} \) represents the skew factor because the camera retinal coordinate system axes are not orthogonal to each other.

    3. (3)

      World coordinate system. The relationship between the camera coordinate system and the world coordinate system can be described using the rotation matrix R and the translation vector t. Thus, the homogeneous coordinates of the point P in the space in the world coordinate system and the camera coordinate system are \( (X_{w} ,Y_{w} ,Z_{w} ,1)^{T} \) and \( (X_{c} ,Y_{c} ,Z_{c} ,1)^{T} \), respectively, and the following relationship exists:

      $$ \left[ {\begin{array}{*{20}c} {X_{c} } \\ {Y_{c} } \\ {Z_{c} } \\ 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} R & t \\ {0^{T} } & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {X_{w} } \\ {Y_{w} } \\ {Z_{w} } \\ 1 \\ \end{array} } \right] = M_{1} \left[ {\begin{array}{*{20}c} {X_{w} } \\ {Y_{w} } \\ {Z_{w} } \\ 1 \\ \end{array} } \right] $$

      where R is a \( 3x3 \) orthogonal unit matrix, t is a 3-dimensional translation vector, and \( 0 = (0,0,0)^{T} \), \( M_{1} \) is the contact matrix between two coordinate systems. In our opinion, there are twelve unknown parameters in the \( M_{1} \) contact matrix to be calibrated. Our method is to use the twelve points randomly selected by space, and find the coordinates of these twelve points according to the world coordinates and camera coordinates, then it forms twelve equations, so the \( M_{1} \) matrix can be uniquely determined.

    4. (4)

      Camera linear model. Perspective projection is the most commonly used imaging model and can be approximated by a pinhole imaging model [7]. It is characterized in that all light from the scene passes through a projection center, which corresponds to the center of the lens. A line passing through the center of the projection and perpendicular to the plane of the image is called the projection axis or the optical axis. As shown in Fig. 2, \( x_{1} \), \( y_{1} \) and \( z_{1} \) are fixed-angle coordinate systems fixed on the camera. Following the right-hand rule, the \( X_{c} \) axis and the \( Y_{c} \) axis are parallel to the coordinate axes \( x_{1} \) and \( y_{1} \) of the image plane, and the distance \( OO_{1} \) between the planes of the \( X_{c} \) and \( Y_{c} \) and the image plane is the camera focal length \( f \). In the actual camera, the image plane is located at the distance \( f \) from the center of the projection, and the projected image is inverted. To avoid image inversion, it is assumed that there is a virtual imaging \( x^{\prime} \), \( y^{\prime} \), \( z^{\prime} \) plane in front of the projection center. The projection position \( (x,y) \) of \( P(X_{c} ,Y_{c} ,Z_{c} ) \) on the image plane can be obtained by calculating the intersection of the line of sight of point \( P(X_{c} ,Y_{c} ,Z_{c} ) \) and the virtual imaging plane.

      Fig. 2.
      figure 2

      Camera model.

The relationship between the camera coordinate system and the retinal coordinate system is:

$$ x = \frac{{fX_{c} }}{{Z_{c} }},y = \frac{{fY_{c} }}{{Z_{c} }} $$

where \( (x,y) \) is the coordinate of point P in the retinal coordinate system, and \( P(X_{c} ,Y_{c} ,Z_{c} ) \) is the coordinate of the space point P in the camera coordinate system, which is represented by the subordinate coordinate matrix:

$$ Z_{c} \left[ {\begin{array}{*{20}c} x \\ y \\ 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {X_{c} } \\ {Y_{c} } \\ {Z_{c} } \\ 1 \\ \end{array} } \right] $$

By combining the above equations, we can get the relationship between the image coordinate system and the world coordinate system:

$$ Z_{c} \left[ {\begin{array}{*{20}c} u \\ v \\ 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {{1 \mathord{\left/ {\vphantom {1 {dx}}} \right. \kern-0pt} {dx}}} & {s^{\prime}} & {u_{0} } \\ 0 & {{1 \mathord{\left/ {\vphantom {1 {dy}}} \right. \kern-0pt} {dy}}} & {v_{0} } \\ 0 & 0 & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} R & t \\ {0^{T} } & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {X_{w} } \\ {Y_{w} } \\ {Z_{w} } \\ 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\alpha_{u} } & s & {u_{0} } \\ 0 & {\alpha_{v} } & {v_{0} } \\ 0 & 0 & 1 \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} R & t \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {X_{w} } \\ {Y_{w} } \\ {Z_{w} } \\ 1 \\ \end{array} } \right] = K\left[ {\begin{array}{*{20}c} R & t \\ \end{array} } \right]\tilde{X} = P\tilde{X} $$

where \( \alpha_{u} = \frac{f}{dx} \), \( \alpha_{v} = \frac{f}{dy} \), \( s = s^{\prime}f \). \( \left[ {\begin{array}{*{20}c} R & t \\ \end{array} } \right] \) is completely determined by the orientation of the camera relative to the world coordinate system, so it is called the camera external parameter matrix, which consists of the rotation matrix and the translation vector. K is only related to the internal structure of the camera, so it is called the camera internal parameter matrix. \( (u_{0} ,v_{0} ) \) is the coordinates of the main point, \( \alpha_{u} \), \( \alpha_{v} \) are the scale factors on the u and v axes of the image, respectively, s is the parameter describing the degree of tilt of the two image coordinate axes [8]. P is the 3 × 4 matrix called the projection matrix, that is, conversion matrix of world coordinate system relative to image coordinate system. It can be seen that if the internal and external parameters of the camera are known, the projection matrix P can be obtained. For any spatial point, if the three-dimensional world coordinates \( (X_{w} ,Y_{w} ,Z_{w} ) \) are already known, the position \( (u,v) \) at the image coordinate point can be obtained. However, if we know the image coordinates \( (u,v) \) at a certain point in the space even if the projection matrix is known, its spatial coordinates are not uniquely determined. In our system, it is mainly determined by using a binocular camera to form stereoscopic vision and depth information to get the position of any point in the world coordinate.

  1. II.

    3D reconstruction based on the binocular camera. As we know, when people’s eyes are observing objects, the brain will naturally produce near and deep consciousness of the object. The effect of generating this consciousness is called stereo vision. By using a binocular camera to observe the same target from different angles, two images of the target can be acquired at the same time. And the three-dimensional information is restored by the relative parallax of the target in the imaging, thereby realizing the stereoscopic positioning effect.

As shown in Fig. 3, for any point P in space, two cameras \( C_{1} \) and \( C_{2} \) are used to observe point P at the same time.\( O_{1} \) and \( O_{2} \) are the optical centers of the two cameras respectively. \( P_{1} \), \( P_{2} \) are the imaging pixels of P in the imaging plane of two cameras. It can be known that the straight line \( O_{1} P_{1} \) and the straight line \( O_{2} P_{2} \) intersect at the point P, so the point P is unique and the spatial position information is determined.

Fig. 3.
figure 3

Binocular vision imaging principle.

In this model, the three-dimensional coordinate calculation of the spatial point P can be solved by the least squares method according to the projection transformation matrix.

Assuming that in the world coordinate system, the image coordinates of the spatial point \( P(x,y,z) \) on the imaging planes of the two cameras are \( P_{1} (u_{1} ,v_{1} ) \) and \( P_{2} (u_{2} ,v_{2} ) \). According to the camera pinhole imaging model, we can get:

$$ Z_{c1} \left[ {\begin{array}{*{20}c} {u_{1} } \\ {v_{1} } \\ 1 \\ \end{array} } \right] = M_{1} \left[ {\begin{array}{*{20}c} x \\ y \\ z \\ 1 \\ \end{array} } \right],Z_{c2} \left[ {\begin{array}{*{20}c} {u_{2} } \\ {v_{2} } \\ 1 \\ \end{array} } \right] = M_{2} \left[ {\begin{array}{*{20}c} x \\ y \\ z \\ 1 \\ \end{array} } \right],M_{1} = \left[ {\begin{array}{*{20}c} {m_{111} } & {m_{112} } & {m_{113} } & {m_{114} } \\ {m_{121} } & {m_{122} } & {m_{123} } & {m_{124} } \\ {m_{131} } & {m_{132} } & {m_{133} } & {m_{134} } \\ \end{array} } \right],M_{2} = \left[ {\begin{array}{*{20}c} {m_{211} } & {m_{212} } & {m_{213} } & {m_{214} } \\ {m_{221} } & {m_{222} } & {m_{223} } & {m_{224} } \\ {m_{231} } & {m_{232} } & {m_{233} } & {m_{234} } \\ \end{array} } \right] $$

where \( Z_{c1} \) and \( Z_{c2} \) are the \( Z \) coordinates of the P points in the left and right camera coordinate systems, and \( M_{1} \) and \( M_{2} \) are the projection matrices of the left and right cameras. The premise of these two formulas is that we must obtain the pixel coordinates \( (u_{1} ,v_{1} ) \) and \( (u_{2} ,v_{2} ) \) on the left and right images of P point in advance. We combine the two equations above and then eliminate \( Z_{c1} \) and \( Z_{c2} \). Then we will get:

$$ AP = b $$

where:

$$ A = \left[ {\begin{array}{*{20}c} {m_{131} - m_{111} } & {m_{132} - m_{112} } & {m_{133} - m_{113} } \\ {m_{131} - m_{121} } & {m_{132} - m_{122} } & {m_{133} - m_{123} } \\ {m_{231} - m_{211} } & {m_{232} - m_{212} } & {m_{233} - m_{213} } \\ {m_{231} - m_{221} } & {m_{232} - m_{222} } & {m_{233} - m_{223} } \\ \end{array} } \right],P = \left[ {\begin{array}{*{20}c} x & y & z \\ \end{array} } \right]^{T} ,b = \left[ {\begin{array}{*{20}c} {m_{114} - u_{1} m_{134} } \\ {m_{124} - v_{1} m_{134} } \\ {m_{214} - u_{2} m_{234} } \\ {m_{224} - v_{2} m_{234} } \\ \end{array} } \right] $$

According to the least squares method, the three-dimensional coordinates of the spatial point P under the world coordinate system can be obtained as:

$$ P = (A^{T} A)^{ - 1} A^{T} b $$

Therefore, we firstly use Camshift to find the pixel coordinates \( (u_{hand} ,v_{hand} ) \) of the experimenter’s hand on the image through the above series of algorithms, then 3D reconstruction is performed by binocular camera vision to obtain the 3D coordinates of the hand in the world coordinate system.

2.3 Ant Colony Algorithm for Path Planning

In view of the fact that the world coordinates of the space field points are already available, the next problem to be solved is to use the coordinates of the hand so that the drone can move with it. We extracted it as a path planning problem for drones. System uses the Ant Colony Algorithm to adjust the two moving targets for precise motion through the preset interaction path between the hand and the drone.

The algorithm in this paper is to set several points on the preset path, which are randomly generated by the movement of the hand, and the number is not fixed, so this is a typical TSP problem. In our system, the central controls the drone to fly in accordance with the path of the random point coordinates generated by the hand. When the UAV position is off the route or the target point is changed, the system will regenerate the path based on the changed UAV dynamic and static information using the Ant Colony algorithm. When the flight path of the drone changes, the system will respond quickly. According to the newly generated track of the drone system, the drone will be controlled to fly. System will convert the control command signal into a PWM signal to reach the drone and realize the purpose of attitude control. The implementation of the entire aircraft control system is shown in Fig. 4.

Fig. 4.
figure 4

UAV path planning system based on ant colony algorithm.

3 Field Experiment

The field experiment of the system is mainly divided into three parts. The first parts is the calibration of the binocular camera, then the second part are construction and deployment of the UAV path planning system and sensor components, which implement the two main modes: including that drone props flying around space and interact with actors’ actions, and the third part is the physical interaction between the experimenter and the drone props. Here are some of our system experiments pictures.

First, the deployment of the UAV terminal control system and sensor system are designed by the authors. This work is the basis for the debugging of experimental hardware systems.

The Fig. 5 shows photos of an experiencer interacting with a drone with a balloon prop. The system detects the coordinates of the change of the hands of the experiencer and adjusts the drone to its corresponding coordinates in real time. The effect is that the experiencer can push the balloon to fly and realize the physical interaction between both.

Fig. 5.
figure 5

Experiments in which researchers interact with physical items.

After joint debugging, it is confirmed that the system can achieve dynamic interaction between entities within a certain range, but the time response still has certain limitations. The authors are prepared to solve this problem with a faster algorithm in the next work.

4 Result

This paper develops a mixed reality system for scene interaction between actors and solid props in the stage performance field. The implementation of the experiment combines multiple techniques of multi-point positioning, computer vision, drone control, and path planning. Through the previous technical details and field experiments, authors finally realized the scene of the drone props flying around the stage space and interacting with the actors according to the set mode. These implementations have greatly overcome some of the limitations of special effects in traditional stage performances and enhance the immersion of the audience. It has great potential for future business deployment.