Keyword

1 Introduction

The Intelligent System Laboratory of the Central Research Institute of Toyota in Japan has developed a personal following robot to assist in handling and loading [1]. The following robot is mainly equipped with panoramic camera, Lidar and inertial measurement sensor for sensing, and the robot follows the target according to the robot's kinematic model. Among them, the panoramic camera is used for target tracking, Lidar is used for obstacle avoidance, and inertial sensors measure the acceleration and angular acceleration of the robot to control the robot to keep its balance.

The school of robotics engineering at Inha University in Korea has developed a following robot [2]. The following robot mainly integrates monocular camera and Lidar to track human targets. In the aspect of visual tracking, the particle filter method is used to track the morphological features extracted from the image. At the same time, the laser ranging sensor is used to measure the distance and angle of the target, and then the data of Lidar and visual tracking are fused to realize the reliable tracking of the target.

Han yang University in South Korea has developed a following robot for marathon athletes [3]. The robot mainly obtains the point cloud data of the surrounding environment through laser sensors. According to the support vector data description, the point cloud data is mapped to high-dimensional space for classification, and the target area is distinguished, the target position is tracked. At the same time, Kalman filter is used to estimate the state of the tracking human body and the optimal position of the tracking target, to realize the motion control of the robot.

The commercial version developed by Intel follows the Segway robot. An intelligent upgrade is made on the platform of the balance car, which senses the surrounding environment through the RGB-D camera, realizes gesture recognition, obstacle avoidance and following based on the visual algorithm, and also has the functions of speech recognition, mobile photography and home monitoring.

A humanoid robot developed by Shenyang Institute of automation, Chinese Academy of Sciences [4], which realizes target tracking based on three degrees of freedom redundant vision. It is equipped with a binocular camera composed of a laser based TOF camera and two CCD cameras. Its processing logic is very similar to human eyes, which is to find and track the target in a relatively large range. After finding the target, carefully observe the target and track it. Firstly, it looks for the target to be tracked through the time-of-flight camera, roughly locates the target, and then accurately locates it through the binocular camera. The time-of-flight camera does not have a high resolution, which can reduce the computation during the coarse localization phase. However, the binocular camera has a higher resolution, which allows precise measurement and localization of the target.

2 Following System Design

The method of human target tracking based on multi-sensor proposed in this paper is mainly used to solve the shortcomings of visual tracking. For example, in completely obscured and poorly illuminated scenes, the vision tracking algorithm is unable to re-identify and track the target when the human target leaves the camera's field of view. Therefore, this paper introduces Ultra Wide Band (UWB) and Inertial Measurement Unit (IMU) sensors, which mainly address the problem of how to provide reliable coordinate information for tracking targets in the presence of occlusion.

Based on the analysis of the above application scenarios and adopted technologies, the overall scheme design of the following robot in this paper is mainly divided into four parts, including core processing layer, hardware layer, control system layer and power module as shown in Fig. 1.

  1. (1)

    The perception layer is the bottom part of the whole system, and it is also an important device for following robot to realize the perception of the surrounding environment. The 1080p camera sensor is installed on the 2-DOF camera PTZ (Pan/Tilt/Zoom) to provide the system with video stream in the following process. Two nine axis gyroscopes are respectively installed on the camera pan tilt and the chassis of the following robot to provide the system with the relative angle and pitch angle of the camera. The tags of IMU and UWB are fixed together with Bluetooth module to provide the acceleration information and angle information of human target for the system. The motor driver is responsible for controlling the motor following the robot motion.

  2. (2)

    The core processing layer is built based on ROS system, which is mainly responsible for visual target detection and tracking and sensor combination tracking. In the initialization process, the target detection algorithm detects the human target closest to the robot in the picture as the target tracked in the subsequent process, and releases the detected target. The target tracking node subscribes to the target location data of the target detection node as the initialization target box of target tracking. The fusion node parses the data according to the communication protocol and obtains the UWB tracking data and the attitude data of the nine-axis gyroscope sensor. At the same time, it achieves the matching between the UWB tracking target and the visual tracking target according to the conversion relationship between the UWB coordinate system and the camera coordinate system, and achieves the effect of multi-sensor tracking fusion.

  3. (3)

    The control system layer mainly collects the sensing data with the core processing layer through UART serial port and transmits it to the core processing layer for processing. There are two processing units in the control layer. The first processing unit STM32F103 is responsible for processing the measurement data of each UWB base station and label. Through the distance data between the three base stations and the label, the position information of the label relative to the robot can be obtained through the solution algorithm. Another processing unit STM32F407 is mainly responsible for the PTZ attitude control of the 2-DOF camera. It controls the motor speed of the robot chassis through the motor driver and receives the data from the core processing layer. At the same time, it receives the position information calculated from STM32F103 and sends the relevant data to the core processing unit through the serial port.

Fig. 1.
figure 1

Overall scheme design of following robot

The hardware layout of the following robot is shown in Fig. 2.

Fig. 2.
figure 2

The hardware layout of the following robot

3 Multi-sensor Combined Tracking

3.1 Fusion Model of UWB and IMU Based on Adaptive Kalman Filter

According to Kalman filter, for general linear system, the relationship between state equation and observation equation is as follows:

$${X}_{k}={\varPhi }_{k,k-1}{X}_{k-1}+{B}_{k-1}{U}_{k-1}+{\varGamma }_{k-1}{W}_{k-1}$$
(1)
$${Z}_{k}={H}_{k}{X}_{k}+{V}_{k}$$
(2)

where Xk represents state vector at time k, \({B }_{k-1}\) represents the influence of \({k-1}\) time input on the system, \({U }_{k-1}\) represents \({k-1}\) time input, \({W }_{k-1}\) represents dynamic noise of random system, Hk represents k-time measurement matrix, Vk represents measurement noise sequence at time k, \({\varGamma }_{k-1}\) represents System noise matrix.

In this paper, the distance measured by three base stations \({[{d}_{1},{d}_{2},{d}_{3}]}^{T}\) is directly used as the observation variable, that is, the observation equation is nonlinear and needs to be processed by extended Kalman filter. According to the human target model in this paper, it can be obtained by Kalman filter. We can get the state vector describing human motion and the measurement of human target view obtained by using UWB solution model as follows:

$$\left(\begin{array}{c}{x}_{k}\\ {y}_{k}\\ {\dot{x}}_{k}\\ {\dot{y}}_{k}\end{array}\right)=\left(\begin{array}{cc}{I}_{2}& \varDelta T{I}_{2}\\ 0& {I}_{2}\end{array}\right)\left(\begin{array}{c}{x}_{k-1}\\ {y}_{k-1}\\ {\dot{x}}_{k-1}\\ {\dot{y}}_{k-1}\end{array}\right)+\left(\begin{array}{c}\frac{1}{2}\varDelta {T}^{2}{I}_{2}\\ \varDelta T{I}_{2}\end{array}\right)\left(\begin{array}{l}{\ddot{x}}_{k-1}\\ {\ddot{y}}_{k-1}\end{array}\right)+{W}_{k-1}$$
(3)
$$\left(\begin{array}{l}{d}_{1}\\ {d}_{2}\\ {d}_{3}\end{array}\right)=\left(\begin{array}{c}\sqrt{{\left({x}_{k}-{x}_{1}\right)}^{2}+{\left({y}_{k}-{y}_{1}\right)}^{2}}\\ \sqrt{{\left({x}_{k}-{x}_{2}\right)}^{2}+{\left({y}_{k}-{y}_{2}\right)}^{2}}\\ \sqrt{{\left({x}_{k}-{x}_{3}\right)}^{2}+{\left({y}_{k}-{y}_{3}\right)}^{2}}\end{array}\right)+{V}_{k}$$
(4)

where, \(\varDelta T\) represents the sampling time interval, \({I}_{2}\) represents the 2  ×  2 unit matrix, the positions of the three UWB base stations are \(({x}_{1},{y}_{1})\),\(({x}_{2},{y}_{2})\) and \(({x}_{3},{y}_{3})\), respectively, and \(({x}_{k},{y}_{k})\) represents the position of the tracked human body.

$$h\left({X}_{k}\right)=\left(\begin{array}{c}\sqrt{{\left({x}_{k}-{x}_{1}\right)}^{2}+{\left({y}_{k}-{y}_{1}\right)}^{2}}\\ \sqrt{{\left({x}_{k}-{x}_{2}\right)}^{2}+{\left({y}_{k}-{y}_{2}\right)}^{2}}\\ \sqrt{{\left({x}_{k}-{x}_{3}\right)}^{2}+{\left({y}_{k}-{y}_{3}\right)}^{2}}\end{array}\right)$$
(5)

Because \(h\left({X}_{k}\right)\) is a nonlinear function, there is no constant matrix \({H}_{k}\), which makes both sides of the equation hold. According to the extended Kalman filter, the nonlinear function is expanded by Taylor formula, and the approximate linearized equation is obtained by ignoring the higher-order terms of more than quadratic. Then you can get:

$${H}_{k}=\frac{\partial h\left({X}_{k}\right)}{\partial {X}_{k}}=\left(\begin{array}{llll}\frac{\partial {p}_{1}}{\partial {x}_{k}}& \frac{\partial {p}_{1}}{\partial {y}_{k}}& \frac{\partial {p}_{1}}{\partial {\dot{x}}_{k}}& \frac{\partial {p}_{1}}{\partial {\dot{y}}_{k}}\\ \frac{\partial {p}_{2}}{\partial {x}_{k}}& \frac{\partial {p}_{2}}{\partial {y}_{k}}& \frac{\partial {p}_{2}}{\partial {\dot{x}}_{k}}& \frac{\partial {p}_{2}}{\partial {\dot{y}}_{k}}\\ \frac{\partial {p}_{3}}{\partial {x}_{k}}& \frac{\partial {p}_{3}}{\partial {y}_{k}}& \frac{\partial {p}_{3}}{\partial {\dot{x}}_{k}}& \frac{\partial {p}_{3}}{\partial {\dot{y}}_{k}}\end{array}\right)$$
(6)

In the application of Kalman filter, it is necessary to ensure that the driving noise and measurement noise of the system must be white noise. In the process, the driving noise and measurement noise of the system are colored, and the change of the actual environment leads to the change of noise. Finally, the deviation between the estimated value and the real value becomes larger and larger. At this time, the filter can not make the optimal estimation of the state. Therefore, this paper uses adaptive weighted Kalman filter to estimate the parameters of the model.

Adaptive Kalman filter mainly introduces fading factor to modify the error covariance matrix online and update the noise matrix and state noise matrix to prevent the divergence of the filter and improve the robustness of the algorithm. The adaptive Kalman filter algorithm introduces the weighting coefficient to adjust the measurement noise and state noise. The weighting coefficient is calculated as:

$${d}_{k}=\frac{1-b}{1-{b}^{k+1}}$$
(7)

where k represents the current k time, b is forgetting factor, value takes from 0 to 1. It can be seen that when \(\mathrm{k}\to \infty\), \({d}_{k}\) tends to 1, that is, the Kalman filter algorithm of the same standard.

The weighting coefficient is calculated and the noise is dynamically adjusted by the residual at the last time as follow:

$$\left\{ {\begin{array}{*{20}l} {r_k = \left( {1 - d_k } \right)r_{k - 1} + d_k \left( {Z_k - H_k \bar{X}_k^ - } \right)} \hfill \\ {R_k = \left( {1 - d_k } \right)R_{k - 1} + d_k \left( {\varepsilon _k \varepsilon _k^T - H_k P_{k - 1} H_k^T } \right)} \hfill \\ {q_k = \left( {1 - d_k } \right)Q_{k - 1} + d_k \left( {\bar{X}_k - \varPhi _{k,k - 1} \bar{X}_{k - 1} } \right)} \hfill \\ {Q_k = \left( {1 - d_k } \right)Q_{k - 1} + d_k \left( {K_k \varepsilon _k \varepsilon _k^T K_k^T + P_k - \varPhi _{k,k - 1} P_k^ - \varPhi _{k,k - 1}^T } \right)} \hfill \\ \end{array} } \right.$$
(8)

where \(\varepsilon (t)\) represents the residual as follow:

$$\varepsilon (t)={Y}_{k}-{H}_{k}{\overline{X} }_{k}^{-}$$
(9)

where, \({r}_{k}\) and \({q}_{k}\) respectively represent the mean value of measurement noise and state noise, \({R}_{k}\) and \({Q}_{k}\) represent the variance of measurement noise and state noise.

3.2 Combined Tracking with Camera Sensor After UWB Fusion

Vision-based target tracking algorithms cannot be completely reliable in real scenarios, and there is a possibility of misidentification in some extreme conditions. However, although UWB sensors have large errors in the measured data due to multipath effects, the fusion of IMU sensors can limit the errors to a certain range. Therefore, when the difference between the visually tracked target and the UWB fused target is small, the position of the visually tracked target can be considered more accurate. And when the difference between the position of the tracked target after UWB fusion and the position of the visually tracked target is large, it may be due to the misidentification of the visual tracking, so the position of the tracked target after UWB fusion is more reliable at this time, so the position of the UWB tracked target is mainly used at this time.

In this paper, three characteristics are used to determine which primary tracking method is used to drive the robot to achieve following. These include whether there is a tracked target in the camera field of view and whether the position of the visually tracked target point and the position of the UWB tracking point are on the same side of the x-axis of the camera coordinate system. The fusion status can be obtained as shown in Table 1.

Table 1. Fusion status.

The decision tree ID3 algorithm uses the information gain criterion to select features for classification on each node of the decision tree, calculates the information gain for all possible features, and selects the feature with the greatest information gain as the classification criterion.

As shown in the above table, \({A}_{1}\) indicates that there is a target in the camera field of view, \({A}_{2}\) indicates that it is on the same side of the x-axis, and \({A}_{3}\) indicates that the coordinates after UWB fusion are on the same side of the camera. Suppose there are k categories in data set D that can be classified, and these categories can be represented by \({A}_{i}\)(i = 1,…, k). The information gain of feature A to training data set D is expressed as \(g(D\mid A)\) as shown below:

$$g(D\mid A)=H(D)-H(D\mid A)$$
(10)

It is defined as the difference between the empirical entropy \(H(D)\) of set D and the empirical entropy of set D under the given characteristic \({A}_{k}\) condition.

Empirical entropy \(H(D)\) represents the uncertainty in set D, as shown below:

$$H(D)=-{\sum }_{k=1}^{K} \frac{\left|{A}_{k}\right|}{|D|}log\left(\frac{\left|{A}_{k}\right|}{|D|}\right)$$
(11)

The empirical entropy of set D under the given characteristic \({A}_{k}\) condition as shown below:

$$H(D\mid A)={\sum }_{k=1}^{n} \frac{\left|{D}_{k}\right|}{|D|}H\left({D}_{i}\right)=-{\sum }_{i=1}^{n} \frac{\left|{D}_{i}\right|}{|D|}{\sum }_{k=1}^{K} \frac{\left|{D}_{ik}\right|}{\left|{D}_{i}\right|}log\left(\frac{\left|{D}_{ik}\right|}{\left|{D}_{i}\right|}\right)$$
(12)

The classification table of sensor fusion analyzed above in this paper is shown in Table 1. Then the fusion of UWB data and camera data is realized according to the ID3 algorithm of decision tree. According to Eq. (10), Eq. (11) and Eq. (12), the information gain based on different characteristics can be calculated, respectively, \(g(D,{A}_{1})=0.548\), \(g(D,{A}_{2})=0.0488\), \(g(D,{A}_{3})=0.0488\), \(g(D,{A}_{1})>g(D,{A}_{2})=g(D,{A}_{3})\). Therefore, first selecting features for classification can reduce the uncertainty of class information more. According to \(g(D,{A}_{2})=g(D,{A}_{3})\), it shows that selecting features \({\mathrm{A}}_{2}\) and \({\mathrm{A}}_{3}\) has the same effect in reducing the uncertainty of the set. Therefore, the decision tree model shown in Fig. 3 can be obtained. According to the decision tree model shown in Fig. 3, the fusion of UWB tracking data and camera tracking data after fusion with IMU can be realized.

Fig. 3.
figure 3

Decision tree model of multi-sensor and camera fusion

4 Experiment Verification

This paper designs relevant verification experiments according to the design indexes of the following robot designed in this paper.

4.1 Following Distance Test

In order to verify the following distance, the experiment of following distance is carried out in this paper. The following scene is shown in Fig. 4, and the distance between the real-time following target and the following robot and the speed of the following car are recorded. As shown in Fig. 4 (a) and Fig. 4 (d), the robot can follow the target in a straight line or on a gentle slope; In Fig. 4 (b) and Fig. 4 (c), the robot follow the target to the right and the left respectively.

Fig. 4.
figure 4

Follow distance experiment

4.2 Following Occlusion Experiment

In order to verify the design index of the extraction distance of the occluded target, the tracking occlusion experiment is designed in this paper. The experiment verifies that the tracking vehicle can sense the tracking target within a certain range, and achieves ultra-broadband hyper-visual distance perception to extract the target position under visual target occlusion.

As shown in Fig. 5 (a), when moving to the position shown in Fig. 5 (b), the tracker can perceive the position of the target. When moving from the position shown in Fig. 5 (c) to the position shown in Fig. 5 (d), the visual tracker has failed and cannot perceive the position of the tracked target. As shown in the visual index diagram in Fig. 6, since the target is no longer within the camera field of view, there is no pixel error and overlap rate with the calibration frame, so they are all 0. As can be seen from Fig. 7, although the tracking accuracy of UWB is not high, it can also sense the position of the target. Therefore, it can be concluded from this experiment that when the tracking target is completely blocked, the following robot can effectively perceive the tracking target position.

Fig. 5.
figure 5

Human target tracking experiment

Fig. 6.
figure 6

Visual tracking index

Fig. 7.
figure 7

Sage-huse adaptive Kalman filter fusion trajectory

4.3 PTZ Following Experiment

As shown in Fig. 8 (a) and (b), when the target leaves the camera field of view, the visual tracking algorithm fails and the target position cannot be perceived. As shown in Fig. 8 (c), when the target reappears within the camera field of view, the target tracking algorithm in this paper can continue tracking again. When moving from Fig. 8 (c) to Fig. 8 (d), the camera pan tilt can follow the target to ensure that the target is within the camera field of view. When moving the state of Fig. 8 (e), the target is partially obscured at this time, but the tracker can still track the target. After manually calibrating 1290 pictures, calculate the average pixel error (APE) and average overlap (AOR) with the tracking frame output by the tracking algorithm, as shown in Fig. 9.

APE is the error value based on the pixel distance between the predicted target center position and the real position, and the final result is averaged. AOR is the intersection ratio of the predicted area to the real area for each frame, and the final result is averaged.

As shown in Fig. 9 (a), there are four average errors of 0, which are in frames 85–101, 303–319, 837–900 and 1073–1092 respectively. This is because the target is completely obscured by obstacles in these frames. From the index of overlap rate, the overlap rate of tracking target frame and calibration frame can be maintained above 0.6, which shows that when the target is within the field of view, the tracking algorithm can track the target and has a certain accuracy.

In conclusion, this experiment can verify that the tracking effect of the lightweight tracking network in this paper can meet the tracking requirements, and the two degree of freedom camera platform can realize the continuous tracking of the target.

Fig. 8.
figure 8

Tracking of camera PTZ

Fig. 9.
figure 9

APE and AOR indexes in camera PTZ following experiment

5 Conclusion

Taking the following robot as the application scenario, this paper designs the overall scheme of the following robot from the perspective of the reliability of the following robot, focuses on the research of human target tracking method based on multi-sensor, and the main results are summarized as follows:

  1. (1)

    Aiming at the human target tracking method studied in this paper, a set of target following robot system is designed, including sensor type selection and so on. At the same time, the motion model of the robot is modeled to realize the motion control of the robot.

  2. (2)

    Based on Kalman filter and decision tree, a tracking method combining UWB, IMU and monocular camera is proposed to realize that when the target is completely blocked by obstacles, the following robot can still perceive the problem of following the target so as to realize robust tracking.