Keywords

1 Introduction

The interaction of humans and robots in a shared workspace is an ongoing field of research. Applications cover a wide field from domains where robotic technologies have been traditionally employed, e.g. industrial scenarios, to relatively new fields such as surgical applications. In all domains, the safety of the human interacting with the robotic system is paramount. For appropriate safety considerations as well as for many applications in human-robot interaction, humans in the environment have to be perceived, e.g. detected and located in 3D space. Both the latency of the perception system and the frame rate heavily influence the possible application scenarios, especially for safety critical applications.

This paper, which is an extension of [1], presents an approach for combining a model-free tracking algorithm using a fast 3D camera, i.e. with low latency and/or high frame rate, with a ground truth provided by a secondary, slower 3D camera. The faster camera pre-calculates the full point cloud of the tracked target in real time. To achieve this, the time-delayed ground truth of the slower camera is propagated forward in the data stream of the faster camera using 2D optical flow and then refined to segment the full point cloud from the scene. Segmentation is performed by calculating connected regions, rejecting outliers based on a simple tracking model and applying background subtraction. This results in a highly accurate tracking estimation in time with the faster camera, based on the time-delayed ground truth of the secondary system. In addition, the model-free tracking algorithm will continue tracking even if there is no ground truth available for a period of time, e.g. because the tracking algorithm that provides the ground truth has lost the target(s).

While the approach is implemented and evaluated on the scenario of human tracking, i.e. using human tracking by a 3D camera as ground truth, the algorithm is not tailored to this application (either implicitly or explicitly). On the contrary, as a model-free tracking algorithm it is designed with the goal to be adaptable to other applications, i.e. different combinations of tracking tasks and modalities. Examples might be 3D tracking of objects with specific temperatures by using a thermal imaging camera as delayed ground truth or even 3D tracking of objects based on positional audio information provided by a microphone array.

The developed algorithm is applied to three different scenarios for tracking a human body as a point cloud in 3D space: (a) low-latency tracking based on ground truth with a latency of one to several seconds, (b) high frame rate tracking based on ground truth with a lower frame rate, (c) continuous tracking based on non-continuous ground truth.

Optical flow and depth information have been used in various works for segmenting and tracking humans and objects. Examples are [2], where depth and optical flow were used to estimate the 3D position and motion of a person; using optical flow to track persons between multiple cameras to avoid occlusions [3] or applying optical flow and depth cues to vehicle-based moving object segmentation [4]. The combination of 2D and 3D Kinect data has been researched e.g. in [5] with the purpose of mobile mapping in indoor environments.

However, the combination of different 3D cameras with 2D/3D propagation of tracking probabilities has not been investigated before.

2 System Setup

The Robot Operating System has been used as a communication framework [6]. It is based on sending time-stamped messages on named topics and provides transport mechanisms for both 2D and 3D image data.

Processing of data acquired by the different 3D cameras has been performed using OpenCV [7] for 2D images and Point Cloud Library [8] for 3D data.

In the following, we give a short description of the scenarios and camera systems to which the algorithm has been applied.

Fig. 1
figure 1

Left Sensor node with Kinect (top), ToF camera (bottom center); the optical tracking (bottom right) was not used for this work. Right Argos P100 3D mounted on top of Kinect II

2.1 Latency Minimization Scenario

The first scenario is based on the sensing system of OP: Sense, a research platform for surgical robotics [9]. Four RGB-D Microsoft Kinect cameras (first generation), featuring a resolution of \(640 \times 480\) pixel for both depth and color image at 30 frames/s (fps), supervise a narrow scene from different points of view. Human tracking and fusion is performed based on the OpenNI tracking libraries [10]. Due to the distributed setup of the system, the Kinect system features a latency of about 950 ms.

A secondary camera system consists of six Time of Flight (ToF) pmd[vision] S3 cameras. With a resolution of \(64 \times 48\) pixels, they provide depth sensing (e.g. point clouds and depth image) as well as an amplitude image that contains the signal strength of the measurement. The left side of Fig. 1 shows one sensor node with both Kinect and ToF camera.

2.2 Frame Rate Optimization Scenario

In this scenario, a RGB-D Microsoft Kinect II has been used for human tracking. The camera provides a color stream with \(1920 \times 1080\) pixels and a depth data stream with \(512 \times 424\) pixels, both at 30 fps. Human tracking was performed using the Microsoft Kinect SDK 2.0 on a Windows system and streaming to ROS has been realized using a custom bridge based on the win_ros stack.

A Bluetechnix Argos 3D P100 ToF camera with a resolution of \(160 \times 120\) pixels provides depth data and an amplitude image, both at a rate of up to 160 fps. The right side of Fig. 1 shows the demonstration setup.

2.3 Tracking Reconstruction Scenario

This scenario is based on the same setup as scenario 2.1, i.e. four Kinect cameras provide the ground truth and six pmd[vision] S3 cameras are used for running the pre-calculation.

3 Methods

For easier reading and consistency with the scenarios and evaluation, we designate the source of the ground truth in the following as “Kinect camera” and the secondary camera as “ToF camera”. However, the presented algorithm is naturally applicable to a wide range of different cameras.

Similarly, the tracking application, which will be referred to throughout the article, is the tracking of humans (based on ground truth provided by the Kinect camera). As the presented approach is a model-free algorithm which is deliberately based on processing an external ground truth (opposed to implementing custom detection and/or tracking algorithms), applications to arbitrary different tracking scenarios are possible. In general, the only requirement is that an external ground truth is available and that correspondences can be established between ground truth and data acquired by the secondary camera.

3.1 Processing Pipelines

The proposed algorithm consists of two different processing pipelines which are executed in parallel as depicted in Fig. 2. The first one processes all data acquired by the ToF camera (data which doesn’t contain any tracking information) and propagates tracking information based on the delayed ground truth. Thereby, a tracking estimate is provided in each time step. The second one processes the user tracking information from the Kinect camera (ground truth) and updates the ToF tracking state as well as the background model.

Fig. 2
figure 2

Overview of the high-level structure of the proposed algorithm. Two pipelines (left ToF pre-calculation, right ground truth processing) are executed in parallel such that the pre-calculation pipeline based on ToF data is continuously updated with the latest available ground truth

ToF Processing. In the following, we use the term “ToF frame” to refer to all ToF data associated to a single time step: source data such as the 3D point cloud, the amplitude image, the depth image and the time stamp of the data acquisition as well as processed data such as a flow field, a tracking probability map and geometric information about tracked targets. To enable applying the results of filtering in the 2D image domain to the 3D space of the point cloud, the pixel-to-point correspondences have to be preserved. For this reason, only operations are employed on the ToF point clouds that keep them organized, i.e. that don’t alter the original points in the cloud.

Figure 3 visualizes the data processing of incoming ToF frames: Upon receiving a new ToF frame, the point cloud is transformed into a shared coordinate system and 2D optical flow from the previous frame is calculated based on the respective amplitude images (see Sect. 3.3). The ToF frame is then stored in a ring buffer. A tracking probability map is calculated that provides a first estimation of the current position of the tracked target(s), based on the optical flow and the tracking probability map stored in the previous ToF frame. Last, a refinement and rejection step is performed based on the tracking probability map, the background model and the spatial information encoded in the depth map (see Sect. 3.5). This yields the extended tracking map for the current time step which is then applied to the point cloud to calculate the human body point cloud tracking estimate.

Fig. 3
figure 3

Processing pipeline for newly acquired ToF frame at time t. 1 A flow field is calculated based on amplitude images of frame t-1 and frame t. 2 The flow field is applied to the tracking probability map of frame t-1, resulting in a tracking probability map for frame t. 3 The tracking probability map is processed based on the tracking information of frame t-1 and a global background model to provide an extended tracking map. 4 Applying the extended tracking map to the point cloud results in the final tracking estimate

Fig. 4
figure 4

Processing pipeline of new ground truth data acquired at time t-6 and is received at time t. First, the corresponding ToF frame in the ring buffer is identified using the associated time stamps. Next, correspondences are estimated to calculate a tracking probability map for the ToF frame at time t-6. Last, the tracking probability map is propagated forwards using the flow fields associated with each ToF frame

Ground Truth Processing. Upon reception, the point cloud corresponding to the tracked human(s) is transformed to the shared coordinate system. Based on the acquisition time of the received point cloud, the closest matching ToF frame is located in the ring buffer (see Fig. 4). By determining correspondences between the ground truth and the point cloud stored in the ToF frame, a tracking probability map with full certainties is established and the ToF frame is marked as a key frame. The background model is updated using this tracking probability map and the corresponding depth map (see Sect. 3.2). These calculations are performed for each incoming ground truth frame and are therefore independent of the actual delay of the ground truth.

Using the respective flow fields, the tracking probability map is propagated forward throughout the ring buffer until the most recent ToF frame (see Sect. 3.4). Here, the number of forward-propagations is directly proportional to the length of the delay. Thereby, the tracking probability map of the next arriving ToF frame will be calculated based on the updated information from this frame.

3.2 Background Modelling

In the presented approach, almost all information is stored and processed on a frame-by-frame basis, e.g. optical flow between two frames and the tracking probability map are directly assigned to a specific ToF frame. There are two exceptions which are modelled as global components: The number of tracked humans and a background model of the scene.

Our approach to modelling the background of the scene is based on the works of [11] that extended the common Gaussian mixture models for pixel-wise background subtraction by an automatic calculation of the correct number of Gaussian distributions per pixel. We have modified the OpenCV implementation of this algorithm in two ways in order to take advantage of the data flow in our approach. First, we introduce a masking capability that enables restricting an update of the background model to specific areas of the image. Second, we split the update step of the original algorithm into two different parts: A background maintenance that only updates the model (without performing background subtraction on the input) and a foreground detection stage that allows performing background subtraction on an image and calculating a foreground mask without updating the background model.

Based on these modifications, the background model is being used as follows:

When a new ground truth frame arrives and correspondences to the according ToF frame have been calculated, the background model is updated using the depth image of this ToF frame. The tracking probability map is used to mask the tracked humans, thereby ensuring that they are not incorporated into the background model. This prevents the common problem that non-moving entities will be included in the background after a certain number of update-steps [12].

When a new ToF frame is processed, an extended tracking map is calculated that contains the location of all pixels belonging to a tracked human. However, this map is prone to inclusion of false positives, e.g. pixels that belong to the background. For correction, a foreground mask is retrieved by querying the background model with the depth image of the ToF frame. By masking the extended tracking map with the foreground mask, we remove potential false positives.

3.3 Optical Flow Estimation

As described in Sect. 3.1, optical flow applied to 2D images is used to propagate the tracking probability map between the ToF frames.

When using RGB images, the sensitivity of optical flow for moving targets such as humans or objects is highly dependent on the kind of motion performed. When applying optical for the purpose of tracking, rotations prove more difficult to detect than translations: During rotation of a tracked target, previously visible parts of the object vanish from the image while new parts appear. For these new elements, no corresponding parts exist in the previous image. Performing optical flow calculations on the amplitude images acquired by ToF cameras partially overcomes this problem: The reflectivity of a tracked target, especially in the case of human tracking, is usually less affected by rotations than its appearance in color space.

For the actual calculation of optical flow between two amplitude images, we use the TV-L1 algorithm proposed by [13]. The flow field is calculated upon receiving a new ToF frame and stored within the frame. As the flow field based propagation of the tracking probability map is only used as a first approximation which is refined in subsequent steps, our parameterization of the optical flow algorithm is targeted on a higher computation speed rather than an optimal accuracy. Therefore, we set the number of warps to 2 with 3 levels.

3.4 Tracking Probability Propagation

In ToF frames, information about the location of tracked humans has to be stored and propagated. We represent this information as a 2D probability map where the value of each pixel denotes the probability of this pixel belonging to a tracked human.

When a ground truth frame is received and the ToF frame with the closest matching timestamp was located in the ring buffer, point-to-point correspondences between both frames have to be established. These correspondences are calculated by creating a k-d-tree of the downsampled ground truth cloud, iterating over all points in the ToF point cloud and determining whether the distance to the ground truth cloud is smaller than a pre-defined threshold. For all points where this check is successful, the according pixel in the zero-initialized probability map is set to one.

Propagation of the tracking probability map from ToF frame \(\text {F}_{\text {t}}\) to subsequent frame \(\text {F}_{\text {t+1}}\) is performed using the flow fields associated with each ToF frame: Using the flow field, each pixel \(\text {p}_{\text {i,t}}\) with a positive probability value is projected onto the tracking probability map of frame \(\text {F}_{\text {t+1}}\). To map its new coordinates (\(\text {x}_{\text {i,t+1}}\), \(\text {y}_{\text {i,t+1}})\) to whole-numbered coordinates, the probability value associated with p\(_{{\text {i,t}}}\) is distributed onto the four adjacent pixels \(\text {p}_{\text {j1,t+1}}\) .. \(\text {p}_{\text {j4,t+1}}\) based on their L2 distance to the new position, provided that these pixels are inside the region of the image.

In addition to populating the tracking probability map, the current total number of tracked targets is determined based on the ground truth frame and stored as part of the global tracking state.

3.5 Tracking Estimation

At the arrival of each new ToF frame, a tracking probability map is calculated that provides a first estimation which points in the point cloud correspond to the tracked human. However, this estimation has to be refined due to potential errors introduced by the flow field based propagation of the tracking probability. In our experience, especially human extremities such as arms are prone to misdetection during optical flow propagation with low-resolution ToF cameras (false negatives). Also, tracking probabilities might be erroneously associated to non-tracked objects in the surrounding environment (false positives).

For this reason, the tracking estimation step is split into two stages: tracking refinement stage and outlier rejection stage.

Tracking Refinement Stage. The tracking refinement stage is primarily targeted at correcting false negative detections, e.g. non-detected extremities. The tracking probability map is first binarized by comparison against a pre-defined threshold and then segmented into connected probable tracking regions \(\text {r}_{\text {i}}\). For each region, the center of mass \(\text {m}_{\text {i}}\) is calculated. Using \(\text {m}_{\text {i}}\) as a seed, a floodfill operation is performed on the associated depth image in order to connect previously undetected pixels with local continuity in 3D space. The result is a refined tracking estimate \(\text {r}_{\text {i}}^{\prime }\) for each connected region.

Outlier Rejection Stage. While false negative detections have been resolved in the previous stage, there is still a possibility for false positive detections to be present due to erroneous propagation of the tracking probability map onto untracked pixels. To reject these outliers, the current number of probable tracking regions is first checked against the number of tracked targets (see Sect. 3.4). If there are more regions than tracked targets, we perform a similarity comparison between each tracked region r\(_{j,{t-1^{\prime }}}\) of the last frame and all current probable tracked regions \(\text {r}_{\text {i,t}^{\prime }}\) in order to detect the correct correspondences. The similarity comparison is based on both 2D similarity metrics (e.g. 2D center location and area of a region) and 3D similarity metrics (e.g. Euclidean distance between the center points in 3D space). For each region \(\text {r}_{\text {j,t}-1^{\prime }}\) of the previous frame, the best matching region \(\text {r}_{\text {i,t}^{\prime }}\) is determined and its features are stored as detected tracked regions in the current ToF frame. In order to avoid merging of multiple regions \(\text {r}_{\text {j,t}-1^{\prime }}\) onto a single region \(\text {r}_{\text {i,t}^{\prime }}\), regions \(\text {r}_{\text {i,t}^{\prime }}\) are exempt from further similarity comparisons once they have been successfully matched.

As a last step, for each detected tracked region all corresponding points in the ToF frame cloud are selected. This results in the full body point cloud of the respective tracked human being available for further processing.

4 Results

The developed algorithm has been evaluated in the three scenarios presented in Sects. 2.12.3. For both the latency minimization and the fame rate optimization scenario, evaluation was performed by comparing the extended tracking map, which is calculated immediately on the arrival of each new ToF frame, against the corresponding ground truth, which becomes available with a certain delay. This also means that only frames for which a corresponding ground truth was received are taken into account. For the third scenario, which is the reconstruction of lost tracking (gaps in the ground truth), the evaluation method and results are described in Sect. 4.3.

All tests were performed under Linux Ubuntu 12.04 using an AMD Phenom II 1090T processor with six cores at 3.2 GHz and 12 GB of RAM. All cameras have been registered against an optical tracking system.

Table 1 lists the metrics employed for accuracy evaluation.

Table 1 Metrics for accuracy evaluation

4.1 Latency Minimization

For the latency minimization scenario, evaluation was performed on three recorded data sets. Set A has a duration of 53.5 s, contains 317 ToF frames and 265 ground truth frames. The cameras are located with a distance of 31.2 cm between each other and share the same field of view. The desired latency for evaluation was artificially introduced by playing back the Kinect data with a delay between 1 and 10 s. The average processing time per ToF frame was 39 ms, independently of the induced delay.

In set A, the tracked person comes into the field of view two times. To allow for a detailed examination, evaluation has been performed on two different subsets of the measurements: A1 takes into account all frames of each measurement, A2 includes only the frames in which recall and precision were positive, i.e. tracking was actually performed. As a consequence, subset A1 is directly influenced by the delay of the ground truth: On entry of a person into the field of view, there is no ground truth available until the delayed ground truth is received. A higher delay therefore directly results in more frames in which no forward propagation happens and no tracking is performed which in turn lead to a higher rate of false negative classifications and thereby a lower recall.

Fig. 5
figure 5

Ground truth processing time (shown for subset A1 only)

Fig. 6
figure 6

Number of false negative classifications (left) and false positive classifications (right)

In all following figures, obtained results are shown over the respective delay; the continuous line corresponds to subset A1 whereas the dotted line corresponds to subset A2. All reported results are averaged over all frames of each measurement.

Figure 5 shows the ground truth processing time. Figure 6 shows the numbers of false negative and false positive classifications. Figure 7 shows the resulting precision of the tracking estimate and the achieved recall of the tracking estimate.

Set B was recorded with the aim of evaluating the proposed algorithm in terms of robustness against data acquired from different points of view.

Fig. 7
figure 7

Precision of the tracking estimate (left) and recall of the tracking estimate (right)

Fig. 8
figure 8

Schematic top-down view of spatial camera configuration: Six ToF cameras (small rectangles) and four Kinect cameras (trapezes) are mounted at a rectangular rig to supervise the surroundings of an operating table (large rectangle)

Table 2 Spatial configuration and accuracy evaluation for six ToF cameras with different points of view compared to the Kinect camera and latency of 1 s

It contains data of six ToF cameras that are ceiling-mounted in four corners as well as on the sides of a rectangle of about 2 m \(\times \) 2 m (see Fig. 8). A Kinect camera mounted in one of the corners is used as ground truth. Set B has a duration of 85 s, contains approximately 230 ToF frames per camera and 294 ground truth frames. Again, the results are split into two subsets B1 and B2 where B2 only contains frames where a detection was performed. Further information about the spatial relation between each ToF camera and the Kinect camera as well as the achieved results (recall and precision) for both subsets B1 and B2 are shown in Table 2.

Figure 9 shows a side-by-side exemplary view of the point cloud of a single ToF camera with the delayed ground truth (left) and the pre-calculated tracking estimate for this scenario (right). All 3D points that are pre-calculated as corresponding to the tracked human have been omitted from the scene for better visibility.

4.2 Frame Rate Optimization

Contrary to the camera system used in the scenario above, which has already been well-tested and optimized, e.g. with regards to crosstalk of the different cameras illuminating the scene with infrared light, the combination of the Kinect II with the Argos 3D P100 is employed as a proof of concept for the purpose of evaluating the presented algorithm. Currently, the maximum frame rate of 160 fps for the Argos camera can only be achieved with a low integration time that drastically decreases the sensing range of the camera. As a compromise, we operated the camera at 80 fps which yielded an acceptable sensing range for object with a medium to high reflectivity (i.e. people wearing white clothes). In addition, we observed infrequent crosstalk. Figure 10 shows the pre-calculation with the Argos 3D P100.

Fig. 9
figure 9

Delayed ground truth (left scene) and pre-calculated tracking estimate (right scene) in latency minimization scenario

Fig. 10
figure 10

Delayed ground truth (left scene) and pre-calculated tracking estimate (right scene) in frame rate optimization scenario

Evaluation was performed using four different data sets of lengths between 30 and 68 s. Each data set contains at least 2.300 frames acquired by the ToF camera and 600 frames taken by the Kinect II. Again, the measurements were split as before into subsets D1 and D2.

As the processing of each ToF frame took more than 230 ms on average, which resulted in dropped frames, we slowed back the playback of the recorded data by a factor of 10. In proportion, this corresponds to a processing time of about 20 ms, and can serve as an indication for the potential accuracy of the algorithm. Table 3 lists the resulting accuracy metrics.

Table 3 Accuracy evaluation for high frame rate ToF at normal and reduced speed
Table 4 Classifications for manual annotation of ground truth quality

4.3 Tracking Reconstruction

Set E with a length of 151 s represents the full configuration of the camera system. It contains 926 ToF frames for each of the six ToF cameras and 453 (460) point clouds representing the ground truth. One person was performing different tasks on both sides of the OR table.

In this scenario, no effective time delay was introduced for the fused ground truth. For each ToF frame, the correctness of the corresponding ground truth was manually annotated using the classification shown in Table 4.

In difference to set B, the full human tracking Kinect system has been used as ground truth. Figure 11 shows an overview of the data flow for set E.

As can be seen from the spatial setup of the cameras as depicted in Fig. 8, for ToF cameras 1, 2, 3 and 4 there is a Kinect camera located close by with a similar angle of view. Each camera only supervises a part of the scene: Even-numbered cameras and odd-numbered cameras acquire different sides of the OR table in the OP:Sense setup. For this evaluation, the results of camera ToF 1, ToF 2 and ToF 3 are used as these represent the three different cases (ToF 1 and ToF 2 look at different sides of the OR table; ToF 3 does not have a spatially close Kinect camera).

In the evaluation data set, there is a frequent loss of tracking by the Kinect cameras. In some instances, this has been intentionally caused by the tracked person standing still which is a common reason of failure for tracking algorithms that rely on detection motion in their first stage.

Contrary to the previous Sects. 4.1 and 4.2, where the accuracy of the pre-calculation against a known ground truth was evaluated, evaluation here takes into account all frames, including those where no ground truth was available. This is necessary for evaluating the capability to fill gaps in the ground truth, e.g. continue tracking the target even if the ground truth incorrectly fails.

Fig. 11
figure 11

Data flow for tracking reconstruction scenario: Four Kinect cameras supervise the scene and independently perform human tracking. Their fused output is used as ground truth for six parallel instances of the model-free pre-calculation algorithm, each based on scene data acquired by one ToF camera. The results are then fused into a point cloud containing the final pre-calculation

To determine the accuracy, recall and precision were calculated. Calculations only took into account the frames where a valid ground truth was available and tracking was performed. Tables 5 and 6 show the achieved results when taking into account either all frames where ground truth was not classified as “loss” or all frames where ground truth was classified as “correct”.

Table 5 Accuracy evaluation for tracking reconstruction scenario based on all frames where tracking was performed
Table 6 Accuracy evaluation for tracking reconstruction scenario based on frames with correct ground truth only

Figures 12, 13 and 14 show graphs of the tracking results obtained in evaluation as bar/line diagrams with the number of tracked pixels on x-axis and the frame number on y-axis. The bars represent the ground truth as obtained by the Kinect camera; for better visibility, bars have been plotted without interleaving intervals. The line represents the result of the pre-calculation algorithm. For a better understanding of the graphs, the line is plotted solidly where ground truth was labelled as tracking the human; where ground truth incorrectly failed to track (loss in tracking), the line representing the result of the pre-calculation algorithm is plotted as points.

Fig. 12
figure 12

Tracking reconstruction results for camera 1

Fig. 13
figure 13

Tracking reconstruction results for camera 2

Fig. 14
figure 14

Tracking reconstruction results for camera 3

5 Discussion

For the latency minimization scenario, Fig. 5 shows that the ground truth processing time starts at 47 ms at a delay of 1 s and increases with longer delays. This corresponds to a first processing step of about 45 ms, in which transformation of the ground truth cloud and correspondence calculation are performed, followed by the forward propagation of the ground truth which takes about 1.7 ms/s of delay and is therefore also applicable to longer delays.

The total latency of the pre-calculated tracking can be calculated as the sum of the latency of the ToF cameras in the six-camera setup of about 240 ms and the ToF frame processing time of 39 ms. The resulting total latency of less than 300 ms is independent of the induced delay, so the observed speedup of the tracking is between 3\(\times \) and 33\(\times \) for a respective delay of 1–10 s.

As expected, the number of false negative classifications as depicted in Fig. 6 is approximately proportional to the induced delay for the subset A1 (see Sect. 4.1). For subset A2, from which frames without a ground truth were excluded, the number of false negative classifications was negligible and clearly independent of the delay.

These results lead to a high precision (see Fig. 7), e.g. close to nil points are erroneously classified as belonging to the tracked human. For subset A1, recall is again proportional to the delay as with a higher delay, there is no ground truth for a large number of frames. If only frames for which a ground truth was available during the measurement are taken into account (subset A2), recall is close to 1 which means that almost all points that belong to the tracked human have been classified as such (see Fig. 9).

Measurements with six ToF cameras show that the proposed algorithm shows good results also on different camera configurations, i.e. when the ToF camera and the Kinect camera are not mounted with a similar point of view, as can be seen from Table 2. Subset B1 shows worse results on recall than subset B2, due to the fact that with different fields of view, the tracked human is often not visible in both cameras at once, so no correspondences can be established. For the six-camera scenario specifically, we expect to solve this by utilizing the fused output of four spatially distributed Kinect cameras as ground truth.

The rather long processing time when using the Argos 3D P100 camera is consistent with the timings measured for the pmd[vision] S3 cameras: The Argos3D P100 nominally provides about six times more points per frame, for which correspondences have to be determined, which leads to an increase in processing time from 46 ms to about 230 ms. However, this calculation is currently performed on CPU in a single thread so we are expecting to achieve a large speedup by parallelizing on CPU and/or GPU. Further optimizations of the frame rate and image quality are expected by using a different high-speed ToF camera, the upcoming Argos 3D P320, which features 12 instead of 2 LEDs for illumination and thereby increases the effective sensing range.

For the tracking reconstruction scenario, results show a good fit between the ground truth that was classified as “correct” and the according result of our model-free pre-calculation tracking algorithm. It is also clearly visible that the tracking continued when ground truth was lost.

When taking into account the tracking results depicted in Figs. 12, 13 and 14, one can further see that the tracking results of the ToF cameras complement each other. E.g. when the tracked human is not visible for cameras ToF 1 and ToF 3, ToF 2 performs the pre-calculation. This is the desired behavior based on the spatial camera setup of OP:Sense (as shown in Fig. 8).

For the accuracy, there is again a high recall and precision as given in Tables 5 and 6. While both metrics yield slightly higher results when only taking into account “correct” ground truth, the difference is negligible. However, it can be seen that the position of the ToF camera does not affect pre-calculation result.

6 Conclusions

We have proposed a new approach for pre-calculating the body point cloud of a human using a model-free algorithm which is based on time-delayed ground truth. It features two distinct processing pipelines: One pipeline processes the ground truth, that corresponds to a past measurement frame, and propagates it forward to the current frame. The other pipeline handles the incoming data from the faster 3D camera system and calculates a tracking estimate based on 2D optical flow in combination with a customized background model and various refinement steps.

The algorithm has been implemented and evaluation has been performed on three different scenarios. Results for the latency minimization scenario show that the presented approach consistently achieves very good results for the evaluated data sets. The distinction between two different data sets for each evaluation shows that apart from the initial delay until a tracking is established, the magnitude of the latency doesn’t affect the high tracking quality of the algorithm. While still good, the accuracy of the second scenario is lower than that of the first scenario and the current processing time prohibits its intended usage. For this reason, optimization of the algorithm in terms of computational costs and the optimization of our test bed for the second scenario will be addressed as detailed above. The results obtained for the third scenario show that using the proposed algorithm with multiple cameras, both for providing the ground truth and for pre-calculating the tracking, gives an accurate and robust tracking estimation. It is also capable of covering gaps in the ground truth, i.e. continues tracking even if tracking was lost as ground truth.

As next steps, we aim to apply the developed algorithm to a number of use cases in human-robot-interaction, e.g. camera-based switching of robot control modes. We also intend to apply the algorithm to other kinds of tracking scenarios using different input modalities.