1 Introduction

Autonomous robot exploration and 3D reconstruction of a scene may be very time consuming if not guided by active perception, even in tabletop scenarios. An active perception behavior usually drives the robot by computing the Next Best View (NBV) to observe the most relevant areas of the environment, given the data acquired so far. Traditional NBV algorithms attempt to maximize the information gain by exploring unknown or incomplete parts of the scene. However, a straightforward maximization of the volume of the unknown space may not be the proper solution as the robot may prioritize large occluded areas that do not contain any interesting object. Moreover, NBV planning is usually performed by constraining the viewpoint to lie on a viewing sphere around the object, but the location of the objects may be unknown in advance.

This paper proposes a novel approach for NBV planning of a robot arm equipped with an eye-in-hand range sensor in a tabletop scenario. The robot gives precedence to the exploration of the objects in the scene without any prior knowledge about their shape and position. Such non-model-based approach is achieved by applying a point cloud segmentation algorithm to the sensor data and then by assigning a saliency value to each segment. The NBV system prioritizes viewpoints that observe the segment with the highest saliency. We show that after point cloud segmentation a simple heuristic can be adopted to identify meaningful segments that belong to the objects. In particular, a method for point cloud segmentation is adopted based on Locally Convex Connected Patches (LCCP) by Stein et al. (2014), which is available within the PCL library. The exploitation of point cloud segmentation for active scene exploration has been considered in few previous works.

A further contribution is the computation of the NBV on the GPU through a modified version of KinFu. The KinFu Large Scale (KinFu LS) project is an open source implementation of KinectFusion (Newcombe et al. 2011) in the PCL library. The system exploits the GPU NBV algorithm developed in Monica et al. (2016) and the environment is modeled as a volumetric 3D voxel grid on the GPU using a Truncated Signed Distance Function (TSDF). It is also shown how viewpoint directions can be extracted directly from the KinFu TSDF volume, using a local contour extraction algorithm. The proposed approach is fully autonomous, it only requires an initial short scan of the environment, from one side, and it does not assume the existence of a dominant plane in the scene. In the experimental setup the robot arm is equipped with a Kinect V2 sensor. To the best of our knowledge, this is the first work that reports the use of Kinect V2 for NBV planning. Kinect V2 has a higher resolution and a higher field of view with respect to Kinect V1. Moreover, Kinect V2 has proven to be two times more accurate in the near range and it presents an increased robustness to artificial illumination. A novel procedure has also been developed for Kinect V2 depth image pre-processing. Experiments in environments with multiple complex objects show that the system is able to reconstruct the scene around the objects faster than a traditional NBV planner which maximizes the volume of the unknown space.

The paper is organized as follows. Section 2 provides an overview of the state of the art. Section 3 describes the proposed active perception system. Section 4 illustrates the experimental results. Section 5 concludes the paper and provides suggestions for possible extensions.

2 Related work

The two closest works to ours that considered NBV planning from point cloud segmentation are Wu et al. (2015) and Xu et al. (2015). Both methods have been evaluated in scenes without or with few stacked objects. In Wu et al. (2015) an active object recognition system was proposed for a mobile robot. A feature-based model was used to compute the NBV in 2D space by predicting both visibility and likelihood of feature matching. Experiments were reported with box-shaped objects where the mobile robot was not autonomous but it was manually placed as dictated by the NBV algorithm. Main differences are that this work focuses on an autonomous robot arm and that objects have more complex shapes. In Xu et al. (2015) a graphcut object segmentation is performed on an initial robot scan, using Kinect V1, through KinectFusion. Then, the PR2 robot performs proactive exploration to validate the object-aware segmentation by combining next-best push planning and NBV planning. NBV planning is performed only on pushed objects for scan refinement. The procedure for robot motion planning is not described. Another difference is that in our work NBV is computed on the GPU.

The most common assumption for NBV planning is to determine the optimal placement of the eye-in-hand sensor on a viewing sphere around a target location. An objective function is usually chosen which maximizes the unknown volume as proposed by Connolly (1985) and Banta et al. (2000). Pito (1999) used a turntable and ensured an overlap among consecutive views. In Reed and Allen (2000) and Vasquez-Gomez et al. (2009) sensor constraints were included to minimize the distance traveled by the robot, but the methods were evaluated in simulation. In Yu and Gupta (2004) NBV was aimed at reducing ignorance of the configuration space of the robot. Potthast and Sukhatme (2014) proposed a customizable framework for NBV planning in cluttered environments where a PR2 robot estimates the visibility of occluded space using a probabilistic approach. Kahn et al. (2015) presented a method to plan the motion of the sensor to enable robot grasping by looking for object handles lying within occluded regions of the environment.

Several NBV approaches assume that the location of the target object is known and do not cope with the problem of detecting the most relevant regions of the environment to be explored. Indeed, in Torabi and Gupta (2010), Kriegel et al. (2012), Foix et al. (2010), Morooka et al. (1998), Li and Liu (2005) and Walck and Drouin (2010) a single object in the environment was considered. Another less sophisticated strategy is to adopt a turntable to rotate the object observed from a fixed sensor. In Kriegel et al. (2012) a next-best scan planner was proposed for a laser stripe profiler aimed at maximizing the quality of the reconstruction. In Kriegel et al. (2011) a non model-based approach was introduced for NBV using the boundaries of the scan and by estimating the surface trend of the unknown area beside the boundaries. Some authors have addressed the NBV problem assuming a simple geometry of the objects to be scanned (Chen and Li 2005), or adopting simple parametric models like superquadrics (Whaite and Ferrie 1997). In Welke et al. (2010) and Tsuda et al. (2012) approaches have been developed for humanoid active perception of a grasped object.

In model based approaches the environment is actively explored to discover the location of objects of interest whose template or class is known in advance (Kriegel et al. 2013; Atanasov et al. 2014; Stampfer et al. 2012; Patten et al. 2016). Kriegel et al. (2013) presented an exploration system, combining different sensors, for tabletop scenes that supports NBV planning and object recognition. In Atanasov et al. (2014) an active hypothesis testing problem was solved using a point-based approximate partially observable Markov decision process algorithm. Stampfer et al. (2012) performed active object recognition enriched with common features like text and barcode labels. In Patten et al. (2016) a viewpoint evaluation method was proposed for online active object classification that predicts which points of an object would be visible from a particular viewpoint given the previous observation of other nearby objects.

As mentioned above active exploration strategies have also been proposed where the robot interacts with the environment by pushing the objects (van Hoof et al. 2014; Xu et al. 2015). In van Hoof et al. (2014) the robot autonomously touched the objects to resolve segmentation ambiguities using a probabilistic model. However, NBV planning was not considered. Beale et al. (2011) exploited the correlation between robot and objects motion data to improve segmentation.

Several works have addressed the problem of change detection for scene reconstruction using attention-based approaches. Most attention based approaches do not consider active exploration using NBV sensor planning. Herbst et al. (2014) presented a method for online 3D object segmentation and mapping from recordings of the same scene at several times. Attention based systems have been proposed to direct gaze of humanoid robots or stereo-heads toward relevant locations. Bottom up saliency maps were used in Orabona et al. (2005) from blobs of uniform color. In mobile robotics attention driven methods have been investigated to maintain a consistent representation of the environment as the robot moves (Finman et al. 2013; Drews et al. 2013). In Finman et al. (2013) segmentations of objects were automatically learned from dense RGBD mapping. A method for novelty detection based on Gaussian mixture models from laser scan data was introduced in Drews et al. (2013).

3 Proposed method for next-best view planning

In traditional non-model based approaches next-best view planning is performed in two phases. In the first phase, candidate view poses are generated. In the second phase, all the poses are evaluated according to a score function to find the next-best view pose. The proposed pipeline to compute the NBV, illustrated in Fig. 1, differs from traditional approaches as it introduces an intermediate phase between viewpoint generation and evaluation. In the intermediate phase the input point cloud is segmented into clusters and a saliency value is computed for each point cloud segment. The aim of the point cloud segmentation phase is to automatically detect segments that belong to the objects of the scene. In the evaluation phase potential view poses are associated to point cloud segments and the NBV is searched among view poses in decreasing order of segment saliency.

A more detailed overview of the proposed view planning pipeline is reported in Algorithm 1. The view generation phase is performed by a contour extraction algorithm (line 1), detailed in Sect. 3.1, which extracts contour points, i.e. points at the border of incomplete surfaces. Contour extraction also produces a view direction for each contour point. Then, from each view direction multiple view poses are generated, as shown in Fig. 2, mainly to increase the probability of finding a reachable pose for the robot manipulator. In particular, for each direction four additional view directions towards the same contour point are sampled within a small solid angle (\(15^\circ \)). To convert each view direction into a pose for the sensor, a distance from the contour point must be selected, compatible with the sensor minimum and maximum sensing distance. Although view poses may be generated by selecting multiple distances, in this work a fixed distance of 80 cm was adopted, which was empirically determined by evaluating the average maximum distance that the robot is able to reach from the objects in the current experimental setup. In addition, a rotation angle around the view direction must be chosen. Eight samples for each view direction are generated at \(45^\circ \) intervals, starting from an arbitrary initial orientation.

Fig. 1
figure 1

Pipeline of the view planning algorithm. The grey background highlights the intermediate phase

figure c
Fig. 2
figure 2

Flowchart of viewpoint generation (line 2 in Algorithm 1). Each candidate view direction generates 40 view poses for the sensor

In line 3, a point cloud (PointCloud) is extracted from the TSDF volume using the marching cubes algorithm, already available in KinFu. Marching cubes generates a mesh from the isosurface between positive (empty) and negative (occupied) TSDF voxels. The vertices of the mesh define the point cloud. In the segmentation phase (line 4) the point cloud is segmented using the LCCP algorithm. Then, a saliency value is computed for each segment (line 5), as described in Sect. 3.2. Finally, the segments are ordered by decreasing saliency (line 6).

In the viewpoint evaluation phase (lines 717) view poses are associated to segments and are processed by decreasing segment saliency. In particular, all contour points close to the current segment are determined (line 8). Given the set \(P_C \equiv PointCloud\) of all points in all segments, a contour point p is close to the current segment S if the nearest point to p in \(P_C\) belongs to S. All view poses generated by the contour points of the current segment are then retrieved (line 9). View poses associated to a segment are evaluated by assigning a score proportional to the expected information gain, as in traditional NBV approaches. Indeed, the expected information gain of each view pose is given by the amount of unknown voxels visible from that pose, which is available from the TSDF volume (Monica et al. 2016). A voxel contributes to the score only if it falls inside a sphere with radius 20 cm larger than the bounding sphere of the segment.

View poses associated to the current point cloud segment are then ranked and processed in decreasing order of score. If the expected information gain of a view pose exceeds a threshold value (line 13) that pose is considered the NBV. Otherwise, if the expected information gains of all the view poses of the current segment are below the threshold, the algorithm moves to evaluate the view poses of the next most salient segment. In summary, the proposed procedure is aimed at giving priority to active exploration of salient segments of unknown objects, not fully reconstructed, rather than favoring viewpoints that blindly try to minimize the size of the unknown space.

3.1 Contour extraction from TSDF volume

The TSDF volume is a volumetric representation of the environment used by the KinectFusion algorithm. The space is subdivided into a regular 3D grid of voxels and each voxel holds the sampled value v(xyz) of the Truncated Signed Distance Function \(R^3 \rightarrow R\), which describes the signed distance from the nearest surface, clamped between a minimum and a maximum value. The TSDF is positive in empty space and negative in occupied space. Each voxel also contains a weight w, initialized to 0, that counts the number of times the voxel has been observed, up to a maximum amount. The TSDF value v and the weight can be used to distinguish between empty, occupied and unknown voxels as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} \begin{array}{ll} w = 0 &{} \qquad \quad \quad \quad \rightarrow unknown\; voxel\\ w> 0 &{} {\left\{ \begin{array}{ll} v \le 0 &{} \rightarrow occupied\; voxel\\ v > 0 &{} \rightarrow empty\; voxel \end{array}\right. } \end{array} \end{array}\right. } \end{aligned}$$
(1)

Rarely observed voxels have a low weight, while completely unknown voxels have 0 weight. In unexplored space, or deep inside the surface of objects, voxels are unknown.

In NBV planning a frontier is defined as the region between seen-empty voxels and unknown space. A frontier is a region that can be explored, since the viewing sensor might be placed in the empty space next to the frontier to observe the unknown space. Occupied voxels do not belong to the frontier, since the sensor can not see through them. However, occupied voxels lying next to a frontier have implications for NBV planning. Indeed, observation of the region of space in close proximity to occupied voxels next to a frontier can extend the perception of the surface of the object those occupied voxels belong to.

In the context of this work a contour is defined as the set of empty voxels that are near to occupied voxels next to a frontier, i.e. a contour consists of voxels that are near to both an occupied voxel and an unknown voxel. To exclude false positive known voxels from being processed, due to noise, a voxel is considered known if observed at least 5 times, i.e. \(w \ge W_{th}\), where \(W_{th} = 5\) is a lower bound threshold. Given the 6-connected neighborhood \(N^{6}_e\) and the 18-connected neighborhood \(N^{18}_e\) of a voxel at position e, the voxel belongs to a contour if the following conditions hold:

$$\begin{aligned} {\left\{ \begin{array}{ll} w\left( e\right) \ge W_{th} \; \wedge \; v\left( e\right) > 0 \\ \exists \; u \in N^{6}_e \, \mid \, w\left( u\right) < W_{th} \\ \exists \; o \in N^{18}_e \, \mid \, w\left( o\right) \ge W_{th} \; \wedge \; v\left( o\right) \le 0 \end{array}\right. } \end{aligned}$$
(2)

A simplified 2D example is shown in Fig. 3, using the Von Neumann neighborhood (4-connected) and the Moore neighborhood (8-connected) in place of the 6-connected neighborhood \(N^{6}_e\) and the 18-connected neighborhood \(N^{18}_e\) used in the 3D case. In the previous view the sensor observed the object from the right side, thus the view was partially obstructed and the cells in the lower left part of the image are not observed and left unknown. The cross marks a computed contour cell.

Fig. 3
figure 3

A simplified 2D example of the contour extraction algorithm using Von Neumann neighborhood (4-connected) and Moore neighborhood (8-connected). In the previous view the sensor observed the object from the right side. A computed contour cell is marked with the cross. The thicker square highlights the Moore neighborhood of the contour cell. The green segment represents a frontier. Known and occupied cells are displayed in red, known and empty cells are in white, unknown cells are in dark grey (Color figure online)

Given the previous definitions a method to compute a potential view direction from each contour voxel is described next. For optimal observation, the sensor should observe the object perpendicularly to its surface. Thus, the opposite of the surface normal computed on the occupied voxel next to the contour voxel can be used as potential view direction. The normal to the surface can be computed from the TSDF volume as the gradient \(\nabla v\left( x,y,z\right) \) of v.

Given a neighborhood \(N_e\) of a voxel at position e, the normal may be approximated as (normalization omitted):

$$\begin{aligned} n_e = \sum _{c \in N_e} v\left( c\right) \cdot \frac{ c - e }{||c - e ||} \end{aligned}$$
(3)

which, for a 6-connected neighborhood, reduces to

$$\begin{aligned} n_e = \left[ \begin{array}{c} v\left( x + 1,y,z\right) - v\left( x - 1,y,z\right) \\ v\left( x,y + 1,z\right) - v\left( x,y - 1,z\right) \\ v\left( x,y,z + 1\right) - v\left( x,y,z - 1\right) \end{array} \right] \end{aligned}$$
(4)

since \(\left( c - e\right) /||c - e ||\) are unary vectors of the coordinate system.

The limitation of this approach is shown in Figs. 4 and 5. In both examples the sensor takes a first observation from the bottom, at position A. The observed volume is displayed in light grey. An object, marked with a dashed line, is partially observed in the red region. The volume behind the object remains unknown (black). The surface normal for the computed contour cell is displayed as a red arrow pointing outside the object. The generated potential viewing pose (B) from the surface normal is shown as the red triangle. In Fig. 4 for a rounded object surface the surface normal provides a good direction for the next view. However, for objects with sharp edges (like boxes), as illustrated in Fig. 5, the normal at the contour cell does not provide a suitable view direction since it does not allow the observation of the region of the object in the unknown space behind the edge. Indeed, in this second example at location B the sensor can not acquire any new information, since the lower plane of the box has already been observed from the initial view.

To overcome this limitation, we propose a method that computes the potential view directions using the normal to the frontier, i.e. the normal to the unknown volume. The normal to the frontier is indicated as view C. While in Fig. 4 for a rounded object the viewing pose is rather similar to the one computed by the surface normal, in Fig. 5 for a sharp edge view C provides a much better view direction to observe the object from the side.

Fig. 4
figure 4

Generation of the next potential viewpoint for a rounded object. Viewpoints B (computed from the surface normal) and C (computed from the frontier normal) are very similar

Fig. 5
figure 5

Generation of the next potential viewpoint for an object with sharp edges. Only viewpoint C computed from the frontier normal allows the observation of the unknown volume behind the sharp edge

In this work we use a fast local approach to approximate the normal of the frontier using the gradient of the weight function \(\nabla w\left( x,y,z\right) \) which can be computed as

$$\begin{aligned} n_e = \sum _{c \in N^{26}_e} w'\left( c\right) \cdot \frac{ c - e }{||c - e ||} \end{aligned}$$
(5)

where the 26-connected neighborhood of a voxel is used to reduce noise and sampling effects.

Since \(w\left( c\right) \) is a positive integer value, Eq. 5 uses a modified weight function \(w'\) defined as

$$\begin{aligned} w'\left( c\right) = {\left\{ \begin{array}{ll} -W_{th} &{} \text {if} \,\, c \,\, \text {occupied} \\ \min _{}\left( w\left( c\right) - W_{th},W_{th}\right) &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

For occupied voxels weight w is set to \(-W_{th}\), since we want the normal to point away from them. Otherwise, w is first centered around 0 and then truncated to \(W_{th}\).

In practice, after extraction of all the contour voxels with their view directions (line 1 in Algorithm 1), similar contour voxels are reduced into a contour point by a region growing algorithm. Two contour voxels at position \(e_1\) and \(e_2\), with view direction \(n_1\) and \(n_2\) are considered similar if

$$\begin{aligned} {\left\{ \begin{array}{ll} ||e_1 - e_2 ||< D_{th} \\ ||n_1 \cdot n_2 ||< A_{th} \end{array}\right. } \end{aligned}$$
(7)

Each group of similar voxels is reduced to a single contour point with an associated view direction by averaging the positions and the view directions of the voxels.

Figures 6 and 7 show an example of contour extraction and viewpoint computation. In Fig. 6 the sensor observes a jug from the current NBV and a partial 3D representation is produced by KinFu. As shown by the ternary volumetric representation, voxels behind the object remain unknown. Contour voxels are extracted and clustered as illustrated in Fig. 7. The normal vectors point outwards towards the empty space. Thus, from that directions the robot may be able to observe the unknown space behind the object.

Fig. 6
figure 6

Left a jug observed from the current sensor viewpoint. Center the 3D mesh reconstructed by KinFu. Right the volumetric representation (rotated view), with occupied (white) and unknown (black) voxels

Fig. 7
figure 7

Left contour voxels (black) and the contour points (red). Right contour points with normals. A contour point represent a group of similar contour voxels (Color figure online)

3.2 Saliency of point cloud segments

This section illustrates how the segmentation of the point cloud, extracted from the TSDF volume, is performed and how the saliency value of each segment is computed (lines 45 in Algorithm 1). The procedure is illustrated in Fig. 8. The point cloud is segmented by the LCCP (Stein et al. 2014) algorithm, available in the PCL library. LCCP partitions the input point cloud into a set of Segments (line 4 in Algorithm 1) by merging patches, called supervoxels, of an over-segmented point cloud. Supervoxels are generated by the a Voxel Cloud Connectivity Segmentation algorithm (VCCS) by Papon et al. (2013).

VCCS requires knowledge about the normals to the point cloud, unless points are acquired from the same viewpoint, which is not applicable in our system. Normal vectors could be computed as the normals to the faces of the mesh extracted by the marching cubes algorithm. However, we obtain the vertex normals with minimal overhead by using the gradient of the TSDF volume, as shown by Eq. 3 in Sect. 3.1, using a 6-connected neighborhood which is is already available for marching cubes operations.

The saliency function is a heuristic model that should provide an objectness measure, i.e. it should provide higher values for segments that belong to real objects of the scene. In this work the saliency of each segment is computed as a function of two features: the segment roundness and the degree of isolation.

Fig. 8
figure 8

Proposed procedure for point cloud segmentation and computation of the saliency value of each segment

The roundedness of a segment S is estimated as the ratio of the minimum and maximum sizes of the Oriented Bounding Box (OBB) of S. The sizes \(\left( d_1,d_2,d_3\right) \) of the OBB are defined in a local reference frame \(T_{OBB}\) centered at the mean point of the segment whose axes are given by the eigenvectors of the covariance matrix of the points (principal axes of inertia). More formally,

$$\begin{aligned} \begin{array}{c} d_1 = \max \limits _{c \in S} \left( c'_x\right) - \min \limits _{c \in S} \left( c'_x\right) \\ d_2 = \max \limits _{c \in S} \left( c'_y\right) - \min \limits _{c \in S} \left( c'_y\right) \\ d_3 = \max \limits _{c \in S} \left( c'_z\right) - \min \limits _{c \in S} \left( c'_z\right) \\ \end{array} \end{aligned}$$
(8)

where c is a point of S in the world reference frame and \(c'\) is the transformed point in the local reference frame

$$\begin{aligned} c' = T_{OBB}^{-1} \cdot c \end{aligned}$$
(9)

The minimum and maximum sizes of the OBB of S are then

$$\begin{aligned} \begin{array}{c} d_{max} = \max \limits _{i \in \{1,2,3\}}\left( d_i\right) \\ d_{min} = \min \limits _{i \in \{1,2,3\}}\left( d_i\right) \end{array} \end{aligned}$$
(10)

We define the degree of isolation of a segment as the fraction of points for which the distance to points belonging to other segments is at least \(B_{th}\). Given a segment \(S\in Segments\) and the set \(\hat{S}\) of all the points not in S, the degree of isolation of S is given by

$$\begin{aligned} F(S) = \frac{\left||\left\{ c \in S \mid \forall \, o \in \hat{S}\,,\; |c - o |> B_{th} \right\} \right||}{||S ||} \end{aligned}$$
(11)

where \(||S ||\) is the total number of points in S. Equation 11 can be efficiently computed using a KdTree radius search of size \(B_{th}\). Feature F has three benefits. First, it is meant to reward isolated segments belonging to partially observed objects, since a large part of their boundary is not shared with any other segment. Second, this heuristic is helpful for noise rejection as noisy segments, not well separated from other segments, often have a large boundary. Third, Eq. 11 penalizes small segments.

Finally, the saliency value of a segment S is computed as

$$\begin{aligned} Saliency(S) = F(S) \cdot \frac{d_{min}}{d_{max}} \end{aligned}$$
(12)

so that saliency is proportional to the degree of segment isolation and it grows the more the maximum and the minimum sizes of the OBB are similar. An example of a segmented point cloud with saliency values is shown in Fig. 9. It can be noted that the segment isolation factor reduces the saliency value of noisy segments (inside the red ellipse).

Fig. 9
figure 9

Example of point cloud segmentation and saliency evaluation. Brighter segments have higher saliency value. (1) a picture of the scenario, (2) saliency evaluated by segment roundness alone, (3) saliency evaluated by segment isolation alone, (4) saliency evaluated by both segment roundness and isolation according to Eq. 12. The segment isolation factor reduces the saliency of the noisy segments inside the red ellipse (Color figure online)

Fig. 10
figure 10

Saliency computed after the initial scan in experiment 2 described in Sect. 4.2, using \(B_{th}\) \(\in \) \(\{0.005, 0.01, 0.02, 0.05, 0.1 (m)\}\) (from left to right)

Figure 10 shows the effect of \(B_{th}\) on the saliency. As \(B_{th}\) increases, the saliency value of the noisy segments at the front decreases. However, when \(B_{th}\) is too high, all the points of the small segments are rejected and, therefore, small objects assume a zero saliency value (black color). Hence, for the experimental evaluation reported in Sect. 4.2, the value \(B_{th}=0.02\) m was chosen.

3.3 Kinect V2 depth image pre-processing

This section describes a low-level pre-processing filter to improve the quality of Kinect V2 depth data. The Kinect2 driver (Freenect2) provides two pre-processing filters: a bilateral filter and an edge-aware filter. The proposed filter is executed at the end of the standard filtering pipeline in place of the edge-aware filter, which does not strongly contribute to the removal of invalid points. It is a known issue that Kinect V2 often produces incorrect measurements near the borders of occluded surfaces, as shown in Fig. 11. We are concerned about locating two types of invalid points and removing them from the depth map.

Points visible by the camera but falling in the shadow of an IR emitter have a low accuracy. We call these points shadow points. Shadow points are due to the displacement between the IR emitter and the camera (Fig. 12), which is approximately \(\varDelta = 8\) cm. In this work we are less concerned about depth image restoration of the regions that are not directly observed by the camera, as in Liu et al. (2013), because the NBV system usually observes the same region of space from multiple viewpoints and the measured data are merged by KinFu.

Fig. 11
figure 11

Top left a scene as seen from the sensor. Top right the image from the depth camera. Lower left the point cloud acquired by the sensor, filtered by the Freenect2 driver. Lower right the point cloud filtered by our method; both shadow points and veil points are correctly removed

Fig. 12
figure 12

The Kinect V2 sensor with IR camera, RGB camera and IR emitters

Fig. 13
figure 13

The Kinect V2 (on the left) observes a scene composed by an object (in the center) and a background plane (on the right). The object partially occludes the background plane. Three kinds of occlusions are possible: camera only (yellow), IR emitter only (blue), both (red) (Color figure online)

Fig. 14
figure 14

Illustration of the horizontal angles \(\alpha \) and \(\beta \) of the camera and IR emitter with respect to the observed points

To detect shadow points we look into the regions of occlusion where a background object is observed only by the camera, but that are not illuminated by the IR emitter (yellow areas). The geometry of the sensor field is illustrated in Fig. 13. Let u and v be the horizontal and vertical image coordinates of the sensor, starting from the upper left corner. Let also be the intrinsic parameters of the IR camera defined as follows: \(\left[ f_u,f_v\right] \) the focal lengths, \(\left[ m_u,m_v\right] \) the principal point, \(\left[ u_{max},v_{max}\right] \) the depth image size and \(\left[ \varDelta ,0\right] \) the displacement between the IR emitter from the IR camera, which are aligned horizontally. Given a measured distance \(z_{uv}\) along the sensor axis z at image coordinates \(\left[ u,v\right] \), the coordinates of the measured point referred to the IR camera are given by

$$\begin{aligned} \begin{array}{l} x_{uv} = \frac{u - m_u}{f_u} \cdot z_{uv} \\ y_{uv} = \frac{v - m_v}{f_v} \cdot z_{uv} \end{array} \end{aligned}$$
(13)

while the horizontal angle, shown in Fig. 14, referred to the camera is

$$\begin{aligned} \alpha _{uv} = \text {atan}\left( \frac{x_{uv}}{z_{uv}}\right) + \frac{\pi }{2} = \text {atan}\left( \frac{u - c_u}{f_u}\right) + \frac{\pi }{2} \end{aligned}$$
(14)

which is monotonically increasing with respect to u. However, when referred to the leftmost IR emitter, the x coordinate becomes

$$\begin{aligned} x'_{uv} = \frac{u - m_u}{f_u} \cdot z_{uv} + \varDelta \end{aligned}$$
(15)

and the horizontal angle becomes

$$\begin{aligned} \beta _{uv} = \text {atan}\left( \frac{x'_{uv}}{z_{uv}}\right) + \frac{\pi }{2} \end{aligned}$$
(16)

Unlike \(\alpha _{uv}\), the value of \(\beta _{uv}\) is not monotonically increasing with respect to u. It can be observed that an increase in u which causes a decrease in \(\beta _{uv}\) means that the depth measurement \(z_{uv}\) suddenly increased, i.e. the sensor is no longer observing an occluding object but the object behind it.

Let \(p_j\) be an observed point in the shadow of the IR emitter (yellow area in Fig. 14). Let also be \(\alpha _j\) the angle from the camera origin, computed using Eq. 14. There exists a point \(p_i\) on the object along the illumination ray \(\overline{O'p_{j}}\). There also exists a point \(p_k\) inside angle \(\widehat{OO'p_j}\) that belongs to the object. At most we can choose \(p_k \equiv p_i\). Since \(p_k\) belongs to \(\widehat{OO'p_j}\), then \(\beta _k \ge \beta _j\). Point \(p_k\) also belongs to the interior of angle \(\widehat{O'Op_j}\) as the object does not intersect segment \(\overline{Op_j}\). Then, \(\alpha _k < \alpha _j\). Therefore, a necessary condition for a point \(p_j\) being in shadow is the existence of a point \(p_k\) that satisfies both \(\beta _k \ge \beta _j\) and \(\alpha _k < \alpha _j\). Thus, the depth measurements are removed if:

$$\begin{aligned} \beta _j \le \max _{k \, \mid \, \alpha _k < \alpha _j} \beta _k \end{aligned}$$
(17)

i.e., since \(\alpha \) is monotonic with respect to u:

$$\begin{aligned} \text {atan}\left( \frac{x'_{uv}}{z_{uv}}\right) \le \max _{k \, \in \, \left\{ 0..\left( u - 1\right) \right\} } \text {atan}\left( \frac{x'_{kv}}{z_{kv}}\right) \end{aligned}$$
(18)

which can be efficiently computed in parallel for each \(v \in \left\{ 0 .. \left( v_{max} - 1\right) \right\} \) as shown in Algorithm 2.

figure d

Although it is very likely that a point \(p_k\) is observed by the sensor, since the object is near the sensor and the resolution is very high, condition 17 is still heuristic. Indeed, in real scenarios an object may be closer to the sensor than the Kinect V2 minimum range, hence \(p_k\) may not be really observed. Moreover, only a necessary condition was demonstrated. Indeed, some valid points may be misclassified as shadow points.

In the pre-processing phase invalid points called veil points are also removed as shown in Fig. 11. Veil points are caused by the lidar technology, which tends to interpolate points near the object border with the background. Veil points are removed if an angle higher than \(\varTheta _{max} = 10^{\circ }\) is detected with respect to the observing ray. In particular, given a point \(p_i\) on the depth image, the point is removed if there is a point \(p_k\) in its Von Neumann neighborhood so that

$$\begin{aligned} \left| \frac{\left( p_k - p_i\right) \cdot p_i}{||p_k - p_i||\cdot ||p_i ||}\right| > \cos \left( \varTheta _{max}\right) \end{aligned}$$
(19)

4 Experimental evaluation

4.1 Robot setup and experimental procedure

The experimental setup (Fig. 15) used for the evaluation of the proposed NBV system consists of a robot arm (Comau SMART SiX) with six degrees of freedom. The robot has a maximum horizontal reach of about 1.4 m. A Kinect V2 sensor is mounted on the end-effector and it has been calibrated with respect to the robot wrist. The developed software runs under the ROS framework on an Intel Core i7 4770 at 3.40 GHz, equipped with an NVidia GeForce GTX 670. Collision free robot movements are planned using the MoveIt! ROS stack.

Fig. 15
figure 15

The experimental setup (left). Motion planning environment based on Moveit! (top right). Screenshot of KinFu output during the initial scan phase (bottom right)

Occupied and unknown voxels are considered as obstacles in the motion planning environment. Experiments have been performed on a workspace of size 2 m\(\times 1.32\) m. The volumetric representation of the environment within KinFu uses voxels of size 5.8 mm. In the motion planning environment voxels are undersampled to 4 cm. KinFu is fed with the robot forward kinematics as in Monica et al. (2016), Newcombe et al. (2011), Roth and Vona (2012) and Wagner et al. (2013) to improve the accuracy of point cloud registration with respect to the standard sensor ego-motion tracking approach.

The experimental procedure consists of the following steps. At the beginning of each experiment the environment is completely unknown and the robot, starting from a collision-free configuration, takes a short initial scan of the scene, from one side, using KinFu. Then, the system iteratively computes the NBV as described in Algorithm 1. If the motion planner finds a collision-free path the robot is moved to the NBV. Otherwise, the NBV is skipped. KinFu is turned on when the robot reaches each planned next-best view configuration. Since Kinect needs to be moved for KinectFusion to operate properly the sensor is slightly tilted around the NBV by rotating the robot wrist. The volumetric representation of the environment is, therefore, updated by KinFu after each observation. For the evaluation of the proposed approach for active exploration the experiments were concluded after the fifth NBV.

4.2 Experiments

Experiments have been performed in four different scenarios shown in Fig. 16. Each experiment contains multiple objects with complex geometry. In particular, in experiment 1 the environment comprises two stacks of objects, while experiment 2 has been performed in a cluttered scene with eight objects.

The performance of the proposed method was compared to a standard approach where the NBV is chosen at each iteration as the viewpoint that maximizes the size of the expected unknown volume of the whole environment that becomes visible. The standard approach has been developed by skipping the point cloud segmentation phase and by assigning the same saliency value to all points. A video of experiment 4 is available for download (http://rimlab.ce.unipr.it/documents/RMonica-auro-2016.avi).

Fig. 16
figure 16

The experimental scenarios used for the evaluation

Quantitative data about the average computational time for each phase are reported in Table 1. The average time for point cloud segmentation and saliency computation is about 23% of the total time. A first advantage of the proposed method is that it completes the five next-best views faster than the standard approach. The average times for motion planning and robot movement are rather similar, since these are fixed costs due to the experimental setup, as well as the running time for updating the collision map of the motion planning environment (planner map update). Also, the time required for viewpoint generation is very short (2.1 seconds for five views), since the computation is performed on the GPU directly on the TSDF volume. The running time required for the computation of the NBV is reported as a subtotal. It can be noted that for the NBV computation phase our method is 3.9 times faster than the standard approach, even though the standard approach does not require point cloud segmentation and saliency evaluation.

Table 1 Average total time (seconds) and standard deviation over the four experiments for each phase

The main difference between the two approaches in the time required to compute a NBV lies in the viewpoint evaluation phase. The standard approach evaluates all candidate viewpoints generated in the environment (on the order of thousands), most of which are located on the edges of the supporting table. On the contrary, being able to focus only on the most salient segments, the proposed method rarely evaluates more than two hundreds candidate viewpoints at each iteration. Indeed, the proposed method is strictly focused on the exploration of the salient segments, whose extension is smaller than the size of all the unknown regions of the environment.

Fig. 17
figure 17

Candidate viewpoints (represented by arrows) for the proposed approach (experiment 1, third NBV). Left candidate viewpoints of all segments. Right candidate viewpoints for the most salient segment only

Table 2 Saliency values and number of view poses for the point cloud segments in Fig. 17 (in descending order of saliency) up to the first segment not belonging to the objects (part of the supporting table)

In Fig. 17, an example of the generated viewpoints is shown. The total number of candidate viewpoints for all segments is 91960. Using the standard approach all viewpoints would be evaluated to find the optimal NBV. Instead, our method focuses only on the most salient segment and, therefore, only 960 viewpoints are evaluated. In this case, a reachable pose for the robot was found among these viewpoints. If a reachable pose had not been found the system would have evaluated the second most salient segment, and so on. In Table 2 the saliency values of the point cloud segments are shown as well as the number of associated viewpoints. Had the algorithm tried other segments after the most salient one, the number of evaluated viewpoints would have increased up to 10,480, which is the total number of candidate viewpoints actually pointing towards the objects. The proposed saliency function is working properly even with some degree of over-segmentation by the LCCP algorithm. Indeed, some of the objects are segmented in multiple parts. For example, both jugs are split into two segments and one of the boxes is segmented into three parts. Nonetheless, each of those parts received a high saliency.

Table 3 Marks showing NBVs pointing towards the objects (\(\checkmark \)) or not (\(\times \)), for all the experiments
Fig. 18
figure 18

The graphs show the number of unknown voxels near the objects in the scene for the first five next-best views

Fig. 19
figure 19

Images of experiment 1 using the proposed method (left to right). Top saliency map of point cloud segments; middle 3D volumetric representation; bottom planned robot next-best views

Fig. 20
figure 20

Images of experiment 1 using the standard NBV approach. Top 3D volumetric representation; bottom planned robot next-best views

In Table 3 marks are reported that indicate whether each NBV points towards the objects or not. In the proposed approach all next-best views pointing towards the objects always occur before any other view, not focused on the objects. In the standard approach next-best views pointing towards the objects occur in an unpredictable order. Therefore, it is possible to conclude that a second and more important advantage of the proposed approach is that it allows a more rapid exploration of the objects thanks to point cloud segmentation and saliency evaluation at the segment level. This conclusion is also supported by the graphs in Fig. 18, which show the number of unknown residual voxels near the objects over the first five next-best views.

Fig. 21
figure 21

Images of experiment 3 using the proposed method (left to right). Top saliency map of point cloud segments; middle 3D volumetric representation; bottom planned robot next-best views

Images of the planned next-best views for experiment 1 are reported in Figs. 19 and 20. Images of the planned next-best views for experiment 3 are reported in Figs. 21 and 22. In experiment 1 the robot focuses on the objects for the first four views. Afterwards, as there are no reachable viewing poses to observe the right side of the objects, due to kinematic constraints, the robot explores a region of space that does not contain any object. In particular, the robot observes the space on the supporting table in the front of the objects, which is incomplete due to noise. A similar behavior is evident, for the proposed approach, in experiment 3. Conversely, it can be noted that the standard approach prioritizes exploration of the unknown voxels occluded by the objects as shown, for example, in the first two views of experiment 3. In the third view of experiment 3 the standard approach takes a frontal observation of the objects, but in the fourth view the robot observes again a region of the supporting plane without any object.

Fig. 22
figure 22

Images of experiment 3 using the standard NBV approach. Top 3D volumetric representation; bottom planned robot next-best views

In some cases at the beginning of the exploration, after one or two next-best views, the standard approach achieves a lower number of unknown residual voxels. An example can be seen in Fig. 18 for experiment 1, after the second NBV. This is due to the fact that in the standard approach when the robot observes the unknown voxels occluded by the objects it also partially observes the back of the objects, since the sensor has a large field of view (\(70^\circ \times 60^\circ \)).

Fig. 23
figure 23

3D volumetric representation of the environment in the four experiments after five next-best views: proposed method (top), standard approach (bottom)

The final voxel-based reconstruction is shown in Fig. 23 for all experiments. The reconstruction of the objects is always more complete for the proposed method. Some unknown voxels are still present, mostly due to unreachable poses aimed at observing the back or below the objects, as stated above. Also, it can be noted that most of the irrelevant voxels around the back panel of the scene remained unknown for the proposed method, while these voxels have been observed by the standard NBV approach.

4.3 Evaluation of depth image pre-processing

The proposed Kinect V2 depth image pre-processing filter (Sect. 3.3) has been evaluated in the scenario shown in Fig. 24 (top-left). The environment contains only planar surfaces to facilitate ground truth annotation. A bounding box was defined around the workspace to remove the background of the room. Thus, any point that does not belong to a plane can be considered as an outlier. Depth images were obtained by averaging 30 frames (one second) acquired by the sensor to simulate the noise-reduction effect of the KinFu algorithm. A maximum distance threshold of 3 cm was defined to consider a point as belonging to a plane.

Fig. 24
figure 24

Top left the scenario used for testing the proposed depth image pre-processing filter. Top right image preprocessed by the Freenect2 bilateral filter only. Bottom left image preprocessed by the Freenect2 bilateral and edge-aware filters. Bottom right image preprocessed by the Freenect2 bilateral filter and our filter. The image is displayed in color although the algorithm operates on the depth map only. Outliers points are displayed in red (Color figure online)

In Fig. 24 it can be noted that our pre-processing method successfully removes the shadow on the left of the box-shaped object. The total number of false negatives, i.e. outlier points not belonging to any plane, are reported in Table 4 as well as the number of measurements, i.e. the number of valid points reported by the algorithms. Our algorithm reports a significantly lower number of outliers compared to the standard filtering algorithms already available in the Freenect2 driver (a bilateral and an edge-aware filter). Being conservative, however, it also reports a slightly lower number of valid measurements.

Table 4 Number of measurements and false measurements produced by each algorithm
Fig. 25
figure 25

The simulated environment, with object size 4 cm (top) and 16 cm (bottom). White area illuminated by the emitter only. Grey area illuminated and properly acquired by the camera. Red shadow visible by the camera. The vertical blue band in the bottom image is a region of space that is neither illuminated by the emitter nor observed by the camera

In Sect. 3.3 it was pointed out that the proposed filter for shadow points removal only provides a necessary condition and that false positives may still be present. Evaluation of false positives was carried out in a simulated environment shown in Fig. 25, which contains a ground plane, a wall, and an object (long box). The object is at a distance of 1.5 m from the wall and the Kinect V2 sensor is placed at 1 m from the object. The sensor view of the wall and ground plane is partially occluded by the object. The IR emitter and the camera were simulated as separate entities according to the Kinect V2 technical specifications. The shadow points removal filter was tested by varying the width of the object and the observation angles of the sensor around the object. Table 5 reports the ratio between the incorrectly removed points and all the removed points (false discovery rate). For normal-sized object (8–16 cm width) the false discovery rate is low. However, for thin objects (4 cm width) as the one displayed in Fig. 25 (top) the false discovery rate is over 40%. This is due to the fact that light from the emitter can pass behind a thin object and illuminate part of the background which could be correctly perceived by the real sensor, but it is actually removed by the proposed filter. It may be noted, however, that this negative result is quite rare as it happens only if a thin object is in front of a far background; moreover, in these cases only the background region is affected.

Table 5 False discovery rate of shadow points

5 Conclusions

In this work a novel formulation of the next-best view problem was presented that prioritizes active exploration of the objects without using any prior knowledge about the environment. The next-best view is selected among candidate viewpoints that observe the border of incomplete and salient regions of space. A point cloud segmentation algorithm was adopted to extract salient point cloud segments associated to the objects.

The proposed approach has some limitations and, therefore, a number of directions are open for future research. The heuristic for saliency evaluation has proven robust to detect common objects, however, thin objects or parts usually receive a low score. Hence, computation of segment saliency can be improved by considering more advance features. Following the results achieved in Tateno et al. (2015) and Uckermann et al. (2012, 2014) the quality and robustness of the point cloud segmentation phase can also be improved by performing a real-time segmentation. Indeed, real-time segmentation computed on the TSDF volume is a promising research line. Technical limitations of the current robotic setup are mainly due to the small workspace of the robot arm and to the minimum sensing distance of the Kinect sensor. Finally, a natural extension of this work is the inclusion of object recognition techniques based on point cloud segmentation (Varadarajan and Vincze 2011) and the application of the active perception system for intelligent robot manipulation.