1 Introduction

Occlusion edge detection is a fundamental capability of computer vision systems as is evident from the number of applications and significant attention it has received [15]. Occlusion edges are useful for a wide array of tasks including object recognition, feature selection, grasping, obstacle avoidance, navigating, path-planning, localization, mapping and stereo-vision. In addition to numerous applications, the concept of occlusions edges is supported by the human visual perception research [6] where it is referred to as figure/ground determination. Once occlusion boundaries have been established, depth order of regions become possible [7, 8] which aids navigation, simultaneous localization and mapping (SLAM) and path planning. Specifically, there is utility of geometric edges for running SLAM algorithms for mobile robots. Many textureless environments are not suitable for feature based SLAM techniques despite being relatively common in indoor environments. However, maps based on occlusion and geometric edges will still allow localisation even in these low texture regions (see example in Fig. 1). For indoor mapping, planes seem a natural consideration for landmarks and indeed numerous researchers have explored plane based mapping [9]. Although less common in robotic experiments, there are places not suitable for planar mapping including buildings with curved walls, natural outdoor environments and extremely cluttered scenes such as those found in search and rescue scenarios. Although planes can be a good way of compressing map information, an observed planar surface is not as constraining to robot pose as feature points and edges. Another strong motivation for using the geometric edges is that they allow localisation by both range and image sensors. For many indoor environments, considering only geometric edges, removes floors, walls and ceilings leaving the elements that lie along the intersection of planes or in cluttered regions, resulting in significant compression of the map data. While such compression may not be significantly effective in many unstructured outdoor environments, occlusion edge detection can still be useful in certain structured scenarios such as roads and pavements.

Figure 1
figure 1

A voxel map and the corresponding geometric edges for the mason hallway dataset [25].

Occlusion edges also help image feature selection by rejecting features generated from regions that span an occlusion edge. As these are dependent on viewpoint position, removing these variant feature saves on further processing and increases recognition accuracy [10]. In many object recognition problems, the shape of the object is better for recognition rather than its appearance, which can be easily dramatically altered e.g., by painted objects, camouflage and people wearing different clothes. However, shape determination is not the approach for state-of-the-art SIFT based object recognition algorithms. Furthermore, knowledge of occlusion edges is a key component of virtual reality (VR), mixed reality (MR) [11, 12] and optic flow algorithms [7]. In robotics, geometric edges of objects demarcate their spatial extents helping with grasping, manipulation as well as maneuvering through the world without collision and therefore, knowledge of occlusion edges is essential.

In the context of Dynamic Data Driven Applications Systems (DDDAS), it is therefore essential to develop algorithms that enable faster and more accurate occlusion edge determination via intelligent processing of heterogeneous information sources such as RGB, depth information and motion related data. Such information can be dynamically accommodated in the map of the environment and system model to improve the accuracy of decision-making. More accurate and efficient decision-making strategies can in turn help the measurement system, in this case the RGB and Depth camera to improve scene understanding capabilities by performing necessary actions such as camera movement, pan, tilt and zoom. In this context, recent works [13, 14] show the effectiveness of the DDDAS framework for various vision and perception problems. Furthermore, to enhance automation and efficiency, meticulous hand-crafting of visual features should be avoided as much as possible. Finally, efficient frame-wide decision-making scheme is required along with other key information regarding the model space and control objectives for closing the control loop. In general, there are multiple specific DDDAS applications that this study will be relevant for such as target recognition, surveillance and tracking and video processing.

In this context, this paper evaluates the efficacy of Deep Learning tools [15] for the task of occlusion edge detection. Recently, this class of techniques have emerged as the top performing machine learning tool for various tasks such as object recognition [16], speech recognition [17], denoising [18], hashing [19] and data fusion [20]. While Deep Neural Networks (DNN) pre-trained using Deep Belief Networks (DBN) [21, 22] perform quite well in most data types, deep Convolutional Neural Networks [23, 24] have been shown to be most suited for images. The better performance is primarily attributed to the preservation of local structures (i.e., localized pixel dependencies) by CNN as opposed to DBN-DNN (where, typically layers are fully connected bipartite graphs). The occlusion edge detection task can logically be conceived as a two step process: identifying edges in an image followed by distinguishing between occlusion and appearance edges. Therefore, deep neural networks are particularly interesting for this problems as they extract hierarchical features (features of features) from data and visualization of intermediate optimized filters [16] show that edge type features are very common. It also should be noted that such an approach eliminates the need for complicated hand-crafting of features that is commonly done in many current approaches. Due to availability of GPUs and recent advancements in the algorithmic/implementation side, model parameters of large CNNs can be effectively learnt using sufficient amount of data for complex problems [16] and overfitting problems can be avoided significantly. In fact, the CNN model size (depth and breadth) can be optimized iteratively for a certain problem. Often however, memory of the implementing GPU becomes the bottle-neck.

In this paper, the main contributions are: (i) formulation of an occlusion edge detection problem as a classification of center-pixels of an image patch with RGB, Depth (D) and optical flow field (UV) channels (ii) performance evaluation of CNN with various input information sources, namely RGB, RGB-D and RGB-D-UV for occlusion edge detection problem and (iii) fusion of patch predictions to generate frame-wide occlusion edges which can be used for robotic applications. Note, similar methods and studies exist in different contexts such as for tracking with occlusion detection [26], wearable multimodal sensor fusion [27], vehicle registration [28] and wide-area motion imagery [29]. This study uses a publicly available benchmark RGB-D (with multiple time frames) data set captured with moving camera in an indoor environment by the Computer Vision group at Technische Universität München (TUM) [30]. The optimized and hardware-accelerated CNN implementation has been done on NVIDIA K-40 GPU.

The paper is organized in seven sections including the introduction. The problem formulation along with the data set description is provided in Section 2. While Section 3 provides the details of architecture and training parameters for the CNN, testing and post-processing are discussed in Section 4. Various experiments with corresponding quantitative results are provided in Section 5 and qualitative observations are articulated in Section 6. Finally, the paper is summarized and concluded with future research directions in Section 8.

2 Problem Formulation and Generating the Training data

In general, it is difficult to define occlusion edge pixels rigorously. In an image, edges manifest along paths of high contrast and are due to four main reasons: (i) texture change, i.e., abrupt change in surface color, (ii) lighting change, i.e., sharp shadows, (iii) range discontinuity, i.e., abrupt change in distance from the observer and (iv) surface normal change, e.g., intersection of two planes. Throughout this work it is assumed that appearance edges are a necessary but not sufficient condition for occlusion edges. This assumption is rarely violated in real world environments but when it is then even the human visual system fails.

It is important to appreciate the distinction in the causes of image edges. Texture change and illumination edges are not observed by 3D sensors. Therefore, the remaining geometric edge types are range discontinuities and abrupt surface normal changes. Surface normal changes are pose invariant, however edges due to range discontinuities can vary with observer position. These surface normal and range discontinuities are illustrated in the last image of Fig. 2. The cylinder sides in Fig. 2 are examples of range discontinuities. The position of these edges varies in 3D space as the position of the observer shifts whereas the cylinder rim edge position is consistent regardless of observer position. For use in mapping, the following characteristics is desired from extracted edge voxels: they should be generally invariant to rotation and translation, and they should be helpful in terms of constraining pose. Therefore, in this study the focus is on identifying the third and fourth type of edges, i.e., edges due to range discontinuity and surface normal change.

Figure 2
figure 2

Image with associated edges due to appearance and due to geometry.

Traditional approaches for detect geometric edges in 3D data include a keypoint detector based on a 3D extension of the Harris corner operator in the Point Cloud Library [31]. This detector operates on local normals of points. A related approach for selecting interest points on 3D meshes was introduced in [32]. In principle, this study is similar to a recent works on indoor scene segmentation [33] and depth map prediction [34]. However, this study focuses on if only occlusion edges can be isolated using CNNs and also if reasonable performance can be achieved without using the depth channel of the RGB-D data. As mentioned earlier, this paper uses a benchmark RGB-D data set the Computer Vision group at Technische Universität München (TUM). The data set contains RGB and depth images of a Microsoft Kinect sensor that was recorded at full frame rate (30 Hz) and sensor resolution 640×480 by moving camera in an indoor environment. The occlusion edge detection problem is formulated as a classification problem and the procedure of generating training data is provided in the following subsection.

2.1 Training Data

Although most of the occlusion edge detection exercises, training labels are generated manually, the occlusion edge information is largely present in a clean version of the depth (D) channel. Therefore, occlusion edge label for a pixel, i.e., the ground truth can be automatically determined to some extent using the depth channel data. However, visually it can be observed that ground truth obtained in such a way may have a large percentage of missed detection. Still, we show that this automated label generation process enables training of a deep CNN. The label generation procedure is illustrated in Fig. 3. From left to right, the three plates in the figure shows an example RGB frame, the corresponding (clean) D channel data and classification frame generated using a simple thresholding only on the depth data. Other than gray (signifying no edge) and white (signifying occlusion edges) colors, the black color can be seen in the classification frame. This signifies bad depth measurements due to presence of absorbing surface or larger than maximum distance allowed between the sensor and the surface.

Figure 3
figure 3

Example RGB, depth and classification frames from the training data generation procedure. In the classification frame gray signifies no edge, occlusion edges are white and black is for no or unreliable data.

As shown in Fig. 4, the RGB-D data set was collected using a camera motion along a certain trajectory in an indoor environment. The trajectory is divided into disjoint training and testing sections so that the trained model can be tested using previously unseen data. The frames in the RGB-D data set are 480×640 in size. In order to create training examples for the Convolutional Neural Network (CNN), 32×32 patches are extracted from the large frames in the training section. The training label for each patch is determined by the pixels located at the center [35, 36]. As illustrated in Fig. 5, if majority of the pixels (2×2 in this case) at the center of a 32×32 patch contains occlusion edges, the patch is labeled as an Occlusion patch. On the other hand, if center pixels contain appearance edges or no edge, corresponding patch is labeled as a No Occlusion patch. Patches with considerable bad or unlabeled pixels are pre-filtered and not used for training. Furthermore, a class balancing is also performed between occlusion and no-occlusion examples within the training data set. As expected, we observe that balancing provides significant performance improvement as originally number of occlusion patches are significantly lower compared to no occlusion patches.

Figure 4
figure 4

Partitioning of camera trajectory for collecting RGB-D data set into training and testing sections.

Figure 5
figure 5

Generation of training data 32×32 patches from original 480×640 frames and labeling based on center-pixels.

Other than 3-channels input with RGB and 4-channels input with RGB-D, we also explore ‘structure from motion’ using optical flow information. As we are interested in depth discontinuity, change in video frames due to motion can be quite useful. Specifically, we use a 2-frame estimation of horizontal (U) and vertical (V) components of the optical flow field with an iterative reweighted least square (IRLS) formulation. Figure 6 shows two consecutive frames and the output of the off-the-shelf optical flow algorithm [37].

Figure 6
figure 6

Two consecutive RGB frames and the output of the optical flow algorithm.

Remark II.1

In an absolute sense, occlusion edges depend on the gradient of the depth image which is very sensitive to noise in the depth map and the depth map derived from a single image is very noisy and has large errors. In our work, we are estimating the occlusion edges directly rather than estimating depth first and then calculating occlusion edges. Secondly there are additional cues (RGB, UV) other than depth which contribute to establishing occlusion edges that our technique is taking advantage of.

3 CNN Architecture and Model Learning

The architecture of the Convolutional Neural Network (CNN) used in this paper is illustrated in Fig. 7. The CNN has three pairs of convolution-pooling layers followed by softmax output layer [16]. This section articulates details of those layers as well as various hyper-parameters used for model learning.

Figure 7
figure 7

Illustration of Convolutional Neural Network (CNN) architecture used for Occlusion Edge classification.

Description of layers

As described in Section 2.1, 32×32 patches were used as data for the CNN in this study. Depending on the experiment, different number channels are used for the input data. For example, while 4 channels were used for single (time) frame RGB-D data (as shown in Fig. 7), 6 channels were used for an RGB-D-UV sequence. Note, all these channels are passed independently through the convolution and max-pooling processes in parallel before combining them for the output layer. So, the convolution and max-pooling processes shown in Fig. 7 applies separately on patches with respect to every channel. More detailed description of various experiments will be provided in Section 5. The layer size parameters here correspond to the RGB-D experiment with 4 channels. The first convolutional layer uses 32 filters (or kernels) of size 5×5×4 with a stride of 1 pixel and padding of 2 pixels on the edges. A two-fold sub-sampling or pooling layer follows the convolutional layer that generates the input data (of size 16×16×32) for the second convolutional layer. This layer uses 32 filters of size 5×5×32 with a stride of 1 pixel and padding of 2 pixels on the edges. A second pooling layer with the same specification as the first one is used after that to generate input with size 8×8×32 for the third convolutional layer that uses 64 filters of size 5×5×32 with same stride and padding strategies as before. The third pooling layer also has the same configuration as the two before it and leads to a softmax output layer with two labels corresponding to No Occlusion and Occlusion classes.

Hyper-parameters

The CNN described above was trained using stochastic gradient descent with a mini-batch size of 100 examples. Although biases of convolutional layer neurons were initialized with constant values zero, weights of the neurons were initialized with zero-mean Gaussian distributions with standard deviations as: 0.0001 for first, 0.01 for second and 0.01 for third convolutional layer. Interestingly, the network performed better with a comparatively larger initialization of the weight standard deviation (0.3) for the output layer. The learning rate and momentum used for all the convolutional layers and for all training epochs were 0.001 and 0.9 respectively. Finally, L2-regularizers were used for all convolutional layers as well with weight 0.001. No dropout was used for model training in this study.

Training with GPU:

The NVIDIA Kepler series K40 GPUs are FLOPS/Watt efficient and are being used to drive real-time image processing capabilities. The Kepler series GPU consists of a maximum of 15 Streaming Execution (SMX) units and up to six 64-bit memory controllers. Each SMX unit has 192 single-precision CUDA cores and each core comprises of fully pipelined floating-point and integer arithmetic logic units. The K40 GPUs consist of 2880 cores with 12 GB of on-board device memory (RAM). Deep Learning applications have been targeted on GPUs previously in [16] and these implementations are both compute and memory bound. Stacking of the channels for the RGB-timeseries and the RGB delta experimnets result in a vector of 32×32×12, which is suitable for the Single Instruction Multiple Datapath (SIMD) architecture of the GPUs. At the same time, the training batch size caches in the GPU memory, so the utilization of the K40 GPU’s memory is very high. This also results in our experiments to run succesfully on a single GPU instead of partitioning the different layers over multiple GPUs.

4 Testing and Post-processing

Performance testing of CNN is done in both quantitative and qualitative manner with various input information as will be explained in Section 5. For quantitative results, classification errors are computed based on the model’s ability to predict label of the center pixels of a test patch collected from a frame captured in testing section of camera motion. The qualitative observations and visualization are made using a post-processing scheme as illustrated in Fig. 8. In this scheme, classification confidence for a patch center pixels is collected from the softmax posterior distribution and it is extrapolated across the patch using a Gaussian distribution with Full Width at Half Maximum (FWHM). Such Gaussian kernels from overlapping patches are fused in a mixture model to generate smooth occlusion edges in the testing frame.

Figure 8
figure 8

Post-processing at the testing phase involves collecting 32×32 overlapping patches with a constant stride from large frames; prediction confidence of a patch center pixel label is converted into a Gaussian kernel with Full Width at Half Maximum (FWHM); Gaussian labels are fused in a mixture model to generate smooth occlusion edges.

5 Experiments and Quantitative Results

Different experiments are performed with different sets of input data for comparative evaluation. They are described below along with corresponding quantitative performance of the CNN model:

RGB-D frame

The first set of experiments used single temporal frames of RGB-D data (i.e., 4 channels). This task may seem rather straight forward as the (noisy) depth information is directly available as one of the channels in the input data. However, majority of edges in the current frames are appearance edges and RGB channels clearly provide that information. Therefore, the task for the CNN model is to detect edges via automatic feature extraction and distinguishing occlusion edges from appearance edges.

RGB frame

The second set of experiments used single temporal frames of RGB data (i.e., 3 channels). The goal here was to investigate if discriminative features exist and can be extracted by CNN from just RGB channels in order to classify patches into Occlusion and No occlusion edges. Ideally, without temporal information RGB channels may not carry a lot of occlusion information. However, occlusion information may remain in certain features such as shadows. Therefore, the goal here is to investigate if such features can be recognized by a CNN to detect occlusion edges.

RGB-D-UV frame

The third set of experiments used UV channels (horizontal and vertical components of the optical flow field respectively) in addition to RGB-D channels (i.e., 6 channels). These additional channels provide the critical temporal information for occlusion edge detection.

Numerical results are provided below for all of these cases. For training the CNN, 57,518 training patches extracted from large image frames (collected in training section of the camera trajectory) are used. During testing, 1,271,002 patches (collected in testing section of the camera trajectory) are used to provide quantitative performance data. Figure 9 shows training and testing error plots over various epochs and specifically the training error graph clearly demonstrates that the training process does not saturate. Table 1 provides percentages of overall error, false alarm and missed detection averaged over epochs 80 through 100 as well as the value of the Pratt Metric averaged over the same epochs. Note, while the overall error, false alarm and missed detection percentages measure pixel-wise accuracy, the Pratt Metric evaluates the similarity between detected occlusion edges and the corresponding ground truth occlusion edges as described in [38].

Figure 9
figure 9

Training and testing error plots (for RGB-D, RGB and RGB-D-UV inputs) over various training epochs.

Table 1 Occlusion detection performance of CNN with RGB-D, RGB and RGB-D-UV inputs.

As provided in Table 1, for both cases, false alarm performance is significantly better compared to missed detection performance. The primary reason for this is that the training labels are generated using an automated procedure using the imperfect depth channel and that can cause many missed detection training examples. For example, qualitative results in the next section show that the CNN based model captures certain occlusion edges which were not part of the ground truth. Numerically, overall error percentage is very close to false alarm rate as majority of the test example patches do not contain occlusion edges. The occlusion edge detection tool is more sensitive with RGB-D input compared to the RGB input. Therefore, missed detection percentage with RGB-D input is 3.65 % less compared to that with RGB input. However, that also causes false alarm rate to rise for RGB-D input. As majority of test example patches do not contain occlusion edges, overall error for RGB input is slightly lower than that of RGB-D input. Adding two more channels based on optical flow, reduces the false alarm rate without increasing the missed detection significantly. The overall performance is also found to be the best (both pixel-wise and in terms of the Pratt Metric) in the case of RGB-D-UV input. Overall, it is interesting to observe that performance is quite comparable even just with RGB channels compared to using all 6 channels. This is significant as it suggests that a vision system with deep learning algorithms can potentially recognize occlusion edges without any depth sensor.

6 Qualitative Observations

This section presents qualitative results in order to understand the efficacy of the deep learning tools for occlusion edge detection and for robotics applications as a whole. An example frame and corresponding ground truth obtained using the automated labeling process are shown in Fig. 10.

Figure 10
figure 10

Example RGB frame and corresponding occlusion edge ground truth.

Figure 11 shows performances with RGB-D input with stride 4 and 8 (see Section 4 for details on strides) on the example testing frame. As expected, occlusion edge detection is better with a lower value of stride as more information is available per pixel in this case. It can be noted in the marked regions (circled in red) in the figures that false detection of occlusion edges reduces with a lower value of stride. The trade-off lies in computational speed. With a lower value of stride, the frame processing time increases linearly with increase in number of test patches. Therefore, this trade-off has to be chosen properly for real-time robotics applications.

Figure 11
figure 11

Occlusion edge detection performance on a test frame for RGB-D input with stride 8 and 4; heat map shows the fused detection confidence (red-yellow-blue signifies high-medium-low); red circled region shows example of confusing appearance edges as occlusion edges; performance improves while computational time increases with decrease in strides.

Finally, Fig. 12 shows detection performance for all three input types with stride 4 and examples of missed detection, false alarm and true detection which was not labeled by automated ground truth determination process are highlighted. From visual inspection, it is clear that detection confidence (shown by the heat map, red-yellow-blue signifies high-medium-low) is the highest for RGB-D input type which means the CNN model becomes most sensitive with that type of input. This corresponds to the quantitative results that shows RGB-D input type has the lowest missed detection rate but highest false alarm rate. On the other hand, RGB has the least detection confidence and RGB-D-UV has the best overall performance. The example of true detection labeled as false alarm because the ground truth was not labeled properly is significant. This shows that the CNN model is able to learn the occlusion features properly (not just memorize) and hence is able to detect certain occlusion edges that were missed by the automated ground truth determination process. This also demonstrates the need of manual labeling of occlusion edges to obtain a more accurate quantitative performance metrics. However, the large number of training and test examples becomes the barrier.

Figure 12
figure 12

Occlusion edge detection performance on a test frame for all three input types with stride 4; heat map shows the fused detection confidence (red-yellow-blue signifies high-medium-low); RGB-D most sensitive, RGB the least, RGB-D-UV has the best overall performance; examples of missed detection, false alarm and true detection which was not labeled by automated ground truth determination process shown in red circled regions.

7 Comparative Evaluation

For performance comparison, we ran recently developed deep learning based Holistically-Nested Edge Detection [39] algorithm on test frames from the same TUM data set. We used the latest version of Caffe [40] and the pre-trained network from HED website. In general, the HED detector is a contour detector which prefers appearance edges that are likely to be range discontinuities and displays impressive performance on single images alone with RGB channels. Qualitative and quantitative results on one of the test frames are provided in Fig. 13 and Table 2 respectively. With visual inspection, it is evident that although the general edge detection performance is quite good for the HED method, it suffers from false alarms, i.e., identifying appearance edges as object contours due to range discontinuities. Numerical results also suggest the same as the false alarm rate of HED is significantly higher compared to our technique while missed detection performance is slightly better for HED. Overall error rate is lower for our method. Note, these performance metrics are only based on patches (≈5000 occlusion patches and ≈120,000 non-occlusion patches in the ground truth) obtained from the single frame presented in Fig. 13. Therefore, they should not be compared with the overall results in Table 1 that considers all patches from all 109 test frames.

Figure 13
figure 13

Occlusion edge detection performance comparison with Holistically-Nested Edge Detection (HED) technique on RGB frame.

Table 2 Occlusion detection performance comparison between our method and HED technique with RGB input.

8 Conclusions and Future Works

In this study, we trained deep convolutional neural networks in a supervised manner in order to detect occlusion edges in RGB-D frames. The problem is formulated as a center-pixel classification problem for an image patch extracted from a larger frame. Apart from RGB-D inputs, experiments were performed to investigate the performance associated with dropping the depth (D) channel and adding motion related information. It is noted that although the missed detection rate increases without depth data, the false alarm performance actually becomes better. Overall performance is the best with use of all six channels, RGB-D-UV. A testing and post-processing scheme is developed to visualize the testing performance. The trade-off between high resolution patch analysis and frame-level computation time is discussed which is critical for real-time robotics applications. Future research directions primarily involve adding ‘closing the control loop’ capabilities to this deep learning based automated feature extraction and classification tool in order to realize an efficient DDDAS. Specific tasks are: (i) investigation of robustness to change in lighting conditions, textures and domain, (ii) design of motion planning using decisions from CNNs and (ii) analysis of computation speed vs accuracy trade-off for real-time operation.