1 Introduction

The problem of scene and image understanding from monocular images has been studied very well in recent years (Gupta et al. 2010; Hedau et al. 2012; Hoiem et al. 2007; Lee et al. 2010, 2009; Saxena et al. 2008). Some works have addressed the task of inferring coarse 3D layout of outdoor scenes, exploiting appearance and geometric information (Hoiem et al. 2007; Saxena et al. 2008). Recently, the focus has shifted towards the more difficult case of cluttered indoor scenes (Gupta et al. 2011; Hedau et al. 2012; Lee et al. 2010, 2009). In this context, the notion of affordance and the functionality of objects for human use acquires importance. Thus, Hedau et al. (2012) recovers walk-able surfaces by reasoning on the location and shape of furniture, Lee et al. (2010), Lee et al. (2009) reason about the 3D geometry of the room and objects, while Gupta et al. (2011) focuses on interpreting the scene in a human-centric perspective.

Another major line of work has been object detection. Most notable among them is the work in the sliding window paradigm one of the first example being Viola and Jones (2001), which considered the task of face detection, Dalal and Triggs (2005), which proposed and benchmarked various feature choices for use with sliding window detectors, and more recent works (Bourdev et al. 2010; Felzenszwalb et al. 2010) which extends the sliding window approaches to reason about parts and their relative arrangements. Notably, Felzenszwalb et al. (2010)’s deformable part models (DPM), is the widely accepted state-of-the-art method for object detection.Footnote 1

With the recent introduction of a commodity depth sensor (like the Microsoft Kinect), a new area of research has opened up in computer vision of looking at tasks which have traditionally been very hard. For example, recent works have considered 3D reconstruction tasks such as real-time scene reconstruction (Izadi et al. 2011), and recovering high fidelity albedo, shape and illumination (Barron and Malik 2013).

There has also been a lot of work on semantic understanding of images given RGB-D input from a depth sensor. A particularly striking first work among this is that of real-time human pose estimation from single RGB-D images (Shotton et al. 2011), in which they demonstrate that with the availability of RGB-D input they can solve the hard problem of human joint localization well enough to be used in a practical application. Subsequently, there have been numerous papers in both robotics and vision communities looking at various image and scene understanding problems namely, bottom-up segmentation (Dollár and Zitnick 2013; Ren and Bo 2012; Silberman et al. 2012), semantic segmentation (Carreira et al. 2012; Koppula et al. 2011; Ren et al. 2012; Silberman et al. 2012), and object detection (Janoch et al. 2013; soo Kim et al. 2013; Lai et al. 2013; Tang et al. 2012; Ye 2013).

In this paper we tackle all these three tasks—bottom-up segmentation, object detection and semantic segmentation for indoor RGB-D images. The output of our approach is shown in Fig. 1: given a single RGB-D image (a, b), our system produces contour detection, bottom-up segmentation (c), contour classification (d), grouping by amodal completion (e), object detection (f) and semantic labeling of objects and scene surfaces (g).

Fig. 1
figure 1

Output of our system: We take in as input a single color and depth image (a, b) and produce as output a bottom-up segmentation (c), long range completions (d), contour classification (e) [into depth discontinuities (red), concave normal discontinuities (green) and convex normal discontinuities (blue)], object detections (f), and a semantic segmentation (g) (Color figure online)

This is an extended version of the work that appeared in Gupta et al. (2013). It differs from Gupta et al. (2013), in that we also investigate the problem of RGB-D detection, and show that incorporating additional features from object detector activations further improves the semantic segmentation accuracy.

This paper is organized as follows: we review related work in Sect. 2. We describe our algorithm and results for perceptual re-organization (bottom-up segmentation and amodal completion) in Sect. 3. We then describe how we train RGB-D object detectors and compare them with existing methods in the literature in Sect. 4. We then describe our system for semantic segmentation in Sect. 5. Finally, we use the output from our object detectors and scene classifiers for the task of semantic segmentation, and show how this additional knowledge can help us improve the performance of our semantic segmentation system in Sect. 6.

2 Related Work

2.1 Bottom-up and Semantic Segmentation

One of the first attempts at bottom-up and semantic segmentation is that of Silberman et al. (2012), in which they consider the task of bottom-up RGB-D segmentation and semantic scene labeling, by modifying the algorithm of Hoiem et al. (2011) to use depth for bottom-up segmentation and then using context features derived from inferring support relationships in the scene for performing semantic segmentation. Ren et al. (2012)’s work uses features based on kernel descriptors on superpixels and their ancestors from a region hierarchy, followed by a Markov random field (MRF) context model. Koppula et al. (2011) also study the problem of indoor scene parsing with RGB-D data in the context of mobile robotics, where multiple views of the scene are acquired with a Kinect sensor and subsequently merged into a full 3D reconstruction. The full 3D point cloud is over-segmented and used as underlying structure for an MRF model. A rich set of features is defined, describing local appearance, shape and geometry, and contextual relationships among object classes. A max-margin formulation is proposed to learn the model parameters and inference is performed via LP relaxation.

Our work differs from the references above in both our approach to segmentation and to recognition. We visit the segmentation problem afresh by extending the gPb-ucm (Arbelaez et al. 2011) machinery to leverage depth information, giving us significantly better bottom-up segmentation when compared to earlier works. We also consider the interesting problem of amodal completion (Kanizsa 1979) and obtain long range groups, which gives us better bottom-up region proposals for scene surfaces which are often interrupted by objects in front of them. Finally, we are also able to label each edge as being a depth edge, a normal edge, or neither.

Our approach for recognition builds on insights from the performance of different methods on the PASCAL VOC segmentation challenge (Everingham et al. 2012). We observe that approaches like Arbelaez et al. (2012), Carreira et al. (2012),Carreira et al. (2012), which focus on classifying bottom-up region candidates using strong features on the region have obtained significantly better results than MRF-based methods (Ladicky et al. 2010). Based on this motivation, we propose new features to represent bottom-up region proposals (which in our case are non-overlapping superpixels and their amodal completion), and use additive kernel SVM classifiers.

2.2 Object Detection

For object detection, from a robotics perspective, Lai et al. (2011, 2013) collect a dataset of day-to-day objects, and propose novel kernel descriptor features to recognize these objects. We study the same problem, but consider it in uncontrolled and cluttered real world scenes, and develop techniques which can generalize across instances of the same category. Moreover, we are more interested in the problem of detecting large furniture like items. Johnson et al., Rusu et al., and Frome et al. look at computing features for describing points in point cloud data (Frome et al. 2004; Johnson and Hebert 1999; Rusu et al. 2009), but in this work we want to design features for complete objects. Janoch et al. (2013) also consider the task of object detection in RGB-D settings, and propose modifications to the approach of Felzenszwalb et al. (2010), and re-scoring and pruning detections to improve detection accuracy. In more recent work (soo Kim et al. 2013), propose modifications to DPMs to reason in 3D and take into account bottom-up grouping cues, and show improvements over the approach of Janoch et al. (2013). Tang et al. (2012) also look at the task of object detection and work in the same framework, but do not reason about perspective in their calculations for depth image gradients. Ye (2013) also look at the same task but compute features on the surface normal images. Our work is more similar to that of Tang et al. (2012), Ye (2013), but we differ in the features that we use, and observe that even a simple model with the right features can outperform more complicated approaches.

3 Perceptual Organization

One of our main goals is to perform perceptual organization on RGB-D images. We would like an algorithm that detects contours and produces a hierarchy of bottom-up segmentations from which we can extract superpixels at any granularity. We would also like a generic machinery that can be trained to detect object boundaries, but that can also be used to detect different types of geometric contours by leveraging the depth information. In order to design such a depth-aware perceptual organization system, we build on the architecture of the \(gPb-ucm\) algorithm (Arbelaez et al. 2011), which is a widely used software for monocular image segmentation.

3.1 Geometric Contour Cues

In addition to color data, we have, at each image pixel, an estimation of its \(3D\) location in the scene from which we can infer its surface normal orientation. We use this local geometric information to compute three oriented contour signals at each pixel in the image: a depth gradient \(DG\) which identifies the presence of a discontinuity in depth, a convex normal gradient \(NG_{+}\) which captures if the surface bends-out at a given point in a given direction, and a concave normal gradient \(NG_{-}\), capturing if the surface bends-in.

Generalizing the color and texture gradients of \(gPb\) to RGB-D images in not a trivial task because of the characteristics of the data, particularly: (1) a nonlinear noise model of the form \( |\delta Z| \propto Z^2 |\delta d| \), where \(\delta Z\) is the error in depth observation, \(Z\) is the actual depth, \(\delta d\) is the error in disparity observation (due to the triangulation-based nature of the Kinect), causing non-stochastic and systematic quantization of the depth, (2) lack of temporal synchronization between color and depth channels, resulting in misalignment in the dataset being used, (3) missing depth observations. We address these issues by carefully designing geometric contour cues that have a clear physical interpretation, using multiple sizes for the window of analysis, not interpolating for missing depth information, estimating normals by least square fits to disparity instead of points in the point cloud, and independently smoothing the orientation channels with Savitsky and Golay (1964) parabolic fitting.

In order to estimate the local geometric contour cues, we consider a disk centered at each image location. We split the disk into two halves at a pre-defined orientation and compare the information in the two disk-halves, as suggested originally in Martin et al. (2004) for contour detection in monocular images. In the experiments, we consider \(4\) different disk radii varying from \(5\) to \(20\) pixels and \(8\) orientations. We compute the 3 local geometric gradients \(DG, NG_{+}\) and \(NG_{-}\) by examining the point cloud in the 2 oriented half-disks. We first represent the distribution of points on each half-disk with a planar model. Then, for \(DG\) we calculate the distance between the two planes at the disk center and for \(NG_{+}\) and \(NG_{-}\) we calculate the angle between the normals of the planes.

3.2 Contour Detection and Segmentation

We formulate contour detection as a binary pixel classification problem where the goal is to separate contour from non-contour pixels, an approach commonly adopted in the literature (Arbelaez et al. 2011; Hoiem et al. 2011; Martin et al. 2004). We learn classifiers for each orientation channel independently and combine their final outputs, rather than training one single classifier for all contours.

Contour Locations We first consider the average of all local contour cues in each orientation and form a combined gradient by taking the maximum response across orientations. We then compute the watershed transform of the combined gradient and declare all pixels on the watershed lines as possible contour locations. Since the combined gradient is constructed with contours from all the cues, the watershed over-segmentation guarantees full recall for the contour locations. We then separate all the boundary location candidates by orientation.

Labels We transfer the labels from ground-truth manual annotations to the candidate locations for each orientation channel independently. We first identify the ground-truth contours in a given orientation, and then declare as positives the candidate contour pixels in the same orientation within a distance tolerance. The remaining boundary location candidates in the same orientation are declared negatives.

Features For each orientation, we consider as features our geometric cues \(DG, NG_{+}\) and \(NG_{-}\) at \(4\) scales, and the monocular cues from \(gPb: BG, CG\) and \(TG\) at their \(3\) default scales. We also consider three additional cues: the depth of the pixel, a spectral gradient (Arbelaez et al. 2011) obtained by globalizing the combined local gradient via spectral graph partitioning, and the length of the oriented contour.

Oriented Contour Detectors We use as classifiers support vector machines (SVMs) with additive kernels (Maji et al. 2013), which allow learning nonlinear decision boundaries with an efficiency close to linear SVMs, and use their probabilistic output as the strength of our oriented contour detectors.

Hierarchical Segmentation Finally, we use the generic machinery of Arbelaez et al. (2011) to construct a hierarchy of segmentations, by merging regions of the initial over-segmentation based on the average strength of our oriented contour detectors.

3.3 Amodal Completion

The hierarchical segmentation obtained thus far only groups regions which are continuous in 2D image space. However, surfaces which are continuous in \(3D\) space can be fragmented into smaller pieces because of occlusion. Common examples are floors, table tops and counter tops, which often get fragmented into small superpixels because of objects resting on them.

In monocular images, the only low-level signal that can be used to do this long-range grouping is color and texture continuity which is often unreliable in the presence of spatially varying illumination. However, in our case with access to \(3D\) data, we can use the more robust and invariant geometrical continuity to do long-range grouping. We operationalize this idea as follows:

  1. 1.

    Estimate low dimensional parametric geometric models for individual superpixels obtained from the hierarchical segmentation.

  2. 2.

    Greedily merge superpixels into bigger more complete regions based on the agreement among the parametric geometric fits, and re-estimate the geometric model.

In the context of indoor scenes we use planes as our low dimensional geometric primitive. As a measure of the agreement we use the (1) orientation (angle between normals to planar approximation to the 2 superpixels) and (2) residual error (symmetrized average distance between points on one superpixel from the plane defined by the other superpixel); and use a linear function of these 2 features to determine which superpixels to merge.

As an output of this greedy merging, we get a set of non-overlapping regions which consists of both long and short range completions of the base superpixels.

3.4 Results

We train and test our oriented contour detectors using the instance level boundary annotations of the NYUD2 as the ground-truth labels. We follow the standard train-test splits of NYUD2 dataset with \(795\) training images and \(654\) testing images (these splits make sure that images from the same scene are either entirely in the test set or entirely in the train set).

We evaluate performance using the standard benchmarks of the Berkeley Segmentation Dataset (Arbelaez et al. 2011): precision and recall on boundaries and Ground Truth Covering of regions. We consider two natural baselines for bottom-up segmentation: the algorithm \(gPb-ucm\), which does not have access to depth information, and the approach of Silberman et al. (2012), made available by the authors (labeled NYUD2 baseline), which produces a small set (5) of nested segmentations using color and depth.

Figure 2 and Table 1 Footnote 2 present the results. Our depth-aware segmentation system produces contours of far higher accuracy than \(gPb-ucm\), improving the average precision (AP) from 0.55 to 0.70 and the maximal F-measure (ODS in Table 1, left) from 0.62 to 0.69. In terms of region quality, the improvement is also significant, increasing the best ground truth covering of a single level in the hierarchy (ODS in Table 1, right) from 0.55 to 0.62, and the quality of the best segments across the hierarchy from 0.69 to 0.75. Thus, on average, for each ground truth object mask in the image, there is one region in the hierarchy that overlaps 75 % with it. The comparison against the NYUD2 baseline, which has access to depth information, is also largely favorable for our approach. In all the benchmarks, the performance of the NYUD2 baseline lies between \(gPb-ucm\) and our algorithm.

Fig. 2
figure 2

Boundary benchmark on NYUD2: our approach (red) significantly outperforms baselines (Arbelaez et al. 2011) (black) and Silberman et al. (2012) (blue) (Color figure online)

Table 1 Segmentation benchmarks for hierarchical segmentation on NYUD2

In Silberman et al. (2012), only the coarsest level of the NYUD2 baseline is used as spatial support to instantiate a probabilistic model for semantic segmentation. However, a drawback of choosing one single level of superpixels in later applications is that it inevitably leads to over- or under-segmentation. Table 2 compares in detail this design choice against our amodal completion approach. A first observation is that our base superpixels are finer than the NYUD2 ones: we obtain a larger number and our ground truth covering is lower (from \(0.61\) to \(0.58\)), indicating higher over-segmentation in our superpixels. The boundary benchmark confirms this observation, as our F-measure is slightly lower, but with higher Recall and lower Precision.

The last row of Table 2 provides empirical support for our amodal completion strategy: by augmenting our fine superpixels with a small set of amodally completed regions (6 on average), we preserve the boundary Recall of the underlying over-segmentation while improving the quality of the regions significantly, increasing the bestC score from \(0.58\) to \(0.63\). The significance of this jump can be judged by comparison with the ODS score of the full hierarchy (Table 1, right), which is \(0.62\): no single level in the full hierarchy would produce better regions than our amodally completed superpixels.

Table 2 Segmentation benchmarks for superpixels on NYUD2

Our use of our depth-aware contour cues \(DG, NG_{+}\), and \(NG_{-}\), is further justified because it allows us to also infer the type for each boundary, whether it is an depth edge, concave edge, convex edge or an albedo edge. We simply average the strengths across the different scales for each of these channels, and threshold them appropriately to obtain labels for each contour. We show some qualitative examples of the output we get in the last column of Fig. 3 (5th column).

Fig. 3
figure 3

Output of our system: we take in as input a single color and depth image (a, b) and produce as output bottom up segmentation (c), long range completion (d), contour classification (e) [into depth discontinuities (red), concave normal discontinuities (green) and convex normal discontinuities (blue)], object detection (f), and semantic segmentation (g) (Color figure online)

4 RGB-D Detector

Given access to point cloud data, it is natural to think of a 3D model which scans a 3D volume in space and reasons about parts and deformations in 3D space. While it is appealing to have such a model, we argue that this choice between a 3D scanning volume detector and a 2D scanning window detector only changes the way computation is organized, and that the same 3D reasoning can be done in windows extracted from the 2D image. For example, this reasoning can be in the form of better 3D aware features that can be computed from the points in the support of the 2D sliding window. Not only does this approach deal with the issue of computational complexity, but also readily allows us to extend existing methods in computer vision to RGB-D data.

Hence, we generalize the Deformable Parts Model detector from Felzenszwalb et al. (2010) to RGB-D images by computing additional features channels on the depth image. We adopt the paradigm of having a multi-scale scanning window detector, computing features from organized spatial cells in the detector support, and learning a model which has deformable parts.

4.1 Features

Note that our sliding window detector searches over scale, so when we are thinking of the features we can assume that the window of analysis has been normalized for scale variations. In addition to the HOG features to capture appearance we use the following features to encode the shape information from the depth image.

4.1.1 Histogram of Depth Gradients

In past work which studied the task of adapting 2D object detectors to RGB-D data (Janoch et al. 2013; Tang et al. 2012), a popular choice is to simply extend the histogram of oriented gradients (HOG) used on color images to depth images. One would think that this primarily captures depth discontinuities and object boundaries. However as we show in Appendix , the histogram of depth gradients actually captures the orientation of the surface and not just the depth discontinuities. Very briefly, the gradient orientation at a point is along the direction in which the surface is receding away from the viewer (the tilt), and the gradient magnitude captures the rate at which the surface is receding away (or the slant of the surface). Note that when the surface is more or less parallel to the viewing plane, then the estimate for the gradient orientation is inaccurate, and thus the contribution of such points should be down-weighed, and this is precisely what happens when we accumulate the gradient magnitude over different orientations.

The final step in HOG computation involves contrast normalization. We stick with this step, as it makes the feature vector around depth discontinuities (where the surface recedes very sharply) in the same range as the the feature vector around non-depth discontinuity areas.

With this contrast normalization step, it turns out that the histogram of depth gradients is very similar to the histogram of disparity gradients (the gradient orientation is exactly the same, the gradient magnitude are somewhat different, but this difference essentially goes away due to contrast normalization, the complete justification of this is given in Appendix ). In all our experiments we use HHG, histogram of oriented horizontal disparity gradients, as this has better error properties than histogram of depth gradients (since a stereo sensor actually measures disparity and not depth).

4.1.2 Histogram of Height

As we show in Appendix , we can estimate the direction for gravity and estimate the absolute height above the ground plane for each point. We use this estimate of height, to compute a histogram capturing the distribution of heights of the points in each cell. We use the L2 normalized square root of the counts in each bin as features for each cell. We call this feature HH.

4.2 Results

In this section, we validate empirically our design choices and compare our results to related work We report experiments on NYUD2 and B3DO.

4.2.1 Performance on NYUD2

The NYUD2 dataset was originally proposed to study bottom-up segmentation, semantic segmentation and support surface inference (Silberman et al. 2012). However, since it provides dense pixel labels for each object instance, we can easily derive bounding box annotations (by putting a tight bounding box around each instance) and study the task of object detection.

Since, we are interested in investigating the task of detecting furniture like objects in indoor scenes, we select the following five most common (by number of pixels) furniture categories in the dataset—bed, chair, sofa, counter, and table (we exclude cabinets because they are more a part of the scene rather than being a furniture item). For the sake of comparison to past and future work we also include all categories studied by Ye (2013), and all categories that are part of the RMRC challenge Reconstruction meets recognition challenge (2013).

We follow the same standard train and test sets (of 795 and 654 images respectively as explained in Sect. 3). We found that training with multiple components did not improve performance given the small amount of data.

We follow the standard PASCAL (Everingham et al. 2010) metric of average precision (AP) for measuring detection performance. We report the performance that we obtain in Table 3.

Table 3 Performance on NYUD2 (Silberman et al. 2012): we use the standard PASCAL (Everingham et al. 2010) metric of average precision (AP) for measuring detection performance

We compare against the state of the art appearance only method (Felzenszwalb et al. 2010) and other approaches which make use of depth information (Ye 2013). We also compare against the output of our semantic segmentation system as proposed in Gupta et al. (2013). We compute bounding box predictions for a class \(c\) from the semantic segmentation output by putting a tight bounding box around each connected component of pixels belonging to class \(c\), and assigning each such box a score based on the confidence score for class \(c\) of pixels within the box (note that the semantic segmentation output does not have instance information and the tightest bounding box around a connected component often includes multiple instances). We observe that we are able to consistently outperform the baselines. We provide some qualitative visualizations for our bed, chair, sofa, table and counter detections in Fig. 3 (6th column).

4.2.2 Performance on B3DO

The B3DO dataset considers the task of detecting mostly small ‘prop-like’ objects which includes bottle, bowls, cups, keyboards, monitors, computer mouse, phones, pillows and a larger furniture object, chair, and provides 2D bounding box annotations for objects of these categories. For this dataset, we only use the HHG and HOG features and do not use the HH features since the gravity estimate fails because there are a lot of images where the camera is not roughly horizontal (like when over looking the top of a table).

We follow the standard evaluation protocol of training on the 6 train sets and testing on the 6 corresponding validation sets, and reporting the average AP obtained for each category. We report the performance in Table 4. We compare against the approach of soo Kim et al. (2013), who also studied the same task of object detection in RGB-D images.

Table 4 Performance on B3DO: comparison with Janoch et al. (2013), soo Kim et al. (2013) on B3DO dataset

Although, we designed our model and features with large furniture like objects in mind, we see that our approach works reasonably well, on this task and we get competitive performance even on small ‘prop-like’ objects. We consistently outperform past approaches which have studied this task in the past.

4.2.3 Ablation Study

Here we study the impact of each of our features towards the performance of our proposed detector. We do an ablation study by removing each component of our detector. We do this analysis on the train set of the NYUD2 dataset. We split the train set into 2 halves and train on one and report performance on the other. We report the ablation study in Table 5.

Table 5 Ablation study: see Sect. 4.2.3

We see that all features contribute to the performance. The most important features are HOG on the appearance image and Histogram of Disparity Gradient features.

To gain further understanding of what the detector is learning, we provide visualizations of the model and its various parts in Appendix .

5 Semantic Segmentation

We now turn to the problem of semantic segmentation on NYUD2. The task proposed in Silberman et al. (2012) consists of labeling image pixels into just four super-ordinate classes—ground, structure, furniture and props. We study a more fine-grained 40 class discrimination task, using the most common classes of NYUD2. These include scene structure categories like walls, floors, ceiling, windows, doors; furniture items like beds, chairs, tables, sofa; and objects like lamps, bags, towels, boxes. The complete list is given in Table 6.

Table 6 Performance on the 40 class task: We report the pixel-wise Jaccard index for each of the 40 categories

We leverage the reorganization machinery developed in Sect. 3 and approach the semantic segmentation task by predicting labels for each superpixel. We define features based on the geocentric pose, shape, size and appearance of the superpixel and its amodal completion. We then train classifiers using these features to obtain a probability of belonging to each class for each superpixel. We experiment with random decision tree forests (Breiman 2001; Criminisi et al. 2012) (RF), and additive kernel (Maji et al. 2013) support vector machines (SVM).

5.1 Features

As noted above, we define features for each superpixel based on the properties of both the superpixel and its amodal completion. As we describe below, our features capture affordances via absolute sizes and heights which are more meaningful when calculated for the amodal completion rather than just over the superpixel. Note that we describe the features below in context of superpixels but we actually calculate them for both the superpixel and its amodal completion.

5.1.1 Generic Features

Geocentric Pose These features capture the pose - orientation and height, of the superpixel relative to the gravity direction. These features include (1) orientation features: we leverage our estimate of the gravity direction from Appendix , and use as features, the angle with respect to gravity, absolute orientation in space, fraction of superpixel that is vertical, fraction of superpixel that is horizontal, and (2) height above the ground: we use height above the lowest point in the image as a surrogate for the height from the supporting ground plane and use as features the minimum and maximum height above ground, mean and median height of the horizontal part of the superpixel.

Size Features These features capture the spatial extent of the superpixel. This includes the size of the 3D bounding rectangle, the surface area—total area, vertical area, horizontal area facing up, horizontal area facing down, if the superpixel is clipped by the image and what fraction of the convex hull is occluded.

Shape Features These include—planarity of the superpixel (estimated by the error in the plane fitting), average strength of local geometric gradients inside the region, on the boundary of the region and outside the region, average orientation of patches in the regions around the superpixel. These features are relatively crude and can be replaced by richer features such as spin images (Johnson and Hebert 1999) or 3D shape contexts (Frome et al. 2004).

In total, these add up to 101 features each for the superpixel and its amodal completion.

5.1.2 Category Specific Features

In addition to features above, we train one-versus-rest SVM classifiers based on appearance and shape of the superpixel, and use the SVM scores for each category as features along with the other features mentioned above. To train these SVMs, we use (1) histograms of vector quantized color SIFT (van de Sande et al. 2010) as the appearance features, and (2) histograms of geocentric textons (vector quantized words in the joint 2-dimensional space of height from the ground and local angle with the gravity direction) as shape features. This makes up for 40 features each for the superpixel and its amodal completion.

5.2 Results

With the features as described above we experiment with 2 different types of classifiers—(1) random forest classifiers with 40 trees with randomization happening both across features and training points for each tree (we use TreeBagger function in MATLAB), (2) SVM classifiers with additive kernels. At test time, for both these methods, we get a posterior probability for each superpixel of belonging to each of the 40 classes and assign the most probable class to each superpixel.

We use the standard split of NYUD2 with 795 training set images and 654 test set images for evaluation. To prevent over-fitting because of retraining on the same set, we train our category specific SVMs only on half of the train set.

Performance on the 40 category task We measure the performance of our algorithm using the Jaccard index (true predictions divided by union of predictions and true labels—same as the metric used for evaluation in the PASCAL VOC segmentation task) between the predicted pixels and ground truth pixels for each category. As an aggregate measure, we look at the frequency weighted average of the class-wise Jaccard index (fwavacc), but for completeness also report the average of the Jaccard index (avacc), and the pixel-level classification accuracy (pixacc). To understand the quality of the classifiers for each individual category independent of calibration, we also compute maxIU, the maximum intersection over union for all thresholds of the classifier score for each category individually, and report their average, and denote this with mean(maxIU).

We report the performance in Table 6 (first 4 rows in the two tables). As baselines, we use Silberman et al. (2012)-Structure Classifier, where we retrain their structure classifiers for the 40 class task, and Ren et al. (2012), where we again retrained their model for this task on this dataset using code available on their website.Footnote 3 We observe that we are able to do well on scene surfaces (walls, floors, ceilings, cabinets, counters), and most furniture items (bed, chairs, sofa). We do poorly on small objects, due to limited training data and weak shape features (our features are designed to describe big scene level surfaces and objects). We also consistently outperform the baselines. Figure 3 presents some qualitative examples.

Ablation Studies In order to gain insights into how much each type of feature contributes towards the semantic segmentation task, we conduct an ablation study by removing parts from the final system. We report our observations in Table 7. Randomized decision forests (RF) work slightly better than SVMs when using only generic or category specific features, but SVMs are able to more effectively combine information when using both these sets of features. Using features from amodal completion also provides some improvement. Silberman et al. (2012)-SP: we also retrain our system on the superpixels from Silberman et al. (2012) and obtain better performance than Silberman et al. (2012) \((36.51)\) indicating that the gain in performance comes in from better features and not just from better bottom-up segmentation. Ren et al. (2012) features: we also tried the RGB-D kernel descriptor features from Ren et al. (2012) on our superpixels, and observe that they do slightly worse than our category specific features. We also analyse the importance of our RGB-D bottom-up segmentation, and report performance of our system when used with RGB based superpixels from Arbelaez et al. (2011) (SVM color sp). We note that an improved bottom-up segmentation boosts performance of the semantic segmentation task.

Table 7 Ablation study on half of the train set: all components of our semantic segmentation system contribute to the performance

Performance on NYUD2 4 category task We compare our performance with existing results on the super-ordinate category task as defined in Silberman et al. (2012) in Table 8. To generate predictions for the super-ordinate categories, we simply retrain our classifiers to predict the 4 super-ordinate category labels. As before we report the pixel wise Jaccard index for the different super-categories. Note that this metric is independent of the segmentation used for recognition, and measures the end-to-end performance of the system unlike the metric originally used by Silberman et al. (2012) (which measures performance in terms of accuracy in predictions on superpixels which vary from segmentation to segmentation). As before, we report fwavacc, avacc, pixacc and mean(maxIU) aggregate metrics. As baselines, we compare against (Silberman et al. 2012; Ren et al. 2012).Footnote 4

Table 8 Performance on the 4 class task: comparison with Silberman et al. (2012), Ren et al. (2012) on the 4 super-ordinate categories task

6 Detectors and Scene Context for Semantic Segmentation

The features that we proposed in Sect. 5 try to classify each superpixel independently and do not reason about full object information. To address this limitation, we propose augmenting the features for a superpixel with additional features computed from activations of object detectors (which have access to whole object information), and scene classifiers (which have access to the whole image). The features from object detector activations provide the missing top-down information and scene classifier outputs provide object scene context information (of the form that night stands occur more frequently in bedrooms than in living rooms). In this section, we describe how we compute these features and show experimental results which illustrate that adding these features helps improve performance for the semantic segmentation task. Figure 4 shows examples of the error modes that get fixed on using these additional features.

Fig. 4
figure 4

Examples illustrating where object detectors and scene classification help: semantic segmentation output improves as we add features from object detector activations and scene classifiers (going from left image to right image)

6.1 Detector Activations Features

We compute the output of the RGB-D detector that we trained in Sect. 4, and do the standard DPM non-max suppression. Then, for each class we pick a threshold for the score of the detector such that the detector obtains a precision of \(p (=0.50)\) on the validation set. We then prune away all detections which have a score smaller than this threshold. We use the remaining detections to compute features for each superpixel. We can see this pruning as introducing a non-linearity on the detection scores, allowing the classifier to use information from the good detections more effectively and not getting influenced by the bad detections which are not as informative.

For each superpixel, for each category for which we have a detector, we compute all detections whose bounding box overlaps with the bounding box of the superpixel. Among these detections, we pick the detection with maximum overlap, and then compute the following features between the superpixel and the picked detection: score of the detection selected, overlap between the detector and superpixel bounding boxes, mean and median depth in the detector box and the superpixel.

With these additional features, we train the same superpixel classifiers that we trained in Sect. 5. We report the performance we get in Tables 6 and 8. our + det (RGB) corresponds to when we use RGB DPMs to compute these features and our + det corresponds to when we use our proposed RGB-D DPM detectors. We observe very little improvement when using RGB DPMs but a large improvement when using RGB-D DPMs, for which we see improvement in performance across all aggregate metrics and for most of the columns marked with a dagger (\(\dag \)) are the categories for which we added detectors.

6.2 Scene Classifier Features

We use the scene label annotations provided in the NYUD2 dataset (we only consider the most common 9 scene categories and map the remaining into a class ‘other’), to train a scene classifier. To train these scene classifiers, we use features computed by average pooling the prediction for each of the 40 classes in a \(1, 2\times 2, 4\times 4\) spatial pyramid (Lazebnik et al. 2006), and training an additive kernel SVM. We find that these features perform comparable to the other baseline features that we tried Appendix .

We then use these scene classifiers to compute additional features for superpixels in the image and train the same superpixel classifiers that we trained in Sect. 5. We report the performance we get in Tables 6 and 8 (our + scene). We observe that there is a consistent improvement which is comparable to the improvement that we get when using detector activation features. As a final experiment, we use both scene classifier features and object detector activation feature and see a further improvement in performance.

7 Conclusion

We have developed a set of algorithmic tools for perceptual organization and recognition in indoor scenes from RGB-D data. Our system produces contour detection, hierarchical segmentation, grouping by amodal completion, object detection and semantic labeling of objects and scene surfaces. We report significant improvements over the state-of-the-art in all of these tasks.