Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Semantic segmentation is an exciting computer vision task with many potential applications in robotics, intelligent transportation systems and image retrieval. Its goal is to associate each pixel with a high-level label such as the sky, a tree or a person. Most successful approaches in the field rely on dense strongly supervised multi-class classification of the image window centered at the considered pixel [28]. Such pixel-level classification of image windows resembles the localization task [30], which is also concerned with finding objects in images. However, the two tasks are trained in a different manner. Positive object localization windows have a well-defined spatial extent: they are tightly aligned around particular instances of the considered class. On the other hand, the class of a semantic segmentation window is exclusively determined by the kind of the object which projects to the central pixel of the window. Thus we see that the window size does not affect semantic segmentation outcome, which poses contrasting requirements. In cases of featureless or ambiguous texture, large windows have to be considered in order to squeeze information from the context [7]. In cases of small distinctive objects one has to focus onto a small neighborhood, since off-object pixels may provide misleading classification cues. This suggests that a pixel-level classifier is likely to perform better if supplied with a local image representation at multiple scales [4, 11, 21] and multiple levels of detail [22, 24].

We especially consider applications in intelligent transportation systems and robotics. We note that images acquired from vehicles and robots are quite different from images taken by humans. Images taken by humans always have a purpose: the photographer wants something to be seen in the image. On the other hand, a vehicle-mounted camera operates independently from the pose of the objects in the scene: it simply acquires a fresh image each 40 milliseconds. Hence, the role of the context [7] in car-borne datasets [6, 12] will be different than in datasets acquired by humans [9]. In particular, objects in car-borne datasets (cars, pedestrians, riders, etc.) are likely to be represented at a variety of scales due to forward camera motion. Not so in datasets taken by humans, where a majority of objects is found at particular scales determined by rules of artistic composition. This suggests that paying a special attention to object scale in car-borne imagery may bring a considerable performance gain.

One approach to address scale-related problems would be to perform a joint dense recovery of depth and semantic information [8, 20]. If the depth recovery is successful, the classification network gets an opportunity to leverage that information and improve performance. However, these methods have limited accuracy and require training with depth groundtruth. Another approach would be to couple semantic segmentation with reconstructed [23] or measured [2] 3D information. However, the pixel-level classifier may not be able to successfully exploit this information. Yet another approach would be to use the depth for presenting better object proposals [2, 5]. However, proposing instance locations in crowded scenes may be a harder task than classifying pixels.

In this paper we present a novel technique for scale-invariant training and inference in stereoscopic semantic segmentation. Unlike previous approaches, we address the scale-invariance directly, by leveraging the reconstructed depth information [13, 32] to disentangle the object appearance from the object scale. We realize this idea by introducing a novel scale selection layer into a deep network operating on the source image pyramid. The resulting scale-invariance substantially improves the segmentation performance with respect to the baseline.

2 Related Work

Early semantic segmentation work was based on multi-scale hand-crafted filter banks [26, 28] with limited receptive fields (typically less than 50 \(\times \) 50 pixels). Recent approaches [3, 4, 21, 22] leverage amazing power of GPU [11] and extraordinary capacity of deep convolutional architectures [19] to process pixel neighborhoods by ImageNet-grade classifiers. These classifiers typically possess millions of trainable parameters, while their receptive fields may exceed 200 \(\times \) 200 pixels [29]. The capacity of these architectures is dimensioned for associating an unknown input image with one of 1000 diverse classes, while they typically see around million images during ImageNet pre-training plus billions of patches during semantic segmentation training. An architecture trained for ImageNet classification can be transformed into a fully convolutional form by converting fully-connected layers into equivalent convolutional layers with the same weights [15, 22, 27]. The resulting fully convolutional network outputs a dense W/s \(\times \) H/s \(\times \) 1000 multi-class heat map tensor, where W \(\times \) H are input dimensions and s is the subsampling factor due to pooling. Now the number of outputs of the last layer can be redimensioned to whatever is the number of classes in the specific application and the network is ready to be fine-tuned for the semantic segmentation task.

Convolutional application of ImageNet architectures typically results in considerable downsampling of the output activations with respect to the input image. Some researches have countered this effect with trained upsampling [22] which may be reinforced by taking into account switches from the strided pooling layers [16, 25]. Other ways to achieve the same goal include interleaved pooling [27] and dilated convolutions [22, 31]. These approaches typically improve the baseline performance in the vicinity of the object borders due to more accurate upsampling of the semantic maps. We note that the system presented in our experiments does not feature any of these techniques, however it still succeeds to deliver competitive performance.

As emphasized in the introduction, presenting a pixel-level classifier with a variety of local image representations is likely to favor the semantic segmentation performance. Previous researchers have devised two convolutional approaches to meet this idea: the skip architecture and the shared multi-class architecture. The shared multi-class architectures concatenate activations obtained by applying the common pixel-level classifier at multiple levels of the image pyramid [4, 11, 20, 21, 29]. The skip architectures concatenate activations from different levels of the deep convolutional hierarchy [22, 24]. Both architectures have their merits. The shared multi-scale architecture is able to associate the evaluated window with the training dataset at unseen scales. The skip architecture allows to model the object appearance [30] and the surrounding context at different levels of detail, which may be especially appropriate for small objects. Thus it appears that best results might be obtained by a combined approach which we call a multi-scale skip architecture. The combined approach concatenates pixel representations taken at different image scales and at different levels of the deep network. We note that this idea does not appear to have been addressed in the previous work.

Despite extremely large receptive fields, pixel-level classification may still fail to establish consistent activations in all cases. Most problems of this kind concern smooth parts of very large objects: for example, a pixel in the middle of a tram may get classified as a bus. A common approach to such problems is to require pairwise agreement between pixel-level labels, which leads to a global optimization across the entire image. This requirement is often formulated as MAP inference in conditional random fields (CRF) with unary, pairwise [3, 18, 21] and higher-order potentials [1]. Early methods allowed binary potentials exclusively between neighboring pixels, however, this requirement has later been relaxed by defining binary potentials as linear combinations of Gaussian kernels. In this case, the message passing step in approximate mean field inference can be expressed as a convolution with a truncated Gaussian kernel in the feature space [3, 18]. Recent state of the art approaches couple the CRF training and inference in custom convolutional [21] and recurrent [1] deep neural networks. We note that our present experiments feature a separately trained CRF with Gaussian potentials while our future work shall include joint CRF training [21].

We now review the details of the previous research which is most closely related to our contributions. Banica et al. [2] exploit the depth sensed by RGBD sensors to improve the quality of region proposals and the subsequent region-level classification on indoor datasets. Chen et al. [4] propose a scale attention mechanism to combine classification scores at different scales. This results in soft pooling of the classification scores across different classes. Ladicky and Shi [20] propose to train binary pixel-level classifiers which detect semantic labels at some canonical depth by exploiting the depth groundtruth obtained by LIDAR. Their inference jointly predicts semantic segmentation and depth by processing multiple levels of the monocular image pyramid. Unlike all previous approaches, our technique achieves efficient classification and training due to scale-invariant image representation recovered by exploiting reconstructed depth. Unlike [4], we perform hard scale selection at the representation level. Unlike [20], we exploit the reconstructed instead of the groundtruth depth.

3 Fully Convolutional Architecture with Scale Selection

We integrate the proposed technique into an end-to-end trained fully convolutional architecture illustrated in Fig. 1. The proposed architecture independently feeds images from an N-level image pyramid into the shared feature extraction network. Features from the lower levels of the pyramid are upsampled to match the resolution of features from the original image. We forward the recovered multi-scale representation to the pixel-wise scale selection multiplexer. The multiplexer is responsible for establishing a scale-invariant image representation in accordance with the reconstructed depth at the particular pixel. The back-end classifier scores the scale-invariant features with a multi-class classification model. The resulting score maps are finally converted into the per-pixel distribution over classes by a conventional softmax layer.

3.1 Input Pyramid and Depth Reconstruction

The left image of the input stereo pair is iteratively subsampled to produce a multi-scale pyramid representation. The first pyramid level contains the left input image. Each successive level is obtained by subsampling its predecessor with the factor \(\alpha \). If the original image resolution is W \(\times \) H, the resolution of the l-th level is W\(/\alpha ^l\times \) H \(/\alpha ^l\), \(l\in [0..\text {N}-1]\) (\(\alpha \) = 1.3, N = 8). We reconstruct the depth by employing a deep correspondence metric [32] and the SGM [13] smoothness prior. The resolution of the disparity image is W \(\times \) H.

Fig. 1.
figure 1

A convolutional architecture with scale-invariant representation.

3.2 Single Scale Feature Extraction

Our single scale feature extraction architecture is based on the feature extraction front-end of the 16-level deep VGG-D network [29] up to the relu5_3 layer (13 weight layers total). In order to improve the training, we introduce a batch normalization layer [14] before each non-linearity in the 5th group: relu5_1, relu5_2 and relu5_3. This modification helps the fine-tuning by increasing the flow of the gradients during backprop. Subsequently we perform a 2 \(\times \) 2 nearest neighbor upsampling of relu5_3 features and concatenate them with pool3 features in the spirit of skip architectures [22, 24] (adding pool4 features did not result in significant benefits). In comparison with relu5_3, the representation from pool3 has a 2 \(\times \) 2 higher resolution and a smaller receptive field (40 \(\times \) 40 vs 185 \(\times \) 185). We hypothesize that this saves some network capacity because it relieves the network from propagating small objects through all 13 convolutional and 4 pooling layers.

The described feature extraction network is independently applied at all levels of the pyramid, in the spirit of the shared multi-class architectures [4, 11, 20, 21, 29]. Subsequently, we upsample the representations of pyramid levels 1 to N−1 in order to revert the effects of subsampling and to restore a common resolution across the representations at all scales. We perform the upsampling by a nearest neighbor algorithm in order to preserve the sparsity of the features. After upsampling, all N feature tensors have the resolution W/8 \(\times \) H/8 \(\times \) (512 + 256). The 8 \(\times \) 8 subsampling is due to three pooling levels with stride 2. Features from relu5_3 have 512 dimensions while features from pool3 have 256 dimensions.

The described procedure produces a multi-scale convolutional representation of the input image. A straight-forward approach to exploit this representation would be to concatenate the features at all N scales. However, that would imply a huge computational complexity which would make training and inference infeasible. One could also perform such procedure at some subset of scales. However, that would require a costly validation to choose the subset, while providing less information to the back-end classification network. Consequently, we proceed towards achieving scale-invariance as the main contribution of our work.

3.3 Scale Selection Multiplexer

The responsibility of the scale selection multiplexer is to represent each image pixel with scale-invariant convolutional features extracted at exactly M = 3 out of N levels of the pyramid. The scale invariance is achieved by choosing the pyramid levels in which the apparent size of the reference metric scales are closest to the receptive field of our features.

In order to explain the details, we first establish the notation. We denote the image pixels as \(p_i\), the corresponding disparities as \(d_i\), the stereo baseline as b, and the reconstructed depths as \(Z_i\). We then denote the width of the receptive field for our largest features (conv5_3) as \(w_{rf}\,=\,185\), the metric width of its back-projection at distance \(Z_i\) as \(W_i\) and the three reference metric scales in meters as \(W_{R}=\{1,4,7\}\). Finally, we define \(s_{mi}\) as the ratio between the m-th reference metric scale \(W_{Rm}\) and the back projection \(W_i\) of the receptive field:

$$\begin{aligned} s_{mi} = \frac{W_{Rm}}{W_i} = \frac{W_{Rm}}{\frac{b}{d_i} w_{rf}} = \frac{d_i \cdot W_{Rm}}{b \cdot w_{rf}} \;. \end{aligned}$$
(1)

The ratio \(s_{mi}\) represents the exact image scaling factor by which we should downsample the original image to attain the reference scale m at pixel i. Now we are able to choose the representation from the pyramid level \(l_{mi}\) which has a downsampling factor closest to the true factor \(s_{mi}\):

$$\begin{aligned} \hat{l}_{mi} = \underset{l}{\arg \!\min } \left| \alpha ^{l} - s_{mi} \right| , \;\;\;\; l\in \{0, 1, ..., N-1\} \;. \end{aligned}$$
(2)

The multiplexer determines the routing information at pixel \(p_i\), by mapping each of the M reference scales to the corresponding pyramid level \(l_{mi}\). We illustrate the recovered \(l_{mi}\) in Fig. 2 by color coding the computed pyramid levels at three reference metric scales. Note that in case when \(s_{mi} < 1\) we simply always choose the first pyramid level (\(l=0\)). We have not experimented with upsampled levels of the pyramid mostly because of memory limitations. The output of the multiplexer is a scale-invariant image representation which is stored in M feature tensors of the dimension W/8 \(\times \) H/8 \(\times \) (512 + 256).

Fig. 2.
figure 2

Visualization of the scale selection switches. From left to right: original image, switches for the three reference metric scales of 1 m, 4 m and 7 m.

3.4 Back-End Classifier

The scale-invariant feature tensors are concatenated and passed on to the classification subnetwork which consists of one 7 \(\times \) 7 and two 1 \(\times \) 1 convolution+ReLU layers. The former two layers have 1024 maps and batch normalization before non-linearities. The last 1 \(\times \) 1 convolutional layer is configured in a way that the number of feature maps corresponds to the number of classes. The resulting class scores are passed to the pixel-wise softmax layer to obtain the distribution across classes for each pixel.

4 Experiments

We evaluate our method on two different semantic segmentation datasets containing outdoor traffic scenes: Cityscapes [6] and KITTI [12]. The Cityscapes dataset [6] has 19 classes recorded in 50 cities during several months. The dataset features good and medium weather conditions, large number of dynamic objects, varying scene layout and varying background. It consists of 5000 images with fine annotations and 20000 images with coarse annotations (we use only the fine annotations). The resolution of the images is 2048 \(\times \) 1024. The dataset includes the stereo views which we use to reconstruct the depth.

The KITTI dataset [12] provides a large collection of 1241 \(\times \) 376 traffic videos with LIDAR reconstruction groundtruth. Unfortunately, there are no official semantic segmentation annotations for this dataset. However, a collection of 150 images annotated with 11 object classes has been published [26]. We expand that work by annotating the same 11 classes in another 299 images from the same dataset, as well as by fixing some inconsistent annotations in the original dataset. The combined dataset with 399 training and 46 test images is freely available for academic researchFootnote 1.

We train our networks using Adam SGD [17] and batch normalization [14] without learnable parameters. Due to memory limitations we only have one image in a batch. The input images are zero-centered and normalized. We initialize the learning rate to \(10^{-5}\), decrease it to \(0.5 \cdot 10^{-5}\) after 2nd epoch and again to \(10^{-6}\) after 10th epoch. Before each epoch, the training set is shuffled to eliminate the bias. The first 13 convolutional layers are initialized from VGG-D [29] pretrained on ImageNet and fine tuned during training. All other layers are randomly initialized. In all experiments we train our networks for 15 epochs on Cityscapes dataset and 30 epochs on KITTI dataset. We use the softmax cross-entropy loss which is summed up over all the pixels in a batch. In both datasets the frequency of pixel labels is highly unevenly distributed. We therefore perform class balancing by weighting each pixel loss with the true class weight factor \(w_c\). This factor can be determined from class frequencies in the training set as follows: \(w_c = \min (10^3, p(c)^{-1})\), where p(c) is the frequency of pixels from the class c in the batch. In all experiments the three reference metric scales were set to equidistant values \(W_{R}=\{1,4,7\}\) which were not cross-validated but had a good coverage over the input pixels (cf. Fig. 2). A Torch implementation of this procedure is freely available for academic researchFootnote 2.

In order to alleviate the downsampling effects and improve consistency, we postprocess the semantic map with the fully-connected CRF [18]. The negative logarithms of probability distributions across classes are used as unary potentials, while the pairwise potentials are based on a linear combination of two Gaussian kernels [18]. We fix the number of mean field iterations to 10. The smoothness kernel parameters are fixed at \(w^{(2)} = 3, \theta _\gamma = 3\), while a coarse grid search on 200 Cityscapes images is performed to optimize \(w^{(1)} \in \{5, 10\}, \theta _\alpha \in \{50, 60, 70, 80, 90\}, \theta _\beta \in \{3, 4, 5, 6, 7, 8, 9\}\).

The segmentation performance is measured by the intersection-over-union (IoU) score [10] and the pixel accuracy [22]. We first evaluate the performance of two single scale networks which are obtained by eliminating the scale multiplexer layer and applying our network to full resolution images (cf. Fig. 1). This network is referred to as Single3+5. Furthermore, we also have the Single5 network which is just Single3+5 without the representation from pool3. The label ScaleInvariant shall refer to the full architecture visualized in Fig. 1. The label FixedScales refers to the similar architecture with fixed multi-scale representation obtained by concatenating the pyramid levels 0, 3 and 7.

Table 1. Individual class results on the Cityscapes validation and test sets (IoU scores). Last row represents a fraction by which each class is represented in the training set.

First, we show results on the Cityscapes dataset. We downsample original images and train on smaller resolution (1504 \(\times \) 672) due to memory limitations. Table 1 shows the results on the validation and test sets. We can observe that our scale invariant network improves over the single scale approach across all classes in Table 1. We can likewise notice that the concatenation from pool3 is important as Single3+5 produces better results then Single3 and that improvement is larger for smaller classes like poles, traffic signs and traffic lights. That supports our hypothesis that the representation from pool3 helps to better handle smaller objects. Furthermore, the scale invariant network achieves significant improvement over multi-scale network with fixed image scale levels (FixedScales). This agrees with our hypothesis that the proposed scale selection approach should help the network to learn a better representation. The table also shows results on the test set (our online submission is entitled Scale invariant CNN + CRF).

The last row in Table 1 represents a proportion by which each class is represented in the training set. We can notice that the greatest contribution of our approach is achieved for classes which represent smaller objects or objects that we see less often like buses, trains, trucks, walls etc. This effect is illustrated in Fig. 3 where we plot the improvement of the IoU metric with respect to the training set proportion for each class. Likewise we achieve improvement in pixel accuracy from \(90.1\,\%\) (Single3+5) to \(91.9\,\%\) (ScaleInvariant).

Fig. 3.
figure 3

Improvement of the IoU metric between Single3+5 and ScaleInvariant architecture with respect to the proportion inside training set for each class.

Figure 4 shows examples where the scale-invariant network produces better results. The improvement for big objects is clearly substantial. We often observe that the scale-invariant network can differentiate between road and sidewalk and person and rider, especially when they assume a rare appearance as in the last row with cobbled road which is easily mistaken for sidewalk.

Fig. 4.
figure 4

Examples where scale invariance helps the most. From left to right: input, groundtruth, baseline segmentation (Single3+5), scale invariant segmentation.

Table 2 shows results on the KITTI test set. We notice a significant improvement in mean IoU class metric, which is, however, smaller than on Cityscapes. The main reason is that KITTI has much less smaller classes (only pole, sign and cyclist). Furthermore, it is a much smaller dataset which explains why the performance is so low on classes like cyclist and pole. Here we again report an improvement in pixel accuracy from 87.63 (Single3+5) to 88.57 (ScaleInvariant).

Table 2. Individual class results on the KITTI test set (IoU scores).

5 Conclusion

We have presented a novel technique for improving semantic segmentation performance. We use the reconstructed depth as a guide to produce a scale-invariant representation in which the appearance is decoupled from the scale. This precludes the necessity to recognize objects at all possible scales and allows for an efficient use of the classifier capacity and the training data. This trait is especially important for navigation datasets which contain objects at a great variety of scales and do not exhibit the photographer bias.

We have integrated the proposed technique into an end-to-end trainable fully convolutional architecture which extracts features by a multi-scale skip network. The extracted features are fed to the novel multiplexing layer which carries out dense scale selection at the pixel level and produces a scale-invariant representation which is scored by the back-end classification network.

We have performed experiments on the novel Cityscapes dataset. Our results are very close to the state-of-the-art, despite the fact that we have trained our network on reduced resolution. We also report experiments on the KITTI dataset where we have densely annotated 299 new images, improved 146 already available annotations and release the union of the two datasets to the community. The proposed scale selection approach has consistently contributed substantial increases in segmentation performance. The results show that deep neural networks are extremely powerful classification models, however, they are still unable to learn geometric transformation better than humans.