Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation

Sun, Peng; Wu, Jiaxiang; Li, Songyuan; Lin, Peiwen; Huang, Junzhou; Li, Xi

doi:10.1007/s11263-021-01433-3

Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation

Published: 19 February 2021

Volume 129, pages 1506–1525, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Computer Vision Aims and scope Submit manuscript

Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation

Download PDF

Peng Sun¹,
Jiaxiang Wu²,
Songyuan Li¹,
Peiwen Lin³,
Junzhou Huang⁴ &
…
Xi Li ORCID: orcid.org/0000-0003-3023-1662^1,5

777 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

To satisfy the stringent requirements for computational resources in the field of real-time semantic segmentation, most approaches focus on the hand-crafted design of light-weight segmentation networks. To enjoy the ability of model auto-design, Neural Architecture Search (NAS) has been introduced to search for the optimal building blocks of networks automatically. However, the network depth, downsampling strategy, and feature aggregation method are still set in advance and nonadjustable during searching. Moreover, these key properties are highly correlated and essential for a remarkable real-time segmentation model. In this paper, we propose a joint search framework, called AutoRTNet, to automate all the aforementioned key properties in semantic segmentation. Specifically, we propose hyper-cells to jointly decide the network depth and the downsampling strategy via a novel cell-level pruning process. Furthermore, we propose an aggregation cell to achieve automatic multi-scale feature aggregation. Extensive experimental results on Cityscapes and CamVid datasets demonstrate that the proposed AutoRTNet achieves the new state-of-the-art trade-off between accuracy and speed. Notably, our AutoRTNet achieves 73.9% mIoU on Cityscapes and 110.0 FPS on an NVIDIA TitanXP GPU card with input images at a resolution of $768 \times 1536$.

CFFNet: Cross-scale Feature Fusion Network for Real-Time Semantic Segmentation

Lightweight and Progressively-Scalable Networks for Semantic Segmentation

Article 18 May 2023

Dense Dual-Path Network for Real-Time Semantic Segmentation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Semantic segmentation, a fundamental topic in computer vision, aims at assigning per-pixel semantic labels for images. Recent approaches (Zhao et al. 2017; Chen et al. 2017, 2018b; Zhao et al. 2018b) based on fully convolutional networks (Long et al. 2015) have achieved remarkable accuracy on public benchmarks (Brostow et al. 2008; Cordts et al. 2016; Everingham et al. 2015). Such improvements, however, come at the cost of deeper and less efficient networks, which may not be applicable to many real-time systems, e.g., autonomous driving and video surveillance.

To perform fast semantic segmentation with satisfactory accuracy, the design philosophy of real-time segmentation network architectures mainly concentrates on three aspects: (1) building block design (Li and Kim 2019; Paszke et al. 2016), which considers the block-level feature representation capacity, computational complexity, and receptive field size; (2) network depth and downsampling strategy (Li and Kim 2019; Li et al. 2019a), which directly affect the accuracy and speed of a network, hence real-time networks favor shallow layers and fast downsampling; and (3) feature aggregation (Yu et al. 2018; Zhao et al. 2018a), which fuses multi-scale features to compensate the loss of spatial details caused by fast downsampling.

The above hand-crafted networks make huge progress, while they require expertise in architecture design based on laborious trial and error. To relieve this burden, some researchers introduce neural architecture search (NAS) methods (Baker et al. 2016; Zoph and Le 2016; Liu et al. 2019b; Xie et al. 2019) into this field, and obtain excellent results (Chen et al. 2018a; Liu et al. 2019a; Zhang et al. 2019b; Nekrasov et al. 2019). Liu et al. (2019a) and Chen et al. (2018a) focus on high-quality segmentation instead of real-time applications. To meet the real-time demand, Zhang et al. (2019b) search a customized architecture by introducing a latency loss function. Although its building blocks are searched, the network depth, downsampling strategy, and feature aggregation method are still set by hand in advance and nonadjustable during searching. Since these three aspects are highly correlated and indispensable for a remarkable real-time segmentation network, the fact that they are nonadjustable increases the difficulty of finding an optimal real-time architecture (i.e. the best trade-off between accuracy and speed). These motivate us to explore all the aspects automatically during the searching process.

In this paper, we propose a joint search framework to search for the optimal building blocks, network depth, downsampling strategy, and feature aggregation method simultaneously. Specifically, we propose hyper-cells to decide the network depth and the downsampling strategy jointly and automatically via a cell-level pruning process. Moreover, we propose an aggregation cell to fuse features from multiple spatial scales automatically. As for the hyper-cell, we introduce a novel learnable architecture parameter. Thus, the network depth and downsampling strategy are fully determined concurrently according to the optimized architecture parameters. As for the aggregation cell, we aggregate multi-level features in the network automatically to fuse the low-level spatial details and high-level semantic context effectively.

We denote the resulting network as Auto searched Real-Time semantic segmentation network or AutoRTNet. We evaluate AutoRTNet on both Cityscapes (Cordts et al. 2016) and CamVid (Brostow et al. 2008) datasets. The experiments demonstrate the superiority of AutoRTNet, as shown in Fig. 1, where our AutoRTNet achieves the best accuracy-efficiency trade-off.

The main contributions can be summarized as follows:

We propose a joint search framework for real-time semantic segmentation that automatically searches for the building blocks, network depth, downsampling strategy, and feature aggregation method simultaneously.
We propose the hyper-cell to learn the network depth and downsampling strategy jointly and automatically via the cell-level pruning process, and the aggregation cell to achieve automatic multi-scale feature aggregation.
Notably, AutoRTNet has achieved 73.9% mIoU on the Cityscapes test set and 110.0 FPS on an NVIDIA TitanXP GPU card with $768 \times 1536$ input images.

2 Related Work

2.1 Semantic Segmentation

High-quality segmentation FCN (Long et al. 2015) is the pioneer work which has greatly promoted the development of semantic segmentation. Extensions to FCN follow many directions. Encoder–decoder structures (Badrinarayanan et al. 2017; Lin et al. 2017a; Noh et al. 2015) combine low-level and high-level features to improve the accuracy of semantic segmentation. DRN (Yu et al. 2017) and DeepLab (Chen et al. 2017, 2018b) use dilated convolution operations to effectively enlarge the receptive field size. To capture multi-scale context information, DeepLabV3 (Chen et al. 2017) and PSPNet (Zhao et al. 2017) propose the pyramid modules. Recently, attention mechanism (Vaswani et al. 2017) has been used in segmentation methods (Fu et al. 2019; Zhang et al. 2019a; Zhao et al. 2018b; Li et al. 2018). These outstanding works are designed for high-quality segmentation, which is inapplicable to real-time applications.

Real-time methods Various algorithms have been proposed for real-time semantic segmentation. Some works (Wu et al. 2017) reduce the computation overheads via restricting the size of input images. Channel-pruning algorithms (Paszke et al. 2016; Badrinarayanan et al. 2017) are introduced to boost the inference speed, and most real-time methods focus on designing the light-weight and effective network architectures. The design philosophy of real-time network architectures mainly can be summarized in the following three aspects. And in our work, we fully explore all three aspects simultaneously.

Building block design The building block design (Paszke et al. 2016; Romera et al. 2017; Mehta et al. 2018; Li and Kim 2019) requires researchers to give sufficient consideration to the computational complexity, feature representation capacity, and receptive field size, which is essential for real-time semantic segmentation. For example, ENet (Paszke et al. 2016) and DABNet (Li and Kim 2019) propose light-weight blocks and stack them with different dilation rates to form a whole network. MobileNet and its variants (Howard et al. 2017; Sandler et al. 2018) use blocks with depth-wise separable convolution in pursuit of light-weight models.

Network depth and downsampling strategy High-quality segmentation networks always use the pre-defined backbones, e.g. ResNet (He et al. 2016), Xception (Chollet 2017), as encoders. However, for real-time segmentation networks, [e.g. DABNet (Li and Kim 2019), DFANet (Li et al. 2019a), ERFNet (Romera et al. 2017)], the network depth and downsampling strategy (i.e. how many layers in each stage) are determined mostly by hand as they directly affect the accuracy and speed of the networks. For pursuing the fast inference speed, real-time networks always enjoy shallow layers and perform fast downsampling with factor 16 or 32.

Feature aggregation The fast downsampling in real-time networks easily results in the loss of spatial details. Thus, multi-scale feature aggregation (Yu et al. 2018; Zhao et al. 2018a; Li et al. 2019a) has been proposed to remedy the loss of spatial details. Zhao et al. (2018a) propose an image cascade network with multi-scale inputs. Yu et al. (2018) decouple the network into context and spatial paths to make the right balance between accuracy and speed. Li et al. (2019a) aggregate multi-scale features from different layers to remedy the loss of spatial details.

2.2 Neural Architecture Search

Overview Neural architecture search (NAS) focuses on automating the network architecture design process. Early NAS methods are time-consuming (e.g. thousands of GPU days) and computationally expensive via reinforcement learning (Zoph and Le 2016; Baker et al. 2016; Zoph et al. 2018; Tan et al. 2019) or evolutionary algorithms (Miikkulainen et al. 2019; Real et al. 2019). Recently, the emergence of differentiable NAS methods (Liu et al. 2019b; Xie et al. 2019; Cai et al. 2018) has greatly relieved the time-consuming problem while achieving excellent performance. DARTs (Liu et al. 2019b) is the pioneer work for gradient-based NAS, and Liu et al. (2019b) propose an iterative optimization framework which is based on the continuous relaxation of architecture representation. Xie et al. (2019) constrain the architecture parameters to approximate one-hot, resolving the inconsistency in optimizing between the performance of derived child networks and converged parent networks. In addition, FBNet (Wu et al. 2019), ProxylessNAS (Cai et al. 2018), and MnasNet (Tan et al. 2019) propose multi-objective optimization with the consideration of real-world latency.

In this paper, we propose hyper-cells to jointly decide the key properties (i.e. the downsampling strategy and the depth of a network) automatically in semantic segmentation. Searching at this network architecture level gives rise to the suitable downsampling strategy and depth for a semantic segmentation network. In contrast, DARTs (Liu et al. 2019b) and SNAS (Xie et al. 2019) only search at the cell level under a fixed network architecture without considering the intrinsic properties of semantic segmentation. Thus, the search spaces of ours and other NAS methods (Liu et al. 2019b; Xie et al. 2019) are essentially different.

NAS for segmentation DPC (Chen et al. 2018a) is the first work for dense image prediction using NAS methods and searches for a multi-scale representation module. A similar work to us is AutoDeepLab (Liu et al. 2019a), they propose a hierarchical search space and search for the downsampling path. Although they also search for the downsampling strategy, the mechanism is fundamentally different from ours. They design the network level continuous relaxation to learn the downsampling path, while we search for the downsampling strategy via the cell-level pruning progress. Moreover, they cannot search for the network depth and feature aggregation method, and focus on high-quality segmentation. For real-time requirements, CAS (Zhang et al. 2019b) searches for the architecture with customized resource constraints and achieves excellent real-time performance. However, our approach can search for the network depth, downsampling strategy, and feature aggregation method, which is significantly different from CAS (Zhang et al. 2019b).

NAS for object detection The combination of multi-scale features is also essential for object detection (Lin et al. 2017b; Liu et al. 2016). In the field of NAS, NAS-FPN (Ghiasi et al. 2019) and Auto-FPN (Xu et al. 2019) search for an architecture that merges features of varying dimensions and are successful at searching for the appropriate combination method. Unlike us, Ghiasi et al. (2019) propose the merging cell and use an RNN controller to select candidate feature layers and a binary operation in each merging cell. Their search space only consists of two binary operations, i.e. sum and global pooling for simplicity. Xu et al. (2019) search for an efficient feature fusion module, and their search space is specially designed for detection and flexible enough to cover many popular designs of detectors. Thus, the search space design, motivation, and implementation of the above methods are significantly different from ours.

3 Methods

In this section, we illustrate the proposed real-time semantic segmentation network search framework in detail. First, we briefly introduce an overview of the proposed framework. Second, we describe the differentiable architecture search. Next, we elaborate on the proposed hyper-cell for joint network depth and downsampling search. Finally, we illustrate the proposed aggregation cell for automatic multi-scale feature aggregation.

3.1 Overview

The joint search framework is shown in Fig. 2. We propose the hyper-cell to search for the optimal network depth and downsampling strategy as they directly affect the accuracy and speed of a network. For remedying the loss of spatial details caused by fast downsampling, a novel aggregation cell is proposed for automatic multi-scale feature aggregation. The whole framework contains two pre-defined convolution layers, three hyper-cells, and an aggregation cell. The multi-scale module (Chen et al. 2017) is subsequently used to extract the global and local context for final prediction. For real-time demands, we take real-world latency into consideration during the searching process.

3.2 Differentiable Architecture Search

Intra-cell search space

The hyper-cell is the building block of the network, and the cell is the basic component unit of the hyper-cell, as shown in Fig. 2. There are two types of cells, i.e., normal cells and reduction cells (Liu et al. 2019b; Xie et al. 2019). The reduction cells reduce the feature map size by a factor of 2 for downsampling, and the factor is 1 in normal cells.

A cell is a directed acyclic graph (DAG) consisting of an ordered sequence of N nodes, denoted by ${\mathcal {N}}$ = $\{x^{(1)},\ldots ,x^{(N)}\}$. Each node $x^{(i)}$ is a latent representation (i.e. feature map), and each directed edge $\left( i, j \right) $ is associated with some candidate operations (e.g. conv, pooling) in operation set ${\mathcal {O}}^{(i, j)}$, representing all possible transformations from $x^{(i)}$ to $x^{(j)}$. Each cell has two inputs (the outputs of the previous two cells) and one output (the concatenation of all the intermediate nodes in the cell). The structure of cell is shown on the right in Fig. 3. Each intermediate node $x^{(j)}$ is computed based on all of its predecessors:

$$\begin{aligned} x^{(j)} = \sum \limits _{i < j} {\widetilde{o}}^{(i, j)} \big ( x^{(i)} \big ), \end{aligned}$$

(1)

where ${\widetilde{o}}^{(i, j)}$ $\in $ ${\mathcal {O}}^{(i, j)}$ is the optimal operation at edge (i, j).

In order to determine the optimal operation ${\widetilde{o}}^{(i, j)}$ at edge (i, j), we represent the intra-cell search space with a set of one-hot random variables from a fully factorizable joint distribution p(M) (Xie et al. 2019). Specifically, each edge (i, j) is associated with a one-hot random variable $M^{(i,j)}$. We use $M^{(i,j)}$ as a mask to multiply all the candidate operations ${\mathcal {O}}^{(i, j)}$ at edge (i, j), and thus the intermediate node $x^{(j)}$ is given by:

$$\begin{aligned} x^{(j)} = \sum \limits _{i < j} \sum \limits _{o \in {\mathcal {O}}} m^{(i,j)}_{o} \cdot o^{(i, j)} \big ( x^{(i)} \big ), \end{aligned}$$

(2)

where $m^{(i,j)}_{o}$ $\in $ $M^{(i,j)}$ and $m^{(i,j)}_{o}$ is a random variable in $\{0, 1\}$, it is evaluated to 1 if operation $o^{(i, j)}$ is selected.

To make p(M) differentiable, we use Gumbel Softmax technique (Jang et al. 2016; Maddison et al. 2016) to relax the discrete sampling distribution to be continuous and differentiable:

$$\begin{aligned} M^{(i,j)} = f_{\alpha ^{(i,j)}}(G^{(i,j)}) = \text {softmax} ((\log \alpha ^{(i,j)} + G^{(i,j)}) / \lambda ), \end{aligned}$$

(3)

where $M^{(i,j)}$ is the softened one-hot random variable for operation selection at edge (i, j), $\alpha ^{(i,j)}$ is the intra-cell architecture parameter at edge (i, j), $G^{(i,j)}$ = $-\log (-\log (U^{(i,j)}))$ is a vector of Gumbel random variables, $U^{(i,j)}$ is a uniform random variable in the range (0, 1). $\lambda $ is the temperature of softmax, and as $\lambda $ approaches 0, $M^{(i,j)}$ approximately becomes one-hot. The technique of Gumbel Softmax makes the entire intra-cell search differentiable (Wu et al. 2018, 2019; Xie et al. 2019) to both network parameter w and architecture parameter $\alpha $.

For the candidate operation set ${\mathcal {O}}$, we collect the operations as follows:

zero operation
skip connection
3 $\times $ 3 max pooling
3 $\times $ 3 conv
3 $\times $ 3 conv, repeat 2
3 $\times $ 3 separable conv
3 $\times $ 3 separable conv, repeat 2
3 $\times $ 3 dilated separable conv, dilation=2
3 $\times $ 3 dilated separable conv, dilation=4
3 $\times $ 3 dilated separable conv, dilation=2, repeat 2

Intra-cell latency cost

For the operation selection of cells of a real-time network, we take real-world latency into consideration. Specifically, we build a GPU-latency lookup table (Cai et al. 2018; Tan et al. 2019; Wu et al. 2019; Zhang et al. 2019b) that records the inference time cost of each candidate operation. The latency of each operation is measured in micro-second on a TitanXP GPU. During the searching process, we associate a cost $lat_{o}^{(i,j)}$ with each candidate operation $o^{(i,j)}$ at edge (i, j), thus the latency cost of cell p is formulated as:

$$\begin{aligned} lat_p \! = \! \sum \limits _{(i,j)} \sum \limits _{o \in {\mathcal {O}}} m^{(i,j)}_{o} \cdot lat_{o}^{(i,j)}, \end{aligned}$$

(4)

where $m^{(i,j)}_{o} \in M^{(i,j)}$ and $M^{(i,j)}$ denotes the softened one-hot random variable at edge (i, j). By using the pre-built lookup table and above sampling process, the latency loss is also differentiable with respect to $m^{(i,j)}_{o}$.

3.3 Joint Network Depth and Downsampling Search

Hyper-cell search space

The network depth and downsampling strategy affect the accuracy and speed of a network directly in real-time semantic segmentation. To adjust them jointly and automatically, we formulate the two design-making processes as a single cell-level pruning process. Specifically, we propose a hyper-cell, as shown in Fig. 3, which consists of a reduction cell and n normal cells. We introduce $n + 1$ edges to connect each cell with the hyper-cell’s output and associate the edges with the learnable architecture parameter $\beta $. The intra-cell architecture parameters $\alpha $ of n normal cells are shared in the same hyper-cell.

We determine the depth of each hyper-cell by limiting that only one edge can be activated for each hyper-cell, and all cells behind this activated edge can be pruned safely. Each specific edge in hyper-cell s is associated with a one-hot random variable $U^s$ = ($u_1^s$, $u_2^s$, ..., $u_{n+1}^s$) from a fully factorizable joint distribution P(U). The $U^s$ works as a mask during the training process, and the output of the hyper-cell s is designed as:

$$\begin{aligned} HyperOut^{(s)} = \sum \limits _{p=1}^{n+1} {u_{p}} ^ {s} \cdot ({C_{p}} ^ {s}), \end{aligned}$$

(5)

where $C_{p}^{s}$ is the output of p-th cell in hyper-cell s, $u_{p}^{s}$ represents the random variable in $\{0, 1\}$ of p-th edge of hyper-cell s. We adopt the Gumbel Softmax based sampling process to make the training process differentiable:

$$\begin{aligned} U^{s} \! = f_{\beta ^{s}}(G^{s}) = \text {softmax} ((\log \beta ^{s} + G^{s}) / \lambda ), \end{aligned}$$

(6)

where $U^{s}$ is the softened one-hot random variable for edge selection of hyper-cell s, $\beta ^{s}$ is the architecture parameter of hyper-cell s. $G^{s}$ and $\lambda $ are similar to the ones in Eq. (3). The hyper-cell architecture parameter $\beta $ we introduced can be effectively optimized together with the network parameter w, intra-cell architecture parameter $\alpha $ in the same round of back-propagation. After stacking hyper-cells to form a whole network, the network depth and downsampling strategy can be fully explored concurrently according to the architecture parameter $\beta $.

To better explain the cell-level pruning process, we give an example as follows. In the initial phase, let’s say we have five cells (one reduction cell and four normal cells) and each cell in hyper-cell keeps its original inputs and outputs. As shown in Fig. 3, if the fourth edge is activated currently (i.e. U is $\{0, 0, 0, 1, 0\}$), the Normal Cell-4 will be pruned in this iteration, and the output of this hyper-cell is the output of Normal Cell-3. At the same time, the reduction cell in the next hyper-cell $s+1$ takes the outputs of hyper-cell s and Normal Cell-2 in hyper-cell s as its inputs, to stick to the “two-input” principle of the cell. The learning and adjusting like this go through the entire searching phase.

By introducing the architecture parameter $\beta $ in the proposed hyper-cell, we can dynamically adjust and search for the network depth as well as the downsampling strategy for real-time semantic segmentation.

Network Latency Cost We define the set of cells in all hyper-cells in the initial phase as P, after optimization, the number of the set is reduced and the new set is marked as ${\bar{P}}$. For the current architecture $(\alpha ,\beta )$ containing several hyper-cells, the total latency excludes the pruned cells and can be calculated as:

$$\begin{aligned} Lat(\alpha ,\beta ) = \sum \limits _{p \in {\bar{P}} } lat_{p}, \end{aligned}$$

(7)

where ${\bar{P}}$ is the set of remaining cells in all hyper-cells of architecture $(\alpha ,\beta )$. The $lat_{p}$ is the latency of cell p. We construct the latency loss function $L_{Lat}$ as:

$$\begin{aligned} L_{lat} = \log (Lat(\alpha ,\beta )). \end{aligned}$$

(8)

Thus, the total loss function can be formulated as:

$$\begin{aligned} L_{total} = L_{CE} + \gamma ~L_{lat}, \end{aligned}$$

(9)

where $L_{CE}$ is the cross-entropy loss between the predictions of the architecture $(\alpha ,\beta )$ with network weights w and the ground truth. $L_{Lat}$ denotes the total latency loss of architecture $(\alpha ,\beta )$. Moreover, $\gamma $ controls the magnitude of latency term (i.e. balance the trade-off between accuracy and speed).

3.4 Network-Level Auto Feature Aggregation

For remedying the loss of spatial details in real-time segmentation networks due to fast downsampling, we propose the aggregation cell to automatically aggregate features by optimal operations from different levels in the network. The aggregation cell seamlessly integrates the outputs of the above hyper-cells, and the outputs of the early hyper-cells can compensate for the loss of spatial details.

The structure of the proposed aggregation cell is shown in Fig. 4. The aggregation cell takes three hyper-cells’ outputs with different resolutions as its inputs, and thus the aggregation cell is designed to combine multi-scale features (i.e. low-level spatial details and high-level semantic context). The aggregation cell is designed as a directed acyclic graph consisting of M nodes and E edges. Each node is a latent representation (i.e. feature map) and each directed edge is associated with some candidate operations. As shown in Fig. 4, each edge’s stride is set to 1, unless explicitly specified by “s = 2” (stride 2), which acts as the downsampling connection. The output of the aggregation cell is designed as the concatenation of the final feature maps from three hyper-cells. We use the same sampling and optimization process as intra-cell search in Sect. 3.2 to optimize the aggregation cell’s architecture parameter.

Given the candidate operation set, the aggregation cell also efficiently enlarges the receptive field of the network. For the operation set of the aggregation cell, we collect following 5 kinds of operations:

1$\times $1 conv, repeat 2
3$\times $3 conv, repeat 2
3$\times $3 dilated separable conv, dilation=2, repeat 2
3$\times $3 dilated separable conv, dilation=4, repeat 2
3$\times $3 dilated separable conv, dilation=8, repeat 2

4 Experiments

To verify the effectiveness and superiority of our joint search framework, we compare our AutoRTNet with other state-of-the-art methods on two challenging benchmarks: Cityscapes (Cordts et al. 2016) and CamVid (Brostow et al. 2008). Moreover, we conduct a series of ablation studies to verify the effectiveness of the proposed hyper-cell and aggregation cell. Furthermore, we provide an in-depth analysis of the architecture searched by our framework. Finally, we give detailed quantitative results, visualization results, and adequate comparisons with other state-of-the-art methods.

4.1 Implementation Details

Searching For the searching process, the whole network contains three hyper-cells and the initial cell numbers in these hyper-cells are $\{5, 10, 10\}$, respectively. The intermediate node number of the cell is set to 2. The initial channel number is 8, and the channels are $\times $3 when downsampling in reduction cells. The searching process, which is conducted on the Cityscapes dataset, runs 150 epochs with mini-batch size 16, which takes approximately 16 hours with 16 TitanXP GPU cards. Similar to FBNet (Wu et al. 2019), we postpone the training of the hyper-cell architecture parameters $\beta $ by 50 epochs to warm-up network weights w and intra-cell architecture parameters $\alpha $. The $\alpha $ and $\beta $ are optimized by Adam, with an initial learning rate of 0.001, a momentum (0.5, 0.999) and a weight decay 1e-4. The w is optimized using SGD with a momentum 0.9, a weight decay 1e-3, and the cosine learning scheduler that decays learning rate from 0.025 to 0.001. For Gumbel Softmax, we set the initial temperature $\lambda $ in equation (3) and (6) as 3.0 empirically, and gradually decrease to the minimum value of 0.03. We set the node number M and edge number E as 7 in the aggregation cell.

Retraining When the searching process is over, the searched network is firstly pretrained on the ImageNet dataset from scratch. We then finetune the network on the specific segmentation dataset (i.e. Cityscapes or CamVid) for 200 epochs with mini-batch size 16. The base learning rate is 0.01 and the ‘poly’ learning rate policy is adopted with a power 0.9, together with a momentum 0.9 and a weight decay 0.0005. Following (Wu et al. 2016; Yu et al. 2018), we compute the loss function with the online bootstrapping strategy. Data augmentation contains random horizontal flip, random resizing with scale ranges in [0.5, 2.0], and random cropping into fix size for training.

Table 1 Accuracy and speed comparison of our method against other state-of-the-art methods on the Cityscapes test set

Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation

Abstract

Similar content being viewed by others

CFFNet: Cross-scale Feature Fusion Network for Real-Time Semantic Segmentation

Lightweight and Progressively-Scalable Networks for Semantic Segmentation

Dense Dual-Path Network for Real-Time Semantic Segmentation

Explore related subjects

1 Introduction

2 Related Work

2.1 Semantic Segmentation

2.2 Neural Architecture Search

3 Methods

3.1 Overview

3.2 Differentiable Architecture Search

3.3 Joint Network Depth and Downsampling Search

3.4 Network-Level Auto Feature Aggregation

4 Experiments

4.1 Implementation Details

4.2 Benchmarks and Evaluation Metrics

4.3 Real-Time Semantic Segmentation Results

4.4 Ablation Study

4.4.1 Comparison with Random Search

4.4.2 Hyper-Cell

4.4.3 Aggregation Cell

4.4.4 Hyper-Cell Searching Process

4.4.5 Different Latency Settings

4.5 Insights from Searched AutoRTNet-F

4.6 Comparison of Networks Searched on Different Datasets

4.7 Detailed Time and GPU Information

4.8 Detailed Quantitative Results and Visualization Results

4.8.1 Cityscapes Dataset

4.8.2 CamVid Dataset

4.8.3 Visual Segmentation Results

4.9 Comparison with High-Accuracy Models

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation