1 Introduction

Semantic segmentation, a fundamental topic in computer vision, aims at assigning per-pixel semantic labels for images. Recent approaches (Zhao et al. 2017; Chen et al. 2017, 2018b; Zhao et al. 2018b) based on fully convolutional networks (Long et al. 2015) have achieved remarkable accuracy on public benchmarks (Brostow et al. 2008; Cordts et al. 2016; Everingham et al. 2015). Such improvements, however, come at the cost of deeper and less efficient networks, which may not be applicable to many real-time systems, e.g., autonomous driving and video surveillance.

To perform fast semantic segmentation with satisfactory accuracy, the design philosophy of real-time segmentation network architectures mainly concentrates on three aspects: (1) building block design (Li and Kim 2019; Paszke et al. 2016), which considers the block-level feature representation capacity, computational complexity, and receptive field size; (2) network depth and downsampling strategy (Li and Kim 2019; Li et al. 2019a), which directly affect the accuracy and speed of a network, hence real-time networks favor shallow layers and fast downsampling; and (3) feature aggregation (Yu et al. 2018; Zhao et al. 2018a), which fuses multi-scale features to compensate the loss of spatial details caused by fast downsampling.

The above hand-crafted networks make huge progress, while they require expertise in architecture design based on laborious trial and error. To relieve this burden, some researchers introduce neural architecture search (NAS) methods (Baker et al. 2016; Zoph and Le 2016; Liu et al. 2019b; Xie et al. 2019) into this field, and obtain excellent results (Chen et al. 2018a; Liu et al. 2019a; Zhang et al. 2019b; Nekrasov et al. 2019). Liu et al. (2019a) and Chen et al. (2018a) focus on high-quality segmentation instead of real-time applications. To meet the real-time demand, Zhang et al. (2019b) search a customized architecture by introducing a latency loss function. Although its building blocks are searched, the network depth, downsampling strategy, and feature aggregation method are still set by hand in advance and nonadjustable during searching. Since these three aspects are highly correlated and indispensable for a remarkable real-time segmentation network, the fact that they are nonadjustable increases the difficulty of finding an optimal real-time architecture (i.e. the best trade-off between accuracy and speed). These motivate us to explore all the aspects automatically during the searching process.

In this paper, we propose a joint search framework to search for the optimal building blocks, network depth, downsampling strategy, and feature aggregation method simultaneously. Specifically, we propose hyper-cells to decide the network depth and the downsampling strategy jointly and automatically via a cell-level pruning process. Moreover, we propose an aggregation cell to fuse features from multiple spatial scales automatically. As for the hyper-cell, we introduce a novel learnable architecture parameter. Thus, the network depth and downsampling strategy are fully determined concurrently according to the optimized architecture parameters. As for the aggregation cell, we aggregate multi-level features in the network automatically to fuse the low-level spatial details and high-level semantic context effectively.

We denote the resulting network as Auto searched Real-Time semantic segmentation network or AutoRTNet. We evaluate AutoRTNet on both Cityscapes (Cordts et al. 2016) and CamVid (Brostow et al. 2008) datasets. The experiments demonstrate the superiority of AutoRTNet, as shown in Fig. 1, where our AutoRTNet achieves the best accuracy-efficiency trade-off.

The main contributions can be summarized as follows:

  • We propose a joint search framework for real-time semantic segmentation that automatically searches for the building blocks, network depth, downsampling strategy, and feature aggregation method simultaneously.

  • We propose the hyper-cell to learn the network depth and downsampling strategy jointly and automatically via the cell-level pruning process, and the aggregation cell to achieve automatic multi-scale feature aggregation.

  • Notably, AutoRTNet has achieved 73.9% mIoU on the Cityscapes test set and 110.0 FPS on an NVIDIA TitanXP GPU card with \(768 \times 1536\) input images.

Fig. 1
figure 1

The inference speed and accuracy for different networks on the Cityscapes test set. Compared with other methods, our AutoRTNet locates in the right-top since it features lower latency with comparable accuracy. Methods trained using both fine and coarse data are marked with \(*\)

2 Related Work

2.1 Semantic Segmentation

High-quality segmentation FCN (Long et al. 2015) is the pioneer work which has greatly promoted the development of semantic segmentation. Extensions to FCN follow many directions. Encoder–decoder structures (Badrinarayanan et al. 2017; Lin et al. 2017a; Noh et al. 2015) combine low-level and high-level features to improve the accuracy of semantic segmentation. DRN (Yu et al. 2017) and DeepLab (Chen et al. 2017, 2018b) use dilated convolution operations to effectively enlarge the receptive field size. To capture multi-scale context information, DeepLabV3 (Chen et al. 2017) and PSPNet (Zhao et al. 2017) propose the pyramid modules. Recently, attention mechanism (Vaswani et al. 2017) has been used in segmentation methods (Fu et al. 2019; Zhang et al. 2019a; Zhao et al. 2018b; Li et al. 2018). These outstanding works are designed for high-quality segmentation, which is inapplicable to real-time applications.

Real-time methods Various algorithms have been proposed for real-time semantic segmentation. Some works (Wu et al. 2017) reduce the computation overheads via restricting the size of input images. Channel-pruning algorithms (Paszke et al. 2016; Badrinarayanan et al. 2017) are introduced to boost the inference speed, and most real-time methods focus on designing the light-weight and effective network architectures. The design philosophy of real-time network architectures mainly can be summarized in the following three aspects. And in our work, we fully explore all three aspects simultaneously.

Building block design The building block design (Paszke et al. 2016; Romera et al. 2017; Mehta et al. 2018; Li and Kim 2019) requires researchers to give sufficient consideration to the computational complexity, feature representation capacity, and receptive field size, which is essential for real-time semantic segmentation. For example, ENet (Paszke et al. 2016) and DABNet (Li and Kim 2019) propose light-weight blocks and stack them with different dilation rates to form a whole network. MobileNet and its variants (Howard et al. 2017; Sandler et al. 2018) use blocks with depth-wise separable convolution in pursuit of light-weight models.

Network depth and downsampling strategy High-quality segmentation networks always use the pre-defined backbones, e.g. ResNet (He et al. 2016), Xception (Chollet 2017), as encoders. However, for real-time segmentation networks, [e.g. DABNet (Li and Kim 2019), DFANet (Li et al. 2019a), ERFNet (Romera et al. 2017)], the network depth and downsampling strategy (i.e. how many layers in each stage) are determined mostly by hand as they directly affect the accuracy and speed of the networks. For pursuing the fast inference speed, real-time networks always enjoy shallow layers and perform fast downsampling with factor 16 or 32.

Feature aggregation The fast downsampling in real-time networks easily results in the loss of spatial details. Thus, multi-scale feature aggregation (Yu et al. 2018; Zhao et al. 2018a; Li et al. 2019a) has been proposed to remedy the loss of spatial details. Zhao et al. (2018a) propose an image cascade network with multi-scale inputs. Yu et al. (2018) decouple the network into context and spatial paths to make the right balance between accuracy and speed. Li et al. (2019a) aggregate multi-scale features from different layers to remedy the loss of spatial details.

2.2 Neural Architecture Search

Overview Neural architecture search (NAS) focuses on automating the network architecture design process. Early NAS methods are time-consuming (e.g. thousands of GPU days) and computationally expensive via reinforcement learning (Zoph and Le 2016; Baker et al. 2016; Zoph et al. 2018; Tan et al. 2019) or evolutionary algorithms (Miikkulainen et al. 2019; Real et al. 2019). Recently, the emergence of differentiable NAS methods (Liu et al. 2019b; Xie et al. 2019; Cai et al. 2018) has greatly relieved the time-consuming problem while achieving excellent performance. DARTs (Liu et al. 2019b) is the pioneer work for gradient-based NAS, and Liu et al. (2019b) propose an iterative optimization framework which is based on the continuous relaxation of architecture representation. Xie et al. (2019) constrain the architecture parameters to approximate one-hot, resolving the inconsistency in optimizing between the performance of derived child networks and converged parent networks. In addition, FBNet (Wu et al. 2019), ProxylessNAS (Cai et al. 2018), and MnasNet (Tan et al. 2019) propose multi-objective optimization with the consideration of real-world latency.

In this paper, we propose hyper-cells to jointly decide the key properties (i.e. the downsampling strategy and the depth of a network) automatically in semantic segmentation. Searching at this network architecture level gives rise to the suitable downsampling strategy and depth for a semantic segmentation network. In contrast, DARTs (Liu et al. 2019b) and SNAS (Xie et al. 2019) only search at the cell level under a fixed network architecture without considering the intrinsic properties of semantic segmentation. Thus, the search spaces of ours and other NAS methods (Liu et al. 2019b; Xie et al. 2019) are essentially different.

NAS for segmentation DPC (Chen et al. 2018a) is the first work for dense image prediction using NAS methods and searches for a multi-scale representation module. A similar work to us is AutoDeepLab (Liu et al. 2019a), they propose a hierarchical search space and search for the downsampling path. Although they also search for the downsampling strategy, the mechanism is fundamentally different from ours. They design the network level continuous relaxation to learn the downsampling path, while we search for the downsampling strategy via the cell-level pruning progress. Moreover, they cannot search for the network depth and feature aggregation method, and focus on high-quality segmentation. For real-time requirements, CAS (Zhang et al. 2019b) searches for the architecture with customized resource constraints and achieves excellent real-time performance. However, our approach can search for the network depth, downsampling strategy, and feature aggregation method, which is significantly different from CAS (Zhang et al. 2019b).

Fig. 2
figure 2

Illustration of our joint network architecture search framework. The network begins with two convolution layers and contains three hyper-cells which search for the optimal network depth and downsampling strategy via the cell-level pruning process. Each hyper-cell contains a reduction cell and n normal cells. The cells marked with the dotted white line are pruned after optimization. The aggregation cell is designed to perform automatic multi-scale feature aggregation effectively, and it seamlessly integrates the outputs of hyper-cells

NAS for object detection The combination of multi-scale features is also essential for object detection (Lin et al. 2017b; Liu et al. 2016). In the field of NAS, NAS-FPN (Ghiasi et al. 2019) and Auto-FPN (Xu et al. 2019) search for an architecture that merges features of varying dimensions and are successful at searching for the appropriate combination method. Unlike us, Ghiasi et al. (2019) propose the merging cell and use an RNN controller to select candidate feature layers and a binary operation in each merging cell. Their search space only consists of two binary operations, i.e. sum and global pooling for simplicity. Xu et al. (2019) search for an efficient feature fusion module, and their search space is specially designed for detection and flexible enough to cover many popular designs of detectors. Thus, the search space design, motivation, and implementation of the above methods are significantly different from ours.

3 Methods

In this section, we illustrate the proposed real-time semantic segmentation network search framework in detail. First, we briefly introduce an overview of the proposed framework. Second, we describe the differentiable architecture search. Next, we elaborate on the proposed hyper-cell for joint network depth and downsampling search. Finally, we illustrate the proposed aggregation cell for automatic multi-scale feature aggregation.

3.1 Overview

The joint search framework is shown in Fig. 2. We propose the hyper-cell to search for the optimal network depth and downsampling strategy as they directly affect the accuracy and speed of a network. For remedying the loss of spatial details caused by fast downsampling, a novel aggregation cell is proposed for automatic multi-scale feature aggregation. The whole framework contains two pre-defined convolution layers, three hyper-cells, and an aggregation cell. The multi-scale module (Chen et al. 2017) is subsequently used to extract the global and local context for final prediction. For real-time demands, we take real-world latency into consideration during the searching process.

3.2 Differentiable Architecture Search

Intra-cell search space

The hyper-cell is the building block of the network, and the cell is the basic component unit of the hyper-cell, as shown in Fig. 2. There are two types of cells, i.e., normal cells and reduction cells (Liu et al. 2019b; Xie et al. 2019). The reduction cells reduce the feature map size by a factor of 2 for downsampling, and the factor is 1 in normal cells.

A cell is a directed acyclic graph (DAG) consisting of an ordered sequence of N nodes, denoted by \({\mathcal {N}}\) = \(\{x^{(1)},\ldots ,x^{(N)}\}\). Each node \(x^{(i)}\) is a latent representation (i.e. feature map), and each directed edge \(\left( i, j \right) \) is associated with some candidate operations (e.g. conv, pooling) in operation set \({\mathcal {O}}^{(i, j)}\), representing all possible transformations from \(x^{(i)}\) to \(x^{(j)}\). Each cell has two inputs (the outputs of the previous two cells) and one output (the concatenation of all the intermediate nodes in the cell). The structure of cell is shown on the right in Fig. 3. Each intermediate node \(x^{(j)}\) is computed based on all of its predecessors:

$$\begin{aligned} x^{(j)} = \sum \limits _{i < j} {\widetilde{o}}^{(i, j)} \big ( x^{(i)} \big ), \end{aligned}$$
(1)

where \({\widetilde{o}}^{(i, j)}\) \(\in \) \({\mathcal {O}}^{(i, j)}\) is the optimal operation at edge (ij).

In order to determine the optimal operation \({\widetilde{o}}^{(i, j)}\) at edge (ij), we represent the intra-cell search space with a set of one-hot random variables from a fully factorizable joint distribution p(M) (Xie et al. 2019). Specifically, each edge (i, j) is associated with a one-hot random variable \(M^{(i,j)}\). We use \(M^{(i,j)}\) as a mask to multiply all the candidate operations \({\mathcal {O}}^{(i, j)}\) at edge (ij), and thus the intermediate node \(x^{(j)}\) is given by:

$$\begin{aligned} x^{(j)} = \sum \limits _{i < j} \sum \limits _{o \in {\mathcal {O}}} m^{(i,j)}_{o} \cdot o^{(i, j)} \big ( x^{(i)} \big ), \end{aligned}$$
(2)

where \(m^{(i,j)}_{o}\) \(\in \) \(M^{(i,j)}\) and \(m^{(i,j)}_{o}\) is a random variable in \(\{0, 1\}\), it is evaluated to 1 if operation \(o^{(i, j)}\) is selected.

To make p(M) differentiable, we use Gumbel Softmax technique (Jang et al. 2016; Maddison et al. 2016) to relax the discrete sampling distribution to be continuous and differentiable:

$$\begin{aligned} M^{(i,j)} = f_{\alpha ^{(i,j)}}(G^{(i,j)}) = \text {softmax} ((\log \alpha ^{(i,j)} + G^{(i,j)}) / \lambda ), \end{aligned}$$
(3)

where \(M^{(i,j)}\) is the softened one-hot random variable for operation selection at edge (i, j), \(\alpha ^{(i,j)}\) is the intra-cell architecture parameter at edge (i, j), \(G^{(i,j)}\) = \(-\log (-\log (U^{(i,j)}))\) is a vector of Gumbel random variables, \(U^{(i,j)}\) is a uniform random variable in the range (0, 1). \(\lambda \) is the temperature of softmax, and as \(\lambda \) approaches 0, \(M^{(i,j)}\) approximately becomes one-hot. The technique of Gumbel Softmax makes the entire intra-cell search differentiable (Wu et al. 2018, 2019; Xie et al. 2019) to both network parameter w and architecture parameter \(\alpha \).

For the candidate operation set \({\mathcal {O}}\), we collect the operations as follows:

  • zero operation

  • skip connection

  • 3 \(\times \) 3 max pooling

  • 3 \(\times \) 3 conv

  • 3 \(\times \) 3 conv, repeat 2

  • 3 \(\times \) 3 separable conv

  • 3 \(\times \) 3 separable conv, repeat 2

  • 3 \(\times \) 3 dilated separable conv, dilation=2

  • 3 \(\times \) 3 dilated separable conv, dilation=4

  • 3 \(\times \) 3 dilated separable conv, dilation=2, repeat 2

Intra-cell latency cost

For the operation selection of cells of a real-time network, we take real-world latency into consideration. Specifically, we build a GPU-latency lookup table (Cai et al. 2018; Tan et al. 2019; Wu et al. 2019; Zhang et al. 2019b) that records the inference time cost of each candidate operation. The latency of each operation is measured in micro-second on a TitanXP GPU. During the searching process, we associate a cost \(lat_{o}^{(i,j)}\) with each candidate operation \(o^{(i,j)}\) at edge (ij), thus the latency cost of cell p is formulated as:

$$\begin{aligned} lat_p \! = \! \sum \limits _{(i,j)} \sum \limits _{o \in {\mathcal {O}}} m^{(i,j)}_{o} \cdot lat_{o}^{(i,j)}, \end{aligned}$$
(4)

where \(m^{(i,j)}_{o} \in M^{(i,j)}\) and \(M^{(i,j)}\) denotes the softened one-hot random variable at edge (ij). By using the pre-built lookup table and above sampling process, the latency loss is also differentiable with respect to \(m^{(i,j)}_{o}\).

3.3 Joint Network Depth and Downsampling Search

Hyper-cell search space

The network depth and downsampling strategy affect the accuracy and speed of a network directly in real-time semantic segmentation. To adjust them jointly and automatically, we formulate the two design-making processes as a single cell-level pruning process. Specifically, we propose a hyper-cell, as shown in Fig. 3, which consists of a reduction cell and n normal cells. We introduce \(n + 1\) edges to connect each cell with the hyper-cell’s output and associate the edges with the learnable architecture parameter \(\beta \). The intra-cell architecture parameters \(\alpha \) of n normal cells are shared in the same hyper-cell.

Fig. 3
figure 3

Illustration of our hyper-cell. The hyper-cell consists of a reduction cell and n normal cells and \(n + 1\) edges with architecture parameter which encodes the depth of the hyper-cell. The structure of cell is shown on the right in this figure

We determine the depth of each hyper-cell by limiting that only one edge can be activated for each hyper-cell, and all cells behind this activated edge can be pruned safely. Each specific edge in hyper-cell s is associated with a one-hot random variable \(U^s\) = (\(u_1^s\), \(u_2^s\), ..., \(u_{n+1}^s\)) from a fully factorizable joint distribution P(U). The \(U^s\) works as a mask during the training process, and the output of the hyper-cell s is designed as:

$$\begin{aligned} HyperOut^{(s)} = \sum \limits _{p=1}^{n+1} {u_{p}} ^ {s} \cdot ({C_{p}} ^ {s}), \end{aligned}$$
(5)

where \(C_{p}^{s}\) is the output of p-th cell in hyper-cell s, \(u_{p}^{s}\) represents the random variable in \(\{0, 1\}\) of p-th edge of hyper-cell s. We adopt the Gumbel Softmax based sampling process to make the training process differentiable:

$$\begin{aligned} U^{s} \! = f_{\beta ^{s}}(G^{s}) = \text {softmax} ((\log \beta ^{s} + G^{s}) / \lambda ), \end{aligned}$$
(6)

where \(U^{s}\) is the softened one-hot random variable for edge selection of hyper-cell s, \(\beta ^{s}\) is the architecture parameter of hyper-cell s. \(G^{s}\) and \(\lambda \) are similar to the ones in Eq. (3). The hyper-cell architecture parameter \(\beta \) we introduced can be effectively optimized together with the network parameter w, intra-cell architecture parameter \(\alpha \) in the same round of back-propagation. After stacking hyper-cells to form a whole network, the network depth and downsampling strategy can be fully explored concurrently according to the architecture parameter \(\beta \).

To better explain the cell-level pruning process, we give an example as follows. In the initial phase, let’s say we have five cells (one reduction cell and four normal cells) and each cell in hyper-cell keeps its original inputs and outputs. As shown in Fig. 3, if the fourth edge is activated currently (i.e. U is \(\{0, 0, 0, 1, 0\}\)), the Normal Cell-4 will be pruned in this iteration, and the output of this hyper-cell is the output of Normal Cell-3. At the same time, the reduction cell in the next hyper-cell \(s+1\) takes the outputs of hyper-cell s and Normal Cell-2 in hyper-cell s as its inputs, to stick to the “two-input” principle of the cell. The learning and adjusting like this go through the entire searching phase.

By introducing the architecture parameter \(\beta \) in the proposed hyper-cell, we can dynamically adjust and search for the network depth as well as the downsampling strategy for real-time semantic segmentation.

Network Latency Cost We define the set of cells in all hyper-cells in the initial phase as P, after optimization, the number of the set is reduced and the new set is marked as \({\bar{P}}\). For the current architecture \((\alpha ,\beta )\) containing several hyper-cells, the total latency excludes the pruned cells and can be calculated as:

$$\begin{aligned} Lat(\alpha ,\beta ) = \sum \limits _{p \in {\bar{P}} } lat_{p}, \end{aligned}$$
(7)

where \({\bar{P}}\) is the set of remaining cells in all hyper-cells of architecture \((\alpha ,\beta )\). The \(lat_{p}\) is the latency of cell p. We construct the latency loss function \(L_{Lat}\) as:

$$\begin{aligned} L_{lat} = \log (Lat(\alpha ,\beta )). \end{aligned}$$
(8)

Thus, the total loss function can be formulated as:

$$\begin{aligned} L_{total} = L_{CE} + \gamma ~L_{lat}, \end{aligned}$$
(9)

where \(L_{CE}\) is the cross-entropy loss between the predictions of the architecture \((\alpha ,\beta )\) with network weights w and the ground truth. \(L_{Lat}\) denotes the total latency loss of architecture \((\alpha ,\beta )\). Moreover, \(\gamma \) controls the magnitude of latency term (i.e. balance the trade-off between accuracy and speed).

3.4 Network-Level Auto Feature Aggregation

For remedying the loss of spatial details in real-time segmentation networks due to fast downsampling, we propose the aggregation cell to automatically aggregate features by optimal operations from different levels in the network. The aggregation cell seamlessly integrates the outputs of the above hyper-cells, and the outputs of the early hyper-cells can compensate for the loss of spatial details.

The structure of the proposed aggregation cell is shown in Fig. 4. The aggregation cell takes three hyper-cells’ outputs with different resolutions as its inputs, and thus the aggregation cell is designed to combine multi-scale features (i.e. low-level spatial details and high-level semantic context). The aggregation cell is designed as a directed acyclic graph consisting of M nodes and E edges. Each node is a latent representation (i.e. feature map) and each directed edge is associated with some candidate operations. As shown in Fig. 4, each edge’s stride is set to 1, unless explicitly specified by “s = 2” (stride 2), which acts as the downsampling connection. The output of the aggregation cell is designed as the concatenation of the final feature maps from three hyper-cells. We use the same sampling and optimization process as intra-cell search in Sect. 3.2 to optimize the aggregation cell’s architecture parameter.

Fig. 4
figure 4

Overview of the aggregation cell for automatic multi-scale feature aggregation. The aggregation cell contains E edges (dotted arrows), and each edge is equiped with some candidate operations. The “s = 2” means stride = 2

Given the candidate operation set, the aggregation cell also efficiently enlarges the receptive field of the network. For the operation set of the aggregation cell, we collect following 5 kinds of operations:

  • 1\(\times \)1 conv, repeat 2

  • 3\(\times \)3 conv, repeat 2

  • 3\(\times \)3 dilated separable conv, dilation=2, repeat 2

  • 3\(\times \)3 dilated separable conv, dilation=4, repeat 2

  • 3\(\times \)3 dilated separable conv, dilation=8, repeat 2

4 Experiments

To verify the effectiveness and superiority of our joint search framework, we compare our AutoRTNet with other state-of-the-art methods on two challenging benchmarks: Cityscapes (Cordts et al. 2016) and CamVid (Brostow et al. 2008). Moreover, we conduct a series of ablation studies to verify the effectiveness of the proposed hyper-cell and aggregation cell. Furthermore, we provide an in-depth analysis of the architecture searched by our framework. Finally, we give detailed quantitative results, visualization results, and adequate comparisons with other state-of-the-art methods.

4.1 Implementation Details

Searching For the searching process, the whole network contains three hyper-cells and the initial cell numbers in these hyper-cells are \(\{5, 10, 10\}\), respectively. The intermediate node number of the cell is set to 2. The initial channel number is 8, and the channels are \(\times \)3 when downsampling in reduction cells. The searching process, which is conducted on the Cityscapes dataset, runs 150 epochs with mini-batch size 16, which takes approximately 16 hours with 16 TitanXP GPU cards. Similar to FBNet (Wu et al. 2019), we postpone the training of the hyper-cell architecture parameters \(\beta \) by 50 epochs to warm-up network weights w and intra-cell architecture parameters \(\alpha \). The \(\alpha \) and \(\beta \) are optimized by Adam, with an initial learning rate of 0.001, a momentum (0.5, 0.999) and a weight decay 1e-4. The w is optimized using SGD with a momentum 0.9, a weight decay 1e-3, and the cosine learning scheduler that decays learning rate from 0.025 to 0.001. For Gumbel Softmax, we set the initial temperature \(\lambda \) in equation (3) and (6) as 3.0 empirically, and gradually decrease to the minimum value of 0.03. We set the node number M and edge number E as 7 in the aggregation cell.

Retraining When the searching process is over, the searched network is firstly pretrained on the ImageNet dataset from scratch. We then finetune the network on the specific segmentation dataset (i.e. Cityscapes or CamVid) for 200 epochs with mini-batch size 16. The base learning rate is 0.01 and the ‘poly’ learning rate policy is adopted with a power 0.9, together with a momentum 0.9 and a weight decay 0.0005. Following (Wu et al. 2016; Yu et al. 2018), we compute the loss function with the online bootstrapping strategy. Data augmentation contains random horizontal flip, random resizing with scale ranges in [0.5, 2.0], and random cropping into fix size for training.

Table 1 Accuracy and speed comparison of our method against other state-of-the-art methods on the Cityscapes test set
Table 2 Results on the CamVid test set with resolution 720 \(\times \) 960

4.2 Benchmarks and Evaluation Metrics

Cityscapes (Cordts et al. 2016), a public street scene dataset, contains high quality pixel-level annotations of 5000 images with size 1024 \(\times \) 2048 and 19,998 images with coarse annotations. 19 semantic classes are used for training and evaluation. CamVid (Brostow et al. 2008) is another public dataset, and it contains 701 images in total. We follow the training/testing set split in (Zhang et al. 2019b; Brostow et al. 2008), with 468 training and 233 testing labeled images. These images are densely labeled with 11 semantic class labels. We use three evaluation metrics, including the mean of class-wise intersection over union (mIoU), network forward time (Latency), and Frames Per Second (FPS).

4.3 Real-Time Semantic Segmentation Results

In this section, we compare our AutoRTNet with other real-time segmentation methods. We run all experiments based on Pytorch 0.4 (Paszke et al. 2017) and measure the latency on an NVIDIA TitanXP GPU card under CUDA 9.0. For a fair comparison, we directly quote the reported remeasured or estimated speed results on TitanXP of other algorithms mentioned in (Zhang et al. 2019b; Orsic et al. 2019). For the AutoRTNet, we report the average inference time through 500 times. In this process, we don’t employ any test augmentation.

Results on Cityscapes.

We conduct the searching process with latency term weight \(\gamma \) 0.01 and 0.001, and obtain the relatively fast and slow networks named AutoRTNet-F and AutoRTNet-S, respectively. We evaluate them on the Cityscapes test set. The validation set is added for training before submitting to the online Cityscapes server. Following (Zhang et al. 2019b; Yu et al. 2018), we scale the resolution of the images from 1024 \(\times \) 2048 to 768 \(\times \) 1536 as inputs to measure the speed and accuracy. As shown in Table 1, our AutoRTNet achieves the best trade-off between accuracy and speed. AutoRTNet-F yields 72.2% mIoU while maintaining 110.0 FPS on the Cityscapes test set with only fine data and without any test augmentation. When the coarse data is added to the training set, the mIoU achieves 73.9%, which is the state-of-the-art trade-off for real-time semantic segmentation. Compared with BiseNet (Yu et al. 2018) and CAS (Zhang et al. 2019b) which have a comparable speed to us, AutoRTNet-F surpasses them by 3.8% and 1.7% in mIoU on the Cityscapes test set, respectively. Compared with other real-time segmentation methods (e.g. ENet (Paszke et al. 2016), ICNet (Zhao et al. 2018a)), our AutoRTNet-F surpasses them in both speed and accuracy by a large margin. Moreover, our AutoRTNet-S achieves 74.3% and 75.8% mIoU (+ coarse data) on the Cityscapes test set with 71.4 FPS, which is also the state-of-the-art real-time performance.

Results on CamVid To validate the transferability of the networks searched by our framework, we directly transfer AutoRTNet-F and AutoRTNet-S, which are obtained on Cityscapes, to the CamVid dataset, as reported in Table 2. We only transfer the network architectures and train them on CamVid from scratch. With 720 \(\times \) 960 input images, AutoRTNet-F achieves 73.5% mIoU on the CamVid test set with 140.0 FPS, which is the state-of-the-art trade-off between accuracy and speed. AutoRTNet-S achieves 74.2% mIoU with 82.5 FPS. We also conduct the architecture search on CamVid (\(\gamma = 0.1\)) and name the resulting ultrafast network AutoRTNet-U. Notably, AutoRTNet-U achieves appealing 250.0 FPS while maintaining 68.6% mIoU on the CamVid test set, which surpasses ICNet (Zhao et al. 2018a) (67.1% mIoU with 34.5 FPS) and DFANet (Li et al. 2019a) (64.7% mIoU with 120 FPS) significantly.

Parameter results Many computationally limited mobile platforms have restrictive memory constraints for real-time applications, and thus the parameter size is also an important consideration. Table 2 shows the results of our AutoRTNet and other methods on the CamVid test set. With only 2.5 million parameters, our AutoRTNet-F achieves impressive accuracy (i.e. 73.5% mIoU) on the CamVid test set, which significantly outperforms existing real-time segmentation networks. The parameter sizes of AutoRTNet-S and AutoRTNet-U are 3.9M and 1.4M, respectively.

4.4 Ablation Study

The contribution of each component is investigated in the following ablation studies on the Cityscapes validation set. The latency term weight \(\gamma \) in Eq. (7) is set to 0.01 and all networks are firstly pretrained on ImageNet in following experiments for a fair comparison, if not specially noted.

Table 3 Comparison with random search on the Cityscapes validation set
Table 4 The optimization results of hyper-cells with different initial states and different random seeds
Table 5 Comparison to random downsampling strategy

4.4.1 Comparison with Random Search

As discussed in (Li and Talwalkar 2019; Yu et al. 2019), NAS is a specialized hyper-parameter optimization problem, and random search is a competitive baseline for the problem. We apply random search to semantic segmentation by randomly sampling ten architectures from our previously-defined search space. The whole search space contains intra-cell operation selection and hyper-cell depth decision, which is significantly challenging for random search to find a satisfactory network. As shown in Table 3, random search achieves the average 66.7% mIoU ± 2.5% on the Cityscapes validation set with ImageNet pretrained, which is substantially lower than our AutoRTNet. The results also demonstrate the effectiveness of our search algorithm.

4.4.2 Hyper-Cell

Robustness Firstly, to verify the robustness of hyper-cell, we set different initial numbers of cells and different random seeds in the initialization phase. The network contains three hyper-cells and the initial cell numbers in hyper-cells are set to {a, b, c}, after optimization, the numbers of cells remaining in each hyper-cell are {\({\overline{a}}\), \({\overline{b}}\), \({\overline{c}}\)}. As shown in Table 4, the experiments demonstrate that the hyper-cells are insensitive to both initial numbers of cells and random seeds, which verify the robustness and stability of the hyper-cell.

Downsampling strategy To demonstrate the effectiveness of the downsampling strategy searched by hyper-cells, we compare the random downsampling position settings with the searched one. The total cell number is 12 (i.e., \({\overline{a}}\)+\({\overline{b}}\)+\({\overline{c}}\) = 12) searched by our framework, we fix the searched cell structures and only change the downsampling positions (x, y, z) randomly for a fair comparison. The (x, y, z) represents the index positions of reduction cells in the 12 cells. After pretraining and retraining, the results in Table 5 demonstrate the superiority of the searched downsampling strategy through hyper-cells. Compared with the random ones, our hyper-cell achieves the best trade-off between accuracy and speed.

Hyper-cell number In our framework, we set the number of hyper-cells as 3 empirically. Thus, the downsampling factor is 16 with a stem convolution layer. In fact, the number of hyper-cells also can be learned by a learnable architecture parameter \(\delta \), and the optimization process of \(\delta \) is similar to that of the architecture parameter \(\beta \). Specifically, the number of hyper-cells can be learned by a hyper-cell-level pruning process, i.e., from the initial hyper-cell numbers to the reduced hyper-cell numbers automatically as follows. First, we introduce edges to connect with each output of hyper-cells, and associate them with the learnable parameters \(\delta \). Then, we determine the number of hyper-cells by limiting that only one edge can be activated. We set the initial hyper-cell number as 5 and thus the initial max downsampling factor is 64, which covers the common practices. Finally, the parameter \(\delta \) and network parameters w are optimized to determine the number of hyper-cells automatically. We perform five repeated experiments as shown in Table 6. With latency weight \(\gamma = \) 0.01, the hyper-cell number determined by the parameter \(\delta \) is 3 or 4.

Then, we conduct the experiments with different numbers of hyper-cells, including the numbers 3 and 4, which are determined by \(\delta \), and hand-designed number 5. The results are shown in Table 7, we observe that AutoRTNet obtains similar performance when the hyper-cell number is 3 or 4. When the hyper-cell number is 5, there has a degradation in performance, which demonstrates the effectiveness of the hyper-parameter \(\delta \). Hence, the hyper-cell number can be set as 3 or 4 in our framework.

Table 6 The optimization results of the number of hyper-cells
Table 7 The optimization results with different hyper-cell numbers
Table 8 Ablation study for the aggregation cell
Fig. 5
figure 5

Illustration of cell numbers in hyper-cells during the searching process. Blue lines from top to bottom denote the actual cell number changing in each hyper-cell with the increase of epochs, and red curves represent the mathematical expectation values of the current cell numbers in hyper-cells (Color figure online)

4.4.3 Aggregation Cell

To demonstrate the effectiveness of the proposed aggregation cell, we conduct a series of experiments with different strategies: (a) without multi-scale feature aggregation; (b) with the aggregation cell using selected operations randomly from the aggregation cell’s search space; (c) with the searched aggregation cell under \(\gamma = 0.01\), the corresponding network is named AutoRTNet-\(\bar{{\mathrm {F}}}\); d) with the searched unconstrained aggregation cell (i.e. our AutoRTNet-F), i.e., we have not introduced the latency constraint for the aggregation cell. Among them, the result of the random aggregation cell is the average result over ten repeated random experiments and the results are shown in Table 8. Overall, the searched aggregation cell successfully boosts up the mIoU from 69.9 to 72.9% on the Cityscapes validation set. Particularly, the searched aggregation cell surpasses the random one 1.5% performance gains. Moreover, we observe that the accuracy degrades from 72.9 to 72.2% mIoU while the speed only gains 0.22 ms (+ 2.8 FPS) improvement when adding the latency constraint to the aggregation cell. Thus, for a better overall trade-off between accuracy and speed, we do not introduce the latency constraint to the aggregation cell.

Fig. 6
figure 6

The results of different latency settings on CamVid

Fig. 7
figure 7

Illustration of the detailed AutoRTNet-F architecture. The structures of the reduction cells and normal cells in three hyper-cells are shown in the figure respectively. The structure of searched aggregation cell in shown on the right. Best viewed in color (Color figure online)

4.4.4 Hyper-Cell Searching Process

To better analyze how hyper-cell works throughout the whole searching process, we visualize the number of cells of each hyper-cell after the warm-up phase, as depicted in Fig. 5. The initial cell numbers are \(\left\{ 5, 10, 10 \right\} \) and eventually converge to \(\left\{ 2, 4, 6 \right\} \) in three hyper-cells. The blue lines from top to bottom denote the actual cell numbers according to the current architecture parameter \(\beta \) of each hyper-cell, and red curves represent the mathematical expectation values of current cell numbers. We observe that our framework explores different cell numbers (i.e. different depths) in each hyper-cell actively at the early stage of searching, and the expectation values of cell numbers also change gradually. The cell numbers progressively become stable towards the final architecture in the late stage of searching, and the actual cell number lines gradually get close to the expectation curves. Another interesting observation is that hyper-cell #1 finds its optimal depth much earlier than the other ones, indicating that the searching process follows a shallow-to-deep manner as we expected.

4.4.5 Different Latency Settings

Our joint search framework searches for the optimal network architectures under different latency settings (i.e. loss weight \(\gamma \)). In Sect. 4.3, AutoRTNet-F and AutoRTNet-S are searched with \(\gamma = 0.01\) and 0.001 on the Cityscapes dataset, respectively, which demonstrates the flexibility of our framework. We also conduct the architecture search on the CamVid dataset with different latency settings, and the results are as depicted in Fig. 6. The networks searched with \(\gamma \) = 0.001, 0.01, 0.05, 0.1 achieve 73.3%, 70.2%, 69.2%, 68.6% mIoU and 138.0, 200.2, 232.1, 250.0 FPS on the CamVid test set, respectively. Notably, our AutoRTNet-U achieves 250.0 FPS while maintaining 68.6% mIoU, which surpasses ICNet (67.1% mIoU with 34.5 FPS) and DFANet (64.7% mIoU with 120 FPS) significantly.

4.5 Insights from Searched AutoRTNet-F

We provide an in-depth analysis of the AutoRTNet-F searched by our framework. We use the NAS methods to search the suitable architectures for specific tasks, likewise, we should understand why the searched network works well and it will guide the hand-designed process in turn. We have the following three important observations.

Early downsampling We notice that the searched downsampling strategy is stable and reasonable. As shown in Table 4, in the first hyper-cell, whether the initialized cell number is 5 or 10, the final number is at most 2 after the optimization. The reason is that the visual information is highly spatially redundant, thus can be compressed into a more efficient representation. Under the latency constraints, the searched downsampling strategy is as we expected and follows the early downsampling prior knowledge (Paszke et al. 2016).

Suitable receptive field The suitable receptive field size is crucial for semantic segmentation (Luo et al. 2016). A too-large receptive field may introduce some extra noises or negative interference, and the network cannot capture enough context information if it is too small. During the searching process, our AutoRTNet continuously adjusts the operation selection to determine the suitable receptive field. For example, in the optimized aggregation cell, as shown in Fig. 7, the operations from the outputs of the third hyper-cell always choose the operation with a dilation rate is 4 rather than 2 or 8 also in the search space. So we should choose the corresponding operations for the suitable receptive field in hand-designed semantic segmentation networks.

Operation selection The early operations act as good feature extractors. As shown in Fig. 7, the selection of operations in early stages always tends to be conv 3\(\times \)3. The middle and deep layers have a diversity of operation selection. When performing multi-scale feature aggregation in the aggregation cell, we clearly found that the deeper layers enjoy dilated convolution operations, while the lower layers only prefer the common convolution operations.

Fig. 8
figure 8

Illustration of the detailed AutoRTNet-F’ architecture

Table 9 The hyper-parameters used for the search and the corresponding results of AutoRTNet-F and AutoRTNet-F’

4.6 Comparison of Networks Searched on Different Datasets

In this part, we compare the similarities and differences of AutoRTNet-F searched on Cityscapes and its counterpart, which is searched on CamVid. We depict the architecture searched on CamVid with \(\gamma = \) 0.001 in Fig. 8, which is named as AutoRTNet-F’. Compared with AutoRTNet-F, AutoRTNet-F’ has a similar Hyper-cell #1 architecture. However, Hyper-cell #2 and Hyper-cell #3 of two networks are rather different. Specifically, Hyper-cell #2 of AutoRTNet-F selects more max-pooling operations than AutoRTNet-F’, and Hyper-cell #3 of AutoRTNet-F prefers more convolution operations than AutoRTNet-F’. Moreover, AutoRTNet-F, which is searched at higher image resolution, has more dilated convolution operations than AutoRTNet-F’. We also list the used hyper-parameters for the network search and the corresponding network performance as shown in Table 9. Other searching hyper-parameters remain the same when searching for AutoRTNet-F and AutoRTNet-F’.

Table 10 The detailed information of our AutoRTNet and other state-of-the-art methods on the Cityscapes test set
Table 11 Detailed per-class accuracy comparison of our AutoRTNet with other methods on the Cityscapes test set
Table 12 Accuracy and speed comparison on the Cityscapes validation set with image resolution 1024 \(\times \) 2048

4.7 Detailed Time and GPU Information

The inference time or FPS is influenced by the GPU device and the input image size of the model. Here we list detailed information of previous approaches in Table 10 for readers as reference. Our GPU device is Nvidia TitanXP GPU. For a fair comparison, we directly quote the reported remeasured or estimated results on TitanXP of other algorithms in CAS (Zhang et al. 2019b) and SwiftNet (Orsic et al. 2019) paper. And we remeasure the speed of the methods based on our implementation if the original speed was reported on different GPUs and not mentioned in CAS (Zhang et al. 2019b) and SwiftNet (Orsic et al. 2019). Note that our implementations and speed measurements do not use TensorRT optimizations.

The speed of DFANet is reported on TitanX GPU, and also not mentioned in CAS (Zhang et al. 2019b) and SwiftNet (Orsic et al. 2019). Thus, we remeasure the inference time on TitanXP carefully for a fair comparison. There still has a speed gap between the original speed and the one we measured, we suspect that this is caused by the inconsistency of the implementation platform. We reimplement DFANet using official PyTorch (Paszke et al. 2017), and they measure it on their framework in which the depth-wise separable convolution is more fully optimized.

4.8 Detailed Quantitative Results and Visualization Results

Here we provide detailed quantitative results of per-class mIoU on the Cityscapes and CamVid datasets. Moreover, we provide the performance of the AutoRTNet on the full-resolution Cityscapes validation set. Finally, we provide some visual segmentation results on Cityscapes and CamVid.

Table 13 Detailed per-class accuracy comparison of our AutoRTNet with other methods on the CamVid test set
Fig. 9
figure 9

Visual segmentation results on the Cityscapes validation set. a Image. b Ground Truth. c ICNet. d AutoRTNet-F

4.8.1 Cityscapes Dataset

Compared with other methods, our AutoRTNet-F achieves an overall 72.2% mIoU with 110.0 FPS, which is the state-of-the-art trade-off between accuracy and speed. The per-class accuracy values are shown in Table 11. In comparison with other methods with public per-class accuracy on the Cityscapes test set, our predictions are more accurate in 13 out of 19 classes. AutoRTNet-F achieves slight improvements on the general classes (Road, Sidewalk, Building, Terrain, Car, etc.) while obtaining a significant accuracy improvement on the challenging classes (Truck, Motorbike, Train, Fence, Rider, etc.). Moreover, AutoRTNet-S achieves 74.3% mIoU on the Cityscapes test set with 71.4 FPS.

Fig. 10
figure 10

Visual segmentation results on CamVid test set. a Image. b Ground Truth. c ICNet. d AutoRTNet-F

The cityscapes dataset contains high-resolution 1024 \(\times \) 2048 images, which makes it a big challenge for real-time semantic segmentation. With high-resolution image inputs, Zhao et al. (2018a) focus on building a practically fast semantic segmentation system while accomplishing high-quality results. SwiftNet (Orsic et al. 2019) and CAS (Zhang et al. 2019b) also perform the experiments on Cityscapes with full-resolution image inputs. In this part, we compare with these methods on the Cityscapes validation set and the results are shown in Table 12. We refer to the speed scaling factors on different GPUs in SwiftNet (Orsic et al. 2019) paper and estimate the speed values of ICNet, SwiftNet, CAS on Titan XP GPU.

Table 14 Results on the Cityscapes validation set of our networks and PSPNet
Table 15 Results on the Cityscapes test set of our AutoRTNet-\(\hat{{\mathrm {S}}}\) and other high-accuracy methods

AutoRTNet-F Our AutoRTNet-F achieves 75.0% mIoU and 62.7 FPS on the full-resolution Cityscapes validation set (i.e. 1024 \(\times \) 2048). To the best of our knowledge, the real-time performance of AutoRTNet-F outperforms all existing real-time methods. Compared with ICNet, AutoRTNet surpasses it by 5.5% in mIoU with a faster inference speed. Moreover, AutoRTNet outperforms SwiftNet and CAS by 0.6% and 1.0% in mIoU, and has a great advantage in inference speed (i.e. 62.7 FPS vs 38.1 FPS, 62.7 FPS vs 45.2 FPS).

AutoRTNet-S Our AutoRTNet-S achieves 76.8% mIoU with 45.6 FPS on the full-resolution Cityscapes validation set, which is the state-of-the-art real-time performance. Compared with SwiftNet and CAS that both have a little bit slower speed than us, our AutoRTNet-S surpasses them by 2.4% and 2.8% in mIoU, respectively.

4.8.2 CamVid Dataset

As shown in Table 13, with 720 \(\times \) 960 input images, the searched AutoRTNet-F achieves 73.5% mIoU with 140.0 FPS, which is the state-of-the-art trade-off between accuracy and speed on the CamVid test set. In comparison with other methods, the predictions of our AutoRTNet-F are more accurate in 7 out of the 11 classes. More importantly, the inference speed of AutoRTNet-F achieves 140 FPS, which is impressive compared with other methods. (e.g. SegNet 29.4 FPS, ENet 61.2 FPS, ICNet 34.5 FPS). The per-class accuracy of AutoRTNet-S and AutoRTNet-U are also shown in Table 13.

4.8.3 Visual Segmentation Results

We provide some visual prediction results on both Cityscapes and CamVid datasets here. As shown in Figs. 9 and 10, the columns correspond to the input image, ground truth, the prediction of ICNet, and the prediction of our AutoRTNet-F. Compared with ICNet, AutoRTNet-F produces more accurate and detailed results with faster inference speed. For example, AutoRTNet-F captures small objects in more details (e.g. traffic light in Fig. 9, poles in Fig. 10) and generates “smoother” results on object boundaries (e.g. rider, fence in Fig. 9, car in Fig. 10).

4.9 Comparison with High-Accuracy Models

It can be found that our AutoRTNet has a performance gap to high-accuracy methods, e.g. PSPNet, even with small \(\gamma \) values. In fact, in our framework, the performance of the searched network is mainly affected by two key factors: (1) the search space size and (2) the latency loss weight \(\gamma \). We aim to search real-time networks and thus use a relatively small search space, which is responsible for the performance gap. Specifically, the maximal number of channels in our search space is only 144, and the parameter size of the search space is only 12.14 M. The searched AutoRTNet-S has only 3.88 M parameters, while PSPNet-ResNet101 has 68.07 M.

For a comparison with PSPNet, we directly expand our AutoRTNet-S by widening the network widths (i.e. 4 \(\times \) channels) to obtain a larger network named AutoRTNet-\(\hat{{\mathrm {S}}}\). Then, we evaluate it on the Cityscapes validation set, and the corresponding results are shown in Table 14. We observe that AutoRTNet-\(\hat{{\mathrm {S}}}\) achieves 78.5% mIoU on the validation set, which has a very close accuracy (only 0.1% mIoU difference) with PSPNet-ResNet101 (78.6% mIoU). Note that our AutoRTNet-\(\hat{{\mathrm {S}}}\) is about 10 times faster in inference than PSPNet-ResNet101 (8.7 FPS VS 0.78 FPS).

Moreover, we also train AutoRTNet-\(\hat{{\mathrm {S}}}\) with the coarse annotations in comparison with other high-accuracy methods, and the results on the Cityscapes test set are shown in Table 15. From the experimental results, AutoRTNet-\(\hat{{\mathrm {S}}}\) sacrifices the real-time speed and obtains the accuracy improvement. Even so, the high-performance network still has a 10 \(\times \) inference speed compared to PSPNet. In conclusion, the experiment proves the great potential of our AutoRTNet. Although our framework focuses on the real-time segmentation network search under a relatively small search space, it also can be generalized and adaptive to the high-accuracy scenario flexibly by enlarging the search space.

5 Conclusion

In this paper, we propose a novel joint search framework that covers all three main aspects of the design philosophy for real-time semantic segmentation networks. The framework searches for building blocks, network depth, downsampling strategy, and feature aggregation method simultaneously. The hyper-cell is proposed for searching the network depth and downsampling strategy jointly and automatically, and the aggregation cell is introduced for automatic multi-scale feature aggregation. Extensive experiments on both Cityscapes and CamVid datasets demonstrate the superiority and effectiveness of our approach.