Keywords

1 Introduction

Convolution is a core building block in computer vision. Early algorithms employ convolutional filters to blur images, extract edges, or detect features. It has been heavily exploited in modern neural networks [46, 47] due to its efficiency and generalization ability, in comparison to fully connected models [2]. The success of convolution mainly comes from two properties: translation equivariance, and locality. Translation equivariance, although not exact [93], aligns well with the nature of imaging and thus generalizes the model to different positions or to images of different sizes. Locality, on the other hand, reduces parameter counts and M-Adds. However, it makes modeling long range relations challenging.

A rich set of literature has discussed approaches to modeling long range interactions in convolutional neural networks (CNNs). Some employ atrous convolutions [12, 33, 64, 74], larger kernel [67], or image pyramids [82, 94], either designed by hand or searched by algorithms [11, 57, 99]. Another line of works adopts attention mechanisms. Attention shows its ability of modeling long range interactions in language modeling [80, 85], speech recognition [10, 21], and neural captioning [88]. Attention has since been extended to vision, giving significant boosts to image classification [6], object detection [36], semantic segmentation  [39], video classification [84], and adversarial defense [86]. These works enrich CNNs with non-local or long-range attention modules.

Recently, stacking attention layers as stand-alone models without any spatial convolution has been proposed  [37, 65] and shown promising results. However, naive attention is computationally expensive, especially on large inputs. Applying local constraints to attention, proposed by [37, 65], reduces the cost and enables building fully attentional models. However, local constraints limit model receptive field, which is crucial to tasks such as segmentation, especially on high-resolution inputs. In this work, we propose to adopt axial-attention  [32, 39], which not only allows efficient computation, but recovers the large receptive field in stand-alone attention models. The core idea is to factorize 2D attention into two 1D attentions along height- and width-axis sequentially. Its efficiency enables us to attend over large regions and build models to learn long range or even global interactions. Additionally, most previous attention modules do not utilize positional information, which degrades attention’s ability in modeling position-dependent interactions, like shapes or objects at multiple scales. Recent works  [6, 37, 65] introduce positional terms to attention, but in a context-agnostic way. In this paper, we augment the positional terms to be context-dependent, making our attention position-sensitive, with marginal costs.

We show the effectiveness of our axial-attention models on ImageNet  [70] for classification, and on three datasets (COCO [56], Mapillary Vistas [62], and Cityscapes [22]) for panoptic segmentation [45], instance segmentation, and semantic segmentation. In particular, on ImageNet, we build an Axial-ResNet by replacing the \(3\times 3\) convolution in all residual blocks [31] with our position-sensitive axial-attention layer, and we further make it fully attentional [65] by adopting axial-attention layers in the ‘stem’. As a result, our Axial-ResNet attains state-of-the-art results among stand-alone attention models on ImageNet. For segmentation tasks, we convert Axial-ResNet to Axial-DeepLab by replacing the backbones in Panoptic-DeepLab [18]. On COCO [56], our Axial-DeepLab outperforms the current bottom-up state-of-the-art, Panoptic-DeepLab  [19], by 2.8% PQ on test-dev set. We also show state-of-the-art segmentation results on Mapillary Vistas [62], and Cityscapes [22].

To summarize, our contributions are four-fold:

  • The proposed method is the first attempt to build stand-alone attention models with large or global receptive field.

  • We propose position-sensitive attention layer that makes better use of positional information without adding much computational cost.

  • We show that axial attention works well, not only as a stand-alone model on image classification, but also as a backbone on panoptic segmentation, instance segmentation, and segmantic segmentation.

  • Our Axial-DeepLab improves significantly over bottom-up state-of-the-art on COCO, achieving comparable performance of two-stage methods. We also surpass previous state-of-the-art methods on Mapillary Vistas and Cityscapes.

2 Related Work

Top-Down Panoptic Segmentation: Most state-of-the-art panoptic segmentation models employ a two-stage approach where object proposals are firstly generated followed by sequential processing of each proposal. We refer to such approaches as top-down or proposal-based methods. Mask R-CNN [30] is commonly deployed in the pipeline for instance segmentation, paired with a light-weight stuff segmentation branch. For example, Panoptic FPN [44] incorporates a semantic segmentation head to Mask R-CNN [30], while Porzi et al. [68] append a light-weight DeepLab-inspired module [13] to the multi-scale features from FPN [55]. Additionally, some extra modules are designed to resolve the overlapping instance predictions by Mask R-CNN. TASCNet [49] and AUNet [52] propose a module to guide the fusion between ‘thing’ and ‘stuff’ predictions, while Liu et al. [61] adopt a Spatial Ranking module. UPSNet [87] develops an efficient parameter-free panoptic head for fusing ‘thing’ and ‘stuff’, which is further explored by Li et al. [50] for end-to-end training of panoptic segmentation models. AdaptIS [77] uses point proposals to generate instance masks.

Bottom-up Panoptic Segmentation: In contrast to top-down approaches, bottom-up or proposal-free methods for panoptic segmentation typically start with the semantic segmentation prediction followed by grouping ‘thing’ pixels into clusters to obtain instance segmentation. DeeperLab [89] predicts bounding box four corners and object centers for class-agnostic instance segmentation. SSAP [28] exploits the pixel-pair affinity pyramid [60] enabled by an efficient graph partition method [43]. BBFNet [7] obtains instance segmentation results by Watershed transform [4, 81] and Hough-voting [5, 48]. Recently, Panoptic-DeepLab [19], a simple, fast, and strong approach for bottom-up panoptic segmentation, employs a class-agnostic instance segmentation branch involving a simple instance center regression [42, 63, 79], coupled with DeepLab semantic segmentation outputs [12, 14, 15]. Panoptic-DeepLab has achieved state-of-the-art results on several benchmarks, and our method builds on top of it.

Self-attention: Attention, introduced by  [3] for the encoder-decoder in a neural sequence-to-sequence model, is developed to capture correspondence of tokens between two sequences. In contrast, self-attention is defined as applying attention to a single context instead of across multiple modalities. Its ability to directly encode long-range interactions and its parallelizability, has led to state-of-the-art performance for various tasks  [24, 25, 38, 53, 66, 72, 80]. Recently, self-attention has been applied to computer vision, by augmenting CNNs with non-local or long-range modules. Non-local neural networks  [84] show that self-attention is an instantiation of non-local means  [9] and achieve gains on many vision tasks such as video classification and object detection. Additionally, [6, 17] show improvements on image classification by combining features from self-attention and convolution. State-of-the-art results on video action recognition tasks  [17] are also achieved in this way. On semantic segmentation, self-attention is developed as a context aggregation module that captures multi-scale context  [26, 39, 95, 98]. Efficient attention methods are proposed to reduce its complexity  [39, 53, 73]. Additionally, CNNs augmented with non-local means [9] are shown to be more robust to adversarial attacks  [86]. Besides discriminative tasks, self-attention is also applied to generative modeling of images  [8, 32, 91]. Recently, [37, 65] show that self-attention layers alone could be stacked to form a fully attentional model by restricting the receptive field of self-attention to a local square region. Encouraging results are shown on both image classification and object detection. In this work, we follow this direction of research and propose a stand-alone self-attention model with large or global receptive field, making self-attention models non-local again. Our models are evaluated on bottom-up panoptic segmentation and show significant improvements.

3 Method

We begin by formally introducing our position-sensitive self-attention mechanism. Then, we discuss how it is applied to axial-attention and how we build stand-alone Axial-ResNet and Axial-DeepLab with axial-attention layers.

3.1 Position-Sensitive Self-attention

Self-attention: Self-attention mechanism is usually applied to vision models as an add-on to augment CNNs outputs [39, 84, 91]. Given an input feature map \(x \in \mathbb {R}^{h \times w \times d_{in}}\) with height h, width w, and channels \(d_{in}\), the output at position \(o=(i,j)\), \(y_{o} \in \mathbb {R}^{d_{out}}\), is computed by pooling over the projected input as:

$$\begin{aligned} y_{o} = \sum _{p \in \mathcal {N}} \text {softmax}_{p}(q_{o}^T k_{p})v_{p} \end{aligned}$$
(1)

where \(\mathcal {N}\) is the whole location lattice, and queries \(q_{o}=W_Q x_{o}\), keys \(k_{o}=W_K x_{o}\), values \(v_{o}=W_V x_{o}\) are all linear projections of the input \(x_{o}~\forall o \in \mathcal {N}\). \(W_Q, W_K \in \mathbb {R}^{d_{q} \times d_{in}}\) and \(W_V \in \mathbb {R}^{d_{out} \times d_{in}}\) are all learnable matrices. The \(\text {softmax}_{p}\) denotes a softmax function applied to all possible \(p=(a, b)\) positions, which in this case is also the whole 2D lattice.

This mechanism pools values \(v_{p}\) globally based on affinities \(x_{o}^T W_Q^T W_K x_{p}\), allowing us to capture related but non-local context in the whole feature map, as opposed to convolution which only captures local relations.

However, self-attention is extremely expensive to compute (\(\mathcal {O}(h^2 w^2)\)) when the spatial dimension of the input is large, restricting its use to only high levels of a CNN (i.e., downsampled feature maps) or small images. Another drawback is that the global pooling does not exploit positional information, which is critical to capture spatial structures or shapes in vision tasks.

These two issues are mitigated in [65] by adding local constraints and positional encodings to self-attention. For each location o, a local \(m\times m\) square region is extracted to serve as a memory bank for computing the output \(y_{o}\). This significantly reduces its computation to \(\mathcal {O}(h w m^2)\), allowing self-attention modules to be deployed as stand-alone layers to form a fully self-attentional neural network. Additionally, a learned relative positional encoding term is incorporated into the affinities, yielding a dynamic prior of where to look at in the receptive field (i.e., the local \(m\times m\) square region). Formally, [65] proposes

$$\begin{aligned} y_{o} = \sum _{p \in \mathcal {N}_{m \times m}(o)} \text {softmax}_{p}(q_{o}^T k_{p} + q_{o}^T r_{p-o}) v_{p} \end{aligned}$$
(2)

where \(\mathcal {N}_{m \times m}(o)\) is the local \(m \times m\) square region centered around location \(o=(i,j)\), and the learnable vector \(r_{p-o} \in \mathbb {R}^{d_{q}}\) is the added relative positional encoding. The inner product \(q_{o}^T r_{p-o}\) measures the compatibility from location \(p=(a,b)\) to location \(o=(i,j)\). We do not consider absolute positional encoding \(q_{o}^T r_{p}\), because they do not generalize well compared to the relative counterpart  [65]. In the following paragraphs, we drop the term relative for conciseness.

In practice, \(d_q\) and \(d_{out}\) are much smaller than \(d_{in}\), and one could extend single-head attention in Eq.  (2) to multi-head attention to capture a mixture of affinities. In particular, multi-head attention is computed by applying N single-head attentions in parallel on \(x_{o}\) (with different \(W_Q^n, W_K^n, W_V^n, \forall n \in \{1, 2, \dots , N\}\) for the n-th head), and then obtaining the final output \(z_{o}\) by concatenating the results from each head, i.e., \(z_{o}=\text {concat}_n(y^n_{o})\). Note that positional encodings are often shared across heads, so that they introduce marginal extra parameters.

Position-Sensitivity: We notice that previous positional bias only depends on the query pixel \(x_{o}\), not the key pixel \(x_{p}\). However, the keys \(x_{p}\) could also have information about which location to attend to. We therefore add a key-dependent positional bias term \(k_{p}^T r^{k}_{p-o}\), besides the query-dependent bias \(q_{o}^T r^{q}_{p-o}\).

Fig. 1.
figure 1

A non-local block (left) vs. our position-sensitive axial-attention applied along the width-axis (right). “\(\otimes \)” denotes matrix multiplication, and “\(\oplus \)” denotes element-wise sum. The softmax is performed on the last axis. Blue boxes denote \(1\times 1\) convolutions, and red boxes denote relative positional encoding. The channels \(d_{in}=128\), \(d_{q}=8\), and \(d_{out}=16\) is what we use in the first stage of ResNet after ‘stem’ (Color figure online)

Similarly, the values \(v_{p}\) do not contain any positional information in Eq.  (2). In the case of large receptive fields or memory banks, it is unlikely that \(y_{o}\) contains the precise location from which \(v_{p}\) comes. Thus, previous models have to trade-off between using smaller receptive fields (i.e., small \(m\times m\) regions) and throwing away precise spatial structures. In this work, we enable the output \(y_{o}\) to retrieve relative positions \(r^{v}_{p-o}\), besides the content \(v_{p}\), based on query-key affinities \(q_{o}^T k_{p}\). Formally,

$$\begin{aligned} y_{o} = \sum _{p \in \mathcal {N}_{m \times m}(o)} \text {softmax}_{p}(q_{o}^T k_{p} + q_{o}^T r^{q}_{p-o} + k_{p}^T r^{k}_{p-o}) (v_{p} + r^{v}_{p-o}) \end{aligned}$$
(3)

where the learnable \(r^{k}_{p-o} \in \mathbb {R}^{d_{q}}\) is the positional encoding for keys, and \(r^{v}_{p-o} \in \mathbb {R}^{d_{out}}\) is for values. Both vectors do not introduce many parameters, since they are shared across attention heads in a layer, and the number of local pixels \(|\mathcal {N}_{m \times m}(o)|\) is usually small.

We call this design position-sensitive self-attention (Fig. 1), which captures long range interactions with precise positional information at a reasonable computation overhead, as verified in our experiments.

3.2 Axial-Attention

The local constraint, proposed by the stand-alone self-attention models [65], significantly reduces the computational costs in vision tasks and enables building fully self-attentional model. However, such constraint sacrifices the global connection, making attention’s receptive field no larger than a depthwise convolution with the same kernel size. Additionally, the local self-attention, performed in local square regions, still has complexity quadratic to the region length, introducing another hyper-parameter to trade-off between performance and computation complexity. In this work, we propose to adopt axial-attention [32, 39] in stand-alone self-attention, ensuring both global connection and efficient computation. Specifically, we first define an axial-attention layer on the width-axis of an image as simply a one dimensional position-sensitive self-attention, and use the similar definition for the height-axis. To be concrete, the axial-attention layer along the width-axis is defined as follows.

$$\begin{aligned} y_{o} = \sum _{p \in \mathcal {N}_{1 \times m}(o)} \text {softmax}_{p}(q_{o}^T k_{p} + q_{o}^T r^{q}_{p-o} + k_{p}^T r^{k}_{p-o}) (v_{p} + r^{v}_{p-o}) \end{aligned}$$
(4)

One axial-attention layer propagates information along one particular axis. To capture global information, we employ two axial-attention layers consecutively for the height-axis and width-axis, respectively. Both of the axial-attention layers adopt the multi-head attention mechanism, as described above.

Axial-attention reduces the complexity to \(\mathcal {O}(hwm)\). This enables global receptive field, which is achieved by setting the span m directly to the whole input features. Optionally, one could also use a fixed m value, in order to reduce memory footprint on huge feature maps.

Fig. 2.
figure 2

An axial-attention block, which consists of two axial-attention layers operating along height- and width-axis sequentially. The channels \(d_{in}=128\), \(d_{out}=16\) is what we use in the first stage of ResNet after ‘stem’. We employ \(N=8\) attention heads

Axial-ResNet: To transform a ResNet [31] to an Axial-ResNet, we replace the \(3\times 3\) convolution in the residual bottleneck block by two multi-head axial-attention layers (one for height-axis and the other for width-axis). Optional striding is performed on each axis after the corresponding axial-attention layer. The two \(1\times 1\) convolutions are kept to shuffle the features. This forms our (residual) axial-attention block, as illustrated in Fig. 2, which is stacked multiple times to obtain Axial-ResNets. Note that we do not use a \(1\times 1\) convolution in-between the two axial-attention layers, since matrix multiplications (\(W_Q, W_K, W_V\)) follow immediately. Additionally, the stem (i.e., the first strided \(7\times 7\) convolution and \(3\times 3\) max-pooling) in the original ResNet is kept, resulting in a conv-stem model where convolution is used in the first layer and attention layers are used everywhere else. In conv-stem models, we set the span m to the whole input from the first block, where the feature map is 56 \(\times \) 56.

In our experiments, we also build a full axial-attention model, called Full Axial-ResNet, which further applies axial-attention to the stem. Instead of designing a special spatially-varying attention stem [65], we simply stack three axial-attention bottleneck blocks. In addition, we adopt local constraints (i.e., a local \(m\times m\) square region as in [65]) in the first few blocks of Full Axial-ResNets, in order to reduce computational cost.

Axial-DeepLab: To further convert Axial-ResNet to Axial-DeepLab for segmentation tasks, we make several changes as discussed below.

Firstly, to extract dense feature maps, DeepLab [12] changes the stride and atrous rates of the last one or two stages in ResNet [31]. Similarly, we remove the stride of the last stage but we do not implement the ‘atrous’ attention module, since our axial-attention already captures global information for the whole input. In this work, we extract feature maps with output stride (i.e., the ratio of input resolution to the final backbone feature resolution) 16. We do not pursue output stride 8, since it is computationally expensive.

Secondly, we do not adopt the atrous spatial pyramid pooling module (ASPP) [13, 14], since our axial-attention block could also efficiently encode the multi-scale or global information. We show in the experiments that our Axial-DeepLab without ASPP outperforms Panoptic-DeepLab [19] with and without ASPP.

Lastly, following Panoptic-DeepLab  [19], we adopt exactly the same stem  [78] of three convolutions, dual decoders, and prediction heads. The heads produce semantic segmentation and class-agnostic instance segmentation, and they are merged by majority voting [89] to form the final panoptic segmentation.

In cases where the inputs are extremely large (e.g., \(2177 \times 2177\)) and memory is constrained, we resort to a large span \(m=65\) in all our axial-attention blocks. Note that we do not consider the axial span as a hyper-parameter because it is already sufficient to cover long range or even global context on several datasets, and setting a smaller span does not significantly reduce M-Adds.

4 Experimental Results

We conduct experiments on four large-scale datasets. We first report results with our Axial-ResNet on ImageNet [70]. We then convert the ImageNet pretrained Axial-ResNet to Axial-DeepLab, and report results on COCO [56], Mapillary Vistas [62], and Cityscapes [22] for panoptic segmentation, evaluated by panoptic quality (PQ) [45]. We also report average precision (AP) for instance segmentation, and mean IoU for semantic segmentation on Mapillary Vistas and Cityscapes. Our models are trained using TensorFlow [1] on 128 TPU cores for ImageNet and 32 cores for panoptic segmentation.

Training Protocol: On ImageNet, we adopt the same training protocol as [65] for a fair comparison, except that we use batch size 512 for Full Axial-ResNets and 1024 for all other models, with learning rates scaled accordingly  [29].

For panoptic segmentation, we strictly follow Panoptic-DeepLab  [19], except using a linear warm up Radam [58] Lookahead [92] optimizer (with the same learning rate 0.001). All our results on panoptic segmentation use this setting. We note this change does not improve the results, but smooths our training curves. Panoptic-DeepLab yields similar result in this setting.

4.1 ImageNet

For ImageNet, we build Axial-ResNet-L from ResNet-50 [31]. In detail, we set \(d_{in}=128\), \(d_{out}=2d_{q}=16\) for the first stage after the ‘stem’. We double them when spatial resolution is reduced by a factor of 2  [76]. Additionally, we multiply all the channels   [34, 35, 71] by 0.5, 0.75, and 2, resulting in Axial-ResNet-{S, M, XL}, respectively. Finally, Stand-Alone Axial-ResNets are further generated by replacing the ‘stem’ with three axial-attention blocks where the first block has stride 2. Due to the computational cost introduced by the early layers, we set the axial span \(m=15\) in all blocks of Stand-Alone Axial-ResNets. We always use \(N=8\) heads  [65]. In order to avoid careful initialization of \(W_Q, W_K, W_V, r^q, r^k, r^v\), we use batch normalizations  [40] in all attention layers.

Table 1 summarizes our ImageNet results. The baselines ResNet-50 [31] (done by [65]) and Conv-Stem + Attention [65] are also listed. In the conv-stem setting, adding BN to attention layers of [65] slightly improves the performance by 0.3%. Our proposed position-sensitive self-attention (Conv-Stem + PS-Attention) further improves the performance by 0.4% at the cost of extra marginal computation. Our Conv-Stem + Axial-Attention performs on par with Conv-Stem + Attention [65] while being more parameter- and computation-efficient. When comparing with other full self-attention models, our Full Axial-Attention outperforms Full Attention  [65] by 0.5%, while being 1.44\(\times \) more parameter-efficient and 1.09\(\times \) more computation-efficient.

Following [65], we experiment with different network widths (i.e., Axial-ResNets-{S,M,L,XL}), exploring the trade-off between accuracy, model parameters, and computational cost (in terms of M-Adds). As shown in Fig. 3, our proposed Conv-Stem + PS-Attention and Conv-Stem + Axial-Attention already outperforms ResNet-50  [31, 65] and attention models  [65] (both Conv-Stem + Attention, and Full Attention) at all settings. Our Full Axial-Attention further attains the best accuracy-parameter and accuracy-complexity trade-offs.

Table 1. ImageNet validation set results. BN: Use batch normalizations in attention layers. PS: Our position-sensitive self-attention. Full: Stand-alone self-attention models without spatial convolutions
Fig. 3.
figure 3

Comparing parameters and M-Adds against accuracy on ImageNet classification. Our position-sensitive self-attention (Conv-Stem + PS-Attention) and axial-attention (Conv-Stem + Axial-Attention) consistently outperform ResNet-50  [31, 65] and attention models  [65] (both Conv-Stem + Attention, and Full Attention), across a range of network widths (i.e., different channels). Our Full Axial-Attention works the best in terms of both parameters and M-Adds

4.2 COCO

The ImageNet pretrained Axial-ResNet model variants (with different channels) are then converted to Axial-DeepLab model variant for panoptic segmentation tasks. We first demonstrate the effectiveness of our Axial-DeepLab on the challenging COCO dataset [56], which contains objects with various scales (from less than \(32\times 32\) to larger than \(96\times 96\)).

Val Set: In Table 2, we report our validation set results and compare with other bottom-up panoptic segmentation methods, since our method also belongs to the bottom-up family. As shown in the table, our single-scale Axial-DeepLab-S outperforms DeeperLab [89] by 8% PQ, multi-scale SSAP [28] by 5.3% PQ, and single-scale Panoptic-DeepLab by 2.1% PQ. Interestingly, our single-scale Axial-DeepLab-S also outperforms multi-scale Panoptic-DeepLab by 0.6% PQ while being 3.8\(\times \) parameter-efficient and 27\(\times \) computation-efficient (in M-Adds). Increasing the backbone capacity (via large channels) continuously improves the performance. Specifically, our multi-scale Axial-DeepLab-L attains 43.9% PQ, outperforming Panoptic-DeepLab [19] by 2.7% PQ.

Table 2. COCO val set. MS: Multi-scale inputs

Test-dev Set: As shown in Table 3, our Axial-DeepLab variants show consistent improvements with larger backbones. Our multi-scale Axial-DeepLab-L attains the performance of 44.2% PQ, outperforming DeeperLab  [89] by 9.9% PQ, SSAP  [28] by 7.3% PQ, and Panoptic-DeepLab  [19] by 2.8% PQ, setting a new state-of-the-art among bottom-up approaches. We also list several top-performing methods adopting the top-down approaches in the table for reference.

Table 3. COCO test-dev set. MS: Multi-scale inputs
Fig. 4.
figure 4

Scale stress test on COCO val set. Axial-DeepLab gains the most when tested on extreme resolutions. On the x-axis, ratio 4.0 means inference with resolution \(4097\times 4097\)

Scale Stress Test: In order to verify that our model learns long range interactions, we perform a scale stress test besides standard testing. In the stress test, we train Panoptic-DeepLab (X-71) and our Axial-DeepLab-L with the standard setting, but test them on out-of-distribution resolutions (i.e., resize the input to different resolutions). Figure 4 summarizes our relative improvements over Panoptic-DeepLab on PQ, PQ (thing) and PQ (stuff). When tested on huge images, Axial-DeepLab shows large gain (30%), demonstrating that it encodes long range relations better than convolutions. Besides, Axial-DeepLab improves 40% on small images, showing that axial-attention is more robust to scale variations.

4.3 Mapillary Vistas

We evaluate our Axial-DeepLab on the large-scale Mapillary Vistas dataset [62]. We only report validation set results, since the test server is not available.

Val Set: As shown in Table 4, our Axial-DeepLab-L outperforms all the state-of-the-art methods in both single-scale and multi-scale cases. Our single-scale Axial-DeepLab-L performs 2.4% PQ better than the previous best single-scale Panoptic-DeepLab (X-71) [19]. In multi-scale setting, our lightweight Axial-DeepLab-L performs better than Panoptic-DeepLab (Auto-DeepLab-XL++), not only on panoptic segmentation (0.8% PQ) and instance segmentation (0.3% AP), but also on semantic segmentation (0.8% mIoU), the task that Auto-DeepLab [57] was searched for. Additionally, to the best of our knowledge, our Axial-DeepLab-L attains the best single-model semantic segmentation result.

Table 4. Mapillary Vistas validation set. MS: Multi-scale inputs
Table 5. Cityscapes val set and test set. MS: Multi-scale inputs. C: Cityscapes coarse annotation. V: Cityscapes video. MV: Mapillary Vistas

4.4 Cityscapes

Val Set: In Table 5(a), we report our Cityscapes validation set results. Without using extra data (i.e., only Cityscapes fine annotation), our Axial-DeepLab achieves 65.1% PQ, which is 1% better than the current best bottom-up Panoptic-DeepLab  [19] and 3.1% better than proposal-based AdaptIS [77]. When using extra data (e.g., Mapillary Vistas [62]), our multi-scale Axial-DeepLab-XL attains 68.5% PQ, 1.5% better than Panoptic-DeepLab  [19] and 3.5% better than Seamless  [68]. Our instance segmentation and semantic segmentation results are respectively 1.7% and 1.5% better than Panoptic-DeepLab  [19].

Test Set: Table 5(b) shows our test set results. Without extra data, Axial-DeepLab-XL attains 62.8% PQ, setting a new state-of-the-art result. Our model further achieves 66.6% PQ, 39.6% AP, and 84.1% mIoU with Mapillary Vistas pretraining. Note that Panoptic-DeepLab  [19] adopts the trick of output stride 8 during inference on test set, making their M-Adds comparable to our XL models.

4.5 Ablation Studies

We perform ablation studies on Cityscapes validation set.

Importance of Position-Sensitivity and Axial-Attention: In Table 1, we experiment with attention models on ImageNet. In this ablation study, we transfer them to Cityscapes segmentation tasks. As shown in Table 6, all variants outperform ResNet-50 [31]. Position-sensitive attention performs better than previous self-attention [65], which aligns with ImageNet results in Table 1. However, employing axial-attention, which is on-par with position-sensitive attention on ImageNet, gives more than 1% boosts on all three segmentation tasks (in PQ, AP, and mIoU), without ASPP, and with fewer parameters and M-Adds, suggesting that the ability to encode long range context of axial-attention significantly improves the performance on segmentation tasks with large input images.

Table 6. Ablating self-attention variants on Cityscapes val set. ASPP: Atrous spatial pyramid pooling. PS: Our position-sensitive self-attention

Importance of Axial-Attention Span: In Table 7, we vary the span m (i.e., spatial extent of local regions in an axial block), without ASPP. We observe that a larger span consistently improves the performance at marginal costs.

Table 7. Varying axial-attention span on Cityscapes val set

5 Conclusion and Discussion

In this work, we have shown the effectiveness of proposed position-sensitive axial-attention on image classification and segmentation tasks. On ImageNet, our Axial-ResNet, formed by stacking axial-attention blocks, achieves state-of-the-art results among stand-alone self-attention models. We further convert Axial-ResNet to Axial-DeepLab for bottom-up segmentation tasks, and also show state-of-the-art performance on several benchmarks, including COCO, Mapillary Vistas, and Cityscapes. We hope our promising results could establish that axial-attention is an effective building block for modern computer vision models.

Our method bears a similarity to decoupled convolution  [41], which factorizes a depthwise convolution  [20, 35, 75] to a column convolution and a row convolution. This operation could also theoretically achieve a large receptive field, but its convolutional template matching nature limits the capacity of modeling multi-scale interactions. Another related method is deformable convolution  [23, 27, 96], where each point attends to a few points dynamically on an image. However, deformable convolution does not make use of key-dependent positional bias or content-based relation. In addition, axial-attention propagates information densely, and more efficiently along the height- and width-axis sequentially.

Although our axial-attention model saves M-Adds, it runs slower than convolutional counterparts, as also observed by [65]. This is due to the lack of specialized kernels on various accelerators for the time being. This might well be improved if the community considers axial-attention as a plausible direction.