1 Introduction

3D point cloud representation learning is critical for autonomous driving, especially for core tasks like 3D object detection. The challenges of learning from 3D point clouds mainly come from two aspects. The first aspect is that 3D points are sparsely distributed in the 3D space due to the nature of LiDAR sensors. This forces 3D models to be different from dense models in natural language processing (where words in a sentence are dense) or image understanding (where pixels in an image are dense). The second aspect is that both the number of points in a point cloud frame and the point cloud sensing region are increasing along with the improvement of the LiDAR sensor hardware. Some of the latest commercial LiDARs can sense up to 250 m [14] and 300 m [41] in all directions around the vehicle, leading to a large range of point clouds.

To address these challenges, previous works have proposed many methods that can be roughly organized as five categories. PointNet [27, 29, 35] based method treats 3D point clouds as unordered sets and encodes them with MLPs and max pooling. Hierarchical structure is introduced to deal with the large input space and to better capture local information. These methods usually have inferior representation capacity compared with more recent methods. PointPillars-style methods [17] divide the space into grids of fixed sizes to convert the sparse 3D problem to a dense 2D problem. This method scales quadratically with the range, making it hard to scale with the advancement of LiDAR hardwares. Sparse submanifold convolutions [13, 33, 37] based method can handle the sparse input efficiently. Usually these methods use small \(3\times 3\) convolution kernels which cannot connect features that are sparsely disconnected without adding normal sparse convolution and striding. This weakness limits its representation capacity. Another weakness of this method is their need for heavily optimized custom ops to be efficient on the modern GPUs and incompatibility with matmul optimized accelerators such as TPUs. Range image is a compact representation of point cloud. Multi-view methods [2, 37, 40, 47] run dense convolutions in this view to extract features and fuse with BEV features learned in the PointPillars-style to improve 3D representation learning. It is hard to regress 3D objects directly from the range image due to its lack of 3D information encoding in the dense 2D perspective convolutions. To tackle this weakness, graph-style kernels [4, 11] replace convolutions to make use of the range information in range images to capture 3D information which greatly improves the accuracy but is still inferior to the state of the art. Transformer [38] is designed to process sequences of data. The challenge in applying it to a point cloud is to solve the quadratic complexity on the number of inputs. Recent methods tackle this problem by attending to neighboring points [26], neighboring voxels [22] or voxels in fixed windows [10]. A generic and efficient transformer-only model without limitations like limited receptive field, irregular memory access pattern, and lack of scalability is still to be designed.

In this paper, we adapt window-based Transformers to 3D point clouds. The Transformer [38] architecture has been hugely successful in modeling language sequences and image patches. In particular, on 2D images, Swin Transformer [21] proposed to partition images into windows and merge context information in a hierarchical manner. Our Sparse Window Transformer (SWFormer) builds upon similar ideas, but with several key adaptations for sparse windows. Our first adaption is to add a bucketing-based window partition for sparse windows. Although each window has the same spatial size, such as a \(10\times 10\) voxel grid, the number of non-empty voxels in each window can vary significantly, so we group these windows into buckets with different effective sequence lengths. Our second adaptation is to limit the expensive window shifting. Swin Transformer [21] uses window shifting once per Transformer layer to connect features between windows and increases receptive fields, but this shifting operation is expensive in the sparse world as it needs to re-order all the sparse features with gather operations. Moreover, it is extremely slow on matmul optimized accelerators such as TPUs. To address this issue, SWFormer employs a new hierarchical backbone architecture, where each SWFormer block has many Transformer layers but only one shifting operation, as shown in Fig. 3. It relies on multi-scale features to achieve large receptive fields for context information, and a multi-scale fusion network to effectively combine these features. The model uses additional custom downsample and upsample algorithms to properly handle the sparse features during feature fusion.

Our innovation continues from the backbone into the 3D object detection head. Existing 3D object detection methods [4, 12, 17, 22, 33, 37, 40, 43, 47, 48] can mostly be viewed as either anchor based methods with implicit or explicit anchors or DETR [3] based methods [24]. The detection performance is closely related with the distribution of the difference between anchor and groundtruth. Methods with inaccurate anchors [4, 22] have poor performance in detecting large objects such as vehicles though they can have reasonable performance on pedestrians. One way to solve this problem is to have a two-stage model to refine the boxes [22, 33] which greatly improves the detection accuracy. CenterNet-style detection methods [12, 37, 43] strive to define anchors in the center of the groundtruth boxes only which enforces distributions of closer to zero mean and smaller variance. However, when detecting objects directly from sparse features (e.g. features from PointNet, Submanifold convolutions, sparse Transformers), there are not necessarily features close to the object centers. To alleviate this issue, [37] applies normal sparse convolutions to insert points in the convolution output; [10] scatters the sparse features to a dense BEV grid and runs dense convolutions to expand features to missing positions. These methods are expensive. In this paper, we propose a voxel diffusion module to address this issue efficiently in a scalable way by segmenting and diffusing foreground voxels to their nearby regions as described in Sect. 3.4.

Extensive experiments are conducted on the challenging Waymo Open Dataset [36] to show state of the art results of SWFormer on 3D object detection. We summarize our contributions as follows:

  • We propose a hierarchical Sparse Window Transformer (SWFormer) backbone for 3D representation learning. Its flexible receptive fields and multi-scale features make it suitable for different self-driving tasks like object detection and semantic segmentation.

  • We propose a generic voxel diffusion module to address the unique challenge of anchor placement in 3D object detection from sparse features.

  • We conduct extensive experiments on Waymo Open Dataset [36] to demonstrate the state of the art performance of our SWFormer model.

2 Related Work

2.1 3D Object Detection

As one of the most important tasks in autonomous driving, 3D object detection has been extensively studied in prior works. Early works like PointNet [27] and PointNet++ [29] directly apply multilayer perceptions on individual points, but it is difficult to scale them to large point clouds with good accuracy. The current mainstream 3D object detectors often convert point clouds into bird eye view 3D [48] or 2D voxels [17] (2D voxels are also referred as pillars), where each voxel aggregates the information from points it contains. In this way, regular 2D or 3D convolutional neural networks can be applied to process these bird-eye-view representations. The pseudo image of voxels also makes it easier to reuse the rich research advancements in 2D object detection, such as two-stage or anchor-based detection heads [43]. The downside is that the pseudo image of voxels grows cubically/quadratically with the voxelization granularity and detection range, not to mention that many of the voxels are effectively empty. Therefore, another type of approach is to perform 3D object detection without voxelization. This includes methods that detect objects from the perspective view [4, 11, 23], or lookup nearest neighbors for each point [25]. However, the detection accuracy is typically inferior to the voxelization route.

To have the best of both worlds, recent approaches [33, 37, 42] start to explore multi-view approaches and make use of sparse convolutions on the voxelized point cloud. For example, the recent range sparse net (RSN [37]) adopts a two-step approach, where the first step performs class-specific segmentation on the range image view, and the second step applies sparse 3D convolutions on the voxel view for specific classes. However, submanifold sparse convolutions cannot connect features that are sparsely disconnected without adding normal sparse convolutions and striding, and they often require heavily optimized customized ops to be efficient on modern accelerators.

Our work aims to learn the 3D representations from sparse point clouds without using any dense or sparse convolutions. Instead, we resort to a hierarchical Transformer to achieve our goal.

2.2 Transformers

Transformers [38] have shown great success in natural language processing [7]. Recently, researchers have brought this architecture to computer vision [1, 6, 30, 39]. ViT [8] partitions images into patches, which greatly advanced the use of Transformers for image classification. Swin Transformer [21] further demonstrated better ways to fuse contextual information through window shifting and hierarchy, and also generalized to other tasks such as segmentation and detection.

Interestingly, Transformers are naturally suitable for sparse point clouds, because they can take any length of sequences as inputs and do not require dense 2D/3D image representations. Therefore, recent works have attempted to adopt Transformers for 3D representation learning, but they are primary developed for object scans and indoor applications [9, 24, 26, 44]. Voxel Transformer [22] is the submanifold sparse convolution [13] counterpart in the Transformer world, by replacing the convolution kernel with attention. Its irregular memory access pattern is computationally inefficient, and its accuracy is worse than state of the art methods. Recently, SST [10] proposes a single-stride transformer for 3D object detection and achieved impressive results on Waymo Open Datasets especially for pedestrian object detection. However, due to its single stride nature, SST has a limited receptive field and thus has difficulty dealing with large objects, making it ineffective in important tasks like large vehicle detection, large object segmentation (e.g. buildings), lane detection, and trajactory prediction. It needs to scatter features to a dense BEV grid to run several dense convolutions which limits its scalability. It is also computationally expensive as it needs to run many layers of transformers on the high resolution feature map which limits its applications in realtime systems.

Our work is inspired by window-based Transformers (e.g., SwinTransformer [21]) in the sense that we also adopt the hierarchical window-based Transformer backbone, but to address the unique challenges of 3D sparse point clouds, we propose several novel techniques such as the improved SWFormer blocks, multi-scale feature fusion, and voxel diffusion.

3 Sparse Window Transformer

3.1 Overall Architecture

SWFormer is a pure Transformer-based model without any convolutions. Figure 1 shows the overall network architecture: given a sequence of point cloud frames as inputs, each point is augmented with per-frame voxel features [17] and an auxiliary frame timestamp offset [37]. It uses dynamic voxelization [47] and a point net [17, 27] based feature embedding net to get sparse voxel features. Note, our voxels are also referred as pillars in other works [17]. These sparse voxels are then processed by a hierarchical sparse window Transformer network described in Sect. 3.2. The resulting multi-scale features are then fused with a Transformer based feature fusion blocks. To address the unique challenge of detecting 3D boxes from sparse features, we first segment the foreground voxels and then apply a voxel diffusion module to expand foreground voxels to neighboring locations with pseudo voxels. In the end, we apply a center net [37, 43, 46] style detection head to regress 3D boxes.

Fig. 1.
figure 1

Overview of SWFormer model architecture. Given a sparse point cloud, we first perform voxelization to generate a grid of 2D voxels. These voxels are then processed with a 5-scale sequence of hierarchical SWFormer blocks (Fig. 3), with strides \(\{1, 2, 4, 16, 32\}\). The output features are combined with a multi-scale feature fusion network (Sect. 3.3). The fused features are fed to a head, which performs foreground segmentation and voxel diffusion (Sect. 3.4), and computes center net style classification and box regression loss (Sect. 3.5). Different object classes (e.g. vehicles and pedestrians) may use a separate head on different feature scales.

3.2 Hierarchical Sparse Window Transformer Encoder

A key concept of our SWFormer is the sparse window in the birds eye view. After points are converted to a grid of 2D voxels on bird eye view, the voxel grid is further partitioned into a list of non-overlapping windows with fixed size \(H \times W\) (e.g., \(10\times 10\)), similar to Swin Transformer [21]; however, since points are often sparse, many voxels are empty with no valid points. Therefore, the number of non-empty voxels in each window may vary from 0 to HW. As we will explain later, all non-empty voxels within the same window will be flattened to a single variable-length sequence and fed into Transformer layers. In practice, these variable-length sequences prevent us from batch training, causing lower training efficiency. To solve this issue, we borrow a widely used ideas from natural language processing [7, 38] and recent works [10], which group these sparse windows into different buckets based on their sequence lengths. Concretely, we divide sparse windows into at most k buckets \(\{B_0, B_1, ..., B_k\}\), where windows in \(B_i\) are always padded to a maximum sequence length of \(HW/2^i\). All padded tokens are masked in Transformer layers.

Based on the aforementioned sparse windows, our encoder adopts hierarchical Transformers to process the inputs and produce a list of multi-scale BEV features. As shown in Fig. 1, each scale starts with a sparse window partition layer followed by a multi-layer SWFormer block.

Sparse Window Partition: We divide the BEV voxels into non-overlapping windows with fixed size \(H\times W\), which are then grouped into buckets \(\{B_0, B_1, ..., B_k\}\). For each bucket \(B_i\), we flatten all voxels within the same window into a sequence and zero-pad the sequence length to \(HW/2^i\). These sequences are then batched and fed to the Transformer blocks, where the self-attention shares the keys and values for all query voxels coming from the same window [21]. Since SWFormer processes inputs in a hierarchical fashion with multiple feature scales, we need to apply strided window partitions at the beginning of each scale. The strided window partition is similar to traditional strided convolutions, except that it always picks the closest voxel to the center of the window with deterministic rules to break ties. Notably, no max or average pooling operations are applied because they are not friendly to sparse implementations. Figure 2 illustrates an example of a stride-4 window partition.

Fig. 2.
figure 2

Strided Sparse Window Partition. Left shows a grid of \(16\times 16\) BEV voxels, where grey voxels are empty and others are non-empty. Right shows the results of stride-4 window partition, leading to a grid of \(4 \times 4\) voxels. For each striding window, it picks the nearest neighbor non-empty voxel feature (light green) from the center (black dot) with any deterministic rule to break ties; if all voxels are empty in the striding window, then the corresponding voxel after striding is also empty. Best viewed in color.

Fig. 3.
figure 3

Sparse Window Transformer Block. Given a sequence of sparse features, it first applies a multi-head self-attention (MSA) on all valid voxel within the same window, followed by a MLP and layer norm. After repeating the Transformer layer N times, it performs a shifted sparse window partition to re-generate the sparse windows, and then process the shifted windows with another M Transformer layers. If N and M are the same, we name it as N-layer SWFormer block for simplicity.

Sparse Window Transformer Block: Transformer [38] is inherently suitable for sparse point clouds, as it does not require the dense 2D/3D inputs as in convolutional networks; unfortunately, due to the quadratic complexity of self-attention with respect to the input sequence length, it is prohibitively expensive to feed the whole point cloud (with millions of points) or voxel features (with tens of thousands valid voxels) as a single input sequence to Transformer. In this paper, we adopt the idea of Swin Transformer [21]: the sparse BEV voxels are first partitioned into windows, and Transformer is applied to each window separately. To increase the receptive field and connect the features across windows, SwinTransformer uses a window shifting technique to re-partition the window for every layer of Transformer. However, as we are operating on sparse voxel features, such shift-window operation is memory-read/write intensive, especially for matrix-optimized accelerators like TPUs. To alleviate this problem, we propose to limit the shift-window operation to once per stride rather than per layer. Figure 3 shows the detailed architecture of a SWFormer block: it largely follows the same style of SwinTransformer to perform self-attention within a local window, except it only performs shift-window operation once in the middle. Formally, our SWFormer block can be described as follows:

$$\begin{aligned}&\textbf{z}^0 = [\textbf{x}; \text { mask}_z] + \text {PE}_{z}&\end{aligned}$$
(1)
$$\begin{aligned}&{{\hat{\textbf{z}}}^{l}} = \text {LN}\left( {\textbf{z}}^{l - 1} + {\text {MSA}( {{{\textbf{z}}^{l - 1}}} )} \right)&l = 1...N \nonumber \\&{{\textbf{z}}^l} = \text {LN}\left( {{\hat{\textbf{z}}}^{l}} + {\text {MLP} ({{{\hat{\textbf{z}}}^{l}}} )} \right)&l = 1 ... N \nonumber \\&\textbf{u}^0 = [{\text {shift-window}(\textbf{z}}^N); \text { mask}_u] + \text {PE}_{u}&\nonumber \\&{{\hat{\textbf{u}}}^{l}} = \text {LN}\left( \textbf{u}^{l - 1} + {\text {MSA}( {{{\textbf{u}}^{l - 1}}} )} \right)&l = 1...M \nonumber \\&{{\textbf{u}}^l} = \text {LN}\left( {{\hat{\textbf{u}}}^{l}} + {\text {MLP} ({{{\hat{\textbf{u}}}^{l}}} )} \right)&l = 1 ... M \end{aligned}$$
(2)

where \(\textbf{x}\) is the input features after sparse window partition, \(\text {mask}_z\) is the mask for input padding, \(\text {PE}_z\) is the positional encoding. The process contains two stages: (1) the first stage applies N Transformer layers to \(\textbf{z}^0\) and output \(\textbf{z}^N\). Each Transformer layer consists of a standard multi-head self-attention (MSA) and multilayer perceptron (MLP), but slightly different from the standard version, here we adopt the post-norm scheme where layer norm (LN) is added after MSA and MLP. For simplicity, we use the standard sine/cosine absolute positional encoding in this paper. (2) The second stage first applies window-shift to \(\textbf{z}^N\), and adds the updated \(\text {mask}_u\) and positional encoding \(\text {PE}_u\) based on \(\textbf{z}^N\); afterwards, M Transformer layers are added to process \(\textbf{u}^0\) and generate the final output \(\textbf{u}^M\). Notably, each SWFormer block has \(N+M\) Transformer layers but only one window-shift operation.

By restricting window-shift operations, our SWFormer block is more efficient than the conventional Swin Transformer; however, it also limits the receptive field, since each Transformer layer is only applied to a small window. To address this challenge, SWFormer is designed as a hierarchical network with multiple scales, where the strides are gradually increased: for simplicity, this paper uses strides \(\{1, 2, 4, 16, 32\}\) for the five scales. For each scale, we always keep the window size fixed (e.g., \(10\times 10\)); however, as the later scales have larger strides, the same window in later scales will cover much larger area. As an example, for the last scale with stride 32, a \(10\times 10\) window would cover \(320\times 320\) area on the original BEV voxel grid, and a single window-shift would connect all features within an area as large as \(480\times 480\).

3.3 Multi Scale Feature Fusion

Inspired by feature pyramid network (FPN [19]), SWFormer adopts Transformer-based multi-scale feature network to effectively combine all features from the hierarchical Transformer encoder. Figure 4 shows the overall architecture of the feature network: given a list of encoder features \(\{P_0, P_1, .. P_5\}\), it iteratively fuses \((P_{i+1}, P_i)\) from large-stride \(P_5\) to small-stride \(P_0\). Formally, our feature fusion process can be described as:

$$\begin{aligned}&\hat{P}_5 = P_5&\end{aligned}$$
(3)
$$\begin{aligned}&\hat{P}_i = \text {SWFormer}(\text {Concat}(P_i, \text {Upsample}(\hat{P}_{i+1})))&i = 0, ..., 4 \end{aligned}$$
(4)

Starting from the last feature map \(P_5\), we first upsample it to have the same stride as \(P_4\) such that they can be concatenated into a single feature map; afterwards, we simple apply a 1-layer SWFormer block to process the concatenated feature and generate the new \(\hat{P}_4\). The process is iterated until all fused features \(\{\hat{P}_0, ..., \hat{P}_5\}\) have been generated, which have the same strides as \(\{P_0, ..., P_5\}\) features. The fused features are further used in voxel diffusion and box regression as described in the following sections.

Fig. 4.
figure 4

Feature Fusion. Feature \(P_{i+1}\) is upsampled and concatenated with \(P_i\) to generate \(P'_i\) and the final \(P_i\). During upsampling, we only duplicate \(P_{i+1}\) features to locations that are non-empty in \(P_i\).

One challenge in sparse upsamping is that one cannot naively duplicate the feature to all upsampled locations (like commonly done in dense upsampling), which will cause unnecessary excessive feature duplication and significantly reduce the sparsity. In this paper, we restrict features in \(P_{i+1}\) to only duplicate to locations that have non-empty features in \(P_i\), as shown in Fig. 4. In this way, we can ensure \(\hat{P}_i\) has the same sparsity as \(P_i\).

3.4 Voxel Diffusion

To detect 3D objects from sparse voxel features, a unique challenge is that there might be no valid voxel feature near object centers which are the best positions to place implicit [43] or explicit anchors [31]. Prior works have attempted to resolve this issue by: 1) second-stage box refinement [33], 2) sparse convolutions [37] or coordinate refinement [26] that can expand features to empty voxels close to the object centers, 3) scattering sparse voxel features to dense and applying dense convolutions [10]. In this paper, we propose a novel voxel diffusion module to effectively and efficiently address this challenge.

Fig. 5.
figure 5

Voxel Diffusion. After foreground segmentation, each voxel receives a segmentation score \(s \in [0, 1]\). All voxels with scores greater than a threshold \(\gamma =0.05\) are scattered to a dense BEV grid, and then we apply a \(k\times k\) max pooling on the dense BEV grid to expand valid voxel features to their neighboring locations where k is set to 5 in this example. (Left) before diffusion, there are only two foreground voxels with segmentation scores \(\{0.5, 0.9\}\) greater than \(\gamma \); (Right) after voxel diffusion, 47 voxels become valid. Best viewed in color.

Voxel diffusion is based on two simple ideas: First, we segment all foreground voxels by jointly performing foreground/background segmentation, thus effectively filtering out the majority of background voxels. Second, we expand all foreground voxels by zero-initializing their features into neighboring locations with a simple \(k\times k\) max pooling operations on the dense BEV grid, where k is the detection head specific diffusion factor to control the magnitude of expansion. The diffused voxel features are further connected and processed with a few Transformer layers. Combining these two ideas, we can simultaneously keep voxel features sparse (by filtering out background voxels) and features filled (by voxel diffusion) for voxels closer to the object center. Figure 5 illustrates an example of voxel diffusion.

Our foreground segmentation is jointly trained with object detection. Specifically, for each voxel, we assign a binary groundtruth label: 0 (background, voxel does not overlap with any objects) and 1 (foreground, voxel overlaps with at least one object). The foreground segmentation is trained with a two-class focal loss [20] for each object class c:

$$\begin{aligned} L_\text {seg}^{c} = \frac{1}{N} \sum _i{L_i} \end{aligned}$$
(5)

where N is the total number of valid voxels and \(L_i\) is the focal loss for voxel i. At inference time, we keep voxels as foreground if their foreground scores are greater than a threshold \(\gamma \).

3.5 Box Regression

SWFormer follows [37] to use a modified CenterNet [12, 37, 43, 46] head to regress boxes from voxel features. The heatmap loss is computed as a penalty-reduced focal loss [20, 46] per object class.

$$\begin{aligned} \begin{aligned} L_{\text {hm}}^{c} = -\frac{1}{N}\sum _{i}\{ (1 - \tilde{h}_{i})^\alpha \log (\tilde{h}_{i})I_{h_{i} > 1 - \epsilon } + \\ (1-h_{i})^\beta \tilde{h}_{i}^\alpha \log (1-\tilde{h}_{i})I_{h_{i} \le 1 - \epsilon }\}, \end{aligned} \end{aligned}$$
(6)

where \(\tilde{h}_{i}\) and \(h_{i}\) are the predicted and ground truth heatmap values for object class c respectively at voxel i. N is the number of boxes in class c. We use \(\epsilon = 1e-3\), \(\alpha =2\) and \(\beta =4\) in all experiments, following [18, 37, 46]. SWFormer parameterize 3D boxes as \(\boldsymbol{b} = \{d_x, d_y, d_z, l, w, h, \theta \}\) where \(d_x, d_y, d_z\) are the box center offsets relative to the voxel centers. \(l, w, h, \theta \) are box length, width, height and box heading. We follow [37] to apply a bin loss [35] to regress heading \(\theta \), smooth L1 to regress other box parameters, and an IoU loss [45] to improve overall box accuracy on the voxels with ground truth heatmap values above a threshold \(\delta _1\).

$$\begin{aligned} L_{\theta _i}^c&= L_{bin}(\theta _i, \tilde{\theta }_i), \end{aligned}$$
(7)
$$\begin{aligned} L_{\boldsymbol{b_{i}} \backslash \theta _{i}}^c&= \text {SmoothL1}(\boldsymbol{b_{i}} \backslash \theta _{i} - \boldsymbol{\tilde{b_{i}}} \backslash \tilde{\theta }_i),\end{aligned}$$
(8)
$$\begin{aligned} L_{\text {box}} ^ c&= \frac{1}{N} \sum _i {(L_{\theta _i} + L_{\boldsymbol{b_i} \backslash \theta _{i}} + L_{\text {iou}_{i}}) I_{h_i > \delta _1}}, \end{aligned}$$
(9)

where \(\tilde{b}_i\), \(b_i\) are the predicted and ground truth box parameters respectively, \(\tilde{\theta }_i\), \(\theta _i\) are the predicted and ground truth box heading respectively.

The net is trained end to end with the total loss defined as

$$\begin{aligned} L = \sum _{c}(\lambda _1 L_{\text {seg}}^c + \lambda _2 L_{\text {hm}}^c + L_{\text {box}}^c) \end{aligned}$$
(10)

When decoding prediction boxes, we first filter voxels with heatmap less than a threshold \(\delta _{2}\), then run max pool on the heatmap to select boxes corresponding to the local heatmap maximas without any non-maximum-suppression.

4 Experiments

We describe the SWFormer implementation details, and demonstrate its efficiency and accuracy in multiple experiments. Ablation studies are conducted to understand the importance of various design choices.

4.1 Waymo Open Dataset

Our experiments are primary based on the challenging Waymo Open Dataset (WOD) [36], which has been adopted in many recent state of the art 3D detection methods [10, 28, 33, 37, 43]. The dataset contains 1150 scenes, split into 798 training, 202 validation, and 150 test. Each scene has about 200 frames, where each frame captures the full 360\(^\circ \) around the ego-vehicle. The dataset has one long range LiDAR with range capped at 75 m, four near range LiDARs and five cameras. SWFormer uses all five LiDARs in the experiments.

4.2 Implementation Details

We normalize intensity and elongation in the raw point cloud with the \(\text {tanh}\) function. The dynamic voxelization uses 0.32 m voxel size in x, y and infinite size in z. During training, we ignore all ground truth boxes with fewer than five points inside. The voxel feature embedding net has two layers of MLPs with channel size of 128. All of the transformer layers have channel size of 128, 8 heads, and inner MLP ratio of 2. We also use stochastic depth [15] with survival probability 0.6. The segmentation cutoff \(\gamma \) in Sect. 3.4 is set to 0.05. The heatmap threshold \(\delta _{1}\), \(\delta _{2}\) are set to 0.2, 0.1 respectively for both vehicle and pedestrian heads. For training efficiency, we cap the number of regression targets in each frame by 1024 for vehicle and 800 for pedestrian sorted by ground truth heatmap values. \(\lambda _1\), \(\lambda _2\) are set to 200 and 10 in Eq. 10.

Data Augmentation. We have adopted the several popular 3D data augmentation techniques described in [5] during training: randomly rotating the world by yaws uniformly chosen from \([-\pi , \pi ]\) with probability 0.74, randomly flipping the world along y-axis with probability 0.5, randomly scaling the world with scaling factor uniformly chosen within [0.95, 1.05), randomly dropping points with probability of 0.05.

Training and Inference. The SWFormer models are trained end-to-end with 32 TPUv3 cores using the Adam optimizer [16] for a total number of 128 epochs with an initial learning rate set to 1e−3. We apply cosine learning rate decay and 8 epoch warmup with initial warmup learning rate set to 5e−4.

4.3 Main Results

We measured the detection results using the official WOD detection metrics: BEV and 3D average precision (AP), heading error weighted BEV, and 3D average precision (APH) for L1 (easy) and L2 (hard) difficulty levels [36]. The official metrics used to rank in the leaderboard uses IoU cutoff of 0.7 for vehicle, 0.5 for pedestrian. We report additional AP results at IoU of 0.8 for vehicle, 0.6 for pedestrian. Large vehicles that have max dimension greater than 7 m are also reported. Table 1 reports the main results on validation set, Table 2 reports additional results for high IoU and large vehicles on the validation set, and Table 3 shows the test set results by submitting our predictions to the official test server. Results from methods with test time augmentation or ensemble are not included.

As shown in Table 1, SWFormer achieves new state-of-the-art results for vehicle detection on the WOD validation set: it has 1.5 APH/L2 higher than the prior best single-stage model RSN [37]. SWFormer even outperforms the prior best performing two-stage method PVRCNN++ [34] by 0.42 APH/L2. Importantly, SWFormer performs very well at detecting large vehicles, 6.35 AP/L2 higher than the prior art of RSN [37] as shown in Table 2. SWFormer slightly outperforms the state of the art single stage method SST_3f [10] by 0.12 APH/L2. Notably, the single frame single stage SWFormer_1f also outperforms all prior single frame methods.

We have compiled the model with XLA [32] and ran inference for the 15th frame in scene 8907419590259234067_1960_000_1980_000 that has 68 vehicles and 69 pedestrians on a Nvidia T4 GPU with fused transformer kernels. The latency is 20 ms, more efficient than the popular realtime detector PointPillars [17] which takes about 100 ms on the same GPU with our own implementation.

Table 1. WOD validation set results. † is from [37]. Top methods are highlighted. Top one-frame (cyan), single-stage (blue) are colored. TS: two-stage. BEV: BEV L1 AP.

Table 3 shows vehicle and pedestrian detection result comparison with published results on the WOD test set, which shows SWFormeroutperforms all previous single-stage or two-stage methods on the official ranking method mAPH/L2.

Table 2. Additional WOD validation set results. Top methods are highlighted.
Table 3. WOD test set results. † is from [37]. Top methods are highlighted. mAPH/L2 is the official ranking metric on the WOD leaderboard. TS is short for two-stage.
Table 4. Impact of Voxel Diffusion. Compared to the baseline (window size = 1), our voxel diffusion improves accuracy, especially with large diffusion window size.
Table 5. Impact of Multi-Scale and Window Shifting. Compared to single scale, multi-scale have much better accuracy. Window shifting is also important for performance.

4.4 Ablation Study

Voxel diffusion is one of the primary contributions of this paper. We study its impacts by varying the diffusion window size k introduced in Sect. 3.4. The result in Table 4 shows the significance of voxel diffusion. Disabling voxel diffusion (i.e. setting \(k=1\)) results in 6.37 and 3.22 3D AP drop compared with \(k=9\) on vehicle and pedestrian detection respectively. Increasing k can slightly improve the detection accuracy especially on vehicle.

Multi-scale feature improves the model accuracy as shown in Table 5 especially going from one scale to two scales. The impact is larger on vehicle detection (+2.72 3D AP) than pedestrian detection (+1.15 3D AP). The 3-scale model has pretty close accuracy as the full 5-scale model. In practice, we can trade-off between accuracy and latency by adjusting the number of scales. Note that some autonomous driving tasks such as lane detection, behavior prediction require larger receptive field. The success of training a deep five-scale SWFormer model shows its potential in those tasks.

Window shifting is introduced in SwinTransformer [21] to connect the features among windows. We have limited its usage to one per scale. What happens if we completely remove it? Table 5 shows clear accuracy drop especially on vehicles if the window-shift operations are removed from the SWFormer blocks. This meets our intuition that it is important to keep one window shift operation per scale to make sure every voxel gets the similar receptive field in all directions.

5 Conclusion

This paper presents SWFormer, a scalable and accurate sparse window transformer-only model, to effectively learn 3D point cloud representations for object detection. Built upon window-based Transformers, it addresses the unique challenges brought by the sparse 3D point clouds, and proposes a bucketing-based multi-scale Transformer neural network. SWFormer takes full advantage of the sparsity of point clouds, and can effectively processes sparse windows of point clouds using pure Transformer layers without any convolutions. It also proposes a novel voxel diffusion module to further detect 3D objects from sparse features. Experiments show state-of-the-art results on the challenging Waymo Open Dataset.