Unifying Visual Perception by Dispersible Points Learning

Liang, Jianming; Song, Guanglu; Leng, Biao; Liu, Yu

doi:10.1007/978-3-031-20077-9_26

Jianming Liang^12,13,
Guanglu Song¹³,
Biao Leng¹² &
…
Yu Liu¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13669))

Included in the following conference series:

European Conference on Computer Vision

3473 Accesses
1 Citations

Abstract

We present a conceptually simple, flexible, and universal visual perception head for variant visual tasks, e.g., classification, object detection, instance segmentation and pose estimation, and different frameworks, such as one-stage or two-stage pipelines. Our approach effectively identifies an object in an image while simultaneously generating a high-quality bounding box or contour-based segmentation mask or set of keypoints. The method, called UniHead, views different visual perception tasks as the dispersible points learning via the transformer encoder architecture. Given a fixed spatial coordinate, UniHead adaptively scatters it to different spatial points and reasons about their relations by transformer encoder. It directly outputs the final set of predictions in the form of multiple points, allowing us to perform different visual tasks in different frameworks with the same head design. We show extensive evaluations on ImageNet classification and all three tracks of the COCO suite of challenges, including object detection, instance segmentation and pose estimation. Without bells and whistles, UniHead can unify these visual tasks via a single visual head design and achieve comparable performance compared to expert models developed for each task. We hope our simple and universal UniHead will serve as a solid baseline and help promote universal visual perception research. Code and models are available at https://github.com/Sense-X/UniHead.

J. Liang—Work is done during the internship at SenseTime.

Access provided by Autonomous University of Puebla. Download conference paper PDF

SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

YORO - Lightweight End to End Visual Grounding

Keywords

1 Introduction

Image classification [12], object detection [16, 24], instance segmentation [8, 24] and human pose estimation [1, 24] are the vital visual perception tasks in computer vision. The vision community has rapidly improved results by developing robust feature representation. Regardless of the development of the powerful backbone, in large part, these advances are inseparable from the task-aware visual head structure design, such as TSD [36], CondInst [40] and CPN [7], or the elaborate frameworks construction, e.g., one-stage detectors [23, 26, 41] and two-stage detectors [4, 34]. These methods are conceptually experienced and introduce task exclusivity, e.g., TSD [36] developed in object detection cannot be migrated to pose estimation. Our goal in this work is to develop a comparably generalized feature representation learning with task-agnostic structure design for unifying visual perception.

The main barriers behind this are: 1) As shown in Fig. 1(a), the different prediction targets force the visual perception into different sub-tasks, i.e., a class label for image classification, a bounding box for object detection, a pixel-wised mask for instance segmentation, and a group of landmarks for pose estimation. 2) How to conduct a task-agnostic head module which can generalize to all sub-tasks and frameworks while achieving good results? Given this, one might expect a complex head design is required to solve these barriers. However, we show that a surprisingly simple, flexible, and universal head module can easily generalize to different visual tasks or frameworks and surpass prior expert models in each individual task.

Our method, called UniHead, can be directly migrated to variant visual frameworks, e.g., Faster RCNN [34], FCOS [41] and ATSS [50], by formulating the prediction targets as the dispersible points learning. As shown in Fig. 1(b), UniHead is built upon any network backbone and the prediction targets for different tasks can be achieved by a basic yet effective points estimation. Given a fixed spatial coordinate, UniHead adaptively scatters it to different spatial points and reasons about their relations by several stacked transformer encoders. It directly outputs the final set of predictions in the form of multiple points, which is robust to geometric variations an object can exhibit, including scale, deformation, and orientation. For image classification, the points directly predict the object class. For object detection, the points are placed along the four edges of a bounding box. For instance segmentation, the points are evenly distributed along the instance mask contour. For pose estimation, the position of points conforms to the pose distribution of the training data.

Furthermore, we found it essential to adapt the initial position of the points according to different prediction targets. This can effectively alleviate the difficulty of optimization under the requirement of fitting objects with different scales and orientations. Additionally, the UniHead only adds a small computational overhead, enabling a universal system and rapid experimentation.

Without bells and whistles, UniHead can be equipped with popular backbones on different visual tasks, such as ResNet [18], ResNeXt [46], Swin Transformer [27], etc. It excels on the ImageNet [12] classification and all three tracks of the COCO [24] suite of challenges, including object detection, instance segmentation, and human pose estimation. We conduct extensive experiments to showcase the generality of our UniHead. By viewing each task as the dispersible points learning via the transformer encoder architecture, UniHead can perform comparably without any special design for specific tasks. UniHead, therefore, can be seen more broadly as a universal head module for visual perception and easily migrated to more complex tasks.

To summarize, our contributions are as follows:

1) We develop a comparably generalized dispersible points learning method for unifying visual perception. We hope our work can inspire the vision community to explore a unified vision framework.

2) We introduce the transformer encoder to reason about the relations of dispersible points and the adaptively points initialization to handle the geometric variations an object can exhibit, including scale, deformation, and orientation.

3) Detailed experiments on ImageNet [12] and MS-COCO [24] datasets show that UniHead can easily generalize to different tasks while obtaining comparable performance compared to the expert models developed in individual tasks.

2 Related Work

Image classification [12], object detection [16, 24], instance segmentation [8, 24] and pose estimation [1, 24] are four popular tasks in computer vision. They all benefit a lot from the development of deep neural networks [18, 37]. Among them, image classification [21] was the first to be applied with CNNs. The performance was improved by a considerable margin. After that, researchers are devoted to designing powerful backbones [18, 19, 46], which also give lift to other instance-level tasks, such as object detection [23, 34] and human pose estimation [37].

For object detection, it requires bounding box level location and category information of interested instances in an image. The methods can be roughly categorized into three types: Two-stage, One-stage and DETR detectors. Two-stage methods detect a series of region proposals at first and refine them in the second stage. Faster RCNN [34] is a popular pipeline of the two-stage method, which also includes R-FCN [9], Cascade RCNN [4], Grid RCNN [29], etc.. One-stage methods predict locations and class scores on a large amount of pre-defined spatial candidates. They can be further divided into two types: anchor-based and anchor-free detectors. Anchor-based methods use anchor boxes as an initial set, such as SSD [26] and RetinaNet [23]. For anchor-free methods, some methods make dense predictions on spatial points, such as CenterNet (objects as points) [51], FCOS [41] and RepPoints [47]. And some other works obtain a keypoint heatmap first and get objects by grouping them. CornerNet [22], ExtremeNet [52] and CenterNet (keypoint triplets) [14] fall into this category. DETR methods, such as DETR [5], Deformable DETR [53] and Conditional DETR [30], propose to detect objects by decoding a pre-defined set of object queries with transformers. These queries are optimized one-to-one with ground truths so there is no need for NMS as post-processing. Such a way of one-to-one label assignment also inspires other works like Sparse RCNN [38].

For instance segmentation, it requires mask and class information for instances. The methods can be categorized into two types: mask-based and contour-based. Mask-based methods predict binary mask directly, which can further be divided into local-mask and global-mask methods. Most local-mask methods include two stages: the first one for instance detection and the second one for instance mask generation, such as Mask RCNN [17], PANet [25] and PointRend [20]. Global-mask methods usually predict the mask for the whole image and leverage dynamic mask filters to decode masks for different instances, such as YOLACT [3] and CondInst [40]. Contour-based methods obtain instance masks by predicting object boundaries. PolarMask [45] and DeepSnake [31] are two typical works using this idea.

For human pose estimation, it requires the keypoint locations (e.g. nose, eyes, knees) for multiple humans in an image. There are mainly two kinds of approaches: heatmap-based and regression-based. Heatmap-based methods use a multi-class classifier to generate keypoint heatmaps and compose them with clustering and grouping procedures, such as CPN [7], HRNet [37] and DARK [49]. Regression-based methods, including Integral [39] and CenterNet [51], etc., predict coordinates of keypoints directly. It is more simple to plug them into existing end-to-end learning frameworks.

Mask R-CNN [17], PointSetNet [44] and LSNet [15] achieved merging object detection, instance segmentation and pose estimation into one network. Besides these tasks, UniHead can be extended to image classification. Furthermore, UniHead can also be simply embedded in variant types of architectures, e.g., anchor-free, anchor-based, and two-stage detectors, showing powerful ability on task and framework generalization.

3 Method

In this paper, we introduce the UniHead, a generalized visual head. It can be applied to different detection frameworks, such as Faster RCNN [34], FCOS [41] and ATSS [50], as well as different tasks including classification, object detection, instance segmentation and pose estimation. In this section, we first describe the design principle of UniHead and then detail the adaptation to different visual tasks and different visual frameworks. Finally, we delve into the inherent advantage of UniHead over other methods.

3.1 UniHead

In UniHead, given a fixed spatial coordinate ($\mathcal {A}_x, \mathcal {A}_y$) (referred as anchor point), i.e., center point of a proposal or a point in the feature map, it adaptively scatters it to different spatial points and reasons about the relations of them by several stacked transformer encoders. As shown in Fig. 2, UniHead adopts the sequentially three-stage procedure to seek for the scattered point representations. In the first stage, it will generate the anchor representation $\mathcal {F}_{x,y}$ according to the anchor coordinate or region proposal. For one-stage or anchor-free detectors, it is designated by the feature representation in the corresponding coordinate of the feature map. For the two-stage detectors, the feature generated by RoI Pooling [34] is used. In the second stage, K scattered points are generated by:

$$\begin{aligned} \begin{aligned} P_{x_i}&= \mathcal {A}_x + s_x \cdot \varDelta x_i\\ P_{y_i}&= \mathcal {A}_y + s_y \cdot \varDelta y_i, \end{aligned} \end{aligned}$$

(1)

where $(\varDelta x_i, \varDelta y_i) = f(\mathcal {F}_{x,y}; w_i)$. f is a simple multi-layer perceptron and $w_i$ is the learnable parameter. $(s_x, s_y)$ is the computed scalar to modulate the magnitude of the $(\varDelta x_i, \varDelta y_i)$. Specifically, $(s_x, s_y)$ is the width and height of the region proposal in a two-stage detector, the anchor scale in a one-stage detector, and the model stride in an anchor-free detector. In the final stage, instead of quantizing a floating-number of $(P_{x_i}, P_{y_i})$, we perform bilinear interpolation to generate the point representations $\mathcal {F}_{x_i,y_i}, i\in [1,K]$.

To better reason about the relations of these scattered point representations and generate more informative features, we introduce the transformer operator to capture the correlative dependence between them. To improve the robustness of different visual tasks, we insert a task-aware token embedding by:

$$\begin{aligned} \begin{aligned} z_0 = [\mathbf{T_{task}}; \mathcal {F}_{x_1,y_1}; \mathcal {F}_{x_2,y_2}; \dots ; \mathcal {F}_{x_K,y_K}], \end{aligned} \end{aligned}$$

(2)

where $\mathbf {T_{task}}$ can be $\mathbf {T_{class}}$, $\mathbf {T_{IoU}}$, and $\mathbf {T_{visibility}}$ for image classification, object detection and pose estimation, respectively. The computation in transformer encoders for point representations can be formulated as:

$$\begin{aligned} \begin{aligned}&z^{'}_{l} = \textrm{MHSA}(\textrm{LN}(z_{l-1})) + z_{l-1}, \qquad l = 1\dots L, \\&z_l = \textrm{MLP}(\textrm{LN}(z^{'}_{l})) + z^{'}_{l}, \qquad l = 1\dots L, \\&[\mathbf{T^{'}_{task}}; \mathcal {F}^{'}_{x_1,y_1}; \mathcal {F}^{'}_{x_2,y_2}; \dots ; \mathcal {F}^{'}_{x_K,y_K}]=z_{L}, \end{aligned} \end{aligned}$$

(3)

where $\textrm{MHSA}$ means multi-head self attention in [43], $\mathrm LN$ indicates layer normalization [2], $\mathrm MLP$ is a multi-layer perceptron. Formally, during training, we use L transformer encoders, and the final output $z_{L}$ will be adapted to different visual tasks to perform the task-aware prediction.

3.2 Adaptation to Different Visual Tasks

Image Classification. For image classification, we directly use the final feature map to perform dispersible points learning. The anchor point is set as the center of the input image and the corresponding scales are the input scale. We choose to align the classifier setting with standard vision transformers, i.e., only leveraging classification token instead of all tokens in the classifier. The training can be formulated as:

$$\begin{aligned} \mathcal {L}_\textrm{cls} = \textrm{CrossEntropy}(\textrm{softmax}(\textrm{MLP}(\mathbf {T^{'}_{cls}})),y). \end{aligned}$$

(4)

In the above $\textbf{y}$ specifies the ground-truth class and $\textrm{MLP}$ is a single fully-connected layer predicting the model’s probability for the class with label $\textbf{y}$.

Object Detection. UniHead can be applied to a variety of detectors, such as Faster R-CNN [34], FCOS [41], etc., without changing the backbone network structure, and the manner of label assignment. Specially, we concatenate a learnable token $\mathbf {T_{IoU}}$ as a replacement for the IoU branch. After passing through all transformer blocks, the $\mathbf {T^{'}_{IoU}}$ is used to predict IoU, which will be multiplied by class prediction to get final scores at inference time. The $\mathcal {F}^{'}_{x_i,y_i}$ is used to predict the offset for point $(P_{x_i}, P_{y_i})$. There are:

$$\begin{aligned} (P^{'}_{x_i}, P^{'}_{y_i}) = (P_{x_i}, P_{y_i}) + \textrm{MLP}(\mathcal {F}^{'}_{x_i,y_i}) \odot (s_x, s_y) , \end{aligned}$$

(5)

where $\odot $ denotes element-wise multiplication, and the $\textrm{MLP}$ is a single fully-connected layer shared between different points. The predicted bounding box can be computed by $B^{'}$ = $(\textrm{min}\{P^{'}_{x_i}\}, \textrm{min}\{P^{'}_{y_i}\}, \textrm{max}\{P^{'}_{x_i}\}, \textrm{max}\{P^{'}_{y_i}\})$, $i \in [1,K]$.

For the classification branch, it performs the same computational manner as UniHead in image classification. For regression, it shares $z_0$ with the classification branch to reduce the computational cost of point representation generation. Our loss function for detection is defined as:

$$\begin{aligned} \mathcal {L}_{loc} = -\frac{1}{n}\sum _{j=1}^n L_1(B^{'}_j, B_j), \end{aligned}$$

(6)

where j is the index of positive samples, $B^{'}_j$ is the predicted box and $B_j$ is the ground truth. Other kinds of detection loss can also be used, e.g., GIoU loss [35].

Instance Segmentation. For instance segmentation, we view this task as the contour-based regression. UniHead is placed at the output of the backbone to generate the points $P^{'}_{x_i,y_i}$ by Eq. 1, Eq. 2, Eq. 3 and Eq. 5. To align the point number between scattered points and the contour points in training data, we uniformly add new points, or delete points with the shortest edge until the target number is met, which is similar to Deep Snake [31]. All ground truth points are clockwise arranged around the contour line. The scattered points $\{P^{'}_{x_i,y_i}, i\in [1,K]\}$ are uniformly and clockwisely perform one-to-one matching with them.

Besides, some objects are split into several components due to occlusions. To overcome this problem, we simply follow PolarMask [45] and directly treat them as multiple objects. During training, we use $L_1$ loss to optimize each point:

$$\begin{aligned} \mathcal {L}_{seg} = \frac{1}{n}\sum _{i=1}^n L_1(P^{'}_{x_i,y_i}, P_{x_i,y_i}), \end{aligned}$$

(7)

where $P^{'}_{x_i,y_i}$ is the predicted point and $P_{x_i,y_i}$ is the corresponding ground truth.

Pose Estimation. The overall design of pose estimation is consistent with instance segmentation, except that an extra token $\mathbf {T_{visibility}}$ is introduced to predict the visibility of keypoints. The number K of predicted points is aligned with keypoint number in the dataset. For pose estimation, each keypoint has a clear definition, like nose, eyes, etc., which makes it possible to build one-to-one connection with dispersible points. $l_1$ loss is adopted to train the keypoint localization branch, same as Eq. 7. For the training of keypoint visibility prediction, we use standard binary cross entropy loss.

3.3 Adaptation to Different Visual Frameworks

Two-stage Framework. UniHead is applied to region proposals in the two-stage framework. Each proposal is represented as a combination of an anchor point ($\mathcal {A}_x, \mathcal {A}_y$) and its scale ($s_x, s_y$). The offsets $(\varDelta x_i, \varDelta y_i)$ are generated from the proposal feature extracted with RoI Pooling or RoI Align. Without other modifications, UniHead now can be directly used on a two-stage framework.

One-stage Framework. UniHead is applied on dense spatial points in the one-stage framework. For anchor-free methods, ($\mathcal {A}_x, \mathcal {A}_y$) and ($s_x, s_y$) are a point and the stride of the feature map. For anchor-based methods, ($\mathcal {A}_x, \mathcal {A}_y$) and ($s_x, s_y$) are the center point and the scale of an anchor. The offsets $(\varDelta x_i, \varDelta y_i)$ are generated using a 1$\times $1 convolutional layer.

3.4 UniHead Initialization

To effectively alleviate the difficulty of optimization under the requirement of fitting objects with different scales and orientations, the result points are initialized in a more appropriate way for different tasks, which is illustrated in Fig. 3. For image classification, points are casually scattered around the anchor point. For object detection, points are divided into four groups placed at the bottom, top, left, and right of the anchor point, respectively. For instance segmentation, first we set a 2D reference vector that starts from the anchor point. Based on the direction of this vector, the points are uniformly and clockwise initialized on the edge of a pseudo box generated from the anchor point and its spatial scale. For pose estimation, we calculate the average positions of different keypoints in the training dataset and use them to initialize points.

The initial point position is controlled by tuning the bias of the last fully-connected layer in $\textrm{MLP}$ used for offsets generation. Taking object detection as an example, the bias for points at left, right, top and bottom are set to $[-0.5, 0]$, [0.5, 0], $[0, -0.5]$ and [0, 0.5], respectively.

Table 1. Ablation study on extra blocks for image classification task.

Full size table

Table 2. Ablation study on $\mathbf {T_{task}}$. ’Det.’ and ’Keyp.’ mean detection and pose estimation, respectively.

Full size table

4 Experiments

For image classification, experiments are conducted on the ILSVRC-2012 ImageNet [12] dataset with 1K classes and 1.3M images. We use Top-1 accuracy as the metric in classification experiments.

We also conduct experiments with different backbones on the MS-COCO 2017 [24] dataset, including object detection, instance segmentation, and human pose estimation tasks. For these tasks, training is performed on the train set, over 57K images for human pose estimation, and over 118K images for object detection and instance segmentation. For experiments of ablation studies, evaluation is conducted on the val set. We also report performance on the test-dev set to compare with the state-of-art methods. The mean average precision (AP) is used as the measurement in COCO experiments. But the definition of AP varies with tasks. For object detection and instance segmentation, AP is calculated under different IoU thresholds (bounding box IoU or mask IoU). For human pose estimation, AP is calculated with object keypoint similarity (OKS).

4.1 Implementation Details

In the image classification task, all models are trained using AdamW optimizer [28] with 1e-4 initial learning rate, 0.05 weight decay, $\beta _1=0.9$, $\beta _2=0.999$ and a batch size of 1024. We train classification models for 300 epochs and use consine annealing scheduler to decrease learning rate. Data augmentations in [42] are also used, e.g., mix up, label smoothing, etc..

For other three tasks, we use different backbones including ResNet [18], ResNeXt [46] and Swin Transformer [27] with weights pretrained on ImageNet [12]. For object detection, we use our UniHead on different detection pipelines and follow their original hyper-parameters. For instance segmentation and pose estimation, the same settings as Faster RCNN [34] are used. During training, we adopt AdamW [28] as the optimizer, with 1e-4 initial learning rate, 0.05 weight decay, $\beta _1=0.9$ and $\beta _2=0.999$. In our $1\times $ setting, we train our model with mini-batch size 16 for 13 epochs and decrease the learning rate by a factor of 10 at epoch 9 and 12. Unless specified, the input scale of images is [800, 1333] and no data augmentations except horizontal flipping are used in training. The hyper-parameter of newly-added transformers keeps the same as [13].

Table 3. Ablation study on UniHead bias initialization strategy.

Full size table

Table 4. Ablation study on point number. Point number 8, 16, 24, 32 are tried.

Full size table

4.2 Ablation Studies

In this section, we conduct extensive ablation studies on ImageNet and COCO val set to validate the effectiveness of UniHead on classification and localization tasks, respectively. Specially, for localization task, we choose object detection and all models are trained on Faster RCNN [34] baseline with AdamW optimizer [28] and ResNet-50 backbone for fair comparison. We find that AdamW can stably improve the performance by $\sim 1$% AP compared to SGD.

Extra Blocks. We add extra blocks to the classification backbone networks to align their FLOPs with UniHead. Specifically, we append two bottlenecks to ResNet-50 ([3,4,6,5] for four stages) and two transformer blocks to Swin-T ([2,2,6,4] for four stages), whose results are shown in Table 1. Though additional layers can boost the performance, UniHead can achieve better performance with similar FLOPs. Also, we conduct the same experiment on Swin-B. We can see that when the model becomes bigger with higher FLOPs, extra blocks can hardly bring improvement. But UniHead achieves a continual performance boost. All these results prove that improvement brought by UniHead does not only account for its transformer blocks.

Task Token. We also explore the influence of $\mathbf {T_{IoU}}$ and $\mathbf {T_{visibility}}$ on object detection and pose estimation, respectively. As is shown in Table 2, the introduction of $\mathbf {T_{task}}$ brings a slight improvement on both tasks, proving the effectiveness of task tokens. It is worth noting that though visibility prediction is not used in pose estimation evaluation, $\textbf{T}_{visibility}$ still has a positive impact on training.

Table 5. Ablation study on block number. $L_{cls}$ and $L_{loc}$ denote transformer encoder block number of classification and localization, respectively. #params means parameters of the detection head. The training and inference time is measured on a 16GB V100 GPU.

Full size table

UniHead Initialization. We replace our task-specific bias initialization with zero initialization on different tasks. Main results are shown in Table 3. It proves that a proper initialization can help the unified architecture learn the knowledge of different tasks more quickly.

Point Number. We evaluate the performance of different point numbers in UniHead, which is shown in Table 4. It shows that our head can benefit from the increasing number of points. But more points may bring overfitting and more computational cost. So we choose to use $K=16$ in our implementations.

Table 6. Ablation study on different modules. IoU prediction is not used in this table. “HD", “MHSA" and “DPL" mean head disentanglement, multi-head self attention and dispersible points learning, respectively.

Full size table

Block Number. We also analyze the influence of the number of transformer encoder blocks. As is shown in Table 5, we compare the performances, head parameters, FLOPs, training time, and inference time with baseline under different block number settings. Our head can benefit slightly from the increase in block numbers. Considering computational costs and the head capacity, we finally use $L_{cls}=2$ and $L_{loc}=3$ in our implementations.

Head Disentanglement. To show that our method does not only benefit from the separated task heads, a Faster RCNN with sibling heads is given in the second row of Table 6. We simply remove the shared fully connected layers in the RCNN head and replace them with separated ones. We can observe that the improvement brought by head disentanglement (0.5 AP) is actually limited.

Dispersible Points Learning and Multi-head Self Attention. In order to demonstrate the effectiveness of dispersible points learning and multi-head self attention, we conduct experiments with different head designs and compare them with our head (without IoU prediction). First, we take the output of RoI Align [17] as tokens directly (49 in total), and process them with disentangled transformer encoders. The result is in the third row of Table 6. We can see that though more points are used, it still performs worse than DPL with $K=16$.

Then, we leverage deformable RoI Pooling [10] as another form of dispersible points learning. Specifically, multiple offsets are generated in the same way and applied to deformable RoI Pooling for feature extraction. The result is shown in the fourth row of Table 6. It indicates that the combination of dispersible points learning and multi-head attention is more effective to capture semantic information within an instance.

Table 7. Results of UniHead with variant detection pipelines.

Full size table

Table 8. Results of UniHead with variant backbones. “DCN" means deformable convolution. * means multi-scale training.

Full size table

4.3 Generalization Ability

Detection Pipeline Generalization. We evaluate the performance by transferring our UniHead to different detection pipelines. Specially, we simply replace the detection head in Mask RCNN with UniHead to build a mask-based version. As is shown in Table 7, the UniHead can boost the performance of all these types of detectors, showing its generalization ability on different detection frameworks.

Backbone Generalization. We further conduct experiments with different backbones under the setting of Faster RCNN. As is shown in Table 8, our head can steadily boost the performance by $2 \sim 3$% AP. It demonstrates the generalization ability of our method on variant backbones.

Table 9. Results on different tasks. “*" indicates multi-scale training, multi-stage refinement and 11x scheduler. “$+$" is multi-scale training and 2x scheduler.

Full size table

Task Generalization. As mentioned before, our head is a unifying perception head, which means that it can be applied to variant visual tasks. To be specific, we use $K=16$ for image classification and object detection, $K=36$ for instance segmentation and $K=17$ points for human pose estimation. The baseline of classification is trained with the same setting as UniHead for fair comparison. The performance is evaluated on ImageNet val set for classification, and COCO val set for other three tasks. The experimental results are shown in Table 9. We can see that with a ResNet-50 backbone, the UniHead makes improvements on classification and object detection, and get a close performance compared with expert models for instance segmentation and pose estimation.

Table 10. Comparisons of for different algorithms and different tasks evaluated on the COCO test-dev set. “FG" and “TG" indicate that the method can be generalized to different visual frameworks and visual tasks, respectively. “*" denotes multi-scale test.

Full size table

4.4 Comparison with State-of-the-Art

We evaluate object detection, instance segmentation and pose estimation on COCO test-dev, whose results are shown in Table 10. The reported AP is related to corresponding tasks, e.g., mask AP for instance segmentation. We only adopt multi-scale training for data augmentation and no TTA is used. It should be noted that we don’t introduce any task-aware algorithm design, e.g., multi-stage refinement for pose estimation.

For object detection, the experimental setting in multi-scale training is [480, 960] for image minimum side and 1333 for image maximum side. We can see that with stronger backbones, our UniHead can achieve competitive performance, although it is not developed just for object detection. For instance segmentation, the same augmentation strategy as object detection is used. Here we also use the mask head of Mask RCNN [17] to build a mask-based UniHead. Without bells and whistles, UniHead gets 46.7% AP with mask-based head and 39.4% AP with contour-based head. Compared with expert models, UniHead achieves comparable performance only using a simpler pipeline. For pose estimation, we use a larger resolution of input image ([480, 1200] for image minimum side and 2000 for image maximum side). With a surprisingly simple way, i.e., direct keypoint regression using $l_1$ loss, UniHead gets a close performance compared with other regression-based methods which utilize multi-stage refinement (like [44]) and more iterations of training.

5 Conclusion

In this paper, we proposed UniHead, a unifying visual perception head. It can not only be embedded in variant detection frameworks, but also applied to different visual tasks, including image classification, object detection, instance segmentation and pose estimation. UniHead perceives instances by dispersible points learning, which is also equipped with transformer encoders to capture semantic relations of them. Though our UniHead is designed in a simple way, it achieves comparable performance on each task compared with expert models. This work shows the potential in general visual learning and we hope it can promote universal visual perception research.

References

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR, pp. 3686–3693 (2014)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016)
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV, pp. 9157–9166 (2019)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR, pp. 6154–6162 (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV, pp. 213–229 (2020)
Google Scholar
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR, pp. 4974–4983 (2019)
Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: CVPR, pp. 7103–7112 (2018)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)
Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NeurIPS 29 (2016)
Google Scholar
Dai, J., et al.: Deformable convolutional networks. In: ICCV, pp. 764–773 (2017)
Google Scholar
Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: CVPR, pp. 7373–7382 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. arXiv:2010.11929 (2020)
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets for object detection. In: ICCV, pp. 6569–6578 (2019)
Google Scholar
Duan, K., Xie, L., Qi, H., Bai, S., Huang, Q., Tian, Q.: Location-sensitive visual recognition with cross-IoU loss. arXiv:2104.04899 (2021)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)
Article Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 4700–4708 (2017)
Google Scholar
Kirillov, A., Wu, Y., He, K., Girshick, R.: PointRend: image segmentation as rendering. In: CVPR, pp. 9799–9808 (2020)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NeurIPS 25 (2012)
Google Scholar
Law, H., Deng, J.: CornerNet: Detecting objects as paired keypoints. In: ECCV, pp. 734–750 (2018)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: ECCV, pp. 740–755 (2014)
Google Scholar
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR, pp. 8759–8768 (2018)
Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: ECCV, pp. 21–37 (2016)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv:2103.14030 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv:1711.05101 (2017)
Lu, X., Li, B., Yue, Y., Li, Q., Yan, J.: Grid R-CNN. In: CVPR, pp. 7363–7372 (2019)
Google Scholar
Meng, D., et al.: Conditional DETR for fast training convergence. In: ICCV, pp. 3651–3660 (2021)
Google Scholar
Peng, S., Jiang, W., Pi, H., Li, X., Bao, H., Zhou, X.: Deep snake for real-time instance segmentation. In: CVPR, pp. 8533–8542 (2020)
Google Scholar
Qiao, S., Chen, L.C., Yuille, A.: Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. In: CVPR, pp. 10213–10224 (2021)
Google Scholar
Qiu, H., Ma, Y., Li, Z., Liu, S., Sun, J.: BorderDet: border feature for dense object detection. In: ECCV, pp. 549–564 (2020)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Google Scholar
Song, G., Liu, Y., Wang, X.: Revisiting the sibling head in object detector. In: CVPR, pp. 11563–11572 (2020)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR, pp. 5693–5703 (2019)
Google Scholar
Sun, P., et al.: Sparse R-CNN: end-to-end object detection with learnable proposals. In: CVPR, pp. 14454–14463 (2021)
Google Scholar
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: ECCV, pp. 529–545 (2018)
Google Scholar
Tian, Z., Shen, C., Chen, H.: Conditional convolutions for instance segmentation. In: ECCV, pp. 282–298 (2020)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV, pp. 9627–9636 (2019)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML. pp. 10347–10357 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS 30 (2017)
Google Scholar
Wei, F., Sun, X., Li, H., Wang, J., Lin, S.: Point-set anchors for object detection, instance segmentation and pose estimation. In: ECCV, pp. 527–544 (2020)
Google Scholar
Xie, E., et al.: PolarMask: single shot instance segmentation with polar representation. In: CVPR, pp. 12193–12202 (2020)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 1492–1500 (2017)
Google Scholar
Yang, Z., Liu, S., Hu, H., Wang, L., Lin, S.: RepPoints: point set representation for object detection. In: ICCV, pp. 9657–9666 (2019)
Google Scholar
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: CVPR, pp. 2403–2412 (2018)
Google Scholar
Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: CVPR, pp. 7093–7102 (2020)
Google Scholar
Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: CVPR, pp. 9759–9768 (2020)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv:1904.07850 (2019)
Zhou, X., Zhuo, J., Krahenbuhl, P.: Bottom-up object detection by grouping extreme and center points. In: CVPR, pp. 850–859 (2019)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv:2010.04159 (2020)

Download references

Acknowledgement

The work was supported by the National Key R &D Program of China under Grant 2019YFB2102400.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Beihang University, Beijing, China
Jianming Liang & Biao Leng
SenseTime Research, Hong Kong, China
Jianming Liang, Guanglu Song & Yu Liu

Authors

Jianming Liang
View author publications
You can also search for this author in PubMed Google Scholar
Guanglu Song
View author publications
You can also search for this author in PubMed Google Scholar
Biao Leng
View author publications
You can also search for this author in PubMed Google Scholar
Yu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Liu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2311 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liang, J., Song, G., Leng, B., Liu, Y. (2022). Unifying Visual Perception by Dispersible Points Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13669. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-20077-9_26
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20076-2
Online ISBN: 978-3-031-20077-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unifying Visual Perception by Dispersible Points Learning

Abstract

Similar content being viewed by others

SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

YORO - Lightweight End to End Visual Grounding

Keywords

1 Introduction

2 Related Work

3 Method

3.1 UniHead

3.2 Adaptation to Different Visual Tasks

3.3 Adaptation to Different Visual Frameworks

3.4 UniHead Initialization

4 Experiments

4.1 Implementation Details

4.2 Ablation Studies

4.3 Generalization Ability

4.4 Comparison with State-of-the-Art

5 Conclusion

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2311 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Unifying Visual Perception by Dispersible Points Learning

Abstract

Similar content being viewed by others

SiRi: A Simple Selective Retraining Mechanism for Transformer-Based Visual Grounding

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

YORO - Lightweight End to End Visual Grounding

Keywords

1 Introduction

2 Related Work

3 Method

3.1 UniHead

3.2 Adaptation to Different Visual Tasks

3.3 Adaptation to Different Visual Frameworks

3.4 UniHead Initialization

4 Experiments

4.1 Implementation Details

4.2 Ablation Studies

4.3 Generalization Ability

4.4 Comparison with State-of-the-Art

5 Conclusion

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2311 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation