Keywords

1 Introduction

Image classification [12], object detection [16, 24], instance segmentation [8, 24] and human pose estimation [1, 24] are the vital visual perception tasks in computer vision. The vision community has rapidly improved results by developing robust feature representation. Regardless of the development of the powerful backbone, in large part, these advances are inseparable from the task-aware visual head structure design, such as TSD [36], CondInst [40] and CPN [7], or the elaborate frameworks construction, e.g., one-stage detectors [23, 26, 41] and two-stage detectors [4, 34]. These methods are conceptually experienced and introduce task exclusivity, e.g., TSD [36] developed in object detection cannot be migrated to pose estimation. Our goal in this work is to develop a comparably generalized feature representation learning with task-agnostic structure design for unifying visual perception.

Fig. 1.
figure 1

(a). Illustration of the typical pipelines for different visual tasks. Different sub-tasks require different prediction targets and different feature structures. (b). Illustration of the UniHead design. Given a fixed spatial coordinate, UniHead adaptively scatters it to different spatial points and reasons about their relations by transformer encoders. It directly outputs a set of predictions in the form of multiple points to perform different visual tasks.

The main barriers behind this are: 1) As shown in Fig. 1(a), the different prediction targets force the visual perception into different sub-tasks, i.e., a class label for image classification, a bounding box for object detection, a pixel-wised mask for instance segmentation, and a group of landmarks for pose estimation. 2) How to conduct a task-agnostic head module which can generalize to all sub-tasks and frameworks while achieving good results? Given this, one might expect a complex head design is required to solve these barriers. However, we show that a surprisingly simple, flexible, and universal head module can easily generalize to different visual tasks or frameworks and surpass prior expert models in each individual task.

Our method, called UniHead, can be directly migrated to variant visual frameworks, e.g., Faster RCNN [34], FCOS [41] and ATSS [50], by formulating the prediction targets as the dispersible points learning. As shown in Fig. 1(b), UniHead is built upon any network backbone and the prediction targets for different tasks can be achieved by a basic yet effective points estimation. Given a fixed spatial coordinate, UniHead adaptively scatters it to different spatial points and reasons about their relations by several stacked transformer encoders. It directly outputs the final set of predictions in the form of multiple points, which is robust to geometric variations an object can exhibit, including scale, deformation, and orientation. For image classification, the points directly predict the object class. For object detection, the points are placed along the four edges of a bounding box. For instance segmentation, the points are evenly distributed along the instance mask contour. For pose estimation, the position of points conforms to the pose distribution of the training data.

Furthermore, we found it essential to adapt the initial position of the points according to different prediction targets. This can effectively alleviate the difficulty of optimization under the requirement of fitting objects with different scales and orientations. Additionally, the UniHead only adds a small computational overhead, enabling a universal system and rapid experimentation.

Without bells and whistles, UniHead can be equipped with popular backbones on different visual tasks, such as ResNet [18], ResNeXt [46], Swin Transformer [27], etc. It excels on the ImageNet [12] classification and all three tracks of the COCO [24] suite of challenges, including object detection, instance segmentation, and human pose estimation. We conduct extensive experiments to showcase the generality of our UniHead. By viewing each task as the dispersible points learning via the transformer encoder architecture, UniHead can perform comparably without any special design for specific tasks. UniHead, therefore, can be seen more broadly as a universal head module for visual perception and easily migrated to more complex tasks.

To summarize, our contributions are as follows:

1) We develop a comparably generalized dispersible points learning method for unifying visual perception. We hope our work can inspire the vision community to explore a unified vision framework.

2) We introduce the transformer encoder to reason about the relations of dispersible points and the adaptively points initialization to handle the geometric variations an object can exhibit, including scale, deformation, and orientation.

3) Detailed experiments on ImageNet [12] and MS-COCO [24] datasets show that UniHead can easily generalize to different tasks while obtaining comparable performance compared to the expert models developed in individual tasks.

2 Related Work

Image classification [12], object detection [16, 24], instance segmentation [8, 24] and pose estimation [1, 24] are four popular tasks in computer vision. They all benefit a lot from the development of deep neural networks [18, 37]. Among them, image classification [21] was the first to be applied with CNNs. The performance was improved by a considerable margin. After that, researchers are devoted to designing powerful backbones [18, 19, 46], which also give lift to other instance-level tasks, such as object detection [23, 34] and human pose estimation [37].

For object detection, it requires bounding box level location and category information of interested instances in an image. The methods can be roughly categorized into three types: Two-stage, One-stage and DETR detectors. Two-stage methods detect a series of region proposals at first and refine them in the second stage. Faster RCNN [34] is a popular pipeline of the two-stage method, which also includes R-FCN [9], Cascade RCNN [4], Grid RCNN [29], etc.. One-stage methods predict locations and class scores on a large amount of pre-defined spatial candidates. They can be further divided into two types: anchor-based and anchor-free detectors. Anchor-based methods use anchor boxes as an initial set, such as SSD [26] and RetinaNet [23]. For anchor-free methods, some methods make dense predictions on spatial points, such as CenterNet (objects as points) [51], FCOS [41] and RepPoints [47]. And some other works obtain a keypoint heatmap first and get objects by grouping them. CornerNet [22], ExtremeNet [52] and CenterNet (keypoint triplets) [14] fall into this category. DETR methods, such as DETR [5], Deformable DETR [53] and Conditional DETR [30], propose to detect objects by decoding a pre-defined set of object queries with transformers. These queries are optimized one-to-one with ground truths so there is no need for NMS as post-processing. Such a way of one-to-one label assignment also inspires other works like Sparse RCNN [38].

For instance segmentation, it requires mask and class information for instances. The methods can be categorized into two types: mask-based and contour-based. Mask-based methods predict binary mask directly, which can further be divided into local-mask and global-mask methods. Most local-mask methods include two stages: the first one for instance detection and the second one for instance mask generation, such as Mask RCNN [17], PANet [25] and PointRend [20]. Global-mask methods usually predict the mask for the whole image and leverage dynamic mask filters to decode masks for different instances, such as YOLACT [3] and CondInst [40]. Contour-based methods obtain instance masks by predicting object boundaries. PolarMask [45] and DeepSnake [31] are two typical works using this idea.

For human pose estimation, it requires the keypoint locations (e.g. nose, eyes, knees) for multiple humans in an image. There are mainly two kinds of approaches: heatmap-based and regression-based. Heatmap-based methods use a multi-class classifier to generate keypoint heatmaps and compose them with clustering and grouping procedures, such as CPN [7], HRNet [37] and DARK [49]. Regression-based methods, including Integral [39] and CenterNet [51], etc., predict coordinates of keypoints directly. It is more simple to plug them into existing end-to-end learning frameworks.

Mask R-CNN [17], PointSetNet [44] and LSNet [15] achieved merging object detection, instance segmentation and pose estimation into one network. Besides these tasks, UniHead can be extended to image classification. Furthermore, UniHead can also be simply embedded in variant types of architectures, e.g., anchor-free, anchor-based, and two-stage detectors, showing powerful ability on task and framework generalization.

Fig. 2.
figure 2

A typical pipeline of UniHead. At first, most methods of location-sensitive tasks contain a backbone and the feature pyramid (not used in the image classification task) to extract feature maps. Then, for an anchor point, UniHead obtains multiple points via dispersible points learning. To generate point representations, bilinear interpolation is performed on the feature map according to point coordinates, which is denoted in dotted line. The obtained features will be concatenated with extra learnable tokens if necessary, and sent to corresponding transformer encoders to complete variant visual tasks.

3 Method

In this paper, we introduce the UniHead, a generalized visual head. It can be applied to different detection frameworks, such as Faster RCNN [34], FCOS [41] and ATSS [50], as well as different tasks including classification, object detection, instance segmentation and pose estimation. In this section, we first describe the design principle of UniHead and then detail the adaptation to different visual tasks and different visual frameworks. Finally, we delve into the inherent advantage of UniHead over other methods.

3.1 UniHead

In UniHead, given a fixed spatial coordinate (\(\mathcal {A}_x, \mathcal {A}_y\)) (referred as anchor point), i.e., center point of a proposal or a point in the feature map, it adaptively scatters it to different spatial points and reasons about the relations of them by several stacked transformer encoders. As shown in Fig. 2, UniHead adopts the sequentially three-stage procedure to seek for the scattered point representations. In the first stage, it will generate the anchor representation \(\mathcal {F}_{x,y}\) according to the anchor coordinate or region proposal. For one-stage or anchor-free detectors, it is designated by the feature representation in the corresponding coordinate of the feature map. For the two-stage detectors, the feature generated by RoI Pooling [34] is used. In the second stage, K scattered points are generated by:

$$\begin{aligned} \begin{aligned} P_{x_i}&= \mathcal {A}_x + s_x \cdot \varDelta x_i\\ P_{y_i}&= \mathcal {A}_y + s_y \cdot \varDelta y_i, \end{aligned} \end{aligned}$$
(1)

where \((\varDelta x_i, \varDelta y_i) = f(\mathcal {F}_{x,y}; w_i)\). f is a simple multi-layer perceptron and \(w_i\) is the learnable parameter. \((s_x, s_y)\) is the computed scalar to modulate the magnitude of the \((\varDelta x_i, \varDelta y_i)\). Specifically, \((s_x, s_y)\) is the width and height of the region proposal in a two-stage detector, the anchor scale in a one-stage detector, and the model stride in an anchor-free detector. In the final stage, instead of quantizing a floating-number of \((P_{x_i}, P_{y_i})\), we perform bilinear interpolation to generate the point representations \(\mathcal {F}_{x_i,y_i}, i\in [1,K]\).

To better reason about the relations of these scattered point representations and generate more informative features, we introduce the transformer operator to capture the correlative dependence between them. To improve the robustness of different visual tasks, we insert a task-aware token embedding by:

$$\begin{aligned} \begin{aligned} z_0 = [\mathbf{T_{task}}; \mathcal {F}_{x_1,y_1}; \mathcal {F}_{x_2,y_2}; \dots ; \mathcal {F}_{x_K,y_K}], \end{aligned} \end{aligned}$$
(2)

where \(\mathbf {T_{task}}\) can be \(\mathbf {T_{class}}\), \(\mathbf {T_{IoU}}\), and \(\mathbf {T_{visibility}}\) for image classification, object detection and pose estimation, respectively. The computation in transformer encoders for point representations can be formulated as:

$$\begin{aligned} \begin{aligned}&z^{'}_{l} = \textrm{MHSA}(\textrm{LN}(z_{l-1})) + z_{l-1}, \qquad l = 1\dots L, \\&z_l = \textrm{MLP}(\textrm{LN}(z^{'}_{l})) + z^{'}_{l}, \qquad l = 1\dots L, \\&[\mathbf{T^{'}_{task}}; \mathcal {F}^{'}_{x_1,y_1}; \mathcal {F}^{'}_{x_2,y_2}; \dots ; \mathcal {F}^{'}_{x_K,y_K}]=z_{L}, \end{aligned} \end{aligned}$$
(3)

where \(\textrm{MHSA}\) means multi-head self attention in [43], \(\mathrm LN\) indicates layer normalization [2], \(\mathrm MLP\) is a multi-layer perceptron. Formally, during training, we use L transformer encoders, and the final output \(z_{L}\) will be adapted to different visual tasks to perform the task-aware prediction.

3.2 Adaptation to Different Visual Tasks

Image Classification. For image classification, we directly use the final feature map to perform dispersible points learning. The anchor point is set as the center of the input image and the corresponding scales are the input scale. We choose to align the classifier setting with standard vision transformers, i.e., only leveraging classification token instead of all tokens in the classifier. The training can be formulated as:

$$\begin{aligned} \mathcal {L}_\textrm{cls} = \textrm{CrossEntropy}(\textrm{softmax}(\textrm{MLP}(\mathbf {T^{'}_{cls}})),y). \end{aligned}$$
(4)

In the above \(\textbf{y}\) specifies the ground-truth class and \(\textrm{MLP}\) is a single fully-connected layer predicting the model’s probability for the class with label \(\textbf{y}\).

Object Detection. UniHead can be applied to a variety of detectors, such as Faster R-CNN [34], FCOS [41], etc., without changing the backbone network structure, and the manner of label assignment. Specially, we concatenate a learnable token \(\mathbf {T_{IoU}}\) as a replacement for the IoU branch. After passing through all transformer blocks, the \(\mathbf {T^{'}_{IoU}}\) is used to predict IoU, which will be multiplied by class prediction to get final scores at inference time. The \(\mathcal {F}^{'}_{x_i,y_i}\) is used to predict the offset for point \((P_{x_i}, P_{y_i})\). There are:

$$\begin{aligned} (P^{'}_{x_i}, P^{'}_{y_i}) = (P_{x_i}, P_{y_i}) + \textrm{MLP}(\mathcal {F}^{'}_{x_i,y_i}) \odot (s_x, s_y) , \end{aligned}$$
(5)

where \(\odot \) denotes element-wise multiplication, and the \(\textrm{MLP}\) is a single fully-connected layer shared between different points. The predicted bounding box can be computed by \(B^{'}\) = \((\textrm{min}\{P^{'}_{x_i}\}, \textrm{min}\{P^{'}_{y_i}\}, \textrm{max}\{P^{'}_{x_i}\}, \textrm{max}\{P^{'}_{y_i}\})\), \(i \in [1,K]\).

For the classification branch, it performs the same computational manner as UniHead in image classification. For regression, it shares \(z_0\) with the classification branch to reduce the computational cost of point representation generation. Our loss function for detection is defined as:

$$\begin{aligned} \mathcal {L}_{loc} = -\frac{1}{n}\sum _{j=1}^n L_1(B^{'}_j, B_j), \end{aligned}$$
(6)

where j is the index of positive samples, \(B^{'}_j\) is the predicted box and \(B_j\) is the ground truth. Other kinds of detection loss can also be used, e.g., GIoU loss [35].

Instance Segmentation. For instance segmentation, we view this task as the contour-based regression. UniHead is placed at the output of the backbone to generate the points \(P^{'}_{x_i,y_i}\) by Eq. 1, Eq. 2, Eq. 3 and Eq. 5. To align the point number between scattered points and the contour points in training data, we uniformly add new points, or delete points with the shortest edge until the target number is met, which is similar to Deep Snake [31]. All ground truth points are clockwise arranged around the contour line. The scattered points \(\{P^{'}_{x_i,y_i}, i\in [1,K]\}\) are uniformly and clockwisely perform one-to-one matching with them.

Besides, some objects are split into several components due to occlusions. To overcome this problem, we simply follow PolarMask [45] and directly treat them as multiple objects. During training, we use \(L_1\) loss to optimize each point:

$$\begin{aligned} \mathcal {L}_{seg} = \frac{1}{n}\sum _{i=1}^n L_1(P^{'}_{x_i,y_i}, P_{x_i,y_i}), \end{aligned}$$
(7)

where \(P^{'}_{x_i,y_i}\) is the predicted point and \(P_{x_i,y_i}\) is the corresponding ground truth.

Pose Estimation. The overall design of pose estimation is consistent with instance segmentation, except that an extra token \(\mathbf {T_{visibility}}\) is introduced to predict the visibility of keypoints. The number K of predicted points is aligned with keypoint number in the dataset. For pose estimation, each keypoint has a clear definition, like nose, eyes, etc., which makes it possible to build one-to-one connection with dispersible points. \(l_1\) loss is adopted to train the keypoint localization branch, same as Eq. 7. For the training of keypoint visibility prediction, we use standard binary cross entropy loss.

Fig. 3.
figure 3

Ways of point initialization for different tasks. From left to right: image classification, object detection, instance segmentation, pose estimation.

3.3 Adaptation to Different Visual Frameworks

Two-stage Framework. UniHead is applied to region proposals in the two-stage framework. Each proposal is represented as a combination of an anchor point (\(\mathcal {A}_x, \mathcal {A}_y\)) and its scale (\(s_x, s_y\)). The offsets \((\varDelta x_i, \varDelta y_i)\) are generated from the proposal feature extracted with RoI Pooling or RoI Align. Without other modifications, UniHead now can be directly used on a two-stage framework.

One-stage Framework. UniHead is applied on dense spatial points in the one-stage framework. For anchor-free methods, (\(\mathcal {A}_x, \mathcal {A}_y\)) and (\(s_x, s_y\)) are a point and the stride of the feature map. For anchor-based methods, (\(\mathcal {A}_x, \mathcal {A}_y\)) and (\(s_x, s_y\)) are the center point and the scale of an anchor. The offsets \((\varDelta x_i, \varDelta y_i)\) are generated using a 1\(\times \)1 convolutional layer.

3.4 UniHead Initialization

To effectively alleviate the difficulty of optimization under the requirement of fitting objects with different scales and orientations, the result points are initialized in a more appropriate way for different tasks, which is illustrated in Fig. 3. For image classification, points are casually scattered around the anchor point. For object detection, points are divided into four groups placed at the bottom, top, left, and right of the anchor point, respectively. For instance segmentation, first we set a 2D reference vector that starts from the anchor point. Based on the direction of this vector, the points are uniformly and clockwise initialized on the edge of a pseudo box generated from the anchor point and its spatial scale. For pose estimation, we calculate the average positions of different keypoints in the training dataset and use them to initialize points.

The initial point position is controlled by tuning the bias of the last fully-connected layer in \(\textrm{MLP}\) used for offsets generation. Taking object detection as an example, the bias for points at left, right, top and bottom are set to \([-0.5, 0]\), [0.5, 0], \([0, -0.5]\) and [0, 0.5], respectively.

Table 1. Ablation study on extra blocks for image classification task.
Table 2. Ablation study on \(\mathbf {T_{task}}\). ’Det.’ and ’Keyp.’ mean detection and pose estimation, respectively.

4 Experiments

For image classification, experiments are conducted on the ILSVRC-2012 ImageNet [12] dataset with 1K classes and 1.3M images. We use Top-1 accuracy as the metric in classification experiments.

We also conduct experiments with different backbones on the MS-COCO 2017 [24] dataset, including object detection, instance segmentation, and human pose estimation tasks. For these tasks, training is performed on the train set, over 57K images for human pose estimation, and over 118K images for object detection and instance segmentation. For experiments of ablation studies, evaluation is conducted on the val set. We also report performance on the test-dev set to compare with the state-of-art methods. The mean average precision (AP) is used as the measurement in COCO experiments. But the definition of AP varies with tasks. For object detection and instance segmentation, AP is calculated under different IoU thresholds (bounding box IoU or mask IoU). For human pose estimation, AP is calculated with object keypoint similarity (OKS).

4.1 Implementation Details

In the image classification task, all models are trained using AdamW optimizer [28] with 1e-4 initial learning rate, 0.05 weight decay, \(\beta _1=0.9\), \(\beta _2=0.999\) and a batch size of 1024. We train classification models for 300 epochs and use consine annealing scheduler to decrease learning rate. Data augmentations in  [42] are also used, e.g., mix up, label smoothing, etc..

For other three tasks, we use different backbones including ResNet [18], ResNeXt [46] and Swin Transformer [27] with weights pretrained on ImageNet [12]. For object detection, we use our UniHead on different detection pipelines and follow their original hyper-parameters. For instance segmentation and pose estimation, the same settings as Faster RCNN [34] are used. During training, we adopt AdamW [28] as the optimizer, with 1e-4 initial learning rate, 0.05 weight decay, \(\beta _1=0.9\) and \(\beta _2=0.999\). In our \(1\times \) setting, we train our model with mini-batch size 16 for 13 epochs and decrease the learning rate by a factor of 10 at epoch 9 and 12. Unless specified, the input scale of images is [800, 1333] and no data augmentations except horizontal flipping are used in training. The hyper-parameter of newly-added transformers keeps the same as [13].

Table 3. Ablation study on UniHead bias initialization strategy.
Table 4. Ablation study on point number. Point number 8, 16, 24, 32 are tried.

4.2 Ablation Studies

In this section, we conduct extensive ablation studies on ImageNet and COCO val set to validate the effectiveness of UniHead on classification and localization tasks, respectively. Specially, for localization task, we choose object detection and all models are trained on Faster RCNN [34] baseline with AdamW optimizer [28] and ResNet-50 backbone for fair comparison. We find that AdamW can stably improve the performance by \(\sim 1\)% AP compared to SGD.

Extra Blocks. We add extra blocks to the classification backbone networks to align their FLOPs with UniHead. Specifically, we append two bottlenecks to ResNet-50 ([3,4,6,5] for four stages) and two transformer blocks to Swin-T ([2,2,6,4] for four stages), whose results are shown in Table 1. Though additional layers can boost the performance, UniHead can achieve better performance with similar FLOPs. Also, we conduct the same experiment on Swin-B. We can see that when the model becomes bigger with higher FLOPs, extra blocks can hardly bring improvement. But UniHead achieves a continual performance boost. All these results prove that improvement brought by UniHead does not only account for its transformer blocks.

Task Token. We also explore the influence of \(\mathbf {T_{IoU}}\) and \(\mathbf {T_{visibility}}\) on object detection and pose estimation, respectively. As is shown in Table 2, the introduction of \(\mathbf {T_{task}}\) brings a slight improvement on both tasks, proving the effectiveness of task tokens. It is worth noting that though visibility prediction is not used in pose estimation evaluation, \(\textbf{T}_{visibility}\) still has a positive impact on training.

Table 5. Ablation study on block number. \(L_{cls}\) and \(L_{loc}\) denote transformer encoder block number of classification and localization, respectively. #params means parameters of the detection head. The training and inference time is measured on a 16GB V100 GPU.

UniHead Initialization. We replace our task-specific bias initialization with zero initialization on different tasks. Main results are shown in Table 3. It proves that a proper initialization can help the unified architecture learn the knowledge of different tasks more quickly.

Point Number. We evaluate the performance of different point numbers in UniHead, which is shown in Table 4. It shows that our head can benefit from the increasing number of points. But more points may bring overfitting and more computational cost. So we choose to use \(K=16\) in our implementations.

Table 6. Ablation study on different modules. IoU prediction is not used in this table. “HD", “MHSA" and “DPL" mean head disentanglement, multi-head self attention and dispersible points learning, respectively.

Block Number. We also analyze the influence of the number of transformer encoder blocks. As is shown in Table 5, we compare the performances, head parameters, FLOPs, training time, and inference time with baseline under different block number settings. Our head can benefit slightly from the increase in block numbers. Considering computational costs and the head capacity, we finally use \(L_{cls}=2\) and \(L_{loc}=3\) in our implementations.

Head Disentanglement. To show that our method does not only benefit from the separated task heads, a Faster RCNN with sibling heads is given in the second row of Table 6. We simply remove the shared fully connected layers in the RCNN head and replace them with separated ones. We can observe that the improvement brought by head disentanglement (0.5 AP) is actually limited.

Dispersible Points Learning and Multi-head Self Attention. In order to demonstrate the effectiveness of dispersible points learning and multi-head self attention, we conduct experiments with different head designs and compare them with our head (without IoU prediction). First, we take the output of RoI Align [17] as tokens directly (49 in total), and process them with disentangled transformer encoders. The result is in the third row of Table 6. We can see that though more points are used, it still performs worse than DPL with \(K=16\).

Then, we leverage deformable RoI Pooling [10] as another form of dispersible points learning. Specifically, multiple offsets are generated in the same way and applied to deformable RoI Pooling for feature extraction. The result is shown in the fourth row of Table 6. It indicates that the combination of dispersible points learning and multi-head attention is more effective to capture semantic information within an instance.

Table 7. Results of UniHead with variant detection pipelines.
Table 8. Results of UniHead with variant backbones. “DCN" means deformable convolution. * means multi-scale training.

4.3 Generalization Ability

Detection Pipeline Generalization. We evaluate the performance by transferring our UniHead to different detection pipelines. Specially, we simply replace the detection head in Mask RCNN with UniHead to build a mask-based version. As is shown in Table 7, the UniHead can boost the performance of all these types of detectors, showing its generalization ability on different detection frameworks.

Backbone Generalization. We further conduct experiments with different backbones under the setting of Faster RCNN. As is shown in Table 8, our head can steadily boost the performance by \(2 \sim 3\)% AP. It demonstrates the generalization ability of our method on variant backbones.

Table 9. Results on different tasks. “*" indicates multi-scale training, multi-stage refinement and 11x scheduler. “\(+\)" is multi-scale training and 2x scheduler.

Task Generalization. As mentioned before, our head is a unifying perception head, which means that it can be applied to variant visual tasks. To be specific, we use \(K=16\) for image classification and object detection, \(K=36\) for instance segmentation and \(K=17\) points for human pose estimation. The baseline of classification is trained with the same setting as UniHead for fair comparison. The performance is evaluated on ImageNet val set for classification, and COCO val set for other three tasks. The experimental results are shown in Table 9. We can see that with a ResNet-50 backbone, the UniHead makes improvements on classification and object detection, and get a close performance compared with expert models for instance segmentation and pose estimation.

Table 10. Comparisons of for different algorithms and different tasks evaluated on the COCO test-dev set. “FG" and “TG" indicate that the method can be generalized to different visual frameworks and visual tasks, respectively. “*" denotes multi-scale test.

4.4 Comparison with State-of-the-Art

We evaluate object detection, instance segmentation and pose estimation on COCO test-dev, whose results are shown in Table 10. The reported AP is related to corresponding tasks, e.g., mask AP for instance segmentation. We only adopt multi-scale training for data augmentation and no TTA is used. It should be noted that we don’t introduce any task-aware algorithm design, e.g., multi-stage refinement for pose estimation.

For object detection, the experimental setting in multi-scale training is [480, 960] for image minimum side and 1333 for image maximum side. We can see that with stronger backbones, our UniHead can achieve competitive performance, although it is not developed just for object detection. For instance segmentation, the same augmentation strategy as object detection is used. Here we also use the mask head of Mask RCNN [17] to build a mask-based UniHead. Without bells and whistles, UniHead gets 46.7% AP with mask-based head and 39.4% AP with contour-based head. Compared with expert models, UniHead achieves comparable performance only using a simpler pipeline. For pose estimation, we use a larger resolution of input image ([480, 1200] for image minimum side and 2000 for image maximum side). With a surprisingly simple way, i.e., direct keypoint regression using \(l_1\) loss, UniHead gets a close performance compared with other regression-based methods which utilize multi-stage refinement (like [44]) and more iterations of training.

5 Conclusion

In this paper, we proposed UniHead, a unifying visual perception head. It can not only be embedded in variant detection frameworks, but also applied to different visual tasks, including image classification, object detection, instance segmentation and pose estimation. UniHead perceives instances by dispersible points learning, which is also equipped with transformer encoders to capture semantic relations of them. Though our UniHead is designed in a simple way, it achieves comparable performance on each task compared with expert models. This work shows the potential in general visual learning and we hope it can promote universal visual perception research.