1 Introduction

Object detection is a fundamental task in computer vision [15, 17, 18, 23, 24, 30, 31, 38, 45], but it remains challenging due to the large variation in object scales. To handle the scale variation, a straightforward method is to utilize multi-scale image inputs  [34, 35], which usually lacks efficiency. A line of more efficient methods is to tackle the scale variation on the intermediate features  [17, 24]. For example, Feature Pyramid Networks (FPN)  [17] is a representative work that implements the detection of objects with different scales in multiple levels of feature pyramids. On the other hand, recent works also attempt to improve the detectors from the perspective of receptive fields (RFs)  [15, 23]. They enhance the scale-awareness of the detectors by having multi-branch transformations with different combinations of kernel sizes and/or dilation rates. Then the features of different RFs are aggregated to enrich the information of different scales at each spatial location.

An object detector often has a backbone network followed by the detection-specific sub-networks (i.e.  heads), which play an important role in object detection. The sub-networks compute the deep features which are used to directly predict the object category, localization and size. Unlike two-stage detectors in which the sub-networks operate on the fixed-size feature maps computed from each object proposal, generated by a region proposal network with ROI-pooling  [31], the sub-networks in one-stage detectors should be capable of ‘looking for’ objects of arbitrary sizes directly. It becomes more challenging for an anchor-free detector. Because the multi-scale anchor boxes can be considered as a way to explicitly handle various sizes and shapes of objects, whereas an anchor-free detector only predicts a single object at each spatial location, without any prior information about the object size. Therefore, for one-stage detectors, especially the anchor-free ones, the capability of the sub-networks for capturing the objects with large scale variation becomes the key. In this work, we aim to enhance the power of the sub-networks in one-stage detectors, by searching for the optimal combination of the RFs and convolutions in a learning-based manner.

Table 1. Comparison against other NAS methods for object detection on MS-COCO  [19]. Trans. indicates the number of transformation types in the search space (‘skip-connect’ is excluded). Counterpart denotes the baseline detectors (and backbone) for direct comparison. \(^\star \) means only the dilation rates are varied.

Neural Architecture Search (NAS) has gained increasing attention. It transfers the task of neural networks design from a heuristics-guided process to an optimization problem. Recently, it has been shown that NAS can achieve prominent results on object detection  [1, 7, 26, 39, 42, 43]. In most of the work, the operations in the search space are directly extended from those used for image classification  [22, 48] with limited variation on dilation rates. Therefore, their search spaces with respect to transformations are relatively limited, as listed in Table 1. Apart from the combination of RFs, we also investigate the importance of the diversity of the transformations in NAS search space for object detection. However, searching through such a large number of candidate transformations can be computationally expensive, especially for the RL-based  [27, 48] and EA-based  [29] approaches. Additionally, this problem can be more significant for object detection than image classification, due to the more complicated pipelines with larger input images.

To this end, we propose a computation-friendly method, named Fast And Diverse (FAD), to search for the task-specific sub-networks in one-stage object detectors. FAD consists of a designed search space and an efficient search algorithm. We first design a rich set of diverse transformations tailored for object detection, covering multiple RFs and various convolution types. To learn the optimal combinations more efficiently, a search method via representation sharing (RepShare) is proposed accordingly. By sharing intermediate representations, the proposed RepSharesignificantly reduces the searching time and memory cost for the architecture search. Furthermore, we propose an efficient method to reduce the interference between the transformations sharing the same representations, and at the same time, alleviate the degradation of search quality caused by RepShare.

To demonstrate the effectiveness of the proposed method, we redesign the sub-networks for modern one-stage object detectors, and propose a searchable module for replacement. The architecture search for the module is extremely efficient using our FAD, which is more than \(25\times \) faster than the fastest NAS approach for object detectors so far, while achieving a comparable AP improvement (see Table 1). With ResNeXt-101  [41] as the backbone, our FAD detector achieves 46.4 AP on the MS-COCO  [19] test-dev set using a single model under single-scale testing, without using any additional regularization or modules (e.g.  deformable conv  [3]). Moreover, we show that FAD can also benefit more challenging tasks, such as instance segmentation. The contributions of this work are summarized as:

  • We present a novel method, named Fast And Diverse (FAD), to search meaningful transformations in the task-specific sub-networks for one-stage object detection. The search space is designed specifically for object detection, and we empirically investigate the importance of the RFs coverage and convolution types for object detection.

  • We propose an efficient search method with a novel representation sharing (RepShare) algorithm, which can significantly reduce the search cost in both time and memory usage, e.g.  being more than \(25\times \) faster than all previous methods. To ensure the search quality, a new method is introduced to decouple the transformation selection from the shared representations.

  • To evaluate our methods, we design a searchable module for one-stage object detection and instance segmentation. Extensive experiments show that our FAD detector obtains consistent performance improvements on different detection frameworks with various backbones, and even has fewer parameters.

2 Related Work

2.1 Object Detection and Instance Segmentation

In general, object detectors can be categorized into two groups: two-stage detectors and one-stage detectors. Modern two-stage detectors  [2, 31] first adopt a regional proposal network (RPN) to generate a set of object proposals, which are then fed to the R-CNN heads for object classification and bounding box regression. On the other hand, one-stage object detectors  [18, 24, 30] directly perform object classification and box regression simultaneously at each spatial location on the feature maps produced from a backbone network. Taking RetinaNet as an example, it consists of a backbone network with a feature pyramid network (FPN) [17] and two sub-networks for classification and bounding box regression. Recent works attempt to get rid of hand-designed anchor boxes while achieving comparable performance  [4, 14, 38, 47]. For instance, FCOS  [38] additionally predicts a centerness score which indicates the distance of current location to the center of the corresponding object, and can even outperform RetinaNet.

Receptive Fields (RF). RF is proved to be very important for object detectors  [15, 23]. For instance, Liu et al.  [23] designed a combination of kernel sizes and dilation rates, to simulate the impact of the eccentricities of population receptive fields in human visual cortex. TridentNet  [15] tackles the scale variation using multi-branch modules with different dilation rates. In this work, we aim to search for an optimal combination of different conv layers and dilation rates jointly.

Instance Segmentation. Instance segmentation is closely related to object detection, and the dominant instance segmentation methods often have two stages  [10, 13]: they first detect the objects in an image, and then predict an object mask on each detected region. Mask R-CNN  [10] is a representative work in this paradigm, which has an additional mask head on top of Faster R-CNN  [31] to perform mask prediction on each object proposal. In this work, we apply the proposed FAD search method to instance segmentation, which has not been explored previously.

2.2 Neural Architecture Search

Recent attention has been moved from network design by hand to neural architecture search (NAS)  [21, 22, 25, 27, 48]. A stream of efficient NAS methods is the differentiable NAS  [22, 25]. In particular, DARTS  [22] significantly increases the search efficiency by relaxing the categorical choice of operation to be continuous, so that the architecture can be optimized by gradient descent. In this work, we develop an efficient NAS algorithm for object detectors, by fast searching the optimized transformations.

Fig. 1.
figure 1

Search space of FAD for one-stage object detectors. The backbone and FPN  [17] in detectors remain the same, while each FPN level is connected to a searchable module. It consists of two groups of cells, with same cell architectures within each group. In a cell, the edges connecting nodes consist of two standard \(1\times 1\) conv layers and a transformation block in between. The cell structures and the transformations are to be searched. Each edge might have different RFs, resulting in combinations of RFs at each node which enrich the features for capturing information of various scales.

NAS for Object Detection. NAS has been applied to many vision tasks apart from image classification, such as object detection  [1, 7, 26, 42]. For example, NAS-FPN  [7] uses a RL-based NAS to search for an optimal FPN  [17] on the RetinaNet. DetNAS  [1] aims at finding the optimal shuffle-block-based backbone network in object detectors using an evolution algorithm  [8, 28]. A channel-level NAS is proposed in NATS  [26] to search for the backbone in object detectors. Alternatively, some recent works search for the detection-specific parts rather than the backbone for object detection. For instance, Auto-FPN  [42] searches for a FPN structure and head structures. SM-NAS  [43] also searches for two-stage detectors by first conducting a structural-level search and then a modular-level search. Instead of exploring novel structures, CR-NAS  [16] aims to re-allocate the computation resources in the backbone. NAS-FCOS  [39] is a FCOS-based detector in which the structure of its FPN and the following sub-networks are computed using RL-based NAS. In this work, we design the search space specifically, and propose the FAD method to search for the sub-networks in one-stage detectors.

3 Fast Diverse-Transformation Search

3.1 Search Space of FAD

One-stage detectors like RetinaNet  [18] and FCOS [38] consist of a backbone network with FPN  [17] and two parallel sub-networks for object classification and bounding box regression, respectively. In this section, we design a searchable module to replace the commonly-used sub-networks. This module is searched by the proposed FAD, and can be adapted to one-stage object detectors that follow a similar structure as RetinaNet  [18] in a plug-and-play fashion. We then describe the novel search space of FAD which is tailored for object detection, including a variety of diverse transformations with different RFs.

Object Detector with FAD. As shown in Fig. 1, the proposed searchable module is comprised of two groups of cells, which are connected sequentially with a shortcut from the input of the module to that of the second group. The module outputs both object classification and bounding box prediction. The architectures and parameters are shared across different FPN levels.

Classification and Regression. In FAD, the bounding box prediction is performed on the output of the first group, while the classification is computed from the output of the second group. The intuition behind this design is that the two tasks should not be implemented on the exactly same feature maps due to different objectives: bounding box regression needs to focus on the local detailed information, while object classification is implemented on the features with more semantic information (i.e.  the feature maps on deeper layers). Therefore, we perform bounding box regression on the output of the first group.

Design of Search Space. In the following, we describe the design of the search space for FAD, which is inspired by the insights from modern neural architectures  [11, 37] and object detectors  [15, 23]. Three important considerations in our design are the coverage of RFs, the diversity in convolution types and the computational efficiency.

Groups and Cells. A group contains M repeated cells, and each cell is defined as a module that contains multiple nodes and edges. Similar to  [20, 22], each cell is formulated as a directed acyclic graph of nodes. Each node is a stack of feature maps and each edge is an atomic block for search. In this work, we empirically set the number of nodes in each cell to be 3, excluding the input and output nodes. In our design, an edge consists of two \(1\times 1\) conv layers f and a transformation block T between the two (Fig. 1 bottom-right). The transformation block contains a set of candidate transformations which will be described in Sect. 3.2. Each conv layer in the transformation is followed by a group-normalization layer  [40] and a ReLU. Given a node \(x_j\), all the predecessors \(x_i\) connected to it, and an edge pointing from \(x_i\) to \(x_j\), we can have the following expression:

$$\begin{aligned} x_j = \sum ^N_{i<N} f^{c',c}_{i,j}(T^{c',c'}_{i,j}(f^{c,c'}_{i,j}(x_i))), \end{aligned}$$
(1)

where \(f^{c,c'}_{i,j}\) and \(f^{c',c}_{i,j}\) are the two \(1\times 1\) conv layers, with one transforming the input channel c to the channel used in the transformation block \(T^{c',c'}_{i,j}\) and the other vice versa. \(x_j\) is computed based on N total number of predecessors. The two \(1\times 1\) convolution enable a flexibility in the channel size in T, similar to the inception module  [37], while maintaining the same channel size for all the nodes. We empirically found that maintaining a relatively large channel size for nodes is beneficial to the performance. The representations of the intermediate nodes in a cell are concatenated and passed to a \(1\times 1\) conv layer to reduce the number of channels back to c. This additional conv layer ensures the consistent channel size between the input and output of each cell. Furthermore, the idea of having two groups of cells enables a larger flexibility for the architecture search, i.e.  a larger search space. Within each group, the cells share the same structure. Therefore, once the search is completed, the cells in each group can be repeated for multiple times, offering a great scalability in architecture depth.

Diverse Transformations. Our initial design of the candidate transformations covers 4 different sizes of RFs (Fig. 2 bottom left). In particular, for the transformations that are responsible for a RF larger than 5, we use more efficient operations by having a base filter followed by a dilated convolution which spreads out the base filter to reach larger RFs. Moreover, the dilated conv layers are depthwise separable  [12, 32], in order to keep the computation efficient. The memory-efficient design introduced in Sect. 3.2, allows us to include more types of convolutions. Hence we have two streams of transformations: the standard conv and the depthwise separable conv. Namely, for the 6 transformations shown in the bottom-left corner of Fig. 2, the ‘conv’ layers can be all standard convolution or depthwise separable convolution.

There are no pooling layers involved in the search space as we empirically found that they are not helpful in our scenario. This is probably because the spatial resolution of the feature maps remains the same in the sub-networks. Moreover, skip-connection is not included in the transformation. Lastly, a ‘none’ path, indicating the importance of input edges with respect to each node, is added to the transformation block. In summary, the proposed transformation block contains 13 distinct transformations in total, including 2 types of conv layers and 3 dilation rates, and covering 4 sizes of RFs, as illustrated in Fig. 2. Therefore, we build a meaningful search space with strongly diverse transformations. The resultant search space has roughly \(2.3\times 10^{13}\) unique paths in total, with one cell per group in search time.

FAD for Instance Segmentation. We expect that the mask prediction task can also benefit from the combination of RFs and diverse transformations. With minimal modification, FAD readily applicable to general instance segmentation frameworks, e.g.  Mask R-CNN  [10] and Mask Scoring R-CNN  [13]. Specifically, we replace the conv layers before the deconvolutional layer in the mask head by the proposed searchable module, and search for its architecture in an end-to-end fashion. The search space is designed by following that of object detectors.

3.2 Fast Search with Representation Sharing

In this section, we propose a novel algorithm to significantly reduce the search cost in both time and memory, followed by the description of search procedure.

Representation Sharing. The proposed acceleration method for architecture search, named RepShare, is performed in two steps: filter decomposing and intermediate representation sharing. We elaborate these two steps in the following.

Fig. 2.
figure 2

Transformations and representation sharing. Left: comparison between the transformations used for image classification and those proposed for object detection in the search space. The proposed transformations are listed at the bottom. conv can be the standard or the depthwise separable convolution. Right: RepShare. Each sphere and solid line denotes a representation and a conv layer, respectively. First, large filters are decomposed into stacks of \(3\times 3\) filters. Second, \(p_1\) and \(p_2\) are shared across transformations. Note that the \(1\times 1\) conv layers are not shown for simplicity.

Decomposing Large Filters. As proposed in  [33], filters with large kernel sizes can be replaced by multiple \(3\times 3\) filters. For example, a stack of three \(3\times 3\) filters in fact has an equivalent size of receptive filed as a \(7\times 7\) filter. The stacked filters have the advantages of fewer parameters and more non-linearities in between for learning more discriminative representations. Following this intuition, we decompose the filters with large kernel size and construct a transformation block only containing filters of size \(3\times 3\) (\(t_1\) to \(t_6\) shown in Fig. 2 top-right). However, the replacement of large filters with stacks of small ones significantly increases the memory overhead during the search. Taking the proposed transformations as an example, more than twice intermediate representations are generated after the decomposition.

Representation Sharing. To reduce this memory overhead, we further propose a novel approach. Namely, for each receptive field (RF) level, all the intermediate representations that are not directly connected to node \(x_j\) are shared (Fig. 2 bottom-right). To be specific, we denote \(t_3\) in top-right of Fig. 2 as the stem. In the stem, there are 3 intermediate representations having different sizes of RFs with respect to the node \(x_i\). We merge the transformations by sharing the intermediate representations in the stem. For example in Figure 2 (top-right), to merge the \(t_1\) into the stem, we directly connect the first intermediate representation in the stem to node \(x_j\), and therefore the original \(t_1\) (conv \(3\times 3\)) transformation is replaced by this new transformation. Specifically, the RepSharereduces the number of representations computed in each transformation block from 26 to 12. Therefore, it can significantly speed up the search process. Moreover, the search speed is further boosted by the memory-efficiency of RepSharesince the search can be done using a single GPU, which avoids the computational overhead introduced by training with multiple GPUs (e.g. parameter update).

Relation to Other Efficient Search Methods. The proposed RepSharehas similar spirits to some recent approaches. For instance, parameter sharing introduced in  [27] takes the advantage of sharing the same sets of parameters among child models to greatly speed up the search in RL-based NAS methods. It is inspired by parameter inheritance  [29] which also reuses the same parameters for child models across mutation to avoid training from scratch. RepShareis more than using the same parameters, but also the same computation. Furthermore, apart from accelerating the search, RepSharefurther reduces the memory consumption. Single-path NAS  [36] also share computations, but is different from ours. It considers a small kernel (e.g.  \(3\times 3\)) as the core of a large one (e.g.  \(5\times 5\)), and uses a learnable threshold to compare the importance of the two kernels, and selects the optimal one.

Decoupling Shared Representations. Similar to parameter sharing described in  [27] in which child models are coupled to some extent due to reusing the same weights, RepSharealso exhibits such behaviour. In RepShare, transformations sharing the same representations might interfere with each other, and thus the parameters directly corresponding to the shared representations are not well optimized in the search. It causes that those transformations are difficult to outstand in the architecture derivation. For example, in Fig. 2 (bottom-right), two intermediate representations are shared. Namely, \(p_1\) is shared across all six transformations and \(p_2\) is shared across \(t_2\), \(t_3\) and \(t_5\). Due to the coupling effect (i.e.  interference between transformations), \(t_1\) and \(t_2\) are not able to learn the optimal parameters on their own, which may degrade the search quality. Notably, this effect mainly happens on \(t_1\) and \(t_2\), since their outputs are exactly the shared representations; while other transformations (\(t_3\) to \(t_6\)) have the flexibility to compensate this effect due to additional operations on the share representations.

Decoupling with Extra Functions. To address this issue in RepShare, we propose a simple yet effective method to decouple the transformations (that directly depend on the shared representations, i.e.  \(t_1\) and \(t_2\)) from the shared representations. Namely, an additional function H is applied between each shared representation and its corresponding transformation output. With this additional function, for example, the output of \(t_1\) is no longer \(p_1\), but \(H(p_1)\). In this case, \(t_1\) and \(t_2\) are decoupled from \(p_1\) and \(p_2\), respectively. For the choice of H, we use a standard \(1\times 1\) conv layer followed by a ReLU. This light-weight extra function produces minimal computational overhead and is applied to both conv streams (i.e.  the standard and depthwise separable convolution streams).

Optimization and Deriving Architectures. In a cell, each edge contains a transformation block in which the final transformation is determined from a set of candidates illustrated in Fig. 2. In order to search using back-propagation, we follow the continuous relaxation for the search space as  [22], and adapt it to the proposed RepShareparadigm. For each of the two streams (Fig. 2 bottom-right) in the transformation block, the output of a transformation (\(T_{i,j}\)) is essentially the sum of all the intermediate representations multiplied with corresponding \(\alpha \). Therefore, we can have:

$$\begin{aligned} T_{i,j}(x_{i}^{\prime })=\sum _{p \in P} \frac{\exp \left( \alpha ^p_{i,j}\right) }{\sum _{p^{\prime } \in P} \exp \left( \alpha ^{p^{\prime }}_{i, j}\right) } \; p, \end{aligned}$$
(2)

where \(x_i^{\prime }\) is the output of the first \(1\times 1\) conv layer in the transformation block. p and \(p^{\prime }\) are the intermediate representations out of all representations P. \(\alpha ^p\) is the \(\alpha \) corresponding to p.

Optimization and Derivation of Discrete Architectures. During the architecture search, \(\alpha \) and the network weights w are jointly optimized in a bilevel optimization scheme, as in  [20, 22]. In particular, the first-order approximation is adopted. At the end of the search, a discrete architecture is decoded by retaining one transformation per edge and two input edges for each node based on the largest \(\alpha \) in each transformation block. Since the intermediate representations are selected instead of operations, they should then be mapped to the corresponding actual transformations in the derived architecture, i.e.  the transformations in Fig. 2 (top-right).

4 Experiments

In this section, the proposed FAD is evaluated in two tasks: object detection and instance segmentation. In the Supplementary Material (SM), we further conduct experiments for image classification to analyze the effect of decoupling in RepShare.

4.1 Object Detection

Implementation Details. Although the proposed module can be adopted to different one-stage object detectors, we perform the architecture search using FAD on FCOS  [38], due to its efficiency. The search is conducted on the PASCAL VOC  [5]. We also perform the search directly on MS-COCO  [19] and make comparisons in Table 2. More implementation details, including the search and the detector training, can be found in the SM.

Ablation Study. We conduct ablation study on the search cost, search spaces, as well as different backbones and detectors. More studies on the marco-structure of the module, and network width and depth are presented in SM.

Search Cost. The time required for a complete architecture search using our FAD is 0.6 GPU-days. A single TITAN XP is used for the search. Table 1 compares the search cost of FAD against other NAS-based methods for object detection. As we can see, the search speed for FAD is at least \(25\times \) faster than other recent approaches, while achieving a similar relative AP improvement on MS-COCO. Meanwhile, the architecture explored by FAD is scalable in depth by simply adding the repetitive cells in the groups, which provides greater flexibility to the module.

Table 2. Comparison for the architecture search. Memory and bs denotes the memory usage and images per GPU. Both Subset and Full refer to the proposed search space. Sep. and Std. mean that only depthwise and standard conv are used, respectively. ResNet-50 is used as the backbone. Results are obtained on the MS-COCO minival split. All the searches are performed on VOC, expect for \(^{\dagger }\) which is on MS-COCO.

Search Space. To demonstrate the superiority of the proposed search space, we reuse the same search procedure but replace the proposed search operations with that in DARTS  [22], which are listed in Fig. 2 (top-left). Note that the depthwise separable convolution is doubled in DARTS, and hence the RFs change accordingly. As we can see from Table 2, the operations used in DARTS only bring a marginal improvement of 0.4 AP, compared to the original FCOS, while the proposed transformations improve the performance significantly, from 38.6 to 40.3. To further study the importance of the full transformation set, we search by using two transformation subsets. Namely, the two subsets contain transformations with the RFs smaller than 7 and 9, respectively. Our results show that with less transformations in the search space, the performance degrades accordingly. Moreover, we search by using only one type of convolution (either the standard or the depthwise separable) for the conv layers with dilation rate of 1. Not surprisingly, both of them fail to achieve a similar performance as the full search space. This illustrates the power of the proposed transformations which fully benefit from the better combinations of RFs and convolution types. Besides, the performance slightly degrades without decoupling. More results on decoupling can be found in the SM. Another observation is that the proxyless search on MS-COCO can achieve similar performance on detection, but it takes much longer search time. Hence, we use the architecture searched on VOC for object detection for the rest of this work.

In addition, our FAD is also compared with the ‘random’ baseline. Namely, a transformation is randomly sampled in each block and two edges are randomly sampled for each node. It can be found that the proposed FAD indeed finds much better architectures. The last conclusion to draw in Table 2 is that, comparing to the search without RepShare, RepShareenables an almost \(4\times \) faster search with only one third of the GPU memory usage, without harming the performance.

Adaptation to Different Backbone Networks. We replace the ResNet-50 in the detector by using three different networks: MobileNetV2  [12], ResNet-101  [11] and ResNeXt-101  [41]. As shown in Table 3, our FAD obtains a consistent improvement (about 1.4 AP on average) for all the backbones compared, with even fewer parameters and FLOPs. This indicates that the architecture of FAD generalizes well to the backbone networks with different capacity. A direct comparison on the sub-networks (without the backbone and FPN) shows a \(16.3\%\) and \(15.2\%\) decrease on the number of parameters and the FLOPs. Hence, we can conclude that the performance gain is obtained from the better architecture searched rather than the network capacity itself.

Table 3. FAD on different detectors and backbones. The \(\rightarrow \) indicates the change from original detector to FAD. Dim. is the channel size in the subnets, or \(c^{\prime }\) in the transformation block in FAD. Results are obtained on MS-COCO minival.

Transferability. Our FAD is expected to be readily applicable to different types of one-stage object detectors (with the two-subnet structure). To examine this property, we further plug the proposed searchable module into RetinaNet  [18]. Table 3 reveals that FAD can also improve the performance of RetinaNet by a large margin even with fewer parameters. Therefore, we see that the searched sub-networks can boost the performance of different types of detectors (and potentially more powerful detectors in the future) in a plug-and-play fashion.

Table 4. Comparison with the state-of-the-art object detectors on the MS-COCO test-dev split (including concurrent work  [9, 39, 44, 46]). FCOS  [38] is used as the base detector for FAD. All the results are tested under the single-scale and single-model setting. Note that models using additional regularization method  [6] and deformable convolution  [3] are excluded in the table (except for NAS-FCOS  [39]).

Comparison with the State-of-the-Art. We compare FAD with the state-of-the-art object detectors on the MS-COCO test–dev split, including some recent NAS-based object detectors. All the methods are evaluated under the single-model and single-scale setting. Table 4 shows that, by having 128 channels in the first group and 256 in the second (with 98.3M parameters), FAD @128-256 achieves 46.4 AP which surpasses all the recent object detectors, including two concurrent work, NAS-FCOS  [39] and Hit-Detector  [9]. Note that NAS-FCOS includes the deformable convolution  [3] in the search space, which is not considered in other NAS-based detectors (including our FAD), and it is well-known for giving large AP improvements. On the other hand, the search of FAD is almost \(50\times \) faster than that of NAS-FCOS on the same dataset (i.e.  VOC).

Fig. 3.
figure 3

Architectures searched for object detection. The left and right cells are for the first and second group, respectively. std, sep and dil denote the standard, depthwise separable and dilated conv.

Searched Architectures. The derived architectures by FAD are presented in Fig. 3. We have two interesting observations. First, the edges correspond to a mixture of RFs (especially for the cell group for classification) and convolution types, which again validates our motivation. Another important insight is that the transformations with large RFs (i.e.  7 and 9) appear near the input node, while those with small RFs (i.e.  3 and 5) are closer to the output node. This is consistent with the DetNAS architecture explored in  [1].

4.2 Instance Segmentation

To showcase the generality of the proposed FAD, we apply it to another useful task – instance segmentation. Different from object detection, only one group of cell is searched in the mask head. The search is conducted on MS-COCO, which takes 2.6 GPU-days. For a fair comparison, we exactly follow  [10, 13] for training the searched networks. The search and training details are described in the SM.

Table 5. Comparison on instance segmentation mask AP on the MS-COCO minival split. P. is for parameters (M) and F. is for FLOPs (G).

Results. Table 5 shows that, with similar number of parameters and FLOPs, all FAD outperform their counterparts with same backbones on both Mask R-CNN and MS R-CNN. Notably, Mask FAD has relatively larger improvements in terms of \(\mathrm {AP_{M}}\) and \(\mathrm {AP_{L}}\) (e.g.  1.6 and 1.8 AP on ResNet-50) than \(\mathrm {AP_{S}}\) (0.6 AP), possibly due to the transformations with larger RFs. Another surprising result is that Mask FAD (ResNet-50) achieves similar AP as MS R-CNN (ResNet-50), i.e.  35.5 vs.  35.6, despite a simpler pipeline and \(26.9\%\) fewer parameters. The improvements are prominent since we only modify the mask head architecture which only accounts for 2.25M parameters, i.e.  2.8% to 5% of the whole networks.

5 Conclusion

In this work, we propose FAD to efficiently search for better sub-networks with diverse transformations and optimal combinations of RFs for one-stage object detection and instance segmentation. To demonstrate the effectiveness of the proposed search space and search method, we design a searchable module for the two tasks at hand (and potentially applicable to other tasks). Extensive experiments show that the architectures searched by our FAD can consistently outperform their counterparts on different detectors and segmentation networks.