1 Introduction

Deep convolutional neural networks (CNNs) have attracted a broad range of research interests in the computer vision field and have achieved remarkable progress in various visual recognition tasks, including image classification [20, 31, 35, 55], object detection [51, 52], semantic segmentation [5], instance segmentation [17, 28], and human keypoint detection [17, 70]. Standard convolution layers containing collections of filters express neighborhood spatial feature connectivity along input channels by a linear transformation, together with non-linear activation functions, play a central role in CNNs. Traditional CNNs serve as robust feature extractors by stacking convolution layers followed by activation layers; e.g., VGGNets [55] construct deep CNNs by modular designed 3x3 convolution layers with non-linear activation functions capable of capturing global context information. In order to further tap the potential of deeper architectures, modern CNNs introduce skip-connection [20] and variants [9, 21, 27] to alleviate gradient vanish.

In addition to the deeper CNNs, another category approaches [3, 22, 25, 26, 64, 68] focus on enriching feature representation according to long-range context dependencies learned by extra parameters, which present considerable potential for practical applications. Some methods [25, 26, 36] in-cooperate attention mechanism with convolution and have boosted the performance of downstream tasks. One of the representative methods is Squeeze-and-Excitation Networks (SENet) [26], combined with various network architectures, bringing consistent performance gains in a wide range of vision tasks at the cost of additional parameters. Unlike the approaches mentioned above that suffer from a heavier computational burden, this paper mainly focuses on the following question: Is it possible to tap the potential of attention-enhanced CNNs while easing the computational burden of attentive modules?

To address this issue, we first revisit the gating mechanism in SENet [26] and several variants [25, 42, 63, 68]. SE block is a micro-encoder-decoder architecture applied at module-level, which aggregates long-range spatial dependencies by non-parametric global average pooling first. It then encodes non-linear latent channel relationships by cascading fully-connected (FC) layer and ReLU [48] activation function. The decoder part models the saliency of channel information flow using another FC layer followed by the Sigmoid function. Although SENet improves the performance of CNNs, it inevitably increases unnecessary complexity compared with original models. In addition, empirical studies indicate dimensionality reduction of SE block is unnecessary and inefficient due to its side effect on cross-channel information flow [63] and increased memory access cost (MAC) [47]. GENet [25] further explores parametric sampling kernels with various fields of view, which achieve better performance at the expense of increased computational budgets for spatial conditioning than SENet. CBAM [68] enhances feature representations using a dual-attention mechanism which consists of max pooling enhanced channel attention and spatial attention captured by extra convolutional kernels. SCNet [42] proposes a conditional calibration-based parallel, heterogeneous, dual-path architecture to enlarge receptive fields and complement informative features, balancing complexity and performance. ECANet [63] presents a locality-prior-driven design that overcomes the SE block dimension reduction defect and reduces the extra computational budgets for attentive modules with 1D convolutions.

In order to further explore the potential of lightweight attentive architectures, we present a scaled gated convolution as an efficient approach to strengthen the feature representations of vanilla convolutional transformations and reduce redundancy of existing attentive modules in a plug-and-play manner. The proposed scaled gated convolution consists of a triplet of operators: scaling, gating, and fuse. Specifically, the scaling operator re-scales feature and kernel spaces into multiple portions for successive heterogeneous feature transformations. The gating operator aggregates global feature context to enlarge the receptive field and leverages cross-channel information flow to generate self-adaptive attentive gating representations. The fuse operator aggregates features across multiple heterogeneous feature spaces adaptively for final semantic fusion.

As an enhanced version of standard convolution, three advantages of our scaled gated convolutions can be offered. First of all, it strengthens cross-channel information flow by adaptively encoding the informative long-range context features of multiple heterogeneous feature spaces, which enlarges the receptive field and suppresses noisy signals compared to standard convolutions. Furthermore, the design of CNNs typically requires a wide selection of complicated hyperparameters and configurations. By contrast, our scaled gated convolutions can be directly deployed in existing state-of-the-art architectures by replacing original vanilla or attentive counterparts, and the performance can be effectively boosted. Besides, the scaled gated convolutions are computationally lightweight and require less redundancy and computational burden compared to existing attentive counterparts.

To verify the effectiveness and provide evidence for these claims, we develop a series of SGNets by plugging scaled gated convolutions into baselines, and conduct a comprehensive evaluation on large scale datasets. We first evaluate the proposed approach on the large scale ImageNet [31] dataset using ResNet variants [20, 26, 72] and obtain significant improvements with comparable model complexity. We also present results on downstream tasks, including object detection, instance segmentation, keypoint detection, and panoptic segmentation, to verify the ability to generalize our models on various typical downstream vision applications. Exhaustive experiments show that, by using SGNets, baseline results can be effectively improved for all these tasks at the expense of comparable or less computational budgets, which indicates our approach’s efficiency.

2 Related work

2.1 Modern architecture design

Remarkable progress has been achieved in the field of network architecture design in recent years. AlexNet [35] laid the foundation for designing modern convolutional neural networks, which dominant the image recognition field. VGGNets [55] introduce modular design and the receptive field equality principle of convolutions and construct deeper networks with fewer parameters than AlexNet [35]. NIN [38] reduces overfitting by non-parametric global average pooling (GAP). Highway network [16], ResNet [20, 21], and DenseNet [27] alleviate vanishing gradient problems by various skip connections and help deep networks convergence. DPN [9] combines residual connections and dense connections to learn robust feature representations. GoogleNet [58] and Inception series [57, 59] enhance feature representations by stacking hand-engineered inception blocks, which introduce heterogeneous multi-path convolutions. ResNeXt [72] further simplifies multi-path networks by homogeneous group convolutions. WideResNet [75] strengthens shallow networks by adjusting the width of models. Xor is utilized for efficient deep hashing [45]. ShuffleNet [47, 77] enhances feature representations of lightweight models with channel shuffle. EfficientNet [60, 61] scales width, depth, and resolution with NAS and achieves remarkable performance gains.

2.2 Attention and gating mechanisms

In addition to plain architectures, effective attention and gating mechanisms design also attract more and more research interests. Attention and gating mechanisms can be interpreted as self-adaptive content-aware computational resource reallocation mechanisms based on informative components, demonstrating their utility across various tasks. SENet [26] firstly adopts Squeeze-and-Excitation blocks among channel dimensions. Beyond channel, GENet [25] leverages extra 2D convolutions to generate spatial region-aware attention weights. SKNet [36] further extends attention and gating mechanisms on kernels, which adjusts the receptive field of convolutions dynamically. CBAM [68] combines spatial and channels attention mechanisms to recalibrate feature representations. SCNet [42] proposes an hourglass-style calibration-based operator to enhance the standard convolution. NLNet [64] models long-range dependency using self-attention. GCNet [3] further simplifies the NL block and proposes a lightweight GC block compatible with the SE block. DANet [12] aggregates dual-path heterogeneous attention to capture large-scale feature dependency. Fuzzy attention [44] is utilized to extract robust features. CCNet [29] proposes a lightweight recurrent criss-cross attention block to reduce the computational budget for large resolution scene parsing. ECANet [63] generates attentive weights based on locality prior to overcome the paradox of performance and complexity trade-off.

2.3 Dynamic neural networks

Different from static neural networks, which recognize visual objects by utilizing static and content-independent filters, dynamic neural networks construct sample-aware architectures using parametric components. CondConv [74] and DyConv [7] generate dynamic kernels conditioned on input samples. WeightNet [46] further introduces SE block and sparse block diagonal matrix to save computational budget. SkipNet [65] learns specific components with reinforcement learning. DRSS [37] builds a gating mechanism to adjust feature scales according to input samples. DyReLU [8] and APReLU [78] are capable of adaptive rectified factor correction using a learnable parametric rectified linear unit to boost performance.

2.4 Neural network scaling

In order to overcome the paradox of complexity and performance, scaling deep neural networks are widely explored in both hand-crafted and automated-searched neural network architectures. After the modular design principle introduced by VGGNets [55] is applied widespread, the ResNet [20, 21] series further tap the potential of depth scaling of models using residual connections and achieve remarkable gains with less complexity compared to VGGNets. MobileNets [23, 24, 53] scale the width of bottleneck structure to enhance feature representations. WideResNet [75] proposes depth-width scaled shallow-wide architectures and reaches comparable performance compared to deep-narrow counterparts [20, 21]. EfficientNet [61] introduces a neural architecture search-based compound search approach to scale input resolution, depth and width automatically and achieves a better balance between complexity and performance. RegNet [50] proposes a statistical information-based principle to adjust design spaces to sample a series of compound scaled ResNeXt-style networks with neural architecture search.

2.5 Transfer learning on vision tasks

Extracting informative and robust feature representations is of great importance to a wide range of modern deep transfer-learning-driven vision recognition tasks, including object detection, instance segmentation, human skeleton keypoint detection, and panoptic segmentation. Plenty of previous architectures demonstrate the ability to generalize on multiple aforementioned transfer-learning tasks.

2.5.1 Object detection

Recognizing and locating various objects in a scene requires backbone networks to balance the collision of feature representations between classification and localization and overcome the aliasing effect caused by the uncertainty of localization. ImageNet [31] pre-trained networks, e.g., VGGNets [55], are firstly utilized as feature extractors for R-CNN families [13, 14, 52]. In order to generate fixed-sized feature representation, SPPNet [19] proposes spatial pyramid pooling to bridge the gap between convolution and fully connected layers. Fast R-CNN [13] extends the SPPNet and proposes ROI pooling to ease the difficulty of the learning process. The feature pyramid network (FPN) [39] further enhances multi-scale feature representations extracted by backbone networks at different stages with pyramid structure and alleviates the feature aliasing effect using a lateral connection.

2.5.2 Instance segmentation

In order to segment instances both accurately and precisely, segmentation networks [1, 2, 17] extract instance-aware representative context features and alleviate irrelevant noise with various backbones [20, 55]. Mask R-CNN [17] alleviates feature misalignment by bi-linear sampling-based RoI align operation, which reduces quantization error compared to RoI pooling [13] and introduces an extra mask prediction head for high-resolution dense prediction. Inspired by a single-stage object detector [40], YOLACT networks [1, 2] regard instance segmentation as a mask coefficients prediction task based on fully convolutional networks (FCNs). PolarMask [71] constructs a unified framework in polar coordinate space with center-guided classification and dense distance regression, which unites the coarse-grained bounding box localization and fine-grained edge prediction with the same representations. SOLO [66, 67] proposes a fast and straightforward FCN framework to segment objects by different locations. BlendMask [4] further fuses the instance feature representations and dense segmentation features using Blender and achieves higher performance.

2.5.3 Keypoint detection

In recent years, deep CNNs have significantly advanced keypoint detection, and various networks have been proposed to extract instance-aware skeleton features. Mask R-CNN [17] introduces a joint training scheme of keypoint and object detection based on the ResNet [20] backbone. HRNet [56] splits the main single-scale branch into multiple branches with different scales to enhance features with multi-scale representations. CPN [6] proposes a cascade pyramid refinement network together with online hard keypoint mining loss to extract keypoint features from coarse to fine. In contrast to complicated keypoint detection models, SimpleBaseline [70] constructs a simple and effective keypoint detection benchmark. DarkPose [76] designs a novel and model-agnostic encoding-decoding-based coordinate representation to boost the performance of keypoint detection.

2.5.4 Panoptic segmentation

Different from instance segmentation [4, 17] and semantic segmentation [5, 43, 49, 79], which focus on stuff/thing segmentation tasks in isolation, panoptic segmentation [34] requires a reconciliation between these tasks and recently attracted increasing research interests. Panoptic FPN [33] combines FPN [39] with mask R-CNN [17] and semantic segmentation head to generate a robust panoptic prediction. UPSNet [73] alleviates feature conflicts between semantic and instance segmentation by utilizing deformable convolution [11, 80] and a parametric-free panoptic segmentation head in a unified framework. introduces attentive structure to alleviate feature noise and improve the performance of panoptic segmentation.

3 Methodology

In this section, we present the details of the proposed scaled gated convolutions for image recognition. It is a lightweight module based on transformation \(\mathcal {\widetilde {F}}\), capable of mapping an input tensor \(\mathbf {X}=[x_{1},x_{2},\dots ,x_{\widetilde {C}}]\in \mathbb {R}^{\widetilde {C}\times \widetilde {H}\times \widetilde {W}}\) to feature representation \(\mathbf {U}=[u_{1},u_{2},\dots ,u_{C}]\in \mathbb {R}^{C\times H\times W}\). Conventional convolution transformation \(\mathcal {F}\) consists of homogeneous filters \(\mathbf {V}=[v_{1},v_{2},\dots ,v_{C}]\) and learns local representations with a fixed receptive field. Given the above notations, the transformed feature representations at the i-th channel can be written as:

$$ u_{i}=v_{i}*\mathbf{X}={\sum}_{j=1}^{\widetilde{C}}{v_{i}^{j}}*x^{j} $$
(1)

where ∗ denotes convolution operator, \(u_{i}\in \mathbb {R}^{H\times W}\), \(v_{i}=[{v_{i}^{1}},{v_{i}^{2}},\dots ,v_{i}^{\widetilde {C}}]\)\({v_{i}^{j}}\) is a 2D single-channel convolutional filter with a fixed kernel size that acts on the corresponding channel of X. Spatial dimension and bias terms are omitted for neat notations. As can be seen in (1), traditional convolutions extract local information by sliding windows with predefined kernel sizes. The channel correlation of feature representations is built by the inherent weighted summation of convolutions. We expect discriminative convolutional feature transformation learning to be strengthened by explicitly exploring long-range spatial semantic information and combining cross-channel correspondence with lightweight and powerful computational components. To this end, we propose scaled gated convolution.

3.1 Scaled gated convolutions

In SENet, Squeeze-and-Excitation modules are cascaded after the main branch, which constructs cross-channel information flow as auxiliary branches and imposes post-gating transformation to enhance feature representations. Different from SENet-style design cascaded after the main branch. Our scaled gated convolutions can apply gating mechanisms in a parallel paradigm, which combines both convolutional and gating transformation. Similar to group convolutions, the input feature representations are parted at the beginning and merged to generate final outputs. However, differently, group convolutions perform homogeneous transformation for groups, while ours introduce heterogeneous transformation to construct relationships among groups so that they may complement each other.

3.1.1 Overview

The overall pipeline of our scaled gated convolution is illustrated in Figure 1. The given input feature representation is denoted as \(\mathbf {X}\in \mathbb {R}^{C\times H\times W}\). The output feature map \(\mathbf {Y}\in \mathbb {R}^{C\times H\times W}\) is designed to keep the same dimension as input X so that the scaled gated convolution can be applied to existing architectures in a plug-and-play manner. In order to reduce the redundancy of scaled gated convolution, the given input features X are scaled by λ and divided into two branches using scaling operator, i.e., X1,X2 for lightweight heterogeneous transformation. The overall framework of our proposed scaled gated convolution is formulated as:

$$ \mathbf{Y}=\{\mathbf{X_{1}};\mathbf{X_{2}}\}=\{\mathbf{X}_{1-\lambda};\mathcal{\widetilde{F}}(\mathbf{X}_{\lambda})\} $$
(2)

where {⋅;⋅} denotes feature fusion and \(\mathcal {\widetilde {F}}\) denoted scaled gated transformation. Inspired by [77], the first branch uses identity mapping to generate the identity intermediate feature representation Y1, i.e., \(\mathbf {Y_{1}=X_{1}}\in \mathbb {R}^{(1-\lambda ){C}\times {H}\times {W}}\), which preserves spatial context based on high-resolution feature representation, and avoids loss of informative details caused by introducing extra learnable parameters. More importantly, the identity branch also preserves an auxiliary constant gradient flow where \(\frac {\partial y}{\partial x}=\mathbf {1}\) so as to accelerate model convergence. The second branch adaptively adjusts input feature \(\mathbf {X_{2}}\in \mathbb {R}^{\lambda {C}\times {H}\times {W}}\) using global context embedding guided scaled gated operation to obtain the other intermediate feature representation Y2. The scaled gated operation consists of a two-stage gating mechanism, more specifically, a scaled gated transformation module succeeded by a scaled gated activation. The final output feature representation Y is obtained by concatenating and fusing Y1,Y2. The details of our gating mechanism will be described in the following sections.

Fig. 1
figure 1

The pipeline of our proposed scaled gated convolution. The given input features representations are divided into gating branch and identity branch for heterogeneous processing. The heterogeneous branches are scaled by λ so as to reduce redundancy and improve performance. The gating mechanism is composed of a scaled gated transformation module that is succeeded by a scaled gated activation using lightweight filters. More details can be found in Section 3.1. Best viewed in color

3.1.2 Scaled gated transformation

In order to effectively tackle the issue of exploiting the input information flow of scaled gated convolutions, we propose a scaled gated transformation module based on cross-channel information flow. Specifically, the channel-wise statistics \(\boldsymbol {\mu }\in \mathbb {R}^{\lambda C}\) are created using non-parametric global average pooling. Formally, given the input feature representations of second branch \(\mathbf {X_{2}}\in \mathbb {R}^{\lambda C\times H\times W}\), the channel-wise statistics \(\boldsymbol {\mu }\in \mathbb {R}^{\lambda C}\) are generated by operation φ, which shrink the spatial dimension of \(\mathbf {X_{2}}\in \mathbb {R}^{\lambda C\times H\times W}\). Thus, the c-th channel μc of channel-wise statistics μ is calculated by:

$$ \mu_{c}=\varphi(\mathbf{{X_{2}^{c}}})=\frac{1}{H\times W}\sum\limits_{i=1}^{H}\sum\limits_{j=1}^{W}\mathbf{{X_{2}^{c}}}(i,j) $$
(3)

Furthermore, in order to construct a self-adaptive selection and adjustment mechanism, we utilize lightweight linear projection to model cross-channel information flow. A lightweight learnable parameter matrix \(\mathbf {W}\in \mathbb {R}^{\lambda C\times \lambda C}\) which contains only around 1.3% parameters of the whole model (e.g., 0.3M vs. 22.3M for SGNet-50), is introduced to build cross-channel information flow among arbitrary pairs (μi,μj) of channel-wise statistics μ, so as to enhance the robustness of features. Formally, W is defined as:

$$ \mathbf{W}= \left[ \begin{array}{lcr} w^{1,1} & {\cdots} & w^{1,\lambda C} \\ {\vdots} & {\ddots} & {\vdots} \\ w^{\lambda C,1} & {\cdots} & w^{\lambda C,\lambda C} \end{array} \right] $$
(4)

However, the lightweight linear projection \(\mathbf {W}\in \mathbb {R}^{\lambda C\times \lambda C}\) limits the capability of modeling cross-channel information flow. On the one hand, such a linear parameter matrix limits the capability of non-linear projection; On the other hand, the introduced parameters might increase the optimization difficulty and potential risk of overfitting. Thus, residual global embeddings \(\mathbf {e}\in \mathbb {R}^{\lambda C}\) are generated by combining both linear projection and residual connection to accelerate convergence, mitigate the risk of overfitting, and suppress the vanishing gradient problem. Formally, \(\mathbf {e}\in \mathbb {R}^{\lambda C}\) is calculated as follows:

$$ \mathbf{e}=f(\boldsymbol{\mu})=\theta(\boldsymbol{\mu})+\boldsymbol{\mu}=\mathbf{W}\circ\boldsymbol{\mu}+\boldsymbol{\mu} $$
(5)

where ∘ denotes matrix multiplication, + denotes element-wise summation. The residual global embeddings \(\mathbf {e}\in \mathbb {R}^{\lambda C}\) are then soft-gated by gating operation δ, and applied on large scale feature representations to construct powerful feature representation \(\mathbf {z}\in \mathbb {R}^{\lambda C}\). As aforementioned, vanilla convolutions are able to enhance informative local details with fixed receptive fields. Thus, we introduce convolutional filters ψ to model local feature patterns of large-scale input feature representations X2. Formally, the c-th channel of the output of scaled gated transformation zc can be calculated by:

$$ z_{c}=(\mathbf{\psi*{X_{2}^{c}}})\odot\delta(e_{c}) $$
(6)

where ∗ denotes convolution, and ⊙ denotes element-wise multiplication. The soft-gating selection operation δ adaptively selects large scale feature representation based on residual global embeddings e using a sigmoid function.

3.1.3 Scaled gated activation

Conventional rectified linear unit [48] provides sparsity and non-linear fitting ability by suppressing negative feature representations yet limiting the robustness of negative feature representations. Parametric alternatives, e.g., PReLU [18] and ELU [10], introduce extra hyperparameters, which require parameter tuning for various downstream tasks.

Inspired by the success of applying APReLU [78] for fault diagnosis, we hypothesize that negative embeddings surpassed by ReLU [48] encode noise disturbed class-distribution-aware information whose potential has not been fully explored for visual recognition. Thus, to adaptively enhance the non-linear fitting ability of feature representations and inhibit noise, we propose a hyper-parameter-free module called scaled gated activation to tap the potential of class-aware negative embeddings and ease the learning process for our model. Formally, given the output \(\mathbf {z}\in \mathbb {R}^{\lambda C}\) of (6) as input, the activated amplitude \(\mathrm {m}\in \mathbb {R}^{\lambda C}\) can be calculated as:

$$ \mathbf{m}=\delta(f(\varphi(\mathbf{z})) $$
(7)

where δ,f,φ are defined in Section 3.1.2. After the activated amplitude is obtained, the output feature representation of the second branch \(\mathbf {Y_{2}}\in \mathbb {R}^{\lambda C\times H\times W}\) can be calculated as follows:

$$ \mathbf{Y_{2}}=G(\mathbf{z})=\max(\mathbf{z}, \mathbf{0})+\min(\mathbf{z}, \mathbf{0})\odot\mathbf{m} $$
(8)

Different from dual-branch APReLU [78], which constructs statistical correlations for positive/negative embeddings separately using cascaded fully-connected layers, which are computational costly. Our scaled gated activation module extracts non-linear representations based on our lightweight scaled gating mechanism. Thus, our scaled gated activation module can be plugged into our scaled gated convolution as an auxiliary non-linear feature extractor with a bit of additional computational budget, while APReLU [78] can not.

3.1.4 Post fusion

Inspired by the success of linear bottleneck in MobileNet-V2 [53] and calibration operator in SCNet [42], we propose a post-fusion module as the post-process of our scaled gated convolution to gather local feature context and fuse heterogeneous output feature representations. The intermediate feature representations Y1,Y2 are concatenated and fused to generate final output feature representations. Formally, given the output of heterogeneous output \(\mathbf {Y_{1}, Y_{2}}\in \mathbb {R}^{\lambda C\times H\times W}\), and post fuse convolutional filters U, the final output feature representation is fused by:

$$ \mathbf{Y}=F(\mathbf{Y_{1}},\mathbf{Y_{2}})=\mathbf{U}*[\mathbf{Y_{1};Y_{2}}] $$
(9)

where U denotes group convolutions which are divided into K groups. ∗ denotes convolution operation and [⋅;⋅] represents feature concatenation.

3.2 Network architecture

Based on the scaled gated convolution, the overall architecture of our proposed SGNet is listed in Table 1. The reasons why our scaled gated convolution is applied to ResNet [20] are as follows. First of all, the design choices of ResNet follow modular design principles introduced by VGGNets [55], which are easy to extend to various downstream tasks, e.g., object detection and pose estimation, and compatible with existing methods like feature pyramid network [39]. In other words, plenty of tasks can benefit from replacing original convolutions with scaled gated convolutions in a plug-and-play manner. Moreover, the application of scaled gated convolution can benefit from residual connections for deep models, which avoids vanishing gradient problems. Last but not least, due to the design of the efficient bottleneck modules, ResNet is one of the state-of-the-art architectures with a low computational budget and model complexity. Specifically, our proposed SGNet consists of multiple bottlenecks containing scaled gated convolutions, termed ”SG Bottleneck”. Each SG bottleneck is composed of stacking 1 × 1 convolution, scaled gated convolution, and 1 × 1 convolution sequentially. By replacing large-size convolutions with our scaled gated convolutions, our SGNet is able to enhance cross-channel information flow, suppress feature noise, and strengthen the robustness of representations. The detailed configurations of SGNet-50 is shown in Table 1. Similar to ResNet-50, SGNet-50 contains four stages which consists of {3,4,6,3} SG bottlenecks respectively. Different SGNet architectures can be obtained by varying the number of bottleneck blocks of each stage. Compared with ResNet-50, our SGNet-50 is capable of maintaining comparable performance while saving around 8.6% parameters and 10.1% computational budget. Furthermore, our SGNet-50 can reduce 16.8% parameters and save 10.3% computational budget compared with SENet-50.

Table 1 Architectures comparison among ResNet-50, SENet-50 and our proposed SGNet-50

3.2.1 Relation to attentive architectures

Our proposed SGNet is quite different from existing attentive architectures. SENet [26] applies homogeneous attention to all channels, which leads to a lack of feature diversity. Differently, we apply heterogeneous operations, i.e., identity mapping and scaled gated design, to enhance the diversity and preserve the informative features and improve efficiency. Also, SENet [26] applies channel-wise reduction to reduce the complexity, introducing information loss, while ours scales the gating path to achieve a similar purpose without channel reduction. Furthermore, SENet [26] inserts channel-wise attention module after convolution as an individual operator, while ours integrates gating module with convolution as a whole to replace the original convolution operator in a plug-and-play manner. In order to generate heterogeneous features, SCNet [42] applies filters in hourglass-style, which inevitably requires extra computational budgets due to reserved large spatial scale, while ours is based on identity mapping together with scaled gating transformation, which is more lightweight. CBAM [68] also preserves large spatial resolutions and applies an attention mechanism to improve performance. However, our work demonstrates that even without the help of large spatial scale reservation, competitive performance can also be achieved, and computational budgets might be reduced a little in terms of flops. Besides, heterogeneous filters utilized in SCNet [42] might introduce bias and lead to information loss, while ours may not. Moreover, our scaled gated activation is capable of enhancing non-linear fitting ability, while SCNet cannot. ECANet [63] explores a locality-based attention mechanism to reduce attention redundancy using 1D convolution, while ours scale the gating branch to achieve such purpose. In short, Table 2 shows the relationship to existing attentive architectures.

Table 2 Relation to attentive architectures

3.3 Complexity analysis

Given scaling coefficient 0 < λ ≤ 1, fusion groups \(\mathbf {K}\in \mathbb {N}^{+}\), kernel size S × S and input feature \(\mathbf {X}\in \mathbb {R}^{C\times H\times W}\), the complexity of our scaled gated convolution can be formulated as:

$$ \begin{cases} \mathbf{\#P}= (\lambda^{2}+S^{2}/\mathbf{K})C^{2}\\ \mathbf{\#F}=[\lambda^{2}(HW+1)+S^{2}HW/\mathbf{K}]C^{2}\\ \end{cases} $$
(10)

where #P and #F denote the number of parameters and flops of our scaled gated convolution, respectively. Compared to vanilla convolutions, the saved computational budget can be calculated as follows. Note that Δ#P and Δ#F denote the number of saved parameters and flops, respectively.

$$ \begin{cases} \mathbf{\Delta\#P}=[(\mathbf{K}-1)S^{2}/\mathbf{K}-\lambda^{2}]C^{2}\\ \mathbf{\Delta\#F}=[(\mathbf{K}-1)S^{2}HW/\mathbf{K}-\lambda^{2}(HW+1)]C^{2}\\ \end{cases} $$
(11)

Note the complexity might increase in some cases, e.g., when K = 1 and λ = 1. Yet, as presented in Section 4.1.2, we typically set K = 2 and λ = 0.5 to overcome the paradox between redundancy and performance if not otherwise noted.

4 Experiments

We evaluate the performance on large-scale datasets on various tasks, including ImageNet [31] classification, object detection, instance segmentation, and keypoint detection on COCO [41]. Specifically, classification performance is evaluated on ImageNet [31], and SGNet is adopted as the backbones for these tasks. Faster/Mask R-CNN [17, 52] and [70] are utilized as code bases for object detection, instance segmentation and keypoint detection, respectively. SGNet is also applied to the keypoint detection task to verify the transferability.

4.1 ImageNet classification

ImageNet [31] contains 1.28 million training images and 50k validation images of 1k classes. Models are trained on the training set, and accuracy is reported on the validation set. We adopt the official code based on the widely used Pytorch framework to run our experiments.

4.1.1 Implementation details

The standard data augmentation is applied as done in [20]. Specifically, the training images are randomly cropped to 224 × 224 with random horizontal flipping. All models are trained on 8 GPUs with batch size 256 for 100 epochs, and parameters are optimized by stochastic gradient descent (SGD) with a weight decay of 0.00005 and momentum of 0.9. The initial learning rate is set to 0.2, and we utilize a cosine learning rate schedule [30] with a linear warmup [15] for the first five epochs. Note that all experiments share the same environment and experimental settings using the same code base.

4.1.2 Ablation study

Fair comparison

To compare the effectiveness of SGNet with other counterparts, the original large fields-of-view convolutions used in ResNet [20] are replaced by our scaled gated convolutions. We consider ResNet [20], ResNeXt [72], and attentive architectures [26, 42, 63, 68] with different depths and evaluate performance on the large-scale ImageNet [31] dataset. Specifically, for single-branch gating, SGNet-50 and SGNet-101 are obtained by replacing correspond convolutions of ResNet-50 and ResNet-101, respectively. For multi-branch gating, we follow [72] to simplify multiple gating branches as a grouped dual-branch scaled gated convolution containing a grouped gating branch and an identity branch. We adjust cardinality settings to SGNeXt-8x14d for these ResNeXt-style models to fit our scaled gated convolution instead of the default 32x4d settings in ResNeXt [72], i.e., ResNeXt-8x14d and SENeXt-8x14d models are utilized as baselines when compared with our SGNeXt-8x14d models for fair comparison if not otherwise specified. Speed is evaluated on 8 GTX-1080Ti.

As shown in Table 3, our SGNet-50 improves 1.7% top-1 accuracy (76.8% vs. 78.5%) and 0.9% top-5 accuracy (93.4% vs. 94.3%). Compared to ResNet-101, our SGNet-101 improves 1.0% top-1 accuracy (78.6% vs. 79.6%) and 0.5% top-5 accuracy (94.3% vs. 94.8%). Similar improvements can be seen for other counterparts. Note that our SGNet achieves comparable performance using our lightweight scaled gated convolution under same experimental settings compared to both attentive and vanilla architectures.

Table 3 Fair comparison on ImageNet [31]

Pooling choices

The global embeddings are channel-wise statistics obtained by pooling operations. Thus, we consider the influence of global average pooling (GAP) and global max pooling (GMP). As shown in Table 4, using GAP improves 0.6% top-1 accuracy (77.9% vs. 78.5%) and 0.4% top-5 accuracy (93.9% vs. 94.3%)). This might due to the fact that GMP captures global maximum as statistics while GAP constructs connections among arbitrary spatial positions so as to generate more powerful representations.

Table 4 Pooling methods and residual connection of embeddings

Residual global embedding

To evaluate the residual connections in global embeddings, we compare global embeddings w/wo. residual connections. As presented in Table 4, introducing residual connections improves 0.4% top-1 accuracy and 0.2% top-5 accuracy. As can be seen, residual connections accelerate model convergence.

Scaling factor

In order to verify the parameter sensitivity of scaling factor λ and fusion factor K for our scaled gated mechanism, we scale gating branch with different λ and K to balance the redundancy and performance. As presented in Table 5, empirically, λ = 0.5 and K = 2 overcomes the paradox between computational budget and performance. Note that λ is fixed when varying K and vice versa.

Table 5 Parameter sensitivity of λ and K using SGNet-50

Ablation study of scaled gated modules

To evaluate the influence of each scaled gated module, we remove scaled gated transformation, scaled gated activation as well as post fusion and validate the performance. As can be seen in Table 6, removing post fusion module leads to 1.8% top-1 accuracy and 0.9% top-5 accuracy drop. A similar trend can be observed when removing scaled gated transformation and scaled gated activation.

Table 6 Ablation study of each scaled gated module, where ”T.”, ”A.” and ”F.” represent scaled gated transformation, scaled gated activation, and post fusion defined in Section 3.1, respectively

Visualization analysis

To explore the class-specific information encoded by scaled gated activation, we uniformly sample 1k images from 20 randomly chosen classes and then project the extracted high-level semantic features using t-SNE [62] to verify the discriminability of class-specific distribution information before/after scaled gated activation. As presented in Figure 2, scaled gated activation enables features in the same class closer to each other and far from samples of other classes. Thus, the proposed scaled gated activation module is capable of encoding class-specific information.

Fig. 2
figure 2

Class-aware features before/ after scaled gated activation

In order to demonstrate the intuitive insight of our heterogeneous design, we visualize features at different positions and heatmaps [54] in Figure 3. Some filters focus on local details of textures and edges, which is darker, while others pay more attention to overall semantic information, which is brighter. The heterogeneous design is capable of making different modules extract heterogeneous features that complement each other.

Fig. 3
figure 3

Visualization of selected features at different positions and heatmaps of the proposed scaled gated convolution. SGT: scaled gated transformation; SGA: scaled gated activation; PF: post fusion. Filters that generate darker visualizations focus more on textures and edges than those generate brighter visualization, which focus more on overall semantic information

4.2 Object detection

To evaluate the transferability of our scaled gated convolutions, our SGNet models serve as backbones of object detectors and are trained on the COCO dataset [41].

4.2.1 Experimental settings

The widely used Faster R-CNN [52] with FPN [39] is utilized based on the Detectron2 [69] benchmark to run our detection experiments. All models are trained on the COCO-2017 training set, and we report COCO-style metrics (AP, AP50, AP75, APS, APM, and APL) on the COCO-2017 validation set. Images are resized so that the edges are not longer than 1333 pixels. We use 8 GPUs to train each model for 90000 iterations, with batch size set to 16. The initial learning rate is set to 0.02 and divided by 10 after 60000 and 80000 iterations. SGD is utilized to optimize parameters. The weight decay and momentum are set to 0.0001 and 0.9, respectively. For a fair comparison, multi-scale training and synchronized batch normalization are enabled for all models.

4.3 Object detection

To evaluate the transferability of our scaled gated convolutions, our SGNet models serve as backbones of object detectors and are trained on the COCO dataset.

4.3.1 Object detection results

As can be seen in Table 7, our SGNet-50 based detector outer-performs ResNet-50 based one around 3.5% AP (38.9% vs. 42.4%). Besides, our SGNet-50 brings 4.2%, 3.7%, and 3.4% for APS, APM, and APL, respectively. The same phenomena can be observed for other configurations in Table 7. Thus, our scaled gated convolution is capable of generating scale-robust feature representations compared to ResNet-50. Moreover, our SGNeXt based Faster R-CNN models bring large performance gaps (37.8% vs. 42.8% for 50 layers and 39.6% vs. 44.4% for 101 layers) compared to ResNeXt baselines, which indicates the heterogeneous gating transformation is able to generate much more powerful representations than homogeneous baselines using group convolutions. Besides, our model also achieves promising performance compared to attentive approaches [26, 42, 63]. In order to verify our scaled gated convolution is capable of overcoming the paradox of complexity and performance, we also evaluate the model complexity in terms of parameters and flops using the same code base. As can be seen in Table 7, our lightweight scaled gated convolution can be plugged into modern architectures and can achieve comparable performance with promising computational budgets.

Table 7 Fair comparison of object detection results on COCO [41]

4.4 Instance segmentation

4.4.1 Instance segmentation results

In addition to objection detection, we apply our scaled gated convolution to Mask R-CNN [17] and share the same settings as aforementioned in Section 4.2.1. As shown in Table 8, our SGNet-50 based Mask R-CNN outer-performs ResNet-50 based Mask R-CNN by 3.2% (34.9% vs. 38.1%), while SGNeXt-50 version outer-performs ResNeXt-50 one by 4.2% (34.5% vs. 38.7%). For deeper configurations, our SGNet-101 version brings 2.4% absolute improvement compared to ResNet-101 based model, and SGNeXt-101 version introduce additional 3.7% AP increase compared to ResNeXt-101 based model. As can be seen, our scaled gated convolution can boost the performance of instance segmentation. Compared to SCNet [42] and ECANet [63], our approach also achieves comparable or better performance using a less computational budget. We also observe a slight performance gap in terms of APM and APL compared to SCNet-50 based Mask R-CNN. It will be our future work to further improve the capability of modeling large-scale instances.

Table 8 Mask R-CNN [17] based instance segmentation results

4.5 Panoptic segmentation

In order to evaluate the capabilities to generalize on dense mask prediction task of our scaled gated convolution, we adopt SGNet as the backbone network for Panoptic FPN [33] and compare it with several modern counterparts [20, 26, 42, 63] using the same code base.

4.5.1 Experimental settings

Following the experimental settings in previous work [33], we utilize COCO-2017 [41] data splits with 118k images for training, 5k images for validation with 80 thing classes for instance segmentation. We also use COCO-2017 [41] stuff data including 40k training images and 5k validation images with 92 stuff classes. The panoptic segmentation is trained by all images containing 80 thing and 53 stuff classes as in [33]. For fair comparison, all models are trained for 90000 iterations using the same code base, and the scale jitter is also adopted as described in [33]. To evaluate the performance of both panoptic segmentation and semantic segmentation, we report different metrics for these tasks. Specifically, the mIoU,fwIoU,mACC and pACC are reported for semantic segmentation, and PQ, SQ, and RQ related metrics are reported for panoptic segmentation, respectively.

4.5.2 Panoptic Segmentation Results

The panoptic segmentation results are listed in Table 9. As can be seen, our proposed scaled gated convolution achieves comparable or better panoptic segmentation results compared to both vanilla architectures [20, 72] and advanced attentive architectures. Compared to ResNet-50 [20] based Panoptic FPN, our SGNet achieves 3.0% performance gains (39.4% vs. 42.4%). For deep models, our SGNet boosts the performance of ResNet-101 [20] from 41.6% to 44.3%. Furthermore, our SGNet also achieves superior performance compared to SENet [26] and other attentive counterparts [42, 63], as shown in Table 9. Thus, our proposed scaled gated convolution can generalize on dense prediction tasks.

Table 9 Panoptic FPN [33] based panoptic segmentation results using the same code base

4.6 Keypoint Detection

To evaluate the ability to generalize on keypoint detection tasks, we also apply our scaled gated convolution to the human keypoint-detection-based pose estimation pipeline and report the OKS-based mAP on the COCO-2017 validation set.

4.6.1 Experimental settings

We strictly follow the default settings in [70], the initial learning rate is set to 0.001 and divided by 10 after 90 and 120 epochs. For fair comparison, all models are trained with batch size 32 for 140 epochs using Adam optimizer [32]. A Faster R-CNN detector with 56.4 mAP detection results of ”person” category is used during inference as done in [70]. We consider input sizes of 256×192 and 384×288 as in [70].

4.6.2 Keypoint detection results

The keypoint detection results are shown in Table 10. We omit the complexity metrics in Table 10 since all the keypoint detection results are based on the same Faster R-CNN [52] code base whose complexity has been discussed in Section 4.3.1, and the running speed of our SGNet backbones has been discussed in Section 4.1.2. As can be seen, our proposed scaled gated convolution outer-performs previous work [42, 70] in terms of OKS-based AP. Specifically, given 256×192 input images, our SGNet-50 outer-performs ResNet-50 and SCNet-50 by 2.9% and 1.2%, respectively. Our SGNet-101 outer-performs ResNet-101 and SCNet-101 by 3.2% and 2.0%, respectively. Similar phenomena can be observed when larger images are given, as shown in Table 10. However, there is a performance gap between SCNet [42] and our SGNet in terms of APL and APM, which indicates that SGNet and SCNet complement each other. More specifically, SCNet might be helpful for large-scale keypoint detection, while our SGNet is complementary to SCNet in terms of AP50, AP75 and APM. Thus, we believe combining SGNet with other approaches might boost the performance of models.

Table 10 Pose estimation results based on the same code base [70]

5 Conclusion

We propose a lightweight scaled gated convolution that introduces scaled heterogeneous gating to generate powerful features and reduce redundancy and can be plugged into modern architectures in a plug-and-play manner. The gating mechanisms consist of gating transformation, gating activation, and post fusion. Experiments on large-scale datasets verify the effectiveness of our scaled gated convolution, and it can also be applied to downstream tasks to boost performance. We hope this work might inspire the study of efficient convolution design in the future.