1 Introduction

In recent years, scene text detection as a fundamental computer vision task has become an active research field, since it is an essential step in many applications such as automatic driving, scene understanding and text recognition. With the rapid development of Convolutional Neural Networks [7, 9, 13, 17, 46, 47], many progresses have been made [19, 20, 39, 43]. Scene text detection methods can be roughly formulated as two categories: regression-based methods and segmentation-based methods, especially segmentation-based methods have received much attention, since the segmentation results can describe text of arbitrary shape such as curve text. Some new approaches [19, 24, 27, 45] have been proposed to detect cure texts. On the one hand, many of these approaches usually employ classification networks as the backbone [7, 10, 42] network. However, due to the diversity of curve text in shape, scale and orientation, therefore, the detector requires to adjust the local receptive fields size adaptively to encode sufficient context information. Based on the fact, it is not optimal that simply transferring classification networks to text detection. On the other hand, Multi-scale detection is very important to text detection, as the high layers strongly respond to global features while the shallow layers are likely to retain local features. Most of previous methods use Feature Pyramid Network (FPN) [23] to extract multi-scale feature. Nevertheless, it will prevent the flow of accurate localization signals due to the long path from low-level to top-level.

To address above problems, an arbitrary shaped text detector is proposed in this paper, namely, Adaptive Convolution and Path Enhancement Pyramid Network (ACPEPNet). The proposed detector as a segmentation-based method which can makes arbitrary shape text detection. The pipeline of this method is as follows, which includes two steps: 1) Using segmentation network to obtain the segmentation maps. 2) Converting the segmentation maps to binarization maps and then reconstructing the text regions by post-processing algorithm. Firstly, in order to make detector adjust the local receptive fields size adaptively and improve non-linear aggregation capability, EfficientNet [42] is used as the backbone and Adaptive Convolution Unit is embedded into it. By redesigning the structure of EfficientNet, a set of backbone networks named ACNet-B0 and ACNet-B3 are proposed, which are designed for curve text detection. Acnet-b0 can achieve better efficiency / accuracy trade-off, while ACNet-B3 can achieve better accuracy. Compared with classification networks, ACNet can bring significant improvement to text detection tasks. Secondly, in order to make the low-level features flow into the top-level more smoothly, a Path Enhancement Feature Pyramid Network (PEFPN) is proposed, which constructs an extremely short path less than 10 layers and adds original feature maps to the final stage of information fusion. For high efficiency, we use depthwise separable convolution [4] instead of conventional convolution in PEFPN.

To show the effectiveness of the proposed method, experiments are carried out on four public benchmark datasets including CTW1500 [24], Total-Text [1], ICDAR2015 [16] and MSRA-TD500 [49]. Among these datasets, CTW1500 and Total-Text are specially designed for curve text detection, ICDAR2015 is multi-oriented text detection datasets and MSRA-TD500 is multi-lingual text datasets. On CTW1500, on ctw1500, when using ACNet-B3 as backbone network, the F-measure is 80.8%, which is 2.8% better than PSENet [45]. Meanwhile, this method also achieves promising performance on multi-oriented and multi-lingual text datasets.

The contributions can be summarized as:

  • We introduced an adaptive convolution unit, which can adjust the local receptive fields size adaptively and nonlinear aggregate multi-scale spatial information.

  • We proposed ACNet, a backbone network designed for text detection, which can improve the result of curve text detection significantly.

  • We proposed PEFPN, a two-way feature pyramid network and it is benefit for cross-scale feature fusion.

2 Related work

Recent scene text detection task based on deep learning methods have achieved remarkable results. Scene text detection methods can be roughly formulated as two categories: regression-based methods and segmentation-based methods.

Regression-based methods usually based on general object detection benchmark [28, 29, 37, 53], which directly regress the bounding boxes of the text instances. TextBoxes [18] directly modified the anchor scales and the shape of convolutional kernels to deal with the text with different aspect ratios. TextBoxes++ [21] adopt quadrilaterals to regress the multi-oriented text. SSTD [36] introduced an attention mechanism to roughly recognizes the text regions. RRD [22] extracted rotation-invariant features for text classification and utilized rotation-sensitive features for text regression which are better for multi-oriented and long text detection. EAST [54] and DeepReg [11] are based on anchor-free, which utilize pixels to directly regress the multi-oriented text instances. SegLink [39] use the segments of bounding box for regression and studied their links, to handle long text detection.

However, most of the above-mentioned methods rely on complex anchor design, which makes these works heavy-footed and result in sub-optimal performance. In addition, these methods were specially proposed for multiple oriented text detection, which are limited to represent quadrilateral bounding boxes and may fall short when dealing with curve texts.

Segmentation-based methods are mainly joined pixel-level prediction benchmark [2, 3, 14, 25] and post-processing algorithms to get the bounding boxes. Zhang et al. [52] extracted text regions by semantic segmentation and adopted MSER to detect character candidate. Yao et al. [51] formulated one text block as three parts, then predicted the corresponding heat-maps by FCN [26]. Lyu et al. [32] adopted corner localization and represented the bounding box with irregular quadrangles. PixelLink [6] predicted pixel connections, to separate texts which are lying close to each other. TextSnake [27] represented curve text for arbitrary shapes text detection by model ordered disks. SPCNet [48] utilized instance segmentation benchmark and use context information to detect curve texts while suppressing false positives. PSENet [45] proposed progressive scale expansion algorithm to construct the text instances by setting multi-scale kernels.

The above methods have achieved remarkable performances over several horizontal and multi-oriented text benchmarks. Nonetheless, except for TextSnake [27], SPCNet [48] and PSENet [45], most of methods have not focused on curve text. However, these methods have not considered the significance of multi-scale receptive field and low-level feature for curve text detection.

3 Methods

3.1 Overall architecture

The overall architecture of our method as shown in Fig. 1. Firstly, the input images are fed into ACNet, our proposed PEFPN serves as the feature extraction and cross-scale fusion network, which takes level 2–5 feature maps {P2, P3, P4, P5} from the ACNet, and we aligned the feature maps to same dimension, which have stride of {4,8,16,32} pixels with respect to the input image. Secondly, feature maps are fused through top-down and bottom-up pathway, we denote the feature maps as {M2, M3, M4, M5} and {N2, N3, N4, N5} respectively, Mi and Ni are merged with the corresponding Pi by element-wise addition. Thirdly, the outputs Vi of PEFPN are upsampled to the same scale and concatenated to produce feature map F. Finally, we use progressive scale expansion algorithm [45] as post-processing to obtain the final results.

Fig. 1
figure 1

The overall architecture of our detector, a ACNet, the backbone network is designed for curve text detection. b PEFPN, the feature extraction and multi-scale fusion network. Different colored circles indicate different levels of feature maps. Top-down pathway is visualized in blue arrows, bottom-up pathway is displayed in red arrows, respectively. c Post-processing algorithm to obtain the result

3.2 Adaptive convolution unit

To make the network adjust the local receptive fields size adaptively and improve non-linear aggregation capability, we introduced an operation which can select kernel size automatically. This operation is divided into three steps. We only use two branches in parallel as an example, however, it can be extended to multi-branch parallel easily. Figure 2 show the architecture of adaptive convolution unit. Next, we will discuss each step in detail.

  1. Step 1:

    Given a feature map X ∈ RH × W × Cand projected F into multiple branches in parallel with different kernel sizes for convolution operations, which can be formulated as two transformation \( {\mathcal{F}}_1 \) mapping X ∈ RH × W × C to X1 ∈ RH × W × C and \( {\mathcal{F}}_2 \) mapping X ∈ RH × W × C to X2 ∈ RH × W × C. We take \( {\mathcal{F}}_1 \) and \( {\mathcal{F}}_2 \) as two convolution operators, to prevent channel dependencies, we use depthwise/dilated convolution [30, 31] for feature extraction and then followed by Batch Normalization [15] and ReLU [35] activation. Notably, the kernel sizes are 3 × 3 and 5 × 5, specially, we use 3 × 3 kernel and dilation size 2 to instead of 5 × 5 kernel. After this stage, the network is able to pay attention to the multi-scale feature in the same layer.

  2. Step 2:

    To enable the network to improve non-linear spatial aggregation capability, the kernels need to adaptively select their receptive fields size according to the different stimulate, we first fuse features from multi branches by element-wise addition:

$$ {X}^{\prime }={X}_1+{X}_2, $$
(1)
Fig. 2
figure 2

Adaptive Convolution Unit

Due to each of kernel only with a local receptive field, therefor, the output X is unable to exploit contextual information while it is essential for network sensitivity. Consequently, we obtained the global spatial information via global average pooling to generate channel-wise statistics, denote as \( \overset{\sim }{X}\in {R}^c \), shrinking X through its 2D spatial dimensions H × W, specially, \( {\overset{\sim }{X}}_c \) is the i-th element of \( \overset{\sim }{X} \) can be calculated by:

$$ {\overset{\sim }{X}}_c=\frac{\sum \limits_{i=1}^H\sum \limits_{j=1}^W{X}_c^{\prime}\left(i,j\right)}{H\times W}, $$
(2)

To limit complexity, we introduce a dimensionality-reduction operation, which is composed of two fully connected layers. In particular, we use several 1 × 1 convolutions instead, they can be simply defined as W1 ∈ Rd × C, W21 ∈ RC × d and W22 ∈ RC × d:

$$ {S}_1={W}_{21}\alpha \left(\beta \left({W}_1\overset{\sim }{X}\right)\right),{S}_2={W}_{22}\alpha \left(\beta \left({W}_1\overset{\sim }{X}\right)\right), $$
(3)

where β refers to the Batch Normalization [15] and α represents the ReLU [35] activation. we use d to control the compactness of Si ∈ RC × 1, d takes the maximum of C/r and L. r denotes dimensionality-reduction ratio, L is the minimum value of d (L = 8):

$$ d=\mathit{\max}\left(\frac{C}{r},L\right), $$
(4)
  1. Step 3:

    In order to select the multi-scales spatial information adaptively, we adopted the SoftMax as self-attention function:

$$ {\mu}_1=\frac{e^{S_1}}{e^{S_1}+{e}^{S_2}},{\mu}_2=\frac{e^{S_2}}{e^{S_1}+{e}^{S_2}}, $$
(5)

Where μ1 ∈ RC × 1 and μ2 ∈ RC × 1, we conduct μ1 and μ2 as the attention scores, which represent the sensitivity of network to multi-scales spatial information. The final outputs \( \hat{X} \) are produced by channel-wise multiplication:

$$ \hat{X}={\mu}_1\cdotp {X}_1+{\mu}_2\cdotp {X}_2, $$
(6)

3.3 Backbone network design

Table 1 show the structure of our proposed backbone ACNet. We start from EfficientNet [42] for three reasons: 1) It is the state-of-the-art network with high efficiency on classification. 2) Compared with other networks, it has fewer parameters and FLOPs without losing accuracy. 3) The network architecture of EfficientNet [42] is obtained by reinforcement learning search algorithms, rather than artificial design, therefore, it has a better balance among depth, width and resolution. EfficientNet [42] is mainly composed of a stack of repeated mobile inverted bottleneck MBConv [38, 44] and we still follow this design. Moreover, in order for the model to adaptively select the receptive field size, we use Adaptive Convolution Unit instead of depthwise convolution in each MBConv [38, 44] block, and then each block consists of a sequence of 1 × 1 expand convolution, adaptive convolution unit and 1 × 1 project convolution. Adaptive convolution unit is imposed only a slight increase in parameter and computational cost.

Table 1 The architecture of ACNet-B0 and ACNet-B3 network with different channels and layers, each row refers to a stage i with \( {\hat{L}}_i \) MBconv [38, 44] blocks, with input resolution \( {\hat{H}}_i\times {\hat{W}}_i \) and the MBconv [38, 44] block of each stage i includes the corresponding components listed in operator

As shown in Table 1, ACNet-B0 has seven stages with {1,2,2,3,3,4,1} MBconv [38, 44] blocks, respectively. In adaptive convolution unit, K is the number of paths that controls the number of choices of different kernels to be aggregated, and the dimensionality-reduction ratio r that determines the number of parameters in Step. 2(see Eq. (4)). AC [6, 50] is the typical setting in ACNet. In addition, we also design a deeper network based on EfficientNet-B3 [42] for better accuracy and named ACNet-B3, which has seven stages with {1,2,3,3,5,5,6,2} MBconv [38, 44] blocks.

3.4 Path enhancement feature pyramid

In the backbone network, low level has larger feature maps and richer spatial details, it is more likely to describe local texture and patterns. On the contrary, high level has smaller feature maps and strongly respond to entire text instances. Generally, localization is more sensitive to low level features, especially for arbitrary shapes text detection, due to the irregularity of text shapes, the network needs to capture more sensitive edge information. However, features are restricted by one-way flow in FPN [23], consequently, it is necessary to build a two-way path to propagate semantically strong features and enhance all features with reasonable classification capability. To address this problem, we further enhance the localization capability of the entire feature hierarchy by spreading the strong response of low-level information. we build an extremely short path, which includes of less than 10 layers to prevent the loss of local features after a lengthy backbone.

We takes level 2–5 feature maps {P2, P3, P4, P5} from the backbone network, and we align the feature maps to same dimension, which have stride of {4,8,16,32} pixels with respect to the input image. Different from traditional FPN [23], we reduced the dimension to 64 for efficiency, and then fuse the multi-scale features through the following steps:

  • Firstly, in the top-down path, we use the same approach as FPN [23]:

$$ {M}_i= Conv\left({Up}_{X2}\left({P}_{i+1}\right)+{M}_{i+1}\right), $$
(7)

Where Mi denotes the i-th level of fusion feature map in the top-down path, the value of i is {2, 3, 4} and M5 is simply P5, UpX2 refers to 2 times up-sampling. To further improve the efficiency, in Conv, we use depthwise separable convolution [4] instead of common 3 × 3 convolution. The structure as shown in Fig. 3a.

  • Secondly, we build a reverse path to return the low-level features to the high level:

$$ {N}_i= Conv\left({Down}_{X2}\left({P}_{i-1}\right)+{N}_{i-1}\right), $$
(8)

Where Ni denotes the i-th level of fusion feature map in the bottom-up path, the value of i is {3, 4, 5} and N2 is simply P2, both DownX2 and Conv are depthwise separable convolution [4] with factor 2 and 1 respectively. The architecture as shown in Fig. 3b.

  • Thirdly, after the above operations, we obtained two sets of feature maps, which are respectively called {M2, M3, M4, M5} and {N2, N3, N4, N5}. We introduced the original feature maps {P2, P3, P4, P5} at same level, then, Pi is merged with the corresponding Mi and Ni by element-wise addition:

$$ {V}_i=\left\{\begin{array}{c}{M}_i+{P}_i,\kern7.25em i=2\\ {}{M}_i+{N}_i+{P}_i,\kern2.75em 4\ge i\ge 3\\ {}{N}_i+{P}_i,\kern7.5em i=5\end{array}\right., $$
(9)

Where Vi represents the i-th level of final output. The operation as shown in Fig. 3c. With these optimizations, we name the efficient feature fusion network as path enhancement feature pyramid network (PEFPN).

Fig. 3
figure 3

The illustration of Path Enhancement feature pyramid network. a Top-down path. b Bottom-up path. c Final feature fusion, where 4 ≥ i ≥ 3

3.5 Loss function

It is common that binary cross entropy [5] is used to optimize the network’s weight. Nonetheless, the text instances usually occupy only an extremely small region in natural images, which leads to the prediction of detector bias to the regions which are non-text region. Thus, in order to obtain a better model during the learning phase, we utilize dice coefficient [34] in training stage, and it can be formulated as follow:

$$ L\left({D}_i,{G}_i\right)=\frac{2{\sum}_{x,y}\left({D}_{i,x,y}\times {G}_{i,x,y}\right)}{\sum_{x,y}{D}_{i,x,y}^2+{\sum}_{x,y}{G}_{i,x,y}^2}, $$
(10)

where Di, x, y denotes the value of pixel (x, y) in detection result Di, and Gi, x, y indicates ground truth Gi.

In addition, to distinguish patterns, such as fences, lattices, which are similar to text strokes. We use Online Hard Example Mining (OHEM) [40] to improve the discernment of detector. Let us consider the training mask given by OHEM as O, and the final loss can be formulated as follow:

$$ {L}^{\prime }=1-L\left({D}_i\cdot O,{G}_i\cdot O\right), $$
(11)

4 Experiment

4.1 Datasets

CTW 1500 [24] is a popular challenging dataset for arbitrarily curve text detection. It includes 1000 training images and 500 testing images. Different from conventional text datasets (e.g. ICDAR 2017 MLT, ICDAR 2015), In order to describe the shape of an arbitrarily curve text that 14 points are used to label the text instances in CTW1500.

Total-Text [1] is also a newly released benchmark for long curve text detection. It consists of horizontal, multi-Oriented and curve text instances. The benchmark is divided into training set and testing set with 1255 and 300 images, respectively.

ICDAR 2015 [16] is a multi-oriented benchmark for text detection. Scene text images in this dataset are taken by Google Glasses without taking care of positioning, image quality and viewpoint. It contains a total of 1500 images, 1000 images for training and another 500 images for testing. The text regions are labelled by 4 vertices of the quadrangle.

MSRA-TD500 [49] is a commonly used dataset for text detection. It includes 300 training pictures and 200 test pictures with text line level annotations. It is a dataset with multi-oriented, multi-lingual and long text lines. We follow these works [27, 32] to tr model on HUST-TR400 [50], which include 400 images.

4.2 Evaluation metrices

In order to evaluate the performance of our detector, we use the Precision (P) and Recall (R) that have been utilized in information retrieval field. Meanwhile, we utilized the F-measure (F) which can be obtained as follows:

$$ F=2\times \frac{P\times R}{P+R}, $$
(12)

where calculating the precision and recall are based on using the ICDAR 2015 intersection over union (IOU) metric [16], which is obtained for the j-th ground-truth and i-th detection bounding box as follow:

$$ \mathrm{IoU}=\frac{Area\left({G}_j\cap {D}_i\right)}{Area\left({G}_j\cup {D}_i\right)}, $$
(13)

Where Gj and Di same as Eqs. (10) and (11). Meanwhile, a threshold of IoU > t is used for counting a correct detection.

4.3 Implementation details

We train our model from scratch with batch size 8 on 2 GPUs for 72 K iterations, and the initial learning rate is set to 1 × 10−3 and is divided by 10 at 24 K and 48 K iterations. All the networks are optimized by using stochastic gradient descent (SGD). Note that no extra data set is used during training. We use a weight decay of 5 × 10−4 and a Nesterov momentum [41] of 0.99 without dampening. We adopt the weight initialization introduced by [8].

During training, three data augmentation strategies are adopted in the all datasets: 1) rescaling images with the ratio {0.5, 1.0, 2.0, 3.0} randomly; 2) flipping the images in the range [−10°, 10°] randomly; 3) All the images are re-sized to 640 × 640 for better efficiency. For quadrangular text, we calculate the minimal area rectangle to extract the bounding boxes. For curve text dataset, we use the result of progressive scale expansion algorithm [45] as the final output.

During the inference, for all test images, we set a suitable width and then re-scale the height through the aspect ratio. We use a batch size of 1 and a single 1080ti GPU to evaluate the inference speed (i.e. FPS) in a single thread. When calculating the inference speed, the model forward time cost and the post-processing time cost are included.

4.4 Ablation study

To investigate the effectiveness of our proposed module, we conduct an ablation study on the ICDAR 2015 and the CTW 1500, which is a quadrangle text dataset and a curve text dataset respectively. Note that, all experiments without any external dataset.

Different kernels and different branches

In Section 3.1, we only take two size kernels as examples, therefor, in order to explore the effect of combination of different kernels and number of different branches, in Table 2, we use three different kernels, 3 × 3 denote the 3 × 3 depth-wise convolution, 5 × 5 denote the 3 × 3 depth-wise convolution with dilation 2, and 7 × 7 denote the 3 × 3 depth-wise convolution with dilation 3. We only use ACNet-B0 as backbone. If \( \hat{X} \) is ticked, it denotes that we use the attention mechanism which is the output of Adaptive Convolution unit, otherwise we only add up the feature maps without SoftMax attention. Considering the efficiency of the model, we did not use a convolution kernel with larger receptive field.

Table 2 Results of backbone with different combinations of multiple kernels

As shown in Table 2, we have the following conclusions: 1) When the number of branches N increases, in general the F-measure increases. 2) When using the attention mechanism, the performance is better than simple addition. 3) In the case of using the attention mechanism, the performance gain of the model from N = 2 to N = 3 is slight. For better efficiency, N = 2 is used.

The effectiveness of PEFPN

We design a set of comparative experiments to verify the effectiveness of PEFPN. Considering the fairness of the comparisons, we employ our own designed network and ResNet-50 [9] as the backbone, then, we add the original FPN [23] and PEFPN after these two backbones, respectively. From Table 3, we can see that the F-measure can make improvement about 0.8%, 1.4% and 1.0%, 0.9% when using ACNet-B0 and ResNet-50 [9], respectively. It indicates that no matter which backbone is used, the performance of PEFPN is better than the original FPN [23].

Table 3 Result of with or without PEFPN, “F” means F-measure

The influence of the original feature maps

In the PEFPN, we add the original feature map at the same level to the fused feature map as the final output (i.e.Pi in Eq. 9). To verify the influence of the original feature map on the detection results, we remove the original feature map when fusing feature. We can find from Table 4 that without the original feature map, the F-measure is decrease. Thence, in the final fusion stage, it is necessary to add the original feature map, meanwhile, this operation will not bring too much extra computation cost.

Table 4 Result of whether the model uses the original feature map, Pi denotes the original feature map at the i-th level, “F” means F-measure

The influence of the backbone

To better analyze the capability of our model, we adopt ACNet-B0 and ACNet-B3 as backbone, respectively. As shown in Table 5, keep the same setting, it can obviously improve the performance by replace the deeper backbone. ACNet-B0 for faster inference and ACNet-B3 for better accuracy.

Table 5 Result of model with different backbone, “F” means F-measure

4.5 Comparisons with state-of-the-art methods

For efficiency, we just training on a single dataset and do not use the pre-training strategy which training on extra datasets adopted by PSENet [45]. It is worth mentioning that only comparing the detection results of training on a single dataset, our method has surpassed PSENet [45].

Curve text datasets detection

To evaluate the ability of curve text detection, we test our model on CTW 1500 and Total-Text, which mainly include curve texts, during inference stage, we re-scale the longer side of images to 1280 and evaluate the results using the same evaluation method with [45]. We report the single-scale performance of our model on CTW 1500 and Total-Text in Tables 6 and 7, respectively.

Table 6 The single-scale results on CTW 1500
Table 7 The single-scale results on Total-Text

On CTW 1500, without external data pre-training, our model with ACNet-B0 achieves the F-measure of 79.5% and with ACNet-B3 achieves the F-measure of 80.8%. Especially, when using ACNet-B3, the performance surpasses most of the counterparts, Notably, we can find that the F-measure of our model with ACNet-B3 is 2.8% higher than PSENet [45], which was published on CVPR 2019.

On Total-Text, similar conclusions can be obtained. Without external data pre-training, our model with ACNet-B0 not only surpasses the PSENet [45] in F-measure (79.7%) but also leads in speed (5.2 FPS). Another model with ACNet-B3 outperforms the previous state-of-the-art method by 80.9%.

The performance on CTW 1500 and Total-Text demonstrates the solid superiority of our method to detect arbitrary-shaped text instances. We illustrate several detection results in Figs. 4 and 5. It is clearly demonstrated that our method can elegantly distinguish complex curve text instances.

Fig. 4
figure 4

Some visualization results on CTW 1500

Fig. 5
figure 5

Some visualization results on Total-Text

Oriented text datasets detection

We evaluate our method on the ICDAR 2015 to test its ability for oriented text detection. Same as previous experiments, we adopt the ACNet-B0 and ACNet-B3 as the backbone of our model. In the test stage, we scale the long side of images to 2240. Table 8 show the results that compares with other state-of-the-art methods. Our model with ACNet-B0 achieves the F-measure of 81.5% at 2.8 FPS, both F-measure and speed surpass the PSENet. When using ACNet-B3, although the speed is not as well as some methods, but our model has a significant improvement in F-measure over 2.2%. Moreover, we demonstrate some test illustrations in Fig. 6, our method can accurately detect the text instances with arbitrary orientations.

Table 8 The single-scale results on ICDAR 2015
Fig. 6
figure 6

Some visualization results on ICDAR 2015

MultiLingual text datasets detection

To prove the robustness of our method to multiple languages and long straight, we test our model on MSRA-TD500 dataset. Considering the fairness of the comparisons, we also resize the longer edge of the test images to 2240 as ICDAR 2015. As shown in Table 9, our model achieves the F-measure of 81.6% and 82.6% with different backbone. Compare with other method, our model has a slight improvement indeed. Therefor, this proves that our method is robust for multiple languages and long straight text detection and can indeed be deployed in complex natural scenarios. We also shown some results in Fig. 7.

Table 9 The single-scale results on MSRA-TD500
Fig. 7
figure 7

Some visualization results on MSRA-TD500

4.6 Effectiveness of ACPEPNet

To demonstrate the effectiveness of our proposed method, Fig. 8 provides a visualization of the feature maps from the network. In the encoding stage, due to the application of the Adaptive Convolution Unit, the ability of aggregate multi-scale spatial information is enhancement, and the response of the text areas are more sensitive. In the decoding stage, PEFPN allows texts with different scales to be effectively retained, which indicates that the text features have not lost in the bottom-up path.

Fig. 8
figure 8

visualization of the feature maps and binary maps from the network

5 Conclusion

In this paper, we focus on two design principles in text detection tasks: 1) larger receptive field 2) finer low-level features and proposed an efficiency detector for arbitrary shapes text detection. Firstly, we design a set of feature extraction networks with EfficientNet as the baseline, named ACNet-B0 and ACNet-B3. These backbones are embedded with the Adaptive Convolution Unit, which enables the network to adaptively adjust the receptive field, enhancing the ability of aggregate multi-scale spatial information while bringing minor extra computation. Then, In order for FPN [23] to obtain more finer low-level features during the feature fusion stage, we redesign the original FPN [23] information flow path, changing it from one-way flow to two-way flow, and adding original features to the final stage of information fusion, this design to shorten the distance of the low-level feature to the top-level while adding more original features. The experiment on scene text detection datasets demonstrate the superior performance of the previous methods. In the future, we will continue to explore the optimization problem of text detection from real-time and learnable post-processing.