Keywords

1 Introduction

Reading text from natural scenes is one of the most indispensable abilities when building an automated machine with high-level intelligence. This explains the reason why researchers from the computer vision community sedulously have explored and investigated this complex and challenging task for decades. Scene text recognition (STR) involves decoding textual content from natural images (usually cropped sub images), which is a key component in text reading pipelines.

Previously, a number of methods [5, 30, 39, 41] have been proposed to address the problem of scene text recognition. Recently, there emerges a new trend that linguistic knowledge is introduced into the text recognition process. SRN [53] devised a global semantic reasoning module (GSRM) to model global semantic context. ABINet [9] proposed bidirectional cloze network (BCN) as the language model to learn bidirectional feature representation. Both SRN and ABINet adopt an independent and separate language model to capture rich language prior.

Fig. 1.
figure 1

Pipelines of classic CNN-based, ViT-based and the proposed MGP-STR scene text recognition methods are illustrated in (a), (b) and (c), respectively. (d) Examples of Character, BPE and WordPiece subword tokenization. (Best viewed in color.) (Color figure online)

In this paper, we propose to integrate linguistic knowledge in an implicit way for scene text recognition. Specifically, we first construct a pure vision STR model based on ViT [8] and a tailored Adaptive Addressing and Aggregation (A\(^3 \)) module inspired by TokenLearner [36]. This model serves as a strong baseline, which already achieves better performance than previous methods for scene text recognition, according to the experimental comparisons. To further make use of linguistic knowledge to enhance the vision STR model, we explore a Multi-Granularity Prediction (MGP) strategy to inject information from the language modality. The output space of the model is expanded that subword representations (BPE and WordPiece) are introduced, i.e. , the augmented model would produce two extra subword-level predictions, besides the original character-level prediction. Notably, there is no independent and separate language model. In the training phase, the resultant model (named MGP-STR) is optimized with a standard multi-task learning paradigm (three losses for three types of predictions) and the linguistic knowledge is naturally integrated into the ViT-based STR model. In the inference phase, the three types of predictions are fused to give the final prediction result. Experiments on standard benchmarks verify that the proposed MGP-STR algorithm can obtain state-of-the-art performance. Another advantage of MGP-STR is that it does not involve iterative refinement, which could be time-consuming in the inference phase. The pipeline of the proposed MGP-STR algorithm as well as that of previous CNN-based and ViT-based methods are shown in Fig. 1. In a nutshell, the major difference between MGP-STR and other methods is that it generates three types of predictions, representing textual information at different granularities: from individual characters to short character combinations, and even whole words.

The contributions of this work are summarized as follows: (1) We construct a pure vision STR model, which combines ViT with a specially designed A\(^3 \) module. It already outperforms existing methods. (2) We explore an implicit way for incorporating linguistic knowledge by introducing subword representations to facilitate multi-granularity prediction, and prove that an independent language model (as used in SRN and ABINet) is not indispensable for STR models. (3) The proposed MGP-STR algorithm achieves state-of-the-art performance.

2 Related Work

Scene Text Recognition (STR) is a long-term subject of attention and research [4, 28, 58]. With the popularity of deep learning methods [13, 21, 42], its effectiveness in the field of STR has been extensively verified. Depending on whether linguistic information is applied, we roughly divide STR methods into two categories, i.e. , language-free and language-augmented methods.

2.1 Language-Free STR Methods

The mainstream way for image feature extraction in STR methods is CNN [13, 42]. For example, previous STR methods [21, 39, 40] utilize VGG. Current STR methods [2, 3, 26, 48] employ ResNet [13] for better performance. Based on the powerful CNN features, various methods [25, 33, 57] are proposed to tackle the STR problem. CTC-based methods [14, 15, 26, 39, 46] use the Connectionist Temporal Classication (CTC) [10] to accomplish sequence recognition. Segmentation-based methods [23, 24, 45, 47] cast STR as a semantic segmentation problems.

Inspired by the great success of Transformer [44] in natural language processing (NLP) tasks, the application of Transformer in STR has also attracted more attention. Vision Transformer (ViT) [8] that directly processes image patches without convolutions opens the beginning of using Transformer blocks instead of CNNs to solve computer vision problems [27, 52], leading to prominent results. ViTSTR [1] attempts to simply leverage the feature representations of the last layer of ViT for parallel character decoding. In general, language-free methods often fail to recognize low-quality images due to the lack of language information.

2.2 Language-Augmented STR Methods

Obviously, language information is favourable to the recognition of low-quality images. RNN-based methods [21, 39, 48] can effectively capture the dependency between sequential characters, which can be regarded as an implicit language model. However, they cannot execute decoding in parallel during training and inference. Recently, Transformer blocks are introduced into CNN-based framework to facilitate language content learning. SRN [53] proposes a Global Semantic Reasoning Module (GSRM) to capture the global semantic context through multiple parallel transmissions. ABINet [9] presents a Bidirectional Cloze Network (BCN) to explicitly model the language information, which is further used for iterative correction. VisionLAN [51] proposes a visual reasoning module that simultaneously captures visual and language information by masking input images at the feature level. The mentioned above approaches utilize a specific module to integrate language information. Meanwhile, most works [9, 16] capture semantic information based on character-level or word-level. In this paper, we manage to utilize multi-granularity (character, subword and even word) semantic information based on BPE and WordPiece tokenizations.

Fig. 2.
figure 2

The architecture of the proposed MGP-STR algorithm.

3 Methodology

The overview of the proposed MGP-STR method is depicted in Fig. 2, which is mainly built upon the original Vision Transformer (ViT) model [8]. We propose a tailored Adaptive Addressing and Aggregation (A\(^3\)) module to select a meaningful combination of tokens from ViT and integrate them into one output token corresponding to a specific character, denoted as Character A\(^3\) module. Moreover, subword classification heads based on BPE A\(^3\) module and WordPiece A\(^3\) module are devised for subword predictions, so that the language information can be implicitly modelled. Finally, these multi-granularity predictions are merged via a simple and effective fusion strategy.

3.1 Vision Transformer Backbone

The fundamental architecture of MGP-STR is Vision Transformer [8, 43], where the original image patches are directly utilized for image feature extraction by linear projection. As shown in Fig. 2, an input RGB image \( \textbf{x} \in \mathbb {R}^{H \times W \times C} \) is split into non-overlapping patches. Concretely, the image is reshaped into a sequence of flattened 2D patches \( \textbf{x}_p \in \mathbb {R}^{N \times (P^2 C)} \), where \((P \times P) \) is the resolution of each image patch and \((P^2 C)\) is the number of feature channels of \( \textbf{x}_p\). In this way, a 2D image is represented as a sequence with \(N = HW/P^2\) tokens, which serve as the effective input sequence of Transformer blocks. Then, these tokens of \( \textbf{x}_p\) are linear transcribed into D dimension patch embeddings. Similar to the original ViT [8] backbone, a learnable [class] token embedding with D dimension is introduced into patch embeddings. And position embeddings are also added to each patch embedding to retain the positional information, where the standard learnable 1D position embedding is employed. Thus, the generation of patch embedding vector is formulated as follows:

$$\begin{aligned} \begin{aligned} \textbf{z}_0=[\textbf{x}_{class}; \textbf{x}^1_p\textbf{E}; \textbf{x}^2_p\textbf{E}; \ldots ; \textbf{x}^N_p\textbf{E}] + \textbf{E}_{pos}, \end{aligned} \end{aligned}$$
(1)

where \(\textbf{x}_{class} \in \mathbb {R}^{ 1 \times D}\) is the [class] embedding, \(\textbf{E} \in \mathbb {R}^{ (P^2 C) \times D } \) is a linear projection matrix and \( \textbf{E}_{pos} \in \mathbb {R}^{ (N+1) \times D } \) is the position embedding.

Fig. 3.
figure 3

The detailed architectures of the three A\(^3\) modules.

The resultant feature sequence \(\textbf{z}_0 \in \mathbb {R}^{ (N+1) \times D} \) serves as the input of Transformer encoder blocks, which are mainly composed of Multi-head Self-Attention (MSA), Layer Normalization (LN), Multilayer Perceptron (MLP) and residual connection as in Fig. 2. The Transformer encoder block is formulated as:

$$\begin{aligned} \begin{aligned}&\textbf{z}_{l}^{\prime }=\text {MSA} (LN(\textbf{z}_{l-1}))+ \textbf{z}_{l-1} \\&\textbf{z}_{l}=\text {MLP} (LN(\textbf{z}_{l}^{\prime }))+ \textbf{z}_{l}^{\prime }. \end{aligned} \end{aligned}$$
(2)

Here, L is the depth of Transformer block and \( l=1 \ldots L \). The MLP consists of two linear layers with GELU activation. Finally, the output embedding \( \textbf{z}_{L} \in \mathbb {R}^{ (N+1) \times D }\) of Transformer is utilized for subsequent text recognition.

3.2 Adaptive Addressing and Aggregation (A\(^3\)) Modules

Traditional Vision Transformers [8, 43] usually prepend a learnable \(\textbf{x}_{class}\) token to the sequence of patch embeddings, which directly collects and aggregates the meaningful information and serves as the image representation for the classification of the whole image. While the task of scene text recognition aims to produce a sequence of character predictions, where each character is only related to a small patch of the image. Thus, the global image representation \( \textbf{z}_{L}^0 \in \mathbb {R}^{ D } \) is inadequate for text recognizing task. ViTSTR [1] directly employs the first T tokens of \( \textbf{z}_{L} \) for text recognition, where T is the maximum text length. Unfortunately, the rest tokens of \( \textbf{z}_{L} \) are not fully utilized.

In order to take full advantage of the rich information of the sequence \( \textbf{z}_{L} \) for text sequence prediction, we propose a tailored Adaptive Addressing and Aggregation (A\(^3\)) module to select a meaningful combination of tokens \( \textbf{z}_{L} \) and integrate them into one token corresponding to a specific character. Specifically, we manage to learn T tokens \(\textbf{Y} = [\textbf{y}_i]_{i=1}^{T}\) from the sequence \( \textbf{z}_{L} \) for the subsequent text recognizing task. An aggregation function is, thus, formulated as \(\textbf{y}_i = A_i(\textbf{z}_{L})\), which converts the input \( \textbf{z}_{L} \) to a token vector \(\textbf{y}_i: \mathbb {R}^{ (N+1) \times D} \mapsto \mathbb {R}^{ 1 \times D } \). And such T functions are constructed for the sequential output of text recognition. Typically, the aggregation function \(A_i(\textbf{z}_{L})\) is implemented via a spatial attention mechanism [36] to adaptively select the tokens from \( \textbf{z}_{L} \) corresponding to \(i_{th}\) character. Here, we employ function \(\alpha _i(\textbf{z}_{L})\) and softmax function to generate precise spatial attention mask \(\textbf{m}_i \in \mathbb {R}^{ (N+1) \times 1} \) from \(\textbf{z}_{L} \in \mathbb {R}^{ (N+1) \times D}\). Thus, each output token \(\textbf{y}_i \) of A\(^3\) module is produced by

$$\begin{aligned} \begin{aligned}&\textbf{y}_{i}= A_i(\textbf{z}_{L}) = \textbf{m}_i^T \tilde{\textbf{z}}_{L} = \text {softmax}(\alpha _i(\textbf{z}_{L}))^T (\textbf{z}_{L}\textbf{U})^T.\\ \end{aligned} \end{aligned}$$
(3)

Here, \(\alpha _i\)(\(\cdot \)) is implemented by group convolution with one \(1 \times 1\) kernel. And \( \textbf{U}\in \mathbb {R}^{ D \times D}\) is a linear mapping matrix for learning feature \( \tilde{\textbf{z}}_{L}\). Therefore, the resulting tokens of different aggregation functions are gathered together to form the final output tensor as follows:

$$\begin{aligned} \begin{aligned}&\textbf{Y}= [\textbf{y}_{1}\textbf{y}_{2};\ldots ;\textbf{y}_{T}] = [A_1(\textbf{z}_{L}); A_2(\textbf{z}_{L}); \ldots ;A_T(\textbf{z}_{L}) ].\\ \end{aligned} \end{aligned}$$
(4)

Owing to the effective and efficient A\(^3\) module, the ultimate representation of the text sequence is denoted as \(\textbf{Y} \in \mathbb {R}^{ T \times D}\) in Eq. (4). Then, a character classification head is built by \( \textbf{G} = \textbf{YW}^T \in \mathbb {R}^{ T \times K} \) for text sequence recognition, where \(\textbf{W} \in \mathbb {R}^{ K \times D}\) is a linear mapping matrix, K is the number of categories and \( \textbf{G} \) is the classification logist. We regard this module as Character A\(^3\) for character-level prediction, of which the detailed structure is illustrated in Fig. 3(a).

3.3 Multi-granularity Predictions

Character tokenization that simply splits text into characters is commonly-used in scene text recognition methods. However, this naive and standard way ignores the language information of text. In order to effectively resort to linguistic information for scene text recognition, we incorporate subword [20] tokenization mechanism in NLP [7] into text recognition method. Subword tokenization algorithms aim to decompose rare words into meaningful subwords and remain frequently used words, so that the grammatical information of word has already been captured in the subwords. Meanwhile, since A\(^3\) module is independent of Transformer encoder backbone, we can directly add extra parallel subword A\(^3\) modules for subword predictions. In such a way, the language information can be implicitly injected into model learning for better performance. Notably, previous methods, i.e. , SRN [53] and ABINet [9], design an explicit transformer module for language modelling, while we cast linguistic information encoding problem as a character and subword prediction task without an explicit language model.

Specifically, we employ two subword tokenization algorithms Byte-Pair Encoding (BPE) [38] and WordPiece [37]Footnote 1 to produce various combinations as shown in Fig. 1(b)(c). Thus, BPE A\(^3\) module and WordPiece A\(^3\) module are proposed for subword attention. And two subword-level classification heads are used for subword predictions. Since subwords could be whole words (such as “coffee” in WordPiece), subword-level and even word-level predictions can be generated by the BPE and WordPiece classification heads. Along with the original character-level prediction, we denote these various outputs as multi-granularity predictions for text recognition. In this way, character-level prediction guarantees the fundamental recognition accuracy, and subword-level or word-level predictions can serve as complementary results for noised images via linguistic information.

Technically, the architecture of BPE or WordPiece A\(^3\) module is the same as Character one. They are independent of each other with different parameters. And the numbers of categories are different for different classification heads, which depend on the vocabulary size of each tokenization method. The cross entropy loss is employed for classification. Additionally, the mask \(\textbf{m}_{i}\) precisely indicates the attention location of the \(i_{th}\) character in Character A\(^3\) module, while it roughly shows the \(i_{th}\) subword region of the image in subword A\(^3\) modules, due to the higher complexity and uncertainty of learning subword splitting.

3.4 Fusion Strategy for Multi-granularity Results

Multi-granularity predictions (Character, BPE and WordPiece) are generated by different A\(^3\) modules and classification heads. Thus, a fusion strategy is required to merge these results. At the beginning, we attempt to fuse multi-granularity information by aggregating text features \(\textbf{Y}\) of the output of different A\(^3\) modules at feature level. However, since these features are from different granularities, the \(i_{th}\) token \(\textbf{y}_i^{char}\) of character level is not aligned with the \(i_{th}\) token \(\textbf{y}_i^{bpe}\) (or \(\textbf{y}_i^{wp}\)) of BPE level (or WordPiece level), so that these features cannot be added for fusion. Meanwhile, even if we concatenate features by [\(\textbf{Y}_i^{char}, \textbf{Y}_i^{bpe}, \textbf{Y}_i^{wp}\)], only one character-level head can be used for final prediction. The subword information will be greatly impaired in this way, resulting in less improvement.

Therefore, decision-level fusion strategy is employed in our method. However, perfectly fusing these predictions is a challenging problem [11]. We, thus, propose a compromised but efficient fusion strategy based on the prediction confidences. Specifically, the recognition confidence of each character or subword can be obtained by the corresponding classification head. Then, we present two fusion functions \(f(\cdot )\) to produce the final recognition score based on atomic confidences:

$$\begin{aligned} f_{Mean}([c_1, c_2, \ldots , c_{eos}])= \frac{1}{n} \sum _{i=1}^{eos} c_i, \end{aligned}$$
(5)
$$\begin{aligned} f_{Cumprod}([c_1, c_2, \ldots , c_{eos}])= \prod _{i=1}^{eos}\ c_i. \end{aligned}$$
(6)

We only consider the confidence of valid character or subword and ending symbol eos, and ignore padding symbol pad. “Mean” recognition score is generated by the mean value function as in Eq. (5). And “Cumprod” represents the score produced by cumulative product function. Then, three recognition scores of three classification heads for one image can be obtained by \(f(\cdot )\). We simply pick the one with the highest recognition score as the final predicted result.

4 Experiment

4.1 Datasets

For fair comparison, we use MJSynth [16, 17] and SynthText [12] as training data. MJSynth contains 9M realistic text images and SynthText includes 7M synthetic text images. The test dataset consists of “regular” and “irregular” datasets. The “regular” dataset is mainly composed of horizontally aligned text images. IIIT 5K-Words (IIIT) [31] consists of 3,000 images collected on the website. Street View Text (SVT) [49] contains 647 test images. ICDAR 2013 (IC13) [19] contains 1,095 images cropped from mall pictures, but we eventually evaluate on 857 images, discarding images that contain non-alphanumeric characters or less than three characters. The text instances in the “irregular” dataset are mostly curved or distorted. ICDAR 2015 (IC15) [18] includes 2,077 images collected from Google Eyes, but we use 1,811 images without some extremely distorted images. Street View Text-Perspective (SVTP) [32] contains 639 images collected from Google Street View. CUTE80 (CUTE) [35] consists of 288 curved images.

4.2 Implementation Details

Model Configuration. MGP-STR is built upon DeiT-Base model [43], which is composed of 12 stacked transformer blocks. For each layer, the number of head is 12 and the embedding dimension D is 768. More importantly, square \( 224 \times 224\) images [1, 8, 43] are not adopted in our method. The height H and width W of the input image are set to 32 and 128. The patch size P is set to 4 and thus \(N=8 \times 32 =256\) plus one [class] tokens \(\textbf{z}_L \in \mathbb {R}^{257\times 768}\) can be produced. The maximum length T of the output sequence \(\textbf{Y}\) of A\(^3\) module is set to 27. The vocabulary size K of Character classification head is set to 38, including \(0-9\), \(a-z\), pad for padding symbol and eos for ending symbol. The vocabulary sizes of BPE and WordPiece heads are set to 50, 257 and 30, 522.

Model Training. The pretrained weights of DeiT-base [43] are loaded the initial weights, except the patch embedding model, due to inconsistent patch sizes. Common data augmentation methods [6] for text image are applied, such as perspective distortion, affine distortion, blur, noise and rotation. We use 2 NVIDIA Tesla V100 GPUs to train our model with a batch size of 100. Adadelta [55] optimizer is employed with an initial learning rate of 1. The learning rate decay strategy is Cosine Annealing LR [29] and the training lasts 10 epochs.

Table 1. The ablation study of the proposed vision model and the accuracy comparisons with some SOTA methods based on only vision information.

4.3 Discussions on Vision Transformer and A\(^3\) Modules

We analyse the influence of the patch size of Vision Transformer and the effectiveness of A\(^3\) module in the proposed MGP-STR method (shown in Table 1). MGP-STR\(_{P=16}\) represents the model that simply uses the first T tokens of \(\textbf{z}_L\) for text recognition as in ViTSTR [1], where the input image is reshaped to \(224 \times 224\) and the patch size is set to \(16 \times 16 \). In order to retain the significant information of the original text image, \(32 \times 128 \) images with \(4 \times 4 \) patches are employed in MGP-STR\(_{P=4}\). MGP-STR\(_{P=4}\) outperforms MGP-STR\(_{P=16}\), which indicates that the standard image size of ViT [8, 43] is incompatible with text recognition. Thus, \(32 \times 128 \) images with \(4 \times 4 \) patches are used in MGP-STR.

When the Character A\(^3\) module is introduced into MGP-STR, denoted as MGP-STR\(_{Vision}\), the recognition performance will be further improved. MGP-STR\(_{P=16}\) and MGP-STR\(_{P=4}\) cannot fully learn and utilize the all tokens, while the Character A\(^3\) module can adaptively aggregate features of the last layer, resulting in more sufficient learning and higher accuracy. Meanwhile, compared with SOTA text recognition methods with CNN feature extractors, the proposed MGP-STR\(_{Vision}\) method achieves substantially performance improvement.

Table 2. The accuracies of MGP-STR\(_{Fuse}\) with different fusion strategies.

4.4 Discussions on Multi-Granularity Predictions

Effect of Fusion Strategy. Since the subwords generated by subword tokenization methods contain grammatical information, we directly employ subwords as the targets of our method to capture the language information implicitly. As described in Sect. 3.2, two different subword tokenizations (BPE and WordPiece) are employed for complementary multi-granularity predictions. Besides the character prediction, we propose two fusion strategies to further merge these three results, denoted as “Mean” and “Cumprod” as mentioned in Sect. 3.4. We denote this method that merges three results as MGP-STR\(_{Fuse}\), and the accuracy results of MGP-STR\(_{Fuse}\) with different fusion strategies are listed in Table 2. Additionally, the first line “Char” in Table 2 records the result of character classification head in MGP-STR\(_{Fuse}\). It is clear to see that both “Mean” and “Cumprod” fusion strategies can significantly improve the recognition accuracy over single character-level result. Due the better performance of “Cumprod” strategy, we employ it as the fusion strategy in the following experiments.

Table 3. The accuracy results of the four variants of MGP-STR model. “Char”, “BPE” and “WP” at “Output” represent predictions of Character, BPE and WordPiece classification head in each model, respectively. “Fuse” represents the fused results.

Effect of Subword Representations. We evaluate four variants of the MGP-STR model, and the performances of these four methods are elaborately reported in Table 3, including the fused results and the results of each single classification. Specifically, MGP-STR\(_{Vision}\) with only Character A\(^3\) module has already obtained promising results. MGP-STR\(_{C+B}\) and MGP-STR\(_{C+W}\) incorporate Character A\(^3\) module with BPE A\(^3\) module and WordPiece A\(^3\) module, respectively. No matter which subword tokenization is used alone, the accuracy of “Fuse” can exceed that of “Char” in both MGP-STR\(_{C+B}\) and MGP-STR\(_{C+W}\) methods, respectively. Notably, the performance of the classification of “BPE” or “WP” could be better than that of “Char” on SVP and SVTP datasets in the same model. These results show that subword predictions can boost text recognition performance by implicitly introducing language information. Thus, MGP-STR\(_{Fuse}\) with three A\(^3\) modules can produce complementary multi-granularity predictions (character, subword and even word). By fusing these multi-granularity results, MGP-STR\(_{Fuse}\) obtains the best performance.

Comparison with BCN. Bidirectional cloze network (BCN) is designed in ABINet [9] for explicit language modelling, and it leads to favorable improvement over pure vision model. We equip MGP-STR\(_{Vision}\) with BCN as a competitor of MGP-STR\(_{Fuse}\) to verify the advantage of multi-granularity predictions. Concretely, we first reduce the dimension 768 of representation feature \(\textbf{Y}\) to 512 for feature fusion of the output of BCN. Following the training setting in [9], the model results are reported in Table 4. The accuracy of “V+L” is further improved over the pure vision prediction “V” in MGP-STR\(_{Vision}\)+BCN, and better than the original ABINet [9]. However, the performance of MGP-STR\(_{Vision}\)+BCN is a little worse than that of MGP-STR\(_{Fuse}\). In addition, we provide the upper bound on the performance of MGP-STR\(_{Fuse}\), denoted as MGP-STR\(_{Fuse}^*\) in Table 4. If one of the three predictions (“Char”, “BPE” and “WP”) is right, the final prediction is considered correct. The highest score of MGP-STR\(_{Fuse}^*\) demonstrates the good potential of multi-granularity predictions. Moreover, MGP-STR\(_{Fuse}\) only requires two new subword prediction heads, rather than the design of a specific and explicit language model in [9, 53].

Table 4. The accuracy results of MGP-STR\(_{Vision}\) equipped with BCN and multi-granularity prediction. “V” represents the results of the pure vision output. “V+L” represents the results based on the both vision and language parts.
Table 5. The accuracy results of MGP-STR\(_{Fuse}\) with different ViT backbones.

4.5 Results with Different ViT Backbones

All of the proposed MGP-STR models mentioned earlier are based on DeiT-Base [43]. We also introduce two smaller models, namely DeiT-Small and DeiT-Tiny as presented in [43] to further evaluate the effectiveness of MGP-STR\(_{Fuse}\) method. Specifically, the embedding dimensions of DeiT-Small and DeiT-Tiny are reduced to 384 and 192, respectively. Table 5 records the results of each prediction head of the MGP-STR\(_{Fuse}\) method with different ViT backbones. Clearly, fusing multi-granularity predictions can still improve the performance of pure character-level prediction in every backbone. And bigger models achieve better performance in the same head. More importantly, the results of “Char” in DeiT-Small and even DeiT-Tiny have already surpassed the SOTA pure CNN-based vision models, referring to Table 1. Therefore, MGP-STR\(_{Vision}\) with small or tiny ViT backbone is also a competitive vision model and multi-granularity prediction can also work well in different ViT backbones.

Table 6. The comparisons with SOTA methods on several public benchmarks.

4.6 Comparisons with State-of-the-Arts

We compare the proposed MGP-STR\(_{Vision}\) and MGP-STR\(_{Fuse}\) methods with previous state-of-the-art scene text recognition methods, and the results on 6 standard benchmarks are summarized in Table 6. All of compared methods and ours are trained on synthetic datasets MJ and ST for fair evaluation. And the results are obtained without any lexicon based post-processing. Generally, language-aware methods (i.e. , SRN [53], VisionLAN [51], ABINet [9] and MGP-STR\(_{Fuse}\)) perform better than other language-free methods, showing the significance of linguistic information. Notably, MGP-STR\(_{Vision}\) without any language information has already outperformed the state-of-the-art method ABINet with explicit language model. Owing to the multi-granularity prediction, MGP-STR\(_{Fuse}\) obtains more impressive results further, which outperforms ABINet with \(0.7\%\) improvement on average accuracy.

Table 7. The details of multi-granularity prediction of MGP-STR\(_{Fuse}\), including the scores of each prediction head, the intermediate multi-granularity (Gra.) results and the final prediction (Pred.). Best viewed in color.

4.7 Details of Multi-Granularity Predictions

We show the detailed prediction process of the proposed MGP-STR\(_{Fuse}\) method on 6 test images from standard datasets. In the first three images, the results of character-level prediction are incorrect, due to irregular font, motion blur and curved shape, respectively. The scores of character prediction are very low, since the images are difficult to recognize and one character is wrong in each image. However, “BPE” and “WP” heads can recognize “table” image with high scores. And “BPE” can make correct predictions with two subwords on “dvisory” and “watercourse” images, while “WP” is wrong in “watercourse” image. After fusion, the mistakes can be corrected. From the rest three images, interesting phenomena can be observed. The predictions of “Char” and “BPE” conform to the images. The predictions of “WP”, however, attempt to produce strings with more linguistic content, like “today” and “guide”. Generally, “Char” aims to produce characters one by one, while “BPE” usually generates n-gram segments related to image and “WP” tends to directly predict words that are linguistically meaningful. These prove the predictions of different granularities convey text information in different aspects and are indeed complementary.

4.8 Visualization of Spatial Attention Maps of A\(^3\) Modules

Exemplar attention maps \( \textbf{m}_i \) of Character, BPE and WordPiece A\(^3\) modules are shown in Fig. 4. Character A\(^3\) module shows extremely precise addressing ability on a variety of text images. Specifically, for the “7” image with one character, the attention mask seems like the “7” shape. For the “day” and “bar” images with three characters, the attention masks of middle character “a” are completely different, verifying the adaptiveness of A\(^3\) module. As depicted in Fig. 1(d) and in Table 7, BPE tends to generate short segments, thus the attention masks of BPE are spilt into 2 or 3 areas as shown in “leaves” and “academy” images. This is probably because that performing subword splitting and character addressing simultaneously is difficult. Moreover, WordPiece often produces a whole word, and the attention maps should be the whole feature map. Since the attention maps produced by the softmax function are usually sparse, the attention maps of WordPiece are not as appealing as those of Character A\(^3\) module. These results are consistent to those of Table 3, where the accuracies of “BPE” and “WP” are relatively lower than “Char”, due to the difficulty of precise subword prediction.

Fig. 4.
figure 4

The illustration of spatial attention masks on Character A\(^3\) module, BPE A\(^3\) module and WordPiece A\(^3\) module, respectively.

4.9 Comparisons of Inference Time and Model Size

Table 8. Comparisons on inference time and model size.

The model sizes and latencies of the proposed MGP-STR with different settings as well as those of ABINet are depicted in Table 8Footnote 2. Since MGP-STR is equipped with a regular Vision Transformer (ViT) and involves no iterative refinement, the inference speed of MGP-STR is very fast: 12.3 ms with ViT-Base backbone. Compared with ABINet, MGP-STR runs much faster (12.3 ms vs. 26.8 ms), while obtaining higher performance. The model size of MGP-STR is relatively large. However, a large portion of the model parameter is from the BPE and WordPiece branches. For the scenarios that are sensitive to model size or with limited memory space, MGP-STR\(_{Vision}\) is an excellent choice.

5 Conclusion

We presented a ViT-based pure vision model for STR, which shows its superiority in recognition accuracy. To further promote recognition accuracy of this baseline model, we proposed a Multi-Granularity Prediction strategy to take advantage of linguistic knowledge. The resultant model achieves state-of-the-art performance on widely-used datasets. In the future, we will extend the idea of multi-granularity prediction to broader domains.