1 Introduction

Recognizing text in the wild has attracted great interest in computer vision (Ye and Doermann 2015; Zhu et al. 2016; Yang et al. 2017; Shi et al. 2018; Yang et al. 2019a). Recently, methods based on convolutional neural networks (CNNs) (Wang et al. 2012; Jaderberg et al. 2015, 2016) have significantly improved the accuracy of scene text recognition. Recurrent neural networks (RNNs) (He et al. 2016b; Shi et al. 2016, 2017) and attention mechanism (Lee and Osindero 2016; Cheng et al. 2017, 2018; Yang et al. 2017) are also beneficial for recognition.

Nevertheless, recognizing text in natural images is still challenging and largely remains unsolved (Shi et al. 2018). As shown in Fig. 1, text is found in various scenes, exhibiting complex backgrounds. The complex backgrounds cause difficulties for recognition. For instance, the complicated images often lead to attention drift (Cheng et al. 2017) for attention networks. Thus, if the complex background style is normalized to a clean one, the recognition difficulty will significantly decreases.

Fig. 1
figure 1

Examples of scene text with complex backgrounds, making recognition very challenging

With the development of GANs (Johnson et al. 2016; Cheng et al. 2019; Jing et al. 2019) in recent years, it is possible to migrate the scene background from a complex style to a clean style in scene text images. However, vanilla GANs are not sufficiently robust to generate sequence-like characters in natural images (Fang et al. 2019). As shown in Fig. 2a, directly applying the off-the-shelf CycleGAN fails to retain some strokes of the characters. In addition, as reported by (Liu et al. 2018b), applying a similar idea of image recovery to normalize the backgrounds for sequence-like objects fails to generate clean images. As illustrated in Fig. 2b, some characters on the generated images are corrupted, which leads to misclassification. One possible reason for this may be the discriminator is designed to focus on non-sequential object with a global coarse supervision (Zhang et al. 2019). Therefore, the generation of sequence-like characters requires more fine-grained supervision.

One potential solution is to employ the pixel-wise supervision (Isola et al. 2017), which requires paired training samples aligning at pixel level. However, it is impossible to collect paired training samples in the wild. Furthermore, annotating scene text images with pixel-wise labels can be intractably expensive. To address the lack of paired data, it is possible to synthesize a large number of paired training samples, because synthetic data is cheaper to obtain. This may be why most state-of-the-art scene text recognition methods (Cheng et al. 2018; Shi et al. 2018; Luo et al. 2019) only use synthetic samples (Jaderberg et al. 2014a; Gupta et al. 2016) for training, as tens of millions of training data are immediately available. However, experiments of (Li et al. 2019) suggest that there exists much room for improvement in synthesis engines. Typically a recognizer trained using real data significantly outperforms the ones trained using synthetic data due to the domain gap between artificial and real data. Thus, to enable broader application, our goal here is to improve GANs to meet the requirement of text image generation and address the unpaired data issue.

Fig. 2
figure 2

Text content extraction of (a) CycleGAN, (b) (Liu et al. 2018b) and (c) our method. Our method uses character-level adversarial training and thus better preserves the strokes of every character and removes complex backgrounds

We propose an adversarial learning framework with an interactive joint training scheme, which achieves success in separating text content from background noises by using only source training images and the corresponding text labels. The framework consists of an attention-based recognizer and a generative adversarial architecture. We take advantage of the attention mechanism in the attention-based recognizer to extract the features of each character for further adversarial training. In contrast to global coarse supervisions, character-level adversarial training provides guidance for the generator in a fine-grained manner, which is critical to the success of our approach.

Our proposed framework is a meta framework. Thus, recent mainstream recognizers (Cheng et al. 2018; Shi et al. 2018; Luo et al. 2019; Li et al. 2019) equipped with attention-based decoders (Bahdanau et al. 2015) can be integrated into our framework. As illustrated in Fig. 3, the attention-based recognizer predicts a mask for each character, which is shared with the discriminator. Thus, the discriminator is able to focus on every character and guide the generator to filter out various background styles while retaining the character content. Benefiting from the advantage of the attention mechanism, the interactive joint training scheme requires only the images and corresponding text labels, without requirement of character bounding box annotation. Simultaneously, the target style training samples can be simply synthesized online during the training. As shown in Fig. 4, for each target style sample, we randomly choose one character and simply render the character onto a clean background. Each sample contains a black character on a white background or a white character on a black background. The target style samples are character-level, whereas the input style samples are word-level. The unpaired training samples suggest our training process is flexible.

Fig. 3
figure 3

Interactive joint training of our framework. The attention-based recognizer shares the position and prediction of every character with the discriminator, whereas the discriminator learns from the confusion of the recognizer, and guides the generator so that it can generate clear text content and clean background style to ease reading

Moreover, we take a further step of the interactive joint training scheme. In addition to the sharing of attention masks, we proposed a feedback mechanism, which bridge the gap between the recognizer and the discriminator. The discriminator guides the generator according to the confusion of the recognizer. Thus, the erroneous character patterns on the generated images are corrected. For instance, the patterns of the characters “C” and “G” are similar, which can easily cause failed prediction of the recognizer. After the training using our feedback mechanism, the generated patterns are more discriminative, and incorrect predictions on ambiguous characters can be largely avoided.

To summarize, our main contributions are as follows.

  1. 1)

    We propose a framework that separates text content from complex background styles to reduce recognition difficulty. The framework consists of an attention-based recognizer and a generative adversarial architecture. We devise an interactive joint training of them, which is critical to the success of our approach.

  2. 2)

    The shared attention mask enables character-level adversarial training. Thus, the unpaired target style samples can be simply synthesized online. The training of our framework requires only the images and corresponding text labels. Additional annotations such as bounding boxes or pixel-wise labels are unnecessary.

  3. 3)

    We further propose a feedback mechanism to improve the robustness of the generator. The discriminator learns from the confusion of the recognizer and guides the generator so that it can generate clear character patterns that facilitate reading.

  4. 4)

    Our experiments demonstrate that mainstream recognizers can benefit from our method and achieve new state-of-the-art performance by extracting text content from complex background styles. This suggests that our framework is a meta-framework, which is flexible for integration with recognizers.

Fig. 4
figure 4

Training samples and generations. Left: Widely used training datasets released by (Jaderberg et al. 2014a) and (Gupta et al. 2016). Middle: Unpaired target style samples, which are character-level and synthesized online. Right: Output of the generator

2 Related Work

In this section, we review the previous methods that are most relevant to ours with respect to 2 categories: scene text recognition and generative adversarial networks.

2.1 Scene Text Recognition

Overviews of the notable work in the field of scene text detection and recognition have been provided by (Ye and Doermann 2015) and (Zhu et al. 2016). The methods based on neural networks outperform the methods with hand crafted features, such as HOG descriptors (Dalal and Triggs 2005), connected components (Neumann and Matas 2012), strokelet generation (Yao et al. 2014), and label embedding (Rodriguez-Serrano et al. 2015), because the trainable neural network is able to adapt to various scene styles. For instance, (Bissacco et al. 2013) applied a network with five hidden layers for character classification, and (Jaderberg et al. 2015) proposed a CNN for unconstrained recognition. The CNN-based methods significantly improve the performance of recognition.

Moreover, the recognition models yield better robustness when they are integrated with RNNs (He et al. 2016b; Shi et al. 2016, 2017) and attention mechanisms (Lee and Osindero 2016; Cheng et al. 2017, 2018; Yang et al. 2017). For example, (Shi et al. 2017) proposed an end-to-end trainable network using both CNNs and RNNs, namely CRNN. (Lee and Osindero 2016) proposed a recursive recurrent network using attention modeling for scene text recognition. (Cheng et al. 2017) used a focusing attention network to correct attention alignment shifts caused by the complexity or low-quality of images. These methods have made great progress in regular scene text recognition.

With respect to irregular text, the irregular shapes introduce more background noise into the images, which increases recognition difficulty. To tackle this problem, (Yang et al. 2017) and (Li et al. 2019) used the two-dimensional (2D) attention mechanism for irregular text recognition. (Liao et al. 2019b) recognized irregular scene text from a 2D perspective with a semantic segmentation network. Additionally, (Liu et al. 2016), (Shi et al. 2016, 2018), and (Luo et al. 2019) proposed rectification networks to transform irregular text images into regular ones, which alleviates the interference of the background noise, and the rectified images become readable by a one-dimensional (1D) recognition network. (Yang et al. 2019a) used character-level annotations for supervision for a more accurate description for rectification. Despite the many praiseworthy efforts that have been made, irregular scene text on complex backgrounds is still difficult to recognize in many cases.

2.2 Generative Adversarial Networks

With the widespread application of GANs (Goodfellow et al. 2014; Mao et al. 2017; Odena et al. 2017; Zhu et al. 2017), font generation methods (Azadi et al. 2018; Yang et al. 2019b) using adversarial learning have been successful on document images. These methods focus on the style of a single character and achieve incredible visual effects.

However, our goal is to perform style normalization on noisy background, rather than the font, size or layout. A further challenge is to keep multiple characters for recognition. That means style normalization of the complex backgrounds of scene text images requires accurate separation between the text content and background noise. Traditional binarization/segmentation methods (Casey and Lecolinet 1996) typically work well on document images, but fail to handle the substantial variation in text appearance and the noise in natural images (Shi et al. 2018). Style normalization of background in scene text images remains an open problem.

Recently, several attempts on scene text generation have taken a crucial step forward. (Liu et al. 2018b) guided the feature maps of an original image towards those of a clean image. The feature-level guidance reduces the recognition difficulty, whereas the image-level guidance does not result in a significant improvement in text recognition performance. (Fang et al. 2019) designed a two-stage architecture to generate repeated characters in images. An additional 10k synthetic images boost the performance, but more synthetic images do not improve accuracy linearly. (Wu et al. 2019) edited text in natural images using a set of corresponding synthetic training samples to preserve the style of both background and text. These methods provided sufficient visualized examples. However, the poor recognition performance on complex scene text remains a challenging problem.

We are interested in taking a further step to enable recognition performance to benefit from generation. Our method integrates the advantages of the attention mechanism and the GAN, and jointly optimizes them to achieve better performance. The text content is separated from various background styles, which are normalized for easier reading.

3 Methodology

We design a framework to separate text content from noisy background styles, through an interactive joint training of an attention-based recognizer and a generative adversarial architecture. The shared attention masks from the attention-based recognizer enable character-level adversarial training. Then, the discriminator guides the generator to achieve background style normalization. In addition, a feedback mechanism bridges the gap between the discriminator and recognizer. The discriminator guides the generator according to the confusion of the recognizer. Thus, the generator can generate clear character patterns that facilitate reading.

In this section, we first introduce the attention decoder in mainstream recognizers. Then, we present a detailed description of the interactive joint training scheme.

3.1 Attention Decoder

Fig. 5
figure 5

Attention decoder, which recurrently attends to informative regions and outputs predictions

Fig. 6
figure 6

Interactive joint training. The recognizer shares attention masks with the discriminator, whereas the discriminator learns from the predictions of the recognizer and updates the generator using ground truth. The shared attention masks work on feature maps, which we present in the generated images for better visualization

To date the attention decoder (Bahdanau et al. 2015) has become widely used in recent recognizers (Shi et al. 2018; Luo et al. 2019; Li et al. 2019; Yang et al. 2019a). As shown in Fig. 5, the decoder sequentially outputs predictions \((y_{1},y_{2} ...,y_{N})\) and stops processing when it predicts an end-of-sequence token \(''EOS''\) (Sutskever et al. 2014). At time step t, output \(y_t\) is given by

(1)

where \({\varvec{s}}_t\) is the hidden state at the t-th step. Then, we update \({\varvec{s}}_t\) by

$$\begin{aligned} {\varvec{s}}_t = GRU({\varvec{s}}_{t-1}, (y_{t-1}, {\varvec{g}}_{t})), \end{aligned}$$
(2)

where \({\varvec{g}}_{t}\) represents the glimpse vectors

$$\begin{aligned} {\varvec{g}}_{t} = \sum _{i=1}^n(\alpha _{t,i} {\varvec{h}}_{i}), {\varvec{\alpha }}_{t} \in {\mathbb {R}}^n, \end{aligned}$$
(3)

where \({\varvec{h}}_{i}\) denotes the sequential feature vectors. Vector \({\varvec{\alpha }}_{t}\) is the vector of attention mask, expressed as follows:

$$\begin{aligned} \alpha _{t,i} = \frac{\exp (e_{t,i})}{\sum _{j=1}^n(\exp (e_{t,j}))}, \end{aligned}$$
(4)
$$\begin{aligned} e_{t,i} = {\varvec{w}}^\mathrm {T}\mathrm{Tan}h({\varvec{W}}_{s}{\varvec{s}}_{t-1}+{\varvec{W}}_{h}{\varvec{h}}_{i}+b). \end{aligned}$$
(5)

Here, \({\varvec{W}}_{out}\), \(b_{out}\), \({\varvec{w}}^\mathrm {T}\), \({\varvec{W}}_{s}\), \({\varvec{W}}_{h}\) and b are trainable parameters. Note that \(y_{t-1}\) is the \((t-1)\)-th character in the ground truth in the training phase, whereas it is the previously predicted output in the testing phase. The training set is denoted as \({\varvec{D}} = \left\{ I_{i}, Y_{i} \right\} , i=1...N \). The optimization is to minimize the negative log-likelihood of the conditional probability of \({\varvec{D}}\) as follows:

$$\begin{aligned} {\mathcal {L}}_{\mathrm{reg}} = -\sum _{i=1}^N{ \sum _{t=1}^{| Y_{i} |}{\log p(Y_{i,t} \left| \right. I_{i}; \theta )} }, \end{aligned}$$
(6)

where \(Y_{i,t}\) is the ground truth of the t-th character in \(I_{i}\) and \(\theta \) denotes the parameters of the recognizer.

3.2 Interactive Joint Training for Separating Text from Backgrounds

As vanilla discriminator is designed for non-sequential object with a global coarse supervision, directly employing a discriminator fails to provide effective guidance for the generator. In contrast to apply a global discriminator, we supervise the generator in a fine-grained manner, namely, character-level adversarial learning, by taking advantage of the attention mechanism. Training the framework at character level also reduces the complexity of the preparation of target style data. Every target style sample containing one character can be easily synthesized online.

3.2.1 Sharing of Attention Masks

Given an image I as input, the goal of our generator G is to generate a clean image \(I'\) without a complex background. The discriminator D encodes the image \(I'\) as

$$\begin{aligned} {\varvec{E}} = Encode(I'). \end{aligned}$$
(7)

With similar settings of the backbone in recognizer (e.g., kernel size, stride size and padding size in the convolutional and pooling layers), the encoder in the discriminator is designed to output an embedding vector \({\varvec{E}}_i\) with the same size as that of \({\varvec{h}}_i\) in Eq. (3), which enables the recognizer to share attention mask \({\varvec{\alpha }}_{t}\) with the discriminator. After that, character-level features of the generation are extracted by

$$\begin{aligned} {\varvec{F}}_{gen, t} = \sum _{i=1}^n(\alpha _{t,i} {\varvec{E}}_{i}), {\varvec{\alpha }}_{t} \in {\mathbb {R}}^n. \end{aligned}$$
(8)

The extracted character features are used for further adversarial training.

3.2.2 Unpaired Target Style Samples

Benefiting from our character-level adversarial learning, target style samples can be simply synthesized online. As illustrated in Figs. 4 and 6, every target style sample contains only a black character on a white background or a white character on a black background. The characters are randomly chosen. Following the previous methods for data synthesis (Jaderberg et al. 2014a; Gupta et al. 2016), we collect fontsFootnote 1 to synthesize the target style samples. The renderer is a simple and publicly available engineFootnote 2 that can efficiently synthesize samples online. Owing to the diversity of the fonts, the font sensitivity of the discriminator is thereby decreased, which enables the discriminator to focus on the background styles.

Because there is only one character in a target style image, we apply global average-pooling to the embedding features for every target style sample \(I_{t}\) as follows:

$$\begin{aligned} {\varvec{F}}_{tgt} = \mathrm{averagePooling}\big (\mathrm{Encode}(I_t)\big ). \end{aligned}$$
(9)

The features of the t-th character in the generated image \({\varvec{F}}_{gen, t}\) and the target style sample \({\varvec{F}}_{tgt}\) are prepared for the following adversarial training.

3.2.3 Adversarial Training on Style

We use a style classifier in the discriminator to classify the style of characters in the generated images as fake and the characters in the target style samples as real. We use the 0−1 binary coding (Mao et al. 2017) for style adversarial training, which is formulated as

$$\begin{aligned} \begin{aligned}&\min \limits _{D} {\mathcal {L}}_{s} = {\mathbb {E}}_{I_t} [\big (1-Style({\varvec{F}}_{tgt})\big )^2] + {\mathbb {E}}_{I', t} \big [ Style({\varvec{F}}_{gen, t})^2\big ], \\&\min \limits _{G} {\mathcal {L}}_{s} = {\mathbb {E}}_{I', t} \big [\big (1-Style({\varvec{F}}_{gen, t})\big )^2\big ], \end{aligned} \end{aligned}$$
(10)

where \(Style(\cdot )\) denotes the style classifier.

The advantages of character-level adversarial training are threefold: (1) Because the background is complicated in a scene text image, the background noise varies substantially in different character regions. Considering the text string as a whole and supervising the training in a global manner may cause the failure of the generator, as discussed previously in Sect. 1 and Fig. 2. Thus, we encourage the discriminator to inspect the generation in a more fine-grained manner, namely, character-level supervision, which contributes to the effective learning. (2) Training at character level brings a benefit for the preparation of target style data. For the synthesis of a text string, it is necessary to consider the text shape, the space between neighboring characters and the rotation of every character (Jaderberg et al. 2014a; Gupta et al. 2016). In contrast, we can simply synthesize only one character on a clean background for every target style sample. Therefore, our target style samples can be simply synthesized online during the training. (3) The training is free of the need for paired data. Because the attention mechanism decomposes a text string into several characters and benefits the further training, only input scene text images and corresponding text labels are required. Hence, our framework is potentially flexible enough to make full use of available data to gain robustness.

3.2.4 Feedback Mechanism

As our goal is to improve recognition performance, we are not only interested in the styles of the backgrounds, but also the quality of the generated content. Therefore, we use a content classifier in the discriminator to supervise content generation.

In contrast to the previous work auxiliary classifier GAN (Odena et al. 2017), which used ground truth to supervise the content classifier, our content classifier learns from the predictions of the recognizer. This bridges the gap between the recognizer and discriminator. The discriminator thus can guide the generator according to the confusion of the recognizer. After the training with this feedback mechanism, the generated patterns are more discriminative, which facilitates recognition. The details of the feedback mechanism are present as follows.

The generator G and discriminator D are updated by alternately optimizing

$$\begin{aligned} \begin{aligned} \min \limits _{D} {\mathcal {L}}_{c, D}&= {\mathbb {E}}_{(I, P), (I_t, GT)} [-\log \mathrm{Content}(GT | {\varvec{F}}_{tgt} ) \\&-\frac{1}{|P|} \sum _{t=1}^{|P|} \log \mathrm{Content}(P_t | {\varvec{F}}_{gen, t})], \end{aligned} \end{aligned}$$
(11)
$$\begin{aligned} \begin{aligned} \min \limits _{G} {\mathcal {L}}_{c, G}= {\mathbb {E}}_{I, GT} [-\frac{1}{|GT|}\sum _{t=1}^{|GT|}\log Content(GT_t | {\varvec{F}}_{gen, t} )], \end{aligned}\nonumber \\ \end{aligned}$$
(12)

where GT denotes the ground truth of the input image I and target style sample \(I_t\). In addition, \(Content(\cdot )\) is the content classifier. Note that the discriminator learns from the predictions P on I of the recognizer, whereas it uses GT of I to update the generator. This is an adversarial process that is similar to that of GAN training (Goodfellow et al. 2014; Mao et al. 2017; Odena et al. 2017; Zhu et al. 2017). They use different labels for the discriminator and generator, but backpropagate the gradient using the same parameters as those of the discriminator. Alternately optimizing the discriminator and generator achieves adversarial learning.

There are some substitution errors in the predictions P that are different from the GT. Therefore, the second term of the right side in Eq. (11) can be formulated as content adversarial training as

$$\begin{aligned} \begin{aligned} -\frac{1}{|P|}&\sum _{t=1}^{|P|}\log Content(P_t|{\varvec{F}}_{gen, t}) = \\ -\frac{1}{|P|} [&\sum _{i=1}^{|P_{real}|}\log Content(P_{real, i}|{\varvec{F}}_{gen, i}) \\ +&\sum _{j=1}^{|P_{fake}|}\log Content(P_{fake, j}|{\varvec{F}}_{gen, j}) ], \end{aligned} \end{aligned}$$
(13)

where \(P_{real}\) and \(P_{fake}\) present the correct and incorrect predictions of the recognizer, respectively. Note that \(P_{real} \cup P_{fake} = P\).

Since the discriminator with the content classifier learns from the predictions of the recognizer, it guides the generator to correct erroneous character patterns in the generated images. For instance, similar patterns such as ”C” and “G”, or “O” and “Q”, may cause failed prediction of the recognizer. If a “G” is transformed to look more like a “C” and the recognizer predicts it to be a “C”, the discriminator will learn that the pattern is a “C” and guide the generator to generate a clearer “G”. We show more examples and further discuss this issue in Sect. 4.

figure a

3.2.5 Interactive Joint Training

The pseudocode of the interactive joint training scheme is presented in Algorithm 1. During the training of our framework, we found that the discriminator often learns faster than the generator. A similar problem has also been reported by others (Berthelot et al. 2017; Heusel et al. 2017). The Wasserstein GAN (Arjovsky et al. 2017) uses more update steps for the generator than the discriminator. We simply adjust the number of steps according to a balance factor \(\beta \in (0, 1)\). If the discriminator learns faster than the generator, then the value of \(\beta \) decreases, potentially resulting in a pause during the update steps for the discriminator. In practice, this trick contributes to the training stability of the generator.

We first sample a set of input samples, and randomly synthesize unpaired samples of target style. Then, the recognizer makes predictions on the generated images and shares its attention masks with the discriminator. To avoid the effects of incorrect alignment between character features and labels (Bai et al. 2018), we filter out some predictions using the metrics of edit distance and string length. The corresponding images are also filtered out. Only substitution errors exist in the remaining predictions. Finally, the discriminator and generator are alternately optimized to achieve adversarial learning.

After the adversarial training, the generator can separate text content from complex background styles. The generated patterns are clearer and easier to read. As illustrated in Fig. 7, the generator works well on both regular text and slanted/curved text. Because the irregular shapes of the text introduce more surrounding background noise, the recognition difficulty can be significantly reduced by using our method.

Fig. 7
figure 7

Generated images for (a) regular and (b) irregular text. Input images are on the left and the corresponding generated images are on the right. The text content is separated by the generator from the noisy background styles. In the generated images, the font style tends to be an average style

4 Experiments

In this section, we provide the training details and report the results of extensive experiments on various benchmarks, including both regular and irregular text datasets, demonstrating the effectiveness and generality of our method.

As paired text images in the wild are not available and there exists great diversity in the number of characters and image structure between the input images and our target style images, popular GAN metrics such as the inception score (Salimans et al. 2016) and Fréchet inception distance (Heusel et al. 2017) cannot be directly applied in our evaluation. Instead, we use the word accuracy of recognition, which is a more straightforward metric, and is of interest for our target task, to measure the performance of all the methods. Recall that our goal here is to improve recognition accuracy.

4.1 Datasets

4.1.1 SynthData

Which contains 6-million data released by (Jaderberg et al. 2014a) and 6-million data released by (Gupta et al. 2016), is a widely used training dataset. Following the most recent work for fair comparison, we select it as the training dataset. Only word-level labels are used, but other extra annotation is unnecessary in our framework. The model is trained using only synthetic text images, without any fine-tuning for each specific dataset.

IIIT5K-Words (Mishra et al. 2012) (IIIT5K) contains 3,000 cropped word images for testing. Every image has a 50-word lexicon and a 1,000-word lexicon. The lexicon consists of the ground truth and some randomly picked words.

Street View Text (Wang et al. 2011) (SVT) was collected from the Google Street View, and consists of 647 word images. Each image is associated with a 50-word lexicon. Many images are severely corrupted by noise and blur or have very low resolutions.

ICDAR 2003 (Lucas et al. 2003) (IC03) contains 251 scene images that are labeled with text bounding boxes. For fair comparison, we discarded images that contain non-alphanumeric characters or those have fewer than three characters, following (Wang et al. 2011). The filtered dataset contains 867 cropped images. Lexicons comprise of a 50-word lexicon defined by (Wang et al. 2011) and a “full lexicon”. The latter lexicon combines all lexicon words.

ICDAR 2013 (Karatzas et al. 2013) (IC13) inherits most of its samples from IC03. It contains 1,015 cropped text images. No lexicon is associated with this dataset.

SVT-Perspective (Quy Phan et al. 2013) (SVT-P) contains 645 cropped images for testing. Images were selected from side-view angle snapshots in Google Street View. Therefore, most images are perspective distorted. Each image is associated with a 50-word lexicon and a full lexicon.

CUTE80 (Risnumawan et al. 2014) (CUTE) contains 80 high-resolution images taken of natural scenes. It was specifically collected for evaluating the performance of curved text recognition. It contains 288 cropped natural images for testing. No lexicon is associated with this dataset.

ICDAR 2015 (Karatzas et al. 2015) (IC15) contains 2077 images by cropping the words using the ground truth word bounding boxes. (Cheng et al. 2017) filtered out some extremely distorted images and used a small evaluation set (referred as IC15-S) containing only 1811 test images.

Table 1 Word accuracy on the testing datasets using different inputs The recognizer is trained on the source and generated images, respectively

4.2 Implementation Details

As our proposed method is a meta-framework for recent attention-based recognition methods (Shi et al. 2018; Luo et al. 2019; Li et al. 2019; Yang et al. 2019a), recent recognizers can be readily integrated with our framework. Thus the recognizer implementation follows their specific design. Here we present details of the discriminator, generator, and training.

4.2.1 Generator

The generator is a feature pyramid network (FPN)-like (Lin et al. 2017) architecture that consists of eight residual units. Each residual unit comprises a \(1 \times 1\) convolution followed by two \(3 \times 3\) convolutions. Feature maps are downsampled by \(2 \times 2\) stride convolutions in the first three residual units. The numbers of output channels of the first four residual units are 64, 128, 256, and 256, respectively. The last four units are symmetrical with the first four, but we upsample the feature map by simple resizing. We apply element-wise addition to the output of the third and fifth units. At the top of the generator, there are two convolution layers that have 16 filters and one filter, respectively.

4.2.2 Discriminator

The encoder in the discriminator consist of 7 convolutional layers that have 16, 64, 128, 128, 192 and 256 filters. Their kernel sizes are all \(3 \times 3\), except for the size of the last one, which is \(2 \times 2\). The first, second, fourth and sixth convolutional layers are each followed by an average-pooling layer. Using settings similar to those of the backbone in the recognizer (e.g., kernel size, stride size and padding size in the convolutional and pooling layers), the output size of the encoder can be controlled to meet the requirements of the attention mask sharing of the recognizer. Both the style and content classifiers in the discriminator are one-layer fully connected networks.

4.2.3 Training

We use Adam (Kingma et al. 2015) to optimize the GAN. The learning rate is set to 0.002. It is decreased by a factor of 0.1 at epochs 2 and 4. In the interactive joint training, we utilize the attention mechanism in the recognizer. Therefore, an optimized attention decoder is necessary to enable the interaction. To accelerate the training process, we pre-trained the recognizer for three epochs.

4.2.4 Implementation

We implement our method using the PyTorch framework (Paszke et al. 2017). The target style samples are resized to \(32 \times 32\). Input images are resized to \(64 \times 256\) for the generator and \(32 \times 100\) for the recognizer. The outputs of the generator are also resized to \(32 \times 100\). When the batch size is set to 64, the training speed is approximately 1.7 iterations/sec. Our method takes an average of 1.1 ms to generate an image using an NVIDIA GTX-1080Ti GPU.

4.3 Ablation Study

4.3.1 Experiment Setup

To investigate the effectiveness of separating text content from noisy background styles, we conduct an ablation analysis by using a simple recognizer. The backbone of the recognizer is a 45-layer residual network (He et al. 2016a), which is a popular architecture (Shi et al. 2018). On the top of the backbone, there is an attention-based decoder. In the decoder, the number of GRU hidden units is 256. The decoder outputs 37 classes, including 26 letters, 10 digits, and a symbol that represented \(``\mathrm EoS''\). The training data is SynthData. We evaluate the recognizer on seven benchmarks, including regular and irregular text.

4.3.2 Input of the Recognizer

We study the contribution of our method by replacing the generated image with the corresponding input image. The results are listed in Table 1. The recognizer trained using SynthData serves as a baseline. Compared to the baseline, the clean images generated by our method boost recognition performance. We observe that the improvement is more substantial on irregular text. One notable improvement is an accuracy increase of 6.6% on CUTE. One possible reason for this is that the irregular text shapes introduce more background noise than the regular ones. Because our method removes the surrounding noise and extracts the text content for recognition, the recognizer can thus focus on characters and avoid noisy interference.

With respect to regular text, the baseline is much higher and there is less room for improvement, but our method also shows advantages in recognition performance. The gain of performance on several kinds of scene text, including low quality images in SVT and real scene images in IC03/IC13, suggests the generality our method. To summarize, the generated clean images of our proposed method greatly decrease recognition difficulty.

4.3.3 Style Supervision

We study the necessity of style supervision by disabling the style classifier in the discriminator. Without style adversarial training, the background style normalization is only weakly supervised by the content label. As shown in the Fig. 8, the generated images suffer from severe image degradation, which leads to poor robustness of the recognizer. The quantitative recognition results of not using/using the style supervision are presented in the second and third row of Table 2. The significant gaps indicate that without the style supervision, the quality of the generated images is insufficient for recognition training. Thus, the style adversarial training is necessary and is used in the basic design of our method.

Fig. 8
figure 8

Visualization of background normalization weakly supervised by content label

Table 2 Word accuracy on generated images using variants of content supervision for the discriminator. Losses \({\mathcal {L}}_{s}\) and \({\mathcal {L}}_{c}\) denote style loss and content loss, respectively

4.3.4 Feedback Mechanism

We also study the effectiveness of the content classifier in the discriminator and the proposed feedback mechanism. In this experiment, we first disable the content classifier. Therefore, there is no content supervision. Only a style adversarial loss supervises the generator. The result is shown in the first row in Table 2. The accuracy on the generated images decreases to nearly zero. We observe that the generator fails to retain the character patterns for recognition. As the content classifier is designed for assessing the discriminability and diversity of samples (Odena et al. 2017), it is important to guide the generator so that it can determine informative character patterns and retain them for recognition. When the content supervision is not available, the generator is easily trapped into failure modes, namely mode collapse (Salimans et al. 2016). Therefore, the content supervision in the discriminator is necessary.

Then we enable the content classifier and replace the supervision in \({\mathcal {L}}_{c}\) with the ground truth. This setting is similar to that of the auxiliary classifier GANs (Odena et al. 2017), which use content supervision for discriminability and diversity in the style adversarial training. After this process, the generated text images contain text content for recognition.

Finally, we replace the content supervision with the predictions of the recognizer. The discriminator thus learns from the confusion of the recognizer, and guides the generator so that it can refine the character patterns to be easier to read. Therefore, the adversarial training is more relevant to the recognition performance. As shown in Table 2, the feedback mechanism further improves the robustness of the generator and benefits the recognition performance.

Table 3 Word accuracy on testing datasets using different transformation methods

One interesting observation is that on the SVT-P testing set, the accuracy on the source image (75.7% in Table 1) is higher than that on the generated image with content supervision of the ground truth (75.0% in Table 2). We observe the source samples and find that most images are severely corrupted by noise and blur. Some of them have low resolutions. The characters in the generated samples are also difficult to distinguish. After training with the feedback mechanism, the generator is able to generate clear patterns that facilitate reading, which boosts the recognition accuracy from 75.0 to 79.2%. As illustrated in Fig. 9, the predictions of “C” and “N” are corrected to “G” and “M”, respectively. The clear characters in the generated images are easier to read.

Fig. 9
figure 9

Predictions of challenging samples in the SVT-P testing set. Recognition errors are marked as red characters. Confusing and distinct patterns are marked by red and green bounding boxes, respectively

Fig. 10
figure 10

Comparison between the OTSU (Otsu 1979) and our method

Table 4 Word accuracy on regular benchmarks

4.4 Comparisons with Generation Methods

Recently, a large body of literature (Shi et al. 2018; Luo et al. 2019; Li et al. 2019; Yang et al. 2019a) has explored the use of stronger recognizers to tackle the complications in scene text recognition. However, there is little consideration of the quality of the source images. The background noise in the source image has not been addressed intensively before. To the best of our knowledge, our method may be the first image generation network that removes background noise and retains text content to benefit recognition performance. Although few literature proposed to address this issue stated above, we find several popular generation methods and perform comparisons under fair experimental conditions. A pre-trained recognizer used in the ablation study is adopted in the comparisons. The recognizer is then be fine-tuned on different kinds of generations.

First we use a popular binarization method, namely OTSU method (Otsu 1979), to separate the text content from the background noise by binarizing the source images. As shown in Fig. 10, we visualize the binarized images and find that single threshold value is not sufficiently robust to separate the foreground and background in scene text images, because the background noise usually follows multimodal distribution. Therefore, the recognition accuracy on the generation of OTSU method falls behind ours in Table 3.

Then, we compare our method with generation methods. Considering the high demand for data (pixel-level paired samples) of pixel-to-pixel GANs (Isola et al. 2017), we treat this kind of method as a potential solution when there is no restriction of data. Here, we study the CycleGANFootnote 3 (Zhu et al. 2017). Before the training, we synthesize word-level clean images as target style samples. The results shown in Table 3 and Fig. 11 suggest that modeling a text string with multiple characters as a whole leads to poor retention of character details. The last two rows in Fig. 11 are failed generations, which indicate that the generator fails to model the relationships of the characters. In Table 3, the recognition accuracy on this kind of generation drops substantially.

Fig. 11
figure 11

Comparison between CycleGAN (Zhu et al. 2017) and our method

Table 5 Word accuracy on irregular benchmarks

Compared with previous methods, our method not only normalizes noisy backgrounds to a clean style, but also generates clear character patterns that tend to be an average style. The end-to-end training with the feedback mechanism benefits the recognition performance. We also show the effectiveness of the image rectification by integrating our method with advanced rectification modules (Shi et al. 2018; Zhan and Lu 2019). It can be seen that image rectifiers are still significant for improving recognition performance. Thus, different from irregular text shape, noisy background style is another challenge.

4.5 Integration with State-of-the-Art Recognizers

As our method is a meta-framework, it can be integrated with recent recognizers (Cheng et al. 2018; Shi et al. 2018; Luo et al. 2019; Li et al. 2019; Yang et al. 2019a) equipped with attention-based decoders (Bahdanau et al. 2015). We conduct experiments using representative methods, namely ASTER (Shi et al. 2018) and ESIR (Zhan and Lu 2019), to investigate the effectiveness of our framework. The reimplementation results are comparable with those in the paper. With respect to the dataset providing a lexicon, we choose the lexicon word under the metric of edit distance. The results of comparison with previous methods are shown in Tables 4 and 5. All the results of the previous methods are collected from their original papers. If a method uses extra annotations, such as character-level bounding boxes and pixel-level annotations, we indicate this with “Add.”. For fair comparison, we perform a comparison with the method of (Li et al. 2019) by including the results of their model trained using synthetic data.

Fig. 12
figure 12

Predictions corrected by our method

Fig. 13
figure 13

Failure cases. Top: source images. Bottom: generated images

Using the strong baseline of ASTER, we first evaluate the contribution of our method on regular text as shown in Table 4. Although the baseline accuracy on these benchmarks is high, thus no much room for improvement, our method still achieves a notable improvement in lexicon-free prediction. For instance, it leads to accuracy increases of 1.4% on SVT and 1.3% on IC13. Some predictions corrected using our generations are shown in Fig. 12. Then, we reveal the superiority of our method by applying it to irregular text recognition. As shown in Table 5, our method significantly boosts the performance of ASTER by generating clean images. The ASTER integrated with our approach outperforms the baseline by a wide margin on SVT (3.9%), CUTE (5.2%) and IC15 (4.3%). This suggests that our generator removes the background noise introduced by irregular shapes and further reduces difficulty of rectification and recognition. It is noteworthy that the ASTER with our method outperforms ESIR (Zhan and Lu 2019) that uses more rectification iterations (ASTER only rectifies the image once), which demonstrates the significant contribution of our method. The performance is even comparable with the state-of-the-art method (Yang et al. 2019a), which uses character-level geometric descriptors for supervision. Our method achieves a better trade-off between recognition performance and data requirement.

After that, our method is integrated with a different method ESIR to show the generalization. Based on the more advanced recognizer, our method can achieve further gains. For instance, the improvement is still notable on CUTE (4.1%). As a result, the performance of the ESIR is also significantly boost by our method.

Table 6 Word accuracy on testing datasets when we use a little more real training data

4.5.1 Upper Bound of GAN

We are further interested in the upper bound of our method. As our method is designed based on adversarial training, the limitations of the GAN cause some failure cases. As illustrated in Figure 13, the well-trained generator fails to generate character patterns on difficult samples, particularly when the source image is of low quality and the curvature of the text shape is too high. One possible reason is the mode-dropping phenomenon studied by (Bau et al. 2019). Another reason is the lingering gap observed by (Zhu et al. 2017) between the training supervision of paired and unpaired samples. To break this ceiling, one possible solution is to improve the synthesis engine and integrate various paired lifelike samples for training. This may lead to substantially more powerful generators, but heavily dependent on the development of synthesis engines.

Table 7 Comparisons of generation in RGB space and in gray

Inspired by recent work (Shi et al. 2018; Liao et al. 2019a), it is possible to integrate several outputs of the system and choose the most possible one to achieve performance gain. Therefore, we proposed a simple yet effective method to address the issue stated above. The source image and the corresponding generated image are concatenated as a batch for network inference. Then, we choose the prediction with the higher confidence. As shown in the last row in Tables 4 and 5 (noted as “+Com.”), this ensemble mechanism greatly boosts the system performance, which indicates that the source and generated images are complementary to each other.

4.6 More Accessible Data

In the experiments of comparing the proposed method with previous recognition methods, we have used only synthetic data for fair comparison. Here, we use the ASTER (Shi et al. 2018) to explore whether there is room for improvement in synthesis engines.

Following (Li et al. 2019), we collect publicly available real data for training. In contrast to synthetic data, real data is more costly to collect and annotate. Thus, there are only approximately 50k public real samples for training, whereas there are millions of synthetic data. As shown in Table 6, after we add the small real training set to the large synthetic one, the generality of both the baseline ASTER and our method is further boosted. This suggests that synthetic data is not sufficiently real and the model is still data-hungry.

In summary, our approach is able to make full use of real samples in the wild to further gain robustness, because of the training of our method requires only input images and the corresponding text labels. Note that our method trained using only synthetic data even outperforms the baseline trained using real data on most benchmarks, particularly on SVT-P (\(\uparrow \)6.0%). Therefore, noisy background style normalization is a promising way to improve recognition performance.

4.7 Discussion

4.7.1 Generation in RGB Space or in Gray

The background noise and text content may be relative easier to be separated in RGB colorful images. To this end, we conduct an experiment to evaluate the influence of RGB color space. The target style samples are synthesized in random color to guide the generation in RGB space. As shown in Table 7, we find that the generation in RGB space cannot outperform the generation in gray. Therefore, the key issue of background normalization is not the color space, but the lack of pixel-level supervision. Without fine-grained guidance at pixel level, the generation is only guided by the attention mechanism of the recognizer to focus informative regions. Other noisy regions on the generated image are unreasonably neglected.

4.7.2 Alignment Issue on Long Text

To tackle the lack of paired training samples, we exploit the attention mechanism to extract every character for adversarial training. However, there exists misalignment problems of the attention mechanism (Cheng et al. 2017; Bai et al. 2018), especially on long text. (Cong et al. 2019) conducted a comprehensive study on the attention mechanism and found that the attention-based recognizers have poor performance on text sentence recognition. Thus, our method still have scope for performance gains on text sentence recognition. This is a common issue of most attention mechanisms, which merits further study.

5 Conclusion

We have presented a novel framework for scene text recognition from a brand new perspective of separating text content from noisy background styles. The proposed method can greatly reduce recognition difficulty and thus boost the performance dramatically. Benefiting from the interactive joint training of an attention-based recognizer and a generative adversarial architecture, we extract character-level features for further adversarial training. Thus the discriminator focuses on informative regions and provides effective guidance for the generator. Moreover, the discriminator learns from the confusion of the recognizer and further effectively guides the generator. Thus, the generated patterns are clearer and easier to read. This feedback mechanism contributes to the generality of the generator. Our framework is end-to-end trainable, requiring only the text images and corresponding labels. Because of the elegant design, our method can be flexibly integrated with recent mainstream recognizers to achieve new state-of-the-art performance.

The proposed method is a successful attempt to solve the scene text recognition problem from the brand new perspective of image generation and style normalization, which has not been addressed intensively before. In the future, we plan to extend the proposed method to deal with end-to-end scene text recognition. How to extend our method to multiple general object recognition is also a topic of interest.