1 Introduction

In recent years, convolutional and recurrent networks have made significant progress in text recognition task, far beyond conventional methods. Deep learning based methods require a mass of data to avoid the overfitting phenomenon, i.e., models achieve high accuracy on the training dataset but low on the test dataset. However, popular benchmark datasets for scene text recognition just contain thousands of images, which cannot meet the requirement of the state-of-the-art deep learning based methods. The common solution towards the lack of data is transfer learning, which firstly pre-trains the model on large datasets that contain millions of images, e.g, ImageNet [2], SynthText90k [10], SynthText in the Wild [7] and FSNS [24], then fine-tunes the pre-trained model on the target dataset with data augmentation [16] or few-shot learning [19]. However, this solution cannot effectively supress the overfitting problem as well. The most effective solution could be generating enough samples with the same distribution as the training dataset during the fine-tuning period.

Inspired by Neural Style Transfer [4] and Style Transfer based on Generative Adversarial Networks [5, 28], this paper proposed a method called SynthText-Transfer to generate arbitrary images with the same texture but different text content for specific tasks. An overview of the model is given in Fig. 1. In the SynthText-Transfer, we choose an image from the raw dataset as the style image and a randomly generated word as the content text. The Content Image Initialization module initializes the content image based on the content text and the style image. And then the Style Transfer Network produces a synthetic image that preserves the content and looks like the style image. In the experiments, we chose two typical sequence-based text recognition models, CRNN [21] and AttentionOCR [22], to verify the capablity of the proposed data generating method. Both of the models gain better performance on benchmark datasets after fine-tuned on the synthetic dataset generated by SynthText-Transfer, which verifies the ability of SynthText-Transfer for relieving overfitting during the fine-tuning period.

Fig. 1
figure 1

The structure of SynthText-Transfer

The contributions of this paper are three-fold. First, we propose a novel pipeline to generate large number of text images according to a few collected samples. Second, the SynthText-Transfer can produce appealing synthetic data, which can further improve the accuracy of existing text recognition models. Third, the SynthText-Transfer is as fast as other fast style transfer methods, which allows generating plentiful samples quickly.

2 Related work

Synthetic data for text recognition

Synthetic datasets provide detailed ground-truth annotations, and are cheap and scalable alternatives to manually annotated datasets. The most relevant datasets to SynthText-Transfer are SynthText90k [10] and SynthText in the Wild [7]. Jaderberg et al. [10] releases a synthetic dataset called SynthText90k containing 9 million samples for text recognition in the wild, which is commonly used to pre-train text recognition models. SynthText in the Wild [7] is the first large synthetic dataset for both text localization and recognition. Both of them try to cover situations as many as possible but fail to synthesize images with the specific texture and distribution of samples.

Style transfer

Style transfer methods can be divided into two categories: Neural Style Transfer and Generative Adversarial Networks (GAN). Gatys et al. [4] are the first to propose neural method for style transfer, which transfers the style by directly updating pixels in the image iteratively. There are a series of researches to improve this method from both quality [13, 20] and speed [1, 9]. Li et al. [13] argued that using patch match instead of gram matrix enhances details. In terms of speed, generative neural methods first optimize a generative model iteratively and produce the styled image through a single forward pass. It is worth mentioning that Patch Swap [1], AdaIn [9] ,WCT [14] allow arbitrary content and style images transfer within one feed-forward network. AdaIn (Adaptive Instance Normalization) [9] is done by aligning the channel wise mean and variance of feature map x to match those of y. Patch Swap [1] generally means swapping each content activation patch x from feature map of content image with its closest matching style patch y from feature map of style image. The patch similarity is measured by normalized cross-correlations. The goal of WCT (Whitening and Coloring Transforms) [14] is to directly transform the feature map of content image to match the covariance matrix of the feature map of style image. [5, 28] use GAN to perform style transfer in different applications, such as generating various bedroom, converting natural character into MNIST style and human face into cartoon style. However, SynthText-Transfer aims to generate specific multiple instances with strong controllable spatial restrictions and precise details, which is too difficult for a single GAN. Hence we don’t choose GAN method. While neural style transfer methods can be trained unsupervised, it generally require users to input existent content and style images, in this problem we choose an image from target dataset as the style image and conduct a content image initialization to build the content image, as the input to Style Transfer Network.

Scene text recognition

In recent years, several deep learning based methods have been proposed for scene text recognition task. Jaderberg et al. [10] pose the word recognition problem as a multi-class classification task with about 90K classes. He et al. [8] and Shi et al. [21] treat the text recognition task as a sequence-to-sequence learning problem, using RNN for sequential predictions based on visual features learned by CNN or RNN and adopting CTC loss [6] to calculate the conditional probability between the predicted and the target sequences. To handle irregular images, Shi et al. [22] introduce an attention-based spatial transform mechanism to transform a distorted text region into a canonical pose suitable for recognition.

3 Synthtext-Transfer method

The input of SynthText-Transfer is one image from the target dataset as style image and a desired word. The style image and the word are fed into the Content Image Initialization module to generate the content image. And then Style Transfer Network (STN) transforms the content image to the specific style. The STN is a feed-forward style transfer network. Once trained, it can convert the input image into arbitrary style through a single forward pass, which allows fast generation.

3.1 Content image initialization

Neural style transfer methods usually take a natural existent image as the content image and an artistic image as the style image. In this problem, we regard an image from the target dataset as the style image but the content image doesn’t exist initially. To handle this problem, we utilize an interpretable pipeline to construct a content image that contains the desired word and has similar colors with the style image.

One common solution is to use pure white text and black background to initialize the content image. But this solution is bound to introduce large bias when performing style transfer because of the huge gap between it and the final desired image. Therefore, we hope the initial content image is similar to the final result and only needs style transfer network to adjust the content image slightly. The pipeline of content image initialization is shown in Fig. 2, which consists of the following four steps.

  1. a.

    Use the K-means algorithm to separate the style image into two unknown regions. We regard the region which contains more pixels in the four 5*5 image corners as the background region and the other one as text region. This step will produce a binary mask for the style image to distinguish its foreground and background.

  2. b.

    Extract max continuous rectangles of both regions in the binary mask generated from Step a. This step will produce two max rectangle regions of the foreground and the background.

  3. c.

    Select the corresponding regions in the style image according to max rectangles and resize them into the fixed size. This step will produce the color base for the text and the background.

  4. d.

    Generate a new text binary mask with the chosen string and mix the background base and the text base according to the new mask to generate the content image.

Fig. 2
figure 2

The pipeline of the Content Image Initialization. a Text segmentation. b Max rectangle extraction. c Resize to fixed size. d Mix color into content mask

3.2 Style transfer network

The Style Transfer Network (STN) is similar to [1]. As shown in Fig. 3, we utilize a pre-trained VGG19 model [23] as the encoder to map images from RGB space into feature space and perform Patch Swap [1] on the features extracted by the encoder. And then we use a trained invert network as the decoder to invert the swapped feature map back to RGB space. The structure of the invert network is the reversed version of the encoder from bottom to top. Patch swap densely extracts 3*3 patches from feature maps of the content image and the style image, and then replaces the feature patches of the content image with its most similar patches in the feature maps of the style image. The similarity of patches is calculated by cosine function. The fast parallel implementation for Patch Swap is done in three steps: 1) construct a 2D convolution layer by regarding each 3*3 patch in the style feature map as the weights of kernels. The kernel size is [3, 3, the number of patches, the number of feature map channels]. 2) After passing the content feature through the constructed convolution layer, perform channel-wise argmax and one hot encoding on the convolution results. 3) Using kernels in step one to perform transpose 2D convolution and get the swapped feature map. To preserve more details, we perform Patch Swap [1] at the Relu2-1 layer, use average pooling instead of max pooling in encoder and use bilinear interpolation instead of nearest neighbor when up-sampling in decoder. As shown in Fig. 4, using average pooling and bilinear interpolation up-sampling can make results smoother.

Fig. 3
figure 3

The structure of the Style Transfer Network

Fig. 4
figure 4

The comparsion of different settings in patch swap. The left is the result of using max pooling and nearest neighbor unsample. And the right is the result of using average pooling and bilinear interpolation upsample

3.3 Loss formulation

The loss function is formulated as the sum of content loss, style loss and total variance loss. Content loss is the mean square content loss [4] between the decoded then again encoded features of generated results and content images. Style loss as AdaIn [9] is to encourage the mean and variance of generated images closer to style image at different layers and channels. Standard total variance loss [4] is for encouraging smoothness. In Eq. (1-4), t is the output of VGG-encoder from the content image, s is the output of VGG-encoder from the style image, g(.) is the VGG-decoder, f(.) is the VGG-encoder, Φi denotes the i-th layer in VGG-19 used to compute the style loss, such as Relu2-1 layer and Relu1-1 layer, μ and σ stand for the average value and the variance of the channel-wise feature layer respectively. I denotes the generated image, and h, w, d donate the height, width and depth of the generated image respectively. α, β are weight parameters.

$$ L_{c} = || f(g(t)) - t ||_{2}. $$
(1)
$$ L_{s} = \sum\limits_{i} || \mu({\Phi}_{i}(g(t))) - \mu({\Phi}_{i}(g(s))) ||_{2} + \sum\limits_{i} || \sigma({\Phi}_{i}(g(t))) - \sigma({\Phi}_{i}(g(s))) ||_{2}. $$
(2)
$$ L_{v} = \sum\limits_{i = 1}^{h-1}\sum\limits_{j = 1}^{w}\sum\limits_{k = 1}^{d} (I_{i + 1,j,k} - I_{i,j,k} )^{2} + \sum\limits_{i = 1}^{h}\sum\limits_{j = 1}^{w-1}\sum\limits_{k = 1}^{d} (I_{i,j + 1,k} - I_{i,j,k} )^{2}. $$
(3)
$$ L_{all} = L_{c} + \alpha L_{s} + \beta L_{v}. $$
(4)

4 Experiments

In this section, we evaluate our model on a number of standard scene text recognition datasets. First, we use SynthText-Transfer to generate synthetic data for each datasets. Some transferred results are shown in Figs. 56. Second, we finetune two typical text recognition models (with lexicon free) on raw datasets with data augmentation and on generated synthetic datasets from Patch Swap, AdaIn and WCT respectively. And we verify the effectiveness of the proposed method by the performance improvement brought by SynthText-Transfer. For English benchmarks, the performance is measured by word accuracy. For English and Chinese benchmarks, the performance is measured by both word accuracy and average edit distance, to evaluate the similarity between the predicted word and the labeled word more precisely.

Fig. 5
figure 5

Examples comparison for Patch Swap (a, d), AdaIn (b, e) and WCT (c, f). As shown that Patch Swap introduces more edge effect and changes the structure of text closer to style source. But AdaIn and WCT can only change color and texture

Fig. 6
figure 6

More samples generated by SynthText-Transfer. The first image of each four images with similar style is the style source from benchmark datasets and the last three images are transferred results, the first 2 rows come from English datasets, the last 2 rows come from MTWI

4.1 Implementation details

SynthText-Transfer is implemented in Tensorflow. The expanded datasets during experiment is 200 times larger than original datasets, except the last one whose generated images are of 20 times more. Our hardware includes one Nvidia titan XP, Ubuntu 14.4 with 16GB memory and Intel core i5s CPU.

Content image initialization

Fonts set is the same as [10]. Text vocabulary comes from Newsgroup20 [12]. Words length is uniformly random chosen within [w/h]-3 and [w/h]+ 3, where w and h are the width and height of the style image respectively.

Style transfer network

When training the invert network as AdaIn [9], we utilize MSCOCO [15] and ArtPainting [3] datasets to approximate decoder to invert features both in natural space and artistic space. This is an unsupervised process. During training, we uniformly sample 100x32 crops from images to fit the size of general text line images. Optimization method is Adam with β1 = 0.5. During training α = 0.5, β = 0.00001. The training process of the decoder takes about 20k iterations to reach convergence. Once trained, SynthText-Transfer is directly used to perform style transfer for arbitrary input. If the shorter edge of the style image is less than 64 pixels, we resize the style image to height of 64 while keeping the aspect ratio. Transfer results are of the same size with style images. We store both initial content images and results after style swap to preserve data diversity. And for WCT, we set α = 0.5 during generating.

4.2 Datasets and models

Datasets

For text recognition task, ICDAR 2003 (IC03) [17] contains 1156 cropped word images for training and 1107 for test. ICDAR 2013 (IC13) [11] contains 849 images for training and 1095 for test, majorly inheriting from IC03. There are 257 train and 647 test images for street sign in Street View Text (SVT) [26] , 2000 train and 3000 test images collected from Internet in IIIT5k [18]. SynthText90k [10] contains 8 million training images which are highly realistic. All of them except SynthText90k just hold thousands of images with limited vocabulary and pattern diversity.

Those datasets mentioned previously are all of English. In addition, there is a large dataset for English and Chinese from MTWI [27] with 134816 labeled images. However, hundreds of thousands of images are still insufficient for Chinese language because there are 5530 independent characters commonly used in modern Chinese language and 27484 supported in Unicode, which is far more than English and harder to learn. The test data labels are not public accessible yet, so we filter out images that contain vertical sequence and randomly split the train dataset into two parts, 81387 images for training and 20276 images for testing. Following [21], we discard images that contain less than 3 characters during testing.

Models for evaluation

Models for text recognition majorly come from two types. The first type is a pipeline consisting of CNN, RNN, and CTC Loss [6]. We choose CRNN [21] in experiment. CRNN model consists of 7 layers of convolution, 2 layers of bi-directional LSTM and CTC Loss. The second type is a pipeline similar to the first one, but equip Attention mechanism within RNN to replace CTC. We choose AttentionOCR [22] during experiment. AttentionOCR consists of 7 layers of CNN, 2 layers of Bi-directional RNN equipped with Attention mechanism. For simplicity, we only test the Sequence Recognition Network part and discarded Spatial Transform Network part.

4.3 Effectiveness of SynthText-Transfer

We compare the performances of using transferred data with those of using raw data with data augmentation. Data augmentation we use is random rotation within 5 degree and Gaussian noise with μ = 0 and σ = 2.0. We conduct experiments on two typical text recognition models and several standard datasets with different training processes. Naive finetuning method is to finetune model on target datasets only. Half finetuning method is to control half of one training batch comes from targets datasets and the other half comes from SynthText90k. This is a practical trick when there is a significant imbalance between pre-trained datasets and target datasets.

Improvement for text recognition models in English datasets

Comparing these results, we can infer the performance contribution of the SynthText-Transfer. From Tables 1 and 2, we can see that fine-tuning on transferred data from Patch Swap achieves better performance than on raw data with data augmentation and from AdaIn, WCT on all benchmarks for naive fine-tune and half fine-tune, worse than AdaIn, WCT when training from scratch. Compared with normal data augmentation methods, random content and style transfer in the SynthText-Transfer introduce a larger vocabulary and more diversity to generated data, which can bring significant performance improvement. Meanwhile in AdaIn and WCT, transferred data sometimes perform worse than raw data with augmentation, which will be discussed next section.

Table 1 Word accuracy of CRNN using different training processes
Table 2 Word accuracy of AttentionOCR using different training processes

When we train text recognition models from scratch, the transferred data also outperforms the raw data with data augmentation. This is due to the limited diversity in patterns and vocabulary of raw datasets, which infers that it is crucial to generate various synthetic data for small datasets specially. During training from scratch AdaIn and WCT outperform Patch Swap. We think it is because when training from scratch, simpler data from AdaIn and WCT are easier for models to learn. Details from Patch Swap shall be used for a pre-trained models with general knowledge for text recognition.

Improvement for text recognition models in English and Chinese datasets

In this paragraph we only use CRNN to train from scratch and 20 times larger generated datasets to see its improvement on MTWI [27] . As shown in Table 3, Chinese, a complex language system containing thousands of unique characters which is far beyond 26 characters in English, is harder for sequence model to learn. The improvement introduced by synthetic data could be more obvious than when training on English datasets. This obvious improvement shows that the proposed method could be more useful for languages which containing enormous unique characters and requires much more labor, money than English to get reliable public benchmarks.

Table 3 Word accuracy and Mean Edit Distance of CRNN in MTWI [27] using different training processes

The impact of data size

Table 4 shows the results of AttentionOCR using different sizes of synthetic data under Patch Swap on IC13 dataset. As shown in Table 4, accuracy of AttentionOCR grows with the size of SynthText-Transfer, which is consistent with the experiments on ImageNet in [25].

Table 4 The impact of data size Accuracy of AttentionOCR using different numbers of synthetic data on IC13 dataset

4.4 Visual quality and time consumption comparisons for Patch Swap, AdaIn and WCT

In Section 4.3, previous experiments statistically indicate that Patch Swap is the most suitable arbitrary style transfer operation for text image synthesizing. In this section, we try to visually understand Patch Swap’s suitability and compare the time consumption. Figure 5 shows the results transferred by Patch Swap, AdaIn and WCT separately.

As shown in Fig. 5, Patch Swap is superior to AdaIn and WCT for two reasons. Firstly, it can adjust text font closer to unknown artistic font from style image. Secondly, Patch swap can supplement content image with edge effect from style image. Meanwhile, AdaIn and WCT focus on the similarity between global statistic variable, it can only change the color and texture of content image but fail in emulating the text structure and edge effect, these kind of details.

As for time consumption, Table 5 shows the time for different algorithms to generate one image of 32 pixels height and 256 width. Patch Swap [1], AdaIn [9], WCT [14] are feed-forward style transfer differs in transfer operation. [4] is iteratively style transfer which is of highest quality but much slower. As shown in Table 5, feed forward style transfer [1, 9, 14] is much faster than iteratively style transfer [4]. Patch Swap is a bit slower than AdaIn and WCT but of much higher visual quality. Patch swap is designed specially for details enhancement which is suitable for text images while AdaIn and WCT is designed for natural and artistic images focusing on global features.

Table 5 Speed for different style transfer methods to generate one image

5 Discussion

In this section we will discuss disadvantages of the proposed method and some future works:

SynthText-Transfer receives a sample from target datasets and generated more samples with similar texture but different text content. Therefore if the collected datasets is too few to represent the most situation of target scenery, the synthetic larger datasets could lead to over fit as well.

The parallel fast implementation of patch swap according to [1] consumes too much memory. Under our hardware environment, within one Nvidia Titan GPU we can’t generate an image with the longer edge more than 512 pixels. Reducing memory usage while preserving speed is under consideration.

This pipeline still involves many hand craft operations which restrict generated images’ diversity, such as fixed fonts set when initializing content images. And K-mean segmentation has a low probability to wrongly classify text and background regions. How to make it end to end requires some future work.

6 Conclusion

SynthText-Transfer solves the problem of training data lackness by generating more samples with similar texture but different text content in a style transfer way. According to experiments, Patch Swap is the most suitable operation for arbitrary style transfer. The interpretable and flexible pipeline of Synth Text-Transfer enables user to strictly control desired text content, fonts, distortion and random noise. SynthText-Transfer allows us to effectively create training datasets of arbitrary language for text recognition, bringing obvious improvement if there is no sufficient training data of this language. In the future we will try to explore new directed methods to synthesize datasets which can be used for both text recognition and localization and verify its effectiveness for more languages.