Abstract
Most of the existing datasets for scene text recognition merely consist of a few thousand training samples with a very limited vocabulary, which cannot meet the requirement of the state-of-the-art deep learning based text recognition methods. Meanwhile, although the synthetic datasets (e.g., SynthText90k) usually contain millions of samples, they cannot fit the data distribution of the small target datasets in natural scenes completely. To address these problems, we propose a word data generating method called SynthText-Transfer, which is capable of emulating the distribution of the target dataset. SynthText-Transfer uses a style transfer method to generate samples with arbitray text content, which preserve the texture of the reference sample in the target dataset. The generated images are not only visibly similar with real images, but also capable of improving the accuracy of the state-of-the-art text recognition methods, especially for the English and Chinese dataset with a large alphabet (in which many characters only appear in few samples, making it hard to learn for sequence models). Moreover, the proposed method is fast and flexible, with a competitive speed among common style transfer methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In recent years, convolutional and recurrent networks have made significant progress in text recognition task, far beyond conventional methods. Deep learning based methods require a mass of data to avoid the overfitting phenomenon, i.e., models achieve high accuracy on the training dataset but low on the test dataset. However, popular benchmark datasets for scene text recognition just contain thousands of images, which cannot meet the requirement of the state-of-the-art deep learning based methods. The common solution towards the lack of data is transfer learning, which firstly pre-trains the model on large datasets that contain millions of images, e.g, ImageNet [2], SynthText90k [10], SynthText in the Wild [7] and FSNS [24], then fine-tunes the pre-trained model on the target dataset with data augmentation [16] or few-shot learning [19]. However, this solution cannot effectively supress the overfitting problem as well. The most effective solution could be generating enough samples with the same distribution as the training dataset during the fine-tuning period.
Inspired by Neural Style Transfer [4] and Style Transfer based on Generative Adversarial Networks [5, 28], this paper proposed a method called SynthText-Transfer to generate arbitrary images with the same texture but different text content for specific tasks. An overview of the model is given in Fig. 1. In the SynthText-Transfer, we choose an image from the raw dataset as the style image and a randomly generated word as the content text. The Content Image Initialization module initializes the content image based on the content text and the style image. And then the Style Transfer Network produces a synthetic image that preserves the content and looks like the style image. In the experiments, we chose two typical sequence-based text recognition models, CRNN [21] and AttentionOCR [22], to verify the capablity of the proposed data generating method. Both of the models gain better performance on benchmark datasets after fine-tuned on the synthetic dataset generated by SynthText-Transfer, which verifies the ability of SynthText-Transfer for relieving overfitting during the fine-tuning period.
The contributions of this paper are three-fold. First, we propose a novel pipeline to generate large number of text images according to a few collected samples. Second, the SynthText-Transfer can produce appealing synthetic data, which can further improve the accuracy of existing text recognition models. Third, the SynthText-Transfer is as fast as other fast style transfer methods, which allows generating plentiful samples quickly.
2 Related work
Synthetic data for text recognition
Synthetic datasets provide detailed ground-truth annotations, and are cheap and scalable alternatives to manually annotated datasets. The most relevant datasets to SynthText-Transfer are SynthText90k [10] and SynthText in the Wild [7]. Jaderberg et al. [10] releases a synthetic dataset called SynthText90k containing 9 million samples for text recognition in the wild, which is commonly used to pre-train text recognition models. SynthText in the Wild [7] is the first large synthetic dataset for both text localization and recognition. Both of them try to cover situations as many as possible but fail to synthesize images with the specific texture and distribution of samples.
Style transfer
Style transfer methods can be divided into two categories: Neural Style Transfer and Generative Adversarial Networks (GAN). Gatys et al. [4] are the first to propose neural method for style transfer, which transfers the style by directly updating pixels in the image iteratively. There are a series of researches to improve this method from both quality [13, 20] and speed [1, 9]. Li et al. [13] argued that using patch match instead of gram matrix enhances details. In terms of speed, generative neural methods first optimize a generative model iteratively and produce the styled image through a single forward pass. It is worth mentioning that Patch Swap [1], AdaIn [9] ,WCT [14] allow arbitrary content and style images transfer within one feed-forward network. AdaIn (Adaptive Instance Normalization) [9] is done by aligning the channel wise mean and variance of feature map x to match those of y. Patch Swap [1] generally means swapping each content activation patch x from feature map of content image with its closest matching style patch y from feature map of style image. The patch similarity is measured by normalized cross-correlations. The goal of WCT (Whitening and Coloring Transforms) [14] is to directly transform the feature map of content image to match the covariance matrix of the feature map of style image. [5, 28] use GAN to perform style transfer in different applications, such as generating various bedroom, converting natural character into MNIST style and human face into cartoon style. However, SynthText-Transfer aims to generate specific multiple instances with strong controllable spatial restrictions and precise details, which is too difficult for a single GAN. Hence we don’t choose GAN method. While neural style transfer methods can be trained unsupervised, it generally require users to input existent content and style images, in this problem we choose an image from target dataset as the style image and conduct a content image initialization to build the content image, as the input to Style Transfer Network.
Scene text recognition
In recent years, several deep learning based methods have been proposed for scene text recognition task. Jaderberg et al. [10] pose the word recognition problem as a multi-class classification task with about 90K classes. He et al. [8] and Shi et al. [21] treat the text recognition task as a sequence-to-sequence learning problem, using RNN for sequential predictions based on visual features learned by CNN or RNN and adopting CTC loss [6] to calculate the conditional probability between the predicted and the target sequences. To handle irregular images, Shi et al. [22] introduce an attention-based spatial transform mechanism to transform a distorted text region into a canonical pose suitable for recognition.
3 Synthtext-Transfer method
The input of SynthText-Transfer is one image from the target dataset as style image and a desired word. The style image and the word are fed into the Content Image Initialization module to generate the content image. And then Style Transfer Network (STN) transforms the content image to the specific style. The STN is a feed-forward style transfer network. Once trained, it can convert the input image into arbitrary style through a single forward pass, which allows fast generation.
3.1 Content image initialization
Neural style transfer methods usually take a natural existent image as the content image and an artistic image as the style image. In this problem, we regard an image from the target dataset as the style image but the content image doesn’t exist initially. To handle this problem, we utilize an interpretable pipeline to construct a content image that contains the desired word and has similar colors with the style image.
One common solution is to use pure white text and black background to initialize the content image. But this solution is bound to introduce large bias when performing style transfer because of the huge gap between it and the final desired image. Therefore, we hope the initial content image is similar to the final result and only needs style transfer network to adjust the content image slightly. The pipeline of content image initialization is shown in Fig. 2, which consists of the following four steps.
-
a.
Use the K-means algorithm to separate the style image into two unknown regions. We regard the region which contains more pixels in the four 5*5 image corners as the background region and the other one as text region. This step will produce a binary mask for the style image to distinguish its foreground and background.
-
b.
Extract max continuous rectangles of both regions in the binary mask generated from Step a. This step will produce two max rectangle regions of the foreground and the background.
-
c.
Select the corresponding regions in the style image according to max rectangles and resize them into the fixed size. This step will produce the color base for the text and the background.
-
d.
Generate a new text binary mask with the chosen string and mix the background base and the text base according to the new mask to generate the content image.
3.2 Style transfer network
The Style Transfer Network (STN) is similar to [1]. As shown in Fig. 3, we utilize a pre-trained VGG19 model [23] as the encoder to map images from RGB space into feature space and perform Patch Swap [1] on the features extracted by the encoder. And then we use a trained invert network as the decoder to invert the swapped feature map back to RGB space. The structure of the invert network is the reversed version of the encoder from bottom to top. Patch swap densely extracts 3*3 patches from feature maps of the content image and the style image, and then replaces the feature patches of the content image with its most similar patches in the feature maps of the style image. The similarity of patches is calculated by cosine function. The fast parallel implementation for Patch Swap is done in three steps: 1) construct a 2D convolution layer by regarding each 3*3 patch in the style feature map as the weights of kernels. The kernel size is [3, 3, the number of patches, the number of feature map channels]. 2) After passing the content feature through the constructed convolution layer, perform channel-wise argmax and one hot encoding on the convolution results. 3) Using kernels in step one to perform transpose 2D convolution and get the swapped feature map. To preserve more details, we perform Patch Swap [1] at the Relu2-1 layer, use average pooling instead of max pooling in encoder and use bilinear interpolation instead of nearest neighbor when up-sampling in decoder. As shown in Fig. 4, using average pooling and bilinear interpolation up-sampling can make results smoother.
3.3 Loss formulation
The loss function is formulated as the sum of content loss, style loss and total variance loss. Content loss is the mean square content loss [4] between the decoded then again encoded features of generated results and content images. Style loss as AdaIn [9] is to encourage the mean and variance of generated images closer to style image at different layers and channels. Standard total variance loss [4] is for encouraging smoothness. In Eq. (1-4), t is the output of VGG-encoder from the content image, s is the output of VGG-encoder from the style image, g(.) is the VGG-decoder, f(.) is the VGG-encoder, Φi denotes the i-th layer in VGG-19 used to compute the style loss, such as Relu2-1 layer and Relu1-1 layer, μ and σ stand for the average value and the variance of the channel-wise feature layer respectively. I denotes the generated image, and h, w, d donate the height, width and depth of the generated image respectively. α, β are weight parameters.
4 Experiments
In this section, we evaluate our model on a number of standard scene text recognition datasets. First, we use SynthText-Transfer to generate synthetic data for each datasets. Some transferred results are shown in Figs. 5, 6. Second, we finetune two typical text recognition models (with lexicon free) on raw datasets with data augmentation and on generated synthetic datasets from Patch Swap, AdaIn and WCT respectively. And we verify the effectiveness of the proposed method by the performance improvement brought by SynthText-Transfer. For English benchmarks, the performance is measured by word accuracy. For English and Chinese benchmarks, the performance is measured by both word accuracy and average edit distance, to evaluate the similarity between the predicted word and the labeled word more precisely.
4.1 Implementation details
SynthText-Transfer is implemented in Tensorflow. The expanded datasets during experiment is 200 times larger than original datasets, except the last one whose generated images are of 20 times more. Our hardware includes one Nvidia titan XP, Ubuntu 14.4 with 16GB memory and Intel core i5s CPU.
Content image initialization
Fonts set is the same as [10]. Text vocabulary comes from Newsgroup20 [12]. Words length is uniformly random chosen within [w/h]-3 and [w/h]+ 3, where w and h are the width and height of the style image respectively.
Style transfer network
When training the invert network as AdaIn [9], we utilize MSCOCO [15] and ArtPainting [3] datasets to approximate decoder to invert features both in natural space and artistic space. This is an unsupervised process. During training, we uniformly sample 100x32 crops from images to fit the size of general text line images. Optimization method is Adam with β1 = 0.5. During training α = 0.5, β = 0.00001. The training process of the decoder takes about 20k iterations to reach convergence. Once trained, SynthText-Transfer is directly used to perform style transfer for arbitrary input. If the shorter edge of the style image is less than 64 pixels, we resize the style image to height of 64 while keeping the aspect ratio. Transfer results are of the same size with style images. We store both initial content images and results after style swap to preserve data diversity. And for WCT, we set α = 0.5 during generating.
4.2 Datasets and models
Datasets
For text recognition task, ICDAR 2003 (IC03) [17] contains 1156 cropped word images for training and 1107 for test. ICDAR 2013 (IC13) [11] contains 849 images for training and 1095 for test, majorly inheriting from IC03. There are 257 train and 647 test images for street sign in Street View Text (SVT) [26] , 2000 train and 3000 test images collected from Internet in IIIT5k [18]. SynthText90k [10] contains 8 million training images which are highly realistic. All of them except SynthText90k just hold thousands of images with limited vocabulary and pattern diversity.
Those datasets mentioned previously are all of English. In addition, there is a large dataset for English and Chinese from MTWI [27] with 134816 labeled images. However, hundreds of thousands of images are still insufficient for Chinese language because there are 5530 independent characters commonly used in modern Chinese language and 27484 supported in Unicode, which is far more than English and harder to learn. The test data labels are not public accessible yet, so we filter out images that contain vertical sequence and randomly split the train dataset into two parts, 81387 images for training and 20276 images for testing. Following [21], we discard images that contain less than 3 characters during testing.
Models for evaluation
Models for text recognition majorly come from two types. The first type is a pipeline consisting of CNN, RNN, and CTC Loss [6]. We choose CRNN [21] in experiment. CRNN model consists of 7 layers of convolution, 2 layers of bi-directional LSTM and CTC Loss. The second type is a pipeline similar to the first one, but equip Attention mechanism within RNN to replace CTC. We choose AttentionOCR [22] during experiment. AttentionOCR consists of 7 layers of CNN, 2 layers of Bi-directional RNN equipped with Attention mechanism. For simplicity, we only test the Sequence Recognition Network part and discarded Spatial Transform Network part.
4.3 Effectiveness of SynthText-Transfer
We compare the performances of using transferred data with those of using raw data with data augmentation. Data augmentation we use is random rotation within 5 degree and Gaussian noise with μ = 0 and σ = 2.0. We conduct experiments on two typical text recognition models and several standard datasets with different training processes. Naive finetuning method is to finetune model on target datasets only. Half finetuning method is to control half of one training batch comes from targets datasets and the other half comes from SynthText90k. This is a practical trick when there is a significant imbalance between pre-trained datasets and target datasets.
Improvement for text recognition models in English datasets
Comparing these results, we can infer the performance contribution of the SynthText-Transfer. From Tables 1 and 2, we can see that fine-tuning on transferred data from Patch Swap achieves better performance than on raw data with data augmentation and from AdaIn, WCT on all benchmarks for naive fine-tune and half fine-tune, worse than AdaIn, WCT when training from scratch. Compared with normal data augmentation methods, random content and style transfer in the SynthText-Transfer introduce a larger vocabulary and more diversity to generated data, which can bring significant performance improvement. Meanwhile in AdaIn and WCT, transferred data sometimes perform worse than raw data with augmentation, which will be discussed next section.
When we train text recognition models from scratch, the transferred data also outperforms the raw data with data augmentation. This is due to the limited diversity in patterns and vocabulary of raw datasets, which infers that it is crucial to generate various synthetic data for small datasets specially. During training from scratch AdaIn and WCT outperform Patch Swap. We think it is because when training from scratch, simpler data from AdaIn and WCT are easier for models to learn. Details from Patch Swap shall be used for a pre-trained models with general knowledge for text recognition.
Improvement for text recognition models in English and Chinese datasets
In this paragraph we only use CRNN to train from scratch and 20 times larger generated datasets to see its improvement on MTWI [27] . As shown in Table 3, Chinese, a complex language system containing thousands of unique characters which is far beyond 26 characters in English, is harder for sequence model to learn. The improvement introduced by synthetic data could be more obvious than when training on English datasets. This obvious improvement shows that the proposed method could be more useful for languages which containing enormous unique characters and requires much more labor, money than English to get reliable public benchmarks.
The impact of data size
Table 4 shows the results of AttentionOCR using different sizes of synthetic data under Patch Swap on IC13 dataset. As shown in Table 4, accuracy of AttentionOCR grows with the size of SynthText-Transfer, which is consistent with the experiments on ImageNet in [25].
4.4 Visual quality and time consumption comparisons for Patch Swap, AdaIn and WCT
In Section 4.3, previous experiments statistically indicate that Patch Swap is the most suitable arbitrary style transfer operation for text image synthesizing. In this section, we try to visually understand Patch Swap’s suitability and compare the time consumption. Figure 5 shows the results transferred by Patch Swap, AdaIn and WCT separately.
As shown in Fig. 5, Patch Swap is superior to AdaIn and WCT for two reasons. Firstly, it can adjust text font closer to unknown artistic font from style image. Secondly, Patch swap can supplement content image with edge effect from style image. Meanwhile, AdaIn and WCT focus on the similarity between global statistic variable, it can only change the color and texture of content image but fail in emulating the text structure and edge effect, these kind of details.
As for time consumption, Table 5 shows the time for different algorithms to generate one image of 32 pixels height and 256 width. Patch Swap [1], AdaIn [9], WCT [14] are feed-forward style transfer differs in transfer operation. [4] is iteratively style transfer which is of highest quality but much slower. As shown in Table 5, feed forward style transfer [1, 9, 14] is much faster than iteratively style transfer [4]. Patch Swap is a bit slower than AdaIn and WCT but of much higher visual quality. Patch swap is designed specially for details enhancement which is suitable for text images while AdaIn and WCT is designed for natural and artistic images focusing on global features.
5 Discussion
In this section we will discuss disadvantages of the proposed method and some future works:
SynthText-Transfer receives a sample from target datasets and generated more samples with similar texture but different text content. Therefore if the collected datasets is too few to represent the most situation of target scenery, the synthetic larger datasets could lead to over fit as well.
The parallel fast implementation of patch swap according to [1] consumes too much memory. Under our hardware environment, within one Nvidia Titan GPU we can’t generate an image with the longer edge more than 512 pixels. Reducing memory usage while preserving speed is under consideration.
This pipeline still involves many hand craft operations which restrict generated images’ diversity, such as fixed fonts set when initializing content images. And K-mean segmentation has a low probability to wrongly classify text and background regions. How to make it end to end requires some future work.
6 Conclusion
SynthText-Transfer solves the problem of training data lackness by generating more samples with similar texture but different text content in a style transfer way. According to experiments, Patch Swap is the most suitable operation for arbitrary style transfer. The interpretable and flexible pipeline of Synth Text-Transfer enables user to strictly control desired text content, fonts, distortion and random noise. SynthText-Transfer allows us to effectively create training datasets of arbitrary language for text recognition, bringing obvious improvement if there is no sufficient training data of this language. In the future we will try to explore new directed methods to synthesize datasets which can be used for both text recognition and localization and verify its effectiveness for more languages.
References
Chen T Q, Schmidt M (2016) Fast patch-based style transfer of arbitrary style. arXiv:1612.04337
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database Computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on, pp. 248–255. Ieee
Duck S Y (2016) Painter by numbers. https://www.kaggle.com/c/painter-by-numbers
Gatys L, Ecker A, Bethge M (2015) A neural algorithm of artistic style. Nature communications
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 369–376
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2315–2324
He P, Huang W, Qiao Y, Loy C C, Tang X (2016) Reading scene text in deep convolutional sequences. In: AAAI, vol 16, pp 3501–3508
Huang X, Belongie S J (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, pp 1510–1519
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv:1406.2227
Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda L G, Mestre S R, Mas J, Mota D F, Almazan J A, De Las Heras L P (2013) Icdar 2013 robust reading competition. In: Document analysis and recognition (ICDAR), 2013 12th international conference on. IEEE, pp 1484–1493
Lang K (1995) Newsweeder: Learning to filter netnews. In: Machine learning proceedings 1995. Elsevier, pp 331–339
Li C, Wand M (2016) Combining markov random fields and convolutional neural networks for image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2479–2486
Li Y, Fang C, Yang J, Wang Z, Lu X, Yang M H (2017) Universal style transfer via feature transforms. In: Advances in neural information processing systems, pp 386–396
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Liu X, Liang D, Yan S, Chen D, Qiao Y, Yan J (2018) Fots: fast oriented text spotting with a unified network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5676–5685
Lucas S M, Panaretos A, Sosa L, Tang A, Wong S, Young R, Ashida K, Nagai H, Okamoto M, Yamamoto H et al (2005) Icdar 2003 robust reading competitions: entries, results, and future directions. Int J Doc Anal Recogn (IJDAR) 7 (2-3):105–122
Mishra A, Alahari K, Jawahar C (2012) Scene text recognition using higher order language priors. In: BMVC-British machine vision conference. BMVA
Ravi S, Larochelle H (2016) Optimization as a model for few-shot learning
Risser E, Wilmot P, Barnes C (2017) Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv:1701.08893
Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39(11):2298–2304
Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4168–4176
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR
Smith R, Gu C, Lee D S, Hu H, Unnikrishnan R, Ibarz J, Arnoud S, Lin S (2016) End-to-end interpretation of the french street name signs dataset. In: European conference on computer vision. Springer, pp 411–426
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Computer vision (ICCV), 2017 IEEE international conference on. IEEE, pp 843–852
Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: Computer vision (ICCV), 2011 IEEE international conference on. IEEE, pp 1457–1464
Wang Y, Bai X, Liu C-L (2018) Icpr mtwi 2018 challenge 1 text recognition of web images. https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100150.711.7.6ad52784HcABoy&raceId=231650&_lang=en_US
Yeh R A, Chen C, Lim T Y, Schwing A G, Hasegawa-Johnson M, Do M N (2017) Semantic image inpainting with deep generative models. In: CVPR, vol 2, p 4
Acknowledgments
This work is supported by National Natural Science Foundation of China under Grant 61673029. This work is also a research achievement of Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Jiahui Li and Siwei Wang made equal contributions to this paper
Rights and permissions
About this article
Cite this article
Li, J., Wang, S., Wang, Y. et al. Synthesizing data for text recognition with style transfer. Multimed Tools Appl 78, 29183–29196 (2019). https://doi.org/10.1007/s11042-018-6656-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6656-3