1 Introduction

Texts in scene image contain high-level important semantic information, which is help to analyzing and understanding the corresponding environment. With the rapid popularization of smart phones and mobile computing devices, images with text data are acquired more conveniently and efficiently. Therefore, scene text recognition (STR) has become active research topic in computer vision, and its related applications are including image retrieval, automatic navigation and human–computer interaction, etc. [1,2,3]. Moreover, the International Conference on Document Analysis and Recognition (ICDAR) initiates “Robust Reading” competition in 2003, and since then numerous techniques and methods have been proposed to greatly advance the development of STR.

Text detection and recognition are two fundamental tasks for STR. Text detection aims to determine the position of text from input image, and the position is often represented by a bounding box. Generally, the shape of target bounding box may be rectangle, oriented rectangle or quadrilateral. More precisely, parameters \((x,y,w,h)\), \((x,y,w,h,\theta )\) and \((x_{1} ,y_{1} ,x_{2} ,y_{2} ,x_{3} ,y_{3} ,x_{4} ,y_{4} )\) can be used to denotes horizontal, rotated and arbitrary quadrilateral bounding box respectively. Text recognition aims to convert image regions containing text into machine-readable strings. Different from the general image classification, the dimension of output sequence for text recognition is not fixed. In most cases, text detection is a preliminary step of text recognition. Recently, many researchers begin to integrate the detection and recognition tasks into an end-to-end text recognition system. Considering a small lexicon, word spotting offers an effective strategy for realizing end-to-end recognition [4].

The target of traditional optical character recognition (OCR) is mainly document images acquired by scanner [5]. Since even old scanners have enough resolution for text image acquisition, the recognition rates of many OCR methods can easily reach 99%. Compared to traditional OCR, however, STR is more challenging, which are discussed as follows:

  1. (1)

    Texts are often scattered in the scene image, and there is no prior information about their location. For scanned documents, the number of text lines, line spacing and even the number of words are known. For scene texts, however, we cannot directly apply segmentation methods for document images since there is no such formatting rule.

  2. (2)

    Scene texts often have variety of sizes, fonts and orientations. Targets in scene image may contain decorated or specially-designed characters, such as presentation slides on screen, calligraphic slogans on wall, and messages on digital signboard. Such texts with multifarious appearance are difficultly recognized by traditional OCR engines.

  3. (3)

    The quality of scene image acquired by digital devices is potentially poor. At present, scene text covers wide range of applications linked to wearable cameras or massive urban captures which are difficult or undesirable to control. Therefore, characters and their background often have very low contrast or perspective distortion, which results in difficulty for localization and recognition. Figure 1 shows some examples of scene text images that are not easily detected and recognized.

    Fig. 1
    figure 1

    Examples of scene text images

  4. (4)

    There are many character-like patterns (non-character) in scene image. Since the background of scene image is often complex, there are many ambiguous objects such as leaves, windows or icons that are much like characters or words. Moreover, sometimes scene texts connect to other objects, which easily results in confusing patterns.

In this paper, we mainly provide a comprehensive review about scene text detection and recognition research over the past decade, and highlight the key techniques. Moreover, we compare state-of-the-art methods and report the corresponding performance on several standard benchmark datasets.

2 Scene Text Detection

As mentioned above, scene text detection is a challenging problem. Similar to majority of computer vision tasks, most previous text detection methods are based on handcraft features as well as prior knowledge, and since around 2015 deep learning based methods emerge and gradually become the mainstream.

2.1 Hand-Crafted Feature Extraction Stage

Traditional text detectors focus on developing hand-crafted low-level features to discriminate text and non-text components in scene image, which can be mainly classified into two categories, i.e., sliding window (SW) and connected component (CC) based methods.

2.1.1 SW Methods

SW methods first detect text information by moving a multi-scale sub-window through all possible locations in an image, and then use a pre-trained classifier to identify whether text is contained within the sub-window [22].

Wang et al. [6] provided an end-to-end pipeline for STR, where they perform multi-scale character detection via SW classification. Features are first extracted by chosen entries in a HOG descriptor computed at the window location. Then Random Ferns is applied to evaluate the likelihood of character in the window location. Pan et al. [7] estimated the text existing confidence and scale information via SW. After that, a conditional random field (CRF) model is proposed to filter out the non-text components. Similarly, Mishra et al. [8] used a standard SW method with character aspect ratio prior to detect potential locations of characters in scene image. Wang et al. [9] applied a convolutional neural network (CNN) model with SW scheme to obtain candidate lines of text in given image, and thus estimate text locations. Jaderberg et al. [10] also applied CNN in SW fashion to compute text saliency map, which stays the same resolution as the original image through zero-padding. After that word bounding boxes can be generated based on these saliency maps.

The main difficulties for this group of methods lie in designing discriminative features to train a powerful classifier, and reasonably managing the number of scanning windows to reduce computation complexity.

2.1.2 CC Methods

CC methods first extract candidate components from the image, and then filter out non-text components using manually designed rules or automatically trained classifiers [23]. Compared to SW methods, such methods are more efficient and robust. There are two representative methods, i.e., stroke width transform (SWT) and maximally stable extremal regions (MSER).

Epshtein et al. [11] presented SWT operator to compute the width of the most likely stroke for image pixel. Canny edge detector is first used to find edges in image. After all the edge pixels in the opposite gradient direction being found, strokes are considered effective and these pixels are grouped into character candidates. Neumann et al. [12] gave a description for character detection problem, i.e., finding all contiguous regions in image such that probability that the sequence represents text has a local maximum. Based on the description, MSER classifier is trained to find region containing characters. Finally, post-processing and connection rules are applied to combine the candidate characters into text line. MSER method needs less priori knowledge and is more robust to language and oriented text. In order to address problems on blurry images or characters with low contrast, the same authors implemented character detection in all extremal regions (ERs) instead of just in MSERs [13, 14]. They use incrementally computable descriptors as features to train a sequential classifier, which can reduce the high false positive rate in real-time. Yin et al. [15] proposed a fast MSERs pruning algorithm, which can significantly reduce the number of character candidates to be processed. Character candidates are clustered into text candidates by the single-link clustering algorithm, whose distance weights and clustering threshold can be automatically learnt. Such new MSER based method is more robust and efficient for text detection.

Generally speaking, CC methods easily bring with numerous non-text components. Therefore, correctly filtering out the false positives is critical to the success of this group of methods.

2.1.3 Hybrid Methods

In order to more efficiently handle scene text with cluttered background information, several hybrid methods are proposed, which make use of the advantages of different methods and combine with specific schemes.

Huang et al. [16] applied CNN to learn high-level features from the MSREs components in image. These components show high discriminant ability and strong robustness against complicated background ones. Moreover, SW model and non-maximal suppression (NMS) are incorporated in the CNN classifier so as to handle the problem of multiple characters connection. Gomez et al. [17] used the MSER algorithm to firstly obtain the initial segmentation of image. After that they propose a text specific selective search strategy, which can group the initial regions by agglomerative clustering in a hierarchy where each node defines a possible word hypothesis. Finally a ranked list of proposals prioritizing the best hypotheses is provided for text detection. Busta et al. [18] proposed a stroke detector, which first finds stroke key-points and then uses them to obtain stroke segmentations for scene text. They show that compared to the traditional MSER method, using stroke specific key-points could detect more characters with less region segmentations. Cho et al. [20] presented Canny text detector using multi-stage algorithm. ER method is first utilized to extract character candidates as many as possible, and the overlapped candidates are eliminated by NMS. After that, the candidates are classified as strong text, weak text or non-text with double threshold. Besides strong text, candidates with low confidence, i.e. weak text, are selected by hysteresis. Finally, the surviving text candidates are grouped to compose sentence. Fabrizio et al. [21] presented a hybrid text detector, which adopts CC method to generate text candidates and also applies texture analysis to compose text string or discard false positives. CCs in image can be first obtained by employing the toggle mapping morphological segmentation (TMMS) algorithm. A shape descriptor based on fast wavelet decomposition is used to classify each CC as character or non- character. After that, a series of texture features are used to train a support vector machine (SVM) for post-processing. He et al. [22] developed contrast-enhancement maximally stable extremal regions (CE-MSERs) detector, which extends the conventional MSERs by enhancing intensity contrast between text patterns and background. Furthermore, they trained a text-attentional CNN that could extract high-level features including text region mask, character label, and binary text/non-text information. The two schemes are incorporated to form an effective text detection model. Zhang et al. [19] proposed a text detector which exploits the symmetry property of character groups. Different from traditional methods that mainly exploit the properties of single characters or strokes, this new detector could utilize context information from scene image to implement text lines extraction.

2.2 Deep learning Era

Recently, deep learning has been widely used in semantic segmentation and general object detection, and achieved great success. Accordingly, related methods are also being adopted in the field of text detection. In general, semantic segmentation based detectors first extract text blocks from the segmentation map generated by fully convolutional network (FCN). After that, bounding boxes of text are obtained by complex post-processing. General object detectors, however, predict candidate bounding boxes directly by regarding texts as objects. Different from common objects, texts have clear definition of orientation, which should be predicted in addition to the axis-aligned bounding box information.

2.2.1 Semantic Segmentation Based Methods

Yao et al. [24] take scene text detection as a semantic segmentation problem. They use a FCN model based on holistically-nested edge detection (HED) to produce global maps, including information of text region, individual characters and their relationship. And the proposed algorithm could detect multi-oriented and curved texts in scene image. He et al. [33] presented the cascaded convolutional text networks (CCTN), which uses two networks to implement coarse-to-fine segmentation for scene image. Note that the coarse network outputs a per-pixel heat-map indicating the location and probability of text instance, and the fine network outputs two heat-maps for final text detection. Zhang et al. [25] also implement text detection with coarse-to-fine procedure. They first use a FCN (called Text-Block FCN) to predict the salient map of text blocks. After that MSER method is applied to extract multi-oriented text line candidates. Finally, they train another smaller FCN (called Character-Centroid FCN) to provide the character centroid information, based on which false text line candidates can be eliminated. Qin et al. [26] proposed a text detector based on the cascade of two CNNs. Text regions of interest are first produced by a FCN and then resized to a square shape with fixed size. The next stage is the word detection procedure, i.e., training a YOLO-like network to generate oriented rectangular bounding boxes for all words. Finally, a NMS stage is implemented to handle overlapping bounding boxes. He et al. [40] proposed a FCN architecture for multi-oriented scene text detection with two tasks. The classification task implements down-sampled segmentation between text and non-text for input image, and the regression task determines the vertex coordinates of quadrilateral text boundaries through direct regression. Zhou et al. [44] also proposed a FCN based model for scene text detection. Multiple channels of pixel-level text score map and geometry could be generated in this model, which is flexible to produce either word level or line level predictions. Furthermore, a locality aware NMS with low time complexity is proposed for post-processing. Dai et al. [27] presented a detector based on fused text segmentation networks. Features of each image are first extracted through a resnet-101 backbone, and then multi-level feature maps are combined and fed to the region proposed network (RPN) for text region of interest (ROI) generation. The whole architecture could implement text detection and segmentation simultaneously and provide predictions both in the pixel and word level. Deng et al. [28] proposed a scene text detector (called PixelLink) based on instance segmentation. The Single-Shot Detector (SSD) [29] like architecture is used to extract features and perform text/non-text prediction as well as link prediction. The predicted positive pixels are joined together into text instances by predicted positive links. Finally, text bounding boxes are generated directly from the segmentation result without location regression. Li et al. [30] proposed the progressive scale expansion network (PSENet) for segmentation-based text detection. In order to handle the closely adjacent text instances, a progressive scale expansion algorithm is presented. Inspired by the idea of breadth first-search, the expansion starts from the pixels of multiple kernels and iteratively merges the adjacent text pixels until the largest kernels are explored. Yang et al. [31] proposed an IncepText architecture based on instance-aware segmentation, which could deal with scene texts with large variance of scale, aspect ratio, and orientation. ResNet-50 module is first used for feature extraction, and Inception-Text module is appended after feature fusion. Furthermore, deformable PSROI pooling [32] is applied to detect multi-oriented text.

This group of methods is suitable for handling multi-oriented text in real-world scene image. Once text instances in image are very close to each other, however, simply using text/non-text semantic segmentation is hard to separate them. Therefore, post-processing is often inevitable to improve the performance.

2.2.2 General Object Detection Based Methods

Zhong et al. [34] developed a unified framework (called DeepText) for text detection. An inception-RPN is proposed in the framework, which could achieve a high recall with only hundreds of word region proposals via applying multi-scale sliding windows over the feature maps and designing a set of text characteristic prior bounding boxes with each sliding position. Gupta et al. [35] presented an efficient engine that could generate synthetic scene images with text annotations, and all synthetic images are used to train a fully-convolutional regression network (FCRN) for text detection. Since an extreme variant of Hough voting is adopted in FCRN, all individual predictions could be aggregated across the input image. Tian et al. [36] proposed a connectionist text proposal network (CTPN) to localize scene text. In CTPN, VGG16 backbone is first used for feature extraction, and then a vertical anchor mechanism is developed to predict text locations in a fine scale. Finally, a Bi-directional long short term memory (BLSTM) is applied to connect the fine scale sequential text proposals. Liao et al. [37] presented an end-to-end trainable scene text detector (called TextBoxes), which is inspired by SSD. Since SSD is general object detector, it cannot be directly applied for text detection. To address the problem, text-box layers are included in the architecture of TextBoxes, which could detect the words with extreme aspect ratios by designing long default boxes and irregular 1*5 convolutional filters. Ma et al. [38] proposed a rotation region proposal networks (RRPN), which is built upon the Faster-RCNN [39] architecture. Since the ground truth (GT) of a text region is represented with 5 tuples \((x,y,w,h,\theta )\), where \(\theta\) is the angle parameter, RRPN could generate inclined proposals with text orientation information. Jiang et al. [41] also proposed a Faster-RCNN based architecture, called rotational region CNN (R2CNN), for arbitrary-oriented text detection. They point out that using an angle parameter could make the network hard to detect vertical texts. Therefore, the coordinates of the first two vertices in clockwise and the height of the bounding box are used to represent an inclined rectangle in R2CNN. Liu et al. [42] designed a small set of quadrilateral sliding windows to roughly recall text. In training phase, a shared Monte-Carlo method is proposed to compute overlapping area between GT and sliding window. The sliding window beyond the given overlapping threshold is considered as positive and used to finely localize the text. Shi et al. [43] proposed a novel perspective, i.e., texts are composed of segments and links. A segment is a part of a word or text line, and a link connects two adjacent segments. Both segments and links are detected by a SSD like network, and then they are taken as nodes and edges of a graph respectively. Finally, a depth-first search (DFS) algorithm is performed on the graph to find the connected components (word or text line). Liao et al. [45] presented a rotation-sensitive regression detector (RRD) based on SSD, which has two network branches. The regression branch extracts rotation-sensitive features by rotating the convolutional filters, while the classification branch extracts rotation-invariant features by pooling the rotation-sensitive features.

This kind of detectors is often trained by bounding-box annotations just like general object detection methods do, which is difficult to learn fine information of text. While handling small-scale texts, only using single shot model may result in accuracy loss. Moreover, it requires designing anchors or default boxes with various scales, aspect ratios and orientations in advance.

2.2.3 Hybrid Methods

Recently, some researchers try to combine the two kinds of above methods so as to correctly detect texts under more complex situations. He et al. [46] proposed a text attention model, which encodes strong text-specific information using a pixel-wise text mask. Such model could effectively suppress background interference in the convolutional features. Furthermore, multi-scale inception features are aggregated to encode rich local and context information for text prediction. The whole detector works in a coarse-to-fine manner. Zhong et al. [47] presented an anchor-free region proposal network (AF-RPN), which could generate high-quality inclined text proposals directly without designing complicated hand-crafted anchors. In AF-RPN, three detection modules are attached on different pyramid levels for detecting small, medium and large text instances. Lyu et al. [48] proposed a hybrid network for multi-oriented scene text detection. The corner points of text region are first detected, and at the same time position sensitive segmentation maps are predicted. After that, candidate bounding boxes are generated by sampling and grouping corner points, and finally suppressed by using NMS. He et al. [49] presented an end-to-end text spotter, which is based on the idea of mask R-CNN [50]. Especially, a text-alignment layer is designed by introducing a grid sampling scheme. It aims to compute fixed length convolutional features that precisely align to a detected text region with arbitrary orientation. The bounding box and segmentation mask of text could be jointly predicted in the multi-task model.

3 Discussion

In general, traditional hand-crafted feature extraction based methods consist of several steps, which make the detection system complicated and inefficient, and easily result in error accumulation. Moreover, they need too many manual optimizations of classification rules. Deep learning based methods, however, inherit the merits of machine learning. As long as having sufficient number of training samples, they could outdistance the traditional methods in terms of both accuracy and efficiency. Figure 2 shows the focused scene text detection results on standard datasets (including ICDAR 2003, ICDAR 2005, ICDAR 2011 and ICDAR 2013) in terms of F-measure reported in literatures mentioned in Sects. 2.1 and 2.2. The blue and red bars represent traditional and deep learning based methods respectively. Obviously, deep learning based methods achieve overwhelming performance, which explains why they become the mainstream recently.

Fig. 2
figure 2

Performance comparison of representative scene text detectors

4 Scene Text Recognition

Similar to text detection, scene text recognition also experiences the transition from traditional means using handcrafted features to deep learning era. In this section, we roughly classify current mainstream text recognition methods into three categories: character classification based, word classification based and sequence based methods.

4.1 Character Classification Based Methods

Bissacco et al. [51] use a deep neural network that is trained on HOG features for character classification. In order to enhance the recognition performance, a two-level language model is adopted: a compact character-level n-gram model is held in RAM and a much larger distributed word-level n-gram model is accessed over the network. Jaderberg et al. [57] proposed a CNN based architecture employing a conditional random field (CRF) graphical model. In this model, unary terms are provided by a CNN that predicts characters at each position of the output, and higher order terms are provided by another CNN that detects the presence of n-grams. Lee et al. [60] presented recursive recurrent neural networks (RNNs) with attention model for text recognition. The RNNs could be applied for learning character-level language model without using n-grams. The soft-attention mechanism allows the model to select features flexibly for end-to-end training.

This group of methods finds individual characters in scene image and consequently recognizes them one by one. Complex heuristic rules or language models are often indispensable to integrate characters into words due to the occurrences of missing or superfluous characters.

4.2 Word Classification Based Methods

Jaderberg et al. [52] proposed a synthetic data engine, which could generate plenty of cropped word images with different styles. A CNN framework is trained using synthetic data without handcrafted labeling and achieves high performance for word recognition. Shi et al. [56] presented a variant of CNN for script identification under multilingual scenarios. In this network, feature maps that have a fixed number of rows but a variable number of columns are input to a spatially-sensitive pooling (SSP) layer, which could handle images with arbitrary sizes. Furthermore, a multi-stage pooling scheme is adopted so as to utilize both higher and lower level features for recognition. Kang et al. [63] designed a context-aware convolutional recurrent network for word recognition. Besides a lexicon dictionary, the metadata of the input image, such as title, tags, and comments, are used as a context prior to enhance the recognition rate. Yang et al. [65] proposed an adaptive ensemble of deep neural networks (AdaDNNs), which could select and combine network components at different iterations within a Bayesian-based formulation framework for text recognition.

Word recognition is actually a multi-class classification task with a large number of class labels (e.g. the number of English words is about 90,000). The strong expression and computation ability of CNN make this task possible. However, the deformation of long word image may affect the recognition rate. Furthermore, this kind of methods often relies on a pre-defined dictionary.

4.3 Sequence Based Methods

Shi et al. [55] proposed a convolutional recurrent neural network (CRNN) for image-based sequence recognition. A standard CNN model is first used to extract a sequential feature representation from input image. Then a bidirectional long-short term memory (LSTM) network is connected with the top convolutional layers to predict a label distribution for each frame of feature sequence. Finally, the connectionist temporal classification (CTC) is applied to find the label sequence with the highest probability conditioned on the per-frame predictions. He et al. [58] also developed a deep-text recurrent network (DTRN) for scene text recognition. Similar to [55], a MaxOut CNN is responsible for encoding input image into an ordered sequence, and a LSTM is employed to decode the CNN sequence into a word string. In order to deal with perspective distortion text and curved text, Shi et al. [59] proposed a recognizer with automatic rectification. The input image is first employed thin-plate-spline (TPS) transformation, and then the rectified image is fed to a sequence recognition network (SRN) to obtain the final result. The methods mentioned above are mainly under an encoder-decoder framework, and use a frame-wise loss to optimize the model. However, the misalignment between the ground truth (GT) sequence and the output probability distribution (PD) sequence may mislead the training [68]. In [68], an edit probability (EP) method is proposed for accurate text recognition. EP measures the probability of a text string conditioned on the input image under parameters for training attention model, meanwhile considering the possible occurrences of missing/superfluous characters.

The advantages and disadvantages of the three kinds of methods for text recognition are summarized in Table 1.

Table 1 Comparison of different kinds of text recognition methods

4.4 Hybrid Methods

In this subsection, we also review some hybrid text recognition methods, which mainly rely on intricate graphical model or hand-crafted feature designing, and do not strictly fall into the above categories. Shi et al. [71] use the tree-structured model to generate detection windows that contain candidate characters. Then a CRF model is built on the detection windows to decide character locations. Finally, word recognition is implemented according to a cost function defined by character detection scores, spatial constraints and linguistic knowledge. Yao et al. [72] represent each candidate character by a set of strokelets that could capture the essential substructures of character at multi-scales. Coupled with HOG descriptor, they could train a random forest classifier with high performance and efficiency. Almazan et al. [54] proposed a word recognition method based on embedded attributes. On one hand, a pyramidal histogram of characters (PHOC) representation for each word is defined, which embeds label strings into a d-dimensional space. On the other hand, word image is represented using Fisher vector. Finally, the attributes with PHOCs could be learned by training a SVM. Lou et al. [62] represent word recognition model as a high-order factor graph, where hypothetical neighboring candidate characters are constructed edges of the graph and taken as random variables. Four factors, i.e., transition, smoothness, consistency, and singleton, are defined and applied for word parsing.

4.5 End-to-end Text Spotting

Text detection and recognition are usually combined to implement text spotting, rather than being treated as separate tasks. In a unified system, the recognizer not only produces recognition outputs but also regularizes text detection with its semantic-level awareness [70]. Wang et al. [9] applied CNN to implement end-to-end text recognition. In this model, NMS is used to remove overlapping candidates and obtain the set of line-level bounding boxes for texts. And then beam search technique is used to find the best segmentation of words. The proposed method achieves state-of-the art results under tasks of character recognition, lexicon driven cropped word recognition and end-to-end recognition. Yao et al. [53] presented a unified framework, where text detection and recognition share both features and classification. Furthermore, the dictionary is generated according to Bing search, whose error correction scheme can be used to enhance the recognition rate. Jaderberg et al. [61] also proposed t an end-to-end text spotting system. Word level bounding box proposals are first obtained with high recall, and then filtered by a random forest classifier for improving precision. Two CNNs are used for bounding box regression and text recognition respectively. Moysset et al. [64] designed a CRNN system, in which the convolutional layers share parameters over the different regressors to find text lines locally, and a 2D-LSTM model is trained with CTC alignment to recognize texts. Gomez et al. [67] presented a text-specific proposal method, which first extracts connected components from input image, and then groups them by their similarity via single linkage clustering (SLC). Furthermore, a ranking strategy is designed to prioritize the best word proposals. Finally, an end-to-end word spotting system can be built by incorporating the word recognizers provided in [61]. Liao et al. [70] proposed a novel text detector called TextBoxes ++. TextBoxes ++ is an extension of [37], which could efficiently detect arbitrary-oriented scene text. Combined with a text recognizer, TextBoxes ++ can also be used for end-to-end text spotting.

More recently, researchers begin to design unified end-to-end trainable deep learning network (DNN) that could predict both text regions and text labels in a single forward pass. Bartz et al. [66] presented a single DNN that could train text detector and recognizer from input image. Moreover, a recurrent spatial transformer is applied as attention mechanism, which makes the localization of the text be learned by the network itself. Liu et al. [69] adopted FCN to find bounding boxes of text, based on which a RoIRotate operator is introduced to extract proper features from shared feature maps. Finally, the features of text proposal are fed to RNN and CTC for text recognition.

5 Key Techniques for Scene Text Detection and Recognition

In this section, state-of-the-art techniques used in current scene text detection and recognition methods are reviewed. As mentioned in Sect. 2, deep learning based methods have become the mainstream for text detection. Therefore, Sects. 4.1 to 4.3 analyze the relevant schemes and issues, including network architecture, loss function and multi-orientation detection. With text recognition, techniques related to language model and sequence labeling are discussed in Sects. 4.4 and 4.5.

5.1 Network Architecture

5.1.1 Fully Convolutional Network (FCN)

FCN [73] could yield hierarchies of features for effective semantic segmentation (see Fig. 3). Since the merits of multi-scale learning and prediction conform to the nature of scene text, many methods [24,25,26, 33, 40] adopt FCN as their backbone for text detection. Generally, a pixel-wise text/non-text salient map is first obtained by using FCN, which produces pixel-wise labeling or labeled region containing texts. After that, candidate bounding boxes of text could be generated. By applying skip architecture of FCN, receptive fields with different sizes could be helpful to encode both local features and global context of text.

Fig. 3
figure 3

Architecture of FCN [73]

5.1.2 Resnet

Deeper neural networks are more difficult to train, since the accuracy may get saturated and degrade rapidly. To address the degradation problem, He et al. [74] proposed a deep residual learning framework (called Resnet), whose building block is defined as \(y = F(X,\{ W_{i} \} ) + x\) (see Fig. 4), where \(x\) and \(y\) are the input and output vectors of the layers considered, and \(F(X,\{ W_{i} \} )\) is the residual mapping to be learned. Some text detectors [27, 31] use Resnet 50/101 as backbone for feature extraction.

Fig. 4
figure 4

A building block for residual learning [74]

5.1.3 Regions with CNN (R-CNN)

Fast R-CNN [39] is an end-to-end architecture for object detection. In this architecture, an input image and multiple regions of interest (RoIs) are input into a FCN, and softmax probabilities and per-class bounding-box regression offsets are the outputs (see Fig. 5a). Faster R-CNN [76] makes improvement on Fast R-CNN, which aims to reduce the time spending on region proposals generation (see Fig. 5b). A region proposal network (RPN) that shares full-image convolutional features with the detection network is proposed, and the RPN and Fast R-CNN are finally merged into a single network by sharing their convolutional features. By incorporating additional components into these architectures, several text detection methods [34, 38, 41, 49] with computational efficiency are proposed.

Fig. 5
figure 5

Architecture of R-CNN series. a Fast R-CNN [39], b faster R-CNN [76]

5.1.4 You only Look Once (YOLO)

YOLO [75] is a single convolutional network that simultaneously predicts multiple bounding boxes and class probabilities for those boxes (see Fig. 6). Since YOLO takes object detection as a single regression problem, it extremely fast comparing with R-CNN based system. However, it may achieve poor precision while localizing objects with small size. Therefore, it cannot be directly applied for text detection. Inspired by YOLO, Gupta et al. [35] proposed a fully-convolutional regression network (FCRN), which could effectively and efficiently detect texts in scene image.

Fig. 6
figure 6

Architecture of YOLO [75]

5.1.5 Single Shot Detector (SSD)

SSD [29] defines a set of default boxes for the output space of bounding boxes, and it simultaneously predicts the shape offsets and the confidences for all object categories (see Fig. 7). In SSD, predictions are combined from multiple feature maps with different resolutions. Compared to YOLO, SSD could effectively deal with objects of various sizes. Moreover, SSD eliminates proposal generation and feature resampling, which is different from R-CNN based network. Since SSD integrates the advantages of YOLO and Fast R-CNN/Faster R-CNN, many methods [37, 42, 43, 45, 48] extend this architecture for text detection by giving some specific modifications, such as designing default boxes with larger aspect ratios or multi orientations, and adopting inception-style convolutional filters.

Fig. 7
figure 7

Architecture of SSD [29]

5.2 Loss Function

Just like in general machine learning model, a loss function should be defined first in deep neural network to measure the gap between prediction and actual value. And then training algorithm seeks to minimize the loss function. The smaller the loss function is, the more robust the model is. Most work often takes text detection as a multi task learning problem, e.g. classification and regression. In this section, some commonly used loss functions for text detection are listed and discussed.

5.2.1 Cross-Entropy Loss Function

It is often used in tasks such as pixel/instance classification or segmentation [25, 27, 28, 30, 31, 33, 44, 48], which is defined as follow

$$L_{ce} = - \frac{1}{N}\sum\limits_{n = 1}^{N} {[y_{n} \log \hat{y}_{n} + (1 - y_{n} )\log (1 - \hat{y}_{n} )]}$$
(1)

where \(y_{n}\) and \(\hat{y}_{n}\) are actual value and prediction respectively. Note that if the same weight is put on all positive pixels, it may achieve poor performance while handling instances with small areas. Therefore, several balanced cross-entropy losses [28, 44] are also introduced to facilitate the training procedure.

5.3 Softmax Loss Function

It should be found in many general object detection methods, which is defined as follow

$$L_{sm} = \log \left( {\sum\limits_{j = 0}^{m - 1} {e^{{z_{j} }} } } \right) - z_{y}$$
(2)

where \(z_{y}\) is the ith value on score vector for classification, and y is the classification label. This function is used in [34, 36,37,38, 41, 43, 45,46,47, 49] as the loss for distinguishing text (y = 1) and non-text (y = 0).

5.4 Smooth-L1 Loss Function

It is often used for bounding box regression task [27, 31, 34, 36,37,38, 41, 43,44,45,46,47,48], which is defined as follow

$$L_{reg} = \sum\limits_{i \in S} {smooth_{L1} (p_{i} ,p*)}$$
(3)

in which,

$$smooth_{L1} (x) = \left\{ \begin{aligned} \, 0.5(\sigma x)^{2} \, if|x| < 1/\sigma^{2} \hfill \\ |x| - 0.5/\sigma^{2} \, otherwise \hfill \\ \end{aligned} \right.$$
(4)

where \(p\) and \(p*\) are predicted value and ground truth respectively, and \(x\) represents the error between \(p\) and \(p*\). Note that the deviation function of Smooth-L1 is also a piecewise function. In [42], Liu et al. defined a continuous function as follow

$$smooth_{Ln} (x) = (|x| + 1)\ln (|x| + 1) - |x|$$
(5)

They claims that smooth-Ln loss could achieve the tradeoff between robustness and stability (see Fig. 8)

Fig. 8
figure 8

Comparison of smooth-L1 and smooth-Ln [42]

5.4.1 Squared Loss Function

It is a conventional loss for regression task, which is defined as follow

$$L_{squ} = (y - \hat{y})^{2}$$
(6)

where \(y\) and \(\hat{y}\) are actual value and prediction respectively. In [26] [35], a bounding box is parameterized in terms of the position of its center, width, height, orientation angle and the confidence that the box contains a word. While training the network, all the parameters are optimized by minimizing a multi-part squared loss function.

There are many other loss functions used for scene text detection. For example, the Dice loss [48] is adopted to implement position-sensitive segmentation, and the IOU loss [44] is applied for regressing four channels of axis-aligned bounding box since it is invariant against texts with different scales.

5.5 Multi-orientation Detection

Most of the previous work focuses on horizontal text detection and achieves pretty good performance. However, text in real-world situation could appear with any orientation. Therefore, text orientation needs to be estimated and corrected for subsequent recognition procedure. Although many studies [81,82,83,84,85,86,87,88,89] have concentrated on multi-oriented scene text detection, the accuracy rates need to be further improved. With the initiating of ICDAR 2015 Competition Challenge 4, a large number of deep learning based methods have stood out, and achieved superior performance over conventional approaches.

In [24], individual characters and their relationship, i.e., linking orientation are considered, and the corresponding prediction maps are produced by training the holistically-nested edge detection (HED) [77] based network. Since HED could find edges of different scales and orientations, it can be used for multi-orientation text detection. Similar work could be found in [43], where the oriented text is decomposed into segments and links, and the final detection results are produced via combining segments connected by links. Since text lines from the same text block often have a roughly uniform spatial layout, a projection profile based skew estimation algorithm [78] is used to determine the possible orientation of text line in [25]. In [27, 33], pixel-wise text region masks with arbitrary shapes are taken as supervision information for training segmentation network so as to handle multi-orientation texts. In [30], the concept “kernel” is introduced, which denotes multiple predicted segmentation areas of text instance. The kernels have the similar shape and locate at the same central point with differ scales. A progressive scale expansion algorithm that could make the kernels grow from small to large scale, is used to obtain the final detections. Therefore, the prediction is robust to arbitrary shapes and orientations. In [31], the position-sensitive RoI (PSROI) pooling [79] is replaced by a deformable PSROI pooling, which could implement multi-oriented text detection through adding offsets to the spatial binning positions.

Note that most of above work includes segmentation step, which is usually time-consuming. A new trend inspired by general object detection has emerged recently, i.e., generating inclined proposals/boxes to roughly recall text, and then implementing bounding box regression to finely localize text region. Text orientation information could be represented by different ways, such as rotation anchors [38], inclined minimum area rectangle [41] or quadrangles inside horizontal sliding windows [42]. Different from previous text detection methods that rely on shared features for both classification and oriented bounding box regression, active rotating filters (ARF) [80] are used to extract rotation-sensitive features in [45]. Since ARF convolves feature map with a canonical filter and its rotated clones, it can help to capture rotation sensitive features. In [48], scene text detection is implemented by localizing corner points of text bounding boxes and segmenting text regions in relative positions (see Fig. 9). The candidate boxes are generated by grouping corner points according to the scores of segmentation maps.

Fig. 9
figure 9

Corner points and position-sensitive maps prediction [48]

5.6 Language Model

Strong language prior, e.g. probability distribution over character/word sequence, would make major contribution to final text recognition. Some characters or strings cannot be easily distinguished, such as the number “0” and the character “O”, or the string “cl” and character “d”. If a proper language model is adopted to consider the context information, these cases must be eliminated.

Inspired by the successful applying of hidden markov model (HMM) in voice recognition, a hybrid HMM/Maxout architecture is proposed in [90], which could sequence words into their corresponding character/inter-character regions by integrating a lexicon. The method is highly accurate as well as fast, since it takes constant time relative to lexicon size. Conditional random field (CRF) model is adopted to predict character position in [8, 91, 92]. The CRF is defined over a set of random variables, and each random variable denotes a potential character in word. In order to recognize weak character or non-dictionary words, however, it needs to compute unary and higher-order terms for all candidate characters, which results in expensive computation. In [51], the beam search based on n-gram model is used to obtain candidate characters. Beside this language model, a simple dictionary is also maintained for providing a soft scoring signal. Finally, the candidate characters are re-ranked by using both language model and shape model. Similarly, a word is taken as a composition of bag-of-n-grams in [57]. In order to compress encoding representation, the model only selects a subset of the space of all possible n-grams. Since the n-gram based CNN has a large number of output nodes, e.g. 10 k output units for n = 4 (see Fig. 10), it increases the training complexity. Different from the above methods, the recurrent neural network (RNN) is used in [60] to model the character-level statistics for text. In this model, character recognition is considered as a task of learning mappings from pixel intensities to character-level vectors, and does not need n-grams any more.

Fig. 10
figure 10

The N-gram encoding model [57]

5.7 Sequence Labeling

As mentioned in Sect. 3.1, many character classification based text recognition methods firstly detect individual characters in image, and sequently recognize each character using CNN models. In order to train a strong character detector, however, we need a large number of labeled character images, which is unrealistic in most cases. Word classification based methods assign a class label to each word, and treat text recognition as an image classification problem. Such methods often train CNN models with a huge number of classes. For English there are about 90 K words, and for Chinese however, the number of potential words may exceed 1 million. Moreover, CNN models are often hard to deal with long words (the number of characters is large). Recently, the state-of-the-art methods consider text spotting as a sequence labeling problem. These methods could generate an ordered high level sequence from input image, and have properties of handling text with arbitrary lengths, lexicon free and avoiding the character segmentation. Some key techniques are reviewed as follows.

5.7.1 Recurrent Neural Network (RNN)

RNN is an important branch of DNN family, which does not need the position information of each element in a sequence image. In [55, 58, 59], a CNN model is first used to convert text image into a sequence of features, and then sequential features are fed to a RNN model for learning context information and generating a predicted sequence. Traditional RNN is hard to transmit the gradient information consistently over long time due to the vanishing gradient problem. The RNN model adopted in [55, 58, 59] is the long-short term memory (LSTM) structure. To be more precisely, two LSTMs, one forward and one backward, are combined into a bidirectional LSTM (see Fig. 11).

Fig. 11
figure 11

The structure of deep bidirectional LSTM [55]

5.7.2 Connectionist Temporal Classification (CTC)

In CNN + LSTM model [55, 58], the length of the LSTM outputs may not consistent with that of the target string. Therefore, the CTC [93] is applied to approximately map the LSTM sequential output into its target string:

$$S_{w}^{ * } \approx {\rm B}(\arg \mathop {\hbox{max} }\limits_{\pi } P(\pi |p))$$
(7)

where B is the projection that removes the repeated labels and the non-character labels.

6 Evaluation and Comparison

Scene text detection and recognition have received increasing attention in computer vision and document analysis, and many approaches and methods have been proposed so far. Therefore, it is impossible to give fair evaluation and comparison for all of them. In this section, we first summarize the widely used datasets and protocols for text detection and recognition. After that, we mainly survey published results of the representative methods for comparison.

6.1 Benchmark Datasets

In this section, we describe the widely used benchmark datasets for tasks of text detection and recognition, whose features are summarized in Table 2.

Table 2 Benchmark datasets for text detection and recognition

ICDAR 2003 [94]. It is the first released benchmark for scene text detection and recognition from ICDAR Robust Reading Competition. There are 258 natural images for training and 251 natural images for testing. All the text instances in this dataset are in English and are horizontally placed.

ICDAR 2011 [95]. It inherits from ICDAR 2003 and has made some modification. There are 229 natural images for training and 255 natural images for testing.

ICDAR 2013 [96]. It also inherits from ICDAR 2003 and has made some modification. There are 229 natural images for training and 233 natural images for testing.

ICDAR 2015 [97]. It is from the Incidental Scene Text Challenge of the ICDAR 2015 Robust Reading Competition. The dataset includes 1500 natural images in total, which are acquired using Google Glass. The text instances (annotated by 4 vertices of the quadrangle) are usually skewed or blurred in ICDAR 2015, since they are acquired without user’s prior preference or intention.

ICDAR 2017 MLT [98]. It is a large scale multi-lingual text dataset, which is composed of complete scene images with 9 languages. There are 7200 training images, 1800 validation images and 9000 testing images in this dataset.

MSRA-TD500 [99]. It has 500 high resolution natural scene images, where the text instances present with multi orientations and the language types include both Chinese and English. There are 300 images for training and 200 images for testing.

COCO-Text [100]. It is the largest benchmark that could be used for text detection and recognition so far. The original images are from the Microsoft COCO dataset, and 173,589 text instances from 63,686 images are annotated in COCO-Text. There are 43,686 images for training and 20,000 images for validation/testing.

Street View Text (SVT) [101]. It consists of 350 images annotated with word-level axis-aligned bounding boxes from Google Street View. It contains smaller and lower resolution text, and not all text instances within it are annotated.

RCTW-17 [102]. It contains various kinds of image, including street views, posters, menus, indoor scenes and screenshots for competition on reading Chinese text in image. The dataset contains about 8000 training images and 4000 test images, whose annotations are similar to ICDAR2015.

IIIT 5 k [103]. It contains 5000 cropped word images downloaded from Google image search. There are 2000 images for training and 3000 images for testing. Each image has an associated 50 word lexicon (IIIT5 k-50) and 1 k word lexicon (IIIT5 k-1 k).

SynthText [104]. It contains 858,750 synthetic images, where texts with random colors, fonts, scales and orientations are rendered on natural images carefully to have a realistic look. The texts in this dataset are annotated in character, word and line level.

Synth90 k [105]. It contains about 9 million synthetic cropped word images, and covers 90 k different English words. Similar to SynthText, the synthetic data in Synth90 k is highly realistic. There are approximate 8 million images for training and 900 k images for testing.

6.2 Evaluation Protocols

In this section, we summarize evaluation protocols for text detection and recognition. The task of text detection could be commonly evaluated using ICDAR or DetEval protocol, and the task of text recognition could be commonly evaluated using word recognition accuracy or end-to-end recognition protocol.

6.2.1 ICDAR Detection Protocol

First, the best match \(m(r,R)\) for a rectangle \(r\) in a set of rectangles \(R\) is defined as follow

$$m(r,R) = \hbox{max} m_{p} (r,r')|r' \in R$$
(8)

where \(m_{p}\) denotes the match between two rectangles of text instances, which can be calculated as the area of intersection divided by the area of the minimum bounding box containing both rectangles. Then, the metrics of precision (\(P\)), recall (\(R\)) and F-measure(\(F\)) can be defined as follows

$$P = \frac{{\sum\nolimits_{{r_{e} \in E}} {m(r_{e} ,T)} }}{|E|}$$
(9)
$$R = \frac{{\sum\nolimits_{{r_{t} \in T}} {m(r_{t} ,E)} }}{|T|}$$
(10)
$$F = \frac{1}{\alpha /P + (1 - \alpha )/R}$$
(11)

where \(T\) and \(E\) are respectively the sets of ground-truth and estimated rectangles, and \(r_{t}\) and \(r_{e}\) are respectively a ground-truth and an estimated rectangle. \(\alpha\) is weight parameter, which is often set to 0.5.

6.2.2 DetEval Detection Protocol

Since standard ICDAR detection protocol is unable to handle the cases of one-to-many and many-to-many matches among the ground truth and detections, it always underestimates the performance of text detection algorithms. To address the problem, Wolf et al. proposed the DetEval protocol to comprise the area overlap and the object level evaluation. In this protocol, the metrics of precision (\(P'\)) and recall (\(R'\)) can be defined as follows

$$P' = \frac{{\sum\nolimits_{i} {Match_{D} (D_{i} ,G,t_{r} ,t_{p} )} }}{|D|}$$
(12)
$$R' = \frac{{\sum\nolimits_{j} {Match_{G} (G_{j} ,D,t_{r} ,t_{p} )} }}{|D|}$$
(13)

where \(Match_{D}\) and \(Match_{G}\) are functions that consider the different types of matches:

$$Match_{D} (D_{i} ,G,t_{r} ,t_{p} ) = \left\{ \begin{array}{ll} 1 &\, {\rm if} \, D_{i} {\text{ matches against a single detected rectangle}} \hfill \\ 0 \, & if \, D_{i} {\text{ does not match against any detected rectangle}} \hfill \\ f_{sc} (k) \, & {\rm if} \, D_{i} {\text{ matches against several }}( \to k){\text{ detected rectangles }} \hfill \\ \end{array} \right.$$
(14)
$$Match_{G} (G_{j} ,D,t_{r} ,t_{p} ) = \left\{ \begin{array}{ll} 1&{\text{ if }}G_{j} {\text{ matches against a single detected rectangle}} \hfill \\ 0&{\text{ if }}G_{j} {\text{ does not match against any detected rectangle}} \hfill \\ f_{sc} (k ) &{\text{ if }}G_{j} {\text{ matches against several (}} \to k ) {\text{ detected rectangles }} \hfill \\ \end{array} \right.$$
(15)

where \(f_{sc} (k )\) is a parameter function that controls the amount of punishment, and it is often set to 0.8.

6.2.3 Yao’s Detection Protocol

While handling texts with arbitrary orientation, the overlap ratio computed in the way of standard ICDAR protocol is possibly not accurate. Therefore, Yao et al. [81] proposed an evaluation protocol that considers true or false positives based on the overlap ratio between the estimated minimum area rectangles and the ground truth rectangles. If the included angle between the estimated rectangle and the ground truth rectangle is less than \(\pi /8\) and their overlap ratio exceeds 0.5, the estimated rectangle is considered a correct detection. Multiple detections of the same text line are taken as false positives. Thus, the metrics of precision (\(P''\)) and recall (\(R''\)) can be defined as follows

$$P'' = |TP|/|E|$$
(16)
$$R'' = |TP|/|T|$$
(17)

where \(TP\) is the set of true positive detections, while \(E\) and \(T\) are respectively the sets of estimated rectangles and ground truth rectangles.

6.2.4 Text Recognition Protocols

Given cropped word image, word recognition accuracy is a commonly used evaluation metric, which is defined as the ratio of the correctly recognized word number to the ground truth number. For holistic scene image containing texts, there are two protocols for evaluation, i.e., word spotting and end-to-end. Word spotting only examines whether the words in lexicon appear in input image, and it ignores symbols, punctuations, numbers and words whose length is less than three. End-to-end protocol concerns both detection and recognition results, and it needs to recognize all the words precisely, no matter whether the lexicon contains these strings. F-measure is also adopted by the two protocols. Performance comparison

In this section, we reported the experimental results of representative text detection and recognition methods on some public datasets through a comprehensive literature review. Since different methods may conduct experiments on different benchmark datasets, and even on the same dataset they may adopt different training sets (such as using synthetic dataset for pre-training, or using special data augmentation scheme to enlarge the number of training samples), it is impossible for us to make an absolutely fair comparison. However, we can witness the development of state-of-the-art methods in this field and acquire some inspiration.

Tables 3 and 4 report text detection performance of different methods on eight datasets. As mentioned in Sect. 2, deep learning based methods become the mainstream recently for text detection. Here we only give results of this group of methods. As is shown in Table 3, at present the F-measures on ICDAR2013 and ICDAR2015 both exceed 90%. Especially, the performance on ICDAR 2015 has increased drastically from 54% (Zhang et al. [25]) to 90.5% (Yang et al. [31]) in terms of F-measure. In [31], deformable PSROI pooling is applied to add offsets to the spatial binning positions in PSROI pooling (see Fig. 12), which can greatly enhance the performance of multi-oriented text detection. As is shown in Table 4, the F-measures on the other four datasets all achieve unprecedented levels so far. On the largest COCO-Text dataset, the performance has increased drastically from 33.31% (Yao et al. [24]) to 61% (Liao et al. [45]) in terms of F-measure. In [45], a rotation sensitive regression network (see Fig. 13) is adopted, which can be helpful to achieve better detection result. It can be observed that abundant technologies of general object detection and semantic segmentation have been extended for scene text location, and the current trend is applying deep learning framework to training an end-to-end text detector.

Table 3 Performance of different text detection methods evaluated on ICDAR datasets
Table 4 Performance of different text detection methods evaluated on other public datasets
Fig. 12
figure 12

The learned context surrounding the text by deformable PSROI pooling [31]

Fig. 13
figure 13

Rotation sensitive regression [45]

Tables 5, 6, 7 and 8 report text recognition performance of different methods on six commonly used datasets. As is shown in Tables 5 and 6, the method of Bai et al. [68] achieve relatively high performance on all ICDAR datasets. In [68], edit probability (EP) is proposed to train attention based text recognition model. By applying a sequence generation mechanism for lexicon-free prediction, this method can effectively recognize out-of-training-set words, and obtain the best result on ICDAR 2003 and ICDAR 2013 without strong or weak lexicon. As is shown in Tables 7 and 8, the methods of Liao et al. [70] and Liu et al. [69] achieve the state-of-the-art performance. Since TextBoxes ++ [70] extends directly from TextBoxes [37] that mainly handles horizontal texts, it obtains relatively high F-measures on ICDAR 2013 and SVT dataset. Note that the performance improvement of TextBoxes ++ is spectacularly significant on SVT dataset due to its training on low-resolution images. In [69], the RoIRotate operator is proposed to connect detection and recognition in a unified network, and it can apply transformation on oriented detection bounding boxes to obtain axis-aligned feature maps (see Fig. 14). Therefore, such unified network achieves obvious advantages on oriented ICDAR 2015 dataset. Note that there is no general text recognition method yet, and each method only performs well on certain datasets. As long as the text regions are properly localized, traditional methods have already achieved relatively high cropped word recognition accuracy. However, present methods attempt to construct an end-to-end framework without complicated pre- or post-processing for both text detection and recognition.

Table 5 Cropped word recognition accuracy (%) on ICDAR datasets
Table 6 Cropped word recognition accuracy (%) on other public datasets
Table 7 End-to-end F-measures (%) on ICDAR03, ICDAR11, ICDAR13 and SVT
Table 8 Word spotting and end-to-end F-measures (%) on ICDAR13 and ICDAR15
Fig. 14
figure 14

Illustration of RoIRotate [69]

7 Conclusions

Scene text detection and recognition have received increasing attention in computer vision due to its potential applications in numerous fields. This paper mainly reviews detection and recognition methods proposed in the last decade. We comprehensively classify these methods and highlight the key techniques. Furthermore, more than 10 benchmark datasets and the corresponding evaluation protocols are described in the paper. Finally, we report the results of more than 40 representative methods and compare their performance. Although great progress has been achieved in text detection and recognition recently, we also find out some problems that should be addressed.

Since most methods focus on text in English, there is still ample room remained for performance improvement on non-Latin or multi-lingual datasets, such as RCTW-17, MSRA-TD500 and ICDAR 2017 MLT. It is potentially to construct a common text detection engine based on character detectors, since character is the most basic element for various languages. Some weakly supervised scene text detection frameworks [106, 107] have been proposed recently, and they can train robust scene text detectors with a small amount of annotated character images. We consider that this work worthy to be further studied in the future. The results on ICDAR 2015 and COCO-Text are also unsatisfactory. It means that we need to tackle the problem of incidental and diversified text detection. Enhancement and rectification methods [22, 31] should be integrated in the conventional deep learning models so as to obtain better performance in the future work. Moreover, many existing text recognition methods achieve poor performance with general lexicons. Schemes of applying large scale language information [108, 109] and sequence leaning [55] have been proposed for text recognition, which should be further studied.