1 Introduction

Undoubtedly, text is among the most brilliant and influential creations of humankind. As the written form of human languages, text makes it feasible to reliably and effectively spread or acquire information across time and space. In this sense, text constitutes the cornerstone of human civilization.

On the one hand, as a vital tool for communication and collaboration, text has been playing a more important role than ever in modern society; on the other hand, the rich and precise high-level semantics embodied in text could be beneficial for understanding the world around us. For example, text information can be used in a wide range of real-world applications, such as image search (Tsai et al. 2011; Schroth et al. 2011), instant translation (Dvorin and Havosha 2009; Parkinson et al. 2016), robots navigation (DeSouza and Kak 2002; Liu and Samarabandu 2005a, b; Schulz et al. 2015), and industrial automation (Ham et al. 1995; He et al. 2005; Chowdhury and Deb 2013). Therefore, automatic text reading from natural environments, as illustrated in Fig. 1, a.k.a. scene text detection and recognition (Zhu et al. 2016) or PhotoOCR (Bissacco et al. 2013), has become an increasing popular and important research topic in computer vision.

However, despite years of research, a series of grand challenges may still be encountered when detecting and recognizing text in the wild. The difficulties mainly stem from three aspects:

  • Diversity and Variability of Text in Natural Scenes Distinctive from scripts in documents, text in natural scene exhibit much higher diversity and variability. For example, instances of scene text can be in different languages, colors, fonts, sizes, orientations, and shapes. Moreover, the aspect ratios and layouts of scene text may vary significantly. All these variations pose challenges for detection and recognition algorithms designed for text in natural scenes.

  • Complexity and Interference of Backgrounds The backgrounds of natural scenes are virtually unpredictable. There might be patterns extremely similar to text (e.g., tree leaves, traffic signs, bricks, windows, and stockades), or occlusions caused by foreign objects, which may potentially lead to confusion and mistakes.

  • Imperfect Imaging Conditions In uncontrolled circumstances, the quality of text images and videos could not be guaranteed. That is, in poor imaging conditions, text instances may be of low resolution and severe distortion due to inappropriate shooting distance or angle, or blurred because of out of focus or shaking, or noised on account of low light level, or corrupted by highlights or shadows.

These difficulties run through the years before deep learning showed its potential in computer vision as well as in other fields. As deep learning came to prominence after AlexNet (Krizhevsky et al. 2012) won the ILSVRC2012 (Russakovsky et al. 2015) contest, researchers turn to deep neural networks for automatic feature learning and start with more in-depth studies. The community are now working on ever more challenging targets. The progress made in recent years can be summarized as follows:

Fig. 1
figure 1

Schematic diagram of scene text detection and recognition. The image sample is from total-text (Ch’ng and Chan 2017).

  • Incorporation of Deep Learning Nearly all recent methods are built upon deep learning models. Most importantly, deep learning frees researchers from the exhausting work of repeatedly designing and testing hand-crafted features, which gives rise to a blossom of works that push the envelope further. To be specific, the use of deep learning substantially simplifies the overall pipeline, as illustrated in Fig. 3. Besides, these algorithms provide significant improvements over previous ones on standard benchmarks. Gradient-based training routines also facilitate to end-to-end trainable methods.

  • Challenge-Oriented Algorithms and Datasets Researchers are now turning to more specific aspects and challenges. Against difficulties in real-world scenarios, newly published datasets are collected with unique and representative characteristics. For example, there are datasets featuring long text (Tu et al. 2012), blurred text (Karatzas et al. 2015), and curved text (Ch’ng and Chan 2017) respectively. Driven by these datasets, almost all algorithms published in recent years are designed to tackle specific challenges. For instance, some are proposed to detect oriented text, while others aim at blurred and unfocused scene images. These ideas are also combined to make more general-purpose methods.

  • Advances in Auxiliary Technologies Apart from new datasets and models devoted to the main task, auxiliary technologies that do not solve the task directly also find their places in this field, such as synthetic data and bootstrapping.

In this survey, we present an overview of the recent development in deep-learning-based text detection and recognition from still scene images. We review methods from different perspectives and list the up-to-date datasets. We also analyze the status quo and future research trends.

There have been already several excellent review papers (Uchida 2014; Ye and Doermann 2015; Yin et al. 2016; Zhu et al. 2016), which also organize and analyze works related to text detection and recognition. However, these papers are published before deep learning came to prominence in this field. Therefore, they mainly focus on more traditional and feature-based methods. We refer readers to these paper as well for a more comprehensive view and knowledge of the history of this field. This article will mainly concentrate on text information extraction from still images, rather than videos. For scene text detection and recognition in videos, please also refer to Jung et al. (2004) and Yin et al. (2016).

The remaining parts of this paper are arranged as follows: In Sect. 2, we briefly review the methods before the deep learning era. In Sect. 3, we list and summarize algorithms based on deep learning in a hierarchical order. Note that we do not introduce these techniques in a paper-by-paper order, but instead based on a taxonomy of their methodologies. Some papers may appear in several sections if they have contributions to multiple aspects. In Sect. 4, we take a look at the datasets and evaluation protocols. Finally, in Sects. 5 and 6, we present potential applications and our own opinions on the current status and future trends.

2 Methods Before the Deep Learning Era

In this section, we take a glance retrospectively at algorithms before the deep learning era. More detailed and comprehensive coverage of these works can be found in Uchida (2014), Ye and Doermann (2015), Yin et al. (2016), and Zhu et al. (2016). For text detection and recognition, the attention has been the design of features.

In this period of time, most text detection methods either adopt Connected Components Analysis (CCA) (Huang et al. 2013; Neumann and Matas 2010; Epshtein et al. 2010; Tu et al. 2012; Yin et al. 2014; Yi and Tian 2011; Jain and Yu 1998) or Sliding Window (SW) based classification (Lee et al. 2011; Wang et al. 2011; Coates et al. 2011; Wang et al. 2012). CCA based methods first extract candidate components through a variety of ways (e.g., color clustering or extreme region extraction), and then filter out non-text components using manually designed rules or classifiers automatically trained on hand-crafted features (see Fig. 2). In sliding window classification methods, windows of varying sizes slide over the input image, where each window is classified as text segments/regions or not. Those classified as positive are further grouped into text regions with morphological operations (Lee et al. 2011), Conditional Random Field (CRF) (Wang et al. 2011) and other alternative graph based methods (Coates et al. 2011; Wang et al. 2012).

Fig. 2
figure 2

Illustration of traditional methods with hand-crafted features: (1) Maximally Stable Extremal Regions (MSER) (Neumann and Matas 2010), assuming chromatic consistency within each character; (2) Stroke Width Transform (SWT) (Epshtein et al. 2010), assuming consistent stroke width within each character

For text recognition, one branch adopted the feature-based methods. Shi et al. (2013) and Yao et al. (2014) propose character segments based recognition algorithms. Rodriguez-Serrano et al. (2013), Rodriguez-Serrano et al. (2015), Gordo (2015), Almazán et al. (2014) utilize label embedding to directly perform matching between strings and images. Strokes (Busta et al. 2015) and character key-points (Quy Phan et al. 2013) are also detected as features for classification. Another decomposes the recognition process into a series of sub-problems. Various methods have been proposed to tackle these sub-problems, which includes text binarization (Zhiwei et al. 2010; Mishra et al. 2011; Wakahara and Kita 2011; Lee and Kim 2013), text line segmentation (Ye et al. 2003), character segmentation (Nomura et al. 2005; Shivakumara et al. 2011; Roy et al. 2009), single character recognition (Chen et al. 2004; Sheshadri and Divvala 2012) and word correction (Zhang and Chang 2003; Wachenfeld et al. 2006; Mishra et al. 2012; Karatzas and Antonacopoulos 2004; Weinman et al. 2007).

There have been efforts devoted to integrated (i.e. end-to-end as we call it today) systems as well (Wang et al. 2011; Neumann and Matas 2013). In Wang et al. (2011), characters are considered as a special case in object detection and detected by a nearest-neighbor classifier trained on HOG features (Dalal and Triggs 2005) and then grouped into words through a Pictorial Structure (PS) based model (Felzenszwalb and Huttenlocher 2005). Neumann and Matas (Neumann and Matas 2013) proposed a decision delay approach by keeping multiple segmentations of each character until the last stage when the context of each character is known. They detect character segmentation using extremal regions and decode recognition results through a dynamic programming algorithm.

In summary, text detection and recognition methods before the deep learning era mainly extract low-level or mid-level handcrafted image features, which entails demanding and repetitive pre-processing and post-processing steps. Constrained by the limited representation ability of handcrafted features and the complexity of pipelines, those methods can hardly handle intricate circumstances, e.g. blurred images in the ICDAR 2015 dataset (Karatzas et al. 2015).

Fig. 3
figure 3

Illustrations of representative scene text detection and recognition system pipelines. a Jaderberg et al. (2016) and b Yao et al. (2016) are representative multi-step methods. c, d are simplified pipeline. In c, detectors and recognizers are separate. In d, the detectors pass cropped feature maps to recognizers, which allows end-to-end training

3 Methodology in the Deep Learning Era

As implied by the title of this section, we would like to address recent advances as changes in methodology instead of merely new methods. Our conclusion is grounded in the observations as explained in the following paragraph.

Methods in the recent years are characterized by the following two distinctions: (1) Most methods utilize deep-learning based models; (2) Most researchers are approaching the problem from a diversity of perspectives, trying to solve different challenges. Methods driven by deep learning enjoy the advantage that automatic feature learning can save us from designing and testing a large amount of potential hand-crafted features. At the same time, researchers from different viewpoints are enriching and promoting the community into more in-depth work, aiming at different targets, e.g. faster and simpler pipeline (Zhou et al. 2017), text of varying aspect ratios (Shi et al. 2017a), and synthetic data (Gupta et al. 2016). As we can also see further in this section, the incorporation of deep learning has totally changed the way researchers approach the task and has enlarged the scope of research by far. This is the most significant change compared to the former epoch.

In this section, we would classify existing methods into a hierarchical taxonomy, and introduce them in a top-down style. First, we divide them into four kinds of systems: (1) text detection that detects and localizes text in natural images; (2) recognition system that transcribes and converts the content of the detected text regions into linguistic symbols; (3) end-to-end system that performs both text detection and recognition in one unified pipeline; (4) auxiliary methods that aim to support the main task of text detection and recognition, e.g. synthetic data generation. Under each category, we review recent methods from different perspectives.

3.1 Detection

We acknowledge that scene text detection can be taxonomically subsumed under general object detection, which is dichotomized as one-staged methods and two-staged ones. Indeed, many scene text detection algorithms are majorly inspired by and follow the designs of general object detectors. Therefore we also encourage readers to refer to recent surveys on object detection methods (Han et al. 2018; Liu et al. 2018a). However, the detection of scene text has a different set of characteristics and challenges that require unique methodologies and solutions. Thus, many methods rely on special representation for scene text to solve these non-trivial problems.

The evolution of scene text detection algorithms, therefore, undergoes three main stages: (1) In the first stage, learning-based methods are equipped with multi-step pipelines, but these methods are still slow and complicated. (2) Then, the idea and methods of general object detection are successfully implanted into this task. (3) In the third stage, researchers design special representations based on sub-text components to solve the challenges of long text and irregular text.

3.1.1 Early Attempts to Utilize Deep Learning

Early deep-learning-based methods (Huang et al. 2014; Tian et al. 2015; Yao et al. 2016; Zhang et al. 2016; He et al. 2017a) approach the task of text detection into a multi-step process. They use convolutional neural networks (CNNs) to predict local segments and then apply heuristic post-processing steps to merge segments into detection lines.

In an early attempt (Huang et al. 2014), CNNs are only used to classify local image patches into text and non-text classes. They propose to mine such image patches using MSER features. Positive patches are then merged into text lines.

Later, CNNs are applied to the whole images in a fully convolutional approach. TextFlow (Tian et al. 2015) uses CNNs to detect character and views the character grouping task as a min-cost flow problem (Goldberg 1997).

In Yao et al. (2016), a convolutional neural network is used to predict whether each pixel in the input image (1) belongs to characters, (2) is inside the text region, and (3) the text orientation around the pixel. Connected positive responses are considered as detected characters or text regions. For characters belonging to the same text region, Delaunay triangulation (Kang et al. 2014) is applied, after which a graph partition algorithm groups characters into text lines based on the predicted orientation attribute.

Similarly, Zhang et al. (2016) first predicts a segmentation map indicating text line regions. For each text line region, MSER (Neumann and Matas 2012) is applied to extract character candidates. Character candidates reveal information on the scale and orientation of the underlying text line. Finally, minimum bounding boxes are extracted as the final text line candidates.

He et al. (2017a) propose a detection process that also consists of several steps. First, text blocks are extracted. Then the model crops and only focuses on the extracted text blocks to extract text center line (TCL), which is defined as a shrunk version of the original text line. Each text line represents the existence of one text instance. The extracted TCL map is then split into several TCLs. Each split TCL is then concatenated to the original image. A semantic segmentation model then classifies each pixel into ones that belong to the same text instance as the given TCL, and ones that do not.

Overall, in this stage, scene text detection algorithms still have long and slow pipelines, though they have replaced some hand-crafted features with learning-based ones. The design methodology is bottom-up and based on key components, such as single characters and text center lines.

3.1.2 Methods Inspired by Object Detection

Later, researchers are drawing inspirations from the rapidly developing general object detection algorithms (Liu et al. 2016a; Fu et al. 2017; Girshick et al. 2014; Girshick 2015; Ren et al. 2015; He et al. 2017b). In this stage, scene text detection algorithms are designed by modifying the region proposal and bounding box regression modules of general detectors to localize text instances directly (Dai et al. 2017; He et al. 2017c; Jiang et al. 2017; Liao et al. 2017, 2018a; Liu and Jin 2017; Shi et al. 2017a; Liu et al. 2017; Ma et al. 2017; Li et al. 2017b; Liao et al. 2018b; Zhang et al. 2018), as shown in Fig. 4. They mainly consist of stacked convolutional layers that encode the input images into feature maps. Each spatial location at the feature map corresponds to a region of the input image. The feature maps are then fed into a classifier to predict the existence and localization of text instances at each such spatial location.

Fig. 4
figure 4

High-level illustration of methods inspired by general object detection: a Similar to YOLO (Redmon et al. 2016), regressing offsets based on default bounding boxes at each anchor position. b Variants of SSD (Liu et al. 2016a), predicting at feature maps of different scales. c Predicting at each anchor position and regressing the bounding box directly. d Two-staged methods with an extra stage to correct the initial regression results

These methods greatly reduce the pipeline into an end-to-end trainable neural network component, making training much easier and inference much faster. We introduce the most representative works here.

Inspired by one-staged object detectors, TextBoxes (Liao et al. 2017) adapts SSD (Liu et al. 2016a) to fit the varying orientations and aspect-ratios of text by defining default boxes as quadrilaterals with different aspect-ratio specs.

EAST (Zhou et al. 2017) further simplifies the anchor-based detection by adopting the U-shaped design (Ronneberger et al. 2015) to integrate features from different levels. Input images are encoded as one multi-channeled feature map instead of multiple layers of different spatial sizes in SSD. The feature at each spatial location is used to regress the rectangular or quadrilateral bounding box of the underlying text instances directly. Specifically, the existence of text, i.e. text/non-text, and geometries, e.g. orientation and size for rectangles, and vertexes coordinates for quadrilaterals, are predicted. EAST makes a difference to the field of text detection with its highly simplified pipeline and efficiency to perform inference at real-time speed.

Other methods adapt the two-staged object detection framework of R-CNN (Girshick et al. 2014; Girshick 2015; Ren et al. 2015), where the second stage corrects the localization results based on features obtained by Region of Interest (RoI) pooling.

In Ma et al. (2017), rotation region proposal networks are adapted to generate rotating region proposals, in order to fit into text of arbitrary orientations, instead of axis-aligned rectangles.

In FEN (Zhang et al. 2018), the weighted sum of RoI poolings with different sizes is used. The final prediction is made by leveraging the textness score for poolings of 4 different sizes.

Zhang et al. (2019) propose to perform RoI and localization branch recursively, to revise the predicted position of the text instance. It is a good way to include features at the boundaries of bounding boxes, which localizes the text better than region proposal networks (RPNs).

Wang et al. (2018) propose to use a parametrized Instance Transformation Network (ITN) that learns to predict appropriate affine transformation to perform on the last feature layer extracted by the base network, to rectify oriented text instances. Their method, with ITN, can be trained end-to-end.

To adapt to irregularly shaped text, bounding polygons (Liu et al. 2017) with as many as 14 vertexes are proposed, followed by a Bi-LSTM (Hochreiter and Schmidhuber 1997) layer to refine the coordinates of the predicted vertexes.

In a similar way, Wang et al. (2019b) propose to use recurrent neural networks (RNNs) to read the features encoded by RPN-based two-staged object decoders and predict the bounding polygon with variable length. The method requires no post-processing or complex intermediate steps and achieves a much faster speed of 10.0 FPS on Total-Text.

The main contribution in this stage is the simplification of the detection pipeline and the following improvement of efficiency. However, the performance is still limited when faced with curved, oriented, or long text for one-staged methods due to the limitation of the receptive field, and the efficiency is limited for two-staged methods.

3.1.3 Methods Based on Sub-text Components

The main distinction between text detection and general object detection is that text is homogeneous as a whole and is characterized by its locality, which is different from general object detection. By homogeneity and locality, we refer to the property that any part of a text instance is still text. Humans do not have to see the whole text instance to know it belongs to some text.

Fig. 5
figure 5

Illustration of representative methods based on sub-text components: a SegLink (Shi et al. 2017a): with SSD as base network, predict word segments at each anchor position, and connections between adjacent anchors. b PixelLink (Deng et al. 2018): for each pixel, predict text/non-text classification and whether it belongs to the same text as adjacent pixels or not. c Corner Localization (Lyu et al. 2018b): predict the four corners of each text and group those belonging to the same text instances. d TextSnake (Long et al. 2018): predict text/non-text and local geometries, which are used to reconstruct text instance

Such a property lays a cornerstone for a new branch of text detection methods that only predict sub-text components and then assemble them into a text instance. These methods, by its nature, can better adapt to the aforementioned challenges of curved, long, and oriented text. These methods, as illustrated in Fig. 5, use neural networks to predict local attributes or segments, and a post-processing step to re-construct text instances. Compared with early multi-staged methods, they rely more on neural networks and have shorter pipelines.

In pixel-level methods (Deng et al. 2018; Wu and Natarajan 2017), an end-to-end fully convolutional neural network learns to generate a dense prediction map indicating whether each pixel in the original image belongs to any text instances or not. Post-processing methods then group pixels together depending on which pixels belong to the same text instance. Basically, they can be seen as a special case of instance segmentation (He et al. 2017b). Since text can appear in clusters which makes predicted pixels connected to each other, the core of pixel-level methods is to separate text instances from each other.

PixelLink (Deng et al. 2018) learns to predict whether two adjacent pixels belong to the same text instance by adding extra output channels to indicate links between adjacent pixels.

Border learning method (Wu and Natarajan 2017) casts each pixel into three categories: text, border, and background, assuming that the border can well separate text instances.

In Wang et al. (2017), pixels are clustered according to their color consistency and edge information. The fused image segments are called superpixel. These superpixels are further used to extract characters and predict text instances.

Upon the segmentation framework, Tian et al. (2019) propose to add a loss term that maximizes the Euclidean distances between pixel embedding vectors that belong to different text instances, and minimizes those belonging to the same instance, to better separate adjacent texts.

Wang et al. (2019a) propose to predict text regions at different shrinkage scales, and enlarges the detected text region round-by-round, until collision with other instances. However, the prediction at different scales is itself a variation of the aforementioned border learning (Wu and Natarajan 2017).

Component-level methods usually predict at a medium granularity. Component refers to a local region of text instance, sometimes overlapping one or more characters.

The representative component-level method is Connectionist Text Proposal Network (CTPN) (Tian et al. 2016). CTPN models inherit the idea of anchoring and recurrent neural network for sequence labeling. They stack an RNN on top of CNNs. Each position in the final feature map represents features in the region specified by the corresponding anchor. Assuming that text appears horizontally, each row of features are fed into an RNN and labeled as text/non-text. Geometries such as segment sizes are also predicted. CTPN is the first to predict and connect segments of scene text with deep neural networks.

SegLink (Shi et al. 2017a) extends CTPN by considering the multi-oriented linkage between segments. The detection of segments is based on SSD (Liu et al. 2016a), where each default box represents a text segment. Links between default boxes are predicted to indicate whether the adjacent segments belong to the same text instance. Zhang et al. (2020) further improve SegLink by using a Graph Convolutional Network (Kipf and Welling 2016) to predict the linkage between segments.

Corner localization method (Lyu et al. 2018b) proposes to detect the four corners of each text instance. Since each text instance only has 4 corners, the prediction results and their relative position can indicate which corners should be grouped into the same text instance.

Fig. 6
figure 6

ac Representing text as horizontal rectangles, oriented rectangles, and quadrilaterals. d The sliding-disk representation proposed in TextSnake (Long et al. 2018)

Long et al. (2018) argue that text can be represented as a series of sliding round disks along the text center line (TCL), which is in accord with the running direction of the text instance, as shown in Fig. 6. With the novel representation, they present a new model, TextSnake, which learns to predict local attributes, including TCL/non-TCL, text-region/non-text-region, radius, and orientation. The intersection of TCL pixels and text region pixels gives the final prediction of pixel-level TCL. Local geometries are then used to extract the TCL in the form of an ordered point list. With TCL and radius, the text line is reconstructed. It achieves state-of-the-art performance on several curved text datasets as well as more widely used ones, e.g. ICDAR 2015 (Karatzas et al. 2015) and MSRA-TD 500 (Tu et al. 2012). Notably, Long et al. propose a cross-validation test across different datasets, where models are only fine-tuned on datasets with straight text instances and tested on the curved datasets. In all existing curved datasets, TextSnake achieves improvements by up to 20% over other baselines in F1-Score.

Character-level representation is yet another effective way. Baek et al. (2019b) propose to learn a segmentation map for character centers and links between them. Both components and links are predicted in the form of a Gaussian heat map. However, this method requires iterative weak supervision as real-world datasets are rarely equipped with character-level labels.

Overall, detection based on sub-text components enjoys better flexibility and generalization ability over shapes and aspect ratios of text instance. The main drawback is that the module or post-processing step used to group segments into text instances may be vulnerable to noise, and the efficiency of this step is highly dependent on the actual implementation, and therefore may vary among different platforms.

Fig. 7
figure 7

Frameworks of text recognition models. a Sequence tagging model, and uses CTC for alignment in training and inference. b Sequence to sequence model, and can use cross-entropy to learn directly. c Segmentation-based methods

3.2 Recognition

In this section, we introduce methods for scene text recognition. The input of these methods is cropped text instance images which contain only one word.

In the deep learning era, scene text recognition models use CNNs to encode images into feature spaces. The main difference lies in the text content decoding module. Two major techniques are the Connectionist Temporal Classification (Graves et al. 2006) (CTC) and the encoder–decoder framework (Sutskever et al. 2014). We introduce recognition methods in the literature based on the main technique they employ. Mainstream frameworks are illustrated in Fig. 7.

Both CTC and encoder–decoder frameworks are originally designed for 1-dimensional sequential input data, and therefore are applicable to the recognition of straight and horizontal text, which can be encoded into a sequence of feature frames by CNNs without losing important information. However, characters in oriented and curved text are distributed over a 2-dimensional space. It remains a challenge to effectively represent oriented and curved text in feature spaces in order to fit the CTC and encoder–decoder frameworks, whose decodes require 1-dimensional inputs. For oriented and curved text, directly compressing the features into a 1-dimensional form may lose relevant information and bring in noise from background, thus leading to inferior recognition accuracy. We would introduce techniques to solve this challenge.

3.2.1 CTC-Based Methods

The CTC decoding module is adopted from speech recognition, where data are sequential in the time domain. To apply CTC in scene text recognition, the input images are viewed as a sequence of vertical pixel frames. The network outputs a per-frame prediction, indicating the probability distribution of label types for each frame. The CTC rule is then applied to edit the per-frame prediction to a text string. During training, the loss is computed as the sum of the negative log probability of all possible per-frame predictions that can generate the target sequence by CTC rules. Therefore, the CTC method makes it end-to-end trainable with only word-level annotations, without the need for character level annotations. The first application of CTC in the OCR domain can be traced to the handwriting recognition system of Graves et al. (2008). Now this technique is widely adopted in scene text recognition (Su and Lu 2014; He et al. 2016; Liu et al. 2016b; Gao et al. 2017; Shi et al. 2017b; Yin et al. 2017).

The first attempts can be referred to as convolutional recurrent neural networks (CRNN). These models are composed by stacking RNNs on top of CNNs and use CTC for training and inference. DTRN (He et al. 2016) is the first CRNN model. It slides a CNN model across the input images to generate convolutional feature slices, which are then fed into RNNs. Shi et al. (2017b) further improves DTRN by adopting the fully convolutional approach to encode the input images as a whole to generate features slices, utilizing the property that CNNs are not restricted by the spatial sizes of inputs.

Instead of RNN, Gao et al. (2017) adopt the stacked convolutional layers to effectively capture the contextual dependencies of the input sequence, which is characterized by lower computational complexity and easier parallel computation.

Yin et al. (2017) simultaneously detect and recognize characters by sliding the text line image with character models, which are learned end-to-end on text line images labeled with text transcripts.

3.2.2 Encoder–Decoder Methods

The encoder–decoder framework for sequence-to-sequence learning is originally proposed in Sutskever et al. (2014) for machine translation. The encoder RNN reads an input sequence and passes its final latent state to a decoder RNN, which generates output in an auto-regressive way. The main advantage of the encoder–decoder framework is that it gives outputs of variable lengths, which satisfies the task setting of scene text recognition. The encoder–decoder framework is usually combined with the attention mechanism (Bahdanau et al. 2014) which jointly learns to align input sequence and output sequence.

Lee and Osindero (2016) present recursive recurrent neural networks with attention modeling for lexicon-free scene text recognition. the model first passes input images through recursive convolutional layers to extract encoded image features and then decodes them to output characters by recurrent neural networks with implicitly learned character-level language statistics. The attention-based mechanism performs soft feature selection for better image feature usage.

Cheng et al. (2017a) observe the attention drift problem in existing attention-based methods and proposes to impose localization supervision for attention score to attenuate it.

Bai et al. (2018) propose an edit probability (EP) metric to handle the misalignment between the ground truth string and the attention’s output sequence of the probability distribution. Unlike aforementioned attention-based methods, which usually employ a frame-wise maximal likelihood loss, EP tries to estimate the probability of generating a string from the output sequence of probability distribution conditioned on the input image, while considering the possible occurrences of missing or superfluous characters.

Liu et al. (2018d) propose an efficient attention-based encoder–decoder model, in which the encoder part is trained under binary constraints to reduce computation cost.

Both CTC and the encoder–decoder framework simplify the recognition pipeline and make it possible to train scene text recognizers with only word-level annotations instead of character level annotations. Compared to CTC, the decoder module of the encoder–decoder framework is an implicit language model, and therefore, it can incorporate more linguistic priors. For the same reason, the encoder–decoder framework requires a larger training dataset with a larger vocabulary. Otherwise, the model may degenerate when reading words that are unseen during training. On the contrary, CTC is less dependent on language models and has a better character-to-pixel alignment. Therefore it is potentially better on languages such as Chinese and Japanese that have a large character set. The main drawback of these two methods is that they assume the text to be straight, and therefore can not adapt to irregular text.

3.2.3 Adaptions for Irregular Text Recognition

Rectification-modules are a popular solution to irregular text recognition. Shi et al. (2016, (2018) propose a text recognition system which combined a Spatial Transformer Network (STN) (Jaderberg et al. 2015) and an attention-based Sequence Recognition Network. The STN-module predicts text bounding polygons with fully connected layers in order for Thin-Plate-Spline transformations which rectify the input irregular text image into a more canonical form, i.e. straight text. The rectification proves to be a successful strategy and forms the basis of the winning solution (Long et al. 2019) in ICDAR 2019 ArTFootnote 1 irregular text recognition competition.

There have also been several improved versions of rectification based recognition. Zhan and Lu (2019) propose to perform rectification multiple times to gradually rectify the text. They also replace the text bounding polygons with a polynomial function to represent the shape. Yang et al. (2019) propose to predict local attributes, such as radius and orientation values for pixels inside the text center region, in a similar way to TextSnake (Long et al. 2018). The orientation is defined as the orientation of the underlying character boxes, instead of text bounding polygons. Based on the attributes, bounding polygons are reconstructed in a way that the perspective distortion of characters is rectified, while the method by Shi et al. and Zhan et al. may only rectify at the text level and leave the characters distorted.

Yang et al. (2017) introduce an auxiliary dense character detection task to encourage the learning of visual representations that are favorable to the text patterns. And they adopt an alignment loss to regularize the estimated attention at each time-step. Further, they use a coordinate map as a second input to enforce spatial-awareness.

Cheng et al. (2017b) argue that encoding a text image as a 1-D sequence of features as implemented in most methods is not sufficient. They encode an input image to four feature sequences of four directions: horizontal, reversed horizontal, vertical, and reversed vertical. A weighting mechanism is applied to combine the four feature sequences.

Liu et al. (2018b) present a hierarchical attention mechanism (HAM) which consists of a recurrent RoI-Warp layer and a character-level attention layer. They adopt a local transformation to model the distortion of individual characters, resulting in improved efficiency, and can handle different types of distortion that are hard to be modeled by a single global transformation.

Liao et al. (2019b) cast the task of recognition into semantic segmentation, and treat each character type as one class. The method is insensitive to shapes and is thus effective on irregular text, but the lack of end-to-end training and sequence learning makes it prone to single-character errors, especially when the image quality is low. They are also the first to evaluate the robustness of their recognition method by padding and transforming test images.

Another solution to irregular scene text recognition is 2-dimensional attention (Xu et al. 2015), which has been verified in Li et al. (2019). Different from the sequential encoder–decoder framework, the 2D attentional model maintains 2-dimensional encoded features, and attention scores are computed for all spatial locations. Similar to spatial attention, Long et al. (2020) propose to first detect characters. Afterward, features are interpolated and gathered along the character center lines to form sequential feature frames.

In addition to the aforementioned techniques, Qin et al. (2019) show that simply flattening the feature maps from 2-dimensional to 1-dimensional and feeding the resulting sequential features to RNN based attentional encoder–decoder model is sufficient to produce state-of-the-art recognition results on irregular text, which is a simple yet efficient solution.

Apart from tailored model designs, Long et al. (2019) synthesizes a curved text dataset, which significantly boosts the recognition performance on real-world curved text datasets with no sacrifices to straight text datasets.

Although many elegant and neat solutions have been proposed, they are only evaluated and compared based on a relatively small dataset, CUTE80, which only consists of 288 word samples. Besides, the training datasets used in these works only contain a negligible proportion of irregular text samples. Evaluations on larger datasets and more suitable training datasets may help us understand these methods better.

3.2.4 Other Methods

Jaderberg et al. (2014a, (2014b) perform word recognition by classifying the image into a pre-defined set of vocabulary, under the framework of image classification. The model is trained by synthetic images, and achieves state-of-the-art performance on some benchmarks containing English words only. However, the application of this method is quite limited as it cannot be applied to recognize unseen sequences such as phone numbers and email addresses.

To improve performance on difficult cases such as occlusion which brings ambiguity to single character recognition, Yu et al. (2020) propose a transformer-based semantic reasoning module that performs translations from coarse, prone-to-error text outputs from the decoder to fine and linguistically calibrated outputs, which bears some resemblance to the deliberation networks for machine translation (Xia et al. 2017) that first translate and then re-write the sentences.

Despite the progress we have seen so far, the evaluation of recognition methods falls behind the time. As most detection methods can detect oriented and irregular text and some even rectify them, the recognition of such text may seem redundant. On the other hand, the robustness of recognition when cropped with a slightly different bounding box is seldom verified. Such robustness may be more important in real-world scenarios.

3.3 End-to-End System

In the past, text detection and recognition are usually cast as two independent sub-problems that are combined to perform text reading from images. Recently, many end-to-end text detection and recognition systems (also known as text spotting systems) have been proposed, profiting a lot from the idea of designing differentiable computation graphs, as shown in Fig. 8. Efforts to build such systems have gained considerable momentum as a new trend.

Two-Step Pipelines While earlier work (Wang et al. 2011, 2012) first detect single characters in the input image, recent systems usually detect and recognize text in word-level or line level. Some of these systems first generate text proposals using a text detection model and then recognize them with another text recognition model (Jaderberg et al. 2016; Liao et al. 2017; Gupta et al. 2016). Jaderberg et al. (2016) use a combination of Edge Box proposals (Zitnick and Dollár 2014) and a trained aggregate channel features detector (Dollár et al. 2014) to generate candidate word bounding boxes. Proposal boxes are filtered and rectified before being sent into their recognition model proposed in (Jaderberg et al. 2014b). Liao et al. (2017) combine an SSD (Liu et al. 2016a) based text detector and CRNN (Shi et al. 2017b) to spot text in images.

In these methods, detected words are cropped from the image, and therefore, the detection and recognition are two separate steps. One major drawback of the two-step methods is that the propagation of error between the detection and recognition models will lead to less satisfactory performance.

Two-Stage Pipelines Recently, end-to-end trainable networks are proposed to tackle this problem (Bartz et al. 2017; Busta et al. 2017; Li et al. 2017a; He et al. 2018; Liu et al. 2018c), where feature maps instead of images are cropped and fed to recognition modules.

Bartz et al. (2017) present an solution that utilizes a STN (Jaderberg et al. 2015) to circularly attend to each word in the input image, and then recognize them separately. The united network is trained in a weakly-supervised manner that no word bounding box labels are used. Li et al. (2017a) substitute the object classification module in Faster-RCNN (Ren et al. 2015) with an encoder–decoder based text recognition model and make up their text spotting system. Liu et al. (2018c), Busta et al. (2017) and He et al. (2018) develop unified text detection and recognition systems with a very similar overall architectures which consist of a detection branch and a recognition branch. Liu et al. (2018c) and Busta et al. (2017) adopt EAST (Zhou et al. 2017) and YOLOv2 (Redmon and Farhadi 2017) as their detection branches respectively, and have a similar text recognition branch in which text proposals are pooled into fixed height tensors by bilinear sampling and then transcribe into strings by a CTC-based recognition module. He et al. (2018) also adopt EAST (Zhou et al. 2017) to generate text proposals, and they introduced character spatial information as explicit supervision in the attention-based recognition branch. Lyu et al. (2018a) propose a modification of Mask R-CNN. For each region of interest, character segmentation maps are produced, indicating the existence and location of a single character. A post-processing step that orders these character from left to right gives the final results. In contrast to the aforementioned works that perform RoI pooling based on oriented bounding boxes, Qin et al. (2019) propose to use axis-aligned bounding boxes and mask the cropped features with a 0/1 textness segmentation mask (He et al. 2017b).

Fig. 8
figure 8

Illustration of mainstream end-to-end frameworks. a In SEE (Bartz et al. 2017), the detection results are represented as grid matrices. Image regions are cropped and transformed before being fed into the recognition branch. b Some methods crop from the feature maps and feed them to the recognition branch. c While a, b utilize CTC-based and attention-based recognition branch, it is also possible to retrieve each character as generic objects and compose the text

One-Stage Pipeline In addition to two-staged methods, Xing et al. (2019) predict character and text bounding boxes as well as character type segmentation maps in parallel. The text bounding boxes are then used to group character boxes to form the final word transcription results. This is the first one-staged method.

3.4 Auxiliary Techniques

Recent advances are not limited to detection and recognition models that aim to solve the tasks directly. We should also give credit to auxiliary techniques that have played an important role.

3.4.1 Synthetic Data

Most deep learning models are data-thirsty. Their performance is guaranteed only when enough data are available. In the field of text detection and recognition, this problem is more urgent since most human-labeled datasets are small, usually containing around merely 1K–2K data instances. Fortunately, there have been work (Jaderberg et al. 2014b; Gupta et al. 2016; Zhan et al. 2018; Liao et al. 2019a) that generate data of relatively high quality, and have been widely used for pre-training models for better performance.

Jaderberg et al. (2014b) propose to generate synthetic data for text recognition. Their method blends text with randomly cropped natural images from human-labeled datasets after rending of font, border/shadow, color, and distortion. The results show that training merely on these synthetic data can achieve state-of-the-art performance and that synthetic data can act as augmentative data sources for all datasets.

SynthText (Gupta et al. 2016) first propose to embed text in natural scene images for the training of text detection, while most previous work only print text on a cropped region and these synthetic data are only for text recognition. Printing text on the whole natural images poses new challenges, as it needs to maintain semantic coherence. To produce more realistic data, SynthText makes use of depth prediction (Liu et al. 2015) and semantic segmentation (Arbelaez et al. 2011). Semantic segmentation groups pixels together as semantic clusters and each text instance is printed on one semantic surface, not overlapping multiple ones. A dense depth map is further used to determine the orientation and distortion of the text instance. The model trained only on SynthText achieves state-of-the-art on many text detection datasets. It is later used in other works (Zhou et al. 2017; Shi et al. 2017a) as well for initial pre-training.

Further, Zhan et al. (2018) equip text synthesis with other deep learning techniques to produce more realistic samples. They introduce selective semantic segmentation so that word instances would only appear on sensible objects, e.g. a desk or wall in stead of someone’s face. Text rendering in their work is adapted to the image so that they fit into the artistic styles and do not stand out awkwardly.

SynthText3D (Liao et al. 2019a) uses the famous open-source game engine, Unreal Engine 4 (UE4), and UnrealCV (Qiu et al. 2017) to synthesize scene text images. Text is rendered with the scene together and thus can achieve different lighting conditions, weather, and natural occlusions. However, SynthText3D simply follows the pipeline of SynthText and only makes use of the ground-truth depth and segmentation maps provided by the game engine. As a result, SynthText3D relies on manual selection of camera views, which limits its scalability. Besides, the proposed text regions are generated by clipping maximal rectangular bounding boxes extracted from segmentation maps, and therefore are limited to the middle parts of large and well-defined regions, which is an unfavorable location bias.

UnrealText (Long and Yao 2020) is another work using game engines to synthesize scene text images. It features deep interactions with the 3D worlds during synthesis. A ray-casting based algorithm is proposed to navigate in the 3D worlds efficiently and is able to generate diverse camera views automatically. The text region proposal module is based on collision detection and can put text onto the whole surfaces, thus getting rid of the location bias. UnrealText achieves significant speedup and better detector performances.

Text Editing It is also worthwhile to mention the text editing task that is proposed recently (Wu et al. 2019; Yang et al. 2020). Both works try to replace the text content while retaining text styles in natural images, such as the spatial arrangement of characters, text fonts, and colors. Text editing per se is useful in applications such as instant translation using cellphone cameras. It also has great potential in augmenting existing scene text images, though we have not seen any relevant experiment results yet.

3.4.2 Weakly and Semi-supervision

Bootstrapping for Character-Box

Character level annotations are more accurate and better. However, most existing datasets do not provide character-level annotating. Since characters are smaller and close to each other, character-level annotation is more costly and inconvenient. There has been some work on semi-supervised character detection. The basic idea is to initialize a character–detector and applies rules or threshold to pick the most reliable predicted candidates. These reliable candidates are then used as additional supervision sources to refine the character–detector. Both of them aim to augment existing datasets with character level annotations. Their difference is illustrated in Fig. 9.

WordSup (Hu et al. 2017) first initializes the character detector by training 5K warm-up iterations on synthetic datasets. For each image, WordSup generates character candidates, which are then filtered with word-boxes. For characters in each word box, the following score is computed to select the most possible character list:

$$\begin{aligned} \small s = w\cdot \frac{area(B_{chars})}{area(B_{word})} + (1-w)\cdot (1-\frac{\lambda _2}{\lambda _1}) \end{aligned}$$
(1)

where \(B_{chars}\) is the union of the selected character boxes; \(B_{word}\) is the enclosing word bounding box; \(\lambda _1\) and \(\lambda _2\) are the first- and second-largest eigenvalues of a covariance matrix C, computed by the coordinates of the centers of the selected character boxes; w is a weight scalar. Intuitively, the first term measures how complete the selected characters can cover the word boxes, while the second term measures whether the selected characters are located on a straight line, which is the main characteristic for word instances in most datasets.

Fig. 9
figure 9

Overview of semi-supervised and weakly-supervised methods. Existing methods differ in the way with regard to how filtering is done. a WeText (Tian et al. 2017), mainly by thresholding the confidence level and filtering by word-level annotation. b Scoring-based methods, including WordSup (Hu et al. 2017) which assumes that text are straight lines, and uses a eigenvalue-based metric to measure its straightness. c by grouping characters into word using ground truth word bounding boxes, and comparing the number of characters (Baek et al. 2019b; Xing et al. 2019).

WeText (Tian et al. 2017) start with a small dataset annotated on the character level. It follows two paradigms of bootstrapping: semi-supervised learning and weakly-supervised learning. In the semi-supervised setting, detected character candidates are filtered with a high thresholding value. In the weakly-supervised setting, ground-truth word boxes are used to mask out false positives outside. New instances detected in either way are added to the initial small datasets and re-train the model.

In Baek et al. (2019b) and Xing et al. (2019), the character candidates are filtered with the help of word-level annotations. For each word instance, if the number of detected character bounding boxes inside the word bounding box equals to the length of the ground truth word, the character bounding boxes are regarded as correct.

Partial Annotations In order to improve the recognition performance of end-to-end word spotting models on curved text, Qin et al. (2019) propose to use off-the-shelf straight scene text spotting models to annotate a large number of unlabeled images. These images are called partially labeled images, since the off-the-shelf models may omit some words. These partially annotated straight text prove to boost the performance on irregular text greatly.

Another similar effort is the large dataset proposed by Sun et al. (2019), where each image is only annotated with one dominant text. They also design an algorithm to utilize these partially labeled data, which they claim are cheaper to annotate.

Fig. 10
figure 10

Selected samples from Chars74K, SVT-P, IIIT5K, MSRA-TD 500, ICDAR 2013, ICDAR 2015, ICDAR 2017 MLT, ICDAR 2017 RCTW, and Total-Text

4 Benchmark Datasets and Evaluation Protocols

As cutting edge algorithms achieved better performance on existing datasets, researchers are able to tackle more challenging aspects of the problems. New datasets aimed at different real-world challenges have been and are being crafted, benefiting the development of detection and recognition methods further.

In this section, we list and briefly introduce the existing datasets and the corresponding evaluation protocols. We also identify current state-of-the-art approaches to the widely used datasets when applicable.

Table 1 Public datasets for scene text detection and recognition

4.1 Benchmark Datasets

We collect existing datasets and summarize their statistics in Table 1. We select some representative image samples from some of the datasets, which are demonstrated in Fig. 10. Links to these datasets are also collected in our Github repository mentioned in abstract, for readers’ convenience. In this section, we select some representative datasets and discuss their characteristics.

The ICDAR 2015 incidental text focuses on small and oriented text. The images are taken by Google Glasses without taking care of the image quality. A large proportion of text in the images are very small, blurred, occluded, and multi-oriented, making it very challenging.

The ICDAR MLT 2017 and 2019 datasets contain scripts of 9 and 10 languages respectively. They are the only multilingual datasets so far.

Total-Text has a large proportion of curved text, while previous datasets contain only few. These images are mainly taken from street billboards, and annotated as polygons with a variable number of vertices.

The Chinese Text in the Wild (CTW) dataset (Yuan et al. 2018) contains 32,285 high-resolution street view images, annotated at the character level, including its underlying character type, bounding box, and detailed attributes such as whether it uses word-art. The dataset is the largest one to date and the only one that contains detailed annotations. However, it only provides annotations for Chinese text and ignores other scripts, e.g. English.

LSVT (Sun et al. 2019) is composed of two datasets. One is fully labeled with word bounding boxes and word content. The other, while much larger, is only annotated with the word content of the dominant text instance. The authors propose to work on such partially labeled data that are much cheaper.

IIIT 5K-Word (Mishra et al. 2012) is the largest scene text recognition dataset, containing both digital and natural scene images. Its variance in font, color, size, and other noises makes it the most challenging one to date.

4.2 Evaluation Protocols

In this part, we briefly summarize the evaluation protocols for text detection and recognition.

As metrics for performance comparison of different algorithms, we usually refer to their precision, recall and F1-score. To compute these performance indicators, the list of predicted text instances should be matched to the ground truth labels in the first place. Precision, denoted as P, is calculated as the proportion of predicted text instances that can be matched to ground truth labels. Recall, denoted as R, is the proportion of ground truth labels that have correspondents in the predicted list. F1-score is then computed by \(F_1=\frac{2*P*R}{P+R}\), taking both precision and recall into account. Note that the matching between the predicted instances and ground truth ones comes first.

4.2.1 Text Detection

There are mainly two different protocols for text detection, the IOU based PASCAL Eval and overlap based DetEval. They differ in the criterion of matching predicted text instances and ground truth ones. In the following part, we use these notations: \(S_{GT}\) is the area of the ground truth bounding box, \(S_{P}\) is the area of the predicted bounding box, \(S_{I}\) is the area of the intersection of the predicted and ground truth bounding box, \(S_U\) is the area of the union.

  • DetEval DetEval imposes constraints on both precision, i.e. \(\frac{S_{I}}{S_{P}}\) and recall, i.e. \(\frac{S_{I}}{S_{GT}}\). Only when both are larger than their respective thresholds, are they matched together.

  • PASCAL (Everingham et al. 2015) The basic idea is that, if the intersection-over-union value, i.e. \(\frac{S_{I}}{S_U}\), is larger than a designated threshold, the predicted and ground truth box are matched together.

Most works follow either one of the two evaluation protocols, but with small modifications. We only discuss those that are different from the two protocols mentioned above.

  • ICDAR-2003/2005 The match score m is calculated in a way similar to IOU. It is defined as the ratio of the area of intersection over that of the minimum bounding rectangular bounding box containing both.

  • ICDAR-2011/2013 One major drawback of the evaluation protocol of ICDAR2003/2005 is that it only considers the one-to-one match. It does not consider one-to-many, many-to-many, and many-to-one matching, which underestimates the actual performance. Therefore, ICDAR-2011/2013 follows the method proposed by Wolf and Jolion (2006), where one-to-one matching is assigned a score of 1 and the other two types are punished to a constant score less than 1, usually set as 0.8.

  • MSRA-TD 500 (Tu et al. 2012) propose a new evaluation protocol for rotated bounding boxes, where both the predicted and ground truth bounding box are revolved horizontally around its center. They are matched only when the standard IOU score is higher than the threshold and the rotation of the original bounding boxes is less a pre-defined value (in practice \(\frac{pi}{4}\)).

  • TIoU (Liu et al. 2019) Tightness-IoU takes into account the fact that scene text recognition is sensitive to missing parts and superfluous parts in detection results. Not-retrieved areas will result in missing characters in recognition results, and redundant areas will result in unexpected characters. The proposed metrics penalize IoUs by scaling it down by the proportion of missing areas and the proportion of superfluous areas that overlap with other text.

The main drawback of existing evaluation protocols is that they only consider the best F1 scores under arbitrarily selected confidence thresholds selected on test sets. Qin et al. (2019) also evaluate their method with the average precision (AP) metric that is widely adopted in general object detection. While F1 scores are only single points on the precision-recall curves, AP values consider the whole precision-recall curves. Therefore, AP is a more comprehensive metric and we urge that researchers in this field use AP instead of F1 alone.

4.2.2 Text Recognition and End-to-End System

In scene text recognition, the predicted text string is compared to the ground truth directly. The performance evaluation is in either character-level recognition rate (i.e. how many characters are recognized) or word level (whether the predicted word exactly the same as ground truth). ICDAR also introduces an edit-distance based performance evaluation.

Table 2 Detection on ICDAR 2013

In end-to-end evaluation, matching is first performed in a similar way to that of text detection, and then the text content is compared.

The most widely used datasets for end-to-end systems are ICDAR 2013 (Karatzas et al. 2013) and ICDAR 2015 (Karatzas et al. 2015). The evaluation over these two datasets are carried out under two different settings,Footnote 2 the Word Spotting setting and the End-to-End setting. Under Word Spotting, the performance evaluation only focuses on the text instances from the scene image that appears in a predesignated vocabulary, while other text instances are ignored. On the contrary, all text instances that appear in the scene image are included under End-to-End. Three different vocabulary lists are provided for candidate transcriptions. They include Strongly Contextualised, Weakly Contextualised, and Generic. The three kinds of lists are summarized in Table 8. Note that under End-to-End, these vocabularies can still serve as references.

Evaluation results of recent methods on several widely adopted benchmark datasets are summarized in the following tables: Table 2 for detection on ICDAR 2013, Table 4 for detection on ICDAR 2015 Incidental Text, Table 3 for detection on ICDAR 2017 MLT, Table 5 for detection and end-to-end word spotting on Total-Text, Table 6 for detection on CTW1500, Table 7 for detection on MSRA-TD 500, Table 9 for recognition on several datasets, and Table 10 for end-to-end text spotting on ICDAR 2013 and ICDAR 2015. Note that, we do not report performance under multi-scale conditions if single-scale performances are reported. We use \(*\) to indicate methods where only multi-scale performances are reported. Since different backbone feature extractors are used in some works, we only report performances based on ResNet-50 unless not provided. For a better illustration, we plot the recent progress of detection performance in Fig. 11, and recognition performance in Fig. 12.

Note that, current evaluation for scene text recognition may be problematic. According to Baek et al. (2019a), most researchers are actually using different subsets when they refer to the same dataset, causing discrepancies in performance. Besides, Long and Yao (2020) further point out that half of the widely adopted benchmark datasets have imperfect annotations, e.g. ignoring case-sensitivities and punctuations, and provide new annotations for those datasets. Though most paper claim to train their models to recognize in a case-sensitive way and also include punctuations, they may be limiting their output to only digits and case-insensitive characters during evaluation.

Table 3 Detection on ICDAR MLT 2017
Table 4 Detection on ICDAR 2015
Table 5 Detection and end-to-end on total-text
Table 6 Detection on CTW1500
Table 7 Detection on MSRA-TD 500
Table 8 Characteristics of the three vocabulary lists used in ICDAR 2013/2015

5 Application

The detection and recognition of text—the visual and physical carrier of human civilization—allow the connection between vision and the understanding of its content further. Apart from the applications we have mentioned at the beginning of this paper, there have been numerous specific application scenarios across various industries and in our daily lives. In this part, we list and analyze the most outstanding ones that have, or are to have, significant impact, improving our productivity and life quality.

Automatic Data Entry Apart from an electronic archive of existing documents, OCR can also improve our productivity in the form of automatic data entry. Some industries involve time-consuming data type-in, e.g. express orders written by customers in the delivery industry, and hand-written information sheets in the financial and insurance industries. Applying OCR techniques can accelerate the data entry process as well as protect customer privacy. Some companies have already been using these technologies, e.g. SF-Express.Footnote 3 Another potential application is note taking, such as NEBO,Footnote 4 a note-taking software on tablets like iPad that performs instant transcription as users write down notes.

Identity Authentication Automatic identity authentication is yet another field where OCR can give a full play to. In fields such as Internet finance and Customs, users/passengers are required to provide identification (ID) information, such as identity card and passport. Automatic recognition and analysis of the provided documents would require OCR that reads and extracts the textual content, and can automate and greatly accelerate such processes. There are companies that have already started working on identification based on face and ID card, e.g. MEGVII (Face++).Footnote 5

Augmented Computer Vision As text is an essential element for the understanding of scene, OCR can assist computer vision in many ways. In the scenario of autonomous vehicles, text-embedded panels carry important information, e.g. geo-location, current traffic condition, navigation, and etc.. There have been several works on text detection and recognition for autonomous vehicle (Mammeri et al. 2014, 2016). The largest dataset so far, CTW (Yuan et al. 2018), also places extra emphasis on traffic signs. Another example is the instant translation, where OCR is combined with a translation model. This is extremely helpful and time-saving as people travel or read documents written in foreign languages. Google’s Translate applicationFootnote 6 can perform such instant translation. A similar application is instant text-to-speech software equipped with OCR, which can help those with visual disability and those who are illiterate.Footnote 7

Intelligent Content Analysis OCR also allows the industry to perform more intelligent analysis, mainly for platforms like video-sharing websites and e-commerce. Text can be extracted from images and subtitles as well as real-time commentary subtitles (a kind of floating comments added by users, e.g. those in BilibiliFootnote 8 and NiconicoFootnote 9). On the one hand, such extracted text can be used in automatic content tagging and recommendation systems. They can also be used to perform user sentiment analysis, e.g. which part of the video attracts the users most. On the other hand, website administrators can impose supervision and filtration for inappropriate and illegal content, such as terrorist advocacy.

Table 9 State-of-the-art recognition performance across a number of datasets
Table 10 Performance of end-to-end and word spotting on ICDAR 2015 and ICDAR 2013

6 Conclusion and Discussion

6.1 Status Quo

Algorithms The past several years have witnessed the significant development of algorithms for text detection and recognition, mainly due to the deep learning boom. Deep learning models have replaced the manual search and design for patterns and features. With the improved capability of models, research attention has been drawn to challenges such as oriented and curved text detection, and have achieved considerable progress.

Applications Apart from efforts towards a general solution to all sorts of images, these algorithms can be trained and adapted to more specific scenarios, e.g. bankcard, ID card, and driver’s license. Some companies have been providing such scenario-specific APIs, including Baidu Inc., Tencent Inc., and MEGVII Inc.. Recent development of fast and efficient methods (Ren et al. 2015; Zhou et al. 2017) has also allowed the deployment of large-scale systems (Borisyuk et al. 2018). Companies including Google Inc. and Amazon Inc. are also providing text extraction APIs.

Fig. 11
figure 11

Progress of scene text detection over the past few years (evaluated as F1 scores)

Fig. 12
figure 12

Progress of scene text recognition over the past few years (evaluated as word-level accuracy)

6.2 Challenges and Future Trends

We look at the present through a rear-view mirror. We march backward into the future (Liu 1975). We list and discuss challenges, and analyze what would be the next valuable research directions in the field scene text detection and recognition.

Languages There are more than 1000 languages in the world.Footnote 10 However, most current algorithms and datasets have primarily focused on text of English. While English has a rather small alphabet, other languages such as Chinese and Japanese have a much larger one, with tens of thousands of symbols. RNN-based recognizers may suffer from such enlarged symbol sets. Moreover, some languages have much more complex appearances, and they are therefore more sensitive to conditions such as image quality. Researchers should first verify how well current algorithms can generalize to text of other languages and further to mixed text. Unified detection and recognition systems for multiple types of languages are of important academic value and application prospects. A feasible solution might be to explore compositional representations that can capture the common patterns of text instances of different languages, and train the detection and recognition models with text examples of different languages, which are generated by text synthesizing engines.

Robustness of Models Although current text recognizers have proven to be able to generalize well to different scene text datasets even only using synthetic data, recent work (Liao et al. 2019b) shows that robustness against flawed detection is not a neglectable problem. Actually, such instability in prediction has also been observed for text detection models. The reason behind this kind of phenomenon is still unclear. One conjecture is that the robustness of models is related to the internal operating mechanism of deep neural networks.

Generalization Few detection algorithms except for TextSnake (Long et al. 2018) have considered the problem of generalization ability across datasets, i.e. training on one dataset, and testing on another. Generalization ability is important as some application scenarios would require the adaptability to varying environments. For example, instant translation and OCR in autonomous vehicles should be able to perform stably under different situations: zoomed-in images with large text instances, far and small words, blurred words, different languages, and shapes. It remains unverified whether simply pooling all existing datasets together is enough, especially when the target domain is totally unknown.

Evaluation Existing evaluation metrics for detection stem from those for general object detection. Matching based on IoU score or pixel-level precision and recall ignore the fact that missing parts and superfluous backgrounds may hurt the performance of the subsequent recognition procedure. For each text instance, pixel-level precision and recall are good metrics. However, their scores are assigned to 1.0 once they are matched to ground truth, and thus not reflected in the final dataset-level score. An off-the-shelf alternative method is to simply sum up the instance-level scores under DetEval instead of first assigning them to 1.0.

Synthetic Data While training recognizers on synthetic datasets has become a routine and results are excellent, detectors still rely heavily on real datasets. It remains a challenge to synthesize diverse and realistic images to train detectors. Potential benefits of synthetic data are not yet fully explored, such as generalization ability. Synthesis using 3D engines and models can simulate different conditions such as lighting and occlusion, and thus is worth further development.

Efficiency Another shortcoming of deep-learning-based methods lies in their efficiency. Most of the current systems can not run in real-time when deployed on computers without GPUs or mobile devices. Apart from model compression and lightweight models that have proven effective in other tasks, it is also valuable to study how to make custom speedup mechanism for text-related tasks.

Bigger and Better Datasets The sizes of most widely adopted datasets are small (\(\sim \) 1k images). It will be worthwhile to study whether the improvements gained from current algorithms can scale up or they are just accidental results of better regularization. Besides, most datasets are only labeled with bounding boxes and texts. Detailed annotation of different attributes (Yuan et al. 2018) such as word-art and occlusion may guide researchers with pertinence. Finally, datasets characterized by real-world challenges are also important in advancing research progress, such as densely located text on products. Another related problem is that most of the existing datasets do not have validation sets. It is highly possible that the current reported evaluation results are actually upward biased due to overfitting on the test sets. We suggest that researchers should focus on large datasets, such as ICDAR MLT 2017, ICDAR MLT 2019, ICDAR ArT 2019, and COCO-Text.