Keywords

1 Introduction

Scene-text recognition or Photo-Optical Character Recognition (Photo-OCR) aims to read scene-text in natural images. It is an essential step for a wide variety of computer vision tasks and has enjoyed significant success in several commercial applications [9]. Photo-OCR has diverse applications like helping the visually impaired, data mining of street-view-like images for information used in map services, and geographic information systems [2]. Scene-text recognition conventionally involves two steps; i) Text detection and ii) Text recognition. Text detection typically consists of detecting bounding boxes of word images [4]. The text recognition stage involves reading cropped text images obtained from the text detection stage or from the bounding box annotations [13]. In this work, we focus on the task of text recognition.

Fig. 1.
figure 1

Clockwise from top-left; “Top: Annotated Scene-text images, Bottom: Baselines’ predictions (row-1) and Transfer Learning models’ predictions (row-2)”, from Gujarati, Hindi, Bangla, Tamil, Telugu and Malayalam. Green, red, and “_” represent correct predictions, errors, and missing characters, respectively. (Color figure online)

The multi-lingual text in scenes is a crucial part of human communication and globalization. Despite the popularity of recognition algorithms, non-Latin language advancements have been slow. Reading scene-text in such low resource languages is a challenging research problem as it is generally unstructured and appears in diverse conditions such as scripts, fonts, sizes, and orientations. Hence a large amount of dataset is usually required to train the scene-text recognition models. Conventionally, the synthetic dataset is used to deal with the problem since a large number of fonts are available in such low resource languages [13]. The synthetic data may also serve as an exciting asset to perform controlled experiments, e.g., to study the effect of transfer learning with the change in script or language text. We investigate such effects for transfer from English to two Indian languages in this work, i.e., Hindi and Gujarati. We also explore the transferability of features among six different Indian languages. We share 2500 scene text word images obtained from over 440 scenes in Gujarati and Tamil to demonstrate such effects. In Fig. 1, we illustrate the sample annotated images from our datasets, and IIIT-ILST and MLT datasets, and the predictions of our models. The overall methodology we follow is that we first generate the synthetic datasets in the six Indian languages. We describe the dataset generation process and motivate the work in Sect. 2. We then train the two deep neural networks we introduce in Sect. 3 on the individual language datasets. Subsequently, we apply transfer-learning on all the layers of different networks from one language to another. Finally, as discussed in Sect. 4, we fine-tune the networks on standard datasets and examine their performance on real scene-text images in Sect. 5. We finally conclude the work in Sect. 6. The summary of our contributions are as follows:

  1. 1.

    We investigate the transfer learning of complete scene-text recognition models i) from English to two Indian languages and ii) among the six Indian languages, i.e., Gujarati, Hindi, Bangla, Telugu, Tamil, and Malayalam.

  2. 2.

    We also contribute two datasets of around 500 word images in Gujarati and 2535 word images in Tamil from a total of 440 Indian scenes.

  3. 3.

    We achieve gains of \(6\%\), \(5\%\), and \(2\%\) in Word Recognition Rates (WRRs) on IIIT-ILST Hindi, Telugu, and Malayalam datasets in comparison to previous works [13, 20]. On the MLT-19 Hindi and Bangla datasets and our Gujarati and Tamil datasets, we observe the WRR gains of \(8\%\), \(4\%\), \(5\%\), and \(3\%\), respectively, over our baseline models.

  4. 4.

    For the MLT-17 Bangla dataset, we show a striking improvement of \(15\%\) in Character Recognition Rate (CRR) and \(24\%\) in WRR compared to Bušta et al. [2], by applying transfer-learning from another Indian language and plugging in a novel correction RNN layer into our model.

1.1 Related Work

We now discuss datasets and associated works in the field of photo-OCR.

Works of Photo-OCR on Latin Datasets: As stated earlier, the process of Photo-OCR conventionally includes two steps: i) Text detection and ii) Text recognition. With the success of Convolutional Neural Networks (CNN) for object detection, the works have been extended to text detection, treating words or lines as the objects [12, 27, 37]. Liao et al. [10] extend such works to real-time detection in scene images. Karatzas et al. [8] and Bušta et al. [1] present more efficient and accurate methods for text detection. Towards reading scene-text, Wang et al. [30] propose an object recognition pipeline based on a ground truth lexicon. It achieves competitive performance without the need for an explicit text detection step. Shi et al. [21] propose a Convolutional Recurrent Neural Network (CRNN) architecture, which integrates feature extraction, sequence modeling, and transcription into a unified framework. The model achieves remarkable performances in both lexicon-free and lexicon-based scene-text recognition tasks. Liu et al. [11] introduce Spatial Attention Residue Network (STAR-Net) with spatial transformer-based attention mechanism to remove image distortions, residue convolutional blocks for feature extraction, and an RNN block for decoding the text. Shi et al. [22] propose a segmentation-free Attention-based method for Text Recognition (ASTER) by adopting Thin-Plate-Spline (TPS) as a rectification unit. It tackles complex distortions and reduces the difficulty of irregular text recognition. The model incorporates ResNet to improve the network’s feature representation module and employs an attention-based mechanism combined with a Recurrent Neural Network (RNN) to form the prediction module. Uber-Text is a large-scale Latin dataset that contains around 117K images captured from 6 US cities [36]. The images are available with line-level annotations. The French Street Name Signs (FSNS) data contains around 1000K annotated images, each with four street sign views. Such datasets, however, contain text-centric images. Reddy et al. [16] recently release RoadText-1K to introduce challenges with generic driving scenarios where the images are not text-centric. RoadText-1K includes 1000 video clips (each 10 s long at 30 fps) from the BDD dataset, annotated with English transcriptions [32].

Works of Photo-OCR on Non-Latin Datasets: Recently, there has been an increasing interest in scene-text recognition for non-Latin languages such as Chinese, Korean, Devanagari, Japanese, etc. Several datasets like RCTW (12k scene images), ReCTS-25k (25k signboard images), CTW (32k scene images), and RRC-LSVT (450k scene images) from ICDAR’19 Robust Reading Competition (RRC) exist for Chinese [23, 25, 33, 35]. Arabic datasets like ARASTEC (260 images of signboards, hoardings, and advertisements) and ALIF (7k text images from TV Broadcast) also exist in the scene-text recognition community [28, 31]. Korean and Japanese scene-text recognition datasets include KAIST (2, 385 images from signboards, book covers, and English and Korean characters) and DOST (32k sequential images) [5, 7]. The MLT dataset available from the ICDAR’17 RRC contains 18k scene images (around \(1-2k\) images per language) in Arabic, Bangla, Chinese, English, French, German, Italian, Japanese, and Korean [15]. The ICDAR’19 RRC builds MLT-19 over top of MLT-17 to contain 20k scene images containing text from Arabic, Bangla, Chinese, English, French, German, Italian, Japanese, Korean, and Devanagari [14]. The RRC also provides 277k synthetic images in these languages to assist the training. Mathew et al. [13] train the conventional encoder-decoder, where Convolutional Neural Network (CNN) encodes the word image features. An RNN decodes them to produce text on synthetic data for Indian languages. Here an additional connectionist temporal classification (CTC) layer aligns the RNN’s output to labels. The work also releases an IIIT-ILST dataset for testing that reports Word Recognition Rates (WRRs) of \(42.9\%\), \(57.2\%\), and \(73.4\%\) on 1K real images in Hindi, Telugu, and Malayalam, respectively. Bušta et al. [2] proposes a CNN (and CTC) based method for text localization, script identification, and text recognition. The model is trained and tested on 11 languages of MLT-17 dataset. The WRRs are above \(65\%\) for Latin and Hangul and are below \(47\%\) for the remaining languages. The WRR reported for Bengali is \(34.20\%\). Recently, an OCR-on-the-go model and obtain the WRR of \(51.01\%\) on the IIIT-ILST Hindi dataset and the Character Recognition Rate (CRR) of \(35\%\) on a multi-lingual dataset containing 1000 videos in English, Hindi, and Marathi [20]. Around 2322 videos in these languages recorded with controlled camera movements like tilt, pan, etc., are additionally shared at https://catalist-2021.github.io/.

Transfer Learning in Photo-OCR: With the advent of deep learning in the last decade, transfer learning became an essential part of vision models for tasks such as detection and segmentation. [17, 18]. The CNN layers pre-trained from the Imagenet classification dataset are conventionally used in such models for better initialization and performance [19]. The scene-text recognition works also use the CNN layers from the models pre-trained on Imagenet dataset [11, 21, 22]. However, to our best knowledge, there are no significant efforts on transfer learning from one language to another in the field of scene-text recognition, although transfer learning seems to be naturally suitable for reading low resource languages. We investigate the possibilities of transfer learning in all the layers of deep photo-OCR models.

Table 1. Statistics of synthetic data. \(\mu \), \(\sigma \) represent mean, standard deviation.

2 Datasets and Motivation

We now discuss the datasets we use and the motivation for our work.

Synthetic Datasets: As shown in Table 1, we generate 2.5M, or more, word images each in Hindi, Bangla, Tamil, Telugu, and MalayalamFootnote 1 with the methodology proposed by Mathew et al. [13]. For each Indian language, we use 2M images for training our models and the remaining set for testing. Sample images of our synthetic data are shown in Fig. 2. For English, we use the models pre-trained on the 9M MJSynth and 8M SynthText images [3, 6]. We generate 0.5M synthetic images in English with over 1200 fonts for testing. As shown in Table 1, English has a lower average word length than Indian languages. We list the Indian languages in the increasing order of language complexity, with visually similar scripts placed consecutively, in Table 1. Gujarati is chosen as the entry point from English to Indian languages as it has the lowest word length among all Indian languages. Subsequently, like English, Gujarati does not have a top-connector line that connects different characters to form a word in Hindi and Bangla (refer to Fig. 1 and 2). Also, the number of Unicode fonts available in Gujarati is fewer than those available in other Indian languages. Next, we choose Hindi, as Hindi characters are similar to Gujarati characters and the average word length of Hindi is higher than Gujarati. Bangla has comparable word length statistics with Hindi and shares the property of the top-connector line with Hindi. Still, we keep it after Hindi in the list as its characters are visually dissimilar and more complicated than Gujarati and Hindi. We use less than 100 for fonts in Hindi, Bangla, and Telugu. We list Tamil after Bangla because these languages share similar vowels’ appearance (see the glyphs above general characters in Fig. 2). Tamil and Malayalam have the highest variability in word length and visual complexity compared to other languages. Please note that we have over 150 fonts available in Tamil.

Fig. 2.
figure 2

Clockwise from top-left: synthetic word images in Gujarati, Hindi, Bangla, Tamil, Telugu, & Malayalam. Notice that a top-connector line connects the characters to form a word in Hindi or Bangla. Some vowels and characters appear above and below the generic characters in Indian languages, unlike English.

Fig. 3.
figure 3

Distribution of Char. n-grams (\(n\in [1,5]\)) from 2.5M words in English, Gujarati, Hindi, Bangla, and Tamil (top to bottom): Top-5 (left) and All (right).

Real Datasets: We also perform experiments on the real datasets from IIIT-ILST, MLT-17, and MLT-19 datasets (refer to Sect. 1.1 for these datasets). To enlarge scene-text recognition research in complex and straight forward low-resource Indian Languages, we release 500 and 2535 annotated word images in Gujarati and Tamil. We crop the word images from 440 annotated scene images, which we obtain by capturing and compiling Google images. We illustrate sample annotated images of different datasets in Fig. 1. Similar to MLT datasets, we annotate the Gujarati and Tamil datasets using four corner points around each word (see Tamil image at bottom-right of Fig. 1). IIIT-ILST dataset has two-point annotations leading to an issue of text from other words in the background of a cropped word image as shown in the Hindi scene at the top-middle of Fig. 1.

Motivation: As discussed earlier in Sect. 1.1, most of the scene-text recognition works use the pre-trained Convolutional Neural Networks (CNN) layers for improving results. We now motivate the need for transfer learning of the complete recognition models discussed in Sect. 1 and the models we use in Sect. 3 among different languages. As discussed in these sections, the Recurrent Neural Networks (RNNs) form another integral component of such reading models. Therefore, we illustrate the distribution of character-level n-grams they learn in Fig. 3Footnote 2 for the first five languages we discussed in the previous section (we notice that the last two languages also follow the similar trend). On the left, we show the frequency distribution of top-5 n-grams, (\(n\in [1,5]\)). On the right, we show the frequency distribution of all n-grams with \(n\in [1,5]\). We use 2.5M words from each language for these plots. We consider both capital and small letters separately for English, as it is crucial for the text recognition task. Despite this, we note that top-5 n-grams are composed of small letters. The Indian languages, however, do not have small and capital letters like English. However, the total number of English letters (given that small letters are different from capitals) is of the same order as Indian languages. The x-values (\(\le \)100) for the drops in 1-gram plots (blue curves) of Fig. 3 also illustrates this. So it becomes possible to compare the distributions. Next, we note that most of the top-5 n-grams comprise vowels for all the languages. Moreover, the overall distributions are similar for all the languages. Hence, we propose that the RNN layers’ transfer among the models of different languages is worth an investigation.

It is important to note the differences between the n-grams of English and Indian languages. Many of the top-5 n-grams in English are the complete word forms, which is not the case with Indian languages owing to their richness in inflections (or fusions) [29]. Also, note that the second and the third 1-gram for Hindi and Bangla in Fig. 3 (left), known as Halanta, is a common feature of top-5 Indic n-grams. The Halanta forms an essential part of joint glyphs or aksharas (as advocated by Vinitha et al. [29]). In Figs. 1 and 2, the vowels, or portions of the joint glyphs for word images in Indian languages, often appear above the top-connector line or below the generic consonants. All this, in addition to complex glyphs in Indian languages, makes transfer learning from English to Indian languages ineffective, which is detailed in Sect. 5. Thus, we also investigate the transferability of features among the Indic scene-text recognition models in the subsequent sections.

3 Models

This section explains the two models we use for transfer learning in Indian languages and a plug-in module we propose for learning the correction mechanism in the recognition systems.

CRNN Model: The first model we train is Convolutional-Recurrent Neural Network (CRNN), which is the combination of CNN and RNN as shown in Fig. 4 (left). The CRNN network architecture consists of three fundamental components, i) an encoder composed of the standard VGG model [24], ii) a decoder consisting of RNN, and iii) a Connectionist Temporal Classification (CTC) layer to align the decoded sequence with ground truth. The CNN-based encoder consists of seven layers to extract feature representations from the input image. The model abandons fully connected layers for compactness and efficiency. It replaces standard squared pooling with \(1\times 2\) sized rectangular pooling windows for \(3^{rd}\) and \(4^{th}\) max-pooling layer to yield feature maps with a larger width. A two-layer Bi-directional Long Short-Term Memory (BiLSTM) model, each with a hidden size of 256 units, then decodes the features. During the training phase, the CTC layer provides non-parameterized supervision to align the decoded predictions with the ground truth. The greedy decoding is used during the testing stage. We use the PyTorch implementation of the model by Shi et al. [21].

Fig. 4.
figure 4

CRNN model (left) and STAR-Net with a correction BiLSTM (right).

STAR-Net: As shown in Fig. 4 (right), the STAR-Net model consists of three components, i) a Spatial Transformer to handle image distortions, ii) a Residue Feature Extractor consisting of a residue CNN and an RNN, and iii) a CTC layer to align the predicted and ground truth sequences. The transformer consists of a spatial attention mechanism achieved via a CNN-based localization network, a sample, and an interpolator. The localizer predicts the parameters of an affine transformation. The sampler and the nearest-neighbor interpolator use the transformation to obtain a better version of the input image. The transformed image acts as the input to the Residue Feature Extractor, which includes the CNN and a single-layer BiLSTM of 256 units. The CNN used here is based on the inception-resnet architecture, which can extract robust image features required for the task of scene-text recognition [26]. The CTC layer finally provides the non-parameterized supervision for text alignment. The overall model consists of 26 convolutional layers and is end-to-end trainable [11].

Correction BiLSTM: After training the STAR-Net model on a real dataset, we add a correction BiLSTM layer (of size \(1\times 256\)), an end-to-end trainable module, to the end of the model (see Fig. 4 top-right). We train the complete model again on the same dataset to implicitly learn the error correction mechanism.

Table 2. Results of individual CRNN & STAR-Net models on synthetic datasets.

4 Experiments

The images, resized to \(150\times 18\), form the input of STAR-Net. The spatial transformer module, as shown in Fig. 4 (right), then outputs the image of size \(100\times 32\). The inputs to the CNN Layers of CRNN and STAR-Net are of the same size, i.e., \(100\times 32\), and the output size is \(25\times 1\times 256\). The STAR-Net localization network has four plain convolutional layers with 16, 32, 64, and 128 channels. Each layer has the filter size, stride, and padding size of 3, 1, and 1, followed by a \(2\times 2\) max-pooling layer with a stride of 2. Finally, a fully connected layer of size 256 outputs the parameters which transform the input image. We train all our models on 2M or more synthetic word images as discussed in Sect. 2. We use the batch size of 16 and the ADADELTA optimizer for stochastic gradient descent (SGD) for all the experiments [34]. The number of epochs varies between 10 to 15 for different experiments. We test our models on 0.5 M synthetic images for each language. We use the word images from IIIT-ILST, MLT-17, and MLT-19 datasets for testing on real datasets. We fine-tune the Bangla models on 1200 training images and test them on 673 validation images from the MLT-17 dataset to fairly compare with Bušta et al. [1]. Similarly, we fine-tune only our best Hindi model on the MLT-19 dataset and test it on the IIIT-ILST dataset to compare with OCR-on-the-go (since it is also trained on real data) [20]. To demonstrate generalizability, we also test our models on 3766 Hindi images and 3691 Bangla images available from MLT-19 datasets [14]. For Gujarati and Tamil, we use \(75\%\) of word images to fine-tune our models and the remaining \(25\%\) for testing.

5 Results

In this section, we discuss the results of our experiments with i) individual models for each language, ii) the transfer learning from English to two Indian languages, and iii) the transfer learning from one Indian language to another.

Performance on Synthetic Datasets: It is essential to compare the results on synthetic datasets of different languages sharing common backgrounds, as it provides a good intuition about the difficulty in reading different scripts. In Tables 2 and 3, we present the results of our experiments with synthetic datasets. As noted in Table 2, the CRNN model achieves the Character Recognition Rates (CRRs) and Word Recognition Rates (WRRs) of i) 77.13% and 38.21% in English and ii) above \(82\%\) and \(48\%\) on the synthetic dataset of all the Indian languages (refer to columns 1 and 2 of Table 2). The low accuracy on the English synthetic test set is due to the presence of more than 1200 different fonts (refer Sect. 2). Nevertheless, using a large number of fonts in training helps in generalizing the model for real settings [3, 6]. The STAR-Net achieves remarkably better performance than CRNN on all the datasets, with the CRRs and WRRs above 90.48 and 65.02 for Indian languages. The reason for this is spatial attention mechanism and powerful residual layers, as discussed in Sect. 3. As shown in columns 3 and 5 of Table 2, the WRR of the models trained in Gujarati, Hindi, and Bangla are higher than the other three Indian languages despite common backgrounds. The experiments show that the scripts in latter languages pose a tougher reading challenge than the scripts in former languages.

Table 3. Results of transfer learning (TL) on synthetic datasets. Parenthesis contain results from Table 2. TL among Indic scripts improves STAR-Net results.

We present the results of our transfer learning experiments on the synthetic datasets in Table 3. The best individual model results from Table 2 are included in parenthesis for comparison. We begin with the English models as the base because the models have trained on over 1200 fonts and 17M word images as discussed in Sect. 2, and are generic. However, in the first two rows of the table, we note that transferring the layers from the model trained on the English dataset to Gujarati and Hindi is inefficient in improving the results compared to the individual models. The possible reason for the inefficiency is that Indic scripts have many different visual and slightly different n-gram characteristics from English, as discussed in Sect. 2. We then note that as we try to apply transfer learning among Indian languages with CRNN (rows 3–7, columns 1–2 in Table 3), only some combinations work well. However, with STAR-Net (rows 3–7, columns 3–4 in Table 3), transfer learning helps improve results on the synthetic dataset from a simple language to a complex languageFootnote 3. For Malayalam, we observe that the individual STAR-Net model is better than the one transferred from Telugu, perhaps due to high average word length (refer Sect. 2).

Table 4. Results on real datasets. FT indicates fine-tuned.
Fig. 5.
figure 5

CNN Layers visualization in the Top: CRNN models trained on Hindi, English\(\rightarrow \)Hindi, and Gujarati\(\rightarrow \)Hindi; and Bottom: STAR-Net models trained on Gujarati, English\(\rightarrow \)Gujarati, and Hindi\(\rightarrow \)Gujarati. Red boxes indicate the regions where the features for the model transferred from English are activated (as white), whereas the features from the other two models are not. (Color figure online)

Performance on Real Datasets: Table 4 depicts the performance of our models on the real datasets. At first, we observe that for each Indian language, the overall performance of the individual STAR-Net model is better than the individual CRNN model (except for Gujarati and Hindi, where the results are very close). Based on this and similar observations in the previous section, we present the results of transfer learning experiments on real datasets only with the STAR-Net modelFootnote 4. Next, similar to the previous section, we observe that the transfer learning from English to Gujarati and Hindi IIIT-ILST datasets (rows 3 and 8 in Table 4) is not as effective as individual models in these Indian languages (rows 2 and 7 in Table 4). Finally, we observe that the performance improves with the transfer learning from a simple language to a complex language, except for Hindi\(\rightarrow \)Gujarati, for which Hindi is the only most straightforward choice. We achieve performance better than the previous works, i.e., Bušta et al. [1], Mathew et al. [13], and OCR-on-the-go [20]. Overall, we observe the increase in WRRs by \(6\%\), \(5\%\), \(2\%\) and \(23\%\) on IIIT-ILST Hindi, Telugu, and Malayalam, and MLT-17 Bangla datasets compared to the previous works. On the MLT-19 Hindi and Bangla datasets, we achieve gains of \(8\%\) and \(4\%\) in WRR over the baseline individual CRNN models. On the datasets we release for Gujarati and Tamil, we improve the baselines by \(5\%\) and \(3\%\) increase in WRRs. We present the qualitative results of our baseline CRNN models as well as best transfer learning models in Fig. 1. The green and red colors represent the correct predictions and errors, respectively. “_” represents the missing character. As can be seen, most of the mistakes are single-character errors.

Since we observe the highest gain of \(23\%\) in WRR (and \(4\%\) in CRR) for the MLT-17 Bangla dataset (Table 4), we further try to improve these results. We plug in the correction BiLSTM (refer Sect. 3) to the best model (row 18 of Table 4). The results are shown in row 19 of Table 4. As shown, the correction BiLSTM improves the CRR further by a notable margin of \(11\%\) since the BiLSTM works on character level. We also observe the \(1\%\) WRR gain, thereby achieving the overall \(24\%\) WRR gain (and \(15\%\) CRR gain) over Bušta et al. [1].

Features Visualization: In Fig. 5 for the CRNN model (top three triplets), we visualize the learned CNN layers of the individual Hindi model, the “English \(\rightarrow \)Hindi” model, and the “Gujarati\(\rightarrow \)Hindi” model. The red boxes are the regions where the first four CNN layers of the model transferred from English to Hindi are different from the other two models. The feature visualization again strengthens our claim that transfer from the English reading model to any Indian language dataset is inefficient. We notice a similar trend for the Gujarati STAR-Net models, though the initial CNN layers look very similar to word images (bottom three triplets in Fig. 5). The similarity also demonstrates the better learnability of STAR-Net compared to CRNN, as observed in previous sections.

6 Conclusion

  We generated 2.5M or more synthetic images in six different Indian languages with varying complexities to investigate the language transfers for two scene-text recognition models. The underlying view is that the transfer of image features is standard in deep models, and the transfer of language text features is a plausible and natural choice for the reading models. However, we observe that transferring the generic English photo-OCR models (trained on over 1200 fonts) to Indian languages is inefficient. Our models transferred from one Indian language to another perform better than the previous works or the new baselines we created for individual languages. We, therefore, set the new benchmarks for scene-text recognition in low-resource Indian languages. The proposed Correction BiLSTM, when plugged into the STAR-Net model and trained end-to-end, further improves the results.