Abstract
This paper presents a new scene text dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset). The dataset consists of sequential images captured in shopping streets in downtown Osaka with an omnidirectional camera. Unlike most of existing datasets consisting of scene images intentionally captured, DOST dataset consists of uncontrolled scene images; use of an omnidirectional camera enabled us to capture videos (sequential images) of whole scenes surrounding the camera. Since the dataset preserved the real scenes containing texts as they were, in other words, they are scene texts in the wild. DOST dataset contained 32,147 manually ground truthed sequential images. They contained 935,601 text regions consisting of 797,919 legible and 137,682 illegible. The legible regions contained 2,808,340 characters. The dataset is evaluated using two existing scene text detection methods and one powerful commercial end-to-end scene text recognition method to know the difficulty and quality in comparison with existing datasets.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Scene text in the wild
- Uncontrolled scene text
- Omnidirectional camera
- Sequential image
- Video
- Japanese text
1 Introduction
Text plays important roles in our life. Imagining life in a world without text, in which, for example, neither book, newspaper, signboard, menu in a restaurant, texting on smartphone nor program source code exists or they exist in a completely different form, we can rediscovery not only the necessity of text but also importance of reading and interpreting text. Although only human being has been endowed with the ability of reading and interpreting text, researchers have struggled to enable computers to read text.
Focusing on camera-captured text and scene text, some pioneer works were presented in 1990s [21]. Since then, increasing attention was paid for recognizing scene text. Table 1 shows remarkable recent progress of scene text recognition techniques. In the table, most of reported accuracies of the latest methods exceeded 90 % on major benchmark datasets. However, does this mean these methods are powerful enough to read a variety of texts in the real environment? Many people would agree that the answer is no. Text images contained in these datasets are far easier than the real. In the real environment, scene text is more diverse; for example, various designs/styles/shapes of texts under many different illuminations are taken from variety of angles/distances. In this regard, there is a big gap between scene texts contained in these existing datasets and observed in the real environment.
In this paper, to fill the gap, we present a new dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset) that preserved scene texts observed in the real environment as they were. The dataset contains videos (sequential images) captured in shopping streets in downtown Osaka with an omnidirectional camera equipped with five horizontal and one upward cameras shown in Fig. 1. In total, 30 image sequences (consisting of five shopping streets times six cameras) consisting of 783,150 images were captured. Among them, 27 image sequences consisting of 32,147 images were manually ground truthed. As a result, 935,601 text regions consisting of 797,919 legible and 137,682 illegible text regions were obtained. The legible regions contained 2,808,340 characters. Since the images were captured in Japan, they contained many Japanese texts. However, out of the whole (797,919) legible text regions, 283,940 consisted of only alphabets and digits. These legible text regions contained 1,138,091 non-Japanese characters. Because of the above mentioned features of the dataset, we can say that DOST dataset preserved scene texts in the wild. Figures 3, 4, 5 and 6 show examples of captured images ground truthed and segmented words contained in DOST dataset. Since the sequence images were captured with an omnidirectional camera and continuous in time, a single word was captured many times in multiple view angles. The DOST dataset was evaluated using two existing text detection and one powerful commercial end-to-end scene text recognition methods to measure the difficulty and quality in comparison with existing datasets.
2 Unique Features of DOST Dataset
Features of existing datasets are summarized in Table 2. Major differences of DOST dataset from existing datasets include following.
-
1.
DOST dataset contains only real images. Unlike MJSynth [22] and SynthText [23] aiming at training a better classifier, DOST dataset aims at evaluation of scene text detection/recognition methods.
-
2.
The images were completely not intentionally captured. In this regard, the most similar dataset is the one dedicated to ICDAR 2015 Robust Reading Competition Challenge 4 “incidental scene text.” It is regarded not intentionally captured because images in the dataset were captured with Google Glass without having taken any prior action to cause its appearance in the field of view or improve its positioning or quality in the frame. DOST dataset is completely free from intention even from face direction of the user wearing Google glass.
-
3.
The images are a video dataset (consecutive in time). There are already video datasets. The 2013 and 2015 editions of ICDAR Robust Reading Competition (RRC) Challenge 3 datasets [5, 24] consists of sequential images. The biggest difference is that DOST dataset was captured with an omnidirectional camera. Another difference is that DOST dataset contains Japanse text while ICDAR RRC datasets consists of Latin text. Another video dataset YVT [25] contained YouTube videos. Some texts in the dataset are not scene texts but just captions.
-
4.
DOST dataset contains multiple word images of a single word taken in different view angles.
-
5.
The scale of DOST dataset is large. In the following discussion, let us exclude synthesized datasets and SVHN consisting of digit. Though the number of total images ground truthed in DOST dataset (32,147) is not very large (almost half of the largest dataset, COCO-Text), the number of word regions (935,601 in total consisting of 797,919 legible and 137,682 illegible) is very large (a factor of 4.6 times larger than the second largest dataset, COCO-Text). This is because image sizes are relatively large (\(1,200 \times 1,600\) pixels) and the images were captured in shopping streets where a lot of texts exist. DOST dataset is also the largest in terms of the number of unique word sequences, which is larger than the second largest, ICDAR2015 Challenge 3 dataset, by a factor of 6.3 times.
Another feature of DOST dataset is that it was manually ground truthed by students. The reason we did not use a crowdsourcing service such as Amazon Mechanical TurkFootnote 1 is most of workers cannot read Japanese text.
Yet another feature of DOST dataset is that it contains many Latin characters, though the images were captured in Japan. The number of characters per category and examples of Japanese characters and symbols are shown in Fig. 7. Kanji (aka. Chinese character) is a logogram. Katakana and Hiragana are syllabaries invented based on Kanji. Though symbols are originally not intended to be ground truthed, some were actually ground truthed. They include often used iteration marks such as “” which represent a duplicated character. In the future, other than the iteration marks would be discarded by rigorously applying the ground truthing policy.
3 Construction of DOST Dataset
DOST dataset was constructed through the following procedure.
-
1.
Image capture
Scene images were captured with an omnidirectional camera, Point Grey Ladybug3, consisting of five horizontal and one upward cameras shown in Fig. 1. It was set up on a cart shown in Fig. 2 with a laptop computer and a battery for car. A pair of students walked in a shopping street putting the cart. Images were captured in 6.5 fps in the uncompressed mode. The resolutions of each captured image were \(1,200 \times 1,600\) pixels. Lens distortion of the captured images was rectified by a provided software by the vendor of the camera. This process completed in the year of 2012. Table 3 summarizes where, how long and how many images we captured.
-
2.
Ground truthing
Selected sequences were ground truthed by hand, unlike COCO-Text dataset [29] that used existing scene text detection/recognition methods. The reasons we did not use these methods were that scene texts contained in these images were very difficult for these methods. We developed a ground truthing tool shown in Fig. 8 to make it efficient. Similar to LabelMe Video [30], it had a functionality to transfer text information (text label) in a frame to neighboring frames using homography. However, things in the scene were not on a plane as homography assumes. Hence, following homography computation, more precise positions of words were determined by sliding window based template matching. Table 4 shows distribution of lengths of sequences. Each image is checked at least twice by different persons; one for ground truthing and the other for confirmation. When the ground truthing policy is updated, ground truths are updated by the confirmation opportunity. We spent more than 1,500 man hours for this process.
-
3.
Privacy preservation
Since the captured images preserved the real scene in shopping streets, we cannot avoid capturing passengers. To avoid privacy violation, we blurred face regions of passengers. At first, we used Amazon Mechanical Turk service. Later, however, we decided to ask this task also to our students so as to ensure the quality with less managing efforts.
4 Ground Truthing Policy
The ground truthing policy of DOST dataset is almost shared with the 2013 and 2015 editions of ICDAR Robust Reading Competition Challenge 3 datasets [5, 24]. Since DOST dataset contained not only Latin but also Japanese text, in addition to the ground truthing policy for Latin scripts, we determined one for Japanese text. The ground truthing policy of DOST dataset is summarized below.
-
1.
Basic unit
A bounding box is created for each basic unit such as a word. In Latin text, word regions segmented by a space is a basic unit. On the other hand, a Japanese sentence is written without some space between words or grammatical units. Hence, as a basic unit of a Japanese sentence, we use bunsetsu which is the smallest unit of words that sounds natural in a spoken sentence. A proper noun is not divided.
There is an exception. If the quality of text is “low,” multiple texts of low quality are covered by a single bounding box (see “Transcription” below).
-
2.
Partial occlusion and out of frame
Even if the region of a basic unit is partially occluded or partially out of frame, it is regarded as a single basic unit without division.
-
3.
Bounding box
To cope with perspective distortion, a bounding box of a basic unit is represented by four isolated points.
-
4.
Transcription
The transcription of a basic unit region consists of visible characters. If a basic unit region is partially occluded or partially out of frame, visible characters are transcribed and invisible character(s) are represented by a space. For example, there is a segmented word region of “Barcelona” but “ce” is occluded. Then, the transcription should be “Bar lona.” In Fig. 6, an underscore represents a space.
-
5.
ID
The same ID is assigned to a sequence of a basic unit as long as it can be traced within the frame. An exception is the case a basic unit once completely disappears because it goes out of the frame; in such a case, even if it appears again, a different ID is assigned to the new one.
-
6.
Quality
Either “high,” “medium” or “low” is assigned to each basic unit based on subjective evaluation. Basic units with “high” and “medium” are regarded as legible. We allowed to enlarge the image to check if they are legible. Basic units with “low” are regarded as “don’t care” regions where even if a text detection method detects such basic units, it is not considered as failure in detection.
-
7.
Language
Either “Latin” or “Japanese” is assigned to each basic unit. A basic unit consisting of only alphabets and digits is labeled as “Latin.” A basic unit containing at least one non-alphabet or non-digit character is labeled as “Japanese.” This is useful for performing an experiment using only Latin text.
5 Comparison of Datasets
Difficulty of major datasets were compared using two detectors and one end-to-end recognition method. To reduce computational burden, in some datasets, a part of data were randomly sampled and used for the experiment. The datasets compared and how they were processed were described below.
-
1.
ICDAR2003 [4]
All (258) images in the training set were used in the experiment.
-
2.
ICDAR2013 (Challenge 2)[5] All (229) images in the training set were used.
-
3.
ICDAR2015 (Challenge 3) [24]
Images were sampled once in every 30 frames in 10 out of 24 training videos. As a result, 207 images were selected.
-
4.
ICDAR2015 (Challenge 4) [24]
All (1,000) images in the training set of “End to End” task (Task 4.4) of ICDAR 2015 Robust Reading Competition Challenge 4 were used.
-
5.
SVT [3]
All (350) images in both training and test sets were used.
-
6.
YVT [25]
Images were sampled once in every 30 frames in all (30) videos. As a result, 420 images were selected.
-
7.
COCO-Text [29]
300 images were randomly sampled from ones containing words annotated as English, legible and machine printed (say, target words). The 300 images contained 2,403 target words and words which do not satisfy the condition of the target words (say, non-target words). The non-target words were treated as “don’t care” regions.
-
8.
DOST (this paper)
Images were sampled once in every 30 frames in all ground truthed sequences. As a result, 1,075 images were selected.
-
9.
DOST Latin (this paper)
This is to evaluate DOST dataset as a Latin scene text dataset containing only alphabets and digits. In text detection and recognition, the same images as “DOST” presented above were used. In evaluation, words containing characters other than alphabets and digits were treated as “don’t care” regions. Thus, even if Japanese texts are detected, it does not affect the result.
Two detection methods were used for evaluation. One was the scene text detection method contained in the OpenCV API version 3.0. It was based on Neumann et al. [31]. The other was Matsuda et al. [32]. We were privately given the source code by courtesy of the authors of the paper. In addition, Google Vsion APIFootnote 2 was used as a powerful commercial end-to-end recognition system. We could designate the language of texts. Only for “DOST,” we designated Japanese. In this mode, English texts are also able to be detected and recognized while accuracies are expected to be lower. For other datasets including “DOST Latin,” we designated English.
In performance evaluation, regardless of datasets, we shared the same evaluation criteria. For both text detection and end-to-end word recognition tasks, we followed the evaluation criteria used in the challenge of “incidental scene text” (Challenge 4) of ICDAR 2015 Robust Reading Competition. That is, for the scene text detection task, based on a single Intersection-over-Union (IoU) criterion with a threshold of 50 %, a detected bounding box was regarded as correct if it overlapped by more than 50 % with a ground truth bounding box. Recall and precision were simply calculated by the following equations.
Then, F-measure was calculated as the harmonic mean of precision and recall. For the end-to-end word recognition task, a detected bounding box was regarded as correct if it satisfies the condition of the scene text detection task as well as the estimated transcription was completely correct. Recall, precision and F-measure were calculated in the same way as the detection task.
Results are summarized in Table 5. As can be seen, results of “DOST” and “DOST Latin” were far worse than others. This indicates that DOST dataset reflecting the real environment is more challenging than the major benchmark datasets.
6 Conclusion
Although many scene text datasets publicly available already exist, none of them are intentionally constructed to reflect the real environment. Hence, even though scene text detection/recognition methods achieve high accuracies on these existing major benchmark datasets, it was not possible to evaluate how they are good for practical use. To address the problem, we presented a new scene text dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset). Unlike most of existing datasets consisting of scene images intentionally captured, DOST dataset consists of uncontrolled scene images; use of an omnidirectional camera enabled us to capture videos (sequential images) of whole scenes surrounding the camera. Since the dataset preserved the real scenes containing texts as they were, in other words, they are scene texts in the wild. Through the evaluation conducted in the paper to know the difficulty and quality in comparison with existing datasets, we demonstrated that DOST dataset is more challenging than the major benchmark datasets.
References
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of CVPR, pp. 4168–4176 (2016)
Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: Proceedings of BMVC (2012)
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV, pp. 1457–1464 (2011)
Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R., Ashida, K., Nagai, H., Okamoto, M., Yamamoto, H., Miyao, H., Zhu, J., Ou, W., Wolf, C., Jolion, J.M., Todoran, L., Worring, M., Lin, X.: ICDAR 2003 robust reading competitions: Entries, results and future directions. IJDAR 7(2–3), 105–122 (2005)
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Gomez i Bigorda, L., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P.: ICDAR 2013 robust reading competition. In: Proceedings of ICDAR, pp. 1115–1124 (2013)
Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: Proceedings of ICPR, pp. 3304–3308 (2012)
Novikova, T., Barinova, O., Kohli, P., Lempitsky, V.: Large-lexicon attribute-consistent text recognition in natural images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 752–765. Springer, Heidelberg (2012)
Goel, V., Mishra, A., Alahari, K., Jawahar, C.V.: Whole is greater than sum of parts: recognizing scene text words. In: Proceedings of ICDAR, pp. 398–402 (2013)
Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: Photoocr: reading text in uncontrolled conditions. In: Proceedings of ICCV, pp. 785–792 (2013)
Alsharif, O., Pineau, J.: End-to-end text recognition with hybrid HMM maxout models. In: International Conference on Learning Representations (ICLR) (2014)
Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE TPAMI 36(12), 2552–2566 (2014)
Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: a learned multi-scale representation for scene text recognition. In: Proceedings of CVPR (2014)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 512–528. Springer, Heidelberg (2014)
Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9003, pp. 35–48. Springer, Heidelberg (2015)
Rodriguez, J.A., Gordo, A., Perronnin, F.: Label embedding: a frugal baseline for text recognition. IJCV 113(3), 193–207 (2015)
Gordo, A.: Supervised mid-level features for word image representation. In: Proceedings of CVPR (2015)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. IJCV 116(1), 1–20 (2016)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Deep structured output learning for unconstrained text recognition. In: Proceedings of ICLR (2015)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR abs/1507.05717 (2015)
Poznanski, A., Wolf, L.: CNN-N-gram for handwritingword recognition. In: Proceedings of CVPR (2016)
Liang, J., Doermann, D., Li, H.: Camera-based analysis of text and documents: a survey. IJDAR 7(2), 83–104 (2005)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Proceedings of NIPS Deep Learning Workshop (2014)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of CVPR (2016)
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., Shafait, F., Uchida, S., Valveny, E.: ICDAR 2015 robust reading competition. In: Proceedings of ICDAR, pp. 1156–1160 (2015)
Nguyen, P.X., Wang, K., Belongie, S.: Video text detection and recognition: dataset and benchmark. In: Proceedings of WACV (2014)
Nagy, R., Dicker, A., Meyer-Wegener, K.: NEOCR: a configurable dataset for natural image text recognition. In: Iwamura, M., Shafait, F. (eds.) CBDAR 2011. LNCS, vol. 7139, pp. 150–163. Springer, Heidelberg (2012)
Jung, J., Lee, S., Cho, M.S., Kim, J.H.: Touch TT: scene text extractor using touchscreen interface. ETRI J. 33(1), 78–88 (2011)
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-Text: dataset and benchmark for text detection and recognition in natural images. CoRR abs/1207.0016 (2016)
Yuen, J., Russell, B., Liu, C., Torralba, A.: LabelMe video: building a video database with human annotations. In: Proceedings of ICCV, pp. 1451–1458 (2009)
Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: Proceedings of CVPR, pp. 3538–3545 (2012)
Matsuda, Y., Omachi, S., Aso, H.: String detection from scene images by binarization and edge detection. Trans. IEICE J93(3), 336–344 (2010). In Japanese
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work is supported by JST CREST and JSPS KAKENHI #25240028.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Iwamura, M., Matsuda, T., Morimoto, N., Sato, H., Ikeda, Y., Kise, K. (2016). Downtown Osaka Scene Text Dataset. In: Hua, G., Jégou, H. (eds) Computer Vision – ECCV 2016 Workshops. ECCV 2016. Lecture Notes in Computer Science(), vol 9913. Springer, Cham. https://doi.org/10.1007/978-3-319-46604-0_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-46604-0_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46603-3
Online ISBN: 978-3-319-46604-0
eBook Packages: Computer ScienceComputer Science (R0)