Downtown Osaka Scene Text Dataset

Iwamura, Masakazu; Matsuda, Takahiro; Morimoto, Naoyuki; Sato, Hitomi; Ikeda, Yuki; Kise, Koichi

doi:10.1007/978-3-319-46604-0_32

Masakazu Iwamura¹⁵,
Takahiro Matsuda¹⁵,
Naoyuki Morimoto¹⁵,
Hitomi Sato¹⁵,
Yuki Ikeda¹⁵ &
…
Koichi Kise¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9913))

Included in the following conference series:

European Conference on Computer Vision

9967 Accesses
13 Citations

Abstract

This paper presents a new scene text dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset). The dataset consists of sequential images captured in shopping streets in downtown Osaka with an omnidirectional camera. Unlike most of existing datasets consisting of scene images intentionally captured, DOST dataset consists of uncontrolled scene images; use of an omnidirectional camera enabled us to capture videos (sequential images) of whole scenes surrounding the camera. Since the dataset preserved the real scenes containing texts as they were, in other words, they are scene texts in the wild. DOST dataset contained 32,147 manually ground truthed sequential images. They contained 935,601 text regions consisting of 797,919 legible and 137,682 illegible. The legible regions contained 2,808,340 characters. The dataset is evaluated using two existing scene text detection methods and one powerful commercial end-to-end scene text recognition method to know the difficulty and quality in comparison with existing datasets.

You have full access to this open access chapter, Download conference paper PDF

The First Robust Mongolian Text Reading Dataset CSIMU-MTR

A survey on camera-captured scene text detection and extraction: towards Gurmukhi script

Article 19 January 2017

Text Detection from Scene and Born Images: How Good is Tesseract?

Keywords

1 Introduction

Text plays important roles in our life. Imagining life in a world without text, in which, for example, neither book, newspaper, signboard, menu in a restaurant, texting on smartphone nor program source code exists or they exist in a completely different form, we can rediscovery not only the necessity of text but also importance of reading and interpreting text. Although only human being has been endowed with the ability of reading and interpreting text, researchers have struggled to enable computers to read text.

Focusing on camera-captured text and scene text, some pioneer works were presented in 1990s [21]. Since then, increasing attention was paid for recognizing scene text. Table 1 shows remarkable recent progress of scene text recognition techniques. In the table, most of reported accuracies of the latest methods exceeded 90 % on major benchmark datasets. However, does this mean these methods are powerful enough to read a variety of texts in the real environment? Many people would agree that the answer is no. Text images contained in these datasets are far easier than the real. In the real environment, scene text is more diverse; for example, various designs/styles/shapes of texts under many different illuminations are taken from variety of angles/distances. In this regard, there is a big gap between scene texts contained in these existing datasets and observed in the real environment.

Table 1. Recent improvement of recognition performance in scene text recognition tasks. Based on Table 1 of [1], this table summarizes recognition accuracies of recent methods in percentage terms on representative benchmark datasets in the chronological order. “50,” “1k” and “50k” represent lexicon sizes. “Full” and “None” represent with all per-image lexicon words and without lexicon, respectively.

Full size table

In this paper, to fill the gap, we present a new dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset) that preserved scene texts observed in the real environment as they were. The dataset contains videos (sequential images) captured in shopping streets in downtown Osaka with an omnidirectional camera equipped with five horizontal and one upward cameras shown in Fig. 1. In total, 30 image sequences (consisting of five shopping streets times six cameras) consisting of 783,150 images were captured. Among them, 27 image sequences consisting of 32,147 images were manually ground truthed. As a result, 935,601 text regions consisting of 797,919 legible and 137,682 illegible text regions were obtained. The legible regions contained 2,808,340 characters. Since the images were captured in Japan, they contained many Japanese texts. However, out of the whole (797,919) legible text regions, 283,940 consisted of only alphabets and digits. These legible text regions contained 1,138,091 non-Japanese characters. Because of the above mentioned features of the dataset, we can say that DOST dataset preserved scene texts in the wild. Figures 3, 4, 5 and 6 show examples of captured images ground truthed and segmented words contained in DOST dataset. Since the sequence images were captured with an omnidirectional camera and continuous in time, a single word was captured many times in multiple view angles. The DOST dataset was evaluated using two existing text detection and one powerful commercial end-to-end scene text recognition methods to measure the difficulty and quality in comparison with existing datasets.

2 Unique Features of DOST Dataset

Features of existing datasets are summarized in Table 2. Major differences of DOST dataset from existing datasets include following.

1.
DOST dataset contains only real images. Unlike MJSynth [22] and SynthText [23] aiming at training a better classifier, DOST dataset aims at evaluation of scene text detection/recognition methods.
2.
The images were completely not intentionally captured. In this regard, the most similar dataset is the one dedicated to ICDAR 2015 Robust Reading Competition Challenge 4 “incidental scene text.” It is regarded not intentionally captured because images in the dataset were captured with Google Glass without having taken any prior action to cause its appearance in the field of view or improve its positioning or quality in the frame. DOST dataset is completely free from intention even from face direction of the user wearing Google glass.
3.
The images are a video dataset (consecutive in time). There are already video datasets. The 2013 and 2015 editions of ICDAR Robust Reading Competition (RRC) Challenge 3 datasets [5, 24] consists of sequential images. The biggest difference is that DOST dataset was captured with an omnidirectional camera. Another difference is that DOST dataset contains Japanse text while ICDAR RRC datasets consists of Latin text. Another video dataset YVT [25] contained YouTube videos. Some texts in the dataset are not scene texts but just captions.
4.
DOST dataset contains multiple word images of a single word taken in different view angles.
5.
The scale of DOST dataset is large. In the following discussion, let us exclude synthesized datasets and SVHN consisting of digit. Though the number of total images ground truthed in DOST dataset (32,147) is not very large (almost half of the largest dataset, COCO-Text), the number of word regions (935,601 in total consisting of 797,919 legible and 137,682 illegible) is very large (a factor of 4.6 times larger than the second largest dataset, COCO-Text). This is because image sizes are relatively large ($1,200 \times 1,600$ pixels) and the images were captured in shopping streets where a lot of texts exist. DOST dataset is also the largest in terms of the number of unique word sequences, which is larger than the second largest, ICDAR2015 Challenge 3 dataset, by a factor of 6.3 times.

Another feature of DOST dataset is that it was manually ground truthed by students. The reason we did not use a crowdsourcing service such as Amazon Mechanical Turk^{Footnote 1} is most of workers cannot read Japanese text.

Yet another feature of DOST dataset is that it contains many Latin characters, though the images were captured in Japan. The number of characters per category and examples of Japanese characters and symbols are shown in Fig. 7. Kanji (aka. Chinese character) is a logogram. Katakana and Hiragana are syllabaries invented based on Kanji. Though symbols are originally not intended to be ground truthed, some were actually ground truthed. They include often used iteration marks such as “” which represent a duplicated character. In the future, other than the iteration marks would be discarded by rigorously applying the ground truthing policy.

Table 2. Summary of publicly available datasets. “Video?” is whether the images are consecutive in time. “Real?” is whether the dataset consists of real images only (Yes) or not (No; note that captions are regarded as synthesized). #Image represents the total number of images (for a video dataset, the total number of frames). #Word represents the number of word regions ground truthed. In a video dataset, #WS represents the number of word sequences which do not consist of only “don’t care” regions.

Full size table

3 Construction of DOST Dataset

DOST dataset was constructed through the following procedure.

1.
Image capture

Scene images were captured with an omnidirectional camera, Point Grey Ladybug3, consisting of five horizontal and one upward cameras shown in Fig. 1. It was set up on a cart shown in Fig. 2 with a laptop computer and a battery for car. A pair of students walked in a shopping street putting the cart. Images were captured in 6.5 fps in the uncompressed mode. The resolutions of each captured image were $1,200 \times 1,600$ pixels. Lens distortion of the captured images was rectified by a provided software by the vendor of the camera. This process completed in the year of 2012. Table 3 summarizes where, how long and how many images we captured.
2.
Ground truthing

Selected sequences were ground truthed by hand, unlike COCO-Text dataset [29] that used existing scene text detection/recognition methods. The reasons we did not use these methods were that scene texts contained in these images were very difficult for these methods. We developed a ground truthing tool shown in Fig. 8 to make it efficient. Similar to LabelMe Video [30], it had a functionality to transfer text information (text label) in a frame to neighboring frames using homography. However, things in the scene were not on a plane as homography assumes. Hence, following homography computation, more precise positions of words were determined by sliding window based template matching. Table 4 shows distribution of lengths of sequences. Each image is checked at least twice by different persons; one for ground truthing and the other for confirmation. When the ground truthing policy is updated, ground truths are updated by the confirmation opportunity. We spent more than 1,500 man hours for this process.
3.
Privacy preservation

Since the captured images preserved the real scene in shopping streets, we cannot avoid capturing passengers. To avoid privacy violation, we blurred face regions of passengers. At first, we used Amazon Mechanical Turk service. Later, however, we decided to ask this task also to our students so as to ensure the quality with less managing efforts.

Table 3. Place, time length (in hour), the number of images of capture.

Full size table

Table 4. Distribution of lengths of image sequences.

Full size table

4 Ground Truthing Policy

The ground truthing policy of DOST dataset is almost shared with the 2013 and 2015 editions of ICDAR Robust Reading Competition Challenge 3 datasets [5, 24]. Since DOST dataset contained not only Latin but also Japanese text, in addition to the ground truthing policy for Latin scripts, we determined one for Japanese text. The ground truthing policy of DOST dataset is summarized below.

1.
Basic unit

A bounding box is created for each basic unit such as a word. In Latin text, word regions segmented by a space is a basic unit. On the other hand, a Japanese sentence is written without some space between words or grammatical units. Hence, as a basic unit of a Japanese sentence, we use bunsetsu which is the smallest unit of words that sounds natural in a spoken sentence. A proper noun is not divided.

There is an exception. If the quality of text is “low,” multiple texts of low quality are covered by a single bounding box (see “Transcription” below).
2.
Partial occlusion and out of frame

Even if the region of a basic unit is partially occluded or partially out of frame, it is regarded as a single basic unit without division.
3.
Bounding box

To cope with perspective distortion, a bounding box of a basic unit is represented by four isolated points.
4.
Transcription

The transcription of a basic unit region consists of visible characters. If a basic unit region is partially occluded or partially out of frame, visible characters are transcribed and invisible character(s) are represented by a space. For example, there is a segmented word region of “Barcelona” but “ce” is occluded. Then, the transcription should be “Bar lona.” In Fig. 6, an underscore represents a space.
5.
ID

The same ID is assigned to a sequence of a basic unit as long as it can be traced within the frame. An exception is the case a basic unit once completely disappears because it goes out of the frame; in such a case, even if it appears again, a different ID is assigned to the new one.
6.
Quality

Either “high,” “medium” or “low” is assigned to each basic unit based on subjective evaluation. Basic units with “high” and “medium” are regarded as legible. We allowed to enlarge the image to check if they are legible. Basic units with “low” are regarded as “don’t care” regions where even if a text detection method detects such basic units, it is not considered as failure in detection.
7.
Language

Either “Latin” or “Japanese” is assigned to each basic unit. A basic unit consisting of only alphabets and digits is labeled as “Latin.” A basic unit containing at least one non-alphabet or non-digit character is labeled as “Japanese.” This is useful for performing an experiment using only Latin text.

5 Comparison of Datasets

Difficulty of major datasets were compared using two detectors and one end-to-end recognition method. To reduce computational burden, in some datasets, a part of data were randomly sampled and used for the experiment. The datasets compared and how they were processed were described below.

1.
ICDAR2003 [4]

All (258) images in the training set were used in the experiment.
2.
ICDAR2013 (Challenge 2)[5] All (229) images in the training set were used.
3.
ICDAR2015 (Challenge 3) [24]

Images were sampled once in every 30 frames in 10 out of 24 training videos. As a result, 207 images were selected.
4.
ICDAR2015 (Challenge 4) [24]

All (1,000) images in the training set of “End to End” task (Task 4.4) of ICDAR 2015 Robust Reading Competition Challenge 4 were used.
5.
SVT [3]

All (350) images in both training and test sets were used.
6.
YVT [25]

Images were sampled once in every 30 frames in all (30) videos. As a result, 420 images were selected.
7.
COCO-Text [29]

300 images were randomly sampled from ones containing words annotated as English, legible and machine printed (say, target words). The 300 images contained 2,403 target words and words which do not satisfy the condition of the target words (say, non-target words). The non-target words were treated as “don’t care” regions.
8.
DOST (this paper)

Images were sampled once in every 30 frames in all ground truthed sequences. As a result, 1,075 images were selected.
9.
DOST Latin (this paper)

This is to evaluate DOST dataset as a Latin scene text dataset containing only alphabets and digits. In text detection and recognition, the same images as “DOST” presented above were used. In evaluation, words containing characters other than alphabets and digits were treated as “don’t care” regions. Thus, even if Japanese texts are detected, it does not affect the result.

Two detection methods were used for evaluation. One was the scene text detection method contained in the OpenCV API version 3.0. It was based on Neumann et al. [31]. The other was Matsuda et al. [32]. We were privately given the source code by courtesy of the authors of the paper. In addition, Google Vsion API^{Footnote 2} was used as a powerful commercial end-to-end recognition system. We could designate the language of texts. Only for “DOST,” we designated Japanese. In this mode, English texts are also able to be detected and recognized while accuracies are expected to be lower. For other datasets including “DOST Latin,” we designated English.

Table 5. Detection and Recognition results on selected datasets. Evaluation criteria are recall (R), precision (P) and F-measure (F) in percentage.

Full size table

In performance evaluation, regardless of datasets, we shared the same evaluation criteria. For both text detection and end-to-end word recognition tasks, we followed the evaluation criteria used in the challenge of “incidental scene text” (Challenge 4) of ICDAR 2015 Robust Reading Competition. That is, for the scene text detection task, based on a single Intersection-over-Union (IoU) criterion with a threshold of 50 %, a detected bounding box was regarded as correct if it overlapped by more than 50 % with a ground truth bounding box. Recall and precision were simply calculated by the following equations.

$$\begin{aligned} \mathrm {Recall}&= \frac{\text {Number of correctly detected bounding boxes}}{\text {Number of bounding boxes in ground truth}} \end{aligned}$$

(1)

$$\begin{aligned} \mathrm {Precision}&= \frac{\text {Number of correctly detected bounding boxes}}{\text {Number of detected bounding boxes}} \end{aligned}$$

(2)

Then, F-measure was calculated as the harmonic mean of precision and recall. For the end-to-end word recognition task, a detected bounding box was regarded as correct if it satisfies the condition of the scene text detection task as well as the estimated transcription was completely correct. Recall, precision and F-measure were calculated in the same way as the detection task.

Results are summarized in Table 5. As can be seen, results of “DOST” and “DOST Latin” were far worse than others. This indicates that DOST dataset reflecting the real environment is more challenging than the major benchmark datasets.

6 Conclusion

Although many scene text datasets publicly available already exist, none of them are intentionally constructed to reflect the real environment. Hence, even though scene text detection/recognition methods achieve high accuracies on these existing major benchmark datasets, it was not possible to evaluate how they are good for practical use. To address the problem, we presented a new scene text dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset). Unlike most of existing datasets consisting of scene images intentionally captured, DOST dataset consists of uncontrolled scene images; use of an omnidirectional camera enabled us to capture videos (sequential images) of whole scenes surrounding the camera. Since the dataset preserved the real scenes containing texts as they were, in other words, they are scene texts in the wild. Through the evaluation conducted in the paper to know the difficulty and quality in comparison with existing datasets, we demonstrated that DOST dataset is more challenging than the major benchmark datasets.

Notes

References

Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of CVPR, pp. 4168–4176 (2016)
Google Scholar
Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: Proceedings of BMVC (2012)
Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV, pp. 1457–1464 (2011)
Google Scholar
Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R., Ashida, K., Nagai, H., Okamoto, M., Yamamoto, H., Miyao, H., Zhu, J., Ou, W., Wolf, C., Jolion, J.M., Todoran, L., Worring, M., Lin, X.: ICDAR 2003 robust reading competitions: Entries, results and future directions. IJDAR 7(2–3), 105–122 (2005)
Article Google Scholar
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Gomez i Bigorda, L., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P.: ICDAR 2013 robust reading competition. In: Proceedings of ICDAR, pp. 1115–1124 (2013)
Google Scholar
Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: Proceedings of ICPR, pp. 3304–3308 (2012)
Google Scholar
Novikova, T., Barinova, O., Kohli, P., Lempitsky, V.: Large-lexicon attribute-consistent text recognition in natural images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 752–765. Springer, Heidelberg (2012)
Chapter Google Scholar
Goel, V., Mishra, A., Alahari, K., Jawahar, C.V.: Whole is greater than sum of parts: recognizing scene text words. In: Proceedings of ICDAR, pp. 398–402 (2013)
Google Scholar
Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: Photoocr: reading text in uncontrolled conditions. In: Proceedings of ICCV, pp. 785–792 (2013)
Google Scholar
Alsharif, O., Pineau, J.: End-to-end text recognition with hybrid HMM maxout models. In: International Conference on Learning Representations (ICLR) (2014)
Google Scholar
Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE TPAMI 36(12), 2552–2566 (2014)
Article Google Scholar
Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: a learned multi-scale representation for scene text recognition. In: Proceedings of CVPR (2014)
Google Scholar
Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part IV. LNCS, vol. 8692, pp. 512–528. Springer, Heidelberg (2014)
Google Scholar
Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9003, pp. 35–48. Springer, Heidelberg (2015)
Google Scholar
Rodriguez, J.A., Gordo, A., Perronnin, F.: Label embedding: a frugal baseline for text recognition. IJCV 113(3), 193–207 (2015)
Article Google Scholar
Gordo, A.: Supervised mid-level features for word image representation. In: Proceedings of CVPR (2015)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. IJCV 116(1), 1–20 (2016)
Article MathSciNet Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Deep structured output learning for unconstrained text recognition. In: Proceedings of ICLR (2015)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR abs/1507.05717 (2015)
Google Scholar
Poznanski, A., Wolf, L.: CNN-N-gram for handwritingword recognition. In: Proceedings of CVPR (2016)
Google Scholar
Liang, J., Doermann, D., Li, H.: Camera-based analysis of text and documents: a survey. IJDAR 7(2), 83–104 (2005)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Proceedings of NIPS Deep Learning Workshop (2014)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of CVPR (2016)
Google Scholar
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., Shafait, F., Uchida, S., Valveny, E.: ICDAR 2015 robust reading competition. In: Proceedings of ICDAR, pp. 1156–1160 (2015)
Google Scholar
Nguyen, P.X., Wang, K., Belongie, S.: Video text detection and recognition: dataset and benchmark. In: Proceedings of WACV (2014)
Google Scholar
Nagy, R., Dicker, A., Meyer-Wegener, K.: NEOCR: a configurable dataset for natural image text recognition. In: Iwamura, M., Shafait, F. (eds.) CBDAR 2011. LNCS, vol. 7139, pp. 150–163. Springer, Heidelberg (2012)
Chapter Google Scholar
Jung, J., Lee, S., Cho, M.S., Kim, J.H.: Touch TT: scene text extractor using touchscreen interface. ETRI J. 33(1), 78–88 (2011)
Article Google Scholar
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)
Google Scholar
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-Text: dataset and benchmark for text detection and recognition in natural images. CoRR abs/1207.0016 (2016)
Google Scholar
Yuen, J., Russell, B., Liu, C., Torralba, A.: LabelMe video: building a video database with human annotations. In: Proceedings of ICCV, pp. 1451–1458 (2009)
Google Scholar
Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: Proceedings of CVPR, pp. 3538–3545 (2012)
Google Scholar
Matsuda, Y., Omachi, S., Aso, H.: String detection from scene images by binarization and edge detection. Trans. IEICE J93(3), 336–344 (2010). In Japanese
Google Scholar

Download references

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work is supported by JST CREST and JSPS KAKENHI #25240028.

Author information

Authors and Affiliations

Department of Computer Science and Intelligent Systems, Graduate School of Engineering, Osaka Prefecture University, Sakai, Japan
Masakazu Iwamura, Takahiro Matsuda, Naoyuki Morimoto, Hitomi Sato, Yuki Ikeda & Koichi Kise

Authors

Masakazu Iwamura
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Matsuda
View author publications
You can also search for this author in PubMed Google Scholar
Naoyuki Morimoto
View author publications
You can also search for this author in PubMed Google Scholar
Hitomi Sato
View author publications
You can also search for this author in PubMed Google Scholar
Yuki Ikeda
View author publications
You can also search for this author in PubMed Google Scholar
Koichi Kise
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masakazu Iwamura .

Editor information

Editors and Affiliations

Microsoft Research Asia, Beijing, China
Gang Hua
Facebook AI Research (FAIR), Menlo Park, USA
Hervé Jégou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Iwamura, M., Matsuda, T., Morimoto, N., Sato, H., Ikeda, Y., Kise, K. (2016). Downtown Osaka Scene Text Dataset. In: Hua, G., Jégou, H. (eds) Computer Vision – ECCV 2016 Workshops. ECCV 2016. Lecture Notes in Computer Science(), vol 9913. Springer, Cham. https://doi.org/10.1007/978-3-319-46604-0_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-46604-0_32
Published: 18 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46603-3
Online ISBN: 978-3-319-46604-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Downtown Osaka Scene Text Dataset

Abstract

Similar content being viewed by others

The First Robust Mongolian Text Reading Dataset CSIMU-MTR

A survey on camera-captured scene text detection and extraction: towards Gurmukhi script

Text Detection from Scene and Born Images: How Good is Tesseract?

Keywords

1 Introduction

2 Unique Features of DOST Dataset

3 Construction of DOST Dataset

4 Ground Truthing Policy

5 Comparison of Datasets

6 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Downtown Osaka Scene Text Dataset

Abstract

Similar content being viewed by others

The First Robust Mongolian Text Reading Dataset CSIMU-MTR

A survey on camera-captured scene text detection and extraction: towards Gurmukhi script

Text Detection from Scene and Born Images: How Good is Tesseract?

Keywords

1 Introduction

2 Unique Features of DOST Dataset

3 Construction of DOST Dataset

4 Ground Truthing Policy

5 Comparison of Datasets

6 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation