1 Introduction

Scene images are used based on their application in different situations and to achieve different goals. Therefore, the detection of them in various fields such as Industrial automation [6], Robot navigation [26], Intelligent transportation system (ITS) [40], Search and translation [8, 31], Optical character recognition (OCR) [41], and other computer vision applications will lead to the extraction of very useful and efficient information from them. Intelligent Transportation Systems (ITSs) need effective features for performing their algorithms. Automatic detection of traffic signs is a critical component of an advanced driver assistance system (ADAS), and in future vehicles, it will be an integral component [9]. Therefore, designing an intelligent system can dramatically assist drivers and significantly reduce the rate of traffic accidents. Nonetheless, with the increasing demand for smart vehicles, the automatic and online detection and recognition of traffic signs is vital, a task that can be facilitated with computer vision.

Scene text detection is a type of text area detection in a complex background. The scene text is usually seen in various fonts and sizes and often with a background in urban environments with various noises, making scientific investigation difficult [20]. The application of text-based traffic signs in ITSs is one of the most important and widely used subsets of scene images. The extensive potential applications of this field of computer vision can lead to multiple challenges that can be generally classified into the complexity of background, diversity of scene text, and interference factors (noise) [41]. Noise in Scene text is one of the challenges that affect text detection. Since the images are taken from natural environments, factors such as light intensity, text angle, color, dust, and tree branches can affect image quality.

In this regard, the scene-text datasets with a wide variety of instances help the researchers to generalize their vision-based algorithms [30]. Given the expansion of Internet communication and the existence of multilingual countries or organizational and academic documents, here, the authors focused on preparing a collection of bilingual Persian/Arabic and English datasets with sufficient sample quantity and diversity to recognize text-based images [17]. Our contributions to this paper are the following:

  • Proposing a complete and large-scale dataset of Persian-English named PESTD at the word level. PESTD includes 5832 instances including letters, digits, and symbols in three categories: Persian, English, and Persian-English.

Note that all Arabic and Persian numbers are similar. Also, the Arabic letters are similar to 28 out of 32 Persian letters. In addition, the word “Farsi” is an alternative name for “Persian”. Therefore, the PESTD can be used in all countries whose official language is Farsi, Arabic, or Urdu. On the other hand, because this data set is bilingual, its English samples can be used in many countries.

  • To prepare PESTD instances, the text detection process was performed on the traffic signs in Iran. The CRAFT feature extraction algorithm with YOLO and the Tesseract engine have been combined to take an effective step to recognize cursive and multilingual languages despite their specific challenges. The use of an end-to-end structure has made this architecture usable in other applications and research. For example, the detection model of the proposed idea can be used to detect texts in other applications such as manuscripts and typed texts with different fonts.

  • The proposed data set includes six general categories of challenges such as weather conditions, light intensities conditions, distance, background (surrounding environment such as trees, streets, cars, buildings, …, and board structure), color, and view angle. This issue causes the performance of the proposed method to be evaluated comprehensively.

  • The accuracy and F-score values (evaluation criteria in YOLOv5) on the PESTD have been attained at 95.3% and 92.3%, respectively.

The rest of this paper is organized as follows: Section 2 covers several types of research related to scene-text detection and related datasets. Section 3 discusses the method used to prepare the proposed dataset and characterizes it. Section 4 examines the dataset efficiency, and Section 5 evaluates the introduced dataset against other datasets. Finally, the conclusions and future research suggestions are presented in Section 6.

2 Related work

As mentioned before, the various applications of text detection and recognition have turned them into curious topics in computer vision. In general, the scene-text datasets are divided into three categories: handwritten, printed, and scene text based images. Each category is classified into two groups of real and synthetic datasets depending on the data collection method. Real datasets are created by scanning documents (e.g., newspapers and journals) or scene images. However, existing texts are used to construct synthetic datasets. An image of each character or word is created randomly by applying various fonts and sometimes various backgrounds. More images can be produced using the existing images.

This article focuses on preparing bilingual Persian-English datasets of scene text images. This section thus investigates only the scene text datasets. Some of the most well-known English real datasets are SynthText [10], Synth90k [12], and VerisimilarSynthesis [39]. Among these datasets, due to the approximate similarity of Arabic and Persian (or Farsi) languages, two synthetic datasets, ACTIV [38] and ALIF [37], can be referred which have been extracted from video frames of Arabic channels. The ALIF dataset is larger than the ACTIV dataset. Each dataset contains 6532 images of text and 21,520 images of words from Arabic channels, respectively. In the following, different types of datasets are introduced:

  • Real dataset: The real datasets in the scene text images are divided into three categories: regular, irregular, and multilingual.

  • Regular dataset: The most famous regular datasets in English are ICDAR 2003 (IC03) [16], ICDAR 2013 (IC13) [13], IIIT 5 k-word (IIIT5k) [18], and Street View Text (SVT) [34]. This includes a test dataset containing 251 scene images with labeled text bounding boxes, 1015 ground truths cropped word images, 3000 cropped word test images collected from the Internet, and 249 street view images collected from Google Street View, respectively. In this (real dataset) category for the Arabic language, there are two datasets: ARASTI [29] and ARASTEC [28]. They include 1687 images, 1280 isolated Arabic words, 2093 isolated Arabic letters, and 60 scene text images. In this case, a slight comparison indicates a highly nonsignificant variety in this category of datasets for Arabic and Persian (or Farsi).

  • Irregular dataset: In irregular datasets, most text samples have a low resolution with different fonts that are not written horizontally but are in curved format, causing this dataset to face more challenges than other categories. As an irregular English dataset, the COCO-Text [32] contains no-text, legible, and illegible text images. In total, there are 22,184 training images and 7026 validation images with at least one sample of legible text. The ICDAR 2015 dataset (IC15) [14] contains 1500 images, 1000 for training and 500 for testing. The StreetViewText-Perspective (SVT-P) dataset [21] contains 238 images with 639 cropped text instances.

  • Multilingual datasets: The frequency of bilingual and multilingual texts directly relates to urban development. Ahmad et al. [1] introduce a bilingual English-Arabic dataset. Table 1 compares these datasets. Note that the number of multilingual scene-text datasets is limited. In addition, there are no rich Persian-English multilingual datasets, whether in the form of prints or scene text images. Therefore, this research attempts to tackle this problem for the first time by preparing a collection of bilingual Persian-English datasets.

Table 1 Description of the scene-text benchmark datasets

3 Proposed dataset

3.1 Description

Traffic signs are generally divided into three several categories. The first category includes circular regulatory signs with a red border that indicate the type of prohibition by a special sign. Among the stop signs, the stop sign has been considered as an octagon with a completely red background and with the word STOP written for the importance and accuracy of the drivers. Due to the importance of the sign of observing the right of precedence, only the top of this sign is downwards. The second category includes warning signs, mainly in the form of a triangle with red border stripes and white background, one end of which is upwards, and inside which the type of danger is indicated by special black markings. The third category includes guide signs that contain general advice. These are designed in white, green, brown, yellow, and blue, and triangle, circle, rectangle, and square shapes. Figure 1 depicts three different types of traffic signs.

Fig. 1
figure 1

The examples of three different types of traffic signs (a) Circular regulatory signs, (b) Warning signs, (c) Path Guide signs

In addition to the mentioned traffic sign categories, there is another type of signage called path guide signs (urban route guidance (Fig. 2). They contain much more text than the three categories introduced. These signs are generally rectangular and can only be designed as flags at the exits (with a sharp arrow-like ending to indicate a specific direction). Traffic signs convey some important guiding and destination information to drivers and pedestrians. This information includes transit conditions, facilities, and access to the route. In some cases, these signs also include regulatory orders. The particular line of these signs should be such that it conveys messages to all drivers quickly and efficiently. Therefore, in the design of the font, in addition to the appropriate size, readability must also be taken into account. Only one font should be selected as the standard font and used in all signs. Currently, two fonts have been selected: Gem for Persian texts and Homa for English texts in urban route signs. In some cases, the Abrisham font is also used. The size of the text in the signs is a function of the time required to read the text. This time depends on the speed of the vehicle approaching the sign. Determining the text size is especially important for route signs. The size of the text, which is measured by the height of the mosaic of Persian letters, is a function of several parameters, such as the number of words, speed of movement, and the distance of the board from the axis of view. The size of the direction signs is determined based on the height of the mosaic of Persian letters, the volume of the text, and the arrangement of other elements. The height of the letters is also determined based on design speed. Signs are used to make an intended message more expressive and accelerate the comprehension of the text. Route signs should be addressed at specific intervals depending on the conditions, and type of road, before the location and finally installed at the entrance to the access road. Except in exceptional cases, signs should be installed on the right side of the path.

Fig. 2
figure 2

The examples of Path confirmation signs

3.2 Proposed method for data collection

There are many monolingual countries where there are also traces of other languages. The official language of Iran is Persian (Farsi), but in many cases, such as universities, organizations, scene images, websites, or similar cases, a combination of English and Arabic with Persian is used. Accordingly, a complete and exhaustive dataset is needed in the first stage to recognize the multilingual texts in the images. Since text recognition is preceded by text detection, accurate detection of the text area is an essential step for improving recognition accuracy and reducing the computational load due to processing the text area instead of the whole image. Accordingly, here the authors aim to prepare a dataset by taking into account the detection and recognition phases separately. In this research, the Character Region Awareness for Text detection (CRAFT) model has been used to produce a dataset for Persian-English scene text. The backbone of the CRAFT model’s feature extraction architecture is based on the VGG-16 network architecture. This includes region and affinity for giving the character region [2] and the affinity of the characters to combine. Therefore, with this approach, the texts in the image are first processed at the character level, and in the next step, they combine according to the affinity score and form the word (Fig. 3).

Fig. 3
figure 3

The scene text database creating flow

By applying CRAFT for each sample, the text area is detected and extracted from the base image (Fig. 4). Here, the CRAFT has been applied to different samples with different challenges. The results show that CRAFT performs desirably in detecting text areas in different conditions, such as distance from the traffic signs (far or near) (Fig. 4-b), its height from the ground, background color, shape of the sign, amount of light in space, and location. In addition, in more complex situations, when there are multiple texts in the sample, such as license plates and traffic signs (Fig. 4-a and Fig. 4-c), this system can optimally carry out the detection operation. Although the samples in the signs have neat text in the range of font and size, the CRAFT model was also able to recognize the handwritten text in the image, in addition to the text written on the sign, with desirable accuracy.

Fig. 4
figure 4

The results of the CRAFT text detection model. (a) License plate detection, (b) Text detection; Scale challenge, and (c) Handwriting text detection and slant challenge

The criterion for evaluating the performance of the CRAFT model on the dataset used is calculated by two precision and recall metrics based on the following equations:

$$ precision=\frac{\sum_{i=1}^{\mid D\mid } matchD\left({D}_i\right)}{\mid D\mid } $$
(1)
$$ recall=\frac{\sum_{i=1}^{\mid G\mid } matchG\left({G}_i\right)}{\mid G\mid } $$
(2)

Where D is the list of detected rectangles and G is the list of ground-truth rectangles. Various types of matching functions (matchG and matchD) exist between ground truth and detected rectangles, such as one-to-one, one-to-many, and many-to-one [35]. Based on the one-to-many matching function, the values obtained for precision and recall were 0.9705 and 0.9822. The prepared dataset is the first dataset of Persian-English scene test images prepared with the help of text-based traffic signs in Tehran. It can help solve a significant research problem that emanates from the lack of sufficient exhaustive text datasets in Persian, Arabic, Urdu, or similar languages. Because this dataset is the basis for the recognition of Persian texts in another study by us, so to promote research in Persian/Arabic, the prepared dataset will be publicly available for all studies of other researchers [Link]. Figure 5 depicts some instances in three categories: Persian, Persian-English, and English.

Fig. 5
figure 5

PESTD instances in three categories. (a) Persian, (b) English, and (c) Persian-English or Bilingual

3.3 Dataset specifications

A text-based traffic sign dataset named “The Persian Text-Based Traffic Signs Dataset” was used in this study. It has 2643 instances containing Persian-English text as a basic dataset [15]. However, the prepared dataset recognizes and extracts Persian and English texts from the basic dataset. The Persian texts in our dataset contain all 32 letters of the Persian alphabet, but in Persian or Arabic texts, the words are purely cursive, giving the letters of a word different shapes depending on their position (beginning, middle, end) in the word. According to the case mentioned above and other cases shown in Table 2, there are 122 different writing shapes according to Deutsches Institut für Normung (DIN) and International Phonetic Association (IPA) [11] standards for 32 Persian letters. The prepared dataset includes 5832 Persian and English words and numbers. Specifications of the proposed dataset (PESTD) have been mentioned in Table 3.

Table 2 The Persian alphabet and their written shapes in the study dataset
Table 3 Samples of Persian/English dataset (PESTD) category details

4 Experimental results

To take a step toward recognizing the introduced dataset, the single-stage deep learning technique of YOLO [25] version 3 and its combination with the Tesseract engine introduced in [23] have been used. In addition, an improved version of YOLO has been used as YOLOv4 [3] and YOLOv5 [36], so this improvement happened by having an improvement in the mean Average Precision (mAP) [33]. In YOLOv3 for the object detection step, Darknet-53 is used as a convolutional neural network (CNN), while in YOLOv4 and YOLOv5, the CSPdarkent53 has been used to act as a backbone [27].

The comparison of the YOLOv3, YOLOv4, and YOLOv5 methods on the PESTD (including 199 different forms of letters, numbers, and symbols in Persian and English) is shown in Table 4. Accuracy and F1-score as the criteria have been used to compare the YOLOv3, YOLOv4, and YOLOv5 in the detection step of the text recognition using the Tesseract engine. Accuracy is calculated from the ratio of the number of correct predictions to the total number of predictions. This is while the F1-score is a kind of averaging of precision and recall (as in Eq. (3) and Eq. (6), True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN)) [22]. Both of these criteria are defined in the range of zero and one. So, their highest possible value is 1, and the lowest possible value is 0. For example, in the criterion of F1-score, value 1 indicates perfect precision and recall, and 0 indicates either the precision or recall is zero. In addition, the PC speed per second to evaluate the inference speed of algorithms to process data has been considered. The results show that YOLOv3 is less accurate than the other two versions because their backbone is different, and on the other hand, the performance of YOLOv5 is better than YOLOv4 due to the use of auto-learning bounding boxes [19] (Table 4). This is while the accuracy of the method mentioned in our previous study using YOLOv3 for the isolated Iranian license plate (including 27 different forms of letters and numbers in Persian) was almost 99%.

Table 4 Comparison of different detection methods for text recognition on PESTD
$$ F1- score=2\times \frac{p\times r}{p+r} $$
(3)

Where, p and r are defined by Eq. (4), and Eq. (5).

$$ p=\frac{TP}{TP+ FP} $$
(4)
$$ r=\frac{TP}{TP+ FN} $$
(5)
$$ Accuracy=\frac{TP+ TN}{TP+ TN+ FP+ FN} $$
(6)

A personal computer with the following specifications has been considered for implementing the proposed algorithm:

  • CPU: 8th Gen. Intel® Core™ i7–1.80GHz Processor.

  • GPU: NVIDIA® GeForce® RTX 2070 SUPER ™ Turing ™ architecture with 8 GB GDDR6.

  • RAM: 32 GB DDR4.

  • Storage: 1 TB NVMe SSD.

  • Operating System: UBUNTU 18.04.6.

5 Discussion

Since the dataset prepared in this paper was extracted from the scene text images of traffic signs based on Persian-English, its inherent features may challenge its users, especially in Persian. In the following, some challenges of the Persian scene text dataset and its limitations have been mentioned.

5.1 Challenges of Persian scene text dataset and its limitations

  • Persian texts, unlike English texts, are written from right to left.

  • The letters of words in Persian are often written cursively and, in some cases, non-separately, while in English, the letters are always written separately. Therefore, additional studies on the Persian language are relatively more complicated than in English. Therefore, according to the different shapes of the Persian alphabet in Table 2, this condition exists in 100% of the Persian alphabet.

  • Some letters have the same shape. Letters can be distinguished from one another only based on the absence or presence of a dot (“ر” vs. “ز”), the number of dots (“ت” vs. “ث”), or the position of the dots (“پ” vs. “ث”) beneath or above letters. Furthermore, two Persian letters, “ک” and “گ” differ only in the existence of a stroke. Based on this, the only alphabets “لام”, “میم”, “واو” and “ه” are not similar to other alphabets in any of the shapes that are placed at the beginning, middle, end, or single. Therefore, about 87.87% of all alphabets are slightly different from other alphabets in at least one shape.

  • Most Persian letters (especially cursive letters) are jagged, and in cases where the image has noise, the jagged format of the character may be seen with the same baseline, complicating the recognition operation. This challenge is very dependent on the writing font, so a detailed analysis cannot be done on it.

  • The letters in a word may overlap, meaning that a vertical line cannot wholly separate the letters. This challenge, like the previous challenge, depends on the font.

5.2 Limitations

Since the motivation of this research is to provide a suitable dataset to create a context for detecting Persian (Farsi)/Arabic and English multilingual texts, these challenges can represent the difficulty of research in this field and demonstrate the value of the work in this field. However, to further evaluate this dataset, its font style and size and the number of samples were compared with other multilingual datasets (Table 5). The results show that the proposed dataset is more extensive in the number of samples, allowing for larger-scale detection of data with a richer variety. It should be noted that the proposed dataset includes samples under different illuminations at different angles and sizes but excludes all types of fonts and sizes.

Table 5 Comparison of multilingual datasets

6 Conclusion and future work

In this study, we present a bilingual Persian-English dataset (PESTD) based on the images of the traffic sign scenes that include 5832 instances including letters, digits, and symbols. The Persian texts in the dataset contain all 32 letters of the Persian alphabet with 122 different writing shapes according to the DIN and IPA. The instances in the presented dataset have been classified into Persian, Persian-English, and English categories. Regarding the similarity of Arabic, Persian, and Urdu numbers and letters, this dataset can be considered a suitable database in all regions with these languages. As a different challenge in comparison with the English language, the letters in this dataset are often written cursively and, in some cases, none- separately. Some letters have a similar shape with different positions of their dot(s). In addition, the jagged format of some letters and overlapping the letters in the words are other challenges. Based on extracting method of instances, traffic signs with real challenges, the proposed dataset includes six general challenges categories: weather conditions, lighting conditions, distance, background, color, and view angle.

As a step toward recognizing the introduced dataset, the single-stage deep learning technique of YOLOv3 with the Tesseract engine has been used to recognize cursive and multilingual languages. The CRAFT model was used to prepare this dataset based on deep learning techniques with 0.9705 precision and 0.9822 recall in scene text detection. In addition, YOLOv4 and YOLOv5 Algorithms have been compared with YOLOv3. The accuracy and F1-score values, evaluation criteria in YOLOv5, on the PESTD have been attained at 95.3% and 92.3%, respectively. The experiments depict the accuracy value, in YOLOv5 as 1.2% and 3.3% upper than YOLOv4 and YOLOv3, respectively. In addition, the F1-score criterion in YOLOv5 is 1.2% and 5.0% more than YOLOv4 and YOLOv3, respectively. Also, the calculation time of the YOLOv5 is 1.9 s and 7.2 s faster. As a future work, the authors plan to expand the database and add a large variety of traffic symbols. In addition, it will be great full to introduce new methods for recognizing scene images with higher accuracy.