Keywords

1 Introduction

Text is one of the humanity’s most essential sources of information and is widely utilized for communication. We commonly used text in ID cards, driving licenses, bank passbooks, scanned lecture notes, etc. Optical character recognition (OCR) [1] software is generally used to read texts from such images. These solutions are incredibly reliable and accurate. In the last decade, there has been a demand for a more difficult task, which is the real-time detection of text from natural scene images. The text displayed in natural scene images differs significantly from the document-based text images which can be present in posters, banners, billboards, street names, and sign poles. The key distinctions are uneven lighting, texture, orientation, perspective distortion, and font color and size variation [2]. With the tremendous rise of the smartphone market, the vast majority of individuals can now capture images of their surroundings. Text in natural scene conveys high-level meanings directly as a result of human ideas and creativity. Because of this semantic property, the text in natural scene images and videos is a unique and valuable source of information. As a result, text detection has a variety of real-life applications, including multimedia information retrieval systems, visually impaired assistive devices, self-driving cars, text translators, and toll gate car number plate detectors [3].

The official language of the Indian state of Assam is Assamese. Assamese is the world’s 67th most-spoken language, with over 15 million native speakers. It is the major language in northeast India consisting of seven Indian states. Assamese has derived its phonetic character set and its behavior from Sanskrit. There are 11 vowels, 41 consonants, 10 digits, and over 300 compound characters in the Assamese language. But since no major research or application is being developed on real-time natural scene Assamese text detection, recognition, and translation system, it is our utmost motivation to develop and contribute our proposed model to a multilingual nation like India and the rest of the world. Traditionally, the methods of text detection can be divided into two types: sliding window and connected component-based methods [2]. The sliding window method detects text by sliding a window across the entire image at various scales. Methods based on connected components recognize single characters and then arrange them into words or text-line regions. With the advent CNNs and deep learning, new approaches with substantially greater accuracies than previous methods have been developed. Deep learning methods are currently widely used in general object recognition, pattern and object segmentation, and text detection in natural scene images. Our proposed model focuses solely on deep learning-based methods, which clearly outperform traditional methods.

There are certain factors that can make a natural scene text detection a challenging task, which can be a scene background pattern, different language characters, and variations in text position, size, and color. However, the performance of the machine learning method for scene text detection is superseded by deep learning method which is suitable for training a model with a large number of datasets and has become a popular method for scene text detection and recognition in recent years. The concept of machine learning in textual region detection comes from the development of machine learning in object detection. Moreover, the framework for scene text detection is further developed from the object detection framework. Deep learning-based text detection is basically divided into three stages: preprocessing, feature extraction, and text detection. The basic network model's fundamental idea is to use CNN as the image's feature extractor. Some of the existing basic networks are LeNet, AlexNet, VGGNet, GoogleNet, ResNet, DenseNet, etc.

The rest of the paper is arranged as follows: Sect. 2 discusses various related works done in the field of natural scene text detection systems, Sect. 3 discusses the method of our proposed detection system, Sect. 4 discusses the experimental results obtained, and finally, Sect. 5 discusses the conclusions drawn from our work and the future works to be done.

2 Related Works

Text detection in the natural scene is a challenging task because the text from the images is crowded with various textures, noises, lightning conditions, font color, orientation, etc. Matteo et al. [2] reviewed several text detection methods for on-scene images, they also provided the most recent state-of-the-art approaches to tackle the challenging task of scene text detection. The accuracy and real-time performance of the approaches are compared. They also presented the most popular scene text detection evaluation datasets. The authors of [3] examined, compared, and contrasted the technical obstacles, methodologies, and performance of text detection and recognition study in color images. They also described the main issues and listed items to consider while dealing with scene detection issues. Mitra et al. [4] presented a novel scene detection system using Fully Convolutional DenseNets. They trained FC-DenseNet to execute semantic segmentation on photos before using it to recognize text. That is, they divided each image into three sections: text, background, and word-fence. Mayuri et al. [5] introduced a novel approach to text detection method that improved detection accuracy while decreasing average processing time. Their text identification method used the eMSER method to retain character form and a custom clustering algorithm to converge quicker. Xinyu et al. [6] proposed a method for fast and accurate text recognition in natural scenes in which a single neural network has predicted words of various orientations and quadrilateral forms in entire images, avoiding superfluous intermediary stages. The authors of [7] introduced a novel method for increasing text detection and identification performance by finding flaws in text detection results. Joseph et al. [8] presented an updated version of YOLO, i.e., YOLOv3. They also provided comparative and performance analyses. Huibai et al. [9] used an enhanced YOLOv3-based scene text detection technique. They found that the training duration of YOLOv3 with DarkNet for a single detection target was slow due to too many layers; therefore, they experimented with a method by replacing it with DarkNet19. Second, the original network preserved multi-scale detection, and three anchors of varying sizes were utilized for bounding box prediction. Sahil et al. [10] proposed a web-based application of tesseract-OCR where a user can upload a document image and do translation with the help of Google Translate API. A Python script and different modules were utilized to address various issues in document-based text segmentation and translation. The authors of [11] proposed a novel open-source line recognizer that combines deep convolutional networks and LSTMs, which uses CUDA for achieving better training performance on PyTorch. Mani et al. [12] proposed a model for translating English phrases into Hindi using ConceptNet for Statistical Machine Translation and Rule-Based Machine Translation in tandem. Abhash et al. [13] proposed a system that utilized a Deep Neural Network (DNN) to construct a text-to-speech system for the Assamese language. The system is trained using audio data provided by collaboration and made freely available for academic usage.

3 Methodology

For real-time detection of text in a natural scene, we chose YOLOv3-Tiny [14] and YOLOv5s [15] algorithms for bounding box prediction over the textual region. Compared to YOLOv2, YOLOv3 [8] included multi-label classification and multi-scale detection, as well as employing the DarkNet53 deep neural network layers as a feature extractor, which enhanced the older version of YOLO that doesn’t perform well while detecting small objects. As a result, YOLOv3 has emerged as one of the most effective object detection algorithms. The basic workflow of the YOLOv3 network is to receive a 2D image as an input, and the convolution layer will extract and map the hidden features of an image using a sliding window. On the other hand, the pooling layer will downsample and select the important features, which drastically reduces computational costs during feature extraction. Convolution [19, 20] is used to extract visual feature information. Our proposed model based on the YOLO-DarkNet architecture has three stages: text detector, text recognizer, and neural machine translator. The text detector stage is able to achieve detection in real time. Several versions of YOLO are optimized and trained using Google Colab and also on a laptop with a discrete GPU. Hyper-parameter tuning is done to reduce weight and improve accuracy. YOLOv3-Tiny achieves desirable detection performance, but when it comes to precision, YOLOv5s outperformed all the above-mentioned algorithms. Figure 1 depicts the block diagram of our proposed work.

Fig. 1
figure 1

Block diagram of natural scene Assamese text detection and recognition system

YOLOv3 begins by scaling the input image of any aspect ratio to 416 × 416 pixels in size before dividing it into same-size S × S cells using a Feature-Pyramid network. The detection of the text is done on three separate scales of feature graph size: 13 × 13, 26 × 26, and 52 × 52. Double up-sampling is used on two adjacent scales for the placement of the feature graph. A particular cell will use three anchor boxes to predict three bounding boxes. YOLOv3 consists of a total of 65 million parameters. Figure 2 depicts the flowchart of the YOLOv3 algorithm.

Fig. 2
figure 2

YOLOv3 algorithm flowchart [11]

The x and y coordinates and text width w and text height ℎ will be predicted by convolutional neural network layer of YOLOv3 for each bounding box in each cell, which are denoted as \({t}_{x},{t}_{y},{t}_{w,}\) and \({t}_{h}\), respectively. If in a particular cell a deviation \(({C}_{x},{C}_{y})\) occurs at the center of the top left text region, and the anchor box of height \({P}_{h}\) and width \({P_w}\), then the changed bounding box is

$${b}_{x}=\sigma \left({t}_{x}\right)+{c}_{x}$$
$${b}_{y}=\sigma \left({t}_{y}\right)+{c}_{y}$$
$${b}_{w}={P}_{w}{e}^{{t}_{w}}$$
$${b}_{h}={P}_{h}{e}^{{t}_{h}}$$

In the training process, the loss is obtained by calculating the sum of squares error (MSE). The gradient of the training iterations or epochs is obtained by minimizing the loss function. Assuming *as the real coordinates, then the gradient is the difference between the real coordinates values and the predicted coordinates: * − t* [9].

Logistic regression is used to predict the object present in the bounding box. The probability of this anchor box is 1 if the rate of overlapping is larger between the anchor box and the true target bounding box [9]. The prediction is disregarded if the overlap is more than the provided threshold but less than the maximum. YOLOv3 is designed to provide one anchor box for a single object. If there is no object in the scene, then the anchor box will not produce any loss function. During training, YOLOv3 uses binary cross-entropy loss and logistic regression to make category predictions, allowing it to perform multi-tag classification of a target [9]. YOLOv3-tiny is a lighter version of YOLOv3. It has 13 layers in total, including 7 convolutional and 6 max-pooling layers. YOLOv5 is rumored to be the next updated version of the YOLO family, released in 2020 by Ultralytics just a few days after YOLOv4 and made it open source. It is simply the PyTorch implementation of YOLOv3. Because there is no official paper, the performance's authenticity cannot be guaranteed. It achieves the same prediction speed as YOLOv3 with better detection precision while using less computation power. Figure 3 depicts the YOLOv5 family’s performance chart.

Fig. 3
figure 3

Performance analysis of YOLOv5 family versus efficientDet [16]. (YOLOv5s is indicated by a yellow-colored curve)

The steps involved in training our proposed natural scene text detector are as follows.

3.1 Datasets Collection

Images of Assamese text in natural scenes are captured using a 12-megapixel smartphone camera which produced an output image of 1600 × 900 pixels size. Several font styles, colors, background texture, lighting condition, angle, and position are taken into consideration. A total of 1000 training images and 300 validation images with over 6000 texts is collected. Figure 4 below shows a few dataset samples. The top left-most image with the largest font consisting of Assamese texts indicates “Goswami Milk Product,” and the top right-most image with two white-colored texts indicates “Hotel Lakhimi,” and so on. 

Fig. 4
figure 4

Natural scene assamese text dataset samples

3.2 Image Annotation

Each textual region of Assamese text in an image is labeled using open-source LabelImg software. Both labeled images and ground truth are stored in a single folder. Ground truth is a.txt file generated after labeling which consists of class number and bounding box coordinates. Our datasets annotation is prepared for a single class in COCO datasets format. Figure 5 depicts labeled images.

Fig. 5
figure 5

Labeled images of three words (left) and two words (right) with corresponding ground truth values

3.3 Hyper-Parameter Optimization

The optimizers used in our proposed model are as follows.

Stochastic gradient descent optimizer. This optimizer is used in our proposed model for optimizing the training iterations and reducing the loss functions. This approach tends to maximize an objective function with acceptable smoothness properties. It is a stochastic approximation to gradient descent optimization since it replaces the actual gradient (taken from the entire dataset) with an estimate of it (calculated from a randomly selected subset of the data). This reduces the extremely high processing cost, allowing for faster iterations in exchange for a lower convergence rate, especially in high-dimensional optimization problems.

Adam Optimizer. Adaptive Moment Estimation or Adam is a new approach for optimizing gradient descent. This optimizer is recommended for solving a problem with a huge number of data and parameters. It is efficient and costs less burden in computing memory a. It is a hybrid form of the gradient descent with momentum and the “RMSP” algorithms.

In the training phase of YOLOv3 in Google Colab, the momentum is set to 0.9, stochastic gradient descent is used for optimization, and the initial learning rate is set to 0.0001. Decay is set to 0.0005 which will stabilize the network in the first 1000 training iterations. Later, to avoid gradient disappearance, a step strategy is set in the 1800 and 2200 training iterations which will alter the learning rate. Maximum batches are set to no. of classes × 2000 = 1 × 2000 = 2000, i.e., training will be stopped after 2000 iterations. With the implementation of transfer learning, we altered the last three layers of YOLOv3. Classes are set to 1 and their corresponding number of filters is set to 18. To speed up the process GPU acceleration, CUDNN and OPENCV are set to 1. Hyper-parameters configuration for YOLOv3-tiny and YOLOv5s is the same as that of the YOLOv3 detection model. Moreover, during the training phase of YOLOv5s on discrete RTX 3050 mobile GPU, the class is set to 1, and the learning rate is set to 0.01 depending on the optimizer (SGD = 1E-2, Adam = 1E-3) for 500 epochs.

3.4 Selection of Backbone Network

There are four types of YOLO algorithm backbone networks:

  1. (1)

    DarkNet19

  2. (2)

    DarkNet39

  3. (3)

    DarkNet53

  4. (4)

    CSPDarknet53.

We chose DarkNet53 and CSPDarknet53, which is a convolution neural network with 53 layers deep, for our detection model with YOLOv3, YOLOv3-tiny, and YOLOv5. This backbone network is stacked on top of our custom YOLOv3 network to improve feature extraction. CSPDarknet53 splits the feature map of the base layer into two parts using a CSPNet strategy and then combines them using a cross-stage hierarchy. Using a split and merge strategy provides a greater gradient flow through the network. YOLOv5's backbone network is CSPDarknet53, and its architecture is depicted in Fig. 6.

Fig. 6
figure 6

Overview of YOLOv5 algorithm with CSPDarknet53 [17]

3.5 Training of YOLOv5s Using RTX 3050 Laptop Discrete GPU

YOLOv5 can be trained on both GPU and CPU. It is recommended to train on GPU-enabled devices with the latest version of CUDA for reducing training time. If a device doesn’t have a GPU card, a virtual system enabled with Nvidia Tesla T4 on Google Colab can be used for training. Our proposed model with YOLOv5s is trained on RTX 3050 with CUDA version of 11.1 alongside Ryzen 7 octa-core processor up to 4.2 GHz and 16 gigabytes of 3200 MHz dual channel RAM.

Steps that are followed during training on Jupyter Notebook are as follows.

Clone the YOLOv5 repository. First of all, we have cloned the YOLOv5 repository from the official Ultralytics Github repository [16]. This repository contains YOLOv5s with a CSPDarknet YML file and all the essential hyper-parameters.

Install prerequisite libraries. To enable GPU for training, we have installed CUDA toolkit v11.1 using the conda install command. The yolov5 repository also includes a “requirements.txt” file that contains all of the libraries needed to train the model. Some of the libraries are OpenCV, PyYAML, torchvision, WandB, etc.

Create a YML data path file. A YML (Yet Another Markup Language) is a data serialization language that is frequently used to navigate datasets’ path during training. We have created dataset.yml using notepad++ to access datasets directory path and set the number of classes to 1.

Integrate WandB. We used the Weights & Biases (WandB) Python package, which helped us track our training performance in real time. It is simple to integrate with popular deep learning frameworks such as Pytorch, TensorFlow, or Keras.

Training custom YOLOv5s model. After creating the YML file, we wrote the command to load our custom datasets and pre-trained model’s weight file to train a new model. We set image size = 288, batch size = 16, epochs to 500, and workers = 2. The training will begin shortly after executing the training command or train.py script and it will take some time depending on the input model hyper-parameters and hardware specifications.

4 Results and Discussion

Our proposed model based on YOLOv3 is trained using custom datasets of Assamese text with a single class on Google Colab which took around 1.2 h duration of training for 2000 iterations and produced a custom weights file of 234 MB. So, to reduce our proposed model weight file’s size, we have also trained our proposed based on YOLOv3-tiny and YOLOv5s algorithm over our custom datasets. YOLOv3-tiny took around 1.01 h of training time and produced a weights file of 33 MB. The loss curves chart of YOLOv3 and YOLOv3-tiny are shown in Fig. 7a and b, respectively.

Fig. 7
figure 7

a YOLOv3-DarkNet53 training loss variations (left). b YOLOv3-tiny-DarkNet53 training loss variations (right)

Training of YOLOv5s on discrete RTX 3050 GPU took around 5.4 h of training duration for 500 epochs and generated a weight file of 13.6 MB. The training loss curve of YOLOv5s using SGD as a hyper-parameter optimizer reduces the object loss and box loss during training which ultimately leads to achieving high performance and precision. Training loss and precision plots are shown in Fig. 8 above.

Fig. 8
figure 8

YOLOv5s training loss variations and mAPs plots

From the above charts, we can see that the YOLO algorithm's training loss decreases faster with the DarkNet53 network, the data is less volatile, and the ultimate stable value is minimal. The custom weights of YOLOv3 and YOLOv3-tiny can now be used for testing our detection model. In terms of the detection effect, we independently tested the weights of three different versions of YOLO networks for the detection of text in the same natural scene image, as shown in Fig. 9. Both the top left and right images indicate “Gauhati University Idol”, the bottom left and right indicates “Tarif Bakery” and “Welcoming Kharupetia Town Committee, Darang”. The candidate bounding boxes extracted from YOLOv3-tiny-DarkNet53 are less precise than those extracted from YOLOv3-DarkNet53, but there is no incorrect detection of the textual region and still sufficiently provide a foundation for subsequent identification work, and the incorrect detection areas are not identified, according to the test images. Furthermore, the detection speed of YOLOv3-tiny is far superior to that of YOLOv3.

Fig. 9
figure 9

Detection effect of YOLOv3 (Top left), YOLOv3-tiny (Top right) and YOLOv5s (Bottom left and right)

Based on the analysis of the above three model detection effects, when we experimented with the same set of training data, YOLOv5s as shown in Fig. 9 has a detection speed slower than YOLOv3-tiny but achieved the highest recognition rate among other models, and there is no incorrect detection for a given IOU threshold, similarly when tested for a single detection target. Also, network simplification improves training speed while maintaining the recognition rate. In terms of detection speed, YOLOv3-DarkNet53 takes 23 ms, YOLO-v3-tiny-DarkNet53 takes 8.3 ms, and YOLOv5s takes 14.2 ms of inference time. Therefore, after optimizing the network with SGD and Adam, the recognition speed of YOLO-v3-tiny-DarkNet53 is faster, and the frame per second count is significantly increased. But when we analyze in terms of precision, YOLOv5s is performing significantly better which achieves a mean average precision (mAP) score of 94.3%. In terms of training duration, YOLOv3-tiny outperforms all other models, providing easier hyper-parameter and network fine-tuning work. Comparative analysis of YOLOv3-tiny with YOLOv3, YOLOv5s, and reference model MobileNetV3 is shown in Table 1.

Table 1 Comparison between the proposed model and experimental results [18]

5 Conclusion

Text detection plays a vital role in the domain of computer vision and its various application has made our life easy and more productive. Scene text detection is a computer vision task that can detect text in a natural scene where different objects, font size, color, lights, etc., are always a factor to be considered and tends to remove some unwanted characteristic like background pattern and noise. The deep learning method has made scene text detection a much easier and more reliable method to overcome problems like region segmentation and pattern recognition. The traditional text detection method is slow and takes a lot of time to train and learn. YOLO algorithm is one among the deep learning algorithm that has passed the real-time detection of the 30 FPS mark. Our proposed model with YOLOv3-tiny with DarkNet53 backbone provides more FPS than YOLOv3, MobileNetv3, and YOLOv5s. In terms of detection metrics, we find out the mean average precision (mAP) and it is observed that YOLOv3-tiny is slightly lower than YOLOv3, YOLOv5, and MobileNetv3. On the other hand, YOLOv5s achieve the highest detection performance and best precision. The weight size of YOLOv5s is very small as compared to YOLOv3 and YOLOv3-tiny but still larger than MobileNetv3 which might constrain a low-end hardware system. We will further develop our work for the recognition of the detected Assamese texts and to translate them into English texts by using an improved version of LSTM-based Tesseract-OCR and Neural Machine Translation.