Keywords

1 Introduction

Automatic License Plate Recognition (ALPR) routines offer various applications, including identifying stolen vehicles, monitoring traffic, smart toll collection, etc. [9, 23]. The recent advancements in deep learning (DL) and parallel computing have contributed to achieving excellent performance in several digital image/video applications, such as optical character recognition and object detection and recognition, which have tremendously improved ALPR systems. Recently, convolutional neural network (CNN) has achieved exceptional performance and have been the primary machine learning approach for LP detection and recognition [2, 6, 10,11,12,13, 15, 25, 27]. Several ALPR commercial systems have also been employing DL methods. They are usually integrated with web services and large data centers to process millions of vehicle images daily and constantly improve the system. Some of the example systems to be mentioned are: OpenALPRFootnote 1, SighthoundFootnote 2, and Amazon RekognitionFootnote 3.

Moreover, the CNN-based object identification routines have become famous for LPR with the establishment of DL. Typically, faster regions with CNN (R-CNN) [21], Single Shot MultiBox Detector (SSD) [14], and You only look once (YOLO) [18] models are employed. Faster R-CNN [21], a modified version of R-CNN and fast R-CNN that forgoes time-consuming strategy, i.e., selective search, allows the architecture to understand the area manifestos. In this work, a NN has been utilized to forecast the region proposals rather than a particular search procedure to determine the area manifestos on the feature map (FM). Praveen Ravirathinam and Arihant Patawari in [16] demonstrated the effective handling of faster R-CNN in the detection of LP. The proposed model could also detect titled and non-rectangular plates. The mAP of their model went relatively low since it could not catch small-scale images. The study in [11] presented a robust object detection model using Fast Yolo and Yolov2 to detect LP in simple and realistic conditions.

Despite the advances in this field, most approaches focus on recognizing LP in controlled environments, assuming a frontal view of the vehicles and LP. The current challenges in ALPR include image distortion, image quality degradation, weather (snow, rain, etc.), variable illumination conditions, etc. A more permissive picture-gathering setting (e.g., a police car using a camera to track down an unlawful vehicle) could result in slanting vision. In such cases, the LP may be severely distorted and, thus, challenging to recognize, for which even existing standard commercial solutions struggle.

This paper proposes a comprehensive ALPR paradigm capable of performing well over various unrestricted capture screenplays and camera arrangements. We integrate a transformation module to estimate and rectify the distortion and improve the character recognition performance. An additional contribution is the collection of images from natural scenes, which cover various challenging scenarios and contain substantial LP distortions. The proposed system could also discover and identify LPs in independent test data sets using the same configuration. The data sets employed in this assignment are publicly available, and the samples can be obtained from SSIG-SegPlate database [4] and the application-oriented license plate (AOLP) data set [7].

2 Materials

This section provides background information about the various vital components employed in the proposed work (Fig. 1).

Fig. 1.
figure 1

Yolov4 architecture [1]

2.1 You Only Look Once (YOLO)

YOLO is one of the one-stage object detector approaches. YOLOv2 [19] model has been built upon YOLO with several incremental enhancements, such as batch normalization, excellent resolution, and anchor boxes. To perform better on smaller objects, YOLOv3 [20] improved upon earlier models by including the bounding box prediction with an objectness score. Also, it attaches links to the backbone network layers and performs predictions at three different degrees of granularity. YOLOv4 [1], a two-stage detector with multiple components, is currently an upgraded version of earlier generations. The higher versions of YOLO are volatile to use as a black box for our proposed methodology. The Yolov4 model consists of Backbone, Neck, and Head, as shown in Fig. 1.

Backbone: It consists of the CSPDarnknet53 model, which detects objects with higher accuracy. Also, it includes the CSPDarknet53 model because it enhances through the MISH and other activation functions [1].

Neck: It consists of a spatial pyramid pooling layer (SPP) and Path Aggregation Network (PAN). SPP plays a crucial role when detecting objects of various scales for adequate context information and, thus, sits between CSPDarnknet53 and PAN. It adds a spatial pyramid pooling layer in place of the last pooling layer, which comes after the final convolutional layer. A maximum pool is applied to a sliding kernel of various sizes. The result is then created by concatenating the FMs generated by different kernel sizes [1].

Further, the PAN network’s capacity to reliably maintain spatial information, which aids in the proper localization of pixels for mask generation, was chosen, for example, segmentation in YOLOv4. The properties which make PAN so accurate are Bottom-up Path Augmentation, Adaptive Feature Pooling, and Fully-Connected Fusion Network [1].

Head: Bounding box location and categorization has performed using the head (Dense prediction). The procedure is the same as that described for Yolo v3; hence, it detects the score and the bounding box coordinates (x, y, height, and width). The algorithm splits the input image into several grid cells and uses anchor boxes to forecast the likelihood that each cell will contain an object. The result is a vector containing the bounding box coordinates and the class probabilities [1].

Fig. 2.
figure 2

Proposed system pipeline.

2.2 Spatial Transformer Network

Though CNN’s defined as a powerful class of models, they are nonetheless constrained by their inability to be computationally and parameter-efficiently spatially invariant to the input data. The Spatial Transformer Network (STN) [8], a novel teachable module, explicitly permits the spatial modification of information within the architecture. Its differentiable module can be added to current convolutional architectures. It enables the NNs to actively modify FM spatial relationships based on the FM itself without changing the optimization procedure or adding additional training supervision.

3 Proposed Methodology

The proposed structure is demonstrated in Fig. 2 and comprises three main steps: LP Detection, LP Transformation and Rectification, and Character Recognition Network. Given an input image, the custom-trained YOLOv4 model detects LPs in the scene. The detections are cropped and forwarded to an STN to rectify LP images with diverse orientations and surroundings details. The corrected images have a uniform orientation and paltrier surrounding noise. These favorable and repaired detections are presented to a Character Recognition Network.

3.1 License Plate Detection

Detection of LPs is an essential phase in the ALPR process; hence we adopted a reliable model to carry it out. To select the best algorithm, we defined the criteria as 1) The algorithm must have an acceptable performance and recall rate because even a small amount of missed detection will cause the LP detection process to perform worse. 2) For real-time detection to be reliable, the method must have a high calculation speed. 3) Additionally, since their use in practical applications won’t be hampered, the calculating costs should be reasonable. As a result, we carefully chose YOLOv4 as our network for LP detection. When comparing the cost and speed of calculations, the YOLOv4 algorithm is quite effective. Figure 3 reveals that we have refined the YOLOv4 model configurations according to our requirements to specialize it for LP detection. Since we need only one class, i.e., LP, for object detection, we altered the number of classes from 80 to 1 and, thus, the modified value of maximum batch size according to the below formula,

$$\begin{aligned} max\_batches = min(training\_images,min(classes*2000,6000)); \end{aligned}$$
(1)
Fig. 3.
figure 3

Examples of detected LP from testing data set

Secondly, we altered the number of filters in the convolutional layers using the formula below.

$$\begin{aligned} filters=(classes+5)*3; \end{aligned}$$
(2)

Thus, we employ a reconfigured model for the detection of LPs.

3.2 Spatial Transformer Network (STN)

The STN suggested in [8] is a differentiable and self-contained module. Thus, it has been added to current convolutional architectures. It streamlines the subsequent classification work and improves classification results. It strengthens a model’s spatial invariance against non-rigid deformations such as translations, scaling, rotations, and cropping. The suggested model is more resistant to various shooting angles and noises since the input LP photos are first rectified with the trained STN to those with a consistent orientation and reduced noise. Figure 4 shows that it is divided into three divisions. 1) The localization network (LN) derives the affine transformation parameter \(\theta \) by extracting the key attributes from the input image \(I\). 2) The initial grid is transformed into a new sampling grid by the grid generator based on the input \(\theta \). 3) The sampler samples the \(I\) by the new grid to create the rectified picture.

Fig. 4.
figure 4

Structure of STN [8].

Localization Network: The LN accepts the input FM \(\textit{U} \,\epsilon \, \mathbb {R}^ {H \times W \times C}\) with height (\(H\)), width (\(W\)), and channels (\(C\)) and outputs (\(\theta \)), the parameters of the transformation \(\mathbb {T}_\theta \) operated to the FM: \(\theta = f_{loc}(\textit{U})\). The proportion changes depending on the parameterized kind of transformation; for example, the size of an affine transformation is six dimensions. A final regression layer must be present in the LN function \(f_{loc}()\) to obtain the transformation parameters, but it can be fully connected or convolutional.

$$\begin{aligned} \begin{bmatrix} x'\\ y' \end{bmatrix} = \begin{bmatrix} \theta _{11}&{}\theta _{12}\\ \theta _{21}&{}\theta _{22} \end{bmatrix} \begin{bmatrix} x\\ y \end{bmatrix} + \begin{bmatrix} \theta _{13}\\ \theta _{23} \end{bmatrix} \end{aligned}$$
(3)

The affine transformation matrix is represented by \(\mathbb {A}_\theta \).

$$\begin{aligned} \mathbb {A}_\theta {} = \begin{bmatrix} \theta _{11}&{}\theta _{12}&{}\theta _{13}\\ \theta _{21}&{}\theta _{22}&{}\theta _{23} \end{bmatrix} \end{aligned}$$
(4)

The LN structure summarized in the Table 1 consists of 3 sets of max-pooling and convolutional layers with a fully connected layer, and finally one output layer.

Table 1. LN Configuration.

Parameterised Sampling Grid: Every pixel of the input LP image has a corresponding vector of coordinate, i.e., \(K_i = (x_i, y_i)^T\) with the pixel index \(i\). A multiplication operation is performed on theta, and Ki to obtain the affine converted vector of coordinate, i.e., \(K'_i = (x'_i, y'_i)^T\). It is expressed as

$$\begin{aligned} \begin{pmatrix} x'_i\\ y'_i \end{pmatrix} = A_\theta \begin{pmatrix} x_i\\ y_i\\ 1 \end{pmatrix} = \begin{bmatrix} \theta _{11}&{}\theta _{12}&{}\theta _{13}\\ \theta _{21}&{}\theta _{22}&{}\theta _{23} \end{bmatrix} \begin{pmatrix} x_i\\ y_i\\ 1 \end{pmatrix} \end{aligned}$$
(5)

\(K' = (K'_1, K'_2, ..., K'_i,...,K'_{W \times H})\) are set up to obtain the grid generator’s final output, where \(W\) and \(H\) in our experiments are 270 and 70, respectively.

Differentiable Image Sampler: In order to generate the rectified image O, the sampler samples the original image using the sampling grid \(K'\). Bilinear interpolation, a differentiable module, is used in this sampling process. The STN, which may be trained end-to-end alongside other sections of the model, comprises the LN, the parameterized grid generator, and the image sampler. The STN is created by combining the LN, parameterized grid generator, and image sampler. It can be trained end-to-end with other model components. Please refer to [8] for further information.

3.3 Character Recognition

The recognition process consists of three parts: (1) Preprocessing the rectified image output of STN; (2) Character Segmentation; (3) Recognition of segmented characters.

Fig. 5.
figure 5

(a) Binary conversion of the detected plate. (b) Bounding rectangles containing contours. (c) Binary images of segmented characters.

Preprocessing Stage: The rectified LP image is processed to make the character extraction easier. With a single 8-bit channel and values ranging from 0–255, where 0 and 255 indicate black and white, the input image is transformed into a grayscale image. This image is then further altered to become a binary image, where each pixel has a value of either 0 or 1, as shown in Fig. 5(a). Black is represented by the value 0, and white by the value 1. A threshold with a value between 0 and 255 is used to achieve it. We set the threshold value at 200 value. A pixel over 200 value in the grayscale image will be given a value of 1; otherwise, the value is 0.

The binary image is further processed for erosion. Erosion [5] is a technique applied to eliminate unwanted pixels from the object’s boundary, i.e., pixels that have a value of 1 but should contain a value of 0. First, it considers each pixel in the image, then its neighbors (kernel size determines the number of neighbors). The pixel only receives a value of 1 if all of its neighbors also have values of 1, otherwise, it receives a value of 0.

The noise-free image is further processed for dilation. Dilation [5] fills up the absent pixels, i.e., pixels that should have a value of 1 but have a value of 0. Every pixel in the image is first taken into account, followed by its neighbors (kernel size determines the number of neighbors); a pixel is given a value of 1 if at least one of its neighbors is also a 1.

Discovering every contour in the input image is essential for extracting the individual characters from the LP. Curves with the same hue or intensity that connect all the continuous points (along the boundary) are called contours. After locating each contour, we examine it individually and determine the size of each bounding rectangle, as shown in Fig. 5(b). Once we have the dimensions of the bounding rectangles, we adjust the parameters and filter the necessary rectangles that contain the required text.

$$\begin{aligned} W = range\{0, \frac{input\_length}{character\_count}\} \end{aligned}$$
(6)
$$\begin{aligned} L = range\{\frac{W}{2}, 4*(\frac{W}{5})\} \end{aligned}$$
(7)

Using the above equations, we perform a dimension comparison. The rectangles accepted have width and length in the range specified. To achieve this, we perform dimension comparison by accepting only rectangles that have width in a range of 0, (length of input)/(number of characters) and length in the range of (width of the input)/2, 4* (width of the input)/5. This process results in segmenting all the characters as binary images, as shown in Fig. 5(c).

Table 2. The layout of the designed CNN.

Recognition of Segmented Characters: CNN, a trainable feature extractor, has recently achieved significant success in computer vision problems. The success of CNNs results from advancements in two technical areas: developing methods to prevent overfitting and creating more robust models [3, 17]. CNNs are formed of artificial neurons with self-optimizing properties, making them capable of extracting and classifying features from images more precisely than any other algorithm. Since the LP text consists of various font styles and sizes, we trained a more powerful deep network for this task. We want to give the model a more instinctive comprehension of the text or character. Among these fundamental characteristics lower-level text features like character labels and explicitly placed text pixels. We propose a Deep LPR CNN to accomplish this by training it on highly supervised text information at multiple levels, including segmentation of character regions, character labels, and text/non-text binary information. The additional supervised information provides the model with more specific textual features, enabling it to do tasks of high-level classification and low-level region segmentation. It allows our model to systematically recognize where and what the character is, which is crucial to make a reliable decision.

Table 2 and Fig. 6 display the detailed configuration and structure of the proposed CNN. The number of channels, stride, padding, and kernel sizes are similar to the VGGNET [26]. Other LP Recognition Tasks [12, 13] have successfully applied these configurations.

Fig. 6.
figure 6

Architecture of the proposed CNN

4 Results and Discussion

The proposed ALPR paradigm is verified for effectiveness; thus, Tensorflow and Keras frameworks have been utilized to implement the model. Our system configuration for evaluation is as follows: Intel core 9th Gen i7 CPU, NVIDIA GeForce GTX 1650Ti with 4 GB memory, and RAM of 16 GB.

4.1 Data Sets Description

As per our work, a general data set for distorted LP images is unavailable. The use of robust DL algorithms in the smart recognition of distorted LP is hampered by the absence of enough images. To effectively train our custom YOLOv4, we created a data set of vehicles with deformed LP in different shooting angles and complex backgrounds, as shown in Fig. 7. The images were collected from google images and natural scenes. We collected 3000 images of vehicles with various LP styles and annotated them to train the model.

Fig. 7.
figure 7

Data set samples of distorted LPs.

Fig. 8.
figure 8

Data set samples of characters of various fonts.

We have a data set of 37,623 images to train our CNN model. The data set includes letters (A–Z) and numbers (0–9) with 50+ unique fonts that are commonly found on various LP, as shown in Fig. 8. To make the model resistant to various oblique views, data augmentation methods, including random rotation and perspective transformations, were used. Therefore, each class of alphabet or digit contains 1045 images of size 28 \(\times \) 28. We randomly select 33,861 character images for training and the remaining 3762 images for testing. Besides, Table 3 provides the comparative analysis of various data sets.

Table 3. Comparative analysis of various data sets.

4.2 Result Analysis

The objective is to create a method that works well in several uncontrolled situations but simultaneously functions adequately in controlled ones (such as primarily frontal views). We have selected four online data sets: AOLP (RP), SSIG, and OpenALPR (EU and BR), which, as shown in Table 3, cover a wide range of scenarios. We have considered two variables: LP angles (frontal and oblique), as well as the separation between the vehicle and the camera (close view, intermediate view, distant view). Although these data sets cover various scenarios, a more general-purpose data set for challenging scenes is still a limitation. Thus as an additional contribution from our collected images, we have selected and manually annotated a set of 104 images that cover various challenging scenarios. The images contain substantial LP distortions but are still viewable to humans. A few images are shown in Fig. 7.

Table 4. Performance analysis and comparison for multiple data sets.

Experimental Results: This section expresses the experimental outcome analysis of the proposed ALPR mechanism and the comparison with other implemented methods. To testify to the overall performance of the presented model, we take the percentage of accurately identified LPs (\(CL\)) from the total number of testing LP images (\(TL\)). The recognition accuracy is given by

$$\begin{aligned} A = CL / TL \end{aligned}$$
(8)

A point to note is that all the test data sets have been tested on the same network. No additional fine-tuning was performed to the network for a specific data set.

Table 4 indicates that the proposed method performs well with various data sets. Compared to other alternatives, it is superior on AOLP (RP) and SSIG Test data sets. The AOLP (RP) and SSIG Test data sets manifest the performance of 96.56% and 89.55% on the proposed method. The variation in performance on AOLP (RP) data set is approximately 27.0% for different approaches. Similarly, it is nearly 8.0% for the SSIG Test data set. Also, the error rate reduction due to the proposed method is 88.63% and 79.19%, respectively, compared to OpenALPR and Sighthound. Table 4 shows the comparison with other implemented systems. Our system has achieved recognition rates comparable to commercially available systems representing controlled scenes, where the LPs have frontal views and less complicated environments. Our system has achieved the best performance in AOLP RP and the proposed oblique LPs data sets.

Table 5. mAP comparison of proposed YOLOv4.
Fig. 9.
figure 9

Training performance of custom YOLOv4.

Furthermore, the proposed ALPR method performance on the OpenALPR data set is inferior compared to other alternatives. The proposed ALPR approach attains more than 90.0% performance but less than 4.95% and 2.04%, respectively, compared to OpenALPR and Sighthound methods. In addition, the proposed method, OpenALPR, and Singhhound approaches vary by 7.01%, 26.58%, and 13.27%, respectively, on AOLP (RP), SSIG Test, and OpenALPR data sets. It indicates the stability of the proposed mechanism in comparison to other alternatives.

Fig. 10.
figure 10

Training accuracy and loss analysis of proposed model.

Moreover, the proposed system presents superior outcomes than other mechanisms on proposed data sets. The performance of 85.0% is attained for the proposed data set, and it is better than 10.0% and 35.0%, respectively, compared to OpenALPR and Sighthound. Besides, it is essential to note that STN has a beneficial impact on identification outcomes. We remove the STN module from the proposed mechanism to demonstrate the effect. The recognition performance in oblique scenes of AOLP and the presented data sets have a significant gap, as seen in Table 4. This performance difference demonstrates how STN contributes to improved performance in identifying distorted LP.

Table 5 and Figure 9 illustrates the training performance of the custom Yolov4 model. The model achieved 90.0% mAP with 2800 iterations which outperformed the Yolov2 and Yolov3 used in [22, 24]. Also, Fig. 10 indicates that the model is not overfitted on given input data. The continuous decrease in error trend is observed for the proposed model. Besides, character recognition performance is analyzed by a confusion matrix, and it is illustrated in Fig. 10. It is observed that the presented APLR method misclassified the ‘O’ and ‘0’.

Fig. 11.
figure 11

Confusion matrix of the character recognition model per class.

5 Conclusion

This work demonstrated a comprehensive approach for ALPR in uncontrolled environments. Results indicate that the presented ALPR paradigm performs significantly better than the existing methods in challenging data sets with License Plates captured at severely oblique viewpoints. The use of the spatial transformer network, which aids in rectifying the distorted license plates, is the primary contribution of this work. This step helps the Recognition Network (Convolutional Neural Network) to understand the character patterns in a simplified way because it has to deal with far minimal distortion. Besides, we generated a complex data set by augmenting the images to detect license plates in skewed views. Currently, the system proposed can recognize the license plate number in English. For future work, we intend to enhance the current paradigm to recognize multilingual license plates written in the Devanagari language.