Keywords

1 Introduction

LBA is a targeted advertising approach that delivers ads based on local cultural practices, ranging from static advertising signs on the roadside to mobile devices. According to [1], it shows that LBA traffic has a higher value than the application filed and investment amounts are expanding year by year. Moreover, the global information market research [2] shows that the market size of LBA is based on geographic positioning.

Nowadays, the value of LBA reveals in the area of mobile service. Yu et al. [3] revealed that the roadside for mobile servers from providers by placing targeted advertisements along the itinerary between the point of passenger boarding and their destination, advertisers can increase their revenue by leveraging local cultural characteristics. However, the service model requires pre-setting the driving route and placing advertisements along the road, which limits the advertisement placement to the designated route.

The demand for systems that support artificial intelligence (AI) increases, it revealed the advanced driver assistance systems (ADAS), which are very popular and the trend of the future. The report shows in [4] from the market size was growing at a compound annual growth rate (CAGR) of 13.8%. The literature in [5, 6] used computer vision for deep learning architecture for the competition of driving needs.

The problem in ADAS system for LBA communication technology, including global positioning system (GPS), Wi-Fi, cellular tower pings, QR code, and radio frequency identification (RFID), it is important that the driving assistance for received the advertisement to consider the detection of higher accuracy, signal strength, and cost, especially using the QR code scanning are confirm to real conditions. The motivation of this research is the challenge of QR code, which is revealed to get more clearly for image collection, rapid for scanning, and error correction [7]. Li et al. [8] points out that the recognition of QR codes may be affected by some motion or focus blur, which makes the uneven road surface may cause horizontal and vertical motion blur within different speeds and road conditions during the driving process.

To address this problem, this study used a vehicle called a donkey car and combine it with raspberry pi and a high-speed camera to capture video on its own. The process of recording during filming is to capture video footage that involves the automotive application, applying different Yolo-v7 algorithms [9] to split training and testing set, which is compared with various speeds, sizes, and angles. Furthermore, the restoring of real pictures for scanning indeed, using end-to-end generative adversarial network (GAN) [10] for single image motion deblurring, named DeblurGAN-v2 [11] to restore the blurred image, and evaluate the proportion of successful scans that can be restored, and also compares the performance to different models, so that can be achieved the proposed method outperforms the other methods. The training and testing set in the dataset are characterized by their detailed information. The former was acquired through a self-propelled device indoors, while the latter was obtained outdoors using the same device.

The remainder of this paper is organized as follows. Section 2 discusses the related work on deep learning for QR codes scanning based on driving assistance and deblur methods. Section 3 describes the device and the proposed deep learning approaches based on architectures. Section 4 presents the experimental results and discussion for evaluation and comparison with different methods. Finally, Sect. 5 provides the conclusions and future work.

2 Related Work

To improve the QR code scanning rate, the QR code for object detection is blurred to increase the reading rate for more discussion. Yuan et al. [12] proposed that neural networks for training such as linear motion, defocus, and Gaussian blur can be distinguished after training. Schuler et al. [13] demonstrated that the use of a neural network-based approach is superior to traditional methods for non-blind image deblurring, particularly in cases of motion blur. Nah et al. [14] proposed Deep Deblur, which is a multi-scale convolutional neural network (MCNN) that cancels the kernel. Mechanism to remove the limitation by the blur kernel, thereby avoiding artifacts. Inspired by Goodfellow et al. [10] proposed GAN image-to-image translation, Kupyn et al. [15] regarded the blurring problem as an image-to-image translation task, and used conditional GAN (cGAN) network structure and loss function to evaluate the generated clear image. The gap between the image and the ground truth, DeblurGAN was proposed, and the best deblurring effect was obtained. To obtain better image quality, which uses the Feature Pyramid Network (FPN) network for feature fusion, and the discriminator uses the loss function of Least Squares GANs (LSGAN) for the overall training process to get more stable.

To remove noise and perform binarization before it is used as input, which has a square shape and functional symbols to capture obvious features of QR code, [16, 17, 18] believe that the task of identifying QR code has a great relationship with the location of Finder Pattern (FIP). Blanger et al. [18] adjusted the Single Shot MultiBox Detector (SSD) architecture in training, and added FIP as a sub-part feature for training to affect the output results, to improve the accuracy of QR code detection. Wang et al. [19] and others proposed a deblurring method under the condition of motion blur. The GAN-based method was used to obtain the deblurred two-dimensional code image, and compared with the traditional method, it was proved that the recognition accuracy and speed of the convolutional neural network (CNN) method were outperformed.

3 Methodology

3.1 Data Collection

In this study, Raspberry Pi was utilized to install vehicle modules for collecting driving QR code images. The following four criteria were used as the basis for collecting and designing the dataset. The first criterion for collecting, labeling, and designing the dataset is the size of the QR code, which determines the amount of space it occupies on the roadside sign. Moreover, it affects the resolution of the in-vehicle camera, especially for smaller QR code sizes, where a higher-resolution lens is required for capturing. The second criterion for collecting and designing the dataset is the driving speed, which can cause shaking and blurring. In this study, the urban speed limit of 50 km/hr was used, and the scale was adjusted based on the size of the QR code. Specifically, for a road surface with a length and width of 1.2 m, and a display driving QR code size of 12cm, the driving speed was reduced to 5 km/hr to minimize shaking and blurring. The third criterion is the distance between the QR code and the vehicle. As the camera on the vehicle may have height limitations, images in the inner lane may be blocked by vehicles in the outer lane. To address this, the environment assumes that the vehicle is driving in the outer lane or a single lane, and the position of the QR code is scaled according to the actual road shoulder width plus the width of the sidewalk. The fourth criterion pertains to the shooting distance, as it is crucial to capture a clear and detailed image of the QR Code. During the data collection process, the initial detection of the QR Code is set at 10 m from the vehicle on the road surface. However, driving speed becomes faster, the QR Code will be early detected at 20 m. The zoom distance of the vehicle is adjusted based on the scale of the QR code. The following are designed is shown as in Fig. 1.

Fig. 1.
figure 1

An illustration of QR code size and distance ratio

The data collection was in both indoor and outdoor locations, with images captured every 2 s. To analyze the impact of blur restoration at different speeds, it is range from 15 km/hr to 50 km/hr based on the hourly speed.

3.2 Yolo-V7 Architecture

Real-time object detection requires a faster training speed. To meet this requirement, it is necessary to adopt a model that can achieve a frame rate of over 30 frames per second (fps). The YOLO [20] series offers several versions of object detection algorithms that have demonstrated high execution speeds and accuracy. YOLOv7 [9] employs advanced optimization methods to enhance the model architecture. It combines the original VoVNet [21] architecture with Cross Stage Partial Network (CSPNet) and also improves the gradient path to CSPVoVNet [22] so that the model can learn the weights of different layers more effectively, speed up training and improve accuracy. The model has improved and extended based on Extended Efficient Layer Aggregation Networks (E-ELAN) [23], which stabilizes learning and convergence through methods such as shuffling cardinality, expand cardinality, and merge cardinality, and avoids excessive computational blocks leading to unstable states. A structure is shown in Fig. 2.

Fig. 2.
figure 2

Extended efficient layer aggregation networks [9]

To address the issue of transition layer width in deep neural networks is through a compound model scaling method. The approach involves scaling the concatenation-based model while maintaining the original nature of the model when the depth of the computational block is scaled [24]. The scaling process also takes into consideration the scaling of the transition layer, which ensures that the width of the transition layer is adjusted proportionally to the scaling of the computational block. The depth of the computational block should be scaled, while the corresponding width can be adjusted through the transition. It allows to maintain the inherent characteristics of the initial model during the design phase and preserve the optimal structure, while also avoiding the issue of reduced computational utilization. Overall, this method aims to optimize the performance of deep neural networks by preserving their original architecture and characteristics while scaling them to handle more complex tasks. A structure is shown in Fig. 3.

Fig. 3.
figure 3

A compound scaling up depth and width for concatenation-based model [24]

A structural re-parameterization technique with Visual Geometry Group called RepVGG [25], optimizes various indicators such as floating point operations per second (FLOPS), accuracy, and speed through re-parameterization. The RepVGG technique destroys the residual in Residual Neural Networks (ResNet) [26] and connection in Densely Connected Convolutional Networks (DenseNet) [27], to improve their performance of more diversity of gradients for different feature maps. The RepVGG identity connection (ResConvN)is combined with the ResNet shortcut connection. As a solution, YOLOv7 proposes to use RepVGG in ResNet (RepResNet) to remove the identity connection and improve the new architecture. Ultimately, the architecture aims to optimize the accuracy of deep neural networks while reducing the complexity of the architecture [9], it used the planned re-parameterized model and auxiliary head with independent label assignment strategy.

3.3 DeblurGAN-V2 Architecture

GAN [10] is a type of deep learning architecture that consists of two models: a generator and a discriminator. The generator produces fake samples, which are then compared to real input samples during the training process. The discriminator is trained to distinguish between the real and fake samples and adjust its weights, the difference is used to update the weights of the discriminator and improve its ability to distinguish.

During the training process, the generator is used to create synthetic data, and the discriminator is used to identify whether the generated data is real or fake. The process is designed to optimize the generator’s ability to create realistic data that can fool the discriminator.

In the training process of a GAN model, two models are used for evaluation. The first is the generator network, which takes in noise data represented as \( \user2{z} \) and generates synthesized data represented as G(\(\user2{z}\)), through the training process. The second model is the discriminator network, which evaluates the difference between the synthesized data and the real data represented as \(x\) and computes a probability score. The function V(D, G) [10] is used to model the interaction between the generator and discriminator networks and is defined in the formula as follows:

$$ \mathop {{\text{minmax}}V}\limits_{{G\,\,\,\,D}} (D,G) = E_{{x \sim p_{{data}} (x)}} [logD(x)] + E_{{x \sim p_{z} (x)}} [log(1 - D(G(\user2{z})))]$$
(1)

Assume \({p}_{z}\) be the data distribution of noise input \( \user2{z} \) and \({p}_{data}\) be the data distribution of real samples. The function E represents the empirical estimation of the joint probability distribution. In the decision process of the discriminative network, the maximum value of \({E}_{x\sim {p}_{data}(x)}[log D(x)]\) to get the fake data \( G(\user2{z}) \) and the expected probability of output is \({p}_{z}(x)\) close to the real sample \(x\), which is obtained by \( D(G\left( \user2{z} \right)) \). Then, the minimization of the fake data probability is obtained by \( E_{{x\sim_{z} (x)}} [log(1 - D(G(\user2{z}))) \).

To fix the vanishing gradients and stabilize the training, using the L2-regularized called Least Squares GANs discriminator (LSGAN) [28] to introduce a loss function can fix the vanishing gradients and stabilize training, and provide smoother and unsaturated gradients. The further away the fake samples are from the boundary, receive the greater penalties. By minimizing the Pearson \({\mathcal{X}}^{2}\) divergence in the loss function that leads to the better training stability can be achieved. The formula is shown as follows:

$$\underset{D}{{\text{min}}V}(D) = {\frac{1}{2}E}_{x\sim {p}_{data}(x)}[ {(D(x)-1)}^{2}] + \frac{1}{2}{E}_{x\sim {p}_{z}(x)}[{D(G(x))}^{2}]$$
(2)
$$\underset{G}{{\text{min}}V}(G) = \frac{1}{2}{E}_{x\sim {p}_{z}(x)}[{D(G(x) - 1)}^{2}]$$
(3)

The relevant GAN called Double-Scale Relativistic GAN Least Square (RaGAN-LS) for upgrades in DeblurGAN-v2, will be getting the image to become more qualifier and clear, it adopted the relativistic wrapping [29] on the LSGAN cost function, the formula is shown as follows:

$$ \begin{gathered} L_{D}^{RaLSGAN} { } = E_{{x\sim p_{data} \left( x \right)}} \left[ {D\left( x \right) - E_{{z\sim p_{z} \left( x \right)}} D\left( {G\left( z \right)} \right) - 1)^{2} } \right] \hfill \\ \quad \quad \quad \;\; + E_{{z\sim p_{z} \left( x \right)}} \left[ {D\left( {G\left( z \right)} \right) - E_{{x\sim p_{data} \left( x \right)}} D\left( x \right) + 1)^{2} } \right] \hfill \\ \end{gathered} $$
(4)

A relativistic discriminator to estimate the probability that a given real data is more realistic than randomly sampled fake data shows more stable and computationally efficient training.

4 Experimental Results and Discussion

4.1 Experimental Setup

The experimental setup for training was conducted on Google Colab, utilizing the NVIDIA T4 Tensor Core GPU as the primary computing device. In the software implementation, the Python programming language was utilized, with the deep learning framework being implemented through the PyTorch package, and also involve the Pyzbar package to make a barcode reader for decoding the QR codes. In addition, the captured QR codes images were recorded based on the homemade donkey car [30] with data collector and raspberry pi 4 Model B as shown in Fig. 4. A Raspberry Pi Camera Module v2 is used in the camera module part in the hardware implementation.

Fig. 4.
figure 4

Homemade donkey car with data collector and Raspberry Pi

The dataset used in this research is derived from the QR codes image dataset, as described in Sect. 3. It comprises a total of 1099 QR codes being annotated and captured by mounted cameras on donkey carts, which was divided into 779 samples of training set and 320 samples of testing set. The shooting scenes for the study were captured in both indoor and outdoor settings. The indoor shots were taken in a room, while the outdoor shots were taken beside the bike lane in Dahan River Riverside Park, as shown in Fig. 5, with QR codes being available in three sizes: 10 cm × 10 cm, 15 cm × 15 cm, and 20 cm × 20 cm.

Fig.5.
figure 5

Indoors and outdoors scenario

4.2 Parameters Settings and Evaluation Metrics

Table 1. Parameters settings for each Yolo model

The input of each YOLO model for parameters settings are shown in Table1. Backpropagation algorithm was employed in conjunction with Adam optimizer [34] for gradient descent during the training process. The models were trained for a total of 50 epochs, with a batch size of 8 samples used for each epoch. The learning rate was set to 0.01 to facilitate the convergence of the training process. The only exception is for the You Only Look Once version 4 tiny (YOLOv4-tiny) [31] model, which was trained for 60 epochs to ensure optimal performance. Some of the models used Complete-IOU [35], and the others used Focal loss [36] for loss function settings. This study utilizes an evaluation framework to assess the performance of the model, which was divided into two stages. The first stage involves QR code label tracking, with evaluation metrics including Precision, Recall, F1 score, and Intersection over Union (IOU). The second stage consists in deblurring QR code labels, which used the pretrained model of DeblurGAN-v2 with evaluation including success and failure for scanning the number of QR code images.

$$Precision=\frac{TP}{TP + FP}$$
(5)
$$Recall=\frac{TP}{TP + FN}$$
(6)
$$F1=\frac{2 * precision * recall }{precision + recall}$$
(7)
$$ IOU\, = \frac{{{\text{Object }} \cap {\text{ Detected box}}}}{{{\text{Object }} \cup {\text{ Detected box}}}} $$
(8)

where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative, and the object refers to the area of the actual object, while the detected box refers to the predicted candidate area by dividing the overlap area by the union area [32].

4.3 Compared Approaches

In this study, several metrics were compared to the results obtained that the training model and verify the training dataset on indoor data. The data collected from outdoor scenes were used as a testing dataset to evaluate the performance of various YOLO models as shown in Table 2, including YOLOv4, YOLOv4-tiny, YOLOv7, YOLOv7-tiny [9], and a larger YOLOv7 model called YOLOv7-W6 [9, 33] were evaluated. The main comparison criteria were the detection accuracy and efficiency of models. The results showed that the YOLOv4 model exhibited good detection performance. The YOLOv4-tiny model decreased the Precision by 0.03, F1-score by 0.02 and IOU by 0.021. The YOLOv7 model shows the results that the performance has a significant increase, which is better than YOLOv4 and YOLOv4-tiny. YOLOv7-tiny model obtains a decrease by only Precision by 0.01 and IOU by 0.021. The YOLOv7-W6 model achieved the best performance among all models, with a high value for a Precision of 0.97 and an IOU of 0.8611. Additionally, all properties of different YOLOv7 models are better than different YOLOv4 models, and the value for F1-score is almost 0.99.

The experiment results for the deblurring models which are pre-trained models for prediction of testing set, and scanning from a barcode reader with comparing as shown in Table 3, the donkey car for driving is about 0–25 km/h speed range, 212 images can be compared with different models, including without deblurring, Deep Deblur [14], DeblurGAN [15], DeblurGAN-v2 with the plugin of sophisticated backbones are MobileNet and Inception-ResNet-v2 [11]. It is shown that the QR code image without deblurring has a higher rate of failed scanning, it shows that more blurring image it cannot scan more clearly, and the use Deep Deblur model shows that the performance of successful scanning has more QR code image than without scanning. Unfortunately, it has failed to scan and it is worst. Moreover, the DeblurGAN model shows that it is equal to successful and fails to scan to Deep Deblur. Otherwise, the DeblurGAN-v2(Inception-ResNet-v2) returns to the source and read QR code successful scanning is grow up to 89 images, it has a significant increase and is better than others. Nevertheless, it has to be improved for more successful scanning and less failure.

Table 2. A comparison of QR code detection for the results.
Table 3. A comparison of QR code deblurring and scanning for the results.

5 Conclusion

This research presents a base for an in-depth learning method combined with Donkey Car for high-speed vehicle reading QR codes efficient structure. The purpose of the research using the model for object detection relies on the detection of QR codes that are strategically placed in low-density areas of the image. These codes are then processed using an image enhancement technique to reduce reconciliation errors and missing QR code images. As a result, the QR code reading success rate is significantly increased, leading to more accurate object detection. In the limitation of this research have the challenge of collecting data for each size, lightness, number of roadside and roadside width, etc. In future works, it can be tried out to add data argumentation and attention mechanism into the training process of detecting QR codes and develop corresponding image processing techniques to enhance the scanning rate of these codes. By addressing these challenges, the accuracy and efficiency of the object detection model can be improved.