1 Introduction

The continuous development of autonomous driving technology has made unmanned driving possible, with perception models playing a crucial role in this field. Perception is one of the core modules of autonomous driving systems, with traffic target detection being a vital component of the perception module. Rapid and accurate detection can assist drivers or autonomous vehicles in making decisions earlier, thereby enhancing safety.

Detection algorithms extract information related to objects, behaviors, key points, etc., from images and detect objects from the background. Traditional detection algorithms often employ sliding windows to identify candidate bounding boxes. Subsequently, relevant features are extracted from these regions. Common feature descriptors encompass Haar features [1], Histogram of Oriented Gradients (HOG) features [2], and others. Eventually, classifiers such as Support Vector Machines (SVM) [3] and AdaBoost [4] are employed to classify features, thus discerning the location and type of the object. Nevertheless, these conventional methodologies rely on prior knowledge and are primarily suited to uncomplicated scenes. In complex and evolving environments, their performance is deficient, falling short of meeting the demands of practical applications. In recent years, deep learning-based detection methods have exhibited notable performance in both accuracy and real-time processing. These methodologies can be divided into two primary categories: one includes single-stage detection algorithms like You Only Look Once (YOLO) [5,6,7,8,9,10] and Single Shot MultiBox Detector (SSD) [11], which directly conduct detection on the image, achieving efficient real-time performance. Another category includes two-stage detection algorithms like Fast R-CNN [12] and Faster R-CNN [13]. These algorithms initially generate candidate regions and subsequently perform classification and localization tasks. These algorithms often demonstrate superior accuracy performance.

Target detection based on deep learning has been widely used in fields, such as underwater object detection [14], driver fatigue detection [15], and small target detection [16]. Nonetheless, in complex traffic scenes where there are significant differences in target sizes, especially for detecting small targets such as pedestrians and distant vehicles, the performance is unsatisfactory. In real-world scenes, where targets move quickly the time for capturing and detecting is short, and computational resources are limited. Detecting traffic objects rapidly and accurately becomes even more challenging. Di et al. [17] studied target detection and tracking in such dynamic scenes and achieved good results. The present study proposes an improved YOLOV8 algorithm to address the aforementioned issue, and significantly enhance the model's capability to detect multi-scale targets while balancing detection speed and precision. The contributions of this paper are as follows:

1. To address the significant scale differences between vehicles and pedestrians, we design a Deep and Filtered Path Aggregation Network (DF-PAN), which uses the characteristics of deep features and shallow features, and the shallow information is filtered through the weights generated by the deep features, filtering out redundant semantic information, and highlighting the underlying small targets; filtering the deep information through the weights generated by the shallow features, filtering out redundant positioning information and highlight the semantic information of the target, and achieving more efficient multi-scale fusion.

2. To address the challenge of limited computational resources, we devise a parameter-sharing detection, transmitting the feature map to the Conv_share module. We incorporate a scaling factor to modulate the feature map after its traversal through the Conv_share module, yielding three distinct outputs. Implementing parameter sharing in the detection component enhances the feature classification ability of the model while reducing the number of parameters.

3. To tackle the complexity and high computational cost of the YOLOV8 network, we use FasterNet[18], a rapid neural network, to improve feature extraction speed and reduce model complexity.

The DF-YOLO algorithm significantly improves precision while reducing the number of model parameters. We test our method on the KITTI dataset, which reaches 90.9% mAP with 77 frames/s, and the number of parameters is only 2.3 m. This is a 3% mAP improvement over the baseline model (YOLOv8-n) and a 28.1% reduction in the number of parameters. In the context of autonomous driving scenes, it is essential to consider key factors, such as model precision, speed, and the number of parameters to ensure vehicle safety during operation. Our method achieves satisfactory balances between real time and precision and is suitable for autonomous driving scenes.

The paper is structured as follows: Sect. 2 provides a detailed review of existing research work. Following that, Sect. 3 delves into the improvement details of the algorithm. In Sect. 4, a comprehensive description and analysis of the experimental procedures are provided. Finally, Sect. 5 summarizes the work carried out in this study.

2 Related work

In recent years, numerous scholars have conducted extensive research on detection tasks in traffic scenes [19,20,21]. Hu et al. [22] proposed a cascaded vehicle detection method that integrates multi-feature fusion with Convolutional Neural Networks (CNN), demonstrating exceptional robustness in complex driving environments. R. Ghosh [23] introduced a Faster R-CNN road vehicle detection approach employing multiple Region Proposal Networks (RPNs) of varying sizes to effectively detect vehicles of different scales, achieving promising results. HAN [24] proposed a convolutional neural network (CNN) enhanced with contextual information, which effectively improves the accuracy of detecting small-sized and occluded vehicles by progressively integrating shallow-layer information into deep-layer networks. Although the aforementioned detection models achieve high precision, they fail to meet real-time requirements due to the limited computational performance of onboard vehicle systems in real-world scenes. Single-stage detection networks are more suitable for real-time traffic object detection in terms of detection speed compared to two-stage networks. Goran Oreski [25] proposed a YOLO*C algorithm that incorporates the MCTX (Multi-Context) context-aware module, efficiently utilizing rich global context information and significantly improving the detection accuracy of small targets in complex traffic environments. KANG et al. [26] introduced a novel YOLO detector called YOLO-FA based on fuzzy attention, which utilizes fuzzy entropy to weight features and enables the network to focus on targets, thereby effectively enhancing the detection accuracy of vehicles. However, these scholars did not address the issue of reduced accuracy due to differences in target scales.

Scale problem lies at the heart of every object detection, and many researchers have made outstanding contributions in this area. Li et al. [27] introduced a depth-based segmentation method and a multi-scale detection network aimed at substantially improving small object detection in vehicle detection systems, and validate its effectiveness through experimental verification. Yuan et al. [28] introduced a multi-scale feature network into the detection model to more accurately extract the features of small targets in traffic scenes. SD Khan et al. proposed a robust method for producing object proposals in a paper [29], encoding objects of different sizes at different scales and achieving satisfactory results. In another paper [30], they proposed addressing the issue of scale variation by utilizing feature maps from three dense blocks to construct three Region Proposal Networks (RPNs). Each RPN is designed to target objects of different sizes, thereby generating multi-scale object proposals. This approach enhances the detection capability for targets across a range of scales by integrating feature maps from different depths with RPNs tailored to various scales. Some researchers have also adopted multi-scale feature fusion methods to solve the scale problem. This technique is aimed at fusing deep features with shallow features to obtain sufficient semantic information. The Feature Pyramid Network (FPN) [31] achieves multi-scale fusion by upsampling features and adding them to bottom-level features, enabling the combination of feature maps with strong low-resolution semantic information and those with rich high-resolution spatial information at minimal computational cost. Building upon FPN, the Path Aggregation Network (PANet) [32] incorporates bottom–up path enhancement to leverage precise localization information for enhancing the entire feature set. The Bidirectional Feature Pyramid Network (BiFPN) [33] proposes a more efficient bidirectional feature fusion to mitigate the issues of information loss and redundancy in the traditional feature pyramid networks when extracting features of different scales.

Additionally, detection tasks in traffic scenes are significantly constrained by the following two challenges:

a. The scene is intricate and constantly changing, with small targets often existing alongside surrounding objects. When combined with the inherent characteristics of small targets, this results in inadequate extraction of feature information from these targets by the model.

b. Targets from different categories display significant variations in size, and even targets within the same category may differ in size due to their positional differences.

To tackle the challenges mentioned above, inspired by MFDS-DETR [34], we designed a Deep and Filtered Path Aggregation Network (DF-PAN) to better suit detection tasks in driving scenes. First, the top-level features filter the underlying information through weights generated by the Feature Coordinate Filtering Module (FCF), filtering out redundant semantic information to highlight small targets in lower layers; second, the bottom features filter the top features through the weights generated by FCF, filtering out redundant positioning information to highlight the semantic information of the target. DF-PAN aims to enhance the model's ability to fuse multi-scale features, thereby improving detection efficiency.

3 Method

3.1 The overall architecture of the improved algorithm

The DF-YOLO architecture consists primarily of three parts: a backbone network, a deep and filtered Path Aggregation Network, and a parameter-sharing detection head. In the backbone extraction network, we replace the original structure with FasterNet, a rapid neural network, to reduce redundant computations and improve spatial feature extraction. To tackle the multi-scale challenges arising from variations in object sizes, we propose the Deep Filtering Path Aggregation Network (DF-PAN), which effectively fuses deep features with shallow features. In addition, we design a parameter-sharing detection head (PSD), which reduces the number of parameters and improves detection precision. The overall structure of the algorithm is shown in Fig. 1.

Fig. 1
figure 1

The overall structure of DF-YOLO

3.2 Backbone

The backbone network of YOLOv8 is overly large. Utilizing FasterNet as the backbone network for feature extraction in DF-YOLO aligns more closely with practical requirements. FasterNet addresses the issue of redundancy in the convolutional neural networks by introducing Partial Convolution (PConv). This technique enhances spatial feature extraction while reducing unnecessary computations and memory access. The overall structure is shown in Fig. 2.

Fig. 2
figure 2

The overall structure of FasterNet

The proposed architecture consists of four hierarchical stages, each featuring a \(4 \times 4\) regular convolutional layer with a stride of 4 for spatial downsampling and a \(2 \times 2\) regular convolutional layer with a stride of 2 for expanding the channel number. Each stage has a FasterNet block, which contains a PConv layer. Unlike conventional convolution, the PConv layer selectively conducts convolution operations on specific channels to extract spatial features, while leaving the remaining channels unchanged. In the detection task of this paper, the input and output feature maps have the same number of channels. Therefore, the FLOPs of PConv are: \(h \times \omega \times k^{2} \times c_{p}^{2}\) , and for \(r = 1/4\), the FLOPs of PConv are only \(1/16\) of standard convolution, with memory access being \(1/4\) of standard convolution. To more effectively utilize information from all channels, pointwise convolution (PWConv) has been added after PConv. PWConv enables feature information to flow through all channels, thereby allowing the model to focus more on the central position.

Due to the small size of targets which only occupy a small portion of the image, models often overlook the spatial information of these targets. The FasterNet feature extraction network excels in spatial feature extraction, enhancing the model's ability to detect small targets. Additionally, FasterNet requires less memory access and is better suited for traffic scenes with limited computing resources.

3.3 DF-PAN

In complex traffic scenes, significant size variations exist among targets across different categories. Moreover, targets within the same category may also vary in size due to their distinct spatial locations. This inherent multi-scale diversity negatively impacts the model's detection and recognition capabilities. Feature Pyramid Networks (FPN) fuse extracted multi-scale information in a bottom–up manner, thereby enhancing the precision of models. However, this structure is less effective in transmitting location information. To address this limitation, the PANet introduces a top–down pathway, facilitating the transfer of low-level details to the upper layers, thereby obtaining a feature map with more comprehensive semantic and spatial information, consequently enhancing the feature representation capability. To better tackle the inherent multi-scale issue in driving scenes, we design a Deeper and Filtered Path Aggregation Network (DF-PAN). The structure is shown in Fig. 3 and consists of two modules:

  1. (a)

    Feature Coordinate Filtering module (FCF)

  2. (b)

    Deep Feature Fusion module (DF).

Fig. 3
figure 3

The overall structure of DF-PAN

(a) Feature coordinate filtering module (FCF).

The feature coordinate filtering module extracts the importance of the feature map in different dimensions, thereby selectively filtering unimportant features. This process enhances the semantic content of the generated features, ultimately improving the model's ability to detect targets of various scales. The original channel-wise attention module only considers dependencies between channels, neglecting the importance of positional information in capturing target structures. To tackle this issue, FCF breaks down global pooling into two parallel feature encoding processes. This approach allows for the capture of inter-feature dependencies while also preserving accurate positional information. Consequently, it significantly enhances the model's ability to objects of different scales. As illustrated in Fig. 4, the feature coordinate filtering module processes the input feature map, where C denotes the number of channels, H represents the height of the feature map, and W represents its width. For a given input X, we encode each channel separately along the horizontal and vertical coordinates using pooling kernels with two spatial ranges: (H, 1) or (1, W). Then, we combine these encodings to obtain features and employ the sigmoid activation function to determine the weights for the horizontal and vertical coordinates. Next, multiply the obtained weights with the feature maps after global average pooling. Finally, multiply them with the corresponding proportion of feature maps to obtain the output. This transformation allows the module to capture long-range dependencies along one spatial direction while preserving positional information along another. The purpose of average pooling is to evenly obtain all data from the feature map and minimize information loss, which is particularly crucial for small target detection in traffic scenes.

Fig. 4
figure 4

The overall structure of FCF

(b) Deep feature fusion module (DF).

In the feature map extracted by the feature extraction network, the deep features have rich semantic information, but relatively little location information; the shallow features have accurate target positioning, but limited semantic information. In this regard, we propose a deep feature fusion module (Fig. 5), the deep features filter the underlying information through weights generated by FCF, filtering out redundant semantic information to highlight small targets in lower layers; The shallow features filter the deep features through the weights generated by FCF, filtering out redundant positioning information to highlight the semantic information of the target. Establishing connections between feature maps at different levels for more effective feature fusion, enabling the network to detect objects of various scales.

Fig. 5
figure 5

Deep Fusion block structure diagram. Given two features \(f_{1}\) and \(f_{2}\), to achieve uniform dimensions, upsampling and downsampling are performed on the deep and shallow features respectively. Subsequently, weights are generated through the Feature Coordinate Filtering (FCF) to filter the corresponding features. Finally, the filtered features are fused together, and the output is expressed as: \(f_{out} = f_{2} \times FCF(f_{1^{\prime}} ) + f_{1^{\prime}}\)

Moreover, smaller objects occupying fewer pixels often lead to frequent occurrences of missed and false detections during the detection process. To address this issue, we incorporate deeper upsampling in the feature fusion module to enable more profound fusion.

3.4 Parameter sharing detection

The detection head of YOLOv8-n accounts for 40% of the computational workload in the model. To make the algorithm better match the autonomous driving scenario, we designed a lightweight detection head called Parameter Sharing Detection (PSD) (Fig. 6), which reduces computational effort by sharing parameters. The PSD module takes feature maps from different scales obtained from DF-PAN as input to the shared convolution module. Moreover, we introduce a learnable scaling factor to adjust the feature maps after Conv_share, resulting in three distinct outputs from the detection layers. This parameter-sharing approach not only reduces the number of parameters in the model but also enables feature extraction capabilities to be shared across different locations, thereby enhancing both the model's efficiency and generalization ability. Experimental results demonstrate that PSD effectively reduces computational workload while maintaining high precision.

Fig. 6
figure 6

Overall structure of PSD

4 Experiments and analysis

4.1 Datasets

We conducted experiments on the KITTI dataset [35], BDD100K dataset [36], and SODA 10 M dataset [37] to comprehensively evaluate the proposed method.

The KITTI dataset is one of the most commonly used benchmark datasets internationally for evaluating detection algorithms in autonomous driving scenarios. It comprises real image data collected from various environments like urban, rural, and highway scenes, featuring up to 15 cars and 30 pedestrians per image, along with varying degrees of occlusion and truncation, posing a significant challenge in the field of object detection. The dataset is annotated with nine primary scene categories (Car, Truck, Van, Tram, Pedestrians, Person_sitting, Cyclist, Misc).

The BDD100K dataset is a publicly available driving dataset released by the University of California, Berkeley. It comprises 100,000 annotated frames collected under various weather conditions (sunny, cloudy, overcast, rainy, snowy, foggy), different times of the day (daytime, nighttime, dawn/dusk), and diverse scenes (residential areas, urban streets, highways, etc.). Due to its diverse geographical, environmental, and meteorological conditions, the BDD100K dataset is an excellent choice for evaluating network reliability.

The SODA10M dataset covers a variety of different road scenes, taking into account diverse weather conditions and conducting data collection during various periods, including daytime, nighttime, early morning, and dusk. The dataset is annotated with six primary scene categories (Pedestrian, Cyclist, Car, Truck, Tram, Tricycle). The SODA10M dataset presents a variety of environmental conditions and can approximate the diversity of real driving environments.

4.2 Training equipment and parameters set

Throughout the training process, we optimize training parameters using stochastic gradient descent. We set the momentum at 0.937 and the learning rate during training to 0.01. We set the batch size as 64, and the epoch is 300. The size of the image in the dataset is 640 × 640 pixels (as shown in Table 1).

Table 1 Training environment and hardware platform parameters table

The loss value reflects the convergence state of the model during the training process. The loss functions used in this paper include classification loss function and bounding box loss function. The classification loss function includes binary cross-entropy loss, denoted as \(L_{BCE}\). The bounding box loss function includes distribution focal loss (DFL), denoted as \(L_{DFL}\) and complete IoU (CIoU) loss, represented as \(L_{CIoU}\). Thus, the loss \(L_{total}\) can be represented as

$$L_{total} = \lambda_{DFL} L_{DFL} + \lambda_{CIoU} L_{CIoU} + \lambda_{BCE} L_{BCE} ,$$
(1)

where \(\lambda_{BCE}\), \(\lambda_{DFL}\), and \(\lambda_{CIoU}\) are corresponding coefficients

$$L_{BCE} = - \left[ {y_{n} \log x_{n} + (1 - y_{n} )\log (1 - x_{n} )} \right],$$
(2)

where \(x_{n}\) is the predicted classification of each object. \(y_{n}\) is.

the ground truth of each object

$$\begin{gathered} L_{DFL} (S_{i} ,S_{i + 1} ) = - ((y_{i + 1} - y)\log (S_{i} ) + (y - y_{i} )\log (S_{i + 1} )) \hfill \\ S_{i} = \frac{{y_{i + 1} - y}}{{y_{i + 1} - y_{i} }},S_{i + 1} = \frac{{y_{i} - y}}{{y_{i} - y_{i + 1} }}, \hfill \\ \end{gathered}$$
(3)

where: \(y\) is the ground truth of the bounding box coordinate.

$$\begin{gathered} L_{CIoU} = 1 - IoU + \frac{{\rho^{2} (b,b^{gt} )}}{{c^{2} }} + \alpha v \hfill \\ v = \frac{4}{{\pi^{2} }}(\arctan \frac{{w^{gt} }}{{h^{gt} }} - \arctan \frac{w}{h})^{2} \hfill \\ \alpha = \frac{v}{(1 - IoU) + v}, \hfill \\ \end{gathered}$$
(4)

where \(b\) is the central point of the prediction box, \(b^{gt}\) is the central point of the ground truth box. \(\rho\) is the Euclidean distance between prediction and ground truth points. \(c\) is the diagonal length of the smallest enclosing rectangle of the two boxes. \(v\) and \(\alpha\) are ratio coefficients. \(w^{gt}\) and \(h^{gt}\) are the width and height of the ground truth box, and \(w\) and \(h\) are the width and height of the prediction box.

4.3 Evaluation metrics

To assess the effectiveness of the model, we utilize mAP, the number of Parameters, Floating Point Operations (FLOPs), and Frames Per Second (FPS) as evaluation metrics. The number of parameters is used to describe the space complexity of the network, which reflects the small space occupied by the model. FLOPs and FPS are used to describe the time complexity of the network. FLOPs represent the number of floating-point arithmetic operations performed by the network during inference, providing insight into the computational workload. On the other hand, FPS indicates the rate at which the network processes frames or inputs per second, reflecting its real-time performance. mAP is primarily employed for evaluating Average Precision (AP), Precision (P), and Recall (R) in object detection tasks. Equations (5) (6), respectively, represent Precision (P) and Recall (R).

$$P = \frac{TP}{{TP + EP}}$$
(5)
$$R = \frac{TP}{{TP + FN}}$$
(6)

where: TP denotes true positives, FP denotes false positives, and FN denotes false negatives.

AP denotes the area under the precision–recall curve, which provides a comprehensive evaluation of model accuracy by considering both precision and recall for each class. A higher precision indicates superior model performance. Equation (7) can be used to express AP, whereas mAP quantifies the average AP across all classes and is represented by Eq. (8)

$$AP = \int_{0}^{1} {P(R)dr}$$
(7)
$$mAP = \frac{{\sum\nolimits_{i = 1}^{N} {AP_{i} } }}{N}$$
(8)

To provide a more detailed description of the model's performance, we use TIDE [38] evaluation indicators and categorize errors in the model into six types: Classification error (Cls), Localization error (Loc), Classification and Localization error (Cls + Loc), Duplicate detection error (Duplicate), Background error (Bkgd), and Missed detection error (Miss). We use IoUmax to represent the maximum IoU overlap between the predicted bounding box and the Ground Truth of the given class. The foreground IoU threshold is denoted as \(t_{f}\), and the background threshold is denoted as \(t_{b}\), which are set to 0.5 and 0.1 respectively. Details are shown in Fig. 7.

Fig. 7
figure 7

Error type definitions. Red boxes represent false positives, green boxes represent ground truth; and orange boxes represent true positives. The IoU with ground truth for each error type is indicated by an orange highlight and shown in the bottom row

4.4 Performance comparison between DF-YOLO and YOLOv8

We validate our model using the KITTI dataset. To assess its performance, we compare the experimental results of DF-YOLO and YOLOv8 with those of YOLOv8-n as the baseline (as shown in Table 2). From the experimental results, it can be observed that compared to YOLOv8-n, DF-YOLO demonstrates a significant performance improvement. The mAP increases by 3%, while the number of parameters decreases by 28.1%. Moreover, the localization error decreases by 1.65%, and the missed detection error decreases by 1.1%. When the batch size is set to 1, the model achieves an FPS rate of 77, meeting the real-time requirements. The experiments confirm that our proposed model possesses superior object localization capabilities and enhances detection capabilities for small objects. To better illustrate the superiority of our proposed model, we conduct a visual analysis of the detection results of DF-YOLO on the KITTI dataset compared to the original YOLOv8-n model (as shown in Fig. 8). According to the figures shown, the left column displays the detection results of the original model. As shown in the first and third images, the model's performance in detecting heavily occluded objects is very poor. In addition, as shown in the second picture, the model cannot accurately detect long-distance small targets, resulting in missed detections and false detections. In contrast, the right column presents the detection results of DF-YOLO. Our proposed model effectively reduces the likelihood of missed and false detections, demonstrating strong detection capabilities even for occluded and distant small objects. In the third figure, although our method reduces the probability of missing and false detections for occluded and small objects, there are still occurrences of missed detections and false positives. The reason for this is that, due to the distance of the targets and occlusion issues, the detector is constrained by insufficient resolution, making it difficult to distinguish the boundaries and features of each target, resulting in missed detections and false positives. Although our method may encounter issues with missed detections in certain scenarios, considering the diversity of real-world application needs and the comprehensive performance of the overall system, the impact of such situations on practical applications is relatively minor. The visualized results further confirm that DF-YOLO is better suited for complex autonomous driving scenes.

Table 2 Performance comparison between DF-YOLO(n) and YOLOv8-n (%)
Fig. 8
figure 8

Visualization of detection results of YOLOv8-n detector (left) and DF-YOLO detector (right) based on the KITTI dataset. Green, blue, and red boxes represent true positives (TP), false positives (FP), and false negatives (FN), respectively. Yellow boxes indicate zoomed-in views

4.5 Ablation study

The effectiveness of different modules is validated through extensive ablation experiments. Among them, Baseline refers to YOLOv8-n, and √ indicates the inclusion of the corresponding module in this experiment. From the experimental results (Table 3), it can be observed that compared to YOLOv8-n, the addition of FasterNet, DF-PAN, and PSD modules results in an increase in mAP by 1.4%, 1.5%, and 0.5% respectively. Notably, DF-PAN exhibits the most significant improvement in model performance. Error analysis suggests that the FasterNet module reduces both classification and localization errors as well as background errors. The DF-PAN module significantly decreases missed detections, while the PSD module reduces localization errors probability-wise. When combined together with FasterNet, DF-PAN and PSD modules lead to a respective mAP improvement of 2.1% and 1.8%. Consequently, when combined with these modules, mAP reaches 90.9%, representing a remarkable 3% enhancement over YOLOv8-n's performance level. Furthermore, the number of parameters is reduced by an impressive 28.1% compared to YOLOv8-n while achieving an FPS rate of 77. However, during the process of improving accuracy, we observe an increase in FLOPs and a slight decrease in FPS. We believe that sacrificing some speed to achieve higher accuracy is reasonable in the detection task addressed in this paper. The above experiments fully demonstrate not only the effectiveness but also the suitability of these modules for deployment on resource-constrained mobile platforms.

Table 3 Ablation experiments of different modules (%)

4.6 Compare with mainstream methods

This section compares DF-YOLO with models, such as YOLOv5, YOLOP [39], YOLOPv2 [40], A-YOLOM [41], CF-YOLOX [42], Faster R-CNN, DINO-Deformable-DETR [43], and HybridNets [44] on the KITTI dataset, The experimental results are shown in Table 4. Among them, bs = 1 represents the FPS when the batch size is 1. The difference between DF-YOLO(n) and DF-YOLO(l) lies in the complexity of the backbone network, where 'n' represents a lightweight network. The design of DF-YOLO(n) reduces the complexity and is suitable for deployment on edge devices with limited computing resources. DF-YOLO(l) provides higher precision while increasing computational overhead. According to experimental results, DF-YOLO(n) achieves the most efficient results. Compared with YOLOP, A-YOLOM, and HybridNets, DF-YOLO(n) has fewer parameters, higher precision, and higher FPS. Compared to yolopv2, DINO-Deformable-DETR, and CF-YOLOX, DF-YOLO(n) has significant advantages in terms of model parameters and speed. With similar parameters and computation costs, our framework outperforms other object detectors, achieving a balance between real-time performance and accuracy.

Table 4 Comparison of detection results of different algorithms on the KITTI dataset (%)

4.7 Performance on the SODA 10 M dataset

This section compares DF-YOLO with models such as YOLOv5, YOLOP, YOLOPv2, A-YOLOM, TTD-YOLO [45], Faster R-CNN, DINO-Deformable-DETR, and HybridNets on the SODA10M dataset. The experimental results are shown in Table 5, that the DF-YOLO algorithm outperforms the other algorithms, has higher detection accuracy, and can meet real-time detection. To better illustrate the reliability of the proposed framework, we conduct a visual analysis (as shown in Fig. 9) of the model's detection results under various environmental conditions on the SODA10M dataset. Under three different environmental conditions (a), (b), and (c), the original model detected 6, 6, and 0 targets, respectively, while DF-YOLO detected 10, 9, and 3 targets, respectively. The visualization results show that the proposed framework has good generalization ability in different environments.

Table 5 Comparison of detection results of different algorithms on the SODA10M dataset (%)
Fig. 9
figure 9

Comparison of detection results of YOLOv8-n detector (left) and DF-YOLO detector (right) in different traffic scenarios

4.8 Performance on the BDD100K dataset

This section compares DF-YOLO with models such as YOLOv5, YOLOP, YOLOPv2, A-YOLOM, CF-YOLOX, MCS_YOLO [46], Faster R-CNN, DINO-Deformable-DETR, and HybridNets on the BDD100K dataset. The experimental results are shown in Table 6, that the DF-YOLO algorithm outperforms the other algorithms, has higher detection accuracy, and can meet real-time detection.

Table 6 Comparison of detection results of different algorithms on the BDD100K dataset (%)

5 Conclusion

High-precision and real-time performance are essential requirements for detection algorithms in autonomous driving systems. This paper proposes a DF-YOLO algorithm that can balance real-time performance and accuracy in complex scenes with limited computational resources, addressing the problem of poor accuracy caused by large differences in target scales in traffic scenarios. High-precision real-time detection results can aid autonomous vehicles in quickly identifying and predicting potentially dangerous situations, thereby making timely decisions and plans, reducing the risk of accidents, and improving safety. DF-YOLO comprises the FasterNet feature extraction network, DF-PAN, and PSD. FasterNet can extract multi-scale feature information more efficiently; DF-PAN can effectively fuse both shallow and deep features, significantly improving the model's detection capabilities for targets with large-scale differences; PSD enables the model to share feature extraction capabilities at different locations, thereby reducing the number of model parameters while enhancing the efficiency and generalization ability of the model. We conducted extensive ablation experiments on the proposed modules, and the results demonstrate that all three modules contribute to improving accuracy. After rigorous ablation experiments and comparative analysis, it was found that the mAP on the KITTI data set reached 90.9%, representing a 3% improvement over the baseline, significantly reducing the probability of missed and false detections. Additionally, the number of parameters decreased by 28.1%. Experimental results prove the effectiveness of DF-YOLO. Overall, this study demonstrates the potential application of the model in resource-constrained environments, assisting with embedded applications. However, exploration in this field remains challenging, and future work will focus on optimizing the model structure to improve algorithm efficiency and precision.