Introduction

Currently, Unmanned Aerial Vehicle (UAV) find applications in nearly a hundred fields of civil use, spanning agriculture, forestry, power, environmental protection, land, ocean, water conservancy, and transportation, among others (Chao et al. 2022). In the field of transportation, UAV, with their characteristics of compact size, high maneuverability, and flexible deployment, offer significant advantages in areas such as illegal evidence collection, traffic guidance, and routine inspections (Ling et al. 2022). The cameras mounted on UAV can transmit real-time footage, allowing operators on the ground to view the captured scenes. The flexibility of UAV enables them to adapt to various complex conditions, and aerial images provide more information compared to ground perspectives. Through artificial intelligence technology, valuable information can be extracted from aerial images, such as tracking the object detected in the images (Xue et al. 2023a, 2023b; Sun et al. 2024), so as to improve the work efficiency of data analysts.

In traffic patrols, it is essential for management personnel to pay more attention to vehicles on the road. Therefore, the issue of how to avoid detecting vehicles outside the designated road area becomes crucial. One solution to this problem is to extract road information and retain detection targets that intersect with the road. However, this approach relies on accurately and efficiently extracting road information. Object detection and image segmentation, as one of the application directions of artificial intelligence technology, leverage various manually curated datasets to train neural networks to detect and segment various objects in images or videos, meeting the requirements for vehicle detection and road information extraction. Object detection methods can be classified into one-stage and two-stage methods. One-stage methods (such as YOLO (Redmon et al. 2016) and SSD (Liu et al. 2016)) directly predict the position and category of objects in a single network, suitable for applications with high real-time requirements. Two-stage methods (such as Faster R-CNN (Ren et al. 2015)) first extract candidate regions and then classify and locate these regions, providing higher accuracy. Image segmentation algorithms based on Convolutional Neural Networks (CNN) have also made significant progress. Classical methods like Fully Convolutional Network (FCN) (Long et al. 2015), U-Net (Ronneberger et al. 2015), SegNet (Badrinarayanan et al. 2017), and deep learning architectures like DeepLab (Chen et al. 1412), PSPNet (Zhao et al. 2017), continuously emerge, effectively improving segmentation performance by introducing techniques such as dilated convolutions and pyramid pooling.

Existing vehicle detection and road segmentation networks are mostly single-task networks. Drones, when identifying vehicles and segmenting roads, require the simultaneous operation of two networks, demanding high performance from the UAV and potentially wasting computational resources. Additionally, they may fail to extract correlated features between tasks.

Multi-task networks can save computational and storage resources by sharing network parameters. In resource-constrained environments or on mobile devices, such resource sharing is crucial for practical deployment. Moreover, multi-task networks allow neural networks to share learned representations across different tasks. By sharing the underlying feature extraction layer, the model can learn universal representations, enhancing the understanding of correlations between tasks.

The current multi-task algorithms lack optimization for the perspective of UAV, which leads to poor detection performance when directly applied in the field of UAV aerial photography, and it is difficult to meet real-time requirements. Due to the large scale differences and complexity of targets in the UAV perspective, it is necessary for multi-task networks to have the ability to extract multi-scale features. Furthermore, considering the power consumption of onboard chips on UAV, further lightweight improvements are needed for multi-task networks to reduce runtime power consumption.

In addressing the aforementioned issues, this paper makes the following contributions:

  • In response to the high parameter and computational complexity issues of current multi-task models, as well as their unsuitability for small target detection from the perspective of UAV, a new multi-task framework YOLO-U has been constructed by improving lightweight backbone networks, employing highly coupled backbone and neck networks, and incorporating lightweight Ghost-Dilated convolutions and G-ASPP modules. This framework is more suitable for vehicle detection and road segmentation from the perspective of UAV compared to other multi-task models. The network employs an improved lightweight backbone, neck, detection head, and segmentation head, allowing simultaneous detection and segmentation of vehicles and roads in UAV-captured images. As both head networks share a common backbone and neck network, computational and storage resources are conserved, and the model's understanding of correlations between tasks is improved, enhancing detection performance.

  • In response to the issue of increased parameter and computational complexity caused by adding multi-scale feature extraction modules, a lightweight dilated convolution called Ghost-Dilated convolution is proposed. Ghost-Dilated Convolution combines the characteristics of Ghost convolution and dilated convolution, using a two-stage feature extraction approach. It achieves a large receptive field while having fewer parameters and computational requirements.

  • The ASPP module is composed of multiple dilated convolutions, which has a high number of parameters and computational complexity. In order to reduce the increase in model parameters and computational complexity, a lightweight Ghost-Atrous Spatial Pyramid Pooling (G-ASPP) module based on Ghost-Dilated convolution is proposed. The G-ASPP module has a structure similar to the Atrous Spatial Pyramid Pooling (ASPP) module but uses Ghost-Dilated Convolution instead of dilated convolution. Therefore, compared to the ASPP module, the network using the G-ASPP module has nearly equivalent multi-scale feature extraction capability but with lower parameters and computational requirements.

This article will be primarily divided into five chapters. In the first chapter, the research significance of multitasking networks on UAV is discussed, and the main contributions of this paper are introduced. The second chapter covers current related work, providing an overview of the current research status of UAV aerial target detection and segmentation, as well as multitasking networks. The third chapter details the methodology, providing a comprehensive introduction to the overall network framework and the principles behind the proposed modules. The fourth chapter focuses on the experimental section, conducting detailed experiments to validate the contributions made and prove their effectiveness. The fifth chapter serves as the conclusion, summarizing the contributions, experimental findings, and conclusions drawn in this paper.

Related work

UAV Aerial object detection

The task of object detection from the UAV perspective has received extensive research attention due to challenges such as varying target sizes, uneven target distribution, and complex shooting environments in UAV-captured scenes. Liu et al. (2023) proposed a novel dual backbone network detection method (DB-YOLOv5), which enhances the feature extraction capability for small targets by utilizing multiple backbone networks, achieving high detection performance for small targets. However, the use of multiple backbone networks significantly increases the model's parameters and computational requirements, difficult to deploy directly on UAV. Huang et al. (2023) proposed a lightweight object detection network for UAV platforms, based on the YOLOv5 network. By adding a small target detection head, improving the IOU metric, and introducing FasterNet, they enhanced small target detection while reducing model parameters and improving real-time performance. Khan et al. (2022) proposed a multi-scale and multi-class unified framework for detecting objects in high-resolution satellite images. The framework addresses the multi-scale problem by utilizing multiple Region Proposal Networks (RPNs), each with its own scale range, and leveraging the independent level of the pyramid to generate scale-specific object proposals. Li et al. (2023) proposed a lightweight infrared target detection method, named Edge-YOLO, by improving the YOLOv5m backbone network to further enhance the running speed of the network. It has a parameter size of 5.2 million and a computational workload of 14.2 gigaflops, achieving a running speed of 31.9 frames per second on the RK3588 chip.

UAV Aerial image segmentation

Image segmentation from the UAV perspective faces challenges such as complex scenes, difficult to achieve real-time performance and significant differences in target sizes. Li et al. (2023) proposed the Dual-Stream Feature Fusion Network (DSFA-Net), which utilizes two branches to extract shallow and deep information separately. This network balances shallow and deep feature extraction, improving feature fusion for stronger segmentation capabilities, especially for targets with large size differences. Xu et al. (2023) introduced an automated segmentation method for insulator images based on DeepLab V3 + . This method demonstrated effective segmentation of insulator images captured from UAV. Shi et al. (2023) presented a UAV image city scene segmentation network based on the Transformer. By designing a backbone network with a deformable multi-head self-attention transformer block featuring an aggregation window, introducing a position attention module, and a V-shaped encoder network, they improved the accuracy of city scene segmentation. The above-mentioned algorithm lacks optimization for embedded devices, which increases the performance requirements of the model and makes it difficult to achieve real-time requirements on embedded devices.

Multi-task neural networks

Multi-task networks are widely applied in the field of autonomous driving. Wu et al. (2022) proposed a panoramic driving perception network, YOLOP, capable of simultaneously performing vehicle detection, drivable area segmentation, and lane line segmentation. It achieves high detection accuracy while maintaining real-time performance. However, due to the abundance of small targets in the drone's field of view and the significant difference from the vehicle's perspective, there is a lack of effective solutions for handling scale variations in aerial drone footage, making it difficult to apply to target detection tasks from a drone's perspective. He et al. (2017) introduced a model, Mask R-CNN, capable of simultaneously performing instance segmentation and object detection. They added a branch for predicting object masks based on the existing bounding box recognition branch, laying the foundation for multi-task networks. However, due to the lack of corresponding lightweight design and the low real-time performance of the R-CNN framework, it is difficult to achieve real-time UAV target detection and segmentation tasks. Zhang et al. (2020) proposed a novel Multi-Scale and Occlusion Aware Network (MSOA-Net) for UAV-based vehicle segmentation. The issue of scale change is addressed through the use of a multi-scale feature adaptive fusion network. However, the network can only detect and segment vehicles, and cannot perform separate detection and segmentation of different targets. Additionally, due to the lack of lightweight processing in the backbone network, it is difficult to guarantee real-time performance on embedded devices. Balamuralidhar et al. (2021) proposed a multitask Mult EYE object detection, which utilizes the characteristics of multitasking during model training to simultaneously train the road segmentation head and vehicle detection head. During inference, the road segmentation head is frozen while sharing the underlying feature extraction layer to improve the accuracy of vehicle detection. However, in order to improve real-time performance of vehicle detection, the model chooses to freeze the road segmentation head, resulting in the inability to output road segmentation results simultaneously during inference and thus losing the characteristics of multitasking.

Methodology

Ghost-dilated convolution

In object detection and image segmentation tasks, objects to be detected and segmented in the input images vary in scale. Therefore, the network needs the capability to capture features at different scales. Researchers have used ordinary convolution with a large kernel to expand the network's receptive field, but this leads to an increase in network parameters. Addressing this issue, YU et al. (2015) proposed a convolution known as dilated convolution with a large receptive field, controlling the size of the dilated convolution's receptive field through dilation factors. Compared to ordinary convolution with the same receptive field, this approach significantly reduces the number of parameters and computational requirements.

HAN et al. (2020) analyzed the feature maps extracted by ordinary convolution and found that the feature maps of some channels extracted by ordinary convolution were similar to the feature maps of other channels, which indicated that the feature maps of other channels could be transformed into these feature maps by some linear transformation. Based on this analysis, they proposed a lightweight convolution called Ghost convolution. Ghost convolution adopts a two-stage feature extraction approach, as shown in Fig. 1. In the first stage, intrinsic feature maps of the images are extracted using ordinary convolution, with the channel number set to a smaller value. In the second stage, group convolution is employed to further process (linear transformation) the feature maps extracted in the first stage, and the results from both stages are concatenated for output. Through this process, Ghost convolution exhibits nearly the same feature extraction capability as ordinary convolution but with lower parameters and computational requirements.

Fig. 1
figure 1

Ghost convolution

Dilated convolution follows the same process as ordinary convolution in feature extraction. Therefore, similar optimization techniques used for ordinary convolution can be applied to dilated convolution to further reduce its parameters and computational requirements. Combining the two-stage characteristic of Ghost convolution with dilated convolution, a lightweight Ghost-Dilated Convolution with a large receptive field is proposed. The Ghost-Dilated Convolution is illustrated in Fig. 2.

Fig. 2
figure 2

Ghost-dilated convolution

In the first stage, intrinsic feature maps are obtained by applying dilated convolution with a smaller channel number to the input feature map, as defined in Eq. (1).

$$Y^{'}=\chi\times f^{'}$$
(1)

where \({\mathrm\chi\in\mathrm R}^{\mathrm H\times\mathrm W\times\mathrm C}\) is the input feature map, \({f^{\mathit'}\in\mathrm R}^{\mathrm c\times\mathrm k\times\mathrm k\times\mathrm m\times\mathrm d}\) denotes the dilated convolution operation, and \({Y^{'}\in R}^{H_2\times W_2\times m}\) represents the intrinsic feature map obtained.

In the second stage, linear transformations are applied to the intrinsic feature maps using 3 × 3 group convolution, and the computation process is expressed in Eq. (2).

$$y_{ij}=\Phi_{i,j}\left(y_i^{'}\right),\;\forall i=1,...,m,\;j=1,...,s$$
(2)

where \(y_i^{'}\) represents the \(Y^{'}\)-th linear transformation process generating the \(i\)-th Ghost feature map \(y_{ij}\), and \(\Phi_{i,j}\) is the convolution kernel for the linear transformation.

Finally, the intrinsic feature maps and the linearly transformed feature maps are concatenated for the final output, as shown in Eq. (3).

$$Y=Cat\left(Y^{'},\;y_{i,j}\right),\;\forall i=1,...,m,j=1,...,s$$
(3)

where \({Y\in R}^{H_2\times W_2\times2m}\) represents the final feature map generated by Ghost-Dilated Convolution.

Based on the Ghost-Dilated Convolution process described above, it combines the characteristics of dilated convolution with a large receptive field and the lightweight nature of Ghost convolution. This combination further reduces the parameters of networks employing dilated convolution.

G-ASPP module

Due to the relative flexibility of drones, the shooting perspective and altitude are not fixed. When a drone is at a high altitude, ground targets appear relatively small in the captured image, while they appear larger when the drone is at a lower altitude. Therefore, the network must have a multi-scale characteristic to accommodate the scale variations of targets captured by the drone. Basalamah et al. (2019) proposed a Scale-Driven Convolutional Neural Network (SD-CNN) model, which generates scale-aware object proposals by creating a scale map. This model effectively addresses the challenges of complex backgrounds, scale variations, nonuniform distributions, and occlusions in object detection tasks. He et al. (2015) proposed the use of Spatial Pyramid Pooling (SPP) modules in the network to enhance the capability of extracting scale features. YOLOv5 further improved the SPP module, introducing the SPPF module to enhance the network's recognition ability for multi-scale targets.

The ASPP module, based on the SPP module, uses dilated convolutions with parallel different dilation rates instead of max-pooling to extract features at different scales, and then combines these multi-scale features. Compared to SPP modules and SPPF modules that use simple max-pooling to increase the image's receptive field, the ASPP module enlarges the receptive field through dilated convolutions with different dilation rates. While using dilated convolutions to extract features from images can capture more multi-scale features than max-pooling operations, it also increases the network's parameters and computational requirements.

To avoid the increase in parameters resulting from adding dilated convolutions, a lightweight multi-scale feature extraction module called the G-ASPP module is proposed, based on Ghost-Dilated Convolution. The G-ASPP module replaces the original convolutions in the ASPP module with Ghost-Dilated Convolution, further reducing the module's parameters and computational requirements. The G-ASPP module is illustrated in Fig. 3. The G-ASPP module adopts a parallel structure, passing through Ghost-Dilated Convolution with dilation rates of 6, 12, and 18, and then concatenating the multi-scale features.

Fig. 3
figure 3

G-ASPP module

Compared to the ASPP module, the G-ASPP module demonstrates similar multi-scale feature extraction capabilities while further reducing parameters and computational requirements. Placing the G-ASPP module in the backbone effectively enhances the network's detection performance for various scale targets.

Overall structure of the multi-task network

When performing both vehicle detection and road segmentation tasks, a drone needs to run two networks simultaneously. Due to the lower performance of drone chips and the high cost of storage, running two neural networks concurrently is challenging and cannot guarantee real-time processing.

The emergence of multi-task networks effectively addresses the aforementioned issues. Currently, multi-task networks are predominantly designed with a parallel multi-task network structure, as illustrated in Fig. 4. The parallel multi-task network structure reduces redundancy by sharing convolutional layers. Moreover, the shared convolutional layers endow the network with the ability to extract features that are relevant to both tasks, thereby enhancing the detection and segmentation performance of the network.

Fig. 4
figure 4

Parallel multi-task network architecture

Through the analysis of vehicle detection and road segmentation tasks, it is observed that these tasks exhibit a certain level of correlation. Vehicles are usually on the road, aligning with the characteristics of multi-task networks. Therefore, a multi-task aerial drone network, named YOLO-U, is proposed to perform vehicle detection and road segmentation tasks simultaneously. The overall network structure includes a backbone, neck, and head.

The multi-task network simultaneously accomplishes both vehicle detection and road segmentation tasks. Hence, the network is designed with separate head networks for vehicle detection and road segmentation, while sharing a common backbone and neck network. The YOLO-U network structure is illustrated in Fig. 5.

Fig. 5
figure 5

YOLO-U network structure

Constrained by the performance of drone devices, lightweight networks are chosen for the design of the multi-task network as the backbone network. Currently, there are various lightweight network models designed for mobile devices, such as ShuffleNet (2018), MobileNet (1704), GhostNet (1511), etc.

ShuffleNet introduces pointwise group convolution and channel shuffling mechanisms, reducing the parameters and computational requirements of convolutions. It also enhances the interaction between features, thereby improving the expressive capability of feature maps. MobileNet adopts depthwise separable convolutions instead of ordinary convolution, further reducing the parameters and computational requirements of convolution operations compared to pointwise group convolutions. The MobileNetV3 version introduces a lightweight Squeeze-and-Excitation (SE) attention mechanism, improving the focus on crucial features. GhostNet analyzes the feature maps generated by convolutions and proposes a Ghost convolution with lower parameters, avoiding redundant feature mappings and enhancing network efficiency.

Similar to MobileNetV3, GhostNet also integrates the SE attention mechanism. The SE attention mechanism primarily focuses on the relationships between channels, allowing the network to concentrate more on the feature channels that are crucial for the task. In comparison, the Efficient Channel Attention (ECA) mechanism replaces the fully connected layer in the SE attention mechanism with one-dimensional convolution. This not only reduces the computational and parameter requirements of the network but also enhances communication between channels.

ECA Attention Mechanism is illustrated in Fig. 6. Firstly, the input feature map undergoes global average pooling. Subsequently, a one-dimensional convolution operation with a kernel size of K is applied to the one-dimensional vector. The Sigmoid function is then employed to compute the convolution result, obtaining weights for each channel, as shown in Eq. (4). Finally, the original feature map is multiplied by the obtained weights, yielding a feature map that incorporates attention information.

Fig. 6
figure 6

ECA attention mechanism

$$\omega=\sigma(C1D_k(y))$$
(4)

In the above formula, \(\sigma\) () denotes the Sigmoid function, \(C1D_k\) represents the one-dimensional convolution computed through an adaptive convolution kernel, \(y\) signifies the channel after global average pooling, and \(\omega\) represents the weights for each channel.

The SE attention mechanism in GhostNet was replaced with the ECA attention mechanism, further enhancing the efficiency of the backbone network. The structure of the backbone network is presented in Table 1. The input image first undergoes the Focus module, which divides the image into several smaller blocks at a certain ratio, enhancing the detection performance for small targets. Subsequently, GhostNet further extracts abstract features, utilizing convolution with a stride of 2 for downsampling the feature map, thereby reducing information loss caused by downsampling operations. Finally, the G-ASPP module is employed to extract and fuse multi-scale features, further strengthening the backbone network's capability to extract features from multi-scale targets.

Table 1 Lightweight backbone network architecture

The vehicle detection head network and neck network form a PAN structure. After feature extraction by the backbone network, a G-ASPP module is applied, and the neck network performs upsampling on the features, concatenating them with the features of the same scale from the backbone network. The vehicle detection head network undergoes multiple downsampling operations and concatenates with the neck network's features of the same scale. The network outputs four-scale feature vectors. The adoption of the PAN structure promotes the fusion and propagation of multi-scale features, improving the detection performance for multi-scale targets. Unlike YOLO's three detection heads, the network employs four detection heads, enhancing the detection performance for small targets.

The road segmentation head network, neck network, and backbone network together form a structure similar to the UNet network. The backbone network performs multiple downsampling operations on the image, and the neck network and road segmentation head network perform upsampling on the features extracted by the backbone network. The features of the same scale from the backbone network are concatenated, and the network outputs a segmentation result with a size of 640 × 640. The image segmentation network, adopting the UNet structure, fully utilizes shallow features, improving feature propagation and achieving better segmentation results compared to other image segmentation networks.

Both the neck network and the head network use the CSP module as the basic feature extraction module, as shown in Fig. 7. When the feature map passes through the CSP module, it goes through two branches before concatenation. The CSP module has a powerful feature extraction capability, lower computational and parameter overhead, and can save memory access. To avoid the increase in computational and parameter overhead caused by transpose convolution, linear interpolation is employed to perform upsampling on the feature map.

Fig. 7
figure 7

CSP module structure

The network, overall, benefits from the high sharing of the backbone and neck networks between the two head networks, promoting network coupling. This enables the network to learn the correlation between tasks. The correlation between road segmentation and vehicle detection tasks mainly manifests when vehicles are on the road. The visualizations of vehicle detection labels and road segmentation labels are shown in Fig. 8. In the figure, the red area represents the road, and the green box represents the vehicles. It can be observed from the figure that there is a high degree of overlap between the object detection labels and road segmentation labels, indicating an inherent correlation. When the network learns this correlation, it can avoid focusing on areas outside of the road, thereby reducing false detection rates for vehicles.

Fig. 8
figure 8

Visualization of vehicle detection and road segmentation labels

The model adopts GhostNet as the backbone network, which combines the ECA attention mechanism, making the model lightweight. The use of G-ASPP module for multi-scale feature extraction not only has lower performance overhead but also further enhances the model's ability to extract features at multiple scales. The detection head share the backbone network and neck network, further reducing performance overhead. Based on these characteristics, the time complexity of the network is lower compared to other multi-task networks, thus meeting the real-time requirements of UAV devices for algorithms and having a stronger learning ability for relevant features.

Experiments

Experimental setup

Dataset

Currently, there is a relative lack of multi-task datasets from the perspective of UAV. To address this issue, we constructed a multi-task Dataset for UAV aerial object detection and road segmentation. Initially, UAV were deployed to capture ground-level footage, resulting in video data. Frames were extracted from the video data at fixed intervals of 10 s, yielding a total of 395 images containing vehicles and roads. The vehicles and roads were manually annotated using the LabelImg software, as illustrated in Fig. 9. The object detection task focuses on single-class object detection, and the distribution of anchor box sizes for object detection is depicted in Fig. 10, revealing a concentration of medium and small targets.

Fig. 9
figure 9

Multi-task dataset for vehicle detection and road segmentation

Fig. 10
figure 10

Distribution of vehicle detection annotations sizes

After augmenting the dataset through data augmentation techniques, we divided the dataset into training, validation, and test sets in an 8:1:1 ratio. Since multi-task dataset annotation is relatively difficult, it is relatively small. To enhance the model's generalization, pre-training was performed using publicly available datasets like Visdrone2019 (2019) and CHN6-CUG Road (2021a), which are closely related to the tasks.

Experimental environment and parameter settings

The experiments in this paper were conducted on a Dell Precision T7920 tower graphics workstation with an Intel Xeon Silver 4100@2.10 GHz × 16 CPU. The GPU used was the Nvidia Quadro P5000 with 16 GB of memory. The system had 64 GB of RAM, and the storage configuration included a 512 GB SSD along with an 8 TB hard drive. The parameters related to model training are shown in Table 2.

Table 2 Training-related parameters

Loss functions

Since the multi-task network has different output results, it requires joint loss functions corresponding to each task. The loss function for the vehicle detection task is defined as follows in Eqs. (5)-(8).

$$L_{conf}=\sum_{i=0}^{s^2}\sum_{j=0}^BI_{ij}^{obj}{(C_i-C_i^{'})}^2$$
(5)
$$L_{cls}=\sum_{i=0}^{S^2}\sum_{j=0}^BI_{ij}^{obj}\sum(p_i(c)-p_{i}^{'}(c))^2$$
(6)
$$L_{iou}=IoU(B,B_{gt})-\frac{\rho^2(B, B_{gt})}{c^2}-\alpha{v}$$
(7)
$$L_{det}=\alpha_1L_{cls}+\alpha_2L_{iou}+\alpha_3L_{conf}$$
(8)

where \(L_{conf}\) represents the confidence loss, \(L_{cls}\) represents the classification loss, \(L_{iou}\) represents the IOU loss, and \(\alpha_1, \alpha_2, \alpha_3\) are weight parameters.

The loss function for the road segmentation task uses the Cross-Entropy (CE) Loss Function, as shown in Eq. (9):

$$L_{seg}=-\sum_{i=1}^N\lbrack y_i(log\;y_i^{'})+(1-y_i)log(1-\;y_{i}^{'})\rbrack$$
(9)

In the above equations, \(y_i\) represents the ground truth for the i-th sample, and \(y_i^{'}\) represents the predicted value for the \(i\)-th sample.

Combining both tasks, the joint loss function is given by Eq. (10).

$$L_{all}=\beta_1L_{det}+\beta_2L_{seg}$$
(10)

In the above equation, \(L_{all}\) represents the joint training loss, and \(\beta_1\), \(\beta_2\) represents weight parameters.

Evaluation metrics

In this article, we evaluate the model using precision (P), recall (R), mean average precision (mAP), intersection over union (IOU), mean IOU (mIOU), as well as metrics related to model complexity such as parameter count and computational complexity. Specifically, P, R, and mAP are used as evaluation metrics for vehicle detection. The calculation formulas for P, R, and mAP are given by Eqs. (11), (12), and (13), respectively.

$$P=\frac{TP}{TP+FP}$$
(11)
$$R=\frac{TP}{TP+FN}$$
(12)
$$mAP=\frac{1}{N}\sum^{N}_{i=1}AP_i$$
(13)

where TP represents the count of correctly detected bounding boxes. FP represents the count of bounding boxes mistakenly classified. FN represents the count of bounding boxes wrongly classified as background. \(AP_i\) denotes the model's average precision for the \(i\)-th class. In the context of a single-class detection task, mAP is numerically equivalent to AP.

Using IOU and mIOU as evaluation metrics for road segmentation, the calculation formulas for IOU and mIOU are as shown in Eqs. (14) and (15).

$$IOU=\frac{TP}{TP+FN+FP}$$
(14)
$$mIOU=\frac{1}{N+1}\sum^{N}_{i=1}\frac{TP}{FN+FP+TP}$$
(15)

where TP represents the number of correctly predicted road pixels, FP represents the number of background pixels incorrectly predicted as road, FN represents the number of road pixels incorrectly predicted as background, and N represents the total number of classes, specifically referring to the road and background classes in this context.

Model training

Due to the relatively small size of the annotated dataset, the training of the network suffers from the issue of weak generalization performance. To address this, pretraining is conducted on a dataset closely related to the task. Initially, the road segmentation head of the network is frozen, and a multitask network comprising backbone, neck, and detection networks is trained using the VisDrone2019 dataset, with the epoch set 240. Subsequently, the backbone, neck, and detection networks are frozen, and the road segmentation head is trained using the CHN6-CUG Road dataset, with the epoch set 240. Finally, the entire network is trained using a multitask dataset for UAV target detection and road segmentation, with the epoch set 240. The hyperparameters for each training session remain unchanged. The pretraining algorithm is outlined in Algorithm 1.

figure a

Algorithm 1. Training of multi-task neural network

Experimental results

Exploration experiment on task relevance

To further validate the correlation between road and vehicle positions, this section calculates the percentage of intersection between vehicle detection labels and road labels using algorithms. It also compares the percentage of intersection between detection boxes and road labels output by single-task network and multi-task network. The experimental results are shown in Table 3. The data in the table indicates that 98.5% of vehicle labels are located within the road in proportion to the ground truth. The percentage of detection boxes output by the multi-task network is 4.1% higher than that of the single-task network, indicating that the multi-task network has learned the correlation between road and vehicle positions, focusing more on road areas.

Table 3 Distribution of output labels in different networks

In addition, it's noted that YOLO-UD refers to the model with the segmentation head removed, and YOLO-US refers to the model with the detection head removed.

In addition to the data comparison, visualizing the feature maps extracted from the backbone network, as shown in Fig. 11, reveals that the network focuses its attention primarily on the road area. This indicates that the network has learned the correlation between tasks, thereby improving the accuracy of network detection and segmentation.

Fig. 11
figure 11

Differences in attention distribution between multi-task and single-task networks

Comparison experiment of lightweight backbone networks

To further investigate the impact of different lightweight backbone networks on the multitask network, this section conducts experiments using popular lightweight networks as the backbone. The experimental results are summarized in Table 4.

Table 4 Comparison results of lightweight backbone networks

MobileNetV3 has the fewest parameters compared to other networks, but its computational cost is higher than GhostNet. In terms of object detection accuracy comparison, the GhostNet ECA model demonstrates the best performance. It outperforms MobileNetV3, ShuffleNetV2 by 6.3%, 5.0% respectively, while achieving similar performance to FastNet-T0 but with lower computational and parameter requirements. Compared to GhostNet, it shows a 1.6% improvement. As for road segmentation accuracy comparison, the differences in performance among the networks are relatively small, with ShuffleNetV2 exhibiting the lowest performance.

The GhostNet network with the ECA attention mechanism significantly reduces the number of parameters, decreasing by 1.51 million. Considering the overall data, GhostNet ECA exhibits superior performance in detection accuracy, segmentation accuracy, and computational cost. This validates the effectiveness of the improvements made to GhostNet and underscores the efficiency of using GhostNet as the backbone network.

G-ASPP module ablation experiment

To validate the effectiveness of the proposed G-ASPP module and Ghost-Dilated convolution module, ablation experiments are designed. SPP module, ASPP module, and G-ASPP module are compared, and the experimental results are presented in Table 5.

Table 5 Comparison results of G-ASPP module

Compared to the SPP module, the ASPP module improves the mAP metric by 4.2% and the mIOU metric by 0.5%. However, it comes with a substantial increase in parameters and computational cost, with a rise of 3.52 million parameters and 1.37 GFlops in computational cost.

By refining the ASPP module, the G-ASPP module slightly reduces the mAP and mIOU metrics compared to the ASPP module, showing a 1.2% decrease in mAP and a 0.1% decrease in mIOU. Nevertheless, there is a significant reduction in parameters and computational cost, with a decrease of 1.1 million parameters and 0.44 GFlops in computational cost.

The experimental data above demonstrates the effectiveness of the lightweight improvement made to the ASPP module through the G-ASPP module.

Comparison with other networks

To validate the effectiveness of the proposed method, comparisons are made with mainstream object detection algorithms,image segmentation algorithms and Multitasking Algorithms.The experimental results are presented in Table 6.

Table 6 Comparison results between YOLO-U and other networks

In the comparison of object detection models, the YOLO-UD model proposed in this paper outperforms YOLOv5s and YOLOv8s, showing a 2.5% and 1.2% improvement in mAP, respectively. Additionally, it reduces the number of parameters by 1.47 million and 5.47 million, and decreases computational cost by 7.72 GFlops and 19.82 GFlops. Compared with the TPH -YOLOv5 (Zhu et al. 2021b) model, which also performs well in drone aerial target detection, although there is a similar results in detection accuracy, the parameter and computational complexity of our model are much lower than those of TPH -YOLOv5. While the SSDLite (Sandler et al. 2018) with a lightweight backbone network has lower parameters and computational complexity, its detection accuracy is poor and cannot meet the requirements of the detection task.

In the comparison of image segmentation models, the YOLO-US model proposed in this paper significantly outperforms UNet and MobileUNet models in various metrics. It achieves a 12.3% and 10.3% improvement in mIOU compared to UNet and MobileUNet, respectively, while maintaining relatively smaller parameter and computational costs. Compared with DeepLabV3-MobileNet (Chen et al. 1706), the IOU and mIOU metrics have improved by 2.9% and 2.0% respectively, while the model's parameter count and computational workload have decreased by 2.68 M and 36.14Gflops respectively. It is worth noting that the multi-task model YOLO-U has shown a slight decrease in segmentation accuracy compared to YOLO-US, which may be attributed to the imbalance of multitasking caused by highly shared lower-level networks, leading the network to be more inclined towards vehicle detection tasks.

To further validate the superiority of the multitask model proposed in this paper under the perspective of UAV, a comparison was made with the YOLO-P (Wu et al. 2022) model, which has lane detection head removed. In terms of vehicle detection task, YOLO-U achieved a 2.5% higher mAP than YOLO-P. As for road segmentation task, both models performed similarly. In terms of parameters and computational cost related to real-time performance, although the parameter quantity of YOLO-U increased by 0.59 M, its computational cost decreased by 0.72 Gflops compared to YOLO-P. Overall, these data indicate that compared to the YOLO-P model, YOLO-U is more suitable for vehicle detection and road segmentation tasks from the perspective of UAV.

Visualization comparison experiment

To further validate the effectiveness of the proposed method, visual results are compared with mainstream object detection and image segmentation algorithms. The vehicle detection results are shown in Fig. 12.

Fig. 12
figure 12

Visualization results of vehicle detection models

In the task of vehicle detection, YOLOv5s exhibits the poorest detection performance, missing almost all small objects in the distance. YOLOv8s, while slightly better than YOLOv5s, still shows cases of missed detections with relatively low detection confidence. Thanks to the use of small object detection heads in the model, both YOLO-UD and YOLO-U models have good detection performance for distant small targets. However, YOLO-UD also has a few missed detections, while YOLO-TPH has a few false alarms. Overall, YOLO-U outperforms other models in terms of comprehensive performance.

Simultaneously, a visual comparison of road segmentation is conducted, as shown in Fig. 13. In the road segmentation task, UNet exhibits cases where some roads are not segmented, and the overall segmentation completeness is the poorest. Although MobileUNet segments all roads, it falsely detects green belts in the center of roads as part of the road. Both the DeepLabV3-MobileNet, YOLO-US, and YOLO-U models demonstrate similar overall performance. They accurately and completely segment the roads without encountering the issues observed in UNet and MobileUNet.

Fig. 13
figure 13

Visualization comparison results of road segmentation models

Visual comparison with the YOLO-P model, which is also a multitask model, as shown in Fig. 14. Due to the lack of optimization for small object detection and multi-scale feature extraction capabilities in the YOLO-P algorithm, there are missed detection issues when performing target detection from the perspective of unmanned aerial vehicles.

Fig. 14
figure 14

Visualization comparison results of road multitask models

During the experiment, it was found that the model also has some deficiencies. As shown in Fig. 15, when there are a large number of vehicles in the area outside the road in the image, on one hand, it will interfere with the result of road segmentation task, causing areas with more vehicles to be incorrectly segmented as roads. On the other hand, for vehicle detection outside the road, there are missed detections in the model.

Fig. 15
figure 15

Failed case

Ablation experiment

To better understand the impact of each module on the network, ablation experiments were designed. As shown in Table 7, after adding the small object detection head, the network's mAP indicator increased by 1.3%. This indicates that the small object detection head further improves the network's ability to detect small object. The backbone network with ECA attention mechanism achieved improved detection and segmentation accuracy, with a 1.6% increase in mAP and a 0.7% increase in mIOU. Replacing the ASPP module with the G-ASPP module resulted in a slight decrease in detection and segmentation accuracy by 1.2% and 0.4%, respectively, but reduced the parameter count by 1.1 M and computational cost by 0.44GFlops. Furthermore, pre-training on multiple datasets further enhanced model generalization, leading to a 2.5% increase in mAP and a 0.8% increase in mIOU metrics. The results of these ablation experiments fully validate the effectiveness of each proposed module.

Table 7 Comparative results of melting experiment

Conclusions

This paper addreses the performance limitations of current UAV running simultaneous object detection and road segmentation networks, unable to extract correlated features between tasks. The proposed multitask model for vehicle detection and road segmentation, named YOLO-U, leads to the following conclusions:

  1. (1)

    The paper introduces a lightweight Ghost-Dilated convolution that combines the advantages of Ghost convolution and dilated convolution, maintaining a large receptive field with a lower parameter count. By addressing the parameter and computational cost increase issues associated with the ASPP module, a lightweight multiscale feature extraction module, G-ASPP, is proposed, effectively reducing the model's parameter count and computational cost.

  2. (2)

    GhostNet is chosen as the backbone network due to its effective feature extraction capabilities. An improved version, GhostNet ECA, is introduced by integrating the ECA module, resulting in a further reduction of parameters and increased detection accuracy. Leveraging these improvements, the YOLO-U model is proposed for multitask UAV aerial vehicle detection and road segmentation, sharing the backbone and neck networks between tasks to enhance feature correlation learning, leading to improved detection and segmentation results. Pretraining using self-built aerial vehicle detection and road segmentation datasets, combined with similar single-task datasets, further enhances model detection accuracy on the test set.

  3. (3)

    Experimental results demonstrate that GhostNet ECA, as the backbone network, outperforms GhostNet by a 1.6% improvement in vehicle detection accuracy with a lower parameter count. The proposed G-ASPP module outperforms SPP and ASPP modules, improving detection accuracy while reducing parameter and computational costs by 1.1 million parameters and 0.44 GFlops, respectively. Comparisons with other single-task network models, both numerically and visually, show that the proposed YOLO-U model achieves superior accuracy and completeness in vehicle detection and road segmentation tasks. This validates the advantages of the proposed model.

Currently, our model also has some shortcomings, such as task imbalance and missed detection of targets outside the road. Additionally, there are issues with subsequent embedded porting that require further research in order to enhance the practical value of the model. This paper focuses on Deep Learning network models from the perspective of UAV and provides a direction for multitask networks carried by UAV.