1 Introduction

With increasing urbanization [1] and significant economic growth, the number of motor vehicles is rising rapidly. This growth has brought serious challenges to urban transportation and has made traffic safety issues more prominent. Consequently, autonomous vehicles (AVs) [2] and advanced driver assistance systems (ADAS) [3] have emerged as critical technological solutions to alleviate traffic problems and enhance driving safety. AVs achieve the ability to drive autonomously without human intervention through a range of sensors, algorithms, and computing systems. ADAS significantly enhances the safety and convenience of conventional vehicles by providing features such as lane keeping, automatic emergency braking, and adaptive cruise control. Lane line detection technology, one of the core components of AVs and ADAS, plays a crucial role in vehicle positioning, navigation, and path planning. As technology continues to advance, AVs and ADAS will gradually become an integral part of future transportation systems. The optimization and improvement of lane line detection technology, as a core technology of AVs and ADAS, is of great significance for enhancing traffic safety, improving traffic flow, and promoting the development of intelligent transportation systems.

Although many lane line detection methods have been highly successful under certain conditions, vehicles face complex scenarios such as heavy occlusion, inclement weather, and extreme lighting, as well as elongated features of lane lines that are inherently difficult to detect. This makes accurate and rapid detection of lane lines a challenging task. Currently, existing lane line detection methods are mainly classified into two categories: traditional methods [4, 5] and deep learning methods [6,7,8,9,10]. Traditional lane line detection methods primarily rely on image processing and feature extraction. These methods usually extract the color features [11,12,13] or edge features [14, 15] of the lane lines, use the inverse perspective transform to obtain a bird’s eye view of the image [16], and finally use the Hough transform [17] or sliding window search [12, 13] to fit the lane lines. Traditional methods can achieve better results under simple conditions but often face greater limitations in complex and changing driving scenarios. They exhibit poor adaptability to environmental changes and complex road conditions, require frequent parameter adjustments, have high computational complexity, and offer poor real-time performance. In recent years, deep learning methods have received extensive attention and application in the field of lane line detection, gradually becoming the mainstream method. Deep learning-based lane line detection methods extract and understand features in images by learning autonomously from large amounts of data, allowing for more robust detection in different scenarios [18,19,20,21]. The mainstream deep learning methods fall into three categories: segmentation-based methods, anchor-based methods, and parameter-based methods. Segmentation-based methods usually formulate the lane line detection problem as a segmentation task to obtain a pixel-by-pixel predicted segmentation map. This method has higher detection accuracy but slower computation speed and poorer real-time performance, making it difficult to meet the real-time requirements of lane line detection in practice. Anchor-based methods generate candidate regions in an image using predefined anchor frames, then classify and regress these regions to detect and locate lane lines. This method is more efficient in detection but less effective in complex scenes. Parameter-based approaches achieve lane line detection by modeling the lane curve and regressing the model parameters. While parameter-based lane line detection is computationally efficient, it is less accurate in complex and non-regular road scenarios.

To effectively balance lane line detection accuracy and efficiency in complex environments, this paper proposes a frequency channel fusion coordinate attention mechanism network (FFCANet) for lane detection. FFCANet uses ResNet [22] as its backbone network. To enhance the model’s ability to extract lane features in complex road scenes, we propose the FFCA module, which mitigates interference from similar external features and improves detection accuracy by capturing lane line details and texture information from multiple spatial directions. Additionally, to achieve faster detection and address the issue of vision loss, a row-anchor-based prediction and classification method is employed to circumvent the high computational complexity of pixel-by-pixel segmentation. To further enhance the feature extraction capability and capture the dynamic dependencies between channels, the ECA module [23] is incorporated into the auxiliary segmentation branch. Finally, FFCANet was evaluated on the CULane [24] and Tusimple [25] datasets. Compared to other methods, FFCANet significantly enhances detection performance in complex scenes and achieves faster detection speeds while maintaining high accuracy.

The main contributions of the work in this paper are as follows:

  • A feature enhancement method with a frequency channel fusion coordinate attention mechanism (FFCA) is proposed to increase feature diversity by capturing lane line details and texture information from multiple spatial directions.

  • A row-anchor-based prediction and classification method is employed to improve detection efficiency and address the lack of visual cues.

  • The ECA module is incorporated into the auxiliary segmentation branch to capture the dynamic dependencies between channels and further enhance feature extraction capability.

  • This approach demonstrates effectiveness and applicability on publicly available benchmark datasets, achieving an effective balance between lane line detection accuracy and speed.

2 Related work

In recent years, lane line detection technology has attracted significant attention from both academia and industry. Existing lane line detection methods are mainly classified into traditional methods and deep learning-based methods.

2.1 Traditional methods

Traditional lane line detection methods primarily utilize algorithms such as edge detection and the Hough transform to identify and track lane lines through image processing and feature extraction techniques. Li et al. [26] employed a multi-channel threshold fusion method based on gradient and background differences combined with HSV color features of lane lines for lane line detection. However, the method exhibits limited adaptability to environmental changes and complex road conditions. Muthalagu et al. [27] utilized a combination of HLS and Lab for color threshold segmentation, applied the Sobel edge detection operator to extract edge features, fused the color and edge features, and subsequently conducted lane line detection using a sliding window approach. However, this method necessitates frequent parameter adjustments and is slow in detection. Feng et al. [28] utilized enhanced gradient features to detect candidate edges of lane lines, obtained a bird’s eye view through inverse perspective transformation, and finally employed an adaptive sliding window approach. However, the method suffers from high computational complexity and suboptimal real-time performance. While traditional lane line detection methods perform well under simple conditions, they exhibit significant limitations in complex and dynamic driving scenarios.

2.2 Deep learning methods

Deep learning-based lane line detection methods offer the advantages of high accuracy and robustness, enabling efficient lane line detection under various road conditions through feature learning from large-scale datasets. These methods can be categorized into three main types: segmentation-based methods, anchor-based methods, and parameter-based methods.

2.2.1 Segmentation-based methods

Segmentation-based methods typically frame the lane line detection problem as a segmentation task, where lane lines are identified through a segmentation map predicted pixel by pixel. Pan et al. [24] introduced a Spatial Convolutional Neural Network (SCNN), employing a layer-by-layer convolutional structure facilitating efficient information exchange between pixels across rows and columns of different layers, thereby enhancing feature capture in images. However, this method exhibits high computational complexity and slow detection speed. Hou et al. [29] proposed the Self Attention Distillation (SAD) module, enabling the network to self-learn and achieve substantial improvements without additional supervision or labeling. However, the method merely enhances the interlayer information flow within the lane area without offering additional monitoring signals for occlusion handling. Xu et al. [30] utilized Neural Architecture Search (NAS) to discover a more effective backbone network for capturing more accurate information to enhance curve lane detection. However, NAS entails extremely high computational expense [31], inefficiency, and time consumption. While segmentation-based methods can attain high-precision detection, they suffer from elevated computational complexity and reduced detection efficiency.

2.2.2 Anchor-based methods

Anchor-based methods utilize predefined anchor frames to generate candidate regions in an image, which are subsequently classified and regressed to detect and locate lane lines. Su et al. [32] introduced a novel vanishing point-guided anchor generator, leveraging multiple structural information related to lane lines to enhance performance. However, the performance of this method may be limited in complex environments. Qin et al. [33] proposed a lane line detection method based on row anchors, treating the lane line detection process as a row-based selection problem using global features, significantly reducing computational cost and addressing the visionless problem. The method struggles to accurately determine the shape of the lanes. Liu et al. [34] proposed a conditional lane line detection strategy employing conditional convolution and row anchors, where the conditional convolution module adaptively adjusts the convolution kernel size, locates the starting point of the lane line, and performs row anchor-based lane line detection. However, the method struggles to recognize the starting point in certain complex scenes, leading to diminished performance. Anchor-based methods offer real-time performance advantages; however, their performance may not be optimal in more complex scenarios.

2.2.3 Parameter-based methods

The parameter-based approach models lane curves with parameters and regresses these parameters to detect lane lines. PolyLaneNet [35] proposed predicting the polynomial coefficients of lane lines via deep polynomial regression, where the output represents the polynomial of each lane line in the image. Despite its computational efficiency, this method lacks accuracy. Liu et al. [36] introduced the application of Transformer to predict the parameters of the lane shape model for each lane line, taking into account road structure and camera pose. However, the method is highly sensitive to prediction parameters, and errors in higher-order coefficients may result in alterations to lane shape. Feng et al. [37] utilized parametric Bezier curves to represent lane lines as parametric curves through curve modeling and employed deep learning networks to directly predict these curve parameters for more efficient and accurate lane line detection. However, in extremely curved road scenarios, cubic curves may inadequately represent lane lines. Although increasing the curve’s order partially addresses this issue, higher-order terms become challenging to predict, as small prediction errors can significantly alter lane line shapes. Consequently, the parameter-based method fails to surpass other lane detection methods in accuracy.

The aforementioned methods struggle to detect lane lines accurately and swiftly in complex environments, and achieving both accuracy and speed simultaneously remains challenging. Consequently, we introduce a novel lane line detection network to address this issue. The proposed method is also anchor-based but successfully handles complex lanes and improves lane line detection accuracy by employing a frequency channel fusion coordinate attention mechanism.

3 Method

In this section, the lane line detection network proposed in this article is detailed. The network structure comprises a feature extraction backbone network (ResNet), a frequency channel fusion coordinate attention module (FFCA), an auxiliary segmentation branch, and a predictive classification branch.

3.1 FFCANet architecture

The overall architecture of FFCANet is shown in Fig. 1. Initially, image preprocessing techniques are employed to mitigate overfitting, and operations such as translation and rotation [38] are applied to the input lane image to obtain an RGB lane line image of size 288 × 800. Subsequently, the ResNet backbone network extracts the preliminary features of the image. In this study, primarily ResNet-18 and ResNet-34 are utilized as the feature extraction networks, and the resulting preliminary feature map serves as the input for FFCA. The FFCA module processes the features independently in various spatial directions to capture lane line feature information diversely. It employs multiple frequency components to extract additional lane line information, enhancing feature diversity. This enables the model to more accurately detect and identify lane line features, thus capturing more lane line details and texture information and improving the model’s feature extraction capability. The enriched feature map is then fed into the cell prediction classification branch. Here, lane marking anchors are selected in predefined row-oriented cells, and the probability of cells in different rows belonging to a lane line is predicted. With the potential existence of up to four lane lines, cells belonging to the same lane line are classified as one class. Three features are output through scales 2–4 of the ResNet network, serving as inputs for the auxiliary segmentation. The ECA module is integrated into the auxiliary segmentation to precisely highlight the more critical lane line features in the current task, further enhancing feature extraction capability. The auxiliary segmentation branch only functions during the training phase and is removed during the testing phase, thus not affecting runtime speed.

Fig. 1
figure 1

FFCANet overall architecture

3.2 Frequency channel fusion coordinate attention mechanism

Due to the thin and sparse appearance of the lane lines, they lack distinctive features and are susceptible to interference from other objects with a similar localized appearance. To leverage the shape a priori information of lane lines fully and enhance the model’s feature extraction capability, a frequency channel fusion coordinate attention mechanism (FFCA) approach is proposed. By utilizing multiple frequency components to extract more feature information, feature diversity is enhanced, enabling the model to more accurately capture lane line details and texture information. Incorporating both channel information and direction-related position information aids in better locating and identifying lane line features, thereby improving model detection.

3.2.1 Frequency channel attention

Frequency channel attention networks (FcaNet) is a channel attention mechanism based on the discrete cosine transform, which captures subtle texture variations and increases feature diversity by operating on features directly in the frequency domain [39]. Initially, FcaNet utilizes the discrete cosine transform (DCT) to convert the input feature maps into the frequency domain, allowing the network to learn and represent frequency-level information directly. The frequency channel attention module automatically learns and emphasizes the more crucial frequency components, thereby focusing more effectively on the frequency features pertinent to the classification task. Subsequently, the weighted frequency features are mapped back to the spatial domain using the inverse discrete cosine transform (IDCT), and the features are further refined by convolutional layers, ultimately producing a feature representation for classification.

The discrete cosine transform (DCT) is a commonly employed signal processing technique utilized to convert 1D or 2D signals from the spatial domain to the frequency domain. In image processing, DCT finds widespread application in tasks such as image compression and feature extraction. The basis function of the 2D DCT is:

$$ B_{{h,w}}^{{i,j}} = \cos \left( {\frac{{\pi h}}{H}\left( {i + \frac{1}{2}} \right)} \right)\cos \left( {\frac{{\pi w}}{W}\left( {j + \frac{1}{2}} \right)} \right) $$
(1)

The 2D DCT can be written as:

$$ f_{h,w}^{2d} = \sum\limits_{i = 0}^{H - 1} {\sum\limits_{j = 0}^{W - 1} {x_{i,j}^{2d} B_{h,w}^{i,j} } } $$
(2)

where \(h,i \in \{ 0,1, \cdots ,H - 1\}\), \(j,w \in \{ 0,1, \cdots ,W - 1\}\), \(H\) is the height of \(x^{2d}\), \(W\) is the width of \(x^{2d}\), \(f^{2d} \in R^{H \times W}\) is the spectrum of the 2D DCT, \(x^{2d} \in R^{H \times W}\) is the input. In contrast, the 2D IDCT can be written as:

$$ x_{h,j}^{2d} = \sum\limits_{h = 0}^{H - 1} {\sum\limits_{w = 0}^{W - 1} {f_{h,w}^{2d} B_{h,w}^{i,j} } } $$
(3)

Dividing the input feature \(X\) into \(n\) parts along the channel dimension, the split part is denoted as \([X^{0} ,X^{1} , \cdots ,X^{n - 1} ]\), and \(X^{i} \in R^{{C^{\prime} \times H \times W}}\), \(i \in \{ 0,1, \cdots ,n - 1\}\), \(C^{\prime} = C/n\), \(C\) is divisible by \(n\). For each section, assigning the corresponding 2D DCT frequency components, 2D DCT results as \(Freq\) vectors for each part, the expression is:

$$ {\text{Freq}}^{i} = 2DDCT^{{u_{i} ,v_{i} }} (X^{i} ) = \sum\limits_{h = 0}^{H - 1} {\sum\limits_{w = 0}^{W - 1} {X_{:,h,w}^{i} B_{h,w}^{{u_{i,} v_{i} }} } } $$
(4)

where \(i \in \{ 0,1, \cdots ,n - 1\}\), \([u_{i} ,v_{i} ]\) is the 2D index corresponding to the frequency component of \(X_{i}\), \({\text{Freq}}^{i} \in R^{{C^{\prime}}}\) is the compressed \(C^{\prime}{ - }\) dimensional vector, the entire compression vector can be obtained by concatenating, denoted by:

$$ {\text{Freq}} = {\text{compress}}(X) = {\text{cat}}([{\text{Freq}}^{0} ,{\text{Freq}}^{1} , \cdots ,{\text{Freq}}^{(n - 1)} ]) $$
(5)

where \({\text{Freq}} \in R^{C}\) is a multispectral vector, the entire multispectral channel attention framework can be written as:

$$ ms\_att = {\text{sigmoid}}(fc({\text{Freq}})) $$
(6)

From Eqs. (5 and 6), it is evident that the frequency channel attention module extends the original global average pooling method to frameworks with multiple frequency components, effectively representing compressed channel information. The frequency channel attention is depicted in Fig. 2.

Fig. 2
figure 2

Frequency channel attention

3.2.2 Coordinate attention

Since FcaNet lacks understanding and application of the spatial domain of the feature map, the coordinate attention (CA) mechanism is incorporated to enhance the model’s comprehension of the spatial structure [40]. CA leverages the spatial structure of an image fully and processes features separately in various spatial directions to better capture information in different orientations. By decomposing the channel attention into two feature codes and aggregating the features along different spatial directions, the model can focus on the rows and columns of the image individually. Thus, it can efficiently capture long-range dependence of features in one direction while preserving precise positional information in the other direction. The coordinate attention mechanism is illustrated in Fig. 3.

Fig. 3
figure 3

Coordinate attention mechanism

Firstly, each channel of the input feature \(x\) is encoded along the x-axis and y-axis using two spatially scoped pooling kernels, \((H,1)\) and \((1,W)\), respectively. The coding of the cth channel with height \(h\) and the coding of the cth channel with width \(w\) are shown in Eqs. (7) and (8), respectively.

$$ z_{c}^{h} (h) = \frac{1}{W}\sum\limits_{0 \le i < W} {x_{c} (h,i)} $$
(7)
$$ z_{c}^{w} (w) = \frac{1}{H}\sum\limits_{0 \le j < H} {x_{c} (j,w)} $$
(8)

where \(W\) and \(H\) are the width and height of the input features, respectively; \(i,j\) are feature calculation coordinates, respectively; \(x_{c}\) is the feature of the input cth channel.

The two formulas above encode each channel along the x-axis coordinates and y-axis coordinates, respectively, resulting in a pair of orientation-aware feature maps that preserve the positional information of each channel of the feature map. Simultaneously capturing a long-range dependency along the coordinate direction for the attention module and maintaining precise positional information along the other coordinate direction helps the network more accurately localize and identify important features.

The feature map is obtained from the embedding stage of coordinate information, which needs to be processed by splicing operation, followed by 1 × 1 convolution and nonlinear activation, which is represented by the formula:

$$ f = \delta (F_{1} ([z^{h} ,z^{w} ])) $$
(9)

where \(\delta\) is a nonlinear activation function; \(f \in R^{C/r \times (H + W)}\); \([ \cdot , \cdot ]\) is the splicing operation along the spatial dimension; \(F_{1}\) is a 1 × 1 convolution operation.

\(f\) is then split into two separate tensors, \(f^{h} \in R^{C/r \times H}\) and \(f^{w} \in R^{C/r \times W}\), along the spatial dimension. Two 1 × 1 convolutions \(F_{h}\) and \(F_{w}\) transform \(f^{h}\) and \(f^{w}\) to have the same number of channels as the input features \(x\), respectively, and the formula is expressed as:

$$ g^{h} = \sigma (F_{h} (f^{h} )) $$
(10)
$$ g^{w} = \sigma (F_{w} (f^{w} )) $$
(11)

where \(\sigma\) is expressed as a sigmoid function. The output feature maps \(g^{h}\) and \(g^{w}\) are expanded and used as attentional weights for the x- and y-axis coordinates, respectively. Finally, the process of re-weighting the original feature map \(x\) through the coordinate attention can be written as:

$$ y_{c} (i,j) = x_{c} (i,j) \times g_{c}^{h} (i) \times g_{c}^{w} (j) $$
(12)

3.3 Auxiliary segmentation branch

The auxiliary segmentation branch is a segmentation method that utilizes multi-scale features to emulate local features, primarily aimed at enhancing the network’s semantic analysis capability. This helps the model better comprehend the structural features of the lanes and enhances the feature extraction ability of the convolutional layer. Segmentation branches are utilized solely during the training phase of the model and are omitted during the testing phase, thereby ensuring that the speed of the run remains unaffected despite the addition of extra segmentation tasks. To effectively capture the dynamic dependencies between channels and accurately highlight the more significant lane line features in the current task, the efficient channel attention (ECA) module is introduced in the auxiliary segmentation branch to further enhance the performance of lane line segmentation networks.

3.3.1 ECA module

The fundamental mechanism of the ECA module is to adaptively capture dependencies between channels using simple and efficient 1D convolutions, eliminating the need for cumbersome downscaling and upscaling processes [23]. Compared to the traditional attention mechanism, it circumvents the complex multi-layer perceptual machine structure, thereby reducing model complexity and computational burden. By computing an adaptive convolution kernel size, the ECA module directly applies one-dimensional convolution on the channel features, enabling it to learn the importance of each channel relative to the others. The ECA module is depicted in Fig. 4.

Fig. 4
figure 4

ECA module

The size \(k\) of a 1D convolutional kernel is proportional to the channel dimension \(C\), and there is a mapping \(\phi\) between \(k\) and \(C\). The simplest mapping relationships are linear functions such as \(\phi (k) = \gamma *k - b\).The channel dimension \(C\) is usually set to a power of 2. Using an exponential function to approximate the mapping \(\phi\), this can be expressed as:

$$ C = \phi (k) = 2^{(\gamma *k - b)} $$
(13)

Given the channel dimension \(C\), the convolution kernel size \(k\) can be determined adaptively by the expression.

$$ k = \psi (C) = \left. {\left| {\frac{{\log_{2} (C)}}{\gamma } + \frac{b}{\gamma }} \right.} \right|_{odd} $$
(14)

where \(\left| {\left. t \right|} \right._{odd}\) denotes the closest odd number to \(t\). In the simulation experiment set \(\gamma = 2,b = 1\) respectively. By mapping \(\phi\) interactions, the higher dimensional channels have longer range interactions, while the lower dimensional channels have shorter range interactions by nonlinear mapping.

3.3.2 Auxiliary segmentation network

Through the ResNet backbone network layer2, layer3, layer4 layers will get three feature maps × 2, × 3, × 4. These feature maps serve as the input for the segmentation network, enabling the segmentation method of local feature reconstruction using multi-scale features. The segmentation network is depicted in Fig. 5.

Fig. 5
figure 5

Auxiliary segmentation network

Initially, to capture the dependencies between feature channels, the × 2, × 3 and × 4 feature maps undergo ECA operations individually. Following convolution, normalization, and activation operations, it is necessary to upsample × 3, × 4. Finally, the splicing operation is performed. The specific steps are illustrated in Fig. 6.

Fig. 6
figure 6

Specific steps for segmented networks

3.4 Location selection and classification based on row anchors

Traditional semantic segmentation for lane line segmentation pixel by pixel point has high computational complexity and slow lane line detection. In order to enhance the efficiency of lane line detection and address issues such as lane occlusion, this paper employs a location selection and classification method based on row anchors [33].

Lane line detection has been transformed into the problem of selecting lane marking anchor points within predefined row-oriented cells using global features. Initially, the lane image is grid-divided into a specific number of rows, with each row further subdivided into a certain number of cells. Subsequently, for each row of cells, predictions are made regarding the cells containing lane lines. Finally, the cells identified to contain lane lines across all predefined rows are categorized, with those belonging to the same lane line grouped together. This approach to detecting lane line anchors circumvents the need to individually process each pixel point of the lane, thereby significantly enhancing detection efficiency. Moreover, the utilization of anchors on global image features results in a larger sensory field, which is more conducive to addressing challenging scenarios. The location selection and classification based on row anchors is depicted in Fig. 7.

Fig. 7
figure 7

Location selection and classification based on row anchors

Assuming the maximum number of lanes is \(C\), \(h\) represents the number of divided rows (i.e., the number of row anchors), and \(w\) denotes the number of cells per row. If \(X\) represents a global image feature, then \(f^{ij}\) denotes the classifier used to select the lane position on the \(ith\) lane and \(jth\) row anchor. Then the lane is predicted to be:

$$ P_{i,j,:} = f^{ij} (X) $$
(15)

where \(i \in [1,C]\), \(j \in [1,h]\). \(P_{i,j,:}\) is a \((w + 1)\)-dimensional vector, representing the probability of selecting \((w + 1)\) cells for the \(ith\) lane and \(jth\) row anchor.

3.5 Loss function

During lane line detection, the background typically occupies most of the image, while the lane lines constitute only a small portion of the targets. To enhance the network’s ability to learn and focus on critical and difficult-to-classify lane line samples, classification loss is introduced. The classification loss is defined as follows:

$$ L_{cls} = \sum\limits_{i = 1}^{C} {\sum\limits_{j = 1}^{h} {L_{CE} (P_{i,j,:} ,T_{i,j,:} )} } $$
(16)

where \(L_{CE}\) denotes the cross entropy loss. \(T_{i,j,:}\) indicates that the marker is in the correct lane position.

Since the lane points in neighboring rows of anchors are close to each other, lane locations are represented by classification vectors. Thus, continuity is achieved by constraining the distribution of classification vectors across neighboring row anchors. The similarity loss function is defined as:

$$ L_{sim} = \sum\limits_{i = 1}^{C} {\sum\limits_{j = 1}^{h - 1} {\left\| {P_{i,j,:} - P_{i,j + 1,:} } \right\|_{1} } } $$
(17)

where \(\left\| \cdot \right\|_{1}\) represents \(L_{1}\) norm.

Another loss function emphasizes the shape of the lane. Given that most lanes are straight, second-order difference equations are employed to constrain their shape. First, the probabilities for different lane positions are computed using the softmax function, as expressed by:

$$ Prob_{i,j,:} = softmax(P_{i,j,1:w} ) $$
(18)

where \(P_{i,j,1:w}\) denotes the \(w\)-dimensional vector. \(Prob_{i,j,:}\) represents the probability of each lane position. Next, by using the expectation of the lane line prediction rather than an approximation, the expected lane position can be expressed as:

$$ Loc_{i,j} = \sum\limits_{k = 1}^{w} {k \cdot Prob_{i,j,k} } $$
(19)

where \(Prob_{i,j,k}\) represents the probability of the \(ith\) lane, the \(jth\) row anchor, and the \(kth\) position.

According to Eq. (19), the second-order difference constraint can be formulated as:

$$ L_{shp} = \sum\limits_{i = 1}^{C} {\sum\limits_{j = 1}^{h - 2} {\left\| {(Loc_{i,j} - Loc_{i,j + 1} ) - (Loc_{i,j + 1} - Loc_{i,j + 2} )} \right\|_{1} } } $$
(20)

where \(Loc_{i,j}\) denotes the position on the \(ith\) lane, \(jth\) row anchor. Finally, the structural loss can be expressed as:

$$ L_{str} = L_{sim} + \lambda L_{shp} $$
(21)

where \(\lambda\) denotes the loss coefficient.

We employ cross entropy as an auxiliary segmentation loss, then the total loss function can be expressed as:

$$ L_{total} = L_{cls} + \alpha L_{str} + \beta L_{seg} $$
(22)

where \(\alpha\) and \(\beta\) denote the loss coefficients, and \(L_{seg}\) denotes the auxiliary segmentation loss.

4 Experiments and results

To validate the effectiveness and applicability of the FFCANet proposed in this paper, its performance is evaluated alongside other lane line detection methods on two public datasets, TuSimple and CULane, respectively. The following sections focus on experimental setup, ablation study, performance comparison and analysis, and visualization of results.

4.1 Experimental setting

4.1.1 Datasets

To assess the performance of the model proposed in this paper, it is trained and tested on two widely used benchmark datasets for lane line detection: TuSimple and CULane. The details of the two datasets are provided in Table 1.

Table 1 Details of the datasets

TuSimple is one of the most widely used datasets for lane line inspection, collected under consistent motorway lighting conditions. It encompasses scenes captured in various weather conditions on motorways, with lane markings annotated in each image. Lanes are annotated by the 2D coordinates of sampled points, with a uniform height interval of 10 pixels, and the dataset includes both straight and curved lanes. Created by the Chinese University of Hong Kong, the CULane dataset was primarily collected in cities and on motorways with nine complex driving scenarios including normal, crowd, night, no line, shadow, arrow, curve, dazzle light, cross.

4.1.2 Evaluation of indicators

The official evaluation metrics for both TuSimple and CULane are different. For the TuSimple dataset, the primary evaluation metrics are accuracy (Acc), false positive (FP), and false negative (FN). The expression for Acc is as follows:

$$ A{\text{cc}} = \frac{{\sum\nolimits_{clip} {C_{clip} } }}{{\sum\nolimits_{clip} {S_{clip} } }} $$
(23)

where \(C_{clip}\) denotes the number of correctly predicted lane points, \(S_{clip}\) then denotes the total number of true and valid lane points in the image. A point prediction is considered correct if more than 85% of the predicted lane points are within the threshold of the ground truth. The FP and FN are calculated as:

$$ FP = \frac{{F_{pred} }}{{N_{pred} }} $$
(24)
$$ FN = \frac{{M_{pred} }}{{N_{{{\text{g}}t}} }} $$
(25)

where \(F_{pred}\) denotes the number of incorrectly predicted lanes and \(N_{pred}\) denotes the number of predicted lanes. \(M_{pred}\) denotes the number of missed lanes and \(N_{gt}\) denotes the number of ground truth lanes.

For the CULane dataset, each lane line is assumed to have a width of 30 pixels. The Intersection over Union (IOU) between the model predictions and the ground truth is computed, with predictions having an IOU > 0.5 considered as true positives. Finally, the F1 score is used as the evaluation metric for the CULane dataset. The expression is as follows:

$$ F1 = \frac{{2 \times {\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}} $$
(26)
$$ {\text{Precision}} = \frac{TP}{{TP + FP}} $$
(27)
$$ {\text{Recall}} = \frac{TP}{{TP + FN}} $$
(28)

where TP, FP, and FN indicate true positives, false positives, and false negatives, respectively.

4.1.3 Implementation details

The models in this paper were trained and tested using NVIDIA GeForce RTX 4060 Laptop GPU 13th Gen Intel(R) Core(TM) i7-13650HX CPU, and the deep learning framework was PyTorch. We utilize a pre-trained ResNet as the backbone network, and the input images are resized to 288 × 800 pixels. To mitigate overfitting and enhance generalization, we apply a data augmentation strategy that includes scaling, panning, and flipping. The number of training epochs for the TuSimple dataset is set to 200, the batch size is set to 32, the AdamW optimiser [41] is utilized with an initial learning rate of 4e-4 and a weight decay rate of e-4, and the cosine annealing learning rate decay strategy [42] is employed during training. The number of training epochs for the CULane dataset is set to 50, the batch size is set to 32, and for training with a single GPU, the initial learning rate needs to be set to 0.025. For different datasets, the division of row anchor is also different, and the specific parameter settings are shown in Table 2.

Table 2 Parameter settings on different datasets

4.2 Ablation study

To validate the effectiveness of the proposed FFCA module and the introduction of the ECA module in the auxiliary segmentation branch, ablation studies are conducted on the TuSimple dataset to show the performance of each module. ResNet-18 is selected as the baseline network, and FFCA module and ECA module are added to the baseline network in turn to assess the impact of each module on lane line detection. The models are trained and tested separately with the identical parameter settings, and the results of the ablation experiments are presented in Table 3.

Table 3 Results of ablation experiments on the TuSimple dataset

From the data in the table above, it is evident that after the introduction of the proposed FFCA module to the original network, the accuracy is increased by 0.16% and the false positives and false negatives are reduced by 0.0014 and 0.0020 respectively. This demonstrates that the proposed FFCA module can make full use of the shape a priori information of lane lines to better locate and identify lane line features. With the introduction of the ECA module in the assisted segmentation branch, the accuracy improved by 0.11% and the false positives and false negatives were reduced by 0.0011 and 0.0013, respectively. It is shown that the ECA module is able to capture the dynamic dependencies between channels, more accurately highlighting the more important lane line features for the task at hand. The simultaneous introduction of the FFCA and ECA modules resulted in a 0.24% increase in accuracy and a 0.0022 and 0.0027 reduction in false positives and false negatives, respectively, affirming the efficacy of both modules. The visual comparison results of the ablation experiments are shown in Fig. 8. The figure indicates that lane line fitting is enhanced by the incorporation of the FFCA and ECA modules. This also demonstrates the validity and feasibility of incorporating these modules.

Fig. 8
figure 8

Visual comparison results of ablation experiments

4.3 Performance comparison and analysis

The FFCANet of this paper is evaluated on two public datasets, TuSimple and CULane, using ResNet-18 and ResNet-34 as backbone networks, respectively, and compared with other lane line detection methods. The comparison methods were tested on the same hardware under the same conditions. For the TuSimple dataset, seven methods, SCNN [24], SAD [29], SGNet [32], PolyLNet [35], UFLD [33], CondLNet [34], MAM [43], and BezierLNet [37] were selected for comparison. Acc, FP, FN, and Runtime (elapsed time per frame) are used as evaluation metrics and the results are shown in Table 4.

Table 4 Comparison with other methods on the TuSimple dataset

Since the TuSimple dataset only contains motorway scenes with adequate lighting conditions and favorable weather conditions, and the traffic scenarios are relatively uncomplicated, there is little difference in the accuracy of each lane line detection method. When using the ResNet-18 backbone, the Acc of FFCANet is 96.06%, which is 0.19%, 2.7%, 0.24%, 0.58%, 0.23%, and 0.65% higher than SGNet, PolyLNet, UFLD, CondLNet, MAM, and BezierLNet. Compared with UFLD, which is also based on row anchor detection method, FP and FN are reduced by 0.0022 and 0.0027, respectively. When using the ResNet-34 backbone, the Acc of FFCANet is 96.09%, which is 0.22%, 2.73%, 0.23%, 0.61%, 0.26%, and 0.44% higher than SGNet, PolyLNet, UFLD, CondLNet, MAM, and BezierLNet. FP and FN were reduced by 0.0026 and 0.0032 over UFLD, respectively. Although the Acc of SCNN and SAD is inferior to that of FFCANet, the FFCANet proposed in this paper performs superior in terms of time consumed per image frame, which is 46.03 and 4.62 times faster than SCNN and SAD, respectively. In summary, FFCANet has a fast operating speed while maintaining a high Acc.

For the CULane dataset, seven methods, SCNN [24], SAD [29], Res34-VP [21], E2E [19], LSTR [36], DeepLab [20], GCSbn [44], STLNet [45], and UFLD [33] were selected for comparison. F1 values and FPS were used as assessment indicators and the results are shown in Table 5. From the table, it can be seen that FFCANet achieves superior results in terms of F1 value and speed, with the fastest speed of 345.8 FPS, with excellent real-time performance. When using the ResNet-34 backbone network, FFCANet has an F1 value of 72.8, outperforming all other comparison methods except STLNet and also has a faster speed advantage. Although STLNet has a slightly higher F1 score than our method, our approach is 3.17 times faster. The F1 value of FFCANet outperforms that of UFLD when using the same backbone network with little difference in speed, proving that the FFCA proposed in this paper as well as the introduction of ECA in the segmentation branch enhances the feature extraction capability. The above results demonstrate the excellent performance of FFCANet in challenging scenarios.

Table 5 Comparison with other methods on the CULane dataset

4.4 Visualization results

In order to visualize the detection effect of the method in this paper, the detection is performed on two public datasets, TuSimple and CULane. The visualization results of FFCANet on TuSimple dataset are depicted in Fig. 9. As observed from the figure, the lane lines at the far end of the bend are accurately fitted, as shown by the yellow elliptical frames in the figure; and the issue of lane line occlusion is effectively addressed, as shown by the red elliptical frames in the figure. The network demonstrates excellent performance in detecting lane straightness, curves and vehicle occlusion.

Fig. 9
figure 9

Visualization results on the TuSimple dataset

The visualization results on the CULane dataset are illustrated in Fig. 10. A variety of lane line scenarios from the CULane dataset were selected. It is evident from the figure that FFCANet exhibits exceptional lane line detection performance in complex scenarios. Excellent detection of vehicle obstructions, glare interference, and curved lane lines is shown in the blue, red, and yellow ellipse frames in the figure respectively. Analysis of the visualization results demonstrates that FFCANet exhibits outstanding generalization ability and robustness in detecting lane lines in complex scenarios.

Fig. 10
figure 10

Visualisation results on the CULane dataset

The visualisation of the CULane dataset presented in Fig. 11 shows that our method obtains smoother and more accurate lane lines in these challenging scenarios compared to other methods.

Fig. 11
figure 11

Visual comparison results of UFLD, GCSbn, SAD and our method on the CULane dataset

5 Conclusion

In this paper, we introduce FFCANet, a frequency channel fusion coordinate attention mechanism network for lane detection. FFCANet adopts ResNet as its backbone network. We introduce the FFCA module, which captures lane line details and texture information from different spatial orientations to enhance feature diversity. Additionally, in order to effectively address the challenges of detection efficiency and the absence of visual cues, we employ a row anchor-based prediction and classification method. This method avoids the high computational complexity associated with pixel-by-pixel segmentation by treating lane line detection as a problem of selecting lane marking anchors in row-oriented cells predefined by global features. To further enhance the feature extraction capability, an ECA module is introduced in the auxiliary segmentation branch to capture the dynamic dependencies between channels. We evaluated FFCANet on two publicly available benchmark datasets, Tusimple and CULane, and designed ablation experiments to validate the effectiveness of each module. Experimental results demonstrate that the proposed method achieves a balance between detection accuracy and efficiency in complex road scenarios. Furthermore, the lightweight ResNet-18 version achieves a processing speed of 345.8 frames per second.

In future research, an integrated lane line detection scheme that combines vision algorithms with LiDAR technology could be explored. By integrating convolutional neural networks (CNNs) with LiDAR, 3D information of lane lines can be obtained to enhance detection accuracy, stability, and real-time performance. This integration aims to enhance the perception and decision-making capabilities of autonomous vehicles in complex traffic environments, thereby providing crucial technical support for the advancement of self-driving cars.