1 Introduction

Respiratory diseases such as lung cancer, tuberculosis, bronchitis, emphysema, pneumonia, and cystic fibrosis are still emerging, prevalent, and life-threatening [1]. One of the most recent detrimental diseases is the well-known COVID-19, which is caused by the SARS-CoV-2 virus [2]. As the symptoms of COVID-19 are similar to pneumonia, this disease is often misdiagnosed. Early diagnosis is essential for the prevention of COVID-19 spreading as well as for effective treatment [3]. In addition to COVID-19, according to World Health Organization (WHO) statistics for 2020, 10 million deaths from various cancer types were reported, and 1.8 million are associated with lung cancer worldwide [4].

One of the ways to reduce lung cancer mortality is by using low-dose computed tomographic screening, which can be an effective life-saving tool if the screening results are correctly interpreted and analyzed [5]. However, manual analysis of medical images is laborious, operator-dependent, and subjective, which can often lead to misdiagnosis as pathological information can be misinterpreted or even neglected. Like many other diseases, detecting and interpreting lung nodules is challenging. First of all, lung nodules are often patient specific and heterogeneous, which is not ideal for their identification [6]. Secondly, lung nodules are geometrically diverse, with various sizes. Finally, there is a complication in distinguishing nodules from the surrounding normal tissues, as they are often visually similar [7].

Recent advances in deep learning methods and their application in medical image analysis have shown promising potential in computer-aided detection and diagnosis, referred to as CAD. The capability of the deep learning based CAD techniques has almost reached the level of experienced radiologists [8]. CAD systems are generally divided into four stages: pre-processing, segmentation, feature extraction, and classification. Pre-processing is the stage of preparing data for a deep learning model where denoising and adjusting image contrast are performed [9]. In the segmentation stage, targeted objects such as organs, tissue, mass, and tumors are segmented. Features including size, location, texture, and patterns are then extracted from segmented lesions of interest. At the last classification stage, the selected features are analyzed, and a segmented area is assigned to one of the given groups. In the case of the lung cancer determination task, the target area is the tumor, the features are the characteristics of the tumor, and the classification groups are benign or malignant [10].

This comparative study focuses on state-of-the-art deep learning-based lung nodule segmentation and annotation systems. The majority of earlier research was conducted on small datasets [11,12,13,14], which might not be a fair comparison given that deep learning-based techniques require a larger number of training images to achieve optimal performance. There was also a lack of validation by other groups because many of these studies were performed on private datasets [15,16,17,18]. Additionally, the primary focus of most papers lies in reliability metrics, such as overall accuracy, with little attention paid to the complexity of the model. However, considering the complexity of the model is vital as it provides valuable insights into the computational resources required and sheds light on the trade-offs between model intricacy and performance. Comparing the latest trends in the field gives a clearer landscape in the direction of current studies and the horizons of future research. Most of the recently proposed models are based on transformer technology, which was originally presented in [19] (2017). Initially, the transformer was developed for application in the natural language processing (NLP) task [20]. Nowadays, transformers have a great deal of applications in computer vision. Similar to the transformers in NLP, the main component of the vision transformer is the self-attention layer, which captures long-term dependencies and thus increases performance in modeling global features [21].

Deep learning based medical image segmentation methods can be categorized in many ways, such as network architecture, training method, input data type, etc. [22]. Examples of network architecture kinds might be RNN [23], LSTM [24], CNN [25], ViT [26], etc. The most common of them were summarized in Table 1. The references in this table are implementation examples of these architectures in medical image segmentation. In training methods, deep learning models are broadly divided into supervised learning [27], unsupervised learning [28], and reinforcement learning [29]. The first two of them are distinguished by input data size and labels (annotated data). Supervised learning generally uses a large amount of data and labels, whereas no annotated data is required in unsupervised learning. Reinforcement learning is a special one whose learning is held by interactions between agent and environment. It is more focused on goal-directed learning than other types of machine learning.

Table 1 Summary of common single networks and hybrid networks

We organized the rest of the paper to include the following: (i) A review of the used dataset and methods in terms of key features (Sect. 2); (ii) A performance comparison of the methods that were presented in (i) (Sect. 3); and (iii) Discussion and conclusions (Sect. 4).

2 Materials and Methods

2.1 CT Database for Model Quantification

The LIDC-IDRI (Lung Image Database Consortium and Image Database Resource Initiative) is the largest publicly available dataset with annotations of lung nodules. In Fig. 1, sample CT image slices from the dataset are presented. The vastness and accuracy of this dataset could allow us to produce a reliable benchmark for evaluating the performance of different segmentation models [48]. The dataset includes 1010 cases of helical thoracic CT images, annotated masks with lung nodules, and XML files with subjective characteristics of the nodules such as spiculation, lobulation, subtlety, internal structure, shape (sphericity), margin, solidity, and likelihood of malignancy. Images and masks are provided as DICOM files, which means that there is not only pixel data but also metadata about device information, image acquisition parameters, and anonymized patient identification data [49].

Fig. 1
figure 1

Sample images from LIDC dataset

Annotations were made by 12 expert radiologists in a two-phase process. The first phase was the blinded-read phase, where radiologists separately reviewed each CT slice and annotated target lesions, classifying them into one of three categories based on the nodule size: nodules with a size in the 3–30 mm range, nodules with a size less than 3 mm, and non-nodules with a size greater than or equal to 3 mm. The main focus of the database is nodules with sizes in the 3–30 mm range, as they are most likely to be malignant [50]. During the second phase, known as the unblinded-read phase, each radiologist conducted an independent review of their own annotations, in addition to examining the anonymized annotations made by the three other radiologists, in order to formulate a final opinion. The objective of this process was to detect, to the fullest extent possible, all lung nodules present in each CT scan without the necessity of enforcing unanimous consensus.

2.2 Deep Learning Models for Lung Nodule Segmentation

Recent advances in deep learning have substantially impacted the computer vision field [51]. This section will provide the major features, such as structure, functionality, and pros and cons, of deep learning models that were used in this comparison study. We have chosen five deep learning models for further comparison and evaluation based on the following criteria:

Firstly, all models consider local and global connections, which are crucial in the medical image segmentation field [52]. These connections are vital in the context of medical image segmentation because they allow the model to understand both the fine details and the broader context of the image. Local connections focus on specific features and their relationships in proximity, while global connections provide a wider context, capturing overall structures and relationships between different regions. This combined approach ensures that the model can effectively discern intricate details while maintaining a comprehensive understanding of the entire image.

Secondly, models have high generalization ability, which means that they can be used for a variety of medical image tasks [53]. This attribute is crucial as it implies that the models can effectively adapt and perform well on various types of medical images, even those they have not specifically been trained on. This adaptability is especially valuable in the dynamic and diverse field of medical imaging, where a single model may be required to handle a wide array of image modalities, anatomical regions, or clinical scenarios.

Finally, the models have been developed and published over the past five years and show the latest trending technologies in the computer-aided detection and diagnosis industry, which can be implemented as a base and improved in future model development.

2.2.1 Salient Attention UNet

The Salient Attention UNet model is based on the UNet model and Salient Attention blocks [54]. The architecture of the model presented in Fig. 2. The U-shaped architecture in the UNet implies three main parts, which are the encoder, skip-connections, and decoder [55]. In the encoder part, learning of global contextual representations takes place with gradual downsampling layers of CNN. In the decoder section, deep features are extracted, and missing spatial information is restored through skip-connections. These features are then combined and upsampled to match the input resolution, enabling precise pixel/voxel-wise semantic prediction. In the Salient Attention UNet model, in order to improve attentional control of the network, attention blocks are incorporated with the encoder layers, which forces it to learn feature representations that focus attention on high-priority target regions. Figure 3 presents the structure of the attention block. Input feature maps, denoted as \({F}_{n}\), are processed through two sequential blocks. The first block performs max-pooling to generate \({P}_{n}\) features, while the second block utilizes a 1 × 1 convolutional layer to expand the channel dimension to 128. The saliency map, S, undergoes downsampling via max-pooling and upsampling through the convolutional layer to match the feature map’s channel dimension. Subsequently, the upsampled \({F}_{n}\) and S are added to produce intermediate maps \({I}_{n}\). This result then passes through several convolutional layers followed by a sigmoid function to normalize it within the [0,1] range. Finally, the product of the attention map A and max-pooled feature map \({P}_{n}\) yields the output of the attention block, denoted as \({O}_{n}\).

Fig. 2
figure 2

Architecture of the salient attention UNet

Fig. 3
figure 3

The structure of the attention block

In the case of tumor segmentation, the focus of the network attention is a crucial part of accurate semantic prediction. In the original paper, the model was evaluated on the breast ultrasound dataset collected from three hospitals.

The strength of the model is the utilization of prior knowledge about the organ and tumor, which resulted in the generation of salient attention image maps during training. However, it is important to note that the model comes with a quadratic computational cost.

2.2.2 Connected-UNets

The Connected-UNets model is based on two basic UNet models that are connected using skip-connection [56]. Figure 4 details the architecture of this model. Atrous Spatial Pyramid Pooling (ASPP) blocks play the role of bottleneck in the network [57]. ASPP extracts multi-scale contextual information and employs it to address the issue of losing resolution in the case of small sized tumors. There are a few variants on this architecture-e.g. Connected-UNets, Connected-AUNets, and Connected-ResUNets. The last two models are Connected-UNets application on Attention UNet [58] and Residual UNet [59], respectively. In their original paper for model evaluation, three mammography datasets were used, which are the Curated Breast Imaging Subset of Digital Database for Screening Mammography [60], INbreast [61], and a private dataset.

Fig. 4
figure 4

Architecture of the connected-UNets

Connected-UNets variations are capable of small sized tumor predictions and over perform the standard architectures like UNet, Attention UNet, and Residual UNet. The limitation of the model is using ASPP blocks, which easily discard local detail characteristics in large amounts [62].

2.2.3 DDANet

The DDANet model is based on an encoder-decoder architecture with applied residual blocks and a squeeze and excitation layer [63]. Architecture of the model presented in Fig. 5. Due to the fact that the training error tends to rise as neural networks become more complex and a particular layer’s activation tends to zero deeper in the network, the residual blocks were applied to form an identity function that addresses this problem. The squeeze and excitation layer plays a role as a channel-wise attention mechanism for addressing the problem of CNN, where every feature channel is equally important. The detailed structure of the residual block, encoder block, and decoder block is provided in a), b), and c) parts of Fig. 6, respectively.

Fig. 5
figure 5

Architecture of the DDANet

Fig. 6
figure 6

a residual block, b encoder block, c decoder block of the DDANet

There are two parallel decoders after the shared encoder. Each of the decoders give two outputs; the first output is a mask and the second output is a grayscale image. The second decoder generates an attention map, which helps improve the semantic representation of the feature maps [64]. Attention mechanisms applied in the network increase the performance of the model by capturing essential local features, filtering irrelevant information, and improving long-range dependencies.

The model can be used for real-time prediction. However, the real-time predictions require increased computational resources due to the necessity for parallel processing of data, which can be considered a potential limitation of the model.

2.2.4 UTNet

The UTNet model is based on Transformer technology combined with CNN in a U-shape encoder-decoder architecture [65]. The architecture of the model is provided in Fig. 7. The hybrid architecture offers a beneficial impact on capturing local and global dependencies. The attention block of the transformer is efficient self-attention, which was proposed in the original paper. The key distinction between pair-wise attention and efficient self-attention lies in the latter’s ability to capture feature maps from all regions, including boundary regions. This is facilitated by employing distinct projections to transform a key and a value into the embedding space. The residual block and transformer encoder block are shown in a), b) parts of Fig. 8, respectively. The model was evaluated on a cardiac MRI dataset that consisted of 150 annotated images of segmented left ventricle (LV), right ventricle (RV), and left ventricular myocardium (MYO).

Fig. 7
figure 7

Architecture of the UTNet

Fig. 8
figure 8

a residual basic block of the UTNet, b transformer encoder block of the UTNet

Due to efficient self-attention, the model has linear computational complexity. Nevertheless, the model exhibits outcomes that suffer from either over-partitioning or under-partitioning issues in cases of large amounts of small object detection.

2.2.5 EdgeNeXt

The EdgeNeXt model was developed as a general purpose model with light weights that can be implemented on edge devices like cameras, sensors, embedded systems, and personal devices [66]. The EdgeNeXt has a hybrid architecture that combines CNN and transformer. Figure 9 presents the architecture of the model. The model uses split depth-wise transpose attention (STDA) as an encoder, which increases receptive field and encoded multi-scale features by applying depth-wise convolution along with self-attention across channel dimensions of multichannel groups from splitted input tensors. Figure 10 shows Convolutional encoder and SDTA encoder in the a) and b) parts, respectively. Previous models like ViT used multi-headed self-attention (MHA), which has a high computational cost. The EdgeNeXt model contains efficient multi-head self-attention (MSHA).

Fig. 9
figure 9

Architecture of the EdgeNeXt

Fig. 10
figure 10

a the convolutional encoder, b the SDTA encoder

EdgeNeXt, known for its remarkable computational efficiency and versatility across tasks, particularly excels in real-time predictions owing to its lightweight architecture. However, its original testing on non-medical images necessitates careful evaluation before applying it to medical imagery for an accurate prognosis.

2.3 Data Preprocessing

Prior to feeding the dataset into the models, a data preprocessing stage was conducted [67]. To enhance image quality and optimize input for the network, we applied a series of preprocessing steps such as windowing, thresholding, and resizing. These preprocessing steps are fundamental in preparing medical images for analysis. Windowing helps to highlight certain ranges of pixel values, enhancing the visibility of particular structures or pathologies. Thresholding aids in isolating regions of interest based on pixel intensity. Resizing ensures that the image is compatible with the input requirements of the network. This diligent preparation ultimately leads to more accurate and meaningful analysis, benefiting medical professionals in their diagnostic and treatment decisions. Original images are CT images in DICOM format with a size of 512*512 pixels. We scaled all images to the size of 224*224 pixels for models, and then we windowed them to the pixel range [0,255]. After the windowing substage, thresholding was applied. Figure 11 displays images from the LIDC dataset post-windowing, alongside corresponding masks highlighting segmented lung nodules.

Fig. 11
figure 11

Images from LIDC dataset after windowing and masks (segmented lung nodules)

With the exception of the Connected-UNets model, most models underwent only the resizing preprocessing step. In contrast to Connected-UNets, we opted for windowing instead of histogram equalization and Otsu’s thresholding instead of normalization within the [0, 1] range for the dataset. While histogram equalization enhances global contrast, it may not preserve local contrast as effectively as windowing, which is specifically tailored for targeted enhancements. Moreover, windowing offers users the ability to manipulate the window center and width interactively, enabling adjustments to optimize the visualization of specific structures.

In terms of pixel value scaling, Otsu’s thresholding operates based on intensity levels, while normalization to the range [0, 1] standardizes pixel values for easier comparison or image processing. Notably, Otsu’s thresholding is particularly well-suited for image segmentation, as it identifies an optimal threshold, effectively separating the image into distinct classes. This adaptive approach is advantageous in scenarios where automatic and accurate determination of the threshold is essential for reliable segmentation results.

2.4 Evaluation Metrics

In order to quantitatively assess the segmented results from all five models of the experiment, we used the Dice similarity coefficient (DSC) and Hausdorff distance (HD). The DSC metric is defined as follows:

$$DSC=\frac{2|X \cap Y|}{|X|+|Y|}$$
(1)

where |X| and |Y| represent the number of elements in each set.

The main purpose of HD is to obtain all the locations in the image that match the model. This metric for A to B is defined as follows:

$${\delta }_{H}^{\sim }(A, B) ={{max}_{a\in A}}{min}_{b\in B} \left|\left|a - b\right|\right|$$
(2)

where A = {\({a}_{1},{a}_{2},...,{a}_{n}\)} and B = {\({b}_{1},{b}_{2},...,{b}_{n}\)} are two point sets in used in \({E}^{2}\).

3 Results

3.1 Performance Comparison

The comparison of five models’ performance was presented in this section. The models were trained using a NVIDIA GeForce RTX 3090 GPU, and they were trained on 16 batch sizes on the PyTorch [68] and TensorFlow [46] frameworks. We modified the EdgeNeXt model, originally designed for multi-class classification, to suit our specific task by converting it into a binary classification model. Binary cross entropy was used as a loss function for all models. For model optimization, the adaptive moment estimation optimizer (Adam) [69] was applied.

We changed two crucial hyperparameters, specifically the learning rate and batch size, from their original values in the model. The learning rate, which dictates the magnitude of steps taken during the optimization process, plays a significant role in shaping the training dynamics of the model. The initial values for the learning rate and batch size of the models can be found in Table 2.

Table 2 Models implementation details

In our experimental setup, we adjusted the learning rate and batch size to 0.001 and 16, respectively. A larger learning rate is beneficial for navigating smooth and flat optimization landscapes, helping the model escape flat regions and converge to the optimal solution more rapidly, particularly when dealing with large datasets. It’s worth noting that, when comparing dataset sizes across models (excluding EdgeNeXt, as outlined in Table 2), LIDC-IDRI stands out with a dataset containing 244, 527 images. In the case of EdgeNeXt, our study indicated that such high learning rates or batch sizes were not necessary.

In contrast to the Salient Attention UNet and DDANet, we employed larger batch sizes, offering more stable gradient estimates that could result in a smoother convergence. This approach reduces noise in updates, potentially enhancing the stability of the optimization process. On the other hand, when compared to EdgeNext, the utilization of a smaller batch size impacts the model by infrequent updates to the parameters during each iteration. This might result in a slower progression of the optimization process while also demanding less memory.

The evaluated models’ performance is listed in Table 2. The training was stopped after 10 epochs when the loss of the validation set did not improve. Figure 12 illustrates the progression of loss and DSC score across epochs for the models.

Fig. 12
figure 12

Loss and dice score through epochs of the models

Table 3 includes the evaluated results on model complexity. There are two different values that were used for model description-e.g. the number of trainable parameters and the model depth, which equal the number of layers.

Table 3 Models performance evaluation

4 Discussion and Conclusions

In this paper, a comparative study is presented for lung nodule detection using computed tomography images. Preprocessing was carried out to obtain better performance results by enhancing image quality, modifying the input size for the network, and fine-tuning specific areas of interest within the image. We used two metrics such as Dice similarity coefficient (DSC) and Hausdorff distance (HD) for model evaluation. Additionally, Table 3 provides the evaluated results on the model complexity, such as the number of trainable parameters and the model depth. Our experiments show that Connected-UNets reached the best results with 97.80% Dice score and HD equal to 1.29 Table 2 due to connection between two UNets which contributed in saving fine-grained features from the layers of the first decoder. The difference of using single UNet and Connected-UNets is detailed in Fig. 13.

Fig. 13
figure 13

The difference between the basic UNet and the connection of two basic UNets in connected-UNets a basic UNet decoder b modified skip-connection part of the connected-UNets

However, Connected-UNets is the most complex model with the highest number of trainable parameters and layers (Table 3) over other models. Consequently, it has a high computational cost. Due to the presence of a large number of trainable parameters in the model, it can potentially lead to overfitting which occurs when a model acquires an excessively detailed understanding of the training data, capturing noise or random fluctuations rather than the underlying patterns.

The second best performance results reached by Salient Attention UNet. This is due to usage of Salient maps and attention blocks which highlighted the most visually important regions in an image. The internal structure of the salient attention block is also important. The cascade of convolutional layers Conv 3 × 3, Conv 3 × 3, and Conv 1 × 1 (Fig. 3) allowed for the capture of more complex patterns in a non-linear way. The receptive field of such a sequence might be larger than the usage of individual convolution 7 × 7. Additionally, the application of salient maps addresses the lack of interpretability in deep learning models by highlighting regions of input data that are deemed most relevant for a given prediction.

The UTNet achieved the third best performance by adapting an efficient self-attention mechanism which is a light version of standard multi-head attention. Similarly, with Connected-UNets, UTNet also tried to save fine-grained details from decoder layers by applying a Transformer mechanism on top of the skip connections. However, compared with Connected-UNets, the UTNet cannot achieve the same results (Table 4).

Table 4 Models complexity evaluation

Overall, this comparative study results revealed a dichotomy in the field of medical image processing. The summary of five models regarding their advantages and limitations evaluated in the study can be found in Table 5 in the Appendix. On one hand, there are sophisticated, computationally expensive models, and on the other hand, there are lighter models that, unfortunately, do not exhibit significant accuracy. This underscores the imperative for continued research and development in medical image processing to strike a balance between computational efficiency and model accuracy. Further exploration and innovation are needed to bridge the existing gaps and advance the capabilities of models in this critical domain.