1 Introduction

There is a large amount of inter-reader variability when it comes to reading and interpreting medical images because the accuracy of the diagnosis and detection of many diseases is dependent on the expertise of individual clinicians (e.g., pathologists, radiologists,) in the current clinical practice. The goal of these schemes is to assist clinicians in reading medical images in a more time-efficient manner and in making diagnostic decisions in a manner that is more accurate and objective. The application of computer-aided quantitative image feature analysis is the basis for this method's scientific rationale. According to this line of thinking, the use of such an approach can assist in overcoming a number of debilitating factors in clinical practice, such as the wide range of expertise possessed by clinicians, the potential exhaustion of human experts, and the deficiency of adequate medical resources [1].

Target segmentation, illness classification and feature computation are the three stages that make up a typical development approach for traditional computer-aided design (CAD) systems. For the purpose of achieving mass classification on digital mammograms, for instance, [2] devised a computer-aided design (CAD) method. Initially, the process began by utilizing a modified active contour method to separate out the areas of interest (ROIs) containing the target masses from the background. Following this, a wide range of image features was employed to thoroughly analyze the characteristics of the lesion in a quantitative manner, encompassing its size, shape, margin structure, texture, and other pertinent factors. This comprehensive analysis led to the transformation of the original pixel data into a vector representing these distinctive characteristics. Ultimately, a classification model utilizing linear discrimination analysis (LDA) was employed to evaluate the feature vector and determine whether the mass was malignant or not.

The first few layers of the model could be able to collect some fundamental information about the lesion, such as the shape of the tumor, its location, and its direction. The following set of layers could recognize and retain the characteristics that are consistently associated to the malignancy of the lesion (for example, shape and edge irregularity), while discarding changes that are unrelated to this relationship (e.g. location). The pertinent characteristics will undergo further processing and be constructed by following higher layers in a way that is much more abstract. When the number of layers that are used is increased, it is possible to create a greater degree of feature representations. Important characteristics that are concealed inside the raw picture are discovered by a generic neural network based model in a self-taught way during the whole of the method; hence, manual feature creation is not required.

Deep learning applications in medical image processing have previously been covered in many outstanding review publications. Several early deep learning strategies were evaluated by [3, 4], where both of whom focused on supervised methods. [12] have recently studied the use of generative adversarial networks in medical imaging tasks, and the results are promising. For diagnostic or segmentation tasks [5] reviewed on how to employ the techniques of semi-supervised and multiple-instance learning. In the context of medical picture segmentation, [6,7,8] studied a number of approaches for dealing with dataset restrictions (e.g., limited or poor annotations). This research, on the other hand, aims to shed light on how recent advancements in deep learning could improve the medical image analysis area, which is typically slowed down by a lack of annotated data.

Our study has two distinguishing aspects from prior review papers: comprehensiveness and technical orientation. While emphasizing promising “not-so-supervised” techniques like self-directed, unsupervised, and semi-supervised learning, we also don’t discount the value of more traditional, supervised methods. Second, rather than focusing on a single job, we demonstrate how the aforementioned learning methodologies may be applied to four common medical picture analysis scenarios (classification, segmentation, detection, and registration). Deep learning-based object identification was a major topic of discussion since it has received so little attention in previous reviews (after 2019). We concentrated on the utilization of chest X-rays, mammograms, CTs, and MRIs in our research. Physicians in the same department can evaluate a wide variety of pictures of this sort since they have many similar qualities (Radiology). Other image domains (e.g. histology) may potentially benefit from some of these technologies, which have the potential to be utilized in radiographic or MRI pictures. Third, the most up-to-date models and architectures for accomplishing these goals are described. Even though it is rapidly developing as a promising area, self-supervised learning in the context of medical vision is still being rigorously studied. The results of this poll may be useful to a broad range of people, including medical researchers and researchers with skills in deep learning, artificial intelligence, and big data.

The rest of the paper is presents as follows: deep learning improvements in unsupervised and semi-supervised techniques are examined in Sect. 2. In addition, attention processes, domain knowledge, and uncertainty estimation are presented as three significant tactics for performance advancement. Classification, segmentation, detection, and registration are all covered in detail in Sect. 3 of this report. Section 4 analyses hurdles for further model improvement and proposes some views on future research routes.

2 Deep Learning Models: An overview

Deep learning comprises a variety of learning paradigms, such as supervised, semi-supervised, and unsupervised learning, depending on the existence of labeled data in the training dataset. In supervised learning tasks like image classification, photos are linked with associated labels to train the model, which then refines its parameters. As a result, during testing, the model assigns probability scores to each image based on its previously acquired correlations. Unsupervised learning, on the other hand, is the discovery of hidden patterns or structures in data that have not been labeled. Here, the model derives correlations simply from input data, with the addition of unlabeled data, which frequently contains complex information. This study provides an overview of recent advances in these fields, with the goal of addressing the issues caused by the scarcity of annotated data in medical imaging tasks. The presentation moves through common frameworks within each learning paradigm, emphasizing their potential contributions to medical imaging, as seen in Fig. 1.

Fig. 1
figure 1

Deep learning applications

2.1 Supervised Learning

As a deep learning architecture for medical image processing, convolutional neural networks (CNNs) are commonly used [9]. Convolutional and pooling layers make up the bulk of CNNs. A basic CNN in action while performing a classification job on a medical picture. Convolutional, pooling and fully connected layers are used by the CNN to turn an image into a class-based probability of that picture, which is then output. A pooling layer is added after the convolutional layer to lower the size of the feature maps and, as a result, the number of parameters. Two typical pooling procedures are average pooling and maximal pooling. For the remainder of the layers, the same procedure is used.

2.2 Unsupervised Learning

2.2.1 Auto Encoders

For dimensionality reduction and feature learning, auto encoders are commonly used [9, 10]. Simple auto encoders have limited representational power because to their small structural depth [11, 12]; however, more complex auto encoders with additional buried layers may boost representational strength. To learn more complex non-linear patterns, deep auto encoders (SAEs) stack numerous auto-encoders and optimize them greedily layer-wise. As a result, SAEs generalize better outside of training data than shallow auto-encoders [13, 14]. Typical SAEs have an encoder and a decoder network, both of which are usually symmetrical to one another. This may be done by adding additional regularization terms, such as the sparsity restrictions found in Sparse Auto encoders [15], into the original reconstruction loss. Auto encoders meant to be insensitive to input perturbations include the Denoising Auto encoder and the contractive Auto encoder.

2.2.2 GAN (Generative Adversarial Networks)

A generative model that is unconditional lacks the ability to directly regulate the modes of data that are being synthesized. Conditioning the generator and discriminator of the conditional GAN (cGAN) with extra information (i.e., the class labels) allows for the construction of the conditional GAN (cGAN), which is used to steer the process of data production. To be more specific, a noise vector denoted by z and a class label denoted by c are both sent along to G at the same time. Likewise, the real/fake data and class label c are both passed along to D as inputs at the same time. Class labels are not the only kind of information that may be conditionally included; pictures and other properties are also acceptable alternatives. Additionally, the auxiliary classifier GAN, also known as ACGAN, is another technique that may be used to enhance picture synthesis by using label conditioning. D in ACGAN is no longer given access to the class conditional information, in contrast to the discriminator in cGAN, which still has this access. In addition to determining which photos are genuine and which are fraudulent, D is responsible for rebuilding the class labels. When ACGAN is forced to execute the extra classification assignment, it is able to readily create pictures of a high quality. The model class probabilities for estimating the sum of the losses of each class are represented in Eq. (1),

$${\mathcal{L}}_{exp}\left(x\right)=\sum_{i=1}^{n}[{p}_{i}.\mathcal{L}(x,{y}_{i};\theta ]$$
(1)

where \({y}_{i}\) is the class label. The entropy loss on the probability of all classes are represented in Eq. (2),

$${\mathcal{L}}_{ent}\left(x\right)=-\sum_{i-1}^{n}{p}_{i}log{p}_{i}$$
(2)

Generative adversarial networks, often known as GANs, are a subcategory of deep neural networks that were first suggested for use in generative modelling by [16, 17]. This architecture includes a built-in framework for estimating generative models, which may gather samples directly from the appropriate underlying data distribution. This produces more accurate findings. As a result, it is no longer necessary to explicitly specify a probability distribution. It consists of two separate models: a generator G and a discriminator D. It is hoped that G will progressively estimate the underlying data distribution via the adversarial process and create realistic samples. The model parameter for calculating gradient are represented in Eq. (3),

$${I}_{Fisher}\left(x;\theta \right)={\mathbb{E}}_{y}\left[{\nabla }_{\theta }^{2}\mathcal{L}(x,y;\theta \right]$$
(3)

2.3 Semi-Supervised Learning

This is in contrast to unsupervised learning, which can only work on unlabeled data. In particular, SSL is applicable to the situation in which there is a small amount of labelled data and a big amount of data that is not labelled. These two different kinds of data need to be relevant for the extra information that is conveyed by the unlabeled data to be beneficial in compensating the labelled data. When completing tasks with a restricted amount of labelled data, it is fair to anticipate that the addition of unlabeled data would result in an improvement in overall performance, and the more unlabeled data included, the better. In point of fact, this objective has been investigated for a number of decades, and the decade of the 1990s saw a rise in interest in making use of SSL approaches in text categorization. The book titled “Semi-Supervised Learning” that was written by Chapelle et al. in 2009 is an excellent resource for readers who are interested in understanding the link between SSL and traditional machine learning techniques. (Figs. 2, 3).

Fig. 2
figure 2

SimCLR representation

Fig. 3
figure 3

Mean teacher model application in medical image analysis

The authors show empirical evidence demonstrating, despite the fact that it might have a positive value, unlabeled data can occasionally lead to a decline in performance, which is an interesting discovery. But the current literature on deep learning suggests that this empirical result is shifting; more and more pieces, particularly in the area of computer vision. In addition, deep semi-supervised learning has been effectively used to medical picture analysis in order to lower the cost of annotation and improve performance. We classify common SSL techniques as one of three types: (1) those based on consistency regularization, (2) those based on pseudo labelling, and (3) those based on generative models. (Table 1).

Table 1 Summarizing the various classification methods

Pseudo annotations for unlabeled instances are created by an SSL model itself as part of pseudo labelling and the resulting examples are used in conjunction with real labels to train the SSL model. Many cycles of this method are used to improve the quality of the pseudo labels and the overall performance of the model. Naive pseudo-labeling and Mixup augmentation [18, 19] may be used together to boost the performance of the SSL model [20]. Multi-view co-training and pseudo labelling are both effective [21, 22]. The mathematical declaration of consistency cost is represented in Eq. (4).

$$J\left(\theta \right)={\mathbb{E}}_{x,\eta ,{\eta }{\prime}\left[|\left|f\left(x,\theta ,\eta \right)-f\left(x,{\theta }{\prime},{\eta }{\prime}\right)\right||\right]}$$
(4)

The teacher model holds its previous weights in alpha (α) proportion and (1 − α) portion of student weights represented in Eq. (5),

$${\theta }_{t}{^\prime}={\alpha \theta }_{t}{^\prime}-1+(1-\alpha ){\theta }_{t}$$
(5)

2.4 Strategies for Performance Enhancement

2.4.1 Domain Knowledge

When applied directly to medical imaging problems, most well-established deep learning models are likely to yield inferior results since they were built to evaluate natural pictures [23, 24]. This is due to the fundamental differences between natural scenes and medical imagery. As a first step, medical pictures often display considerable inter-class similarity, making it difficult to extract fine-grained visual characteristics that are necessary for understanding small distinctions and producing accurate predictions. Second, natural datasets used as benchmarks often comprise tens of thousands to millions of photos, while medical image databases are typically considerably smaller. This prevents very complicated computer vision models from being used in the medical field. This means that the question of how to tailor models for use in medical image analysis persists. Learning helpful may be aided by including appropriate domain knowledge or task-specific features.

2.4.2 Uncertainty Estimation

In highly-regulated therapeutic environments, trustworthiness is of paramount importance (e.g. cancer diagnosis). There are several sources of inaccuracy in model predictions, such as noisy data and inference mistakes; thus, it is preferable to measure uncertainty and ensure the findings can be trusted [25,26,27]. Bayesian methods approximate the posterior distribution over the parameters of neural networks. In order to quantify uncertainty, ensemble methods use a combination of different models. If you want to learn more about uncertainty estimate, I highly recommend reading [28,29,30].

2.4.3 Attention Mechanisms

The visual processing system of primates, which generates attention, chooses a subset of relevant sensory information for complicated scene interpretation as opposed to employing all available information [31,32,33]. Researchers in deep learning have been inspired to include attention into the development of cutting-edge models in a variety of domains by the concept of zeroing down on certain bits of inputs. We may essentially classify attention processes as either soft or hard based on the methods used to pick attended places in a picture. While the former learns a weighted average of features over all locations deterministically, the latter randomly selects a selection of feature sites to focus on [34]. Since hard attention cannot be differentiated, researchers have focused on soft attention even though it is more computationally costly.

3 Deep Learning Applications

3.1 Classification

Computer-assisted diagnosis (CADx) seeks to classify medical pictures so that malignant lesions may be distinguished from benign ones, or so that specific illnesses can be identified from input photos [35,36,37]. Over the last decade, deep learning-based CADx methods have shown tremendous success. However, effective performance from deep neural networks often requires a large number of labelled photos; this condition may be difficult to meet for many medical imaging datasets. Many methods have been explored to compensate for the dearth of big annotated datasets, but transfer learning has emerged as the clear frontrunner. While transfer learning has been shown to be effective in improving performance with little to no annotated data, alternative learning paradigms such as unsupervised image synthesis, self-supervised learning, and semi-supervised learning have also shown promising results. In the following sections, we will introduce the use of these learning paradigms in the categorization of medical images.

3.1.1 Supervised Classification

The global X-ray image’s attention heat maps were employed to hide vast, unimportant areas and emphasize small, key ones that hold diagnostic clues for the thorax illness. The suggested model was able to successfully combine global and local data, leading to improved classification accuracy. Input pictures include salient elements beneficial to the goal task, and each attention module has been taught to zero in on a specific subset of these local structures.

However, the effectiveness of deep learning models is very sensitive to the number and quality of training datasets and picture annotations. It might be difficult to build a sufficiently big and high-quality training dataset in many medical image analysis jobs, particularly 3D situations, because of challenges in data gathering and annotation [38,39,40]. Pre-trained convolutional neural networks (CNNs) with sufficient fine-tuning outperformed CNNs trained from scratch, as shown by [41,42,43].

3.1.2 Unsupervised Methods

3.1.2.1 Unsupervised Image Synthesis

Simple but effective, traditional data augmentation (e.g., rotation, scaling, flipping, translation, etc.) may provide more training examples for improved performance. However, it does not add much to the knowledge provided by the current training examples. In the medical field, GANs have been employed as a more involved strategy for data augmentation because of their ability to understand the distribution of concealed data and generate realistic pictures. To enhance liver lesion categorization on a small dataset, [41] used DCGAN to synthetically generate high-quality instances. Only 182 lesions, including cysts, metastases, and hemangiomas, are included in the dataset. The authors used standard data augmentation techniques (e.g., rotation, flip, translation, and scaling) to generate over 90,000 samples, which is necessary for training GAN. The classification performance was greatly enhanced by the GAN-based synthetic data augmentation, with the sensitivity and specificity rising from 78.6 and 88.4 percent, respectively, to 85.7 and 92.4 percent. More recently, the authors have expanded lesion synthesis from an unconditional setting (DCGAN) to a conditional one [42] (ACGAN). Besides synthesizing new instances, ACGAN’s discriminator also projected lesion classes based on the auxiliary information (lesion classes). Weak classification performance was shown using ACGAN-based synthetic augmentation compared to the unconditional baseline.

3.1.2.2 Self-Supervised Learning Based Classification

This approach works well in situations when there are many medical photos accessible but only a fraction of them have been tagged. As a result, the process of optimizing the model consists of two phases: unsupervised self-training and supervised fine-tuning. As a first step, the model is fine-tuned with unlabeled photos in order to efficiently learn features that are representational of the image semantics. Supervised fine-tuning is used to self-trained models for improved performance in future classification challenges. Self-supervision may be developed in practice using pretext activities or contrastive learning [44,45,46,47,48,49,50].

3.1.2.3 Self-Supervised Pretext Tasks

Such as rotation prediction [51, 52] and Rubik’s cube recovery, are used in self-supervised pretext task based categorization [53,54,55]. The two phases of this novel pretext job are picture corruption (by disorganizing patches) and image restoration. Classification accuracy for medical images was enhanced by using a context restoration pre-training technique. After an initial period of “pre-training,” the models were trained with the use of labelled examples. To learn feature representations, we focused on maximizing agreement between positive picture pairs, which may be two enhanced instances of the same image or many photos from the same patient. Fewer tagged pictures of dermatological and chest X-rays were used to fine-tune the pre-trained algorithms. In terms of mean area under the curve (AUC) for chest X-ray classification, these models fared better than their ImageNet-pre-trained counterparts by 1.1%, and in terms of top-1 accuracy (AUC), they performed better by 6.7%.

3.1.3 Semi-Supervised Learning

Semi-supervised learning, in contrast to self-supervised methods, requires integrating unlabeled data with labelled data through diverse strategies to train models for improved performance. As labelled data was scarce, [57, 58] used a GAN trained semi-supervisedly [59, 60] to classify heart illness in chest X-rays. This semi-supervised GAN, in contrast to the vanilla GAN [61, 62], was trained on both unlabeled and labelled data. In addition to predicting whether or not a picture is fake, its discriminator can now also classify input images as normal or abnormal. The semi-supervised GAN-based classifier outperformed the supervised CNN as the number of labelled instances became larger.

3.2 Segmentation

Medical image analysis requires segmenting lesions, organs, and other substructures from backgrounds. Segmentation takes more supervision than classification and detection. Recent research has focused on utilizing Transformers to supervised segment medical images, thus we placed them in the supervised segmentation area. This categorization does not preclude Transformer-based designs in semi-supervised or unsupervised settings or other medical imaging applications.

3.2.1 Supervised Learning Based Segmenting Models

3.2.1.1 U-Net and its Variants

Lower–level fine-grained features include vital information for exact localizations (i.e., labelling each pixel) that are necessary for image segmentation, whereas higher–level coarse-grained features capture semantics relevant for overall picture classification in a convolutional network. By employing skip connections, the network’s output may match its input’s spatial resolution. U-Net takes just two-dimensional pictures and creates segmentation maps for each pixel category. [63] looked at underlying structures to understand how long and short skip connections impact picture segmentation. Short skip connections are required for training deep segmentation networks. Simple skip connections between U-encoder and decoder subnetworks fuse semantically disparate feature maps. Before fusing feature maps, they advised bridging the semantic gap. In U-Net +  + , stacked and thick skip connections replaced basic ones. In four medical picture segmentation tasks, the recommended design beat U-Net and wide U-Net [66,67,68,69,70,71,72,73,74,75].

Attention gates (AGs) were proposed by [64] to be included into the U-Net design in order to minimize the amount of irrelevant data and highlight the most important salient aspects conveyed through skip links. Attention U-Net often outperformed U-Net when it came to CT pancreas segmentation. To assess uncertainty in segmenting images of the prostate by magnetic resonance (MR) and the chest by computerized tomography (CT), [65] developed a hierarchical probabilistic model. The scientists employed vibrational auto encoders and a number of latent variables to mimic segmentation changes at different resolutions and deduce the uncertainties or ambiguities in the experts’ comments.

3.2.1.2 Transformers for Segmentation

The field of natural language processing (NLP) makes use of a class of encoder–decoder network designs called “transformers” to perform sequence-to-sequence processing. Multi-head self-attention (MSA) is a crucial sub-module that employs several parallel self-attention layers to produce numerous attention vectors for each input all at once. In contrast to U-Net and its derivatives, which rely on convolutional neural networks, Transformers use self-attention processes, which have the benefit of learning complicated, long-range relationships from input pictures. Both a hybrid approach and a Transformer-only approach exist for implementing Transformers in the context of medical picture segmentation [74]. The Transformers technique does not use convolutional neural networks, but the CNN-based hybrid method does.

Previously, we saw that U-Net and its derivatives that rely on convolution processes have produced effective outcomes. Skip connections allow the decoder to make advantage of the encoder’s low-level/high-resolution CNN characteristics, which provide accurate localization information. These models are often poor at depicting long-range relations because of the localization that is inherent to convolutions. Despite the fact that transformers based on self-attention processes are adept in capturing long-range dependencies, the authors discovered that this technique, on its own, was unable to provide desirable outcomes [76,77,78,79]. This is due to the fact that it places an emphasis only on acquiring knowledge of the global context while paying no attention whatsoever to acquiring knowledge of the finer, local aspects. Therefore, the authors suggest integrating global context from the Transformer with fine-grained spatial information from CNN features. Trans U-Net employs a skip-connected encoder–decoder architecture, as illustrated in Fig. 4.

Fig. 4
figure 4

Trans UNet architecture

The embedded picture patch sequence is fed to the multi-layered transformer branch, which extracts global context details. After the last layer's output has been processed, it is transformed into 2D feature maps. These maps are up sampled to greater resolutions at three distinct scales in order to recover finer local information. In a similar vein, the CNN offshoot employs three ResNet-based blocks to extract characteristics at three distinct spatial scales, starting at the local level and working its way up to the global level. In order to selectively fuse features from both branches with the same resolution scale, a separate module is utilized. Both the local and global contexts may be captured by the combined characteristics [80, 81]. The final segmentation mask is created using the multi-level fused features. Transfuse was able to successfully segment prostate MRI scans [82,83,84,85,86,87,88,89,90].

3.2.1.3 Self-Supervised Pretext Tasks

When just a limited number of annotated examples are available, self-supervision through pretext tasks and contrastive learning is often used to pre-train the model and allow tackling downstream tasks (such medical picture segmentation) more precisely and effectively. Application circumstances might inform the development of the pretext tasks, or the tasks could be selected from the canon of what has already been employed in computer vision. To address the former [66] developed a unique pretext task to segment cardiac MR images based on predicted anatomical locations. Accurate ventricles segmentation was taken on using the characteristics learned by the system autonomously in the pretext task. When just a small number of annotations were given, the suggested technique nevertheless outperformed the traditional U-Net trained from scratch in terms of segmentation accuracy. More trustworthy instruction for the student model may be generated by the uncertainty-aware teacher model, and the student model may be able to enhance the teacher model. The c method may also be used to enhance the mean-teacher model. The instructor model also adaptively updated the student model in the same way. The suggested technique outperformed state-of-the-art methods in segmenting pneumonia lesions and showed great resilience to label noise [91, 92].

3.3 Detection

Here, we will first take a quick look back at a number of current milestone detection frameworks, including one- and two-stage methods. Importantly, we present these detection frameworks inside the supervised and semi-supervised learning paradigms since these contexts are where they are most often used. We will next discuss the use of these frameworks for locating both common and rare lesions. Last but not least, we will present GAN and VAE-based unsupervised lesion identification.

3.3.1 Supervised Lesion Detection

These frameworks need to be modified in many ways, such as by integrating domain-specific features, estimating uncertainty, or adopting a semi-supervised learning technique, in order to provide high detection performance in the medical domain [97,98,99,100,101,102,103].

In order to suggest potential nodule locations from 2D axial slices, an FPN was used to the deconvolution feature map. The authors recommended allowing the classification network to take in the complete spectrum of contexts of the nodule candidates to bring down the false positive rate. They opted for a 3D CNN over a 2D one so that more unique characteristics may be recorded for nodule detection by taking use of the 3D environment of candidate locations.

For the purpose of training 3D CNN-based models, [70] created a large dataset (PN9) including more than 40,000 annotated lung nodules. Through the use of correlations between successive CT slices, the scientists enhanced the model’s capacity to identify big and tiny lung nodules. Long-range relationships between locations and channels in the feature map were captured using a non-local operation-based module applied to a slice group.

In recent years, semi-supervised techniques have been used to medical item recognition in order to boost its accuracy. Before any lesion annotations were added to the CT scans, an FPN was applied to the pictures to create dummy labels for the objects in the images. Afterward, mixup augmentation was used to combine the pseudo-labeled instances with those that had ground truth annotations. However, the scientists adopted the mixup augmentation approach originally developed for classification tasks using picture class labels and applied it to the lesion detection problem with bounding box annotations. When compared to supervised learning baselines, the semi-supervised method improved lung nodule detection ability significantly.

3.3.2 Universal Lesion Detection

There is a growing interest in research that seeks to identify and localize various types of lesions from the entire human body at once, but traditional lesion detectors have focused on a single type of lesion. Different forms of lesions, such as lung nodules, liver tumors, abdominal masses, pelvic masses, etc., are represented in Deep Lesion’s 32 K-strong dataset. In order to lower the number of false positives produced by the model, it was retrained using negative examples. Using a multitask detector (MULAN) to simultaneously carry out lesion detection, tagging, and segmentation, [50] were able to significantly enhance the performance of universal lesion detection. It has been shown that combining tasks may improve performance on a single task since they can supply complimentary information to each other. MULAN is an adaptation of the three-head-branch variant of the Mask RCNN. Each proposed region’s lesion status and bounding box regressions are predicted by the detection branch, while 185 tags (such as body part, lesion type, intensity, shape, etc.) are predicted by the tagging branch for each lesion proposal [104,105,106,107,108,109,110].

3.3.3 Unsupervised Lesion Detection

No matter whether you’re trying to identify a particular sort of lesion or a universal lesion, you'll need some level of supervision to train a one-stage or two-stage detector, as we’ve discussed above. Before training the detectors, we must setup the supervision by specifying the sorts of lesions to be monitored. Once the detectors have been trained, they will be unable to recognize lesions that were not included in the original training set. Unsupervised lesion identification, on the other hand, does not need ground-truth annotations, therefore the kinds of lesions do not have to be stated in advance. Despite this, it may be utilized for rough anomaly identification and to identify potential imaging biomarker candidates.

The effectiveness of these unsupervised models has mostly been shown in MRI, where VAE and GAN are often utilized to estimate the normative distribution. The authors do an in-depth analysis of the differences between these models and provide several insightful examples of effective uses. This research concludes, among other things, that restoration-based methods outperform reconstruction-based methods in situations when runtime is not a factor [111,112,113,114,115,116,117,118,119]. In contrast to the aforementioned survey publication, we will just provide a quick overview of reconstruction-based methods and zero in on the most current research concerning restoration-based detection.

3.3.3.1 Reconstruction-Based Paradigm

The original picture is reconstructed from its latent representation using an AE or VAE-based model. In order to train the model, only clean pictures are utilized. Additionally, the model is tuned to provide low reconstruction error pixel by pixel. Reconstruction errors are predicted to be small for healthy picture regions and large for abnormal image regions when unhealthy photos are processed by the model. When these two measures were combined, the CVAE-based model produced acceptable tumor segmentation findings in MRIs. It is important to note that the authors included local context into CVAE by using patch placements as conditions. In order to enhance performance, the location-related condition might give extra background knowledge about healthy and diseased tissues.

3.3.3.2 Restoration-Based Paradigm:

Either an ideal latent representation or the original, non-anomalous version of the input abnormal picture is the goal of restoration. There have been applications of both GAN-based and VAE-based approaches, with the former often using GAN during latent representation restoration. The ideal latent representation was then restored by performing gradient descent in the latent space (with respect to the latent variable) given an input picture (normal or anomalous). To be more specific, the optimization is governed by a loss function that takes into account both the residual loss and the discrimination loss. Similar to the reconstruction error, the residual loss evaluates the degree to which the produced pictures differ from the originals based on the latent variable. Meanwhile, the discriminator network receives both kinds of pictures and uses a single intermediate layer to extract characteristics from them.

3.4 Registration

Registration seeks to build a variation between pictures, as opposed to rigid registration, in which all of the image pixels evenly undergo a basic transform (such as rotation). In recent years, deep learning has been used to this field of study increasingly often, particularly in the area of deformable image registration. Our assessment of deep learning-based medical image registration techniques follows the same three-fold structure of the review paper (Haskins et al., 2020): (1) deep iterative registration; (2) supervised registration; and (3) unsupervised registration. More detailed information on registration strategies is available in many other outstanding review studies [50, 51].

3.4.1 Deep Iterative Registration

To accomplish deformable registration, for instance, [50] employed a 5-layer convolutional neural network (CNN) to develop a measure to assess the similarity of aligned 3D brain MRI T1–T2 image pairings. For multimodal registration, this deep learning-based measure performed better than manually specified similarity metrics like mutual information. The closest comparable study [41, 42], who used an FCN pre-trained using stacked denoising auto encoder to assess the similarity of 2D CT–MR patch pairings.

3.4.2 Supervised Registration

Warp/deformation fields may be synthesized/simulated, manually marked, or obtained through registration. [48] built a multiscale CNNs-based model to directly predict 3D DVFs. To enhance their training dataset, they created DVFs with varied spatial frequency and amplitude, then augmented the data with 1 million training samples. Distorted pictures were recorded in a single pass after training, outperforming B-spline. Image similarity assessments may also aid with registration. [68] established dual-supervised brain MR image registration training. Using ground truth guidance, the difference in deformation fields was estimated. Picture similarity was used to compare the template and distorted topic image. First improved network convergence, then training and registration.

3.4.3 Unsupervised Registration

Traditional registration procedures make it difficult to collect ground truth warp fields, and the limited deformations available for model training lead to unsatisfactory outcomes on unseen images. Wu et al. (2016) used a convolutional stacked auto encoder to improve registration performance. The decoder outputs the registration field, while the encoder inputs a moving and static image. The spatial transformer network [56] warped the dynamic image with the registration field to reconstruct the static image. Voxel Morph reduces the difference between reconstructed and fixed images to provide deformation fields. Our unsupervised registration approach was orders of magnitude faster than symmetric normalization (SyN).

All of the loss functions used by the aforementioned unsupervised registration methods are designed using a combination of user-defined similarity metrics and specific regularization terms. Although traditional similarity measures perform well in mono-modal registration, they are not as successful in multi-modal instances as deep similarity metrics. In order to get optimal outcomes in multi-modal registration, it has been suggested to use sophisticated deep similarity metrics learnt in unsupervised settings.

4 Discussions

4.1 On the Task-Specific Perspective

4.1.1 Classification

When compared to the development of computer vision, the use of deep learning in medical image analysis has lagged behind. Nonetheless, it is possible that straight use of computer vision techniques may not provide desirable outcomes owing to the differences between medical pictures and natural images. Good performance requires overcoming obstacles specific to medical imaging jobs. The key to success in the classification challenge is to extract highly dis-criminative characteristics with regard to particular classes. Domains with high inter-class similarity may make this challenging, whereas domains with considerable inter-class variation make it quite straightforward. It is challenging to capture discriminative characteristics for breast cancers, which contributes to poor mammography classification performance. Given the high degree of similarity across classes, it may be appropriate to learn fine-grained visual categorization characteristics that distinguish one class from another. However, it is important to keep in mind that all the picture samples in benchmark FVGC datasets were intentionally gathered to display significant inter-class similarity. Thus, methods developed and tested on such data may not translate well to medical datasets, where only a subset of photos exhibits strong inter-class similarity.

4.1.2 Detection

As the procedure of bounding box prediction demonstrates, detecting medical objects is more involved than classifying them. The difficulties of categorization naturally manifest themselves in the process of detection. Meanwhile, there are other difficulties, such as class imbalance and the identification of tiny-scale items, such microscopic lung nodules. One-stage detectors are often equally effective as two-stage detectors in detecting big things but have greater trouble with little ones. Using multiscale features has been shown to help with this problem in both single- and two-stage detectors. Futurized image pyramids (Liu et al., 2020b) are a simple but successful method in which characteristics are retrieved from different picture scales separately. While there is no one right approach to construct a feature pyramid, it is generally accepted that robust, high-level semantics and high-resolution feature maps must be fused for optimal performance. As shown by FPN, this is crucial for the detection of microscopic things (Lin et al., 2017a).

4.1.3 Segmentation

Classification and detection difficulties are combined in the difficult task of segmenting medical images. These difficulties frequently look interwoven, too. Structures, forms, and contours that are crucial to a proper diagnosis and prognosis may be lost as a result. For this reason, we think it’s important to improve segmentation performance by creating non-region-based indicators that may supplement region-based data.

The power of transformers is in their capacity to accurately portray causal relationships across broad time scales, and it is this feature that we want to highlight here. Most CNN-based approaches don’t put an emphasis on long-range dependencies, despite its usefulness for obtaining accurate segmentation. Both intra-slice dependencies (relationships between pixels inside a single CT or MRI slice) and inter-slice dependencies (relationships between pixels in different slices).

4.1.4 Registration

The goal of medical image registration is to identify the pixel- or voxel-level correlation between two pictures, which is a very different challenge from those that came before it. Gaining access to trustworthy ground truth registrations, whether they be created synthetically or by traditional registration techniques, presents a distinct obstacle. The use of unsupervised techniques has shown much promise in resolving this problem. But many unsupervised registration systems [50] are made up of numerous phases to register pictures in a coarse-to-fine fashion. The improved performance may be offset by the increased computational complexity and training difficulty introduced by multi-stage frameworks. To this purpose, it is preferable to create registration frameworks with as few steps as possible, so that they may be learned in their entirety.

4.1.5 Incorporating Domain Knowledge

Most medical vision models are adopted from their counterparts in the natural imaging community; nevertheless, medical pictures present their own set of issues that make them harder to work with. If used properly, domain knowledge may reduce the amount of time and effort needed to solve these problems using computing. However, we find that it is sometimes more challenging to successfully combine the extensive domain knowledge that is already known to radiologists. Mammograms may detect breast cancer in certain women. Important signals for radiologists to discover worrisome areas and diagnose malignancy include unilateral correspondence and bilateral difference. As it is, there aren't many effective ways to put this information to use. Because of this, further study is required to fully exploit the benefits of superior domain expertise.

4.2 More Widespread use of Deep Learning in Medical Contexts

Despite its widespread usage in academic and industrial research institutes for interpreting medical pictures, deep learning has not had the profound effect on clinical practice that was anticipated. Researchers across the globe quickly jumped on the deep learning on patient chest X-rays and CT scans in an effort to make a more precise and timely diagnosis and prognosis of the condition. However, model overfitting, poor assessment, wrong data sources, etc. greatly skewed the positive outcomes. Another review paper (Roberts et al., 2021) came to a similar conclusion after analyzing 62 studies that were chosen from 415.

4.2.1 Image Datasets

Deep learning relies heavily on data to function. For the purpose of training and testing new algorithms, the field of medical vision has generated and is continuing to develop medical picture datasets of increasing size (typically at least several hundred images). When benchmark datasets for various illnesses (such as cancer) are provided annually as part of the MICCAI challenges, for instance, this tremendously promotes the development of medical vision. Nonetheless, because everyone in the community is working toward the same goal of perfect performance, overfitting is likely to occur on this dataset if it is utilised exclusively (Roberts et al., 2021). Many academics have realised this issue, therefore it is usual practise to employ many public and/or private datasets to thoroughly verify the effectiveness of a new algorithm. Although this helps lessen prejudice in the population as a whole, it is not enough for widespread clinical use.

More data should be used to train and test models to further reduce community-wide bias. Data curation, or the ongoing creation of vast, varied databases via collaborative effort with experts, is one straightforward approach to add new data. Our alternative recommendation is to integrate dispersed private datasets as ethical and legal restrictions permit, which is a more roundabout approach. Large, representative, labelled data may always appear to be insufficient, at least in the eyes of the medical image analysis community. The reality is more nuanced than that, however. It is true that many well-known public databases have restricted quantity and diversity due to time and expense restrictions. Most current data sources are secret and dispersed throughout several agencies and nations because to concerns about privacy and the complexity of the relevant political climates. As a result, it is preferable to use the combined power of private datasets and even personal data without compromising patients’ privacy. Federated learning, which provides models with encrypted access to private information, is a potential strategy for attaining this objective (Li et al., 2020f). Without the need to exchange data, federated learning enables the training of deep learning algorithms on data from many universities. Although this technology introduces new difficulties, it allows for the training of algorithms that are less biassed, more generalizable, more resilient, and perform better than ever before, thereby better serving the demands of clinical applications.

4.2.2 Performance Evaluation

While it is simple to assess the technical performance of proposed methodologies using these criteria, this does not always represent clinical application. Clinicians are more interested in whether or if using algorithms will improve patient care than they are in the performance increases stated in articles (Kelly et al., 2019). Consequently, we think it is crucial for research teams to interact with physicians for algorithmic assessment, in addition to using appropriate criteria. We very briefly discuss two approaches that may be used to institutionalize cooperative assessment. To begin, have clinicians participate in the peer review process for conferences and publications by submitting papers and giving their perspectives on open clinical problems. Second, evaluate whether the use of deep learning algorithms may enhance physicians’ performance and/or efficiency. Some research has looked at the possibility of using model findings as a “second opinion” to help guide physicians’ ultimate interpretation. For instance, McKinney et al. (2020) assessed the supplementary function of a deep learning model for the job of predicting breast cancer from mammograms. They discovered that the model could detect cancer in numerous situations when radiologists had failed to do so. In addition, the model greatly decreased the burden of the second reader in the “double-reading procedure” (common practise in the UK), while keeping performance levels close to the consensus opinion.

Overall, deep learning is an emerging topic of study that shows great promise in several areas of medical image analysis, such as illness classification, segmentation, detection, and picture registration. To construct deep learning based CAD schemes that can reach high scientific rigour, we currently face various technological problems or hazards (Roberts et al., 2021). Therefore, these challenges must be addressed via further study before deep learning-based CAD methods may gain widespread acceptance among doctors. The choice of the best deep learning method depends on the specific requirements and constraints of the medical imaging task at hand. It’s often beneficial to experiment with different architectures, pre-processing techniques, and training strategies to determine the most effective approach for a particular application. Additionally, incorporating domain knowledge, collaborating with medical experts, and conducting rigorous evaluations are essential for developing reliable and clinically relevant deep learning solutions in medical image analysis.