Keywords

1 Introduction

This chapter provides a state-of-the-art review of artificial intelligence in medical image analysis. We start with a brief introduction to computer vision and overview of deep learning architectures. We proceed to highlight relevant progress in clinical development and translation across various medical specialties of dermatology, pathology, ophthalmology, radiology, and cardiology, focusing on the domains of computer vision and machine learning. Furthermore, we introduce some of the challenges that the disciplines of computer vision and machine learning face within a traditional regulatory environment. This chapter highlights the developments of computer vision and machine learning in medicine by displaying a breadth of powerful examples that give the reader an understanding of the potential impact and challenges that artificial intelligence can play in the clinical environment.

2 Medical Imaging Modalities

The number of available imaging modalities has grown over the past few decades. From the early days of plain fluoroscopic images and simple light microscopy, imaging modalities that are now commonplace in well-resourced hospitals include computed tomography, magnetic resonance imaging, ultrasound imaging, positron emission tomography and digital microscopy, among others [1]. Beyond the various modalities of image acquisition, the digitization of images and technology of storing, transferring and manipulating image data has accelerated the pace of the medical imaging field to a point where ample opportunities exist to take advantage of this data for the benefit of patient care. In this chapter, we will provide examples of computer vision applications in medical problems.

3 General Principles of Computer Vision and Machine Learning

Computer vision is a field in computer science focused on identifying, analyzing, and decomposing images into meaningful elements in a process that emulates the inner workings of the human visual system. Essentially, it assigns a machine with a task that parallels the higher cognitive processing our brains perform from mere visual capturing of an image to its processing, interpretation and response.

There has been exponential growth in computer vision over the past decade due to gains in computational power, data storage and sharing capabilities, and the development of innovative machine learning models that have transformed the performance of artificial intelligence. Particularly, deep learning has been a transformative approach in the field of computer vision. Whilst the early development [2] and application [3] of neural networks occurred in the 1970s and 1980s, advances in the necessary processing power with the use of graphics processing units (GPUs) happened in the late 2000s enabling deep learning models, such as convolutional neural networks (CNNs), to be trained at an acceptable speed.

A critical step in a typical computer vision task is the ability to recognize patterns automatically. Underpinning good pattern recognition is the ability to have a large, reliable and high-quality dataset. Typically, a dataset is split into training, validation, and testing components. Training and validation datasets allow for model selection and parameter adjustment, and the test dataset enables the assessment of that model. Thus, we find that many examples of high performing artificial intelligence (AI) algorithms have large, high-quality datasets that underpin them.

Similarly, humans often rely on pattern recognition in everyday life, albeit in a more implicit way. When doctors assess images, whether it be from X-rays, MRI, ultrasound, or histopathology slides, pattern recognition is critical. Having seen more images of a particular disease condition makes doctors more competent to identify it the subsequent time.

Medical image analysis is an exciting field for the meaningful application of computer vision and machine learning techniques. Diagnoses and clinical decisions often rely heavily on the acquisition and interpretation of images by a clinician. There is a growing number of applications of AI in this field, and in this chapter, we will mention a few of them across a broad range of specialties. However, to better understand how computer vision and machine learning can be applied, we need to break down the individual tasks that AI models perform within medical image analysis, namely object classification, object detection and image segmentation.

4 Computer Vision Tasks and CNN Architectures

4.1 Object Classification, Object Detection and Image Segmentation in Medical Image Analysis

There are nuanced differences between the concepts of object classification, detection and segmentation. With classification, the goal is to identify the objects within an image, i.e. an image of a pack of wolves would be classified as an image of wolves (Fig. 1). Object detection would involve creating bounding boxes around each wolf to ‘locate’ the wolves within the image. Segmentation of the image would involve the analysis of the image at a pixel level to determine which pixels belong to the ‘wolf’ object and which do not. Therefore, the ‘wolf’ object is more closely outlined. Where object detection may result in bounding boxes that can overlap, segmentation is mutually exclusive, and a pixel in the image can only be attributed to a single object within the image.

Fig. 1
figure 1

Examples of various computer vision tasks. (a) Classification: determining an image does indeed contain a wolf. (b) Detection: identifying wolves within an image. (c) Segmentation (semantic): outlining wolves in the image (d) Segmentation (instance): outlining each wolf within an image

In a conventional sense, the concept of pattern recognition is most easily compared to a classification task. In the domain of machine learning aided medical image analysis, classification tasks take the form of making global or study level diagnoses from available medical images or videos.

Extending beyond classification, the domains of object detection and segmentation techniques are often employed when the clinical decisions involve localization or complete tracing of the lesions down to the pixel level, where the main difference lies in the output forms: the object detection CNNs give a bounding box to indicate the location of a detected lesion, and the segmentation CNNs provide a concise segmentation of the lesion object. This also means the annotation cost between the two types of tasks is different. For the task of object detection, bounding box annotation enclosing a lesion object is much easier to draw. For the segmentation task, the annotation needs to be drawn carefully according to the lesion object boundary, with every pixel in the image assigned to a class. Naturally, the annotation costs of segmentation will be significantly higher. Generally, segmentation will be more time consuming than classification and detection.

4.2 CNN Architectures: A Deep Dive

4.2.1 Origin of CNNs

The origin of CNNs dates back to the “neocognitron” proposed by Kunihiko Fukushima in 1980 [3]. The “neocognitron” referred to a self-organizing neural network model inspired by the human visual pattern recognition system. Here, basic CNN layers were introduced: convolution (generating response to useful spatial patterns) and down-sampling (reducing the spatial size of input or convoluted feature maps). The LeNet-5, a seven-layer CNN introduced by LeCun et al. is a simple CNN model by today’s standard [4]. The LeNet-5 was successfully applied to the MNIST database (Modified National Institute of Standards and Technology database) of handwritten digits provided by United States Postal Service, achieving a 1% error rate [5]. Its success inspired the various CNN architectures seen today. In 2012, the development of CNNs rapidly expanded when AlexNet [6]—a GPU-accelerated deep CNN architecture won the ImageNet ILSVRC 2012 image recognition challenge [7].

4.2.2 CNN Design

A CNN architecture defines a stack of functional layers, each performing some form of mathematical computation to transfer its input features to output features in a specific order. Taking the LeNet-5 as an example, each convolution (conv) layer consists of a convolution operation and a nonlinear activation operation. In LeNet-5 the activation function is the hyperbolic tangent function (tanh), where the commonly used counterparts can be sigmoid or the more popular Rectified Linear Unit (ReLU) [8]. The average pooling (avg-pool) operation is used in LeNet-5 to perform the down-sampling operation, where the max-pooling is more favored in the recent CNN architectures. The seven layers of LeNet-5 can be simply described as sequence of (1) conv; (2) avg-pool; (3) conv; (4) avg-pool; (5) conv; (6) fc (fully connected); (7) softmax. The fully connected layer applies a linear transformation to the input and sometimes referred to as linear or dense layers in literature. The final softmax layer is used to compute the probability of each class for the classification task. Architecture-wise, modern CNNs such as the GoogLeNets [9, 10] and ResNets [11] discussed below are variants to the LeNet-5 with more sophisticated and well-designed functional layers, a longer stack of layers, and millions of trainable model parameters, providing much larger learning capacity.

The original GoogLeNet, often referred to as Inception v1, was designed by Google researchers [10]. The v1 network is a deep learning architecture with 22 layers and utilized the Inception modules for multi-scale and multi-path processing of features. The GoogLeNet was trained on the ImageNet dataset containing approximately 1.28 million images and won the ILSVRC 2014 challenge [12]. Over the years, the Inception networks went through several modifications from v1 to v4 [9, 13], exploring the structure variations of the Inception module for better learning capacities. The high performance of the GoogLeNets is reflected by its frequent usage in medical image analysis.

ResNets, designed by Microsoft researchers [11], are a family of deep learning architectures that utilize residual connections to allow information to optionally bypass layers of computation. They achieve efficient backpropagation from the loss signal to early network layers, allowing networks to be built deeper. The ResNet family are code-named by the number of layers in the network; the commonly used ones are ResNet-18, 34, 50, 101, and 152. The ResNets achieved extraordinary performance on the ILSVRC 2015 & COCO 2015 challenges [14].

4.2.3 Architectures for Object Detection

The object detection networks are commonly classified in two streams: (1) one-stage networks, e.g., YOLO [15], SSD [16], and RetinaNet [17], which aim to detect objects simultaneously (i.e., where are the lesions) and predict the object class (i.e., what type or classification of the lesion); and (2) two-stage networks, e.g., the R-CNN [18] detector family from the original R-CNN to Fast R-CNN [19] and Faster R-CNN [20], which first find the regions of interests and then in the second stage fulfil the object classification task. Although one-stage networks compute faster, two-stage networks often perform better.

4.2.4 Architectures for Image Segmentation

FCN (Fully Convolutional Networks) [21], U-Net [22], DeepLab [23], and Mask R-CNN [24] are commonly used architectures for image segmentation. FCN, U-Net, DeepLab are end-to-end one stage solutions generating pixel-level object classification directly from an input image; hence the raw network outputs may not be accurate on the object boundary. The raw network predictions can be further refined by Dense-CRF [25] to align the object boundary better. Mask R-CNN builds on top of Faster R-CNN by adding an FCN network in the second stage of detection to generate extra pixel-level segmentation, in additional to the object bounding box. This effectively achieves the goal of instance segmentation—separating multiple objects (even from the same class, i.e., identifying each wolf in the pack) in the same image from each other.

In the domain of medical image segmentation, U-Net has been widely adopted. This can be credited to its simplicity, wide available online implementations, and high performance, as evidenced by its ability to win medical image segmentation challenges such as the Dental X-Ray Image Segmentation challenge and Cell Tracking Challenge at ISBI 2015.

5 Clinical Applications

5.1 Dermatology

Dermatologists are all too familiar with the diagnosis of skin lesions based on visual inspection, often aided by a dermatoscope. Clinicians consider additional patient demographic and clinical information to aid the diagnosis with the possibility of further downstream confirmatory testing and therapy. Nonetheless, diagnosis largely relies on visual inspection and, therefore, image analysis in the diagnostic process invites an opportunity for the use of AI techniques.

Researchers around the world have seized this opportunity. For example, along with academic institutions, large corporations such as Google and Microsoft have research divisions involved in developing high performing deep learning architectures and fine-tuning them for application in the field of Dermatology.

Esteva et al. [26] retrained the GoogLeNet Inception v3 CNN architecture with 127,463 images of epidermal and melanocytic lesions in the training and validation set (Fig. 2). They were able to classify benign and malignant lesions in a testing set of 1942 biopsy labelled images with an AUC exceeding 91%. This result was better than the accuracy of the 25 dermatologists to whom the model was compared [26].

Fig. 2
figure 2

Application of deep CNN layout of the Google Inception v3 CNN architecture to skin lesion analysis. (Reproduced from Ref. [26])

In a focused head-to-head task comparison of diagnosing melanoma from benign naevi, Brinker et al. compared the performance of the ResNet-50 deep learning architecture to 157 dermatologists from 12 different University hospitals in Germany [27, 28]. The CNN was trained on a labelled dermatoscopic dataset from the International Skin Imaging Collaboration (ISIC) archive, which contained 2169 melanomas and 18,566 atypical naevi. The CNN and specialists were tested on a dataset of 100 images. The CNN outperformed 136 dermatologists in terms of sensitivity and specificity [28]. When the ResNet-50 CNN was trained on a different biopsy-proven dataset of 4204 dermatoscopic images with a 1:1 ratio of melanoma to naevi, it conclusively outperformed junior and senior dermatologists against a testing dataset of 804 biopsy-proven images [27].

Similarly, a team from Microsoft Research Asia used ResNet-152 to classify 12 different skin diseases and achieved performance comparable to that of 16 dermatologists [29].

These studies demonstrate the potential of AI to assist in diagnosing dermatological conditions from a predetermined known set of diagnoses. However, whilst seemingly obvious, one must remind themselves that AI cannot classify an image to a clinical condition that it has not been trained against. Once again, this highlights the importance of a high quality and broad training dataset relevant to the task in question.

The commercial industry has capitalized on general optimism about the capability of AI. SkinVision, based in the Netherlands, allows users to upload photos of skin moles or spots. It subsequently classifies as benign or malignant and provides a risk assessment for the patient [30]. It has reported a sensitivity of approximately 95% at a specificity level of 78% in its ability to detect pre-malignant conditions. The application makes predictions on photos that have been uploaded by users via their personal device. To make a meaningful prediction, the software initially processes the image. This involves noise removal to eliminate minor irregularities e.g., freckles, and feature extraction to obtain geometric, texture and colour parameters, and image segmentation to identify the lesion of interest from surrounding skin. Rather than using a CNN, which is the hallmark of deep learning models, this application feeds the information obtained from feature extraction into a Support Vector Machine (SVM) classifier, a well-known method in machine learning [30].

Whilst the clinical applications of AI in dermatology are undoubtedly exciting, there are a lot of fundamental limitations that need to be considered. For example, conditions under which images are taken, both in training and testing, are particularly important in the process.

Dermatology is a highly specialized field, and it is well-known that the diagnostic accuracy of lesions assessed by family practitioners or physicians not specializing in the field is comparatively low. Dutch and British studies have estimated an underwhelming 40–60% of skin lesions are accurately diagnosed by general practitioners [31, 32]. This makes a compelling case for the use of smartphone applications to aid bedside assessment of skin lesions.

5.2 Pathology

Pathologists provide a critical perspective on the diagnosis of many medical conditions, and histological diagnoses often serve as the gold standard test in many situations. Much histopathological assessment is undertaken via light microscopy with additional tissue staining and immunohistochemistry to enhance diagnostic capability. As the diagnosis is often contingent on image analysis, the field of pathology lends itself to AI technology. Critical to image analysis by AI has been the advancement in technology that now enables digitizing histology slides with whole slide imaging scanners. [33, 34]

In 2016, a competition called CAMELYON16, hosted in the Netherlands [35], tasked researchers with developing automated solutions to detect the presence of lymph node metastases in tissue biopsies of women with breast cancer. It consisted of two tasks: a) identification of individual metastatic foci within the whole-slide image; b) classification of the presence of metastasis or not within the whole-slide image. Thirty-two submitted algorithms were trained on a set of 270 images. Deep-learning algorithms performed the best overall. The best performing algorithms had an AUC comparable to that of experienced pathologists in both tasks. An example of the results of the top 3 performing teams are seen in Fig. 3 [35]. The top-performing model, developed by contributors from Harvard Medical School and Massachusetts Institute of Technology, involved a deep-learning model with a 22-layer GoogLeNet architecture [10].

Fig. 3
figure 3

Top performing models in CAMELYON16 competition to automate detection of lymph node metastases in tissue biopsies of women with breast cancer. On the left-most column (a) there are 4 annotated metastatic lesions of breast cancer. These were identified by algorithms with results seen on the following 3 columns (bd) by a probability colour map. (Image obtained from Ref. [35])

Similarly, a group from New York University School of Medicine applied deep algorithms to determine the presence and classify a subtype of lung cancer from histopathology whole-slide images (WSI) [36]. Coudray et al. trained images on WSIs from the Cancer Genome Atlas in order to classify them into normal lung, adenocarcinoma or squamous cell carcinoma. They were not only able to achieve diagnostic performance similar to that of pathologists with an average AUC of 0.97, but also able to train the network to predict adenocarcinoma subtypes based on their mutation status. The prediction of mutation status had an AUC of between 0.733 and 0.856. A summary of their approach can be seen in Fig. 4.

Fig. 4
figure 4

Process, workflow and strategy of classification of lung tissue into normal, adenocarcinoma and squamous cell carcinoma using whole-slide images from Cancer Genome Atlas. (a) Number of WSIs per class. (b) Training strategy (b, i), Image downloads (b, ii) slides separated to training, validation and testing sets (b, iii) slides tiled to non-overlapping 512 × 512 pixel windows (b, iv) Inception v3 architecture used and trained with training & validation tiles (b, v) classification performed on independent test set and aggregated to heat maps. (c) size distribution of image widths and heights. (d) number of tiles per slide. (Reproduced from Ref. [36])

5.3 Ophthalmology

In the field of ophthalmology, fundus photography is a routine part of the clinical examination. Fundoscopy allows assessment of the retina, its vasculature and the optic nerve head. Fundoscopic findings can reveal a great deal about other systemic conditions such as diabetes, hypertension and raised intracranial pressure [37].

Google researchers developed and validated a deep learning algorithm to detect diabetic retinopathy in retinal fundus photographs [38]. Armed with a dataset of 128,175 fundoscopic images from patients presenting for diabetic retinopathy screening, they trained a deep learning model to detect the presence of diabetic retinopathy and diabetic macular edema. Their algorithm had an AUC of greater than 0.99 for detecting referable diabetic retinopathy or macular oedema.

A group of researchers from The University of Iowa worked on developing algorithms to automate the screening of diabetic retinopathy. They demonstrated that IDx-DR version X2.1, a system underpinned by a deep learning AI algorithm based on the Alexnet [6] and Oxford Visual Geometry Group [39] network architectures, was able to achieve 96.8% sensitivity, 87% specificity and an AUC of 0.980 in the screening of diabetic retinopathy [40]. In a prospective clinical trial with 900 enrolled patients, the AI system exceeded endpoints of superiority when compared to the Wisconson Fundus Photograph Reading Centre (FPRC), which is the typical gold standard. The IDx-DR system achieved a sensitivity of 87% and specificity of 91% [41]. The use of deep learning enabled the high performance seen here with AI, as it was a marked improvement from the previous algorithm, also developed by Abramoff et al., that did not incorporate deep learning methods [42]. The IDx-DR has since obtained approval from the United States Food and Drug Administration (FDA), being the first device authorized for marketing that provides a screening decision without the need for a clinician also to interpret the image or results [43].

5.4 General Radiology

As a specialty primarily working with imaging modalities that include x-ray, computed tomography (CT), magnetic resonance imaging (MRI), ultrasound and nuclear imaging, there is an abundance of opportunity for AI to aid diagnostics. Radiologists may specialize further and focus on limited sections of the body, defined by anatomical regions or specific modalities.

A commonly ordered diagnostic test is the Chest X-ray. With the common Chest X-ray, there remains technical and patient-related factors that may hinder the process of standardizing the X-ray image that is then produced. Researchers at Stanford University developed CheXNet, a 121-layer dense CNN model developed on a database of 112,120 frontal-view chest X-ray images that were individually labelled from a set of 14 different diagnoses [44]. They found the model to exceed the diagnostic performance of four radiologists.

The enthusiasm of AI in the field of radiology has led to a number of image challenges, one of which is the Radiological Society of North America (RSNA) Paediatric Bone Age Machine Learning Challenge. This challenge provided a dataset of 14,236 paediatric hand radiographs and received 105 submissions globally. Most of these submissions used deep learning algorithms. With a mean patient age of 127 months in the dataset, the winning submission had an average deviation in age estimation of only 4.2 months [45].

Diagnosis of fractures on skeletal X-rays can also benefit from AI assistance. Researchers from the National University of Singapore used the Faster R-CNN architecture [46], a high performing object detection network, to identify radius and ulna wrist fractures. They trained the model on 7356 AP and lateral X-ray images of the wrist, and the model correctly detected and localized the fracture in 91% and 96% of cases in the AP and lateral views, respectively [47]. This can be seen in Fig. 5.

Fig. 5
figure 5

Radius and Ulna wrist fracture detection method designed by researchers from National University of Singapore. Green boxes are marks made by the Faster R-CNN model. Percentages reflect the confidence score of a fracture within the box. (Reproduced from Ref. [47])

There are a growing number of AI applications in the field of advanced imaging. Within the MRI modality, there have been deep learning models created to perform analyses across organ systems including the brain, kidneys, prostate and spine among others [48]. Some of the many applications of deep learning models include brain lesion quantification and segmentation [49], diagnostic improvement for multiple sclerosis [50] and Alzheimer’s disease [51, 52].

Artificial Intelligence has also been used to improve workflow and image interpretation, aiding the radiologist. The US FDA has approved the use of an AI-based tool that automates brain segmentation and volumetric analysis. This tool has been studied and shown to assist in the diagnosis of Alzheimer’s dementia [52].

Artificial Intelligence models can also improve the efficiency of imaging, creating safer environments for patients. For instance, a deep learning model has also been developed to reduce the amount of gadolinium contrast used in MRI brains by ten-fold, without significant image quality degradation [53]. This may result in opportunities for patients with severe renal impairment who are often denied gadolinium-based imaging due to the nephrotoxicity of gadolinium and the possibility of an adverse reaction called nephrogenic systemic fibrosis [54].

5.5 Neuro-Radiology

Artificial intelligence can uniquely contribute to intelligent medical assessment and clinical workflow optimization.

The use of AI in imaging can enhance the speed of medical image interpretation and help prioritize images. In the outpatient setting, AI has been shown to significantly reduce the speed with which intracranial haemorrhage (ICH) can be diagnosed. Arbabsharani et al. [55] have shown that a deep learning model trained on over 37,000 ICH-protocol CT brain studies can lead to early diagnosis of ICH. They evaluated the model prospectively for three months and found that the meantime to diagnosis for an ICH in an outpatient CT brain study was significantly reduced from 512 min to 19 min. Their model achieves this impact by flagging and prioritizing scans deemed by the model to contain an ICH. This form of AI-based prioritization can have a significant impact on the clinical workflow.

Some early success has also been demonstrated by Viz.ai, a US FDA approved startup that uses deep learning methodology to diagnose large vessel occlusions on CT angiogram imaging. After the diagnosis of a large vessel stroke, the software quickly notifies a stroke response team of its findings. On average, the AI software alerts the on-call physicians within 6 min after CT angiography is completed, via a built-in ringtone. Physicians can access these images via the mobile application. Experience with this software has been positive, although in a small sample size of 43 patients, demonstrating that it reduces the time to treatment and overall hospital length of stay [56]. The software can lead to a 20-min reduction in door-to-puncture times and an improvement in the mean modified Rankin Score [57]. Although it is still early days with the implementation of this technology, it certainly provides a glimpse of how AI alteration of clinical workflow can improve patient outcomes and be worthy of reimbursement.

5.6 Radiation Therapy

Radiation therapy is a field that lends itself to a constructive partnership with artificial intelligence techniques. Microsoft Research Cambridge have developed methods to automate the segmentation of abnormal, malignant tissues, an integral part of the planning process in radiation therapy. As part of project “InnerEye”, researchers have successfully developed an 11-layer deep CNN called “DeepMedic” [58], for the task of brain tumour segmentation [59]. Automating parts of the planning process can save a significant amount of time for radiation oncologists as it is often a repetitive and arduous task. Segmenting out abnormal tissue that needs irradiation from benign tissue where irradiation needs to be minimized is crucial but time-consuming.

Similar adaptations of AI-based tissue segmentation have been performed. Prostate cancer segmentation has been performed on MRI images [60], creating opportunities to assist radiation therapy planning [61]. This concept has also been researched for treatments in breast cancer, lung cancer and abdominal cancer [62, 63].

5.7 Cardiology Application

Cardiologists also have a unique combination of imaging tools to aid decision making. These tools include the electrocardiogram (ECG), echocardiography, CT coronary angiography (CTCA) and cardiac MRI.

The standard electrocardiogram (ECG) is an incredibly informative tool and often the sole piece of information from which many important clinical decisions are made. Digitization of ECG data has allowed large scale data collection, opening up possibilities of using AI to help diagnose rhythms. Hannun et al. developed a deep neural network trained on 91,232 single-lead 30 second ECG strips from 53,549 patients with a patch monitoring device [64]. They were able to classify ten different arrhythmias in addition to sinus rhythm and noise to a level of accuracy that exceeded that of a group of cardiologists.

Echocardiography is a popular tool to assess the function of the heart. In addition to ejection fraction, there are a number of important measurements that have therapeutic implications. The US FDA has approved several AI-powered tools to assist with ejection fraction and other measurement estimation (Ultromics EchoGo [65], Caption Health [66]).

CT coronary angiography [67] and CT calcium scoring [68] are rapidly gaining popularity as tests to rule out the presence of severe coronary artery disease and predict the risk of cardiovascular events, respectively. Technology that uses AI to determine coronary calcium scores has currently already been approved by the US FDA [69, 70].

Overall, AI has been able to impact all steps in the cardiovascular imaging chain, namely decision support tools, examination, reconstruction, post-processing, diagnosis and prognostication [71]. Thus far, AI tools have been task-focused and do not span across the entire imaging chain, nor has that been the focus. Limitations to implementation of some of these tools in cardiovascular healthcare are related to regulatory approvals, uncertain added value to the clinician or patient, and a clear scarcity in traditional randomized controlled trials to prove efficacy [71].

6 Challenges

Successful integration of AI into daily clinical workflow presents numerous challenges. These challenges span the entire process of dataset management, algorithm development, regulatory approvals and implementation. For example, the process from data acquisition to model development and ultimate clinical use has been depicted in Fig. 6.

Fig. 6
figure 6

Steps in AI model development from conception to clinical implementation. (Image reproduced from Ref. [72])

6.1 Dataset Management

Medical data are generated in an environment where information is held in a confidential manner and data sharing is highly restricted. As described above, the performance of AI algorithms is closely tied to its training datasets. Therefore, great care must be taken to de-identify data prior to its use in model building.

Datasets have their limitations. Recognition of these limitations is necessary to minimize data bias. The dataset population upon which the AI model is developed must be similar to the population upon which the model is applied. Dataset size has implications for model performance. While large datasets may allow for some inaccuracies, smaller datasets require high-quality data. With supervised learning, there must also be a widely accepted gold standard as this becomes the premise for ground truth [72]. AI models are trained to match results of the presented ground truth. Therefore, as technology improves and gold standards for diagnoses evolve, models must also be updated and retrained.

6.2 Algorithm Development and Maintenance

In the fields of computer vision and machine learning, most successful models have utilized deep learning techniques. Whilst traditional AI techniques are more interpretable, deep learning models have limited transparency despite achieving good quantitative performance. Due to challenges in visibility of step-by-step processes that lead the model from input to output, these models are often considered to operate within a “black-box” [73]. This phenomenon can impact the ability to generalize the model for applications in other situations that the model has not been directly developed against. Thus, the adaptation of the model to a new testing environment. For example, a model developed on the imaging data from one radiology centre may not have the same performance when applied to images acquired from another centre, despite evaluation of the same region of interest [74].

In addition to the development of an algorithm, there exists some challenges in its maintenance. An advantage of AI-based methods is that as availability of datasets grows, the algorithm can be retrained and constantly updated. However, this may lead to situations where the prediction for an individual task may change due to modification of the training dataset. This conflict would require reconciliation, likely by the physician.

Model security against adversarial attacks is another concern that will need to be addressed [73]. Scientists and physicians need to be mindful that deliberate alteration of data inputs can bias a model resulting in suboptimal or erroneous decisions [75]. This is also a consideration for policymakers and regulators alike.

6.3 Regulatory Approval

There are many barriers of entry to the regular uptake of AI by clinicians. Often, clinicians and healthcare providers can gain greater confidence in technology if approved by the US FDA regulatory body. However, the regulation of AI by the US FDA poses challenges not seen with hardware-based medical devices or pharmaceuticals.

To be approved by the FDA, currently a technology needs to obtain one of three broad categories of clearance: 510(k) clearance [76], Premarket approval (PMA) [77] or de novo pathway [78]. The FDA determines which category of clearance is necessary for the AI tool based on three considerations: risk to patient safety; the existence of predicate algorithm and degree of human input [79].

Risk to patient safety is determined by the duration and size of the impact caused by false positives or false negatives from a particular technology. This can be classified as low (Class 1), intermediate (Class II) or high (Class III). For high risk and specific intermediate risk scenarios, a PMA, the most stringent process of the three, would be required.

Technology that is incremental with an existing predicate technology benefits from the notion that its safety and efficacy must be at least comparable to that of existing technology. Therefore, if the technology can be shown to be at least as safe as another FDA-cleared technology, it may be eligible for a 510(k) clearance [79, 80]. For lower risk novel technology or one with a novel application and no legally marketed counterparts, clearance via the de novo pathway may be sought [80].

The degree of clinician input also affects the regulatory process. There is a distinction made between computer-aided detection (CAD) and computer-aided diagnosis (CADx). CAD technology can alert clinicians to relevant findings, whereas CADx technology provides an assessment of the disease by providing a diagnosis or differential list [79]. CAD has greater clinician involvement and therefore, a clinical decision support system powered by AI may pose a lesser risk as a CAD compared to a CADx.

7 Conclusion

In this chapter, we have discussed the various elements that go into a Deep learning AI model that is typically used in the fields of Computer Vision and Machine Learning. We have shown several representative use cases of the many examples across a range of medical specialties and highlighted the importance of the training dataset in model success and application.

If AI is to be incorporated into routine clinical practice, the collaboration between computer scientists and physicians is essential. Computer scientists require physician expertise to identify problems, provide relevant datasets and determine the appropriateness of the clinical application of the model. Physicians rely on computer scientists for model development, refinement and maintenance. The arrival of AI as an entirely new category of technology in the field of medicine has necessitated special attention from regulatory bodies such as the US FDA.

There is still much to do before AI becomes commonplace in clinical practice, but the response of the scientific community and regulatory bodies has been promising.