Introduction

Taking advantage of the increasingly large amount of labelled digital information available in every area of technological activity, artificial intelligence (AI), and in particular methods such as deep learning, have recently achieved impressive results with ‘big data’ in many different domains [1]. The healthcare sector, rich in complex data and processes, is also being transformed, and radiology will be one of the first specialties to be affected. Despite some opinions voiced in the public space, AI is unlikely to replace human radiology expertise. However, in the long term, AI may augment existing computer-based tools or even partially replace human medical expertise for certain specialised and/or repetitive tasks, such as computing targets in radiotherapy, detecting disease indicators in images or measuring longitudinal disease burden [2].

At the same time, technological solutions for domains rich in labelled data do not necessarily translate easily to the healthcare sector. In order to recognise general object categories in images, popular internet platforms train deep learning algorithms on hundreds of millions of labelled data easily accessible in large image databases [e.g. the ImageNet project, http://image-net.org]. These data sets usually consist of 2D images with much fewer pixels than typical 2D or 3D medical images. Furthermore, such images do not come with regulatory or ethical constraints, the correct solution to whether an image shows (for instance) a cat, a dog or a car does not require an expert’s annotation with strong inter-user agreement, and the cost of failure is low. In contrast, creating a gold standard of medical diagnosis is often challenging and sometimes unclear. Medical data is subject to multiple privacy and regulatory considerations and is not easily accessible. These accessibility constraints can make it very difficult to curate high-quality image data sets.

Methods such as deep learning, however, are only the latest in a long history of AI-based approaches to medical image analysis that span several decades. This effort has produced advanced methods for the automation of many tasks in radiological image processing [e.g. 3]. These tasks include organ localisation; organ segmentation; the detection and segmentation of lesions within the organ; registration for the longitudinal follow-up of disease progress, or for combining image data of different modalities; and finally, the analysis of image patterns within the organ and the tumour region, with the goal of revealing associations with clinical outcome. We provide a review on how the computational technology required to achieve these objectives evolved during the past decades in order to provide a context for the latest developments. We review the functioning and limitations of these methods and present a perspective on the relationship between AI and the traditional discovery process in medicine.

Symbolic interpretation of medical images

The first attempts to develop automatic interpretation of medical images were based on human decisional models, performing high-level (i.e. symbolic) image interpretation. In the 1980s, AI approaches to medical imaging relied on so-called expert systems that performed logical inference based on rules derived from a human approach to image processing [e.g. 4]. In practice, this involved image processing operations that would be considered simple by modern standards, for instance binarising/thresholding an image and looking for geometric structures such as lines, circles or trapezoids. A set of logical rules would then deduce image content based on the presence/absence of such structures [e.g. 5].

The main strength of such methods is their ability to provide solutions that are fully human interpretable and therefore easily acceptable by physicians. In addition, these models have excellent generalisation since decisions are taken on the basis of human medical knowledge. On the negative side, the mere encoding of human decisional processes did not prove to add value to medical practice, especially given the limited computational resources of that era. While expert systems were successful in some other domains, such as business analysis or manufacturing, they did not produce a significant impact in medicine. Table 1 provides a summary for this type of approach as well as for the more recent classes of methods discussed next.

Table 1 Summary of the different types of approaches to AI in medical imaging

Probabilistic interpretation of medical images

The second generation of algorithms was also inspired by the human decisional chain, but instead of adopting a strongly symbolic interpretation of observations, it moved towards a statistical/probabilistic one. This category of methods is also based on a model derived from human expertise; however, the model’s parameters are computed from a labelled reference data set using probabilistic methods that determine the most likely solutions [e.g. 69]. In this manner, a statistical model—sometimes called an atlas—is created to represent the inherent signal variability in the reference population (Fig. 1).

Fig. 1
figure 1

An atlas encoding the variability of the appearance of healthy livers in a reference population can be created via deformable registration of individual liver images into a common target space. In this example, the registration was obtained with the method of [28]

Image/organ segmentation—one of the pillars of medical imaging—is a classic example to demonstrate this concept. A structure of interest in a new image can be described by matching its appearance to the population atlas. If the atlas is constructed from a healthy population of images, then lesions in the new image can be modelled as abnormal outliers relative to the expected appearance of healthy tissues (Fig. 2). This approach is commonly used in neuroimaging for modelling brain anatomical structure or for lesions arising from traumatic injuries, cancer, multiple sclerosis or other diseases [e.g. 1012].

Fig. 2
figure 2

When an individual image containing a lesion is compared to a healthy population atlas, the lesion will appear as an abnormal outlier region. In this manner, lesions can be detected and mapped without the need for explicit segmentation

Such approaches inherit numerous strengths but also important limitations. On the positive side, these methods aggregate information across populations and experts, couple them with a human-understandable statistical model and therefore are endowed with good acceptability and good generalisation properties. However, the choice of the statistical model and its ability to cope with the observed anatomical variability play a significant role in the overall performance. Furthermore, in order to build an appropriate model and then successfully apply it, all images need to be transformed to the same reference space via deformable registration, which is a highly challenging problem on its own [13].

Data-driven/model-free approaches to automated knowledge discovery

A limitation of methods that rely on explicit expert knowledge is that transforming human-level expertise into suitable computational models can be challenging, especially when human knowledge is incomplete. Therefore, there has recently been a surge of interest in a different kind of approach—the ‘discovery science’ approach, which emphasises the mining of large amounts of data to discover new patterns and help formulate new hypotheses [e.g. 14]. As such, these approaches are model-free and purely data-driven. A currently popular application of these methods is known as ‘texture analysis’ or ‘radiomics’, where features computed over a segmented tumour region are used to classify different tumour types [15].

In the context of radiology, methods in this category learn a characteristic representation of the appearance of organs and tissues from a set of training images. This approach is referred to as supervised learning when the training images have previously been (manually) labelled. If there is no such labelling, methods exist to automatically detect groupings in the data with what is known as unsupervised learning. In both cases, learning is based on a large set of basic image descriptors (features) that are automatically extracted from the image. The goal is to use these features to discover a separation between classes in high-dimensional feature spaces; see Fig. 3 for a simple two-dimensional example. Different statistical and machine learning techniques exist to do this, for instance logistic regression, support vector machines [16], decision trees [17] and many others.

Fig. 3
figure 3

Example of two-dimensional decision boundaries between two groups of data points, with increasing complexity from a to b to c; a is likely to be too simple, whereas c is likely to be overfitted

However, when the number of features computed is much larger than the number of samples in a typical data set, the risk of overfitting is increased. Overfitting occurs when the system learns idiosyncratic properties of the training data, i.e. properties that are too specific to the training data and do not generalise well. This will prevent the system from reliably fitting new data or predicting future observations in a reliable manner. To reduce the risk of overfitting, one option is to perform feature selection, i.e. reducing the number of features under consideration by keeping only the most relevant features and discarding the rest [e.g. 18, 19]. One popular feature selection method is the least absolute shrinkage and selection operator (LASSO) [20], which alters standard regression methods by selecting only a subset of the available covariates, as opposed to using all of them. In doing so, it removes predictor variables that may contribute to overfitting. As another example of feature selection, in experimental paradigms that involve human interaction, for instance the manual segmentation of lesions, it is possible to determine the reproducibility of features across different human operators and then discard those features that are not reproducible.

As an example of potential overfitting, on a small training set of images of different tumour grades, given a sufficiently large number of features, one could in principle always find a particular model that provides perfect classification of the training images into different tumour grades. However, such a model is not likely to have a practical value unless it is validated on an independent data set. In fact, it is customary in machine learning and statistical approaches to classification to work with three separate data sets. The first is the training set, used to optimise model parameters. The second data set—the validation data set—is then used to assess the quality of the training and the potential presence of overfitting. Finally, the third data set, called the test data set, is used to provide a final, unbiased report of the method’s performance.

The advantage of data-driven approaches is that they do not require expert modelling of the specific problem at hand. Instead, the underlying assumption is that provided a large enough training set, the meaningful information and variability in the data can be represented as a function of a large set of basic image features. On the negative side, these methods are often myopic (decisions are taken at a local scale where global context is ignored), not modular (for each task, user-driven adjustment of the feature space might be required) and their generalisation needs to be verified with independent validation sets. In addition, methods such as radiomics or texture analysis suffer from a lack of standardisation, since many of the features they work with depend on specific image acquisition parameters that can vary according to scanner hardware and software.

Deep learning/sub-symbolic artificial intelligence

Typical data-driven approaches in radiology work with features designed to reflect properties of the data seen as important from a human radiologist’s point of view, such as density, heterogeneity of tumours, tumour shape, etc. However, a class of machine learning methods have recently gained popularity because they can automatically discover the best features for a given task, without requiring human ‘feature engineering’. This class of methods is known as deep learning and can oftentimes exceed human performance for specific tasks [1, 21].

Architecture

Deep learning systems are built upon an architecture of artificial neural networks (ANNs). Each ANN consists of an array of processing units, known as artificial ‘neurons’ or nodes. Each such node takes as input a set of feature values, multiplied by a corresponding weight. The node then sums up the weighted evidence it receives as input, and then passes it through a nonlinear activation function, which determines the neuron’s output (Fig. 4). Essentially, each artificial neuron makes a decision based on weighted evidence. This design is intended to mimic, in a simplified manner, the behaviour of a biological neuron, which integrates multiple inputs from synapses on its dendrites and sends an output signal down its axon, as input to another neuron.

Fig. 4
figure 4

a Schematic representation of a simple biological neural network. b Analogous artificial neural network. Adapted from [21]

Based on this paradigm, hierarchical networks of neurons can be used to encode very complex nonlinear functions. Deep learning systems implement this concept with multilayer hierarchies of ANNs. The first layer in the hierarchy contains nodes that are directly sensitive to the input data, such as individual pixel values in an image. The last layer represents output values such as classification results. The intermediate layers compute intermediate feature representations that become more abstract as the information travels towards deeper layers. Ultimately, the deep learning network is designed to convert raw activation signals from the input layer to target solution values determined from the activation pattern in the last layer.

Training and learning

Deep learning networks in use today consist of thousands of nodes and millions of connections, resulting in millions of network parameters. For the network to be useful, all these parameters need to be adjusted through a training process. Each element of the training set (e.g. each image) is given as input to the network, and the network’s response based on its current configuration is recorded. Following this, the network’s current output prediction is quantitatively compared with the true prediction through an error function, termed a ‘loss function’. Through a process known as back-propagation [1, 22], each parameter of the network is adjusted by a small increment in a direction to minimise the network’s loss function. This process is repeated iteratively, until the network reaches a state minimising the loss function (Fig. 5).

Fig. 5
figure 5

Schematic representation of an artificial neural network with backpropagation of error signals during the training process. At a very abstract level, backpropagation is analogous to human training where experiences improve a doctor’s skills and knowledge

This training process can be very computationally intense for networks with a large number of layers and nodes, and this is what has kept artificial neural networks in relative obscurity for many decades. Research on neural networks in AI dates back to the 1950s [23]. However, the modern resurgence in neural networks owes much to two recent developments that make training such networks practical and tractable: (1) the arrival of inexpensive and powerful computing hardware and memory, and (2) the availability of large quantities of labelled digital data that can be used for the training process. These two factors are crucial in the modern success of deep learning algorithms.

Limitations of deep learning

Despite its success and popularity in a wide variety of tasks, deep learning comes with several limitations.

Inefficient use of data and processing power

To learn a particular task, a deep learning algorithm requires several orders of magnitude more data points than other types of learning algorithms, let alone a human. When such amounts of data are available, then deep learning can outperform any other method. However, more traditional methods tend to perform significantly better with smaller amounts of data, as illustrated schematically in Fig. 6. The deep learning community has recognised this problem and is now turning towards developing the next generation of algorithms, which will be much more complex mathematically, but would present a more efficient use of data; see, for instance, [24] for a layman’s introduction to this topic.

Fig. 6
figure 6

Schematic representation of the impact of data availability on the performance of various machine learning methods. Traditional machine learning methods, using human-crafted features, perform best when only a small amount of data is available. Increasingly deeper neural networks (i.e. with a larger number of layers) perform increasingly better as the amount of available data is increased. Adapted from Easy Solutions Inc. [http://blog.easysol.net/building-ai-applications/]

Limited generalisation and model fragility

A given deep neural network can typically be applied only to a single narrowly defined task, and its performance is tied to a particular data set and label set. In the case of radiological applications, if deep learning is to be used to classify tumour images, different deep networks would need to be retrained and relabelled for every type of organ, tumour, modality and orientation. In adversarial situations, even minor changes to the input data, often invisible to the human eye, can result in dramatically different classifications [e.g. 25]. This is why, even if deep learning were to be successfully integrated with radiology, human verification will most likely be required.

A medical approach to AI research in radiology

The current AI trends centred around deep learning have produced a pressure to shift from hypothesis-driven to data-driven research. The data-driven approach can be powerful and lead to novel insights; however, it cannot replace the cognitive integration of complex information combining the semeiological analysis of images features in a precise anatomical and physiopathological context. These methodological approaches have been refined over centuries of scientific and medical thought, and they allow for solutions that are still inaccessible to AI.

AI brings new opportunities, but the fundamental principles of clinical reality do not change. In order to effect real impact on patient care, AI-based research in medical imaging must adhere to first principles in medical science. Research hypotheses, whether AI-based or not, must be clinically relevant and answerable in the clinical domain. In Table 2 we have adapted the classical PICO approach [26] to develop focused clinical questions adapted to AI-based research in medical Imaging. PICO is an acronym for patient/problem, intervention, comparison intervention and outcomes.

Table 2 The PICO approach adapted to developing focused clinical questions for AI-based research in medical Imaging

Because of a certain invincible aura that AI has in popular perception, a common mistake is to expect AI to find analysable information in medical images even where none exists or for which there is no biological plausibility. Akin to perennial concerns in medical research of data dredging and p-hacking [e.g. 27], improperly designed AI experiments could misinform, mislead or without critical analysis could result in patient harm.

In the current literature, AI-based methods are usually evaluated by comparison with expert individual or consensus radiologist performance. However, the influence of clinical history and context, differences in population incidence, and the inherent subjectivity of interpretations can make such experimental paradigms unreliable. Rather, training labels where possible should relate to meaningful patient outcomes and gold standards that exist beyond radiology. At the same time, it is important to compare human and AI performance in terms of efficacy, reproducibility, turnaround time and cost. In that sense, AI should be compared not only to the best experts in the field but also to general practitioners, and possibly also to healthcare systems with limited human resources, for instance those in developing countries. In certain high-need settings, AI may bring additional value, in particular as a screening tool.

Conclusion

Novel computational tools derived from AI are likely to transform radiological practice. The greatest challenge at this time—and also the most exciting opportunity—lies in determining which particular clinical tasks in radiology are the most and least likely to benefit from AI algorithms, given the power but also the limitations of such algorithms. At the same time, regardless of the technology used, it is important to keep in mind that proper clinical hypotheses remain primordial, and an appropriate validation of AI-based methods against actual clinical outcome measures of patient well-being remains the most important measure of success.