Keywords

1 Introduction and Motivation

Fig. 1.
figure 1

(brain images from [52])

Graphical abstract

In this paper, our approach is to discriminate between: general challenges to AI/ML image analysis (irrespective of the specific approach); challenges specific to Black Box methods; challenges that Explainable AI alone can help to overcome; and challenges that Black Box methods together with Explainable AI can help to overcome. In addition to outlining the challenges and our hypotheses, we also include an extensive review to assess the use of AI and ML-enhanced magnetic resonance imaging (MRI) for the evaluation of neurodegenerative diseases in comparison to the use of histological information (Fig. 1).

2 Glossary

The following abbreviations and terms are used in this manuscript:

  • AD Alzheimer Disease

  • AI Artificial Intelligence

  • CAG Cytosine-adenine-guanine

  • CV Cross-Validation

  • DL Deep Learning

  • Explainable AI Explaining the behavior of otherwise black-box models

  • fMRI functional Magnetic Resonance Imaging

  • HD Huntington’s Disease

  • LDA Linear Discriminant Analysis

  • ML Machine Learning

  • MRA Magnetic Resonance Angiography

  • MRI Magnetic Resonance Imaging

  • Neurodegenerative diseases Diseases caused by the death of neurons and/or other cells in the brain

  • NMR Nuclear Magnetic Resonance

  • PCA Principal Component Analysis

  • PD Parkinson’s disease

  • RF Random Forests

  • RFDA Regularized Fisher Discriminant Analysis

  • SC-CNN Spatially Constrained Convolutional Neural Network

  • SPECT Single Photon Emission Computed Tomography

  • Virtual Autopsy The use of imaging to diagnose cause of death

  • USG Ultrasonography

  • CAG Cytosine-Adenine-Guanine

3 State-of-the-Art

3.1 Role of AI/ML

We define AI as the whole field working on understanding intelligence towards context-adaptive systems; the backbone, however, is ML as a methodological subset of AI, whilst e.g. Deep Learning (DL) is just one specific methodological subset of ML [24]. In the following we only briefly point to some of the manifold possibilities which AI/ML offer for the topic of this paper. The most important methods for classification of brain states include Linear Discriminant Analysis (LDA) and regularized Fisher Discriminant analysis (RFDA) to get a classifier, Independent Component Analysis (ICA) and Principal Component Analysis (PCA) for dimensionality reduction of the input data, several cross-validation (CV) schemes for model evaluation/selection, however, keeping potential pitfalls in mind [35, 52].

The human brain has a significantly high individual variability with substantial amount of information that needs to be analysed. To count neurons manually can be an extremely time-consuming process to perform by human eye observations alone. Non-homogeneous brain shrinkage is one major challenge where, in combination, imaging and AI/ML may be effective. By providing a huge amount of information for histology, several ML methods can help to validate in vivo MRI findings and MRI limitations. This leads to improving and enhancing the capabilities of MRI through the identification of even more alterations. Moreover, in brain informatics problems include the detection and interpretation of volume changes in neurodegenerative diseases and more global aspects such as brain volume, gyration, cortical thickness, hippocampal shape changes and allometric studies correlating, for example, cortical volume with hippocampal and subcortical nuclear volume in schizophrenics.

As a first step, by using different contrasts in MRI, one can obtain more information, not only regarding volume quantification, but also iron deposition, myelin quantification and brain parcellation with myeloarchitectonics [39] and other sequences. Architectonics is considered to reflect the functional properties of the brain and its surface. The manifold problems with such approaches include the high-dimensionality of the data, which would make standard approaches of AI awkward, yet impossible to use, therefore, it is necessary to harness the full potential of current AI/ML. As a next step, one can attempt to quantify the architectonics, the size of defined fields, and correlate with their normative data for different diseases and normal ageing. Due to the high individual variability of the brain, these studies will require investigating extensive data. For this step, accessing and analysing the large database using simple methods will most likely lead to a null finding. We hypothesize that this is a matter of AI to find an invariably recurring parameter that has escaped human attention (e.g. due to noisy data).

However, one must take care when discussing AI in the context of initial/early stages of neurodegenerative or neuropsychiatric diseases. It is necessary to perform both MRI and volumetry when there is no access to the tissue. It is also important to take into consideration that changes other than visible changes (e.g. volume) are causing symptoms in the early and later stages of schizophrenia or depression. In addition, from a clinical and neuropsychiatric view, schizophrenia itself is very challenging and complicated. It is the same matter as assessing proficiency based on the size of the brain. Many would assume that bigger brains are more proficient than smaller brains, however, the contrary can be true because one can find small brains in highly intelligent people and vice versa.

Generally, next to imaging there are genetic biomarkers for disease classification but these are applicable only in specific cases. Established biomarkers for Alzheimer’s disease (AD), Parkinson’s disease (PD) and Huntington’s disease (HD) are based on visual analysis including MRI. Genetic variants of apolipoprotein E, tau as well as amyloid precursor protein or presenilin 1 and 2 are significant for AD, however, they are not imperative for the development of the disease, thus, are not reliably applicable for diagnosis. So far, AD is diagnosed clinically by a patient’s detailed history and mental state and finally determined pathologically by brain autopsy [10, 70].

The use of imaging to to diagnose the cause of death (virtual autopsy) has already proven its value, although it has limitations, especially regarding microscopic changes [52]. The possibility of having postmortem MRI in a high number of cases, with histology included, is an invaluable combination to validate the usefulness of MRI in this setting. This is in regards to, not only cases, but also healthy controls, and it provides more understanding to populational variability. Imaging yields an immense number of data and data needs to be interpreted. There is a region of transition between pure statistical analysis of data and interpretation by AI. One strategy is the investigation of brains from persons with well-characterized disabilities. A good example is Alzheimer’s Disease, where the visual inspection of neuroimagery is susceptible to limitations of human vision; here AI methods have shown to be equally or even more effective than human clinicians in diagnosing dementia from neuroimages [16].

Pathognonomic indicators include cerebral atrophy and neurofibrillary tangle and amyloid plaque pathology. The likelihood and/or course of HD can be genetically tested by the determination of the number of cytosine-adenine-guanine (CAG)-repeats [12] of Huntingtin; the gene product of the affected gene on chromosome 4. The neuropathological interaction between early striatal and cortical atrophy proved to be puzzling [17] and imaging of prodromal cerebral cortical changes is difficult to detect [48]. An exemplary novel approach of disease characterization has been recently described for AD. The technique is based on imaging of brain structural connectivity atrophy in combination with a multiplex network for generating a classification score [1]. Automated differentiation of PD has been described also based on various ML derived classification algorithms using quantitative MRI data [19].

3.2 AI/ML Methods

A myriad of different AI and ML methods exist, so we scratch the surface here and focus our description only on Deep Learning (DL), which we recommend for these particular studies. We refer the reader to [21] for an overview of ML in general, and to [6, 14, 20] for more specific details. AI-aided diagnosis has been used as a supporting tool for physicians for a long time [5, 63]. Due to the increasing computational power and available storage capacities, many methods, which proved to be computationally-demanding in the past, can now sufficiently be used in daily routine. Standard examples of state-of-the-art methods today, include not only DL [2], but also Support Vector Machines (SVM) [11] and Random Forests (RF) [7].

Among these methods, DL is rapidly proving to be the state-of-the-art foundation, leading to improved accuracy. It has also opened up new frontiers in biomedical data analysis generally, and in clinical medicine specifically, with astonishing rates of progress [15]. In favour of AI-aided diagnosis, quantitative changes could be evaluated in combination with functional neuroimaging and interpretation of big data in a longitudinal setting.

Recent work has been performed relating to automatic detection and classification of cell nuclei in histopathological images of cancerous tissue [61]. The authors applied DL [33] and produced encouraging results by applying a so-called Spatially Constrained Convolutional Neural Network (SC-CNN) [53] to perform nucleus detection. SC-CNN regresses the likelihood of a pixel being the centre of a nucleus, where high probability values are spatially constrained to locate in the vicinity of the centers of nuclei.

However, the current approaches in ML and neuroimaging do not facilitate essential mechanistic investigations validation by way of histology. Rather than only showing the ability to detect patterns of brain alterations, ML can also benefit from improving knowledge about algorithm choices and particular characteristics of precision power related to specific disease mechanisms.

3.3 Multimodal Deep Learning in Medical Imaging

In the machine learning context, algorithms dealing with data from multiple heterogeneous sources are referred to as “multimodal” or “multi-view” learning algorithms [47]. The advantages of using multimodal deep learning in the biomedical context are: (i) they require little or no pre-processing input data, because both features and fused representations are learned from data; (ii) they perform implicit dimensionality reduction within the architecture, which is a desired property in feature-rich biomedical datasets; (iii) they support early, late, or intermediate fusion [69]. However, they usually require powerful graphics processing units (GPUs) for reasonable training time.

Multimodal deep learning can be used to solve complex machine learning problems in areas of high dimensional unstructured data like computer vision, speech and natural language processing. The main advantage of deep learning is that, it automatically learns hierarchical representation for each modality instead of manually designing modality-specific features that are then fed with machine learning algorithm. In medical image analysis, the medical expert can use multiple image modalities information, e.g. computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound imaging for diagnosis and treatment. Therefore, multimodal deep learning is suitable for medical applications issues like tissue and segmentation, multimodal medical image retrieval and computer-aided diagnosis. There are however two significant challenges faced by the medical applications community when using multimodal deep learning, namely the difficulty in obtaining sufficient labelled data, and class imbalance [57].

Multimodal deep learning is widely used for brain imaging studies. Collecting data of magnetic resonance imaging (MRI) of multiple modalities of the same individual is popular in brain imaging studies. Multimodal brain imaging study can provide a more comprehensive understanding of the brain and its disorders. For instance, it can inform us about how brain structure shapes brain function, in which way they are impacted by psychopathology, and which structural aspects of physiology could drive human behaviour and cognition [9].

Multimodal medical image fusion techniques are the most significant methods to identify and investigate disease to provide complementary information from different multimodalities. Multimodal medical images can be categorized into several types, which include computed tomography (CT), magnetic resonance angiography (MRA), magnetic resonance imaging (MRI), positron emission tomography (PET), ultra sonography (USG), nuclear magnetic resonance (NMR) spectroscopy, single photon emission computed tomography (SPECT), X-rays, visible, infrared and ultraviolet. Structural therapeutic images are MRI, CT, USG and MRA, which provide high-resolution images. Functional therapeutic images are PET, SPECT and functional MRI (fMRI) which provide low-spatial resolution images with useful information.

Multimodal medical image fusion increases the effectiveness of image-guided disease analysis, diagnosis and assessment of medical problems. Image fusion has several applications like medical imaging, biometrics, automatic change detection, machine vision, navigation aid, military applications, remote sensing, digital imaging, aerial and satellite imaging, robot vision, multi focus imaging, microscopic imaging, digital photography and concealed weapon detection [55]. Due to their versatility, multimodal algorithms can be used in wider biomedical applications involving genomics, proteomics, metabolomics and other types of omics data. Interestingly, they have been successfully used when the features originate from different domains, and some of them are generated by mechanistic models [67]. In this case, preprocessing of the features, coupled with late or intermediate fusion should be preferred to early fusion. This approach, based on computational systems biology and machine learning, could provide key mechanistic insights into neurological disorders [60].

Medical image segmentation is a challenging task in medical image analysis. Multimodal deep learning has been used in medical imaging, especially for providing multi-information about the target (tumour, organ or tissue). Segmentation using multimodal has been implemented as a fusion of multi-information to improve segmentation [36]. Deep learning provides state-of-the-art performance in image classification, segmentation, object detection and tracking tasks. Recently, deep learning has gained interest in multimodal image segmentation because of its self-learning and simplification ability over a large amount of data [71].

3.4 Databases for AI/ML

Reproducibility, validation and prediction benefit from existing imaging data and related information. There already exist some databases that provide open data for modeling, testing and inferring [28, 41, 54]. Thereupon these web resources, some information is provided about AI or ML approaches. However, both reliable data containing patient data as well as ML performance depend largely on the studied disease and its features. Some example of data from animal models include the Cambridge MRI database (this provides open phenotypic data for animal models of HD [59]), and the Mouse Tumor Biology Database (this provides different kinds of information on tumors in mice [8]). Moreover, the project BRAINS provides anonymised images and related clinical information from healthy subjects across human life span via data request and access agreement [29]. The Open Access Series of Imaging Studies provides MRI data sets of subjects clinically diagnosed with Alzheimer’s disease [37]. Functional MRI data from subjects with Huntington’s disease can be found within Track-HD study [64]. The Parkinson’s Disease Biomarkers Program provides access to brain scans and related information for researchers [49].

Classification of disease subtypes, subjects, brain regions, and gradings are often based on ML approaches via automatically segmenting brain MRI data [23]. Making use of such databases, ML not only helps in (semi-)automatically segmenting images, but it is also a tool for trying to answer several research questions, for example predicting tumor growth [27] or investigating minimal tumor burden and therapy resistance by cancer patients [50, 51]. Some case reports also show AI outperforming human domain experts [18, 31, 56]. Recent advances already try to bridge imaging and genetic studies. Imaging genetic studies combine investigations of genotype and imaging phenotype to better understand brain structure, function and the further cause and effects of a specific disease. Imaging genetics studies improve our understanding of pathways that are related to the cause or effect in cerebral disorders [30]. Genome-wide association studies suggest genetic relationships for structural as well as functional measures among family members [65].

3.5 AI-Aided Disease Classification Using Chemical Imaging

Further use of chemicals, such as metabolic biomarkers, on imaging basis could be included, for example, in therapy monitoring on a cellular basis [46]. AI-aided diagnosis for clinical purposes and computational models for prediction could involve quantitative changes next to functional imaging built upon brain MRI data. In this regard, molecular fMRI techniques exhibit the specificity for neural pathways or signaling components at cellular-level specificity [4]. The method of fMRI has been used to study time-resolved volumetric measurements of dopamine release [34]. Chemical exchange saturation transfer allows for signal-amplification of, for example, deoxyglucose and its phosphorylated metabolite in order to image glucose uptake [45]. This technique of MR imaging has been used to image glucose uptake in head and neck cancer [66]. Proton MRS is commonly used for studies on brain metabolites, including the marker of neuroaxonal integrity N-acetylaspartate, cholin for membrane turnovers, (phospho) creatine for energy metabolism and myo-inositol for astroglial activation [44].

MR techniques for imaging brain metabolism can assist in the studying of brain disorders, in aid of novel MRI contrasts for visualizing neuronal firing across brain regions, pH imaging of glioma and both glutamatergic neurotransmission and cell-specific energetics [26]. The understanding of oxidative metabolism plays a fundamental role in many diseases, which supports demand for the development of non-invasive methods for routine analyses [42].

3.6 Explainable AI

There are several methods which are relevant for further studies and for testing whether and to what extent they can be useful to contribute towards the aforementioned use-cases. Six of the most relevant include: BETA, LRP, LIME as well as GAMs, Bayesian Rule Lists and Hybrid Models, particularly with a human-in-the-loop. BETA (Black Box Explanations through Transparent Approximations) is a model-agnostic framework for explaining the behavior of black-box classifiers by instantaneously optimizing for fidelity to the original model and interpretability of the explanation [32]. There is also a more general solution to the problem of understanding classification decisions by pixel-wise decomposition of nonlinear classifiers which allows visualizing the contributions of single pixels to predictions for kernel-based classifiers over bag of words features and for multilayered neural networks [3].

LRP (Layer-Wise Relevance Propagation) is another general solution for understanding classification decisions via pixel decomposition of nonlinear classifiers, which allows running the “thought processes” backwards [3, 43]. This enables to retrace which input had which influence on the respective result. In individual cases, this lets us understand how a deep learning method has come to a certain medical diagnosis or a risk assessment. LIME (Local Interpretable Model-Agnostic Explanations) developed by Ribeiro, Singh and Guestrin [58] is a model-agnostic system, where \(x \in \mathbb {R}^{d}\) is the original representation of an instance being explained, and \(x' \in \mathbb {R}^{d'}\) is used to denote a vector for its interpretable representation (e.g. \(x'\) may be a feature vector containing word embeddings, with \(x'\) being the bag of words). The goal is to identify an interpretable model over the interpretable representation that is locally faithful to the classifier, i.e.

$$\begin{aligned} g: \mathbb {R}^{d'} \rightarrow \mathbb {R}, g \in G, \end{aligned}$$

where G is a class of potentially interpretable models, such as linear models, decision trees, or rule lists etc.; given a model \(g \in G\), it can be visualized as an explanation to the human expert in \(\mathbb {R^2}\). LIME works separately with each instance, they are permuted and a measure of similarity to the original instances is calculated. Consequently, the complex model provides predictions for each of these permutated instances and the influences of the alterations can be understood for each instance. In this way, for example, a medical doctor can check whether and to what extent results can be realistic. All these models cannot explain why a certain decision has been made, which is a goal of current research to find out in the context of the aforementioned use cases.

In medical domains, the explainability and interpretability of algorithms are as critical as their performances [25]. It can be nearly impossible for a doctor or medical professional to effectively integrate their expert knowledge with a model’s output unless they can interpret why that model made the decision that it did [22]. Over the past several decades, AI researchers have developed a wide range of techniques for interpretable and explainable classification. These techniques fall into four general categories: sensitivity analysis, linear approximation, rule-based decompositions, and models of causality.

Sensitivity analysis techniques attempt to model which regions of the input space are most important for the classification decision. For a neural network, the simplest sensitivity analysis technique is the “input gradient technique,” which involves taking the (smoothed) gradient of the input features with respect to the model loss function [62]. Several sensitivity analysis techniques, including the above-mentioned LRP [43] and LIME [13, 58] use a linear model to (locally) approximate a complex classifier, since linear models can be easily interpreted based on feature weights. LRP directly decomposes the model output on any training sample into the weighted sum of the model features, and LIME builds a linear classifier to approximate model behavior in the region of a particular training sample.

Rule-based algorithms represent a classification problem as a set of rules on the input features. These include algorithms like Decision Trees and Bayesian Rule Lists [68]. The most straightforward way to build these algorithms is to assemble them directly from the training data, but this approach can have extremely high variance and is often insufficient for modern applications. A more modern approach is to use rule-based algorithms to approximate pre-trained classifiers, similarly to how LIME and LRP approximate complex algorithms with linear functions. One example of this is the above-mentioned BETA algorithm, which builds a rule set to approximate a black box classifier [32]. The most direct way to build an explainable classifier is to directly model the causal relationships between the features and the classification output. The classic way to do this is to use a Bayesian Network [38] to model the conditional independences of features and latent factors, but this approach can be challenging to scale to the size of modern datasets and feature spaces.

4 Open Problems and Future Outlook

This paper entails exploring an intriguing subject. Our understanding of brain disease comes from different sources, but pathology remains to be one of the most important. However, it is a very time-consuming process because it requires manually performed tasks. On the other hand, MRI provides a lot of information on a larger scale, and has already seen major transformations with its use of AI. For cerebral disorders, we discussed in what way can AI generally, and ML specifically, contribute.

We hypothesized that AI can help to find invariably recurring parameters that have escaped human attention (e.g. due to noisy data) to validate diagnosis. In addition, AI helps to deal with an ever increasing amount of data that would take much longer to be analyzed manually. Several ML methods can help to identify and provide more meaningful information regarding the signals of different contrasts, location (with high resolution 7 T MRI), texture, size, dimension, patient information and specific patterns. Moreover, AI could be used to locate, correlate and compare all brain regions, in order to study the high individual variability of human gyri and sulci, signal variability, normal ageing process, clinical records and neurodegenerative diseases.

AI has the potential to go beyond helping filter out noise. One reason humans may be limited is not only due to the noise, but also due to wrong decisions on the feature space wrong. For example, in structural MRI, we automatically make the feature space a voxel. However, that unit results from the measurement technique, rather than any hypothesis or regularity about the brain or disease. There is great potential for AI to reveal other meaningful feature spaces, such as volume, heterogeneity, variability. It is possible that a meaningful feature space involves voxel-to-voxel relationships, or interdependencies, for example. Therefore, AI can help us elucidate the right feature spaces, eliminating or reducing human bias.