Keywords

1 Introduction

Normative modelling is a popular method to study heterogeneous brain disorders. Normative models assume disease cohorts sit at the tails of a healthy population distribution and quantify individual deviations from healthy brain patterns. Typically, a normative analysis constructs a normative model per variable, e.g., using Gaussian Process Regression (GPR) [9]. Recently, to model complex non-linear interactions between features, deep-learning approaches using adversarial (AAE) and variational autoencoder (VAE) models have been proposed [8, 11]. These models have a uni-modal structure with a single encoder and decoder network. So far, almost all deep-learning normative models have modelled only one modality. However, many brain disorders show deviations from the norm in features of multiple imaging modalities to a varying degree. Often it is unknown which modality will be the most sensitive. Thus, it is advantageous to develop normative models suitable for multiple modalities.

Most previous deep-learning normative and anomaly detection models measure deviations in the feature space [4, 7, 11]. However, for multi-modal models built from modalities containing highly different, but complementary information (e.g., T1 and DTI features as used here), we may not expect to see significantly greater deviations in the feature space compared to uni-modal methods. Indeed previous work has shown that, when using VAEs, even for one modality, measuring deviation in the latent space outperforms metrics in the feature space [8] and provides a single measure of abnormality. As such, we develop a latent deviation metric suitable to measuring deviations in multi-modal data.

There are many approaches to extending VAEs to integrate information from multiple modalities and learn informative joint latent representations. Most multi-modal VAE frameworks learn separate encoder and decoder networks for each modality and aggregate the encoding distributions to learn a joint latent representation. Wu and Goodman [16] introduced a multi-modal VAE (mVAE) where each encoding distribution is treated as an ‘expert’ and the Product-of-Experts (PoE), which takes a product of the experts’ densities, is used to approximate a joint encoding distribution. The PoE approach treats all experts as equally credible taking a uniform contribution from every modality. In practice, however, different levels of noise, complexity and information are present in different modalities. Furthermore, if we have an overconfident miscalibrated expert, i.e. a sharp, shifted probability distribution, the joint distribution will have low density in the region observed by the other experts and a biased mean prediction. This can result in a suboptimal latent space and data reconstruction. Shi et al. [13] address this problem by combining latent representations across modalities using a Mixture-of-Experts (MoE) approach. For MoE, the joint distribution is given by a mixture of the experts’ densities so that the density is spread over all regions covered by the experts and overconfident experts do not monopolize the resulting prediction. However, MoE is less sensitive to consensus across modalities and will give lower probability to regions where experts are in agreement than PoE. Alternatively, we propose a mVAE modelling the joint encoding distribution as a generalised Product-of-Experts (gPoE) [2]. We optimise modality specific weightings to account for different information content between experts and enable the model to down-weight experts which cause erroneous predictions. Depending on the application, either MoE or gPoE will be most appropriate and so we consider both methods for normative modelling.

As far as we are aware, only one other multi-modal VAE normative modelling framework has been proposed in the literature which uses the PoE (PoE-normVAE) [7]. However, Kumar et al. [7] rely on measuring deviations in the feature space, which we argue does not leverage the benefits of multi-modal models. Here, we present an improved factorisation of the joint representation by modelling it as a weighted product or sum of each encoding distribution.

Our contributions are two-fold. Firstly, we present two novel multi-modal normative modelling frameworks, MoE-normVAE and gPoE-normVAE, which capture the joint distribution between different imaging modalities. Our proposed models outperform baseline methods on two neuroimaging datasets. Secondly, we present a deviation metric, based on the latent space, suitable for detecting deviations in multi-modal normative distributions. We show that our metric better leverages the benefits of multi-modal normative models compared to feature space-based metrics.

2 Methods

Multi-modal Variational Autoencoder (mVAE). Let \({\textbf {X}}=\{{\textbf {x}}_m\}^M_{m=1}\) be the observations of M modalities. We use a mVAE to learn a multi-modal generative model (Fig. 1c), where modalities are conditionally independent given a common latent variable, of the form \(p_{\theta }\left( {\textbf {X}}, {\textbf {z}}\right) = p (\textbf{z}) \prod _{m=1}^{M} p_{\theta _{m}}\left( {\textbf {x}}_{m} \mid {\textbf {z}}\right) \). The likelihood distributions \(p_{\theta _{m}}\left( {\textbf {x}}_{m} \mid {\textbf {z}}\right) \) are parameterised by decoder networks with parameters \(\theta = \{ \theta _{1}, \ldots , \theta _{M} \}\). The goal of VAE training is to maximise the marginal likelihood of the data. However, as this is intractable, we instead optimise an evidence lower bound (ELBO):

$$\begin{aligned} \mathcal {L} = \mathbb {E}_{q_{\phi }({\textbf {z}} \mid {\textbf {X}})}\left[ \sum _{m=1}^{M} \log p_{\theta }\left( {\textbf {x}}_{m} \mid {\textbf {z}}\right) \right] -D_{K L}\left( q_{\phi }({\textbf {z}} \mid {\textbf {X}})| p({\textbf {z}})\right) \end{aligned}$$
(1)

where the second term is the KL divergence between the approximate joint posterior \(q_{\phi }\left( {\textbf {z}} \mid {\textbf {X}}\right) \) and the prior \(p({\textbf {z}})\). We model the posterior, likelihood, and prior distributions as isotropic gaussians.

Approximate Joint Posterior. To train the mVAE, we must specify the form of the joint approximate posterior \(q_{\phi }\left( {\textbf {z}} \mid {\textbf {X}}\right) \). Wu and Goodman [16] choose to factorise the joint posterior as a Product-of-Experts (PoE); \(q_{\phi }\left( {\textbf {z}} \mid {\textbf {X}}\right) = \frac{1}{K} \prod _{m=1}^{M} q_{\phi _{m}}\left( {\textbf {z}} \mid {\textbf {x}}_{m}\right) \), where the experts, i.e., individual posterior distributions \(q_{\phi _{m}}\left( {\textbf {z}} \mid {\textbf {x}}_{m}\right) \), are parameterised by encoder networks with parameters \(\phi = \{ \phi _{1}, \ldots , \phi _{M} \}\). K is a normalisation term. Assuming each encoder network follows a Gaussian distribution \(q\left( {\textbf {z}} \mid {\textbf {x}}_{m}\right) =\mathcal {N}(\boldsymbol{\mu }_m, \boldsymbol{\sigma }_{m}^{2} {\textbf {I}})\), the parameters of joint posterior distribution can be computed [5]; \( \boldsymbol{\mu } = \frac{\sum _{m=1}^{M} \boldsymbol{\mu }_{m} / \boldsymbol{\sigma }_{m}^{2}}{\sum _{m=1}^{M} 1 / \boldsymbol{\sigma }_{m}^{2}} \quad \text{ and } \quad \boldsymbol{\sigma }^{2} =\frac{1}{\sum _{m=1}^{M} 1 / \boldsymbol{\sigma }_{m}^{2}} \) (see Supp. for proofs).

However, overconfident but miscalibrated experts may bias the joint posterior distribution (see Fig. 1b) which is undesirable for learning informative latent representations between modalities [13].

Shi et al. [13] instead factorise the approximate joint posterior as a Mixture-of-Experts (MoE); \(q_{\varPhi }\left( {\textbf {z}} \mid {\textbf {X}}\right) = \frac{1}{K} \sum _{m=1}^{M} \frac{1}{M} q_{\phi _{m}}\left( {\textbf {z}} \mid {\textbf {x}}_{m}\right) .\)

In the MoE setting, each uni-modal posterior \(q_{\phi }({\textbf {z}} \mid {\textbf {x}}_m)\) is evaluated with the generative model \(p_{\theta }\left( {\textbf {X}}, {\textbf {z}}\right) \) such that the ELBO becomes:

$$\begin{aligned} \mathcal {L} = \sum _{m=1}^{M}\left[ \mathbb {E}_{q_{\phi }({\textbf {z}} \mid {\textbf {x}}_{m})}\left[ \sum _{m=1}^{M} \log p_{\theta }\left( {\textbf {x}}_{m} \mid {\textbf {z}}\right) \right] -D_{K L}\left( q_{\phi }({\textbf {z}} \mid {\textbf {x}}_m)| p({\textbf {z}})\right) \right] . \end{aligned}$$
(2)

However, this approach only takes each uni-modal encoding distribution separately into account during training. Thus, there is no explicit aggregation of information from multiple modalities in the latent representation for reconstruction by the decoder networks. For modalities with a high degree of modality-specific variation, this enforces an undesirable upperbound on the ELBO potentially leading to a sub-optimal approximation of the joint distribution [3].

Generalised Product-of-Experts Joint Posterior. We propose an alternative approach to mitigate the problem of overconfident experts by factorising the joint posterior as a generalised Product-of-Experts (gPoE) [2]; \(q_{\phi }\left( {\textbf {z}} \mid {\textbf {X}}\right) = \frac{1}{K} \prod _{m=1}^{M} q_{\phi _{m}}^{\alpha _{m}}\left( {\textbf {z}} \mid {\textbf {x}}_{m}\right) \) where \(\alpha _{m}\) is a weighting for modality m such that \(\sum _{m=1}^{M}\alpha _{m} = 1\) for each latent dimension and \(0<\alpha _{m}<1\). We optimise \(\alpha \) during training allowing the model to weight experts in such a way as to learn an approximate joint posterior \(q_{\phi }\left( {\textbf {z}} \mid {\textbf {X}}\right) \) where the likelihood distribution \(p_{\theta }\left( {\textbf {X}} \mid {\textbf {z}}\right) \) is maximised. This provides a means to down-weigh overconfident experts. Furthermore, as \(\alpha \) is learnt per latent dimension, different modality weightings can be learnt for different vectors, thus explicitly incorporating modality specific variation in addition to shared information in different dimensions of the joint latent space. Similarly to the PoE approach, we can compute the parameters of the joint posterior distribution; \( \boldsymbol{\mu } = \frac{\sum _{m=1}^{M} \boldsymbol{\mu }_{m}\boldsymbol{\alpha }_{m} / \boldsymbol{\sigma }_{m}^{2}}{\sum _{m=1}^{M} \boldsymbol{\alpha }_{m} / \boldsymbol{\sigma }_{m}^{2}} \quad \text{ and } \quad \boldsymbol{\sigma }^{2} = \sum _{m=1}^{M} \frac{1}{ \boldsymbol{\alpha }_{m} / \boldsymbol{\sigma }_{m}^{2}} \).

Recently, a gPoE mVAE was proposed for learning joint representations of hand-poses and surgical videos [6]. However, we emphasize that our approach differs in application and offers a more lightweight implementation (Joshi et al. [6] require training of auxiliary networks to learn \(\alpha \) per sample).

Multi-modal Normative Modelling. We propose two mVAE normative modelling frameworks shown in Fig. 1a. MoE-normVAE, which uses a MoE joint posterior distribution, and gPoE-normVAE, which uses a gPoE joint posterior distribution. For both models, the encoder \(\phi \) and decoder \(\theta \) parameters are trained to characterise a healthy population cohort. normVAE models assume abnormality due to disease effects can be quantified by measuring deviations in the latent space [8] or the feature space [11]. At test time, the clinical cohort is passed through the encoder and decoder networks. Deviations of test subjects from the multi-modal latent space of the healthy controls and data reconstruction errors are measured. We compare our methods to the previously proposed PoE-normVAE [7] and three uni-modal models; two single modality and one multi-modality with a concatenated input.

To compare our normVAE models to a classical normative approach, we trained one GPR (using the PCNToolkit) per feature on a sub-set of 2000 healthy UK Biobank individuals and used extreme value statistics to calculate subject-level abnormality index [9]. We used a top 5% abnormality threshold (set using the healthy training cohort) to calculate a significance ratio (see Eq. 6).

Fig. 1.
figure 1

(a) gPoE-normVAE and MoE-normVAE normative framework. All normVAE models were implemented using parameter settings; maximum epochs=2000, batch size=256, learning rate=\(10^{-4}\), early stopping=50 epochs, encoder layers=[20, 40], decoder layers=[20, 40]. A ReLU activation function was applied between layers. Models were trained with a range of latent space sizes (\(L_{\text {dim}}\)) from 5 to 20. Models with \(L_{\text {dim}}\)=10 were fine-tuned (maximum 100 epochs) using the ADNI healthy cohort. Learnt \(\alpha \) values are given in Supp. Table 1. (b) Example PoE and gPoE joint distributions. (c) Graphical model.

Multi-modal Latent Deviation Metric. Previous works using autoencoders as normative models mostly relied on feature-space based deviation methods [7, 11]. That is, they compare the input value for subject j for the i-th brain region \(x_{ij}\) to the value reconstructed by the autoencoder \(\widehat{x}_{ij}\): \(d_{ij}=\left( x_{ij}-\widehat{x}_{ij}\right) ^{2}\). Kumar et al. [7] propose the following normalised z-score metric on the data reconstruction (a univariate feature space metric):

$$\begin{aligned} D_{\text {uf}}=\frac{d_{i j}-\mu _{\text{ norm }}\left( d_{i j}^{\text{ norm }}\right) }{\sigma _{\text{ norm }}\left( d_{i j}^{\text{ norm }}\right) } \end{aligned}$$
(3)

where \(\mu _{\text{ norm }}\left( d_{i j}^{\text{ norm }}\right) \) is the mean and \(\sigma _{\text{ norm }}\left( d_{i j}^{\text{ norm }}\right) \) the standard deviation of the deviations \(d_{i j}^{\text{ norm }}\) of a holdout healthy control cohort.

However, in the multi-modal setting, feature space-based deviation metrics may not highlight the benefits of multi-modal models over their uni-modal counterparts. The goal of the joint latent representation is to capture information from all modalities. Thus, decoders for each modality must extract the information from the joint latent representation, which now carries information from all other modalities as well. Therefore, data reconstructions capture only information relevant to a particular modality and may also be poorer compared to uni-modal methods. As such, particularly when incorporating modalities with a high degree of modality-specific variation, we believe latent space deviation metrics would better capture deviations from normative behaviour across multiple modalities. Then, once an abnormal subject has been identified, feature space metrics can be used to identify deviating brain regions (e.g. Supp. Fig. 3).

We propose a latent deviation metric to measure deviations from the joint normative distribution. To account for correlation between latent vectors and derive a single multivariate measure of deviation, we measure the Mahalanobis distance from the encoding distribution of the training cohort:

$$\begin{aligned} D_{\text {ml}}=\sqrt{\left( z_{j}-\mu (z^{\text {norm}})\right) ^T \varSigma (z^{\text {norm}})^{-1}\left( z_{j}-\mu (z^{\text {norm}})\right) } \end{aligned}$$
(4)

where \(z_j \sim q\left( {\textbf {z}}_j \mid {\textbf {X}}_j\right) \) is a sample from the joint posterior distribution for subject j, \(\mu (z^{\text {norm}})\) is the mean and \(\varSigma (z^{\text {norm}})\) the covariance of the healthy cohort latent position. We use robust estimates of the mean and covariance to account for outliers within the healthy control cohort. For closer comparison with \(D_{\text {ml}}\), we derive the following multivariate feature space metric:

$$\begin{aligned} D_{\text {mf}}=\sqrt{\left( d_{j}-\mu (d^{\text {norm}})\right) ^T \varSigma (d^{\text {norm}})^{-1}\left( d_{j}-\mu (d^{\text {norm}})\right) } \end{aligned}$$
(5)

where \(d_j = \{ d_{ij}, \ldots , d_{Ij} \}\) is the reconstruction error for subject j for brain regions \((i=1,...,I)\), \(\mu (d^{\text {norm}})\) is the mean and \(\varSigma (d^{\text {norm}})\) the covariance of the healthy cohort reconstruction error.

Assessing Deviation Metric Performance. For each model, we calculated \(D_{\text {ml}}\) and \(D_{\text {mf}}\) for a healthy holdout cohort and disease cohort. For each deviation metric, we identified individuals whose deviations were significantly different from the healthy training distribution (\(p<0.001\)) [15]. Ideally, we want a model which correctly identifies disease individuals as outliers and healthy individuals as sitting within the normative distribution. As such, we use the following significance ratio (positive likelihood ratio) to assess model performance:

(6)

In order to calculate significance ratios, we calculated \(D_{\text {uf}}\) relative to the training cohort for the healthy holdout and disease cohorts (Bonferroni adjusted p=0.05/\(N_{\text {features}}\)) [7].

3 Experiments

Data Processing. To train the normVAE models, we used 10,276 healthy subjects from the UK Biobank [14] (application number: 70047). We used pre-processed (provided by the UK Biobank [1]) grey-matter volumes for 66 cortical (Desikan-Killiany atlas) and 16 subcortical brain regions, and Fractional Anisotropy (FA) and Mean Diffusivity (MD) measurements for 35 white matter tracts (John Hopkins University atlas). At test time, we used 2,568 healthy controls from a holdout cohort and 122 individuals with one of several neurodegenerative disorders; motor neuron disease, multiple sclerosis, Parkinson’s disease, dementia/Alzheimer/cognitive-impairment and other demyelinating disease.

We also tested the models using an external dataset. We extracted 213 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI)Footnote 1 [10] dataset with significant memory concern (SMC; N=27), early mild cognitive impairment (EMCI; N=63), late mild cognitive impairment (LMCI; N=34), Alzheimer’s disease (AD; N=43) as well as healthy controls (HC; N=45). We used the healthy controls to fine-tune the models in a transfer learning approach. The same T1 and DTI features as for the UK Biobank were extracted for the ADNI dataset.

Rather than conditioning on covariates as done in some related work [7, 8], we adjusted for confounding effects prior to analysis. Non-linear age and linear ICV affects where removed from the DTI and T1 MRI features of both datasets [12]. Each brain ROI was normalised by removing the mean and dividing by the standard deviation of the healthy control cohort brain regions.

UK Biobank Results. As expected, we see greater significance ratios for all models when using \(D_{\text {ml}}\) rather than \(D_{\text {mf}}\) (Table 1). When using \(D_{\text {mf}}\) or \(D_{\text {uf}}\),

Table 1. Significance ratio calculated from \(D_{\text {ml}}\), \(D_{\text {mf}}\), and \(D_{\text {uf}}\) for the UK Biobank. See Supp. for results in figure form. Using GPR, we observed a significance ratio of 6.01, poorer performance than our models (using \(D_{\text {ml}}\)).

all models perform similiarly. Using \(D_{\text {ml}}\) over \(D_{\text {mf}}\) leads to a 4-fold increase in the signficance ratio. Further, our proposed models give the best overall performance across different \(L_{\text {dim}}\) with the highest significance ratio for gPoE-normVAE with \(L_{\text {dim}}\)=10. Generally, all multi-modal normVAE showed better performance than the uni-modal models suggesting that by modelling the joint distribution between modalities, we can learn better normative models.

ADNI Results. Previous work [8] explored the ability of a uni-modal T1 normVAE to detect deviations in the ADNI cohorts. Figure 2a shows the latent deviation \(D_{\text {ml}}\) for different diagnosis in the ADNI cohort for the T1 normVAE, DTI normVAE, PoE-normVAE and gPoE-normVAE models. All models reflect the increasing disease severity with increasing disease stage. The gPoE-normVAE model showed greater sensitivity to disease stage as suggested by the higher F statistic and p-values from an ANOVA analysis. We measured the Pearson correlation with composite measures of memory and executive function (Fig. 2b) and found that our proposed model exhibited greater correlation with both cognition scores than baseline approaches. Finally, we see that the sensitivity to disease severity for the gPoE-normVAE model extends to the feature space where we see a general increase in average \(D_{\text {uf}}\) from the LMCI to AD cohort (Supp. Figs. 3a and 3b respectively).

Fig. 2.
figure 2

(a) \(D_{\text {ml}}\) by disease label (\(L_{\text {dim}}\)=10). Statistical annotations were generated using Welch’s t-tests between pairs of disease groups; \(\text {ns}: 0.05< p <= 1, *: 0.01< p <=0.05, **: 0.001< p <= 0.01, ***: 0.0001< p <= 0.001, ****:p <= 0.0001\). Robust estimates of the mean and covariance were not used to calculate \(D_{\text {ml}}\) due to the small healthy cohort size. (b) Pearson correlation between \(D_{\text {ml}}\) and patient cognition represented by age adjusted memory and executive function composite scores.

4 Discussion and Further Work

We have built on recent works [7, 8, 11] and introduced two novel mVAE normative models, which provide an alternative method of learning the joint normative distribution between modalities to address the limitations of current approaches. Our models provide a more informative joint representation compared to baseline methods as evidenced by the better significance ratio for the UK Biobank dataset and greater sensitivity to disease staging and correlation with cognitive measures in the ADNI dataset. We also proposed a latent deviation metric suitable for detecting deviations in the multivariate latent space of multi-modal normative models which gave an approximately 4-fold performance increase over metrics based on the feature space.

Further work will involve extending our models to more data modalities, such as genetic variants, to better characterise the behaviour of a physiological system. We note that, for fair comparison across models, we remove the effects of confounding variables prior to analysis. However, confounding effects could be removed during analysis via condition variables [8]. Another limitation of normVAE models introduced here is the use of ROI level data. Data processing software, such as FreeSurfer, may fail to accurately capture abnormality in images, particularly if large lesions are present. Further work involves creating normative models designed for voxel level data to better capture disease effects.

Normative models have been successfully applied to the study of a range of heterogeneous diseases. Diseases often present abnormalities across a range of neuroimaging, biological and physiological features which provide different information about the underlying disease process. Normative systems that incorporate features from different data modalities offer a holistic picture of the disease and will be capable of detecting abnormalities across a broad range of different diseases. Furthermore, multi-modal normative modelling captures the relationship between different modalities in healthy individuals, with disruption to this relationship potentially leading to a disease signal. Code is publicly available at https://github.com/alawryaguila/multimodal-normative-models.