1 Introduction

Multiple Sclerosis (MS) is an inflammatory autoimmune disorder of the central nervous system. It is characterized by the formation of lesions in the white matter, as well as marked brain atrophy primarily in deep gray matter structures [1, 2]. The increased availability of longitudinal magnetic resonance imaging (MRI) scans opens up the prospect of tracking lesion evolution and atrophy trajectories over time, enabling a better assessment of disease progression and treatment efficacy [3].

Despite high potential clinical impact, work on computational methods for quantifying longitudinal changes in MS has remained fairly limited to date (cf. [4] for an overview). The methods that do exist suffer from one or more of the following limitations: They only assess changes in white matter lesions [5,6,7,8] or in aggregate measures of brain atrophy such as global brain or gray matter volume [9, 10], but not in individual brain structures; they can only compare between two consecutive time points [11,12,13,14] instead of characterizing entire temporal trajectories; or they are developed and tested in very specific settings only, with degraded performance when applied to data from different scanners and acquisition protocols [15] which limits their usefulness in practice.

In order to address these limitations, here we propose a dedicated model for simultaneously segmenting anatomical brain structures and white matter lesion from longitudinal multi-contrast MRI scans. The proposed method builds upon a contrast-adaptive method for simultaneous whole-brain and lesion segmentation that we previously developed and validated [16]. Here we extend this approach to the longitudinal setting by additionally modeling the expected temporal consistency between repeated scans of the same subject, using latent variables that introduce a statistical dependency between the time points. By segmenting both white matter lesions and anatomical brain structures across time, the resulting method enables tracking deep gray matter atrophy trajectories and lesion evolution simultaneously. The model is fully adaptive to different MRI contrasts and scanners, and does not put any constraints on the number or the timing of longitudinal follow-up scans. To the best of our knowledge, no other method with these capabilities currently exists.

We assessed the segmentation performance of the proposed method on three longitudinal datasets. Preliminary results indicate that it produces more reliable segmentations and detects disease effects better than the cross-sectional method. An example result produced by the longitudinal method is shown in Fig. 1.

2 Existing Cross-Sectional Method

We first summarize the existing cross-sectional method for simultaneous whole-brain and lesion segmentation [16] the proposed method builds upon.

Let \(\mathbf {D} = ( \mathbf {d}_{1}, \dots , \mathbf {d}_{I} ) \) be the image intensities of a multi contrast MRI scan with I voxels, where the vector \(\mathbf {d}_{i} = ( d_i^1, \dots , d_i^N )^T\) represents the log-transformed image intensity of voxel i for all the available N contrasts. Moreover, let \(\mathbf {l} = (l_1, \dots , l_I)^T\) be corresponding segmentation labels, where \(l_i \in \{1, \dots , K\}\) denotes one of the K possible anatomical structures assigned to voxel i. In order for the model to be capable of segmenting white matter lesions, a binary lesion map \(\mathbf {z} = (z_1, \dots z_I)\) is introduced, where \(z_i \in \{0, 1\}\) indicates the presence of lesion in voxel i. We use a generative model, illustrated in black in Fig. 2, to estimate a joint segmentation \(\{ \mathbf {l}, \mathbf {z} \}\) from MRI data \(\mathbf {D}\). The model consists of a segmentation prior \(p(\mathbf {l},\mathbf {z} | \mathbf {h}, \mathbf {x})\) with parameters \(\mathbf {h}\) and \(\mathbf {x}\) that encode shape information, and a likelihood function \(p(\mathbf {D} | \mathbf {l}, \mathbf {z}, \varvec{\theta }, \varvec{\theta }_{z} )\) with parameters \(\varvec{\theta }\) and \(\varvec{\theta }_{z}\) that govern intensity appearance. Below we briefly describe the segmentation prior and the likelihood function, as well as how the model is “inverted” to obtain automatic segmentations.

Fig. 1.
figure 1

Example segmentation produced by the proposed method on a longitudinal scan with T1w and FLAIR contrast.

Segmentation Prior: The segmentation prior is composed of two components \(p(\mathbf {l} | \mathbf {x})\) and \(p(\mathbf {z} | \mathbf {h}, \mathbf {x})\) that encode spatial information regarding the neuroanatomical labels \(\mathbf {l}\) and the lesion map \(\mathbf {z}\) respectively: \( p(\mathbf {l},\mathbf {z} | \mathbf {h}, \mathbf {x}) = p(\mathbf {l} | \mathbf {x}) p(\mathbf {z} | \mathbf {h}, \mathbf {x}) . \) The first component is a deformable probabilistic atlas, encoded as a tetrahedral mesh [17] with node positions \(\mathbf {x}\) and with a deformation prior distribution defined as:

$$ p(\mathbf {x}) \propto \exp \left[ -K \sum _d U_{d}(\mathbf {x}, \mathbf {x}_{ref}) \right] . $$

Here K controls the stiffness of the mesh deformations, d loops over the tetrahedra in the mesh, and \(U_{d}(\mathbf {x}, \mathbf {x}_{ref})\) is a cost [18] associated with deforming the \(d^{th}\) tetrahedron from its shape in the atlas’s reference position \(\mathbf {x}_{ref}\). Letting \(p( l_i = k | \mathbf {x} )\) denote the probability of observing label k at voxel i for a given deformation, assuming conditional independence of the labels between voxels yields \( p(\mathbf {l}| \mathbf {x}) = \prod _{i=1}^{I} p( l_i | \mathbf {x} ) . \)

The second component of the segmentation prior is a model of the form: \( p(\mathbf {z} | \mathbf {h}, \mathbf {x}) = \prod _{i=1}^I p( z_i | \mathbf {h}, \mathbf {x}) , \mathbf{p} ( \mathbf {h} ) = \mathcal {N} ( \mathbf {h} | \mathbf {0}, \mathbf {I}) , \) where \( p( z_i = 1 | | \mathbf {h}, \mathbf {x} ) \) is the probability that voxel i is part of a lesion. This model takes into account both a voxel’s spatial location within its neuroanatomical context (through \(\mathbf {x}\)), as well as lesion shape constraints through a variational autoencoder (VAE) [19] that “decodes” a low-dimensional latent code \(\mathbf {h}\) using a convolutional neural network.

Likelihood: For the likelihood, which links segmentations \(\{\mathbf {l}, \mathbf {z}\}\) to intensities \(\mathbf {D}\), we use a multivariate Gaussian intensity model for each structure, and model the MRI bias field artifact as a linear combination of spatially smooth basis functions that is added to the local voxel intensities [20, 21]. Letting \(\varvec{\theta }_{z} = \{\varvec{\mu }_{z}, \varvec{\varSigma }_{z}\}\) denote the mean and variance of lesion intensities, and \(\varvec{\theta }\) the collection of bias field parameters and intensity means and variances \(\{ \varvec{\mu }_k, \varvec{\varSigma }_k \}\) of all K anatomical structures, the likelihood is defined as \( p(\mathbf {D} | \mathbf {l}, \mathbf {z}, \varvec{\theta }, \varvec{\theta }_{z} ) = \prod _{i=1}^{I} p(\mathbf {d}_{i} | l_{i}, z_i, \varvec{\theta }, \varvec{\theta }_{z} ), \) where

$$\begin{aligned} p(\mathbf {d}_{i} | l_i=k, z_i, \varvec{\theta }, \varvec{\theta }_{z} ) = {\left\{ \begin{array}{ll} \mathcal {N}( \mathbf {d}_{i} | \varvec{\mu }_{z} + \mathbf {C}^T\varvec{\phi }_{i}, \varvec{\varSigma }_{z} ) &{} \text {if } z_i=1, \\ \mathcal {N}( \mathbf {d}_{i} | \varvec{\mu }_k + \mathbf {C}^T\varvec{\phi }_{i}, \varvec{\varSigma }_k ) &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

Here \(\varvec{\phi }_{i}\) evaluates the bias field basis functions at the \(i^{th}\) voxel, and \(\mathbf {C} = (\mathbf {c}_1 , \dots , \mathbf {c}_N)\), where \(\mathbf {c}_n\) denotes the parameters of the bias field model for the \(n^{th}\) contrast. The model is completed by a flat prior on \(\varvec{\theta }\), and a weak conditional prior \(p(\varvec{\theta }_{z} | \varvec{\theta })\) that ensures that the method can be robustly applied to scans with no or very small lesion loads [16].

Segmentation: Given an MRI scan \(\mathbf {D}\), segmentation proceeds by approximating the segmentation posterior using point estimates of the parameters \(\mathbf {x}\) and \(\varvec{\theta }\):

$$\begin{aligned} p( \mathbf {l}, \mathbf {z} | \mathbf {D} ) \simeq p ( \mathbf {l}, \mathbf {z} | \mathbf {D}, \hat{\varvec{\theta }}, \hat{\mathbf {x}} ) , \end{aligned}$$
(1)

and Markov chain Monte Carlo sampling to marginalize over the remaining, lesion-specific parameters \(\varvec{\theta }_{z}\) and \(\mathbf {h}\). For the purpose of finding the point estimates \(\mathbf {\hat{x}}\) and \(\varvec{\hat{\theta }}\), a simplified model is fitted to the data:

(2)

where the lesion-shape encoding VAE and its parameters \(\mathbf {h}\) are temporarily removed to simplify the optimization process. More details can be found in [16].

3 Longitudinal Extension

In the longitudinal setting we are given T scans with image intensities \(\{ \mathbf {D}_t \}_{t=1}^T\), and we wish to compute for each time point t the corresponding segmentation \(\{\mathbf {l}_t, \mathbf {z}_t \}\). In contrast to the cross-sectional setting where each image is treated independently, here we can exploit the fact that all images belong to the same subject to produce more consistent (and potentially more accurate) segmentations. Towards this end, we introduce subject-specific latent variables \(\mathbf {x}_0\) and \(\varvec{\theta }_0\) in the segmentation prior and likelihood, respectively, imposing a statistical dependency between the time points that encourages the segmentations to be similar to one another. The augmented generative model is depicted in Fig. 2, where the parameters \(\mathbf {x}_t, \mathbf {h}_t, \varvec{\theta }_t\) and \(\varvec{\theta }_{t,z}\) denote the model parameters of time point t, and the blue parts indicate the additional components compared to the cross-sectional model.

Segmentation Prior: In order to obtain temporal consistency in the segmentation prior, we use the concept of a “subject-specific atlas” [22]: a deformation of the cross-sectional atlas to represent the average subject-specific anatomy across all time points. In particular,

$$ p( \{ \mathbf {x}_t \}_{t=1}^T | \mathbf {x}_0 ) = \prod _{t=1}^T p( \mathbf {x}_t | \mathbf {x}_0 ) , \quad p( \mathbf {x}_{t} | \mathbf {x}_{0} ) \propto \exp \left[ - K_{1} \sum _d U_{d}(\mathbf {x}_{t}, \mathbf {x}_{0}) \right] , $$

where \(\mathbf {x}_0\) are latent atlas node positions encoding subject-specific brain shape, with prior \( p( \mathbf {x}_{0} ) \propto \exp \left[ - K_{0} \sum _d U_{d}(\mathbf {x}_{0}, \mathbf {x}_{ref}) \right] . \) Here the mesh stiffnesses \(K_0\) and \(K_1\) are hyperparameters of the model; by choosing \(K_0 = \infty \) and \(K_1 = K\) the model devolves into the cross-sectional segmentation prior for each time point separately.

Likelihood: In a similar vein, we also introduce subject-specific latent variables to encourage temporal consistency in the Gaussian intensity models. For each anatomical structure, we condition the Gaussian parameters \(\{ \varvec{\mu }_{t,k}, \varvec{\varSigma }_{t,k} \}\) on latent variables \(\{ \varvec{\mu }_{0,k}, \varvec{\varSigma }_{0,k} \}\) using a normal-inverse-Wishart (NIW) distribution: \( p( \{ \varvec{\theta }_{t} \}_{t=1}^T | \varvec{\theta }_{0} ) = \prod _{t=1}^T p( \varvec{\theta }_{t} | \varvec{\theta }_{0} ) \) with

$$\begin{aligned} p( \varvec{\theta }_{t} | \varvec{\theta }_{0} ) \propto \prod _{k=1}^K \mathcal {N}( \varvec{\mu }_{t,k} | \varvec{\mu }_{0,k}, P_{0,k} \varvec{\varSigma }_{0,k} ) \mathrm {IW}( \varvec{\varSigma }_{t,k} | P_{0,k} \varvec{\varSigma }_{0,k}, P_{0,k} - N - 2 ) . \end{aligned}$$

Here \(\varvec{\theta }_0 = \{ \varvec{\mu }_{0,k}, \varvec{\varSigma }_{0,k} \}_{k=1}^K\) with prior \(p(\varvec{\theta }_0) \propto 1\), and \(P_{0,k} \ge 0 \) is a hyperparameter that governs the strength of the regularization across time for label k. Note that choosing \(P_{0,k} = 0, \forall k\) yields the cross-sectional likelihood for each time point independently.

Segmentation: We follow the same overall segmentation strategy as in the cross-sectional setting: we first compute point estimates \(\{\varvec{\hat{\theta }}_t, \mathbf {\hat{x}}_t \}_{t=1}^T\) using a simplified model in which the lesion shape codes \(\{ \mathbf {h}_t \}_{t=1}^T\) are removed, and subsequently obtain segmentations as described in the cross-sectional setting, i.e., by using (1) for each time point separately. As in the cross-sectional case, we obtain the required point estimates by fitting the longitudinal model to the data:

(3)

where \(\varvec{\varOmega }_t = \{ \mathbf {x}_t, \varvec{\theta }_t, \varvec{\theta }_{t,z} \}\). For optimizing (3) we use coordinate ascent, updating one variable at a time in an iterative fashion. Because \( p( \mathbf {x}_t | \mathbf {x}_0 ) \) is of the same form as the cross-sectional deformation prior, and the NIW distribution used in \( p( \varvec{\theta }_t | \varvec{\theta }_0 ) \) is the conjugate prior for the mean and variance of a Gaussian distribution, estimating \(\varvec{\varOmega }_t\) from \(\mathbf {D}_t\) for given values of \(\varvec{\theta }_0\) and \(\mathbf {x}_0\) simply involves performing an optimization of the form of (2) for each time point t separately. Conversely, for given values \(\{ \varvec{\varOmega }_t \}_{t=1}^T\) the update for \(\varvec{\theta }_0\) is given in closed form:

$$\begin{aligned} \varvec{\mu }_{0,k} \leftarrow \left( \sum _{t=1}^T \varvec{\varSigma }_{t,k}^{-1} \right) ^{-1} \sum _{t=1}^T \varvec{\varSigma }_{t,k}^{-1} \varvec{\mu }_{t,k} , \quad \varvec{\varSigma }^{-1}_{0,k} \leftarrow \left( \frac{1}{T} \sum _{t=1}^T \varvec{\varSigma }_{t,k}^{-1} \right) \frac{P_{0,k}}{P_{0,k} - N - 2} , \end{aligned}$$

whereas updating \(\mathbf {x}_0\) involves the optimization (cf. [22])

which we solve numerically using a limited-memory BFGS algorithm.

Implementation: In order to avoid longitudinal processing biases resulting from not treated all time points in exactly the same way, we first compute an unbiased within-subject template using an inverse consistent registration method [23]. This template is a robust representation of the average subject anatomy over time, and we use it as an unbiased reference to register all time points to in a preprocessing step. We also use it to start the proposed iterative algorithm optimizing (3): we apply the cross-sectional method to the template, and use the estimated model parameters \(\varvec{\varOmega }\) to initialize \(\varvec{\varOmega }_t, t=1, \ldots , T\). The proposed algorithm, which interleaves updating the latent variables \(\varvec{\theta }_0\) and \(\mathbf {x}_0\) with updating the parameters \(\{ \varvec{\varOmega }_t \}_{t=1}^T\), is then run for five iterations, which we have found to be sufficient to reach convergence.

Based on initial pilot experiments on scans from the ADNI projectFootnote 1 (distinct from the ones used in the experiments below), we set the method’s hyperparameter values to \(K_1=14K\) and \(K_0=K\), where K is the mesh stiffness in the existing cross-sectional method, and \(P_{0,k}\) to the number of voxels assigned to class k in the segmentation of the within-subject template.

Our implementation builds upon the C++ and Python code of [16, 24], and is publicly available from FreeSurferFootnote 2. Segmenting one subject takes approximately 15 min per time point on an Intel 12-core i7-8700K processor with a GeForce GTX 1060 graphics card.

4 Experiments and Results

In order to assess whether introducing subject-specific latent variables leads to better longitudinal performance, we compared the proposed method and the cross-sectional method on three different datasets:

  • Test-retest [25]: This dataset consists of longitudinal T1-weighted (T1w) and FLuid Attenuation Inversion Recovery (FLAIR) scans of 2 MS subjects. For each subject 6 repeated scans were acquired from 3 different 3T scanners (Philips Achieva; Siemens Verio; GE Signa MR750) within 3 weeks.

  • Achieva: This dataset consists of longitudinal T1w and FLAIR scans of 86 MS subjects. The subjects were scanned between 3 and 6 times (time between scans between 6 and 12 months) with a 3T Philips Achieva scanner at the Department of Neurology, School of Medicine, at the Technical University of Munich in the context of the in-house cohort study on MS named TUM-MS. All the subjects were diagnosed as relapsing-remitting MS.

  • ADNI: This dataset consists of longitudinal T1w scans of 135 subjects randomly selected from the ADNI project. Scanners from multiple sites were used to acquired the scans, and subjects were scanned between 2 and 6 times, with 6 or 12 months between scans. The subjects were divided into 3 groups: cognitively normal (CN, n=45), mild cognitive impairment (MCI, n=54), and Alzheimer disease (AD, n=36).

We report results on the estimated volumes of the following 26 regions: left and right cerebral white matter, cerebellum white matter, cerebral cortex, cerebellum cortex, lateral ventricle, hippocampus, thalamus, putamen, pallidum, caudate, amygdala and nucleus accumbens, as well as brain stem and lesions. To avoid cluttering, results for left and right structures are averaged.

Temporal Consistency: We wished to assess whether the proposed method is able to reduce non-biological variations in longitudinal volume measurements, both within the short (\(<3\) weeks) and longer (\(<6\) years) time intervals of the test-retest and the Achieva datasets, respectively. For the test-retest dataset one can expect true biological changes to be minimal, and we therefore computed the coefficient of variation (ratio of the standard deviation to the mean) for each brain structure. The results, shown in Table 1, indicate that the longitudinal method indeed performs better in this respect than the cross-sectional one for almost all the structures.

For the Achieva dataset, one may assume that the true change in volume of a structure over the span of a few years is approximately linear, except for lesions whose temporal trajectory is affected by disease effects, with growing and shrinking lesions occurring at the same time. We therefore fitted, for each structure and for each subject, a linear regression model to the longitudinal volumes estimated by each method, and computed the ratio of the standard deviation of the residuals to the intersect (time of the first scan is taken as zero). The results are summarized in Table 2 and indicate that the proposed model indeed yields generally better results in this respect.

Fig. 2.
figure 2

Graphical representation of the proposed model. In black the existing cross-sectional method of [16] for each time point t; in blue the proposed additional latent variables for modeling temporal consistency between longitudinal scans with T time points. (Color figure online)

Table 1. Coefficients of variation in [%] on the test-retest dataset, both for the proposed longitudinal (“Long”) and the cross-sectional (“Cross”) method.
Table 2. Average deviation from a linear trajectory in [%] for volumetric measurements in the Achieva dataset.

Detecting Disease Effects: In order to ensure that the proposed method is not simply over-regularizing, we also assessed whether it can capture known group differences in the temporal evolution of certain brain structures better than the cross-sectional method. Towards this end, we compared the annualized percentage change (the slope of a linear regression model divided by its intersect) in the volume of the hippocampus between the CN, MCI and AD groups in the ADNI dataset. The results, shown in Fig. 3, indicate that the longitudinal method can indeed detect group differences better this way.

Fig. 3.
figure 3

Annualized percentage change (APC) in the volume of the hippocampus for the three groups of the ADNI dataset. Statistical significance was computed with a Welch’s t-test and effect size with Cohen’s d.

5 Discussion and Conclusion

In this paper we have proposed a novel method for the segmentation of longitudinal brain MRI scans of patients suffering from MS. The method is based on an existing cross-sectional method for simultaneous whole-brain and lesion segmentation, and leverages subject-specific latent variables to encourage segmentations across time points to be similar to each other. Preliminary results indicate that it is able to produce more consistent and reliable segmentations compared to the cross-sectional version, while being more sensitive to group differences. Future work will involve an extensive analysis of disease progression in different MS patient groups, as well as a more careful tuning of the hyperparameters of the model.