1 Introduction

Although longitudinal MRIs enable noninvasive tracking of the gradual effect of neurological diseases and environmental influences on the brain over time [23], the analysis is complicated by the complex covariance structure characterizing a mixture of time-varying and static effects across visits [8]. Therefore, training deep learning models on longitudinal data typically requires a large amount of samples with accurate ground-truth labels, which are often expensive or infeasible to acquire for some neuroimaging applications [4].

Recent studies suggest that the issue of inadequate labels can be alleviated by self-supervised learning, the aim of which is to automatically learn representations by training on pretext tasks (i.e., tasks that do not require labels) before solving the supervised downstream tasks [14]. State-of-the-art self-supervised models are largely based on contrastive learning [5, 10, 17, 19, 21], i.e., learning representations by teaching models the difference and similarity of samples. For example, prior studies have generated or identified similar or dissimilar sample pairs (also referred to as positive and negative pairs) based on data augmentation [6], multi-view analysis [21], and organizing samples in a lookup dictionary [11]. Enforcing such an across-sample relationship in the learning process can then lead to more robust high-level representations for downstream tasks [16].

Despite the promising results of contrastive learning on cross-sectional data [2, 7, 15], the successful application of these concepts to longitudinal neuroimaging data still remains unclear. In this work, we propose a self-supervised learning model for longitudinal data by exploring the similarity between ‘trajectories’. Specifically, the longitudinal MRIs of a subject acquired at multiple visits characterize gradual aging and disease progression of the brain over time, which manifests a temporal progression trajectory when projected to the latent space. Subjects with similar brain appearances are likely to exhibit similar aging trajectories. As a result, the trajectories from a cohort should collectively form a smooth trajectory field that characterizes the morphological change of the brain development over time. We hypothesize that regularizing such smoothness in a self-supervised fashion can result in a more informative latent space representation, thereby facilitating further analysis of healthy brain aging and effects of neurodegenerative diseases.

To achieve the smooth trajectory field, we build a dynamic graph in each training iteration to define a neighborhood in the latent space for each subject. The graph then connects nearby subjects and enforces their progression directions to be maximally aligned (Fig. 1). As such, the resulting latent space captures the global complexity of the progression while maintaining the local continuity of the nearby trajectory vectors. We name the trajectory vectors learned from the neighborhood as Longitudinal Neighbourhood Embedding (LNE).

We evaluate our method on two longitudinal structural MRI datasets: one consists of 274 healthy subjects with the age ranging from 20 to 90, and the second is composed of 632 subjects from ADNI to analyze the progression trajectory of Normal Control (NC), static Mild Cognitive Impairment (sMCI), progressive Mild Cognitive Impairment (pMCI), and Alzheimer’s Diease (AD). On these datasets, the visualization of the latent space in a 2D space confirms that the smooth trajectory vector field learned by the proposed method encodes continuous variation with respect to brain aging. When evaluated on downstream tasks, we obtain higher squared-correlation (R2) in age regression and better balanced accuracy (BACC) in ADNI classifications using our pre-trained model compared to alternative self-supervised or unsupervised pre-trained models.

2 Method

We now describe LNE, a method that smooths trajectories in the latent space by local neighborhood embedding. While trajectory regularization has been explored in 2D or 3D spaces (e.g., pedestrian trajectory [1, 9] and in non-rigid registration [3]), there are several challenges in the context of longitudinal MRI analysis: (1) each trajectory is measured on sparse and asynchronous (e.g., not aligned by age or visit time) time points; (2) the trajectories live in a high-dimensional space rather than a regular 2D or 3D grid space; (3) the latent representations are defined in a variant latent space that is iteratively updated. To resolve these challenges, we first propose a strategy to train based on pairwise data and translate the trajectory-regularization problem to the estimation of a smooth vector field, which is then solved by longitudinal neighbourhood embedding on dynamic graphs.

Pairwise Training Strategy. As shown in Fig. 1, each subject is associated with a trajectory (blue vectors) across multiple visits (\(\ge \)2) in the latent space. To overcome the problem of the small number of time points in each trajectory, we propose to discretize a trajectory into multiple vectors defined by pairs of images. Compared to using the whole trajectory of sequential images as a training sample as typically done by recurrent neural networks [18], this pairwise strategy substantially increases the number of training samples. To formalize this operation, let \(\mathcal {X}\) be the collection of all MR images and \(\mathcal {S}\) be the set of subject-specific image pairs; i.e., \(\mathcal {S}\) contains all \((x^t, x^s)\) that are from the same subject with \(x^t\) scanned before \(x^s\). These image pairs are then the input to the Encoder-Decoder structure shown in Fig. 1. The latent representations generated by the encoder are denoted by \(z^t=F(x^t)\), \( z^s=F(x^s)\), where F is the encoder. Then, \(\varDelta z^{(t,s)} = (z^s - z^t) / \varDelta t^{(t,s)}\) is formulated as the normalized trajectory vector, where \(\varDelta t^{(t,s)}\) is the time interval between the two scans. All \(\varDelta z^{(t,s)}\) in the cohort define the trajectory vector field. The latent representations are then used to reconstruct the input images by the decoder H, i.e., \(\tilde{x}^t=H(z^t)\), \( \tilde{x}^s=H(z^s)\).

Fig. 1.
figure 1

Overview of the proposed method: an encoder projects a subject-specific image pair \((x^t, x^s)\) into the latent space resulting in a trajectory vector (cyan). We encourage the direction of this vector to be consistent with \(\varDelta h\) (purple), a vector pooled from the neighborhood of \(z^t\) (blue circle). As a result, the latent space encodes the global morphological change linked to aging (red curve). (Color figure online)

Longitudinal Neighbourhood Embedding. Inspired by social pooling in pedestrian trajectory prediction [1, 9], we model the similarity between each subject-specific trajectory vector with those from its neighbourhood to enforce the smoothness of the entire vector field. As the high-dimensional latent space cannot be defined by a fixed regular grid (e.g., a 2D image grid space), we propose to define the neighbourhood by building a directed graph \(\mathcal {G}\) in each training iteration for the variant latent space that is iteratively updated. The position of each node is defined by the starting point \(z^t\) of the vector \(\varDelta z\) and the value (representation) of that node is \(\varDelta z\) itself. For each node i, Euclidean distances to other nodes \(j\ne i\) are computed by \(P_{i,j} = \parallel z^t_i - z^t_j\parallel _2\) while the \(N_{nb}\) closest nodes of node i form its 1-hop neighbourhood \(\mathcal {N}_i\) with edges connected to i. The adjacency matrix A for \(\mathcal {G}\) is then defined as:

$$\begin{aligned} A_{i,j} := {\left\{ \begin{array}{ll} exp(-\frac{P_{i,j}^2}{2\sigma _i^2}) &{} j \in \mathcal {N}_i\\ 0, &{} j \notin \mathcal {N}_i \end{array}\right. }~. \\ \text{ with } \sigma _i := max(P_{i,j \in \mathcal {N}_i}) - min(P_{i,j \in \mathcal {N}_i}) \end{aligned}$$

Next, we aim to impose a smoothness regularization on this graph-valued vector field. Motivated by the graph diffusion process [13], we regularize each node’s representation by a longitudinal neighbourhood embedding \(\varDelta h\) ‘pooled’ from the neighbours’ representations. For node i, the neighbourhood embedding can be computed by:

$$\begin{aligned} \varDelta h_i := \sum _{j \in \mathcal {N}_i} A_{i,j} D^{-1}_{i,j} \varDelta z_j, \end{aligned}$$

where D is the ‘out-degree matrix’ of graph \(\mathcal {G}\), a diagonal matrix that describes the sum of the weights for outgoing edges at each node. As shown in Fig. 1, the blue circle illustrates the above operation of learning the neighbourhood embedding that is shown by the purple arrow.

Objective Function. As shown in [24], the speed of brain aging is already highly heterogeneous within a healthy population, and subjects with neurodegenerative diseases may exhibit accelerated aging. Therefore, instead of replacing \(\varDelta z\) with \(\varDelta h\), we define \(\theta _{\langle \varDelta z,\varDelta h \rangle }\) as the angle between \(\varDelta z\) and \(\varDelta h\), and only encourage \(\cos (\theta _{\langle \varDelta z,\varDelta h \rangle }) = 1\), i.e., a zero-angle between the subject-specific trajectory vector and the pooled trajectory vector that represents the local progression direction. As such, it enables the latent representations to model the complexity of the global progression trajectory as well as the consistency of the local trajectory vector field. To impose the direction constraint in the autoencoder, we propose to add this cosine loss for each image pair to the standard mean squared error loss, i.e.,

$$\begin{aligned} L := \mathbf {E}_{(x^t, x^s) \sim \mathcal {S}} \left( \parallel x^t - \tilde{x}^t \parallel _2^2 + \parallel x^s - \tilde{x}^s \parallel _2^2 - \lambda \cdot \cos (\theta _{\langle \varDelta z,\varDelta h \rangle })\right) , \end{aligned}$$

with \(\lambda \) being the weighing parameter and \(\mathbf {E}\) define the expected value. The objective function encourages the low-dimensional representation of the images to be informative while maintaining a smooth progression trajectory field in the latent space. As the cosine loss is only locally imposed, the global trajectory field can be non-linear, which relaxes the strong assumption in prior studies (e.g., LSSL [24]) that aging must define a globally linear direction in the latent space. Note, our method can be regarded as a contrastive self-supervised method. For each node, the samples in its neighbourhood serve as positive pairs with the cosine loss being the corresponding contrastive loss.

3 Experiments

Dataset. To show that LNE can successfully disentangle meaningful aging information in the latent space, we first evaluated the proposed method on predicting age from 582 MRIs of 274 healthy individuals with the age ranging from 20 to 90. Each subject had 1 to 13 scans with an average of 2.3 scans spanning an average time interval of 3.8 years. The second data set comprised 2389 longitudinal T1-weighted MRIs (at least two visits per subject) from ADNI, which consisted of 185 NC (age: 75.57 ± 5.06 years), 119 subjects with AD (age: 75.17 ± 7.57 years), 193 subjects diagnosed with sMCI (age: 75.63 ± 6.62 years), and 135 subjects diagnosed with pMCI (age: 75.91 ± 5.35 years). There was no significant age difference between the NC and AD cohorts (p = 0.55, two-sample t-test) as well as the sMCI and pMCI cohorts (p = 0.75). All longitudinal MRIs were preprocessed by a pipeline composed of denoising, bias field correction, skull striping, affine registration to a template, re-scaling to a \(64\times 64\times 64\) volume, and transforming image intensities to z-scores.

Implementation Details. Let C\(_k\) denote a Convolution(kernel size of \(3\times 3\times 3\) )-BatchNorm-LeakyReLU(slope of 0.2)-MaxPool(kernel size of 2) block with k filters, and CD\(_k\) an Convolution-BatchNorm-LeakyReLU-Upsample block. The architecture was designed as C\(_{16}\)-C\(_{32}\)-C\(_{64}\)-C\(_{16}\)-CD\(_{64}\)-CD\(_{32}\)-CD\(_{16}\)-CD\(_{16}\) with a convolution layer at the top for reconstruction. The regularization weights were set to \(\lambda =1.0\). The networks were trained for 50 epochs by the Adam optimizer with learning rate of \(5 \times 10^{-4}\) and weight decay of \(10^{-5}\). To make the algorithm computationally efficient, we built the graph dynamically on the mini-batch of each iteration. A batch size \(N_{bs}=64\) and neighbour size \(N_{nb}=5\) were used.

Evaluation. Five-fold cross-validation (folds split based on subjects) was conducted with 10% training subjects used for validation. Random flipping of brain hemispheres, and random rotation and shift were used as augmentation during training. We first qualitatively illustrated the trajectory vector field (\(\varDelta z\)) in 2D space by projecting the 1024-dimensional bottleneck representations (\(z^t\) and \(z^s\)) to their first two principal components. We then estimated the global trajectory of the vector field by a curve fitted by robust linear mixed effect model, which considered a quadratic fixed effect with random effect of intercepts. We further quantitatively evaluated the quality of the representations by using them for downstream tasks. Note, for theses experiments, we removed the decoder and only kept the encoder with its pre-trained weights for the downstream tasks. On the dataset of healthy subjects, we used the representation z to predict the chronological age of each MRI to show that our latent space was stratified by age. Note, learning a prediction model for normal aging is an emerging approach for understanding structural changes of the human brain and quantifying impact of neurological diseases (e.g. estimating brain age gap [20]). R2 and root-mean-square error (RMSE) were used as accuracy metrics. For ADNI, we predicted the diagnosis group associated with each image pair based on both z and trajectory vector \(\varDelta z\) to highlight the aging speed between visits (an important marker for AD). In addition to classifying NC and AD, we also aimed to distinguish pMCI from sMCI, a significantly more challenging classification task.

The classifier was designed as a multi-layer perceptron containing two fully connected layers of dimension 1024 and 64 with LeakyReLU activation. In a separate experiment, we fine-tuned the LNE representation by incorporating the encoder into the classification models. We compared the BACC (accounting for different number of training samples in each cohort) to models using the same architecture with encoders pre-trained by other representation learning methods, including unsupervised methods (AE, VAE [12]), self-supervised method (SimCLR [6]. Images of two visits of the same subject with simple shift and rotation augmentation were used as a positive pair in SimCLR), and longitudinal self-supervised method (LSSL [24]).

Fig. 2.
figure 2

Experiments on healthy aging: Latent space of AutoEncoder (AE) (a) and the proposed LNE (b) projected into 2D PCA space of \(z^t\) and \(z^s\). Arrows represent \(\varDelta z\) and are color-coded by the age of \(z^t\). The global trajectory in (b) is fitted by robust linear mixed effect model (red curve). (Color figure online)

Fig. 3.
figure 3

Experiments on ADNI: (a) The age distribution of the latent space. Lines connecting \(z^t\) and \(z^s\) are color-coded by the age of \(z^t\); Red curve is the global trajectory fitted by a robust linear mixed effect model. (b) Trajectory vector field color-coded by diagnosis groups; (c) The norm of \(\varDelta z\) encoding the speed of aging for 4 different diagnosis groups. (Color figure online)

3.1 Healthy Aging

Figure 2 illustrates the trajectory vector field derived on one of the 5 folds by the proposed method (Fig. 2(b)). We observe LNE resulted in a smooth vector field that was in line with the fitted global trajectory shown by the red curve in Fig. 2(b). Moreover, chronological age associated with the vectors (indicated by the color) gradually increased along the global trajectory (red curve), indicating the successful disentanglement of the aging effect in the latent space. Note, such continuous variation in the whole age range from 20 to 90 was solely learned by self-supervised training on image pairs with an average age interval of 3.8 years (without using their age information). Interestingly, the length of the vectors tended to increase along the global trajectory, suggesting a faster aging speed for older subjects. On the contrary, without regularizing the longitudinal changes, AE did not lead to clear disentanglement of brain age in the space (Fig. 2(a)).

As shown in Table 1 (left), we utilized the latent representation z to predict the chronological age of the subject. In the scenario that froze the encoder, the proposed method achieved the best performance with an R2 of 0.62, which was significantly better (\(p<0.01\), t-test on absolute errors) than the second-best method LSSL with an R2 of 0.59. In addition to R2, the RMSE metrics are given in the supplement Table S1, which also suggests that LNE achieved the most accurate prediction. These results align with the expectation that a pre-trained self-supervised model with explicitly modeling of aging effect can lead to better downstream supervised age prediction. Lastly, when we fine-tuned the encoder during training, LNE remained as the most accurate method (both LNE and LSSL achieved an R2 of 0.74).

Table 1. Supervised downstream tasks in frozen or fine-tune scenarios. Left: Age regression on healthy subjects with R2 as an evaluation metric. Right: classification on ADNI dataset with BACC as the metric.

3.2 Progression of Alzheimer’s Disease

We also evaluated the proposed method on the ADNI dataset. All 4 cohorts (NC, sMCI, pMCI, AD) were included in the training of LNE as the method was impartial to diagnosis groups (did not use labels for training). Similar to the results of the prior experiment, the age distribution in the latent space in Fig. 3(a) suggests a continuous variation with respect to brain development along the global trajectory shown by the red curve. We further illustrated the trajectory vector field by diagnosis groups in Fig. 3(b). While the starting points (\(z^t\)) of different diagnosis groups mixed uniformly in the field, vectors of AD (pink) and pMCI (brown) were longer than NC (cyan) and sMCI (orange). This suggests that LNE stratified the cohorts by their ‘speed of aging’ rather than age itself, highlighting the importance of using longitudinal data for analyzing AD and pMCI. This observation was also evident in Fig. 3(c), where AD and pMCI had statistically larger norm of \(\varDelta z\) than the other two cohorts (both with \(p<0.01\)). This finding aligned with previous AD studies [22] suggesting that AD group has accelerated aging effect compared to the NC group, and so does the pMCI group compared to the sMCI group.

The quantitative results on the downstream supervised classification tasks are shown in Table 1 (right). As the length of \(\varDelta z\) was shown to be informative, we concatenated \(z^t\) with \(\varDelta z\) as the feature for classification (classification accuracy based on \(z^t\) only is reported in the supplement Table S2). The representations learned by the proposed method yielded significantly more accurate predictions than all baselines (\(p<0.01\), DeLong’s test). Note that the accuracy of our model with the frozen encoder even closely matched up to other methods after fine-tuning. This was to be expected because only our method and LSSL explicitly modeled the longitudinal effects which led to more informative \(\varDelta z\). In addition, our method that focused on local smoothness could capture the potentially non-linear effects underlying the morphological change along time, while the ‘global linearity’ assumption in LSSL may lead to information loss in the representations. It is worth mentioning that reliably distinguishing the subjects that will eventually develop AD (pMCI) from other MCI subjects (sMCI) is crucial for timely treatment. To this end, Supplement Table S3 suggests LNE improved over prior studies in classifying sMCI vs. pMCI, highlighting potential clinical values of our method. Ablation study on two important hyperparameters \(N_{nb}\) and \(\lambda \) is reported in the supplement Table S4.

4 Conclusion

In this work, we proposed a self-supervised representation learning framework, called LNE, that incorporates advantages from the repeated measures design in longitudinal neuroimaging studies. By building the dynamic graph and learning longitudinal neighbourhood embedding, LNE yielded a smooth trajectory vector field in the latent space, while maintaining a globally consistent progression trajectory that modeled the morphological change of the cohort. It successfully modeled the aging effect on healthy subjects, and enabled better chronological age prediction compared to other self-supervised methods. Although LNE was trained without the use of diagnosis labels, it demonstrated capability of differentiating diagnosis groups on the ADNI dataset based on the informative trajectory vector field. When evaluated for downstream task of classification, it showed superior quantitative classification performance as well.