Keywords

1 Introduction

Multiple sclerosis (MS) is a chronic inflammatory disease of the central nervous system (CNS) that produces demyelination and axonal/neuronal damage [9]. The demyelinating process is associated with persistent inflammation throughout the CNS and, as a result, the demyelinated lesion, also known as plaque, is the main pathological feature of MS [1, 10]. Due to its high sensitivity, magnetic resonance imaging (MRI) is extensively used for diagnosis and monitoring of MS [29]. However, manual segmentations by trained experts are tedious, time-consuming and often lack reproducibility [24]. For this reason, automated MS lesion segmentation techniques have been developed during the last years [29], and particularly convolutional neural networks (CNN) have gained popularity for this task, after proving their effectiveness in other neuroimaging tasks [26].

From the perspective of how the data is used to train a model, MS lesion segmentation algorithms can be classified as either longitudinal or cross-sectional. Longitudinal approaches make use of the time information provided by subsequent scans (known as time-points or visits) of the same patient. In cross-sectional approaches, all scans, even if belonging to the same patient, are treated as independent scans and no time information is considered. Only few CNN-based approaches can be found in the literature that tackle the problem of MS lesion segmentation in a longitudinal manner.

The first CNN-based longitudinal method found in the literature was proposed in [6]. Although the CNN is used only for classifying candidates extracted using intensity and atlas information, the method employs different time-points to perform the task. Another longitudinal method is described in [19], where the goal is to detect lesion load change using CNNs. To achieve this, a hybrid between the U-Net [23] and the Dense-Net [16] was used as basis. A special type of loss function generates probabilities that are used together with the segmentation masks for obtaining information about lesion change. Following this idea of detecting changes, a CNN is proposed in [27] for detecting new T2-weighted (T2-w) lesions. This architecture consists of a first block based on Voxelmorph [4] to learn deformation fields and register the baseline image to the follow-up images, and a second block to perform the segmentation of new lesions using the results of the first block.

One important issue when using longitudinal data is the normalization across time-points, or longitudinal normalization. The goal is to increase the similarity, in terms of image intensity regarding tissue classes, of the different time-points, without modifying the structures whose changes are due to pathological conditions. MS lesions are an example of those structures, as they can persist, change or disappear in time [25]. A statistical normalization method is proposed in [28], in which all histograms are centered using statistical measures obtained from the white matter voxels. Other methods based on histograms use landmarks to perform the normalization across different time-points [21]. Another longitudinal normalization method is presented in [25]. In this case voxel changes in time are modeled mathematically depending on temporal intensity variations and the lesion priors are used for keeping the lesion voxels unchanged.

Following the idea that including longitudinal information in CNNs can produce better segmentation results [6], this work proposes an effective pipeline based on the Chi-Square metric and convolutional long short-term memory (C-LSTM) [15] networks, for the exploitation of longitudinal MRI data for segmentation of MS lesions. To the best of our knowledge, this is the first MS lesion segmentation longitudinal method that combines traditional CNNs with convolutional recurrent neural networks (C-RNN).

2 Method

2.1 Longitudinal Normalization

A simple yet effective MRI longitudinal normalization based on the Chi-Square metric \( \chi ^2 \) is proposed. The Chi-Square test is commonly used for analyzing the difference between observed and expected distributions [30], but in this case only the metric is used to measure and maximize the similarity between the histograms of volumes of the same modality for different patients and different time-points. Equation 1 shows the Chi-Metric as a means of comparison of two histograms \(H_a\) and \(H_b\), for voxel intensities I.

Let s, t and m represent subject, time-point and modality, respectively. For each modality m a reference volume \(V_{\hat{s}\hat{t}}^{(m)}\) is selected to normalize the other volumes \(V_{st}^{(m)}\) of that modality, with \( s \ne \hat{s}, t\ne \hat{t}\). For each \(V_{st}^{(m)}\) an optimal scalar \(\theta _{st}^{(m)}\) is found using Eq. 2, where \(H_{\hat{s}\hat{t}}^{(m)}\) and \(H_{st}^{(m)}\) are the histograms of the normal appearing white matter (NAWM) of \(V_{\hat{s}\hat{t}}^{(m)}\) and \(V_{st}^{(m)}\), respectively. The normalized images are the result of the product \(\theta _{st}^{(m)} V_{st}^{(m)}\). To obtain the NAWM masks, the Computational Anatomy Toolbox (CAT12) [13] applied within the Statistical Parametric Mapping (SPM12) [22] toolkit was used.

(1)
(2)

2.2 Network Architecture

In order to exploit time information, an extension of the architecture presented in [20] is proposed for the segmentation of MS lesions from longitudinal multi-modal brain images. This architecture is a hybrid between the well known U-Net and the Convolutional Long Short-Term Memory (C-LSTM) network. Figure 1 shows the architecture. The input patches have size (T, M, H, W, D), where T and M represent the number of selected time-points to process for each sample and the number of modalities, respectively. H, W and D represent the height, width and depth of the patches in each volume. The patches are extracted from the volumes that have been previously pre-processed. Just like in the U-Net architecture, there is an encoder for extracting hierarchical features, but these are extracted for each time-point separately. These features are then combined, at the deepest level and for all time-points, by a C-LSTM. After the features are processed by the first C-LSTM, a decoder upsamples them so that the input dimensions can be reached again. At the output of the decoder, a second C-LSTM combines again the features of the different time-points. Finally, the feature maps corresponding to a specific time-point (e.g. the one in the middle, if T is odd) are selected and a last convolution takes place to reduce the number of maps to 2, one for each class, lesion or non-lesion.

Fig. 1.
figure 1

U-Net ConvLSTM architecture. Patch dimensions are included in gray text, where T and M denote the number of selected time-points and number of modalities, respectively. H, W and D denote the spatial dimensions of the patches in the volumes. Dashed lines denote skip connections by copying and concatenation.

2.3 Post-processing

After a segmentation is produced, a post-processing step is performed to exclude potential false-positive (FP) detected lesions. This is achieved by imposing a minimal lesion size of 3 \(\mathrm{mm}^3\), as it has been found to improve the performance of MS lesion segmentation methods [12].

3 Experimental Setup

3.1 Dataset

One of the most popular datasets for MS lesion segmentation is the one provided by the longitudinal MS lesion segmentation challenge, which was part of the International Symposium on Biomedical Imaging (ISBI) in 2015 [17] and continues to be publicly available. The dataset, acquired with a 3T scanner, is subdivided into training (5 subjects) and testing (14 subjects) sets. Only the training set contains lesion segmentation masks generated by two different expert raters. Each subject contains between 4 and 6 time-points, each of which consists of T1-weighted (T1-w) magnetization prepared rapid gradient echo (MPRAGE), T2-weighted (T2-w) and Proton Density weighted (PDw) produced with double spin echo (DSE) and Fluid Attenuated Inversion Recovery (FLAIR) images. The average time between subsequent time-points is 1 year [8, 17].

Both the original and the pre-processed images are available for use. For this work, the pre-processed images were used. The pre-processing steps are: first N4 correction, skull-stripping, dura stripping, second N4 correction and registration to an isotropic MNI template. The proposed longitudinal normalization was applied on the images that resulted from this pre-processing procedure.

3.2 Longitudinal Normalization Configuration

One important step for the longitudinal normalization is the white matter segmentation. The segmentation was performed on each volume using CAT with the default parameters. For finding the values of \(\theta _{st}^{(m)}\), the first time-point of subject 01 was selected as reference for each modality and the Nelder–Mead Simplex method [11] was employed for the minimization of the distance function.

For comparison purposes, the min-max normalization (Eq. 3) was used as default normalization. This type of normalization, in which intensity values are mapped to the interval [0, 1], is one of the methods that can be used as pre-processing in MRI data [5, 7, 14]. In Eq. 3 \(I_{orig}\), \(I_{min}\) and \(I_{max}\) represent the original, minimum and maximum intensities of a volume, respectively, and \(I_{norm}\) is the assigned intensity.

$$\begin{aligned} I_{norm} = \frac{I_{orig} - I_{min} }{I_{max} - I_{min}} \end{aligned}$$
(3)

3.3 Training and Implementation Details

After having normalized the pre-processed images, a leave-one-out (subject-wise) cross-validation was performed on the training set. For each fold, from the 4 subjects not used for testing, one was used for validation and 3 for training. The model was trained using 32 \(\times \) 32 \(\times \) 32 patches with step size 16 \(\times \) 16 \(\times \) 16, and extracted only from the brain region using a brain mask generated by thresholding the pre-processed volumes. All four modalities were used (\(M=4\)) and three time-points were taken for each training sample (\(T=3\)). Sampling in time was performed using padding by repeating the first and last time-points. Training was performed using the Adam optimizer [18] for a maximum of 200 epochs with an early stopping condition of 20 epochs, and a batch size of 16. To deal with the class imbalance (more normal tissue as compared to lesion tissue), the dice loss function was used. All models were trained with only one segmentation mask (mask1) of the two available, and the validation subjects were randomly chosen a first time obtaining the order 03, 03, 04, 05, 02, and then set to be the same for all experiments. No data augmentation was performed in order to increase the comparability between different experiments.

The proposed model was implemented in PyTorch, using a GPU NVIDIA Tesla T4.

3.4 Evaluation

Four experiments were carried out for determining the advantages of the normalization procedure and the longitudinal architecture: the cross-sectional version of the model (without ConvLSTM blocks and without considering the time dimension), which is a normal 3D U-Net, for both min-max and the proposed normalization; and the proposed model (Fig. 1) also for both types of normalization.

In order to evaluate the performance of the method, the Dice score (DSC), lesion-wise false positive rate (LFPR) and lesion-wise true positive rate (LTPR) were used. The DSC is computed according to Eq. 4, where TP, FP and FN denote number of true positive, false positive, and false negative voxels, respectively. The LFPR is the number of lesions in the produced segmentation that do not overlap with a lesion in the ground truth, divided by the total number of lesions in the produced segmentation. The LTPR is computed as the number of lesions in the ground truth that overlap with a lesion in the produced segmentation, divided by the total number of lesions in the ground truth [2].

$$\begin{aligned} DSC = \frac{2\times TP}{2\times TP + FP + FN} \end{aligned}$$
(4)

4 Results

The results of the normalization process are presented in Fig. 2 for all pre-processed training subjects of the dataset.

Fig. 2.
figure 2

Histograms of training cases for both considered normalization techniques. Only brain tissue is considered in the computation of the histograms.

Table 1 shows the mean values of the metrics for the four evaluated cases, computed for all time-points of all subjects. An increase is noticed in the DSC and LTPR, particularly, when the longitudinal approach is combined with the proposed normalization. Figure 3 shows an example of a segmentation result for which a higher DSC in the longitudinal approach denotes a better segmentation quality with respect to the cross-sectional (3D U-Net) version.

Table 1. Cross-validation comparison of the proposed longitudinal normalization and architecture with the min-max normalization and U-Net architecture using “mask1” as ground truth. Standard deviations are presented in parenthesis.
Fig. 3.
figure 3

Example of resulting segmentation masks from cross-validation experiment for patient 01, slice 89 from the ISBI training dataset for cross-sectional and longitudinal models. Masks are shown on FLAIR images and t denotes time-point index. The standard cross-sectional approach (second column) leads to unsegmented lesions especially on \(t=2\). Both proposed techniques improve the segmentation when applied individually (third and fourth columns), while the combined method (last column) yields the best results as compared to the ground truth.

5 Discussion and Conclusion

We have proposed a supervised longitudinal pipeline for MS lesion segmentation from MRI images. The approach combines a whole-volume longitudinal normalization scheme with a patch-based 3D CNN architecture that exploits time information. The method was evaluated on data from the ISBI 2015 challenge, obtaining an improvement in the DSC and LTPR with respect to a cross-sectional architecture and with respect to a simple normalization algorithm. Time padding allowed to reduce the impact of the reduction of samples that occurs when the time dimension is considered. Both intra- and inter-subject standard deviations were reduced in comparison to the cross-sectional approach for the DSC metric, which denotes more consistent segmentations in time. However, the LFPR increased for the proposed pipeline, which can be explained, to a certain extent, by the fact that in several cases (e.g. Fig. 3), the normalization and the C-RNN-based architecture allow to segment lesions that were not segmented for example in the cross-sectional case. These new lesions can represent a single lesion in the ground truth and therefore the LFPR is expected to increase. Ensembling several models could be used for reducing the amount of false positive lesions, while keeping high DSC and LTPR values. When compared to other approaches like for instance [7] in which the DSC for the same cross-validation was 0.684, or to [2] with a DSC of 0.698, our DSC (0.699) is similar to these values, but it is lower than the best reported result, which is 0.765 [3]. Considering the absence of data augmentation in our experiments, our approach offers promising results.

The proposed CNN architecture requires a higher training time (about 1.5x the time of the cross-sectional approach), which can be seen as a limitation, but the improvement in both the time consistency and values of segmentation metrics justifies the increase in training time. The increase in training time also strongly depends on the implementation of the C-LSTM, which in our case has not yet been optimized for time efficiency.

In comparison to other normalization methods such as histogram matching, the proposed method allows to preserve the basic shape of the histograms, which prevents from loosing information about the lesions and other structures. The optimization of the similarity reduces problems that peak/landmark based methods can show when the histograms differ too much before normalization. We chose an approach using a pre-segmented WM mask, assuming that normalizing the surrounding tissue value of white matter lesions optimally supports the detection of the pathological lesions. This approach relies on a rough segmentation of the white matter before applying the CNN. However, the normalization can also be applied on the original histograms, at the cost of a higher influence of the lesion volume in the quality of the normalization.

The longitudinal normalization pre-processing method increased the robustness of a trained network in respect to the histogram variations of the input data, which were present in the ISBI 2015 training data. Thus, it is a promising technique to be applied also on MRI data from various sources, e.g. in the context of multi-center trials. Future work of our group will therefore include the validation of the algorithm on heterogeneous real-life data.

Diagnosis and treatment decision based on lesion inspection on MRI data is a central aspect in MS. The clinical workflow also contains the comparison to pre-examinations to assess inflammatory activity. This process is tedious when looking at up to above 100 slices in high resolution imaging, at least four modalities and several pre-examinations. Still, common solutions for automated lesion segmentation do not rely on neural networks and are not typically applied in the clinical setting. Thus, our work is highly relevant as it investigates ways to improve the state-of-the art regarding the important aspect of longitudinal analysis, in order to make longitudinal lesion segmentation applicable and reliable in clinical MS neuroimaging.