Introduction

The extensive use of magnetic resonance imaging (MRI) to investigate pathology in the brain entails identification of specific regions of interest (ROI) for quantitative analysis. Accurate manual tracing of deep brain structures, such as the thalamus and hippocampus, demands a high level of tracer expertise and preferably standardized segmentation protocols. Introducing automatic or semi-automatic techniques into post-processing pipelines accelerates data analysis and offers reproducible and consistent decisions across datasets in large studies, which is crucial for obtaining reliable results [1].

Several software solutions for automatic segmentation are publicly available. Frequently used softwares in clinical research include “Oxford Centre for Functional MRI of the Brain” (FMRIB) Software Library (FSL) (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/), FreeSurfer (http://surfer.nmr.mgh.harvard.edu) and Statistical Parametric Mapping (SPM) (http://www.fil.ion.ucl.ac.uk).

The segmentation techniques applied in FSL, FreeSurfer and SPM are model-based methods. In highly variable data, such as MRI of the brain, it may be difficult for the segmentation tools to model the ROIs with sufficient accuracy, even when the techniques are trained on representative datasets. To address this, multi-atlas label fusion has been suggested and has demonstrated excellent segmentation abilities [24]. Label fusion relies on a representative image library with corresponding validated structure segmentations (atlases). Recently, multi-atlas segmentation techniques, such as patch-based segmentation, have become popular [5, 6]. Patch-based methods have the advantage of requiring a smaller training library compared to regular label fusion and is therefore relatively easy to implement in a local setting [5, 7]. Even though these improvements in segmentation algorithms have demonstrated highly accurate morphometric results,Footnote 1 most of the novel approaches are still not publicly available and therefore less used in clinical research. Moreover, the impact of segmentation accuracy on quantification of parameters from other imaging modalities, such as diffusion and perfusion MRI, is not well studied.

Quantitative diffusion tensor imaging (DTI) is widely used to investigate microstructural changes in tissue. In diseases that cause subtle microstructural changes, such as mild traumatic brain injury (mTBI), there is a need for sensitive biomarkers in clinically relevant areas of the brain. Thalamus and hippocampus are two deep brain structures, where previous DTI studies have shown microstructural changes linked to cognitive impairment [8, 9], stress [10] and headache [11]. Segmentation directly on the DTI maps is prone to inconsistency and bias, as DTI provides limited anatomical information. Unbiased and automatic studies rely on accurate T1-weighted (T1w) segmentation and co-registration for obtaining quantitative measurements within relevant brain regions. Thus, it is highly relevant to investigate the impact of automatic segmentation accuracy on these quantitative measures.

Patch-based segmentation methods [5] perform well on conventional T1w images, such as Magnetization Prepared Rapid Acquisition Gradient Echo (MPRAGE) [12]. To the best of our knowledge, the accuracy of different automated segmentation methods has not yet been compared using T1w images from the recently proposed MP2RAGE sequence, which significantly reduces the intensity bias and provides superior grey matter (GM) to white matter (WM) contrast [13].

In this study, we compared the performance of a multi-atlas, patch-based segmentation method, as implemented in the online software platform volBrain (with two different training libraries), to three widely applied methods implemented in FSL, FreeSurfer and SPM. We used manual segmentation as the gold standard and measured the segmentation accuracy of thalamus and hippocampus when imaged with MP2RAGE. Additionally, we applied the segmented masks of thalamus and hippocampus on co-registered fractional anisotropy (FA) and mean diffusivity (MD) maps for the purpose of evaluating the effect on the quantification of these diffusivity metrics.

Material and methods

Participants

Twenty-two healthy subjects were included in the study (age range 19–40 years, 12 females). MP2RAGE images were acquired from all subjects and DTI images in 10 subjects. All subjects were scanned on a Siemens Magnetom Skyra 3T MRI system with a 32-channel head coil. MP2RAGE parameters were acquired with \(\hbox {TR}=5\,\hbox {s}, \hbox {TI}_{1}=0.7\,\hbox {s}\), \(\mathrm{TI}_{2}=2.5\,\hbox {s}\), \(\alpha _{1}=4^{\circ }, \alpha _{2}=5^{\circ }\), a 3D sequence imaged at isotropic \(1\,\hbox {mm}^{3}\) resolution (acquisition matrix: 240 \(\times \) 256, 176 sagittal slices) and turbo factor of 176 as defined by Marques et al. [13]. DTI was acquired with 32 directions, 5 B0 acquisitions, \(\hbox {TR=}10.9\,\hbox {s}\), \(\hbox {TE}=0.079\), \(\hbox {TI}= 2.1\,\hbox {s}\), imaged at isotropic \(2.3\,\hbox {mm}^{3}\) resolution (acquisition matrix: 96 \(\times \) 96, 38 axial slices), and inversion recovery-based CSF suppression to reduce partial volume effects. Figure 1 provides an overview of the methods and comparisons.

Fig. 1
figure 1

Overview. Left six segmentation methods, middle the two ROIs highlighted in red (upper thalamus, lower hippocampus) and right three MRI comparisons (upper T1 (DSI comparison), middle FA, lower MD)

Pre-processing

MP2RAGE images were calculated as the ratio of two MPRAGE images, acquired with different inversion times [13]. As reported by Fujimoto et al. [14], the amplified background noise in MP2RAGE images may introduce errors in FreeSurfer segmentations. To deal with this, we used a slightly different approach than Fujimoto and colleagues. We classified the intensities of the proton density-weighted image acquired during the second inversion recovery into four classes using a Fuzzy C-means algorithm [15]. This robustly separates the background (1 class) from foreground (3 classes). To regain a “natural” background noise, we added the background noise of the second inversion recovery to the combined (flat) image. The modified image was used as input to all segmentation pipelines, and no errors were detected. Despite inherent intensity normalization of the MP2RAGE images, all segmentation tools were run with intensity non-uniformity correction, as this was the default setting.

Diffusion data were eddy current and motion corrected using FSL, and EPI distortion correction was performed with ExploreDTI toolbox [16]. During pre-processing, the diffusion data were resampled to the space of the undistorted MP2RAGE image and FA and MD values were subsequently calculated with the ExploreDTI pipeline.

Manual segmentation of ROIs

Thalamus and hippocampus from the 22 MP2RAGE images were manually segmented by an experienced neuroradiologist (EN: 7 years of experience in neuroradiology) and a trained assistant (TA) using ITK-SNAP (www.itk-snap.org) [17]. The thalami were first manually traced by EN in the axial plane using anatomical landmarks. Next an initial training phase of TA using the protocol outlined by Power et al. [18] and supervised by EN was established. TA then adjusted the thalami in all three principal planes according to the protocol. Bilateral thalamus segmentation took 35–40 min per subject. The hippocampi were manually traced according to the EADC-ADNI segmentation protocol [19] by TA, initially supervised by EN. Segmentation of both hippocampi took 25–35 min per subject. As done in the EADC-ADNI protocol [19], all manual segmentations were performed in MNI space, where similar appearance of the nuclei is expected to improve tracing consistency and speed when using the segmentation protocols. The final segmentations were transformed back to scanner native space for comparison. Manual intra-operator reliability on hippocampus was tested 10 months after initial segmentations by TA on the ten subjects with DTI. A 1-h training session, reading the protocol and training on a separate subject were performed before segmentation of the ten subjects was carried out.

Automatic segmentation

The following provides a brief overview of the four processing methods volBrain, FSL, FreeSurfer and SPM with the applied settings.

volBrain

volBrain (http://volbrain.upv.es), which is an open-access platform, is based on an advanced pipeline providing automatic segmentations of several brain structures [20]. The version of volBrain used in the current setting involves an updated version of a recent patch-based method [5]. We tested the segmentation method using two different libraries: (1) the default volBrain library (external) consisting of 50 conventional T1w images (MPRAGE and SPGR), and (2) our own manually segmented library (local) of 22 MP2RAGE images in a leave-one-out fashion. In both cases, the images were flipped across the mid-sagittal plane to artificially increase the library size as done in related work [7].

FSL

FSL is freely available for download online (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/). The FMRIB’s Integrated Registration & Segmentation Tool (FIRST), from FSL v5.0, was used to segment subcortical structures [21]. FSL FIRST is a model-based segmentation tool that uses training data from manually segmented images. In the current pipeline, we used the default setting of FIRST, which applies empirically optimized settings (numbers of modes and shape/boundary correction) for each structure (see http://fsl.fmrib.ox.ac.uk/fsl/fsl-4.1.9/first/index.html).

FreeSurfer

FreeSurfer image analysis suite (version 5.3), which is documented and freely available for download online (http://surfer.nmr.mgh.harvard.edu), was used in a default mode in the current setting. The segmentation maps are created using spatial intensity gradients across tissue classes and are therefore not simply reliant on absolute signal intensity [22].

SPM

SPM is a MATLAB-based (MathWorks Inc.) freely available software, which can be downloaded online (http://www.fil.ion.ucl.ac.uk). Segmentation was performed with SPM12 and MATLAB R2015b by combining the unified segmentation tool with the neuroinformatics SPM template [23], which consists of multiple structures of the brain segmented in Montreal Neurological Institute (MNI) space.

Fig. 2
figure 2

ROI volumes of the thalamus and hippocampus. From left manual, volBrain local, volBrain external, FSL, FreeSurfer and SPM. Boxes indicate 25 and 75 % percentile and the bold line the median. Whiskers indicate the most extreme point within 1.5 times the interquartile range. Data points outside this range are plotted individually. Horizontal bars indicate non-significant test for difference in volume. The remaining comparisons showed significant differences in volume

Fig. 3
figure 3

DSI, FPR and FNR for segmentations of the thalamus and the hippocampus using volBrain local (vBlocal), volBrain external (vBext), FSL, FreeSurfer and SPM compared to the manual “gold standard”. Boxes indicate 25 and 75 % percentile and the bold line the median. Whiskers indicate the most extreme point within 1.5 times the interquartile range. Data points outside this range are plotted individually. Horizontal bars indicate non-significant test for difference in DSI, FPR and FNR. The remaining comparisons showed significant differences

Data and statistical analysis

The segmentations obtained from the four automatic methods were compared to the manual segmentations using volume, Dice similarity index (DSI), false positive rate (FPR), false negative rate (FNR), and Hausdorff distance estimated bilaterally. We report mean (M) and standard deviations (SD) and visualize data with boxplots.

DSI is defined as \(\frac{2{C}}{{A}+{B}}\) and is the quotient of similarity ranging from zero to one. A and B are the number of voxels in segmentation A and segmentation B, respectively, and C is the number of voxels shared by the two segmentations. FPR and FNR were calculated, respectively, as number of false positive and false negative voxels as percentage of the total manually segmented number of voxels. Hausdorff distance, h, indicates the maximum distance error and is defined as the maximum distance, d, from the surface of segmentation A to the nearest point in the surface of segmentation B: \({h}({A,B})= \max _{{a}\in {A}}\{\min _{{b}\in {B}}\{{d}({a},{b})\}\}\). Intra-rater reliability, volume, DSI, FPR, FNR and Hausdorff distance were analysed with two-way ANOVA and post-estimation was carried out, with a significance level of 0.05. Finally, FA and MD values were analysed using two-way ANOVA. Post hoc analyses of FA and MD were carried out with a primary analysis of the automatic segmentations against the manual segmentation and secondary between the automatic methods. Results are presented at a significance level of 0.05, and in addition, diffusivity results were reported with correction for multiple comparisons (60 tests on diffusion metrics were carried out, which yields a Bonferroni-corrected threshold of \(p=0.0008\)).

Results

Intra-operator reliability of manual segmentation

The 10-month intra-operator reliability test of hippocampus manual segmentation resulted in a mean volume difference of 3.1 % (\(\hbox {SD}=4.9\,\%\)), which was not significantly different (\(p>0.05\)). Mean DSI was 0.913 (\(\hbox {SD}=0.010\)), and mean FPR and FNR were, respectively, 10.4 (\(\hbox {SD}=3.4\,\%\)) and 7.3 % (\(\hbox {SD}=1.6\,\%\)). Intra-operator Hausdorff distances ranged from 2.2 to 4.9 mm. For DTI metrics, the overall model was significantly different for both FA (\(p<0.001\)) and MD (\(p<0.001\)). Post-estimation revealed a FA mean difference of 0.003 (\(\hbox {SD}=0.002\)) which was significantly different (\(p=0.003\)). MD obtained a mean difference of \(0.006\times 10^{-3}(\hbox {SD}=0.006\times 10^{-3})\), which was significantly different (\(p=0.018\)). If Bonferroni corrected, there is no significant difference between the manual segmentations.

Fig. 4
figure 4

Hausdorff distance of the automatic segmentations of hippocampus and thalamus compared to the manual segmentation. Boxes indicate 25 and 75 % percentile and the bold line the median. Whiskers indicate the most extreme point within 1.5 times the interquartile range. Data points outside this range are plotted individually. The horizontal bar indicates a non-significant test for difference in Hausdorff distance. The remaining comparisons showed significant differences

Thalamus and hippocampus volumes

Figure 2 shows the volumes of thalamus and hippocampus for each of the segmentation methods. Overall the model was significantly different in both ROIs (\(p<0.001\)). There was no significant difference (\(p>0.05\)) in manual versus volBrain local, manual versus volBrain external and volBrain local versus volBrain external in thalamus, but all other comparisons for thalamus were significantly different (\(p<0.05\)). The hippocampus segmentations showed significantly higher volumes of volBrain external, FSL, FreeSurfer and SPM compared to the manual and volBrain local, and only FSL versus FreeSurfer and volBrain external versus SPM were not significantly different (\(p>0.05\)) from each other.

Manual versus automatic segmentation

Comparison of manual and automatic segmentation methods showed a substantial variation in DSI across the methods (see Fig. 3), and the overall model was significantly different (\(p<0.001\)) for both thalamus and hippocampus DSI, FPR and FNR. To maintain overview, only non-significant (\(p>0.05\)) p values are marked in Fig. 3. All other p values are significant (\(p<0.001\)).

DSI of the thalamus was significantly higher for volBrain local (\({M}=0.913\), \(\mathrm{SD}=0.014\)) and volBrain external (\({M}=0.868\), \(\mathrm{SD}=0.024\)) compared to FSL (\({M}=0.806\), \(\mathrm{SD}=0.034\)), FreeSurfer (\({M}=0.798\), \(\mathrm{SD}=0.049\)) and SPM (\({M}=0.787\), \(\mathrm{SD}=0.031\)). FreeSurfer was not significantly different from FSL or SPM. FPR in the thalamus when segmented with FSL (\(M=41\,\%\)) and SPM (\(M=42\,\%\)) was significantly higher than the other segmentation methods. Over-segmentations are exemplified in Fig. 5a, b, where the significantly lower FPR of volBrain local (\(M=9\,\%\)) and external (\(M=14\,\%\)) also can be observed. FreeSurfer FPR was significantly higher than volBrain and significantly lower than FSL and SPM. The mean FNR of the four methods ranged from 5 to 13 % all being significantly different, except volBrain local versus SPM, volBrain external versus FreeSurfer and FSL versus SPM.

The DSI of the hippocampus demonstrated significantly different values between all methods, with volBrain local (\({M}=0.892\), \(\mathrm{SD}=0.016\)) showing the best performance, followed by volBrain external (\({M}=0.859\), \(\mathrm{SD}=0.014\)), FSL (\({M}=0.808\), \(\mathrm{SD}=0.017\)), FreeSurfer (\({M}=0.771\), \(\mathrm{SD}=0.023\)) and SPM (\({M}=0.735\), \(\mathrm{SD}=0.038\)). A similar pattern was observed for FPRs, with volBrain local performing best (\({M}=9\,\%\)) followed by volBrain external (\({M}=26\,\%\)), FSL (\({M}=36\,\%\)), SPM (\({M}=40\,\%\)) and FreeSurfer performing worst (\({M}=41\,\%\)). FSL versus SPM and FreeSurfer versus SPM were the only methods which were not significantly different in FPR. Mean FNR ranged from 5 to 19 %, and all methods were significantly different, except volBrain local versus FreeSurfer.

Fig. 5
figure 5

Examples of manual and automatic segmentations of thalamus and hippocampus presented in a the subject where volBrain local had the best performance, and b where volBrain local had the worst performance. Upper two rows thalamus in an axial view, overlaid on native T1 and co-registered FA images. Third row 3D reconstructions of thalamus. The lower three rows contain similar visualizations for hippocampus segmentations. Left to right manual, volBrain local, volBrain external, FSL, FreeSurfer and SPM methods. Green areas indicate overlap between automatic methods and manual segmentation. Red indicates areas, which are included in the automatic, but not the manual method (false positives). Blue indicates areas, which are included by the manual, but not the automatic method (false negatives)

Fig. 6
figure 6

FA and MD values for thalamus obtained by the six different segmentation methods

Fig. 7
figure 7

FA and MD values for hippocampus obtained by the six different segmentation methods and the second manual inter-rater segmentation (man2)

Table 1 FA and MD (\({\times }10^{-3}\)) values from thalamus obtained by the six different segmentation methods

In terms of Hausdorff distance, the overall model was significantly different in both thalamus and hippocampus (\(p<0.001\)). Figure 4 shows the Hausdorff distances for the automatic hippocampus and thalamus segmentations with low distances indicating good performance. Post-estimation showed that all methods had significantly different Hausdorff distances (\(p<0.05\)) except volBrain external versus FSL in thalamus. The best performance was seen with volBrain local, and the highest Hausdorff distances were measured with FreeSurfer in both thalamus and hippocampus.

Visual inspection of ROIs

Examples of manual segmentations and the corresponding automatic segmentations of the thalamus and hippocampus, overlaid on the T1w image and the FA map, are shown in Fig. 5

As illustrated, FreeSurfer, FSL and SPM generally over-segment the thalamus, especially the non-thalamic tissue near the border of the internal capsule (IC). volBrain external over-segments to a lesser extent, and volBrain local demonstrated only subtle over-segmentation at the inferior and lateral border of the thalamus. The same pattern of over-segmentation is found in the hippocampus with more extensive over-segmentation by FSL, FreeSurfer and SPM, but also slightly by volBrain external, compared to the manual (Fig. 5a, b). The over-segmentation of FSL, FreeSurfer and SPM in the hippocampus is mainly restricted to the superior and the rostral part of the hippocampus in the transition to thalamus and fornix.

Diffusivity results: Thalamus

The model was overall significantly different in the diffusivity measurements for both FA (\(p<0.001\)) and MD (\(p<0.001\)) in thalamus. Figure 6 shows mean FA and MD values in thalamus extracted from the six different segmentations and how MD values of all segmentations consistently change based on the segmentation method used, while FA values change less consistently. The volBrain local method provided the most accurate measurements compared to the manual segmentation.

Diffusivity results: Hippocampus

The model also provided overall significantly different results for FA (\(p<0.001\)) and MD (\(p<0.001\)) in hippocampus. Diffusivity results of the hippocampus are shown in Fig. 7. The figure illustrates the same consistent increase or decrease in MD between methods and subjects, but with different offsets and variation compared to the manual segmentation. FA showed a less consistent pattern. All automatic methods were significantly different from the manual values. When corrected for multiple comparisons, FA values for all methods stayed significantly different from the manual segmentation, except volBrain local and SPM, and for MD, all methods stayed significantly different except volBrain local.

Post hoc analysis on diffusivity parameters between manual and automatic segmentation

Post hoc analysis for thalamus and hippocampus is reported in Table 1. The post hoc analysis of thalamus revealed that only volBrain local was not significantly different from the manual segmentation and obtained the lowest mean difference of \(M=-0.3\,\%\) in FA and \(M=-0.1\,\%\) in MD. The other methods obtained a higher mean difference, ranging from \(M=3\,\%\) to \(M=9\,\%\) in FA and \(M=1\)–3 % in MD.

All methods obtained significantly different diffusivity parameters in the hippocampus when compared to the manual segmentation. The volBrain local demonstrated the most accurate result in the hippocampus, with a mean difference of \(M=-1\,\%\) of FA and \(M=-0.5\,\%\) of MD. If corrected for multiple comparisons, volBrain local FA and MD were not significantly different from the manual and neither was the SPM result of FA.

Post hoc analysis on diffusivity parameters between the automatic segmentation methods

Between-method comparison revealed more variable results. For an overview, see Table 2 with indication of corrected and un-corrected p values for both thalamus and hippocampus. All methods, except FSL, FreeSurfer and SPM, were significantly different from each other, when measuring FA in the thalamus. When measuring MD in the thalamus, all five methods yielded significantly different results. For hippocampus FA measurements, only volBrain local stood out as different from all the other methods. Furthermore, volBrain external was significantly different from FreeSurfer, while FSL versus FreeSurfer and SPM were also significantly different. For hippocampus MD, all methods were significantly different, except volBrain external versus FSL and FreeSurfer versus SPM.

Discussion

In this study, we evaluated the performance of a recent patch-based segmentation method [5] as implemented in volBrain [20] and three widely used conventional methods as implemented in FSL [21], FreeSurfer [22] and SPM [23]. Using MP2RAGE images, we tested the algorithms on two often investigated deep brain structures: the thalamus and the hippocampus. We found that the patch-based segmentation had the best overall accuracy. FreeSurfer, FSL and SPM all over-segmented the thalamus including non-thalamic tissue near the border of the IC and under-segmented in regions of the medial and lateral geniculate of the thalamus. In the segmentation of hippocampus, volBrain performed best followed by FSL, FreeSurfer and SPM. Moreover, we demonstrated that volBrain, based on a local library, was the only method, in which the diffusivity metrics of the thalamus did not differ significantly from the metrics obtained based on manual segmentation (Table 1). Analysis of hippocampus revealed that volBrain and SPM (although reporting low DSI) were not significantly different (Bonferroni corrected) from the manual method in terms of FA, and for MD only, volBrain local was not significantly different. This demonstrates that segmentation accuracy impacts the obtained diffusivity results, and less accurate methods, such as FSL, FreeSurfer and SPM, do not produce consistent diffusivity results.

Table 2 Mean difference in FA and MD (\({\times }10^{-3}\)) values of thalamus and hippocampus between the five automatic segmentation methods

The accuracy of the patch-based segmentation method in our study is comparable to previous results on hippocampus segmentations using MPRAGE images [5, 6]. A study by Patenaude et al. [21], using conventional T1w images and a leave-one-out comparison on its own library, found higher DSIs using FSL than found here. Patenaude and colleagues reported a mean DSI of 0.887 and 0.840 for the thalamus and hippocampus, respectively. This difference may reflect the importance of using coherent labelling protocols and similar imaging parameters within the template library. Patenaude et al. did, however, not reach the accuracy of the volBrain local segmentation in our study with DSI of 0.913 and 0.892, respectively. To compare the performance of the volBrain method with a training library different from MP2RAGE, we applied volBrain with an external training library consisting of MPRAGE and SPGR images. We found that volBrain still performed better than FSL, FreeSurfer and SPM (Fig. 3). The results of the intra-reliability test on hippocampus further emphasize the advantage of automatic segmentation. We found a mean DSI of 0.915, which is consistent with the previous findings by Frisoni et al. [24] of DSI \(=\) 0.89. This result is at an accuracy level of volBrain local. However, in contrast to manual segmentations, automatic methods are deterministic and yield consistent errors. Thus, automatic segmentation methods are more robust in a longitudinal setting.

FSL, FreeSurfer and SPM over-segmented the structures with FPRs in the range of 15–42 %. This resulted in consistent inclusion of white matter in the segmented regions of thalamus and hippocampus (both grey matter structures) as qualitatively verified using FA maps (see Fig. 5a, b). Regarding the volBrain method, no systematic over- or underestimation for thalamus was observed with neither local nor external libraries (FPR out-balanced FNR). Patches can capture texture similarities [5], and this is perhaps why the patch-based method attains consistently high accuracy on both thalamus and hippocampus. volBrain local was unbiased for hippocampus, while volBrain external slightly over-segmented hippocampus (FPR \(M = 26\,\%\)). This was unexpected, because the two libraries were constructed using hippocampus masks segmented based on the same protocol (the EADC protocol), while the protocols differed for the thalamus libraries. For hippocampus, this may be explained by different interpretations (different operators) of the EADC-ADNI protocol in the segmentation procedure of the hippocampus or by differences in contrast to the T1w images in the training libraries (MP2RAGE versus MPRAGE/SPGR).

The Hausdorff distance showed a stepwise increase between the manual and automatic methods, with mean values in the range 2–6 mm, the lowest being volBrain local followed by volBrain external, FSL, SPM and FreeSurfer (see Fig. 4). When considering the obtained FPR and FNR, the Hausdorff distance most likely reflects a maximum over-segmentation. However, evaluating the examples in Fig. 5a, b where the geniculate bodies of the thalamus are excluded (except for volBrain local), the distance may be due to under-segmentation in this specific region.

The intra-operator reliability test of manual hippocampus segmentation showed a consistent segmentation and no significant difference between volumes segmented with a time interval of 10 months. Our intra-operator DSI (\({M}=0.913\), \(\mathrm{SD}=0.010\)) is in line with previous reports of manual hippocampus segmentation reliability (\({M}=0.89\), \(\mathrm{SD}=0.01\)) [24]. The DSI of repeated tracings reveal that manual segmentation of hippocampus has the same level of accuracy as between manual segmentation and the volBrain local method. The volBrain local method though has the advantage of being more consistent, faster and less costly when the library has been established [25].

The obtained segmentation accuracies are partly reflected in the analysis of the diffusivity metrics. The volBrain local method was the only method not yielding significantly different FA and MD results in the thalamus compared to the results obtained by manual segmentation with a mean difference of −1 and −0.1 % in FA and MD, respectively. The other methods yielded mean differences between 1 and 9 % and were all significantly different in FA and MD compared to the values obtained with manual segmentation. This can be explained by the over-segmentation expanding into IC and the ventricular cerebral spinal fluid (CSF) (Fig. 5a, b). In hippocampus, the manual method was significantly different in FA compared to all methods (\(p<0.03\)). When correcting for multiple comparisons, FA values in both volBrain local and SPM and MD in volBrain local was not significantly different from manual measurements. The finding of SPM not being significantly different from the manual method, despite the inaccuracy of the SPM segmentation, can be explained by the segmentation expanding into both WM and CSF, which on average blur the FA differences as WM and CSF, respectively, represent higher and lower FA values. Higher FA values of volBrain external, FSL and FreeSurfer in the hippocampus can be explained by over-segmentation into areas at the transition to the thalamus and fornix. The difference between volBrain local and the other methods in the hippocampus segmentation is furthermore confirmed by the post hoc analysis (Table 2), which shows that both volBrain external, FSL, FreeSurfer and SPM all significantly differ in FA from volBrain local estimates, but not from each other, if corrected for multiple comparisons. This is visualized in Fig. 7 by the offsets between volBrain local and volBrain external, FSL, FreeSurfer and SPM. The FA and MD results of the intra-operator reliability test in hippocampus showed a significant difference between the two segmentations. This difference is similar to that of the best automatic segmentation (volBrain local). In both cases, the difference seems to be systematic (Fig. 7). However, this bias will be removed if using automatic methods in a longitudinal setting, as automatic methods are consistent and not prone to changing interpretations of the segmentation protocol. Finally, it should be noted that Bonferroni correction removed the significant differences between the manual segmentations.

Although the mean difference of FA and MD varies, all segmentation methods yielded consistent inter-subject differences compared to the manual approach (Figs. 6, 7). This was most pronounced for MD results. Group comparisons may therefore relatively yield similar results, using the same method within the same study. However, in diseases and disorders with subtle structural changes where the influence of segmentation errors could blur the findings and result in reduced sensitivity, it is crucial to use the most accurate method to detect pathological changes. A study by Barbagallo et al. [26] found a significant difference in MD in the thalamus between amyotrophic lateral sclerosis (ALS) patients and controls (\(0.06 \times 10^{-3}\), \(p=0.019\)) using FSL FIRST, but the FA difference of 0.01 was not significant (\(p>0.025\)). We speculate that such a result might have been significant if a more accurate segmentation method had been used. We found that volBrain local obtained the most accurate measurements compared to the manual segmentation (FA mean difference \(= -\)0.001), and all the other methods obtained a mean difference of FA higher than the 0.01 level obtained between groups in the study by Barbegello and colleagues. Although it is not directly comparable, the impact of using different methods (more or less accurate) in clinical studies should be investigated further.

The variation of our diffusivity measurements was considerably smaller compared to those reported in the study by Barbagello et al. The volBrain local method obtained SDs of FA values in the thalamus and hippocampus of, respectively, 0.008 and 0.007 and SDs of MD values of, respectively, 0.014 (\(\times 10^{-3}\)) and 0.017 (\(\times 10^{-3})\). In the Barbagello study, the corresponding SDs were 0.02 and 0.01 for FA and 0.05 \((\times 10^{-3})\) and 0.07 \((\times 10^{-3})\) for MD. This could be due to the reliability of the MP2RAGE images, as pointed out in a recent study [27]. The MP2RAGE sequence is less influenced by B1 as well as M0 and T2*, improving the image contrast and sharpness, which makes it easier to discriminate between grey- and white matter structures [13]. Another reason for the higher FA and MD variation in the Barbagello study could be due to the ALS pathology.

In the present study, we executed the FSL pipeline with the empirical optimized default settings. Experimentation on optimizing the FSL modes and boundary correction adapted to the MP2RAGE images may improve the final segmentation of FSL. The influence of adapting the FSL segmentation tool, FIRST, to MP2RAGE remains to be investigated. However, Patenaude et al. [21] used an adapted and optimized setting to their T1w images (no specification of the T1 sequence) and did still not reach the DSI levels of the patch-based method in volBrain local. Furthermore, it was not possible to use the same template library in volBrain and FSL, which would have been optimal for comparing the methods. Thus volBrain local has an advantage because it uses consistent training and testing data. This is similar to the Patenaude study [21], which showed good performance when using customized local settings. Patch-based methods have the advantage of requiring a relatively small library, which makes it feasible to implement and optimize locally.

State-of-the-art segmentation methods, like patch-based methods [5], together with high image quality, as in MP2RAGE data, may lead to higher sensitivity in future studies of morphometry and of microstructural changes. It could also be of interest to apply the presented methods retrospectively to clinical studies and evaluate whether statistical power and conclusions might be altered.

In conclusion, we have shown the potential of a recent, automatic patch-based segmentation method, volBrain, to provide more accurate thalamus and hippocampus segmentations in MP2RAGE images compared to conventional approaches. We have furthermore demonstrated that FA and MD values, extracted from co-registered DTI, deviated less from the reference of the manual segmentation, when using patch-based methods compared to the segmentations of FSL, FreeSurfer and SPM. We have illustrated under- and particularly over-segmentations on T1w images and FA maps, especially for FSL, FreeSurfer and SPM. Finally, we propose that MP2RAGE images are more suitable for thalamus and hippocampus segmentation compared to conventional T1w images.