Keywords

1 Introduction

White matter (WM) tract segmentation on diffusion magnetic resonance imaging (dMRI) provides a valuable quantitative tool for various brain studies [1, 7, 21, 24]. Manually delineated WM tracts are generally considered the gold standard segmentation, but the annotation process can be time-consuming and requires the expertise of experienced radiologists. Therefore, automated WM tract segmentation approaches are developed, which classify fiber streamlines [4, 6] obtained with tractography [2, 9] or directly provide voxelwise labeling results [3, 19, 26]. In particular, methods based on deep neural networks (DNNs) have substantially improved the accuracy of WM tract segmentation [15, 25, 27]. For example, Zhang et al. [27] group fiber streamlines into different WM tracts with a DNN that takes the spatial coordinates of the points along a fiber streamline as input; in [25], fiber orientation maps extracted from dMRI scans are fed into a U-net [20] to directly predict the existence of WM tracts at each voxel.

The DNN-based segmentation model is generally trained on a dataset where both dMRI scans and WM tract annotations are available. However, the performance of the model on an arbitrary test dataset that is different from the training dataset may be degraded due to distribution shift, where the use of different numbers of diffusion gradients and different noise levels are two major contributing factors [18]. Since dMRI scans can be acquired with various protocols, the improvement of the generalization of WM tract segmentation models to arbitrary test data becomes an important research topic. Although domain adaptation techniques [8] may be applied to improve the generalization, they require access to the test data during model training, which is not guaranteed when arbitrary test data is considered, and thus they are out of scope for this work. To account for the different numbers of diffusion gradients between training and test datasets, in [25] additional training scans are obtained by subsampling the diffusion gradients of the training data, and this allows improved segmentation accuracy on test data. However, the segmentation accuracy may still be improved by taking the signal-to-noise ratio (SNR) into consideration during model training.

In this work, we seek to further improve the generalization of WM tract segmentation from the perspective of SNR.Footnote 1 We focus on volumetric WM tract segmentation that directly obtains volumes of WM tract labels without requiring the tractography step. We assume that by producing diverse SNRs for training data, the training data can better represent the test data, and the trained model can better generalize to the test data. Therefore, we propose a scaled residual bootstrap strategy that augments the training scans with adjusted noise magnitude. First, we estimate a linear dictionary-based representation of diffusion signals and compute the residuals of the representation. These residuals are considered samples drawn from the noise distribution [10]. Then, for each diffusion gradient, the residual is drawn with replacement, and we adapt the standard residual bootstrap by scaling the residual. The scaled residuals are added to the linear representation of diffusion signals to generate augmented dMRI scans with different SNRs. Finally, the augmented images are used together with the original images for model training. Since it is difficult to know the SNR of the test data a priori, we choose to perform the residual scaling with multiple factors. The proposed approach was evaluated on two brain dMRI datasets, where various experimental settings of training and test scans were considered. The results show that our method consistently improved the generalization of WM tract segmentation under these various settings.

2 Methods

2.1 Problem Formulation

Suppose we are given a set of dMRI scans from a training dataset and the set of their annotations of WM tracts. We seek to train a WM tract segmentation model with good generalization, i.e., it performs well on an arbitrary test dataset. Like existing volumetric WM tract segmentation approaches [14, 25], the model input is fiber orientation maps computed from dMRI. Two major factors that cause the difference between the training and test dMRI data are the use of different numbers of diffusion gradients and different noise levels. Since increasing/decreasing the number of diffusion gradients also leads to increased/decreased SNRs in the fiber orientation maps, respectively, we assume that adjusting the SNR of the dMRI scans for model training can effectively improve the generalization of the trained model to other datasets. Although existing approaches have considered SNR manipulation in the data augmentation operations of model training [25], it is applied to the network input of fiber orientation maps. As fiber orientations are orientations with unit lengths, adding realistic noise that is consistent with imaging physics to them is nontrivial. Therefore, we seek to further explore data augmentation with SNR adjustment in model training to improve the generalization of WM tract segmentation models.

2.2 Model Training with Scaled Residual Bootstrap

To produce training data with diverse SNRs and realistic noise distributions, we propose a scaled residual bootstrap strategy for model training. For convenience, we denote the diffusion weighted signals at each voxel of a training dMRI scan by a vector \(\boldsymbol{y}\), where \(\boldsymbol{y}\in \mathbb {R}^{N_\mathrm{{d}}}\) and \(N_\mathrm{{d}}\) is the number of diffusion gradients. It has been shown that diffusion weighted signals can be linearly represented with a properly designed dictionary [16, 17]:

$$\begin{aligned} \boldsymbol{y}=\textbf{D}\boldsymbol{x}+\boldsymbol{\epsilon }, \end{aligned}$$
(1)

where \(\textbf{D}\in \mathbb {R}^{{N_\mathrm{{d}}}\times {N_\mathrm{{a}}}}\) is the dictionary with \(N_\mathrm{{a}}\) dictionary atoms, \(\boldsymbol{x}\in \mathbb {R}^{N_\mathrm{{a}}}\) is the vector of dictionary coefficients, and \(\boldsymbol{\epsilon }\in \mathbb {R}^{N_\mathrm{{d}}}\) represents the noise.

If the distribution of \(\boldsymbol{\epsilon }\) is known, different levels of realistic noise can be added to the noise-free linear representation to provide training data with different SNRs. This motivates us to adopt a residual bootstrap strategy, which provides a feasible way of approximating the noise distribution. Then, by modifying the noise distribution, we achieve the goal of augmenting the SNR levels of training data. There are two major steps in the proposed method, which are 1) residual computation and 2) data generation with scaled residuals.

Residual Computation. Like the standard residual bootstrap, we first estimate \(\boldsymbol{x}\) with the pseudoinverse of \(\textbf{D}\):

$$\begin{aligned} \hat{\boldsymbol{x}}=(\textbf{D}^{\textsf{T}}\textbf{D})^{-1}\textbf{D}^{\textsf{T}}\boldsymbol{y}, \end{aligned}$$
(2)

where \(\hat{\boldsymbol{x}}\) is the estimated coefficient vector. Then, the linear representation of the diffusion weighted signals can be estimated as

$$\begin{aligned} \hat{\boldsymbol{y}}=\textbf{D}\hat{\boldsymbol{x}}=\textbf{D}(\textbf{D}^{\textsf{T}}\textbf{D})^{-1}\textbf{D}^{\textsf{T}}\boldsymbol{y}. \end{aligned}$$
(3)

The residuals \(\hat{\boldsymbol{\epsilon }}\) of the signal representation can be simply computed by subtracting \(\hat{\boldsymbol{y}}\) from \(\boldsymbol{y}\)

$$\begin{aligned} \hat{\boldsymbol{\epsilon }}=\boldsymbol{y}-\hat{\boldsymbol{y}}=(\textbf{I}-\textbf{D}(\textbf{D}^{\textsf{T}}\textbf{D})^{-1}\textbf{D}^{\textsf{T}})\boldsymbol{y}. \end{aligned}$$
(4)

Then, to ensure that the variances of the residuals \(\hat{\boldsymbol{\epsilon }}\) are consistent with those of the noise \(\boldsymbol{\epsilon }\), the residuals are corrected with the following normalization [5, 10]:

$$\begin{aligned} \hat{\epsilon }_{i}'=\frac{\hat{\epsilon }_{i}}{\sqrt{1-h_{ii}}}. \end{aligned}$$
(5)

Here, \(\hat{\epsilon }_{i}\) is the i-th entry of \(\hat{\boldsymbol{\epsilon }}\), \(\hat{\epsilon }_{i}'\) is the corresponding corrected residual, and \(h_{ii}\) is the i-th diagonal entry of \(\textbf{H}=\textbf{D}(\textbf{D}^{\textsf{T}}\textbf{D})^{-1}\textbf{D}^{\textsf{T}}\). The set \(\mathcal {E}=\{\hat{\epsilon }_{i}'\}_{i=1}^{N_\mathrm{{d}}}\) of corrected residuals is then used in the bootstrap procedure that provides training data with diverse SNRs, and the procedure is described next.

Data Generation with Scaled Residuals. The corrected residuals \(\mathcal {E}\) can be viewed as samples drawn from the noise distribution [5], and in the standard residual bootstrap, they are randomly drawn with replacement and added to the linear representation \(\hat{\boldsymbol{y}}\). For our purpose of better generalization, we seek to generate samples with diverse SNRs. Therefore, the standard bootstrap procedure is modified with a scaling operation. Specifically, for the i-th diffusion gradient, we sample from \(\mathcal {E}\) with replacement, and the sampled residual is denoted by \(\tilde{\epsilon }_{i}\). The vector comprising the sampled residuals for all diffusion gradients is represented as \(\tilde{\boldsymbol{\epsilon }}=(\tilde{\epsilon }_{1},\ldots ,\tilde{\epsilon }_{N_\mathrm{{d}}})\). Then, a bootstrap signal \(\tilde{\boldsymbol{y}}\) is generated as

$$\begin{aligned} \tilde{\boldsymbol{y}}=\hat{\boldsymbol{y}}+r\tilde{\boldsymbol{\epsilon }}, \end{aligned}$$
(6)

where r is the scaling factor that controls the magnitude of noise. r is selected from a predefined candidate set \(\mathcal {R}\). By repeating the scaled residual bootstrap in Eq. (6) for each voxel, bootstrap diffusion weighted images can be generated.

Note that in dMRI acquisition, the b0 image without diffusion weighting is also acquired, and when more than one b0 images are available, their SNR can be adjusted as well. We denote the j-th b0 signal at each voxel by \(y_{j}^0\), and the number of b0 images is denoted by \(N_0\). Then, the residual \(\hat{\epsilon }_{j}^{0}\) for the j-th b0 signal is calculated by

$$\begin{aligned} \hat{\epsilon }_{j}^{0}=y_{j}^0-\bar{y}^0, \end{aligned}$$
(7)

where \(\bar{y}^0=\frac{1}{N_{0}}\sum _{j=1}^{N_{0}}y_{j}^0\) is the mean value of all b0 signals. These residuals form a set \(\mathcal {E}^{0}\). For each j, a sample is drawn from \(\mathcal {E}^{0}\) with replacement, which is denoted by \(\tilde{\epsilon }_{j}^{0}\), and the bootstrap b0 signal is generated as

$$\begin{aligned} \tilde{y}_{j}^0=\bar{y}^0+r\tilde{\epsilon }_{j}^0. \end{aligned}$$
(8)

Here, r has the same value as in Eq. (6). Equation (8) is repeated for each voxel to obtain bootstrap b0 images.

After bootstrap b0 images and diffusion weighted images are generated, they are combined to obtain new dMRI scans with different SNRs. These bootstrap dMRI scans are used to train the segmentation model together with the original dMRI scans based on the WM tract annotations.

2.3 Implementation Details

Our method is agnostic to the architecture of the segmentation model. For demonstration, the state-of-the-art TractSeg architecture [25] is used as the backbone network, but other network structures [13, 14] may also be applied. As in [25], we extract fiber orientation maps from dMRI scans with constrained spherical deconvolution (CSD) [22] (for single-shell dMRI data) or multi-shell multi-tissue CSD (MSMT-CSD) [11] (for multi-shell dMRI data), and use these maps as network input. At most three fiber orientations are allowed, and all WM tracts are jointly segmented [25].

We use the SHORE basisFootnote 2 [17] for the linear representation of diffusion signals, which is a common choice. To generate bootstrap training data with diverse SNRs, the set of candidate scaling factors is \(\mathcal {R}=\{2,3,4\}\). Since it is difficult to predetermine the SNR of arbitrary test data, all values in \(\mathcal {R}\) are used for bootstrap, and each value is used once for each training scan.

For model training, following [25], we use the binary cross entropy loss function, which is minimized by Adamax [12] with a batch size of 56 and 300 training epochs; the initial learning rate is set to 0.001. We select the model that has the best segmentation accuracy on a validation dataset. Traditional data augmentation implemented online in TractSeg, such as intensity perturbation and spatial transformation, is also applied online in the proposed method.

3 Results

3.1 Datasets and Experimental Settings

We used two dMRI datasets to evaluate our method. The first one is the publicly available Human Connectome Project (HCP) dataset [23], and the second one is an in-house dMRI dataset. A detailed description of the two datasets and their experimental settings is given below.

The HCP Dataset. The dMRI scans in the HCP dataset were acquired with 270 diffusion gradients (b = 1000, 2000, and 3000 s/mm\(^{2}\)) and an isotropic image resolution of 1.25 mm. 18 b0 images were also acquired for each dMRI scan. 72 WM tracts were manually delineated for the HCP datasetFootnote 3. We used 100 scans in our experiments, where 55 and 15 scans were used as the training set and validation set, respectively, and the remaining 30 scans were used for testing. To improve the generalization of the segmentation model to different imaging protocols, in TractSeg [25], subsampling of diffusion gradients was performed on the original training dMRI scans, where dMRI scans with 12 and 90 diffusion gradients associated with b = 1000 s/mm\(^{2}\) were generated for model training together with the original dMRI scans.Footnote 4 Here, we followed [25] and performed the subsampling as well for the original and bootstrap training data for model training. For convenience, the original HCP dataset is referred to as HCP_1.25mm_270, and the subsampled datasets with 12 and 90 diffusion gradients are referred to as HCP_1.25mm_12 and HCP_1.25mm_90, respectively.

To evaluate the performance of the proposed method on test scans that were acquired with different protocols, we generated additional test sets from the 30 original test scans. First, like the training data in HCP_1.25mm_12 and HCP_1.25mm_90, the 12 and 90 diffusion gradients associated with b = 1000 s/mm\(^{2}\) were selected from the 30 test scans, respectively. Second, 34 diffusion gradients associated with b = 1000 s/mm\(^{2}\) were selected for the test scans, so that their imaging protocol was different from the original and subsampled training data, and the images associated with this subsampling are referred to as HCP_1.25mm_34. Only three b0 images were kept for HCP_1.25mm_34. Finally, another test set HCP_1.25mm_36 was generated from the test scans by selecting 18 diffusion gradients associated with b = 1000 s/mm\(^{2}\) and 18 diffusion gradients associated with b = 2000 s/mm\(^{2}\), which also produced dMRI scans that used a different imaging protocol than the training data. Only one b0 image was kept for HCP_1.25mm_36. A summary of these different datasets is listed in Table 1.

Table 1. A summary of the datasets used in the experiments

In addition, to investigate the impact of the amount of training data on the segmentation, three other experimental settings were considered, where 10, 20, or 30 training subjects were used and the other settings were not changed.

The In-House Dataset. The segmentation models trained on the HCP dataset were also applied to an in-house dataset for further evaluation. The dMRI scans in the in-house dataset were acquired with 270 diffusion gradients (b = 1000, 2000, and 3000 s/mm\(^{2}\)) and one b0 image. The spatial resolution is 1.7 mm isotropic. These scans were acquired on a scanner that is different from that of the HCP dataset. Due to the annotation cost, only ten of the 72 annotated WM tracts of the HCP dataset were manually delineated, and the delineation was performed on 17 in-house dMRI scans. These annotations were used only to evaluate the segmentation accuracy. This in-house dataset is referred to as IH_1.7mm_270. We also synthesized another dataset IH_1.7mm_36 from IH_1.7mm_270 for evaluation, where 18 diffusion gradients of b = 1000 s/mm\(^{2}\) and 18 diffusion gradients of b = 2000 s/mm\(^{2}\) were selected from the original scans. These two datasets are also summarized in Table 1.

Fig. 1.
figure 1

Representative segmentation results (red) for the HCP dataset, together with the gold standard (manual annotation) for reference. The cross-sectional views of the segmented tracts are shown, and they are overlaid on fractional anisotropy maps. Zoomed views of the highlighted regions are also displayed for better comparison. The image orientation is shown in the rightmost column. For the meaning of the tract abbreviations, we refer readers to [25]. (Color figure online)

3.2 Evaluation of Segmentation Results on the HCP Dataset

We first present the evaluation of the segmentation results on the HCP dataset. Our method was compared with TractSeg without using bootstrap (but with the subsampling of diffusion gradients), which is referred to as the baseline method.

Examples of the segmentation results are shown in Fig. 1. For demonstration, here we show the results of representative WM tracts on HCP_1.25mm_90, HCP_1.25mm_36, and HCP_1.25mm_34 when 55 training subjects were used. For reference, the gold standard (manual delineation) is also displayed. In Fig. 1, cross-sectional views of the WM tracts are given, and regions are highlighted with zoomed views for better comparison. It can be seen that the segmented tracts of the proposed method have more similar spatial coverage to the gold standard than the baseline method.

Table 2. The mean Dice coefficient (%) of all 72 WM tracts and the individual average Dice coefficients (%) of the three most challenging tracts for the HCP dataset across different settings. The proposed method was compared with the baseline method using paired Student’s t-tests, and asterisks indicate that the difference between the two methods is statistically significant (***: \(p<0.001\)).

We then quantitatively evaluated the proposed method by computing the Dice coefficient between the segmentation results and the gold standard. The mean Dice coefficient of all 72 WM tracts for each test dataset and each number of training subjects is shown in Table 2. As some WM tracts can be more challenging to segment [14] and the improvement of the segmentation of these tracts is important, in Table 2 we also show the individual average Dice coefficients of the three most challenging WM tracts, which are the anterior commissure (CA), left fornix (FX_left), and right fornix (FX_right) [14, 25]. Compared with the baseline method, the proposed method can consistently improve the Dice coefficients across the different cases, and the improvement is more prominent for the three most challenging WM tracts. In addition, the Dice coefficients of the proposed method were compared with those of the baseline method using paired Student’s t-tests, and the p-values are listed in Table 2. It can be seen that the improvement of the proposed method is statistically significant in all cases.

By comparing the results achieved with different numbers of training subjects, we observe that the overall improvement of the proposed method tends to be greater when the number is moderate (20 and 30) than when the number is small (10) or large (55). Moreover, the Dice coefficients of the proposed method obtained with 20 training subjects are comparable to or higher than the baseline performance achieved with 55 training subjects. Also, when the number of training subjects increases from 20 to 30 or 55, the Dice coefficients of the proposed method are relatively stable, whereas the Dice coefficients of the baseline method can still increase. This is possibly because the proposed method augments the training data and thus reduces the requirement for manual annotation.

Table 3. The mean Dice coefficient (%) of all ten annotated WM tracts and the individual average Dice coefficients (%) of two challenging tracts for the in-house dataset across different settings. The proposed method was compared with the baseline method using paired Student’s t-tests, and asterisks indicate that the difference between the two methods is statistically significant (***: \(p<0.001\), **: \(p<0.01\), *: \(p<0.05\), n.s.: \(p\ge 0.05\)).

3.3 Evaluation of Segmentation Results on the In-House Dataset

The proposed method was next applied to the in-house test datasets IH_1.7mm_270 and IH_1.7mm_36, and the mean Dice coefficients of all ten annotated WM tracts are summarized in Table 3. In addition, the individual average Dice coefficients of two challenging tracts, the left uncinate fasciculus (UF_left) and right uncinate fasciculus (UF_right) [14], are also shown in Table 3.Footnote 5 In each case, the proposed method achieves a higher Dice coefficient than the baseline method, and the improvement is more prominent for the two challenging tracts and for IH_1.7mm_36 that has a smaller number of diffusion gradients. We also performed paired Student’s t-tests to compare the two methods in Table 3, and the difference between the proposed and competing methods is statistically significant in most cases.

Like the results on the HCP dataset, the improvement of the proposed method over the baseline method is greater when the number of training subjects is 20 or 30 than 10 or 55, and its performance becomes stable after the number of training subjects reaches 20. Also, the Dice coefficients of the proposed method obtained with 20 training subjects are already better than the baseline performance achieved with 55 training subjects.

4 Conclusion

We have proposed a WM tract segmentation approach that better generalizes to arbitrary test datasets. In the proposed method a scaled residual bootstrap strategy is developed, where the SNR levels of the training data are adjusted based on the residuals of a linear signal representation. This reduces the discrepancy between training and test data and thus improves the generalization of the trained segmentation model. Our method was validated on public and in-house datasets under various data settings, and the results show that it consistently improved the segmentation accuracy in the different cases.