Introduction

Tractography is an important and unique tool for the non-invasive, in vivo study of brain connections (Mori et al. 1999; Basser et al. 2000). With the emergence of modern and comprehensive diffusion MRI (dMRI) acquisition protocols, we can now map the human brain connections at an unprecedented resolution (Wandell 2016). Particularly, as a result of the advances in multi-band and parallel imaging technologies (Feinberg et al. 2010; Moeller et al. 2010; Setsompop et al. 2012) and their successful application in the Human Connectome Project (HCP) (Toga et al. 2012; Van Essen et al. 2013), there have been substantial progress in multi-shell dMRI data acquisition and fiber orientation modeling (Yeh et al. 2010; Jbabdi et al. 2012; Jeurissen et al. 2014; Cheng et al. 2014). With the latest efforts on studying tissue microstructure and compartment modeling (Panagiotaki et al. 2012; Ferizi et al. 2014; Novikov et al. 2016; Tran and Shi 2015) that provide even more accurate local uncertainty models for fiber orientation densities (FODs), the community has been improving tractography algorithms as well (Aydogan and Shi 2016; Reisert et al. 2014), paving the way toward quantitative tractograms (Jbabdi and Johansen-Berg 2011; Girard et al. 2014).

In spite of the developments in dMRI and tractography techniques, we still face the long-standing challenge of missing ground truth for the validation of connections reconstructed by tractography. Validations using phantoms (Leemans et al. 2005; Campbell et al. 2006; Fieremans et al. 2008; Pullens et al. 2010; Bach et al. 2014; Girard et al. 2014; Neher et al. 2014) were proposed in various previous studies, but it is unclear how results obtained for artificial phantoms translate to biological tissues, especially when considering the neuroanatomical complexity of living organisms. Anatomically and histologically, tracer injections have long been considered the gold standard for the validation of tractography. While this usually is challenging, many important studies were done to compare tractography with tracer injections. Including on post-mortem human brains (Seehaus et al. 2012), tracer injection-based validations were done on monkeys (Dauguet et al. 2007; Schmahmann et al. 2007; Jbabdi et al. 2013; Donahue et al. 2016), pigs (Dyrby et al. 2007; Knösche et al. 2015) and rats (Gyengesi et al. 2014). Notably, mapping the mouse brain connectome has seen tremendous progress in creating detailed axonal projection maps with whole brain coverage (Zingg et al. 2014; Oh et al. 2014). The mouse brain connectome from the Allen Mouse Brain Atlas (AMBA) (Oh et al. 2014) quickly became an important resource for validating the connectivity constructed by diffusion imaging (Keifer et al. 2015; Calabrese et al. 2015; Chen et al. 2015).

Previous validation studies show that obtaining connections in the brain using dMRI-based tractography is a challenging problem. Differences across image acquisition techniques (Tuch et al. 2002; Wedeen et al. 2005; Feinberg et al. 2010; Aganj et al. 2010; Setsompop et al. 2012; Moeller et al. 2010), diffusion models (Tournier et al. 2004; Tuch 2004; Panagiotaki et al. 2012; Basser et al. 1994), tractography algorithms and parameters (Fillard et al. 2011; Mangin et al. 2013; Pestilli et al. 2014; Daducci et al. 2015; Smith et al. 2015) all introduce limits and assumptions that result in biases and variability, affecting the accuracy, reliability and reproducibility (Besseling et al. 2012; Girard et al. 2014; Thomas et al. 2014).

Without carefully validating the connections and studying the factors that affect the performance of our complicated techniques, it is difficult to fully leverage the rich information obtained by tractography. A quantitative insight into the sources of performance variation is therefore crucial to improve how we conduct tractography experiments and interpret the results. Consequently, it is critical to rigorously analyze and study these factors which is a challenging and multidimensional problem.

Many validation studies addressing a single or few dimensions of the sources of variations have been published earlier to improve the practices in tractography. In Thomas et al. (2014), projections obtained using anterograde tracer injections from two locations of a macaque brain were compared against the tractography results obtained from the same subject. The authors compared the variability between deterministic and probabilistic tractography results across four different diffusion models and four curvature thresholds. They reported common limitations of diffusion models and tractography techniques, but did not quantitatively analyze and rank how much each factor influences the results. Gyengesi et al. (2014) compared the variability in performance with respect to different tractography techniques on 12 fiber bundles in rat brains and reported limitations and advantages of deterministic and probabilistic approaches. Chen et al. (2015b) used a single mouse brain and studied the variation in probabilistic tractography results with respect to the variation in fractional anisotropy (FA) and curvature thresholds. The authors suggested optimal parameters by testing three different FA and five different curvature thresholds. In Seehaus et al. (2012), carbocyanine dyes were used as tracers on a human post-mortem tractography validation experiment. The variability of results was studied using three seed locations and nine different FA thresholds. The authors reported that for their study FA values between 0.02 and 0.08 were optimal. In Dauguet et al. (2007), tractography results were compared with 3D histological tracing of two injections on pre- and post-central gyrus on a single macaque brain. The variability in results with respect to FA, curvature and step size was studied. The authors tested 13 different FA, 11 different curvature and 13 different step size values. However, they fixed the other two parameters while checking the variability due to a single parameter, and thus did not address the coupling effects of multiple parameters.

Despite its relatively long history, tractography lacks maturity. Previous validation works show that the large number of sources for variation plays an important role, obstructing the way for the clinical applications of this unique technique. With such vast options for streamline reconstruction, there are very few common practices adapted in the literature, leaving a plethora of ways for not adopting adequate experimental protocols. This in addition makes parameter optimization studies challenging, since the significance of optimal parameters is questionable when several other variation sources is in effect. Due to the large variability in scanners, pre-processing pipelines, diffusion models and tractography algorithms, inevitably optimal parameters for tractography parameters also vary. Owing to this reason, in contrast to previous validation studies, we chose to focus on the extent of variability due to several parameters. Instead of presenting optimal values, our study provides systematic ways to obtain robust and reproducible results with tractography. While meticulously inspecting the trends, we not only take into account the coupled effects of several parameters, but we also rank how much variability each source introduces.

Our aim of this work is to expand our understanding on how to better conduct tractography experiments and interpret results by taking into account the sources of variations. For that we present results from our extensive (over 1 million) multidimensional tractography validation experiments using multi-shell dMRI data from mouse brains and tracer injections from AMBA. We studied the variability in the results with respect to seven different factors including their cross relations with each other using N-way ANOVA analysis. The factors we studied are: subject, tractography parameters (step size, curvature, cutoff, number of streamlines), use of anatomical constraints and the fiber bundle that is studied.

Our results have significant implications. Firstly and most importantly, we show that the variations in tractograms with respect to differences in subjects is comparable to variations with respect to tractography parameters used in the experiments. Secondly, we show that the tractography results significantly vary with respect to all parameters, including the commonly overlooked step size and number of streamlines. Lastly, our experiments show that while incorporating prior anatomical knowledge can dramatically reduce false positives, the overlaps between tractograms and injection experiments are still not ideal due to false negatives.

Materials and methods

Materials

Eight wild-type female mice (Mus musculus) aged 8–10 weeks were used as subjects. Subjects were deeply anesthetized with isoflurane gas then transcardially perfused with phosphate-buffered saline (PBS) with 0.05% heparin pH 7.4 at 37 °C, followed by 4% paraformaldehyde in PBS, pH 7.4, at 37 °C to fix the tissue. The subjects were then rapidly decapitated and skin, muscle, and bottom jaw were removed from the skull. The mouse brains intact within the skull were immediately postfixed in 4% paraformaldehyde in PBS, pH 7.4, at 4 °C overnight. The following day, the mouse brains were transferred to storage buffer, PBS pH 7.4, with 0.01% sodium azide and kept at 4 °C. The mouse brains were transferred to fresh storage buffer four more times during the first 48 h after fixation, and then transferred to fresh storage buffer and rocked on a nutating rotator for 5 days at 4 °C to remove any remaining paraformaldehyde. Finally, the mouse brains were transferred to fresh storage buffer and shipped from the Dulawa Lab, then at the University of Chicago, to the California Institute of Technology on ice for MRI data collection. All procedures involving animals were done in accordance with the ethical standards of the institution.

Multi-shell imaging of mouse brains

Fixed mouse brains intact within the skull were soaked in 5 mM Prohance® (Bracco Diagnostics, Inc., NJ) for 3 days prior to imaging to decrease overall T1 relaxation rates. They were then scanned three at a time immersed in Galden® (Solvay Solexis, Inc., NJ, USA) with a 7 T Bruker BioSpin MRI scanner at the California Institute of Technology. Diffusion-weighted images (DWI) were acquired using a four-segment 3D spin-echo echo-planar imaging (SE-EPI) sequence: 128 × 110 × 100 matrix; voxel size: 0.2 × 0.2 × 0.2 mm3, TE = 50 ms; TR = 1000 ms, δ = 9 ms, Δ = 13 ms, bandwidth = 303 kHz, double sampling, NA = 1, yielding scan time ~ 10 h. (When scaled to the size of an average human brain (Nolte 2009), the imaging resolution corresponds to an isotropic 2.7 mm which is well within the range of typical human dMRI acquisition settings.) 93 separate volumes were acquired: 3 T2-weighted volumes (voxel size: 0.1 × 0.1 × 0.1 mm3) with no diffusion sensitization (B0 image) and 90 diffusion-weighted images. DWIs were acquired with different angular samplings across three distinct b value shells: 1000, 3000 and 5000s/mm2 (each b value shell contained 30 diffusion-weighted images), where the gradient directions were generated with optimal distribution across the spheres (Caruyer et al. 2013).

Allen Mouse Brain Atlas and selection of injection locations

The Allen Mouse Brain Atlas (AMBA) provides detailed tracer injection and projection density images from a large number of sites covering the mouse brains (Oh et al. 2014). All AMBA data were saved in an atlas space with a detailed set of anatomical labels (Dong 2007). The AMBA atlas was created by registering and averaging serial two-photon microscopy (STP) images of 1231 specimens (Kuan et al. 2015). AMBA provides 10, 25, 50 and 100 µm resolution versions of the atlas. In our study, we used the 25 µm resolution images because they provide a good balance between the level of detail and memory required for computation. All data from the AMBA are distributed in the NRRD format and uses the common coordinate framework (CCF) (Oh et al. 2014). To utilize conventional neuroimaging tools, we wrote a custom script to convert the AMBA data into RAS orientation and saved them in the NIFTI format. As of June 2017, there are 2546 anterograde recombinant adeno-associated virus (rAAV) tracer studies shared by the AMBA. Tracer studies were done on transgenic mouse lines as well as wild types. Due to the heavy computational load in our validation experiments, we limited our study to ten injection sites that were done on wild-type mice. These ten injections were selected from different parts of the brain to cover a large variety of projections. The injection IDs used in this article, original injection IDs (given by the AMBA), and their anatomical locations are listed in Table 1. 3D visualizations of the injection sites and their projections are shown in Fig. 1.

Table 1 List of injection IDs and anatomical locations used in the study
Fig. 1
figure 1

The top panel shows the 3D volume renderings for the ten tracer projection densities used in Mode-I and Mode-II comparisons. Injection sites used in the study are visualized on the bottom panel

Pre-processing of MRI data

The skull-stripping tool (BSE) (Shattuck and Leahy 2002) of BrainSuite was used to extract masks for individual mouse brains from the T2 image. Each brain was carefully extracted by manually adjusting BSE parameters. Brain masks were then reoriented to the RAS orientation with an in-house developed software tool that uses manually annotated landmarks. Both T2 and diffusion MRI were then warped to align with AMBA using the ANTs registration tool (Avants et al. 2011). Because AMBA is prepared using STP images, we used the mutual information similarity metric for all registration steps. A comparison using a checkerboard pattern between AMBA and a registered image is shown on an axial slice in Fig. 2a. For each subject, registration error is quantitatively measured using the average displacement of manually marked landmarks on the AMBA and subject’s T2 image. For that, we used the landmarks proposed in Sergejeva et al. (2015). In this study, 16 landmarks are recommended for the C57BL/6J mouse brain MRI registration: 2 landmarks are in cerebellum, 1 in middle cortex, 1 in periaqueductal gray, 1 in pontine nucleus, 1 in hippocampus, 3 in interpeduncular nucleus, 1 in corpus callosum, 1 in middle ventricle, 2 in anterior commissure and 3 in frontal areas. Among the 16 landmarks, we did not use the 2 in cerebellum since dorsal sections of AMBA do not include these regions. In our dataset, the mean and standard deviations of average displacement of all subjects are measured to be 145.35 ± 29.37 µm, which corresponds to 1.45 ± 0.3 voxels in MRI space. This error is comparable to that obtained by Sergejeva et al. (2015) which was 134 ± 20 µm (1.54 ± 0.23 voxels). The nonlinear transforms were saved and used to warp the injection sites and tracer projection density maps of the AMBA to individual mouse brains for our tractography experiments. Additionally, FSL’s eddy_correct tool is used to reduce the artifacts in dMRI due to eddy currents (Jenkinson et al. 2012).

Fig. 2
figure 2

a Visualization for registration accuracy using a checkerboard pattern on an axial slice. b FODs reconstructed from multi-shell mouse brain dMRI data are plotted on a coronal slice. c Whole brain tractography results from FOD-based probabilistic tractography

FOD reconstruction from multi-shell diffusion MRI

Using the multi-shell diffusion MRI (dMRI) data of mouse brains, we applied the novel reconstruction method we developed recently in Tran and Shi (2015) and Kammen et al. (2016) to compute FODs that are used in our tractography experiments. At each voxel, the FOD is a scalar function defined on the unit sphere that represents the probability of streamlines in each direction (Fig. 2b). Numerically, each FOD is represented with spherical harmonics up to the order of 12 that matches the number of gradient directions in our acquisition protocol. For the diffusivity of the stick kernel in our computational framework, we used 0.0008 mm2/s following previous literature on post-mortem mouse brain diffusion imaging (Wu et al. 2013, 2014). Note that this is smaller than the kernel diffusivity that we typically use for in vivo human brain imaging data from the HCP. Using FOD-based tractography, we can visualize the connectivity of mouse brains (Fig. 2c) and study its relation to the underlying anatomy.

Tractography technique and parameter values

We used the iFOD2 algorithm of the MRtrix3 software (Tournier et al. 2010, 2012) for FOD-based probabilistic tractography. Injection density images from the AMBA were registered to each mouse brain image and used as seed regions for tractography. Because injection density images provide a probability density function for the injected tracer, we used the -seed_rejection flag of tckgen command of MRtrix3. This option generates track seeds proportional to the tracer density at each voxel. To thoroughly investigate the impact of tracking parameters, we used nine different values for the following parameters in tractography: step, curvature and cutoff (FOD amplitude threshold for terminating tracks). Other tractography parameters used in the experiments were fixed as minlength = 0.1 mm, maxlength = 50 mm and trials = 1000. The term “curvature” is used synonymous to “radius of curvature” and values were converted to angle for the tckgen command. We also varied the number of total streamlines per each injection site with ten different values to examine the convergence of overlap with respect to this parameter. For each injection, we overall conducted 7290 (= 9 × 9 × 9 × 10) tractography experiments.

The values of the varied tractography parameters are listed in Table 2. Because it is common to set step size as a fraction of voxel dimensions, this is also listed. Similarly, curvature is also commonly set in terms of angular deviations that we also included in the table. Notice that step size (with respect to voxel size), curvature (in angles) and number of streamlines used in the tests are within the typical ranges of tractography experiments done in the literature. Cutoff values however are different, since we used post-mortem mice in our experiments which has much lower diffusivity compared to the in vivo human case.

Table 2 Tracking parameters used in the experiments

Quantitative comparison of tractography and tracer injections

We used the projection density images provided by the AMBA as the ground truth. Projection density is defined as the ratio of the number of projection-detected pixels to the number of all pixels in the division (Kuan et al. 2015), i.e., the maximum projection density value is 1 and it indicates that all the pixels in the division are projected by the neurons from the injection area. However, the projection density values are challenging to use for quantitative analysis due to variations among intensity profiles in different sections as well as changes in experimental procedures for different injection sites such as the tracer dose. Based on these reasons, as the ground truth, instead of considering the amount of projections from an injection site to a voxel, we chose to consider whether there exists a projection to this voxel or not.

We compared the ground truth with tractography results by obtaining the voxels that have projections from the injection site. For this purpose, we computed track density images (TDI) at the same resolution as AMBA (Calamante et al. 2012). TDI is obtained using the tckmap command of MRtrix3.

To compare tractography results with AMBA’s injection studies, we used two different modes in our analysis. Mode-I and Mode-II separately consider the two common applications of tractography. Mode-I comparison aims to measure the accuracy of tractography when used for general exploration, i.e., identification of all tracks from a given seed region. On the other hand, Mode-II comparison aims to quantify the accuracy of tractography when used for targeted exploration, i.e., extraction of tracks from a given region to another. In Mode-I comparison, voxels inside the brain mask with non-zero projection density are considered as ‘positive condition’, otherwise they are ‘negative condition’. In Mode-II comparison, we apply constraints on both the projection density images and tractograms. For tracer projection maps, all projected voxels ideally should be morphologically connected to the injection site without any gaps. However, this is not the case in projection density images due to discretization errors and spurious projections. For Mode-II comparison, as ground truth, we used the voxels that have more than 1% projection density and form a single connected component with the injection site. Besides using the injection site as the seed ROI for tractography, we also apply this connected component as the target (include) ROI for the tractography. The use of this additional anatomical constraint in Mode-II experiments is similar to the streamline reconstruction protocols in human brain imaging, where multiple ROIs were typically used to identify the major fiber bundles. Our goal is to examine the degree of improvement that can be achieved with the incorporation of additional anatomical constraints. Streamlines for Mode-II comparisons are obtained using the ones computed for Mode-I comparisons by trimming segments from the ends of each streamline until a projection site is hit. Figure 1 shows volume renderings of Mode-I and Mode-II projections for all the injections and Fig. 3 shows a graphical explanation for the ground truths and predictions used for Mode-I and Mode-II analysis.

Fig. 3
figure 3

Graphical description of Mode-I and Mode-II analysis. a Injection seed and projections are shown with green and yellow, respectively. b Mode-I uses the discretized ground that is the existence of projections to voxels. c Mode-I results are computed using the streamlines projecting from the seed location. (d) In Mode-II spurious projections are removed from the ground truth. (e) Mode-II results are computed by trimming the ends of streamlines that are outside the ground truth. This presents the case where an anatomical constraint requires the end points of all streamlines to project to a ground truth voxel

To quantitatively evaluate the performance of tractography results, we form a predicted projection label image from the TDI image and calculate the overlap with ground truth labels. To form the predicted projection label image, we mark a voxel with the true label if it has a non-zero TDI value; otherwise it is marked with the false label. Therefore, a true label indicates that we predict the existence of projections to that voxel, and a false label indicates that no projection is predicted. The ground truth label image is obtained similarly by thresholding the tracer projection density images used in either Mode-I or Mode-II comparisons. Using the predicted projection label and the ground truth label images, we compute the number of voxels belonging to true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) (Fig. 3c, d). After that, the following measures were calculated to characterize the performance of tractography results: true positive rate \(\left( {{\text{TPR}}=\frac{{{\text{TP}}}}{{{\text{TP}}+{\text{FN}}}}} \right)\), false positive rate \(\left( {{\text{FPR}}=\frac{{{\text{FP}}}}{{{\text{FP}}+{\text{TN}}}}} \right)\) and DICE coefficient \(\left( {{\text{DICE}}=\frac{{2 \times {\text{TP}}}}{{{\text{TP}}+{\text{FP}}+~{\text{TP}}+{\text{FN}}}}} \right)\) (Dice 1945).

Statistical analysis of variation

We used N-way ANOVA analysis to study the sources of variation in our results. Similar to previous studies (Besseling et al. 2012; Dauguet et al. 2007), we focused on the variation in DICE measure since it is a one-dimensional quantity that measures the overlap quality. By applying N-way ANOVA analysis using the DICE measure, we identified the changes in the quality of overlap between different groups as well as the variation within them. As a part of the ANOVA analysis, we computed the F-statistic to test and quantify the significance of parameters. ANOVA analysis is done using MATLAB (MathWorks 2012). We showed that our dataset satisfies the ANOVA requirements in Supplementary Material A.

Results

The overlap between tractography and tracer injection experiments changes dramatically with different choices of parameter combinations. Before we analyze the sources of this variation, in “Examination of overlap measures and variation across subjects”, we demonstrate how the quantitative measures used in this study (TPR, FPR and DICE) visually correspond to the overlap. Here we also briefly point out the effect of subject variability on the results. In “Examination of overlap variations due to tractography parameters and anatomical constraints”, we extend the results of “Examination of overlap measures and variation across subjects” by varying each of the parameters separately. With this, some of the trends due to parameter variations are observable. In “Examination of overlap variations due to injection sites”, we dig deeper into the trends, by including multiple injection locations which gives a complete overview of the trends in our seven dimensional parameter space. (More detailed information and figures on the trends are provided in the supplemental materials for interested readers.) In “ANOVA analysis and the ranking of variation sources”, we analyze the sources of trends and reveal how much each of the parameters contribute to the variation in performance.

Examination of overlap measures and variation across subjects

Figure 4 visualizes examples for a range of TPR, FPR and DICE to establish a visual link between these performance measures and the overlap quality. Figure 4 shows the 3D overlap between projection density images and tractography results for the sixth injection site that we denote as I6. For qualitative inspection, we used two subjects, M1 and M2, and considered three different sets of parameters represented as Case A, Case B and Case C that are listed in Table 3.

Fig. 4
figure 4

3D overlap of the tracer projection density for I6 (shown in yellow) and tractography results (shown in red) using two of the eight subjects, M1 and M2. Top and bottom rows show the results for Mode-I and Mode-II comparisons, respectively. While Mode-II does not noticeably affect the tracer data since it only removes spurious projections, there are visible differences in tractograms (shown with blue arrows) due to the introduction of anatomical constraints. Computed measures are listed under each experiment. Case A shows better results compared to Cases B and C

Table 3 Tracking parameters used for the qualitative examination

I6 projects to a large portion of the brain including somatosensory areas on both hemispheres and dorsal regions such as the medullary nuclei. We observed in our experiments that most of the parameter combinations were successful in obtaining projections to somatosensory areas. Contralateral and dorsal connections are more visible for M2. This coincides with the values of Dice coefficient, which is an indicator for the quality of overlap. Capturing dorsal projections, however, required more flexible parameter combinations. Basically, Case A, B and C visually show relatively good, moderate and bad agreements between the tracer projection and tractography. While differences can be observed between the results from the two mouse brains, the overall trends for Case A, B and C are consistent. Another consistent trend is the improvement of the overlap measures when the anatomical constraint was added to obtain Mode-II results. For all cases, the inclusion of prior anatomical knowledge improved the result for both subjects.

Examination of overlap variations due to tractography parameters and anatomical constraints

To examine the effect of tractography parameters, we used Case A, B and C shown in Table 3 as reference points and varied each of the four parameters. The results are shown in Fig. 5 as FPR vs TPR plots on the receiver operator characteristic (ROC) plane. Overall, we run 204 tractography experiments for each subject. Plots in Fig. 5 can be considered as samples from ROC curves for the variation of each tractography parameter separately. The size of data points is in proportion to the corresponding Dice coefficient computed from the same experiment. The results from the reference parameters listed in Table 3 are marked with an ‘x’ for each case. Figure 5 confirms that Mode-II results always yield better matches with the injection experiments. It is also observed that although step size changes the results, compared to other parameters its effect is not as pronounced. A large radius of curvature seems to be a strong constraint; however, decreasing it does not improve the match for very low curvature values. Decreasing the cutoff threshold value of the FOD magnitude increases the TPR, but also increases the FPR. There is a big difference in overlap quality between results generated using low and high number of streamlines. Increasing the number of streamlines shows a converging pattern for all cases.

Fig. 5
figure 5

Effects of parameter variations on the overlapping measures between tractography and tracer projection density for I6. Top and bottom rows show the results for subjects M1 and M2, respectively. Reference points corresponding to Case A, B and C in Table 3 are plotted as x. On each column, only one of the parameters is varied, i.e., for the two plots under curvature, only the curvature parameter is changed, all other parameters are kept same. The size of the data points is in proportion to the DICE coefficient

Examination of overlap variations due to injection sites

To examine the results for different injection sites, we picked I1 and I6 for the first subject (M1). For both injection sites, we ran tractography experiments with all the possible 9 × 9 × 9 × 10 = 7290 combinations of parameter choices listed in Table 2. Because we conducted both Mode-I and Mode-II analysis, there are overall 14580 experiments for each injection. The TPR vs FPR values showing the trends with respect to the changes in number of streamlines and cutoff parameters are plotted in Fig. 6. Notice that all of the 14580 data points in columns one and three are identical for each row, i.e., for each injection site. However they are plotted in different coloring schemes to highlight the trends with respect to changes in number of streamlines (first column) and cutoff (third column) parameters. Similarly, the data points used in the second and fourth columns are identical. The second column shows a sub-set of the data points where the cutoff value is fixed to 0.75 × 10−2 to further clarify the trend with respect to changes in the number of streamlines. Similarly in the fourth column, the number of streamlines is fixed to 500 K which better exposes the trend due to cutoff variation.

Fig. 6
figure 6

Variability of tractography performances with respect to number of streamlines and cutoff parameters. The dimensions of data points are in proportion to the DICE coefficients. The first and third rows show the same data points which are collected from all of the experiments using injections I1 and I6 on M1 (14580 experiments = 2 modes × 9 step sizes × 9 curvatures × 9 cutoff thresholds × 10 number of streamlines). However, the coloring of points are done with respect to the number of streamlines in the first column and cutoff in the third to highlight the trends. Cutoff is fixed to 0.75 × 10−2 in the second column to clarify the trend with respect to changes in the number of streamlines. Similarly in the fourth column, the number of streamlines is fixed to 500 K to emphasize the trend with respect to cutoff changes

Overall, we can see quite dramatic differences between the plots for I1 and I6. This shows that anatomy plays a key role in determining the performance of tractography experiments in terms of the TPR, FPR and DICE values. It is also consistent with the previous experiments where Mode-II analysis generated higher agreements between tractography and tracer projection density images. From Fig. 6, we can see that the data points with different cutoff thresholds are stratified into distinguished bands. Experiments with high cutoff thresholds yield lower TPR, FPR and DICE values. Relaxing this constraint enables us to obtain higher TPR at the cost of higher FPR. On the other hand, the results in Fig. 6 show that the number of streamlines does not have a similar effect on constraining the performance of the tractography experiments. It can be observed that both TPR and FPR values improve almost monotonically with the increase of the number of streamlines. Given this observation, we selected the data points with the number of streamlines at the maximum value (500 K) and plotted them separately by coloring them according to the cutoff values. These plots more clearly illustrate the effect of the cutoff values on the performance of tractography. With the change of the cutoff thresholds, we can see a wide range of overlapping measures.

To illustrate the effects of the other two parameters, curvature and step, we plotted in Fig. 7 the data points with the maximum number of streamlines (500 K) and three representative cutoff thresholds, low cutoff (0.75 × 10−2), medium cutoff (2.25 × 10−2) and high cutoff (3.75 × 10−2). To have a clearer visualization, we only used data points from four representative curvature and step values. For the visualization of the plots, we used different symbols to indicate different curvature values and different colors to indicate different step sizes. Similar to the cutoff parameter, curvature imposes a strong constraint and in general high curvature values yield low TPR and FPR. Relaxing curvature by decreasing its value increases both TPR and FPR. On the other hand, decreasing the step size has an opposite effect. Smaller step sizes in general yield lower TPR and FPR.

Fig. 7
figure 7

Variability of tractography performances with respect to curvature and step size parameters. The dimensions of data points are in proportion to the DICE coefficients. All data points are a sub-set of the experiments shown in Fig. 6. To keep the plots simple, the number of streamlines is fixed to 500 K and three different (fixed) cutoff values are shown in each column. Data points with respect to varying curvature values are shown with different symbols, whereas different colors are used to show the points for varying step size

More details on the overall overlap variation due to tractography parameters and anatomical constraints are given in Supplementary Material B. Also in Supplementary Material C, detailed figures are provided of the overall impact of injection sites, mouse brains and tractography parameters.

ANOVA analysis and the ranking of variation sources

In the previous sections, we examined several factors that affect the performance of tractography results. In this part, we perform N-way ANOVA analysis to compare the effects of these factors. The top panel of Fig. 8 shows the individual group means plotted using MATLAB’s multcompare command. By checking the means and confidence intervals, we determined that only groups with curvature = 287 µm were significantly different from all others and have a low mean value that is not close to other groups. We marked this data point as an outlier, crossed it with dark green and removed it from further analysis. (Outlier removal is explained in more detail in Supplementary Material D.) Gray and black in Fig. 8 show ANOVA results before and after outlier removal, respectively. Red data points show results when Mode-I and Mode-II are separated. Figure 8 shows that Mode-II gives higher DICE coefficients for all groups. Decreasing step or curvature increases the quality of overlap for Mode-II; however, this trend is opposite for Mode-I. For both modes, decreasing the cutoff threshold increases the DICE values. There is no significant gain below cutoff = 0.015 for Mode-II.

Fig. 8
figure 8

Multiple comparison plots and F-statistics obtained by N-way ANOVA analysis. Multiple comparison shows a large separation of results with respect to mode. Red data points on multiple comparison plots show ANOVA results obtained when modes are separated. The outlier curvature = 287 µm group is removed from all F-statistic results

In the lower panel of Fig. 8, (a) shows the F-statistics belonging to each variation source using all the experiments after the outlier group is removed. All groups are found to be statistically significant (\(p \ll 0.05\)). Here, we also included the results for TPR and FPR. The analysis mode is determined to be the most significant parameter for all measures. The order of significance between the other parameters changes depending on the measure. For DICE coefficient, tractography parameters contribute to the total variation less than the injection or the subject for most cases.

Our results show that both the injection and the subject used in the experiments are big contributors to the total variation of the quality of overlap and they are significant. Especially, when Mode-I and Mode-II are separated, their significance becomes clearer; Fig. 8 b, c. For DICE coefficient injection is the most significant source of variation when modes are separated. We observe from Fig. 8a–c that, for most cases, the number of streamlines is a more significant parameter than both step and curvature. Figure 8d shows that most of the variation from the number of streamlines is due to low values and the variation decreases with the increase in the number of streamlines. We find it interesting to determine the sources of variation in the case that the experiments are performed with a large enough number of streamlines. Figure 8d, e shows the F-statistics when modes are separated and the largest number of streamlines are used. We observe that, in this case, the two most significant sources of variation in the quality of overlap are the injection location and the subject either of which are parameters used to compute the tracks. The most significant tractography parameter is the cutoff threshold which is followed by curvature and step.

Discussion

Understanding the roots of performance variation in tractograms is important to advance our techniques toward more reliable and reproducible results, which are essential for clinical practices. It is well known from earlier studies that due to several factors including subject (Heiervang et al. 2006; Willats et al. 2014), tractography parameters (Mangin et al. 2013; Yeh et al. 2016; Thomas et al. 2014; Azadbakht et al. 2015), anatomical constraints and the fiber system (Smith et al. 2012; Donahue et al. 2016), the performance of tractography varies. Although the sources for variability that are studied in this work are well known, how much they contribute to variability has not been thoroughly studied. Learning more about this information is critical in (1) setting up protocols for tractography and (2) interpreting results. However, studying the coupled effects of factors is challenging due to the big dimension of affecting parameters. This demands a large set of experiments to be performed to cover a sufficient subset of the whole parameter space. For the validation of tractography, this is an additional challenge on top of the lack of ground truth, which we tried to overcome in this work.

For our study, we sampled a large portion of the parameter space responsible for performance variations in tractography, extracted trends and attained a comprehensive understanding of the factors within practical ranges of parameter choices. We collected results from 1,166,400 experiments: 8 mouse brains × 10 injections × 2 modes of analysis × 9 step sizes × 9 curvatures × 9 cutoff thresholds × 10 number of streamlines. It took about 40 days to complete all tractography experiments using 350 nodes in our computer cluster. The compressed data take 118 TB of hard drive space. Importantly, the large amount of experiments that we conducted enabled us to rank the parameters in terms of their contribution to the variability using N-way ANOVA analysis. In short, by exhaustively studying the sources of variation, we obtained an improved understanding on how to perform better tractography experiments and interpret results.

One notable study addressing the variation in tractograms due to a large number of factors is Cote et al. (2013). Here, the authors performed over 57,000 experiments to validate the results of different tractography experiments with respect to image acquisition settings, local reconstruction techniques (tensor, q-ball, and spherical deconvolution), curvature, step size and seeds. However, this study did not quantitatively analyze and compare the sources with their contributions on the variability. Girard et al. (2014) is another important study toward reducing parameter biases. Here, the authors studied several factors and suggested optimal parameters for obtaining streamlines. However, while optimizing individual tractography parameters, they fixed all the other parameters and thus did not thoroughly investigate the sources of variation. One of the main differences between our work and earlier validation studies is that we analyzed how several factors in combination affect the results.

Our study fundamentally differs from the recent tractography validation works that were also based on injection experiments as ground truth. In their works (Calabrese et al. 2015; Chen et al. 2015b; Donahue et al. 2016), the authors studied a large number of injections, mainly focusing on the connectomics aspect for validation. They investigated how well the connections and connectome generated by tracer injection experiments match the one generated by tractography, whereas in our study we conduct a deeper investigation on a limited number of injections and add another dimension to these works by showing how individual connections vary due to several factors. Among the previous validation studies with AMBA injection experiments, only Chen et al. (2015) tried to investigate how the performance of tractography varies with respect to its parameters for individual fiber systems. However, they studied only few parameter combinations, three different cutoff thresholds and five different curvatures. Keeping in mind that connectomics is only one application of tractography and many other studies actually focus on specific fiber systems (Yamada et al. 2009; Mukherjee et al. 2008; Nucifora et al. 2007), we focused on studying certain injections. While doing that, we picked injections with projections to different areas of the brain to check the influence of this factor on the performance.

Anterograde tracers used in the AMBA study only label outgoing projections, i.e., projections shown in Fig. 1 are from the cell bodies in the injection site to synapses. This is a common limitation for tracer injection-based validation studies (Heilingoetter and Jensen 2016). On the other hand, current tractography algorithms continue to propagate after synapses since it is not possible to define stopping conditions at these locations using MRI data. Therefore a part of the false positive results obtained in this study might be due to incoming projections and connections after synapses. For the same reason, a part of the false negatives might also be extra. For a more thorough comparison in the future, we need more comprehensive information about different types of neuronal connections that include incoming, outgoing, reciprocal, and intermediate connections, which can be obtained using double coinjection tract tracing which comprises both anterograde and retrograde agents (Zingg et al. 2014). Also, the reliability of the results can be improved by conducting the tracer experiments and dMRI-based tractography on the same subjects.

One other challenge regarding the injection experiments concerns the quantitative use of projection density values. Due to the differences in experimental procedures such as tracer doses and injection leakages, as well as the variations in the intensities during the STP imaging, projection density values are prone to variations (Oh et al. 2014; Dong 2007; Kuan et al. 2015). Because of these reasons, as ground truth, instead of the intensity of projections, we chose to consider whether there exists a projection to a voxel or not. However even so, due to partial volume effects, discretization errors and spurious projections, the ground truth images are still not ideal. Therefore, we chose to clean up the projection density data for Mode-II analysis where we also enforced anatomical constraints. Note that this is a typical step done in other AMBA-based validation studies (Chen et al. 2015a, b; Keifer et al. 2015). On the other hand, for Mode-I analysis, we did not process the projection density images since this would bias the results despite the lack of quantitative information regarding the invalidity of the data.

Additional to the parameters that we studied in our work, the choice for scanners (Sotiropoulos et al. 2016), image acquisition protocols (Daianu et al. 2015) and pre-processing pipelines (Albi et al. 2018) all have prominent effects on tractograms. However, for practical reasons, these are typically difficult to control or adjust. Results also change with respect to the choice of diffusion models (Thomas et al. 2014) and tractography algorithms (Azadbakht et al. 2015). Additionally, the length of connections (Jbabdi et al. 2015) is shown to be a critical factor as well. This parameter, however, is highly dependent on the injection location that we used in this study and also the AMBA data does not implicitly provide ground truth for the length of connections. Although investigation of all sources of variation is valuable, this is out of the scope of our work. For our experimental setup, among many factors that affect tractograms, we tried to pick those that most researchers can easily tune. In recent extensive technical validation studies (Maier-Hein et al. 2017; Neher et al. 2015), the authors tested different diffusion models and representations including tensors (DTI) and the FOD. According to their study using 25,000 tractograms, it was concluded that the best results obtained with DTI-based approaches were almost always worse than any FOD-based technique. The authors concluded that “the scientific community needs to move beyond DTI for meaningful fiber tractography”. Several other studies conducted earlier on diffusion models and tractography also pointed out the limitations of DTI-based tractography (Neher et al. 2014; Fillard et al. 2011; Cote et al. 2013). On the other hand, a significant majority of previous validation studies with tracer injections were performed based on this limited approach. Taking into consideration the new developments in the field, we chose to focus on a popular FOD-based probabilistic tractography approach in our experiments (Tournier et al. 2010).

Our work studies different subjects, fiber bundles (injection locations) and anatomical constraints (mode), which are common factors affecting tractography results independent of the used diffusion model or tracking algorithm. Therefore, we expect our conclusions concerning these factors to mostly translate to other studies as well, i.e., the fiber bundle that is studied is highly likely to be a more important source for variation in comparison to differences between subjects irrespective of how tracks are obtained. For our study, the data points are collected using an FOD-based probabilistic tractography technique. We expect our results to translate to most similar probabilistic approaches as well, such as the iFOD1 algorithm in MRtrix3 (Tournier et al. 2010, 2012). Also because step size and curvature parameters are used almost exactly in the same way in most tractography algorithms (excluding global approaches), we expect curvature to be a more important parameter for variability than the step size, regardless of the diffusion model or the tracking algorithm. Additionally, based on the similarities between commonly used tracking algorithms, we expect that an increased number of streamlines will always show a converging trend. On the other hand, the variability introduced by the number of streamlines and cutoff parameters is likely to differ with respect to the diffusion model and the tractography technique (deterministic, global).

The parameters for the fixed post-mortem mouse brain were chosen based on the study of Wu et al. (2013, 2014). The main difference in the reconstruction of the diffusion model that affects tractography is the reduced diffusivity value (0.0008 mm2/s). On the other hand, Wu et al. (2013) report major differences between the in vivo and ex vivo imaging cases, such as the 60% reduction in the ventricular volumes after death and the large deformations nearby, accompanied by fixation. However, volumes of major brain structures were not found to be significantly different. Because the propagation of the tracer occurs over several days when the subject is alive and tractography experiments are conducted post-mortem, it is possible that the anatomical and microstructural changes due to death and fixation of the subject introduce a negative bias on the absolute values of the overlap quality measures, i.e., if the ground truth could be obtained without sacrificing the subject, tractography experiment could as well yield a better overlap when conducted in vivo. Additionally, because the FOD representations are common for both in vivo and ex vivo data, the lessons we learned from ex vivo data in our study will be valuable for in vivo experiments. While it is challenging to directly repeat these tractography experiments in vivo due to the lack of ground truth, there is a possibility of examining and replicating these findings on specific pathways such as optic radiation using the well-known anatomy in retinotopy (Benson et al. 2012; Aydogan and Shi 2016). The parameters we chose to study in our work are general and operate in the same way regardless of the subject being a post-mortem mouse or living human. For any tractogram, anatomical constraints for any type of subject are applied by defining ROIs to avoid or include for certain areas of the brain. Also, the tractography algorithms operate with the exact principles irrespective of the subject that is studied. However due to other sources of variations such as the differences in imaging devices and neuroanatomy, our results obtained using mouse do not one to one translate to humans. Additionally, there are discrepancies in the values of tractography parameters, for example the FOD cutoff value used to terminate streamlines are significantly lower in mouse subjects compared to the values in human studies. On the other hand, in our study we did not focus on the actual values of the parameters, we instead investigated the variability in performance within practical ranges of parameter choices. Therefore, although our quantitative results obtained using mouse do not translate to human studies, we believe it is reasonable to expect the overall trends and rankings for the sources of variability to have parallelism.

Comparison of the overlap between tractograms and injection experiments is a widely accepted validation approach of connections obtained using dMRI-based tractography (Dauguet et al. 2007; Thomas et al. 2014). This was also the choice to rank the performance of tractography protocols in the ISBI 2018 tractography challenge (https://my.vanderbilt.edu/votem/). In the rat brain, the mean diameters for unmyelinated and myelinated axons are reported to be around 0.2–0.6 µm and they vary from 0.02 to 3.0 µm (Partadiredja et al. 2003; Barazany et al. 2009). Capturing axon-level details requires high-resolution images (< 1 µm) which can be used to validate fiber orientations (Budde et al. 2011; Mollink et al. 2017). Although whole brain AMBA projection images are adequate for the validation of dMRI-based tractography, with a minimum resolution of 10 µm, they are not suitable to validate FODs. In our work, we used a state-of-the-art multi-shell, multi-compartment model to estimate FODs from dMRI, which are validated on simulated data (Tran and Shi 2015). We plan to investigate the accuracy of fiber orientations obtained based on this technique using high-resolution histology images in the future.

DICE measure can be significantly improved by focusing on false negatives

Our overall findings summarized in Table 4 show that the average overlap measured using DICE coefficient without the incorporation of anatomical knowledge is 24.2% and it increases to 31.9% when anatomical constraint is applied. Similar to the results from previous studies (Dauguet et al. 2007; Calabrese et al. 2015), we find that the overlaps between tractography and tracer injections are relatively low but significant. Different from previous studies, our results additionally show that despite the dramatic increase in the overlap with anatomical constraint, a large portion of the ground truth projections were still not captured by tractography—false negatives (FN). In Table 4, we separately showed the estimated DICE values when false positive (FP) and FN contributions were removed. Changes in FP do not alter TP; however, decreasing FN to 0, implies perfect TP. For our estimate in the case of FN = 0, we conservatively forecasted an increase in FP that is proportional to the increase in TP. Our results point out that FN is more detrimental to the DICE score than the FP. This underlines the other issue with tractograms, which is that tractograms not only contain a significant amount of false positives, but it is also hard to capture all the projections.

Table 4 Average TPR, TNR, FPR, FNR, ACC and DICE values

Performance variations in tractography follow trends over parameter changes

Our results in “Examination of overlap measures and variation across subjects”, “Examination of overlap variations due to tractography parameters and anatomical constraints”, and “Examination of overlap variations due to injection sites” show that there is no simple way of moving towards the ideal FPR = 0 and TPR = 1 corner on the ROC plane. However, there are common trends in results with respect to parameter changes. Figure 6 shows that cut-off threshold determines the upper bounds for TPR and FPR regardless of other tractography parameters. Given a subject and an injection location, this implies the best TPR and worst FPR are limited once the cut-off is set. Figure 9 shows the overall trends that we obtained with respect to tractography parameter choices. It is a visual summary of how parameter changes move the overlap quality on the FPR vs TPR plane. In order to present a cleaner visualization, among the seven variables that are considered, plots show average results over all subjects and injection sites. On top of each plot, the names of the fixed parameters are listed. All unlisted parameters are varied in the plots. The plots are colored according to the parameter listed in the title. For example, Fig. 9d includes data points for variable curvature and step size parameters but the colors represent results from different curvatures.

Fig. 9
figure 9

Summary of trends in tractography performance with respect to parameter changes. Data points show average values over all subjects and injection sites

Results based on different tractography protocols are difficult to relate

Our study has significant conclusions owing to N-way ANOVA analysis. The variability in the overlap performance between subjects shown in Fig. 7 suggests that subject-wise comparisons should be done with caution. Our results show that performance variations introduced by different tractography parameter choices (for example cutoff) is comparable to variations introduced by different subjects. Additionally, our analysis indicate that all tractography parameters including step size and number of tracks significantly affect the performance. We found that varying injection locations and subjects affect overlap performance more than cutoff threshold, curvature and step size. Although a part of the associated differences might be due to registration errors or actual differences between subjects, we believe variation due to neuroanatomy and subjects is one of the major challenges in determining the optimal parameters for tractography.

Improving tractography protocols and interpretation of results

Overall, our study points out to consider new practices for the application of tractography and how to interpret results:

  • Our work reveals that neuroanatomy plays the most critical role in determining the performance of tractography, which underscores the importance of employing anatomical information in fiber tracking. This justifies the efforts to improve tractography via anatomical constraints (Smith et al. 2012) or atlases (Rojkova et al. 2016). However, even with perfect anatomical constraints, our results show that false negatives dominate the overlap performance which should be taken into account during the interpretation.

  • The large variability in overlap quality due to injection locations shows that tractography techniques might provide better results if parameters are adjusted according to individual fiber bundles or regions of the brain that are being processed by the tracking algorithms.

  • As a corollary to the previous points, we provide evidence that tractograms obtained for specific fiber bundles with optimized parameters and well-built anatomical constraints are more reliable compared to whole brain tractograms used for connectomics where there are limited opportunities for parameter optimization and anatomical restrictions. This makes tractography a more reliable tool to study certain parts of the brain than others.

  • Our study heavily underlines the seriousness of documenting the complete list of tractography parameters. Because our N-way ANOVA analysis shows that variations due to subjects and tractography parameters are comparable, the motivation to document the complete tractography protocol is beyond good practice. It is simply essential without which comparisons between different studies are potentially not meaningful.

  • Our results showing the importance of all tractography parameters have implications in designing tractography protocols and how to decide parameters. Because results do converge with increasing number of tracks, it is good practice to study the convergence with respect to this parameter. On the other hand, our results show variability with respect to step size, curvature and cutoff parameters. Therefore, conventional tractography studies that are based on a single set of parameter combination are prone to variations with respect to these parameters. One way to reduce biases due to fixed parameter choices is to repeat the experiments using different parameter combinations which would improve the reliability of the results.

To conclude, we believe our findings will not only contribute to the literature of validation studies, but also improve our understanding of how to improve tractography experiments and interpret results.