Keywords

1 Introduction

Scientifically, investigation in the connectional organization of human brain and its variation across subjects is a critical step to understand the pathology of many neuro-related disorders. Diffusion-weighted MRI offers a non-invasive approach to study the tissue structure of white matter fiber bundles in vivo, including both the geometric shape and the diffusion properties [2, 6, 9, 12, 17, 24, 27]. Delineating diffusion statistics along fiber bundles may help identify structural connectivity abnormalities across different spatial-temporal scales. It could eventually inspire new approaches for disease preventions, diagnoses and clinical treatments.

Group analysis of fiber bundle statistics poses remarkable computational and mathematical challenges to existing statistical methods. The first challenge is to efficiently and simultaneously study multiple fiber bundles with heterogeneous geometric structures and variation patterns. The second challenge is to correlate fiber bundle statistics with a large number of covariates, such as millions of genetic markers. This challenge is motivated by the demand to carry out a genome-wide association study on fiber bundle statistics. Voxel-wise methods [21] and single tract analysis [8, 26, 28] suffer from performing massive multiple comparison adjustments, which would severely reduce detection power. The third challenge is to properly handle the potential correlation among multiple tracts and to disentangle tract-specific information from global information shared by a large portion of fiber bundles.

The aim of this paper is to develop a hierarchical functional principal regression model (HFPRM) framework to address the three challenges discussed above. HFPRM consists of three statistical models, including a varying coefficient model (VCM), a latent factor analysis (LFA) procedure, and a multivariate regression model (MRM). The path diagram of HFPRM is presented in Fig. 1. The VCM not only captures the functional structure of fiber bundle statistics for each single tract, but also maps the heterogeneous geometric structure of multiple fiber bundles onto a common coordinate system. The LFA is applied to characterize potential inter-tract correlation across multiple bundles. It allows us to explicitly identify both tract-specific and global latent signals. The integration of VCM and LFA dramatically reduces the dimension of fiber bundle statistics. Finally, using MRM, we are able to examine the effect of selected predictors on both global level and individual level.

Fig. 1.
figure 1

A schematic overview of HFPRM

In Sect. 2, we introduce the general framework of HFPRM and propose a two stage estimation procedure to study both global effect and individual tract effect. In Sects. 3 and 4, we use numerical simulations and a real data example to examine the finite sample performance of HFPRM. Section 5 concludes with some remarks.

2 Methods

2.1 Data Structure

Suppose that we obtain a data set with clinical, genetic variables as well as DTI statistics along M fiber bundles from n subjects. For the m-th fiber bundle, \(m=1,\cdots ,M\), we use \(s_m\in [0,S_m]\) to denote the arc length of any point relative to a fixed end point, where \(S_m\) is the longest arc length on the tract. For the i-th subject where \(i=1,\cdots ,n\), \(y_{i,m}(s_m)\) denotes a specific diffusion statistics observed at arc-length \(s_m\) along the m-th tract, and \({{\varvec{x}}}_i\) is a \(q \times 1\) vector of covariates.

2.2 HFPRM

HFRPM is proposed to study the association between diffusion properties (e.g., FA, MD or RD) along M fiber bundles with a set of covariates, such as age, gender, and genetic markers. It consists of three key components, a varying coefficient model (VCM), a latent factor analysis (LFA) procedure, and a multivariate regression model (MRM).

The VCM describes the functional association between \(\{y_{i,m}(s_m): s_m\in [0, S_m]\}\) and \({{\varvec{x}}}_i\) for a single tract. It admits the following form,

$$\begin{aligned} y_{i,m}(s_m)=\mu _m(s_m)+\eta _{i,m}(s_m)+e_{i,m}(s_m), \end{aligned}$$
(1)

where \(\mu _m(s_m)\) is the function of population mean, \(\eta _{i,m}(s_m)\) is an individual function characterizing subject-specific spatial variations along the m-th tract, and \(e_{i,m}(s_m) \) is the measurement error. Let \(SP(0,\varSigma )\) represent a stochastic process with mean zero and covariance operator \(\varSigma (s_m, s_m')\). It is assumed that \(\eta _{i,m}(s_m)\) and \(e_{i,m}(s_m)\) are mutually independent and identical copies of stochastic processes \(SP(0,\varSigma _{\eta _m})\) and \(SP(0,\varSigma _{e_m})\) respectively, in which \(\varSigma _{e_m}(s_m,s_m')=\sigma ^2_{e_m}(s_m)\mathbf{1}(s_m=s_m')\) and \(\mathbf{1}(\cdot )\) is an indicator function.

The major challenge to simultaneously study M fiber bundles is the heterogenuity in their geometric structures. It is necessary to find a common coordinate system for \(\{\eta _{i,m}(s_m)\}_{m=1}^M\). Specifically, we use functional principal component analysis (fPCA) to extract the key features in \(\eta _{i,m}(s_m)\). Based on Mercer’s theorem, \(\varSigma _{\eta _m}(s_m,s_m')\) admits a spectral decomposition as follows:

$$\begin{aligned} \varSigma _{\eta _m}(s_m,s_m') = \sum _{d=1}^{+\infty } \lambda _{md} \phi _{md}(s_m) \phi _{md}(s_m'), \end{aligned}$$
(2)

where \(\{\lambda _{md} \ge 0\}\) are eigenvalues in descending order with \(\sum _{d=1}^{\infty } \lambda _{md} < \infty \) and \(\{\phi _{md}(s_m)\}\) are the corresponding orthonormal eigenfunctions. Using Karhunen-Loeve expansion [13, 16], \(\eta _{im}(s_m)\) can be expressed as

$$\begin{aligned} \eta _{i,m}(s_m)= & {} \sum _{d=1}^{+\infty } z_{i,md}\phi _{md}(s_m)~~\text{ with }~~ z_{i,md}=\displaystyle \int _0^{S_m} \eta _{i,m}(s_m) \phi _{md}(s_m) ds_m. \end{aligned}$$
(3)

Individual function \(\eta _{i,m}(s_m)\) can then be equivalently represented by a set of functional principal component (fPC) scores \(\{z_{i, md}: d=1, \ldots , \infty \}\). In practice, a relatively small number of fPC scores would account for the majority of variation in \(\eta _{i,m}(s)\). Therefore, we can approximate \(\eta _{i,m}(s_m)\) by a finite vector \({{\varvec{z}}}_{i,m}=(z_{i,m1},\ldots ,z_{i,mD})^T\) of dimension D. For notational simplicity, it is assumed that D is the same across all M bundles. Now we use \({{\varvec{z}}}_{i,m}\) to integrate information across M bundles and denote \({{\varvec{z}}}_i\) as a \(p \times 1\) long vector that concatenates all \({{\varvec{z}}}_{i,m}\)s together, where \(p=DM\).

A LFA is then proposed to account for potential inter-tract correlation across multiple bundles. Specifically, \({{\varvec{z}}}_i\) is assumed to have the following latent factor structure,

$$\begin{aligned} {{\varvec{z}}}_i={{\varvec{\varLambda }}}{{\varvec{f}}}_i+{{\varvec{u}}}_i, \end{aligned}$$
(4)

where \({\varvec{\varLambda }}\) is a \(p\times L\) loading matrix and \({{\varvec{f}}}_i\) and \({{\varvec{u}}}_i\), respectively, represent global and individual latent factors. When there exist homogeneous signal patterns across multiple fiber bundles, L is expected to be much smaller than p. Global factor \({{\varvec{f}}}_i\) thus allows us to study the shared pattern in a low dimensional space. And tract-specific pattern can also be captured by each component in \({{\varvec{u}}}_i=({{\varvec{u}}}_{i,1},\cdots ,{{\varvec{u}}}_{i,M})^T\).

Finally, a MLM is introduced to correlate the global and individual latent factors with covariate \({{\varvec{x}}}_i\),

$$\begin{aligned} {{\varvec{f}}}_i= & {} {{\varvec{B}}}_f^T{{\varvec{x}}}_i+{{\varvec{\epsilon }}}_{f,i} ~~\text{ and }~~ {{\varvec{u}}}_{i,m} ={{\varvec{B}}}_{u_m}^T {{\varvec{x}}}_i+{{\varvec{\epsilon }}}_{u_m,i}, ~ \text {for } m =1 ,\cdots , M, \end{aligned}$$
(5)

where \({{\varvec{B}}}_f\) and \({{\varvec{B}}}_{u_m}\) are, respectively, \(q\times L\) and \(q\times D\) coefficient matrices and \({{\varvec{\epsilon }}}_{f,i}\) and \({{\varvec{\epsilon }}}_{u_m,i}\) are residual terms. Using (5), we are able to perform a hierarchical analysis on both global level and individual level.

2.3 Estimation and Inference Procedure

In practice, diffusion statistics are observed on discrete grid points along each tract. For the m-th tract, assume \(y_{i,m}(s_m)\) is observed on sample point set \(\mathcal {S}_m=\{s_{m,1},\ldots ,s_{m,k},\ldots ,s_{m,K_m}\} \subset [0,S_m]\), we use the following two-stage procedure to estimate fPC scores \(\mathbf {Z}=\{{{\varvec{z}}}_i\}_{1\le i\le n}\), global factors \(\mathbf {F}=\{{{\varvec{f}}}_i\}_{1\le i\le n}\) and individual factors \(\mathbf {U}=\{{{\varvec{u}}}_i\}_{1\le i\le n}\).

  • Stage I: For each tract, \(\mu _m(s_m)\) and \(\eta _{i,m}(s_m)\) are estimated from (1) and functional principal component analysis is applied to calculate \(\hat{\phi }_{md}(s_m)\) and \(\hat{{{\varvec{z}}}}_i\),

  • Stage II: Perform factor analysis on \(\hat{{{\varvec{z}}}}_i\) to extract global factor \(\hat{{{\varvec{f}}}}_i\) and individual factor \(\hat{{{\varvec{u}}}}_i\). Regression and hypothesis testing can then be applied on \(\hat{{{\varvec{f}}}}_i\) and \(\hat{{{\varvec{u}}}}_i\) respectively.

Details of the two stages are given below.

In Stage I, to estimate the mean curve from model (1), we apply the local linear kernel smoothing technique. \(\mu _{m}(s_m)\) is first approximated by the following taylor expansion,

$$\begin{aligned} \mu _{m}(s_{m,k})\approx \mu _m(s_m)+d\mu _m(s_m)(s_{m,k}-s_m). \end{aligned}$$
(6)

Let K(s) be a predetermined smoothing kernel and denote \(K_h(s)=\frac{1}{h}K(\frac{s}{h})\) as the rescaled function with bandwidth h, \(\hat{\mu }_m(s_m)\) and \(d\hat{\mu }_m(s_m)\) can be estimated as the minimizers of the following weighted least square function,

$$\begin{aligned} \sum _{i=1}^n \sum _{k=1}^{K_m}[y_{i,m}(s_{m,k})-\mu _m(s_m)-d\mu _m(s_m)(s_{m,k}-s_m)]^2K_h(s_{m,k}-s_m), \end{aligned}$$
(7)

and solution \(\hat{\mu }_m(s_m)\) is smooth curve with local linearity. More complicated polynomial structure can be applied using higher order expansion if necessary.

Similarly, we expand individual function \(\eta _{i,m}(s_m)\) for subject i as follows,

$$\begin{aligned} \eta _{i,m}(s_{m,k})\approx \eta _{i,m}(s_m)+d\eta _{i,m}(s_m)(s_{m,k}-s_m). \end{aligned}$$
(8)

The corresponding weighted least square function is given by,

$$\begin{aligned} \sum _{k=1}^{K_m}[y_{i,m}(s_{m,k})-\hat{\mu }_{m}(s_{m,k})-\eta _{i,m}(s_m)-d\eta _{i,m}(s_m)(s_{m,k}-s_m)]^2K_h(s_{m,k}-s_m). \end{aligned}$$
(9)

When smoothed individual functions are obtained as \(\{\hat{\eta }_{i,m}(s_m)\}_{i=1}^n\), we can calculate the empirical covariance function \(\hat{\varSigma }_{\eta _m}(s_m,s_m^{\prime })=\frac{1}{n}\sum _{i=1}^n \hat{\eta }_{i,m}(s_m)\hat{\eta }_{i,m}(s_m^{\prime })\). And eigenbases \(\{\hat{\phi }_{md}(s_m)\}\) can be estimated from spectral decomposition,

$$\begin{aligned} \hat{\varSigma }_{\eta _m}(s_m,s_m^{\prime }) = \sum _{d} \hat{\lambda }_{md} \hat{\phi }_{md}(s_m) \hat{\phi }_{md}(s_m^{\prime }). \end{aligned}$$
(10)

Then individual random effect \(\hat{\eta }_{i,m}(s_m)\) is projected onto basis functions \(\{\hat{\phi }_{md}(s_m)\}\) to get functional PC scores,

$$\begin{aligned} \hat{z}_{i,md}=\sum _{k=1}^{K_m}\hat{\eta }_{i,m}(s_{k,m})\hat{\phi }_{md}(s_{k,m}). \end{aligned}$$
(11)

There are several strategies to determine the number of fPCs to be extracted. For example, the analog of some model selection techniques have been generalized for this purpose, such as Akaike information criterion (AIC), Bayesian information criterion (BIC) [25] and cross-validation (CV) [20]. Alternatively, the percentage of explained variation has been widely used to give an appropriate cut-off in practice. Here, we choose D as the minimum number of fPCs that incorporates at least \(V\%\) of total variation in each tract. When the optimal \(D=D_m\) is different across tracts, the largest \(D_m\) will be used for all tracts.

In Stage II, a PCA-based factor analysis is performed. Let \({\hat{{\varvec{\xi }}}_1,\ldots ,\hat{{\varvec{\xi }}}_L }\) be the first L eigenvectors of sample covariance matrix \(\hat{\mathbf {\Sigma }}_\mathbf {z} =\frac{1}{n}\hat{\mathbf {Z}}^T\hat{\mathbf {Z}}\). The loading matrix, the global factors and the individual factors are estimated as,

$$\begin{aligned} \hat{\mathbf {\Lambda }}=\sqrt{p}(\hat{{\varvec{\xi }}}_1,\ldots ,\hat{{\varvec{\xi }}}_L),~ \hat{\mathbf {F}}=\frac{1}{p}\hat{\mathbf {Z}}\hat{\mathbf {\Lambda }},~ \text {and}~\hat{\mathbf {U}} = \hat{\mathbf {Z}} - \hat{\mathbf {F}}\hat{\mathbf {\Lambda }}^T \end{aligned}$$
(12)

Finally, the MLM (5) is used to estimate regression coefficients. Standard test statistics, such as wald and score statistics, can be applied subsequently for inference purpose.

3 Simulations

In this section, numerical simulations are conducted to evaluate the proposed method. Particularly, we examine the performance of HFPRM to detect covariate effect in hypothesis testing.

3.1 Setup

11 fiber tracts with FA measure shown in Table 1 were selected from diffusion tensor tractography in UNC Early Human Brain Development Studies [7]. Functional responses were simulated from a vary coefficient model with fixed covariate effects,

$$\begin{aligned} y_{i,m}(s_m)=\mu _m(s_m)+{{\varvec{\beta }}}_m(s_m)^T{{\varvec{x}}}_i+\eta _{i,m}(s_m)+e_{i,m}(s_m), \end{aligned}$$
(13)

where \(i=1,\cdots ,n\) and \(m=1,\cdots ,11\), \({{\varvec{\beta }}}(s_m)\) was a \(q \times 1\) vector of coefficient functions along the \(m-\)th tract, covariates \({{\varvec{x}}}_i=(x_{i1},\cdots ,x_{iq})^T\) were generated from N(0, 1) for continuous variables or from multinomial distribution with equal probabilities for categorical variables, \(\eta _{i,m}(s_m)\) followed gaussian process \(GP\{0,\varSigma _{\eta _m}\}\) and \(e_{i,m}(s_m)\) followed \(GP\{0,\varSigma _{e_m}\}\). Compared to model (1), the above equation directly specified the covariates as fixed effect. Sample size n was set to be 100 and true parameters \(({{\varvec{\beta }}}(s_m), \varSigma _{\eta _m},\varSigma _{e_m})\) were estimated from real data using FADTTS [28].

To examine our method, the following two scenarios on \(\beta (s_m)^T x_i\) were simulated. In case I, the aim is to study shared effect of multiple tracts. Gender (G) and gestational age at birth (Gage) were included as covariates for all the 11 tracts,

$$\begin{aligned} y_{i,m}(s_m)=\mu _m(s_m)+c\beta _{m,1}(s_m)\text {Gage}_i+\beta _{m,2}(s_m)\text {G}_i+\eta _{i,m}(s_m)+e_{i,m}(s_m),~ \forall m, \end{aligned}$$

in which we assumed \(c=0,0.2,0.4,0.6\) and Gage effect was tested.

In case II, we want to examine a tract-specific effect. Birth weight (BW) was added as covariate to one particular tract, right uncinate fasciculus \((m=11)\), in addition to case I,

$$\begin{aligned} y_{i,m}(s_m)= & {} \mu _m(s_m)+\beta _{m,1}(s_m)\text {Gage}_i+\beta _{m,2}(s_m)\text {G}_i+\eta _{i,m}(s_m)+e_{i,m}(s_m), ~m\le 10, \\ y_{i,11}(s_m)= & {} \mu _m(s_m)+\beta _{11,1}(s_m)\text {Gage}_i+\beta _{11,2}(s_m)\text {G}_i+c\beta _{11,3}(s_m)\text {BW}_i\\+ & {} \eta _{i,11}(s_m)+e_{i,11}(s_m), \end{aligned}$$

where effect size c was set to take values 0, 0.5, 1, 1.5 and the effect of BW was tested.

We applied HFPRM to the simulated dataset. The varying coefficient model (1) was first fitted to estimate individual functions. Functional principal components were then extracted such that at least 85% of total variation is included for each tract. In factor analysis, the first elbow point in the scree plot was taken as a cut-off to determine the number of global factors. In testing step, type I error and statistical power were calculated at significance level \(\alpha =0.05\) based on 1000 simulation replications. FADTTS was also applied on each single tract and the results were compared.

3.2 Results

In case I, the first five functional principal components were extracted for each tract and the first factor was identified as global factor. The rejection rates for global factor analysis and FADTTS on testing Gage effect are presented by Fig. 2(a). The global factor analysis is substantially more powerful than the single tract analysis when detecting commonly shared effect. Such results are expected since common effect tends to be accumulated in the global factor.

Fig. 2.
figure 2

Simulation result

In case II, the first five functional principal components were extracted for each tract and the first two factors were identified as global factors. Figure 2(b) shows the rejection rates for global factor analysis, individual factor analysis and FADTTS on testing BW effect. As can be seen, individual factor analysis in HFPRM achieves comparable power to single tract analysis for detecting tract-specific effect.

4 Early Human Brain Development Study

To investigate how genetic factors influence brain structure in prenatal and early postnatal stage, we conducted a genome-wide association study on the fiber bundle statistics in a unique cohort of infants. A total number of 662 neonatal twin subjects were taken from the UNC Early Brain Development Studies [7].

4.1 Data Acquisition and Preprocessing

MRI scans were acquired either on a 3T Siemens Allegra head-only scanner (N = 566) or on a 3T Siemens TIM Trio 3 T scanner (N = 96). For the Allegra model, 339 diffusion weighted images were acquired by a single shot EPI DTI sequence with the following parameters: TR/TE = 5200/73 ms, voxel resolution = \(2 \times 2 \times 2\,\text {mm}^3\), 6 non-collinear directions with \(b=1000\,\text {s}/\text {mm}^2\) and 1 baseline image with \(b=0\). To improve the signal-to-noise ratio, five scans were repeated and averaged. For the remaining subjects scanned on Allegra, DWI was acquired with the following parameters: TR/TE = 7680/82 ms, voxel resolution = \(2 \times 2 \times 2\,\text {mm}^3\), 42 non-collinear directions with diffusion gradients of \(b=1000\,\text {s}/\text {mm}^2\) in addition to 7 baseline images. For the Trio model, DWIs were acquired using a similar protocol to that of the 42 direction Allegra model with TR/TE = 7200/83 ms. Quality control was applied on raw DWIs using DTIPrep [18], and FSL [11, 22] was performed for skull stripping and brain masking. We used a weighted least squares method [8] to estimate diffusion tensors and followed the UNC-Utah NA-MIC framework [23] to create a study-specific atlas. Subsequently, a total number of 44 fiber tracts listed in Table 1 were reconstructed in the atlas space using a streamline algorithm [5]. For each subject, four scalar diffusion properties, FA, MD, AD and RD, were then calculated at each location along each tract using neighboring diffusion tensors.

Genotyping of single nucleotide polymorphisms (SNPs) was conducted on Affymetrix Axiom genome-wide LAT Array. Samples with call rates less than \(95\%\), outliers for homozygosity, ancestry outliers and unexpected relatedness were excluded from the study. We also removed genetic markers with Hardy-Weinberg equilibrium p-value less than \(10^{-8}\), call rate less than \(95\%\) and Mendelian error rate larger than \(10\%\). Population stratification was assessed using PCA [19]. Imputation was performed with MaCH-Admix [15] using 1000G reference panel [3]. To evaluate the quality of imputed SNPs, we computed the mean \(R^2\) under varying minor allele frequency (MAF) categories and selected \(R^2\) cutoffs as described in [14]. SNPs with MAF less than 0.01 were excluded from imputed dataset. Eventually, 472 twin subjects (32 MZ pairs, 75 DZ pairs and 259 singletons or unpaired twin subjects) and 8,538,562 genetic markers were retained for further analysis.

Table 1. List of fiber tracts in simulation and real data experiment

4.2 Data Analysis

In this experiment, we chose to focus on the fractional anisotropy (FA) measure. FA quantifies the extent of local directional water diffusion and partially reflects the degree of bundle maturation in premature brains [4]. To eliminate the heterogeneity in variance among different tracts, \(y_{i,m}(s_m)\) was rescaled by the total standard deviation along the tract. For the twin study, ACE model was fitted in (5) to account for correlation within twin pairs. Seven variables were added as covariates, including gestational age at birth, gender, DTI direction, scanner type and the first three genetic principal component to adjust for population stratification.

4.3 Results

In functional PCA, the first 5 functional principal components were extracted for each tract to include at least 70% of variation. Figure 3(a) shows the scree plot in factor analysis and the elbow point is located at factor 2. Therefore, the first factor is identified as the global factor. We then performed GWAS on the global factor. The result is visualized by Fig. 3(b). In the Manhattan plot, we observed a significant region in anaplastic lymphoma kinase (ALK) gene on chromosome 2. The ALK gene is a neuronal orphan receptor tyrosine kinase that plays an important role in the nervous system development [1], and is highly expressed in the neonatal brain [10]. As a comparison, we also performed association analysis for top hit rs66556850 on each single tract. The result is presented by Fig. 3(c). A number of tracts have relatively small pvalue yet not small enough to be detected by a single tract GWAS. It indicates that the global factor analysis is more powerful to detect commonly shared genetic effect than single tract analysis.

Fig. 3.
figure 3

Real data analysis result: (a) Functional PCA and factor analysis. (b) Visualization of GWAS result of the global factor. (c) A comparison between global factor analysis and single tract analysis on marker rs66556850, the \(-\text {log}_{10} p\) value in the association test is plotted. The majority of pvalues in single tract analysis are around \(10^{-2}{\sim }10^{-6}\).

5 Conclusion

We have developed a hierarchical functional principal regression model (HFPRM) to efficiently conduct joint analysis on diffusion statistics from multiple neurofiber bundles. A varying coefficient model is introduced and functional PCA is applied to capture major tract variation. Factor analysis is then adopted to extract key features at both global level and individual level. Finally, standard estimation and testing procedures can be applied to study global effect and tract-specific effect. Simulation results demonstrated that HFPRM is powerful to detect common effect shared by multiple tracts. HFPRM has also been successfully applied to a genome-wide association study on neonatal twins. We are able to identify some important genetic variants related to early childhood brain development that were ignored by single tract analysis.