Keywords

1 Introduction

Functional magnetic resonance imaging (fMRI) is an imaging technique which allows the study of human brain activities without being invasive. Such a technique provides a high resolution 3d image reconstruction of a human brain, starting from the blood-oxygen-level dependent (bold) signal. The bold value is the difference in magnetization between oxygenated and deoxygenated blood, arising from changes in regional cerebral blood flow. In particular, the data at our disposal consist of a collection of bold signals obtained from a resting state functional magnetic resonance imaging (rs-fMRI) session. This means that the subjects were not performing any explicit task during the scan. Refer, for instance, to [1,2,3,4] for detailed discussions on rs-fMRI data, statistical techniques commonly employed, and medical implications.

From a modeling perspective, what emerges from a rs-fMRI scan is a collection of spatially dependent functional observations. This kind of data collection has encouraged the development of suitable statistical techniques, and indeed, several novel spatio-temporal and dynamic models have been proposed (e.g., [5,6,7,8,9,10]). Within the Bayesian framework, comprehensive reviews of the main statistical methodologies employed for fMRI data are given in [11, 12].

Since we are proposing a preliminary specification here, we focus on the model of one subject at time, i.e., of a single brain, which is usually referred to as single-subject analysis. Although such an approach does not account for borrowing of information across subjects, it simplifies the modeling process and the related estimation procedures. An early reference to single-subject analysis is given by [13], who propose a general linear model to learn about the blood activity of a single brain. Several contributions appeared afterwards in the context of single-subject modeling and we mention here just a few. For example, the authors of [7] specify a Gaussian random field to capture the spatial correlation, while in [5] the spatial dependence is induced through a hierarchical specification of the parameters.

One of the main goals in the analysis of rs-fMRI data is to study the complex covariance structure between brain regions [3, 14]. In this paper, we propose a Bayesian factor model for fMRI data which is based on the structural assumption of separability. This means that, with regard to the dependence in brain activity across regions, we assume that the covariance structure can be split into two multiplicative components: the spatial and the temporal one. Mainly motivated by the high computational cost that would arise by using a non-separable specification, separability has been employed in several fMRI applications (e.g. [6, 7, 15]). The model presented in this paper benefits from this simplifying assumption, which, in addition, provides interpretable inferential results. This allows the assessment of functional connectivity across brain regions.

The paper is organized as follows. In Sect. 2, we introduce the rs-fMRI dataset at our disposal and we conduct some explorative analysis. In Sect. 3, we specify a single-subject Bayesian factor model for the blood functional activity, which accounts also for temporal dependence. In Sect. 4 we present a Markov Chain Monte Carlo (mcmc) for fitting the proposed model. In Sect. 5, we discuss the performance of our model and we present some empirical results. Concluding remarks are given in Sect. 6.

Fig. 1
figure 1

bold functional activities for subjects \(i=1,2\), for the L brain regions of the [16] parcellation and \(t=1,\ldots ,403\), after the c-pac pre-processing

2 The rs-fMRI Dataset

Our dataset comes from the pilot study of the Enhanced Nathan Kline Institute-Rockland Sample project (enkirs), which aims at providing a publicly available large sample of multimodal neuroimaging data. Comprehensive information about the project can be found at the link http://fcon_1000.projects.nitrc.org/ indi/CoRR/html/nki_1.html. From the original multimodal imaging dataset, we retained the bold values of two different subjects, which were randomly chosen among the patients. The bold values refer to the \(L = 68\) brain regions of the Desikan atlas parcellation [16], equally divided into the left and right hemispheres, discarding the two regions labeled as unknown.

Each measurement is a functional observation composed of \(T=403\) equally spaced bold values, with a lag of approximately 1400 ms, meaning that our dataset comprises two matrices of size \(L \times T\), for individuals \(i=1, 2\). From the original dataset, two bold values were discarded since they were missing. The bold functional activities, displayed in Fig. 1, are obtained from the raw rs-fMRI scans through an automated pipeline called c-pac, whose details can be found at https://fcp-indi.github.io.

As shown in Fig. 1, the set of regional bold functional activities can be regarded as multiple realizations of continuous functions. That is, bold, though continuous in time at each region, is evaluated on a finite grid of time \(t=1,\ldots ,T\), while brain regions are specified discretely, being obtained from the Desikan parcellation [16]. There is a considerable statistical literature in spatio-temporal modeling in continuous space and time (see e.g., [17]), particularly in the Bayesian setting [18]. However, data over continuous time and discrete space is rather uncommon in spatio-temporal applications, as pointed out in [19]. In particular, our data should not be modeled, at least in principle, through classical multivariate time series models, since the bold activities are continuous in time. Refer for instance to (Chap. 1, [20]) about the use of continuous models for functional observations. Moreover, our data cannot be modeled via standard functional data analysis techniques because some sort of dependence across brain regions is expected. There is need for general modeling methodology for the analysis of this type of rs-fMRI data. We aim to partially fill this gap by introducing a simple spatio-temporal model in the continuous-time and discrete-space framework. Then, we apply it to the rs-fMRI data.

Fig. 2
figure 2

Absolute value of the Pearson correlation coefficients among the bold functional activities of the 68 brain regions, for subjects \(i=1,2\). Values from 1–34 refer to the left hemisphere, whereas the remaining 35–68 refer to the right hemisphere

As already mentioned, one of the main goals in the analysis of rs-fMRI data is to study the functional connectivity, e.g. dependence, between brain regions [3, 14]. A simple approach consists in computing the Pearson correlation coefficients between bold functional activities, treating the bold values as if they were independent over time [2]. The correlation coefficients for subjects \(i=1,2\) are shown in Fig. 2. We argue that this strategy, although useful in an explorative phase, could lead to misleading inferential results, for instance revealing fictitious relationships which are due to temporal dependence. In fact, bold functional activities are characterized by a non-negligible amount of autocorrelation, as evidenced in Fig. 3, suggesting that Pearson coefficients should be, at the very least, interpreted with care. Nonetheless, correlation matrices like those in Fig. 2 provide an interpretable picture of the bold functional connectivity. Additionally, dichotomized versions of these correlations are often at the basis of network analyses for functional connectivity [21, 22]. We aim to preserve this simple structure by seeking a model that naturally leads to an alternative estimate of such a correlation structure, but also takes into account the temporal component.

Fig. 3
figure 3

Autocorrelation functions for brain regions \(l=1,\ldots ,68\) and for subjects \(i=1,2\)

Some additional difficulties arise when trying to model the spatial component of our rs-fMRI dataset. In particular, areally referenced temporal processes typically rely on some notion of distance, or neighborhood, between different regions, whose definition crucially impacts the results of the analysis. However, given the natural complexity of the brain morphometry, unavoidable questions about the choice of such a distance can be raised [23]. Although there is some evidence that connectivity among brain regions rapidly decays as a function of the Euclidean distance [23, 24], this is often a crude approximation. For instance, as shown in Fig. 2, high levels of connectivity characterize symmetric pairs of brain regions, which are far apart in terms of Euclidean distance. In order to avoid potential misspecification issues, we do not rely on any notion of physical distance between brain regions. Thus, the spatial structure of the bold functional activities is reconstructed entirely from the data, without imposing the brain morphometry. This is not to say that there is no potential information in terms of proximity of regions. If a suitable measure reflecting the foregoing caveats were developed, it would provide valuable information and it could improve estimation performance.

3 Modeling and Theory

3.1 Low-Rank Multivariate Processes

Consistent with the discussion in Sect. 2, we propose a hierarchical model for rs-fMRI data which (i) accounts for both the spatial and temporal aspects, specifying Gaussian processes for time, and latent factor models for the spatial dimension; (ii) allows a simple interpretation of the functional connectivity among brain regions in terms of a suitable covariance matrix; (iii) avoids misspecification issues by placing few assumptions on the spatial structure. Again, we focus on single-subject models, which means that the two different individuals \(i=1,2\) are treated separately and independently, having in common only the model structure. For the sake of notation, we omit the subject index \(i=1,2\) and we describe a model for a generic brain. We also note that, with a single subject, we cannot build a regression explanation of response since we cannot include individual level covariate information.

We aim to describe the joint behavior of realizations from a L-dimensional stochastic process, i.e., the collection of the bold functional activities. Formally, we will denote the L-dimensional stochastic process as \(\varvec{B}(t)\), whose entries are the bold functional activities \(B_{l}(t)\), for \(l=1,\ldots ,L\). We assume a customary additive error structure, that is

$$\begin{aligned} \varvec{B}(t) = \varvec{Z}(t) + \varvec{\epsilon }(t), \end{aligned}$$
(1)

where \(\varvec{\epsilon }(t)\) is a L-dimensional pure error, and \(\varvec{Z}(t)\) is a L-dimensional process, which we refer to as the mean process. Specifically, \(\varvec{\epsilon }(t)\) is a Gaussian white noise process with variance \(\sigma ^2\), whose entries are independent Gaussian random variables over time and brain regions.Footnote 1 Notice that no intercept term is included in specification (1), since, as shown in Fig. 1, the dataset is centered around zero during the c-pac pipeline.

The overarching goal of our contribution is to infer functional connectivities among brain regions [3]. Therefore, consistent with the available literature and with the descriptive analysis in Sect. 2, the components of the L-dimensional process \(\varvec{Z}(t)\) should not be modeled as independent realizations. Moreover, the high dimensionality of our data calls for parsimonious representations, which can be obtained for instance via low-rank approximations. As a notable example, covariance regression models [25, 26] address similar issues and assume that the mean process \(\varvec{Z}(t)\) can be decomposed as follow:

$$\begin{aligned} \varvec{Z}(t) = \varvec{A}(t)\varvec{V}(t), \end{aligned}$$
(2)

where \(\varvec{A}(t)\) is a \(L \times K\) time varying factor loading matrix, and \(\varvec{V}(t)\) is a K-dimensional vector whose entries are independent Gaussian processes—called latent factors in this context. The dimensionality reduction is performed by fixing some \(K \ll L\), e.g. \(K=3\) or \(K=5\). The covariance regression model of [26] is a flexible model for multidimensional stochastic processes, having large support and a familiar interpretation in terms of Bayesian factor models, for any fixed time. Moreover, with discrete time, the covariance regression model could be formally related to the class of dynamic latent factor models (Chap. 10 [27]).

3.2 A Time-Dependent Latent Factor Model

We simplify (2) and we set \(\varvec{A}(t) = \varvec{A}\), that is, the matrix \(\varvec{A}\) is now constant over time and

$$\begin{aligned} \varvec{Z}(t) = \varvec{A}\varvec{V}(t). \end{aligned}$$
(3)

Hence, \(\varvec{A}\) is a \(L \times K\) factor loading matrix. Although such an assumption reduces the global flexibility of the covariance regression model in Eq. (2), it allows the factor loading matrix \(\varvec{A}\) to be interpreted as a simple measure of dependence, e.g., connectivity, among brain regions, a key feature of our analysis.

In Sect. 5 we provide some empirical support for the factorization (3) as a reasonable assumption for modeling rs-fMRI data while in Sect. 6 we discuss possible extension to the non-stationary case. Additionally, decomposition (3) formally relates our model to the class of latent factor models, which have been used as a dimension reduction tool for instance in genomic applications [28, 29]. Thus, our model can be regarded as a time-dependent extension of a Bayesian factor model, in which the latent factors \(\varvec{V}(t)\) are independent random functions of time rather than independent draws.

Gaussian processes [30] are a flexible class of stochastic processes to provide random realizations within the space of functions over a specified domain. Therefore, they are a suitable candidate for modeling the time-dependent latent factors \(\varvec{V}(t)\). We suppose that the components of \(\varvec{V}(t)\) are independent and identically distributed Gaussian processes \(\text {GP}\left( 0, \kappa \rho (t, t')\right) \), with zero mean and correlation function \(\rho (t, t')\). As we will discuss in Sect. 3.3, for identifiability purposes we assume \(\kappa =1\). Independence among the Gaussian latent factors and the restricted factorization (3) imply a multivariate Gaussian distribution for the mean process \(\varvec{Z}(t)\) evaluated at a fixed time \(t_0\), that is

$$\begin{aligned} \varvec{Z}(t_0) \sim \text {N}_{L}\left( 0, \varvec{\varSigma }_{\varvec{A}}\right) , \qquad \text { for any fixed } t_0, \end{aligned}$$
(4)

with \(\varvec{\varSigma }_{\varvec{A}} = \varvec{A}\varvec{A}^{\mathsf {T}}\), which does not depend on time. The role of \(\varvec{A}\) is now clearer since it can be viewed as the square root of the covariance matrix \(\varvec{\varSigma }_{\varvec{A}}\). We remark that \(\varvec{\varSigma }_{\varvec{A}}\) is singular, being of rank \(K \ll L\). In turn, this implies that \(\varvec{Z}(t_0)\) for any fixed \(t_0\) would be a degenerate multivariate Gaussian, lying in a subspace of dimension K. Factorization (3) effectively induces dependence among the components of the mean process, e.g. among brain regions, but implicitly enforces some form of stationarity, since the spatial dependence structure is constant over time. Such an assumption is discussed in depth for instance in [31], who suggest that it might be worth looking at non-stationary models to obtain a more complete picture of the phenomenon. However, as argued by [31] themselves, stationarity is also convenient in order to prevent the model from becoming vastly more complex. The stationary temporal dependence of our model can be appreciated by observing the covariance matrix between \(\varvec{Z}(t)\) and \(\varvec{Z}(t')\), that is, the so-called cross-covariance matrix

$$\begin{aligned} \text {Cov}\left( \varvec{Z}(t),\varvec{Z}(t')\right) = \rho (t,t') \varvec{\varSigma }_{\varvec{A}}, \qquad t\ne t', \end{aligned}$$
(5)

whose limit as \(|t-t'| \rightarrow 0\) is \(\varvec{\varSigma }_{\varvec{A}} = \varvec{A}\varvec{A}^{\mathsf {T}}\), consistently with Eq. (4). Moreover, for \(t\ne t'\) we have \(\text {Cov}\left( \varvec{B}(t),\varvec{B}(t')\right) = \text {Cov}\left( \varvec{Z}(t),\varvec{Z}(t')\right) \). Thus, the cross-covariance in (5) has an appealing interpretation: dependence between bold values is multiplicatively adjusted according to temporal proximity.

In practice, we observe the bold functional activities only over a finite grid of times \(t=1,\ldots ,T\); we denote with \(\varvec{\mathcal {Z}}\) the \(L \times T\) matrix containing the values of \(\varvec{Z}(t)\) over this time grid. Also, let \(\varvec{\mathcal {B}}\) be a \(L \times T\) observed data matrix having entries \(B_{l}(t)\), for \(t=1,\ldots ,T\). We can re-express the model of Eqs. (1), (3) and (4) in terms of matrix Gaussian distributions [32], evaluated over the finite time grid:

$$\begin{aligned} (\varvec{\mathcal {B}} \mid \varvec{\mathcal {Z}}, \sigma ^2)&\sim \text {N}_{L, T}(\varvec{\mathcal {Z}}, \sigma ^2 I_{L \times L}, I_{T \times T}), \end{aligned}$$
(6)
$$\begin{aligned} (\varvec{\mathcal {Z}} \mid \varvec{A})&\sim \text {N}_{L , T}(0, \varvec{\varSigma }_{\varvec{A}}, \varvec{\varSigma }_{\varvec{T}}), \end{aligned}$$
(7)

where \(\varvec{\varSigma }_{\varvec{T}}\) denotes the Gram-matrix obtained by evaluating the covariance functions \(\rho (t, t')\) over the finite grid \(t=1,\ldots ,T\). Notice that the stationarity assumption translates into a separability assumption in the finite-dimensional setting, since we have

$$\begin{aligned} (\text {vec}(\varvec{\mathcal {Z}}) \mid \varvec{A}) \sim \text {N}(0, \varvec{\varSigma } ), \qquad \varvec{\varSigma } = \varvec{\varSigma }_{\varvec{T}} \otimes \varvec{\varSigma }_{\varvec{A}}. \end{aligned}$$
(8)

This convenient separability result is well described in the spatial literature on multivariate spatial processes: see for instance [18]. Factorization \(\varvec{\varSigma } = \varvec{\varSigma }_{\varvec{T}} \otimes \varvec{\varSigma }_{\varvec{A}}\) has relevant benefits: it provides a parsimonious representation of the covariance matrix \(\varvec{\varSigma }\) and it facilitates numerical computations. Notice that under the separability assumption the marginal distribution of the rows of \(\varvec{\mathcal {Z}}\) is a multivariate Gaussian with covariance matrix \(\varvec{\varSigma }_{\varvec{T}}\) and, symmetrically, the columns of \(\varvec{\mathcal {Z}}\) follow a multivariate Gaussian with covariance \(\varvec{\varSigma }_{\varvec{A}}\). In other words, the dependence structure over time does not depend on the brain regions, and vice versa.

3.3 Identifiability

Without further restrictions, the model described in Eqs. (6) and (7) is not identified. There are two sources of non-identifiability which can be handled by imposing some constraints on the parameters. Notice that \(\varvec{A}\) appears in Eq. (7) only in terms of product with its transpose. This means, for instance, that for any orthogonal matrix \(\varvec{Q}\) such that \(\varvec{Q} \varvec{Q}^\mathsf {T} = I_{K\times K}\) we get

$$\begin{aligned} \varvec{A} \varvec{A}^{\mathsf {T}} = \varvec{A} \varvec{Q} \varvec{Q}^\mathsf {T} \varvec{A}^{\mathsf {T}} = \tilde{\varvec{A}} \tilde{\varvec{A}}^{\mathsf {T}}, \end{aligned}$$

where \(\tilde{\varvec{A}} = \varvec{A} \varvec{Q}\). Thus, it is not possible to discriminate between a model with parameter \(\varvec{A}\) and another with parameter \(\tilde{\varvec{A}}\), since they lead to exactly the same likelihood. Thus, we let \(\varvec{A}\) to be lower triangular with positive diagonal, as commonly done in coregionalization models in spatial statistics [18]. To avoid confusions: since \(\varvec{A}\) is a \(L \times K\) rectangular matrix, by lower triangular with positive diagonal we mean that the elements \(a_{lk}\) of \(\varvec{A}\) are such that \(a_{lk} = 0\) for \(k > l\) and \(a_{kk} > 0\) for \(k=1,\ldots ,K\). Thanks to the Cholesky decomposition for positive semi-definite matrices, under these assumptions the matrix \(\varvec{A}\) is a Cholesky factor, which uniquely identifies \(\varvec{\varSigma }_{\varvec{A}} = \varvec{A}\varvec{A}^\mathsf {T}\).

The second source of non-identifiability concerns the scale of the covariance matrices in Eq. (7). For any positive constant \(c \in \mathbb {R}^+\) we have that

$$\begin{aligned} \varvec{\varSigma }_{\varvec{A}} \otimes \varvec{\varSigma }_{\varvec{T}} = (c \varvec{\varSigma }_{\varvec{A}}) \otimes \left( \frac{1}{c}\varvec{\varSigma }_{\varvec{T}} \right) = \tilde{\varvec{\varSigma }}_{\varvec{A}} \otimes \tilde{\varvec{\varSigma }}_{\varvec{T}}, \end{aligned}$$

which leads again to non-identifiability. To overcome this difficulty, we set the trace \(\text {Tr}(\varvec{\varSigma }_{\varvec{T}})\) equal to some constant, which can be easily obtained by letting the scaling parameter of the covariance function \(\kappa \) to be equal to one. Under these constraints, the model is fully identified. As an alternative, one could impose the first, or the last, diagonal entry to be equal to one.

3.4 Prior Specification

We conduct inference within the Bayesian framework and therefore we need to specify prior distributions for both \(\varvec{A}\) and \(\sigma ^2\). In the latter case, we choose an inverse gamma prior for the residual variance, that is

$$\begin{aligned} \sigma ^{-2} \sim \text {Ga}(a_\sigma , b_\sigma ), \end{aligned}$$
(9)

with \(a_\sigma , b_\sigma > 0\) some fixed hyperparameters. In the former case, we can equivalently deal with the coefficients in \(\varvec{A}\) or with the covariance matrix \(\varvec{\varSigma }_{\varvec{A}}\), since they are in a one-to-one correspondence. We formulate the prior distribution in terms of the coefficients in \(\varvec{A}\): we let its elements \(a_{lk}\), for \(l=1,\ldots ,L\) and \(k=1,\ldots , K\), to be independently distributed as follow

$$\begin{aligned} \begin{aligned} a_{lk} \overset{\text {iid}}{\sim } N(0,\gamma ^2),&\qquad l=1,\ldots ,L, \qquad k=1,\ldots ,K, \qquad k < l,\\ a^2_{kk} \overset{\text {ind}}{\sim } \gamma ^2 \chi ^2_{K - k + 1},&\qquad k=1, \ldots ,K, \\ a_{lk} = 0,&\qquad \text {otherwise,} \end{aligned} \end{aligned}$$
(10)

for some variance hyperparameter \(\gamma ^2 > 0\). By employing specification (10), we automatically deal with the identifiability constraints of Sect. 3.3.

4 Posterior Inference

Posterior inference cannot be conducted in closed form. We need to turn to simulation-based fitting techniques to obtain samples from the posterior distribution. Generally, we would prefer to work with a marginal specification in order to reduce the dimensionality of the problem as much as possible before doing any computation. The normal-normal conjugacy enables marginalization of Eq. (6) over \(\varvec{\mathcal {Z}}\), leading to the following Gaussian model, no longer having a factorized specification

$$\begin{aligned} (\text {vec}(\varvec{\mathcal {B}}) \mid \varvec{A}, \sigma ^2) \sim \text {N}(0, \varvec{C}), \end{aligned}$$
(11)

where \(n = L \times T\), and \(\varvec{C}= \varvec{\varSigma }_{\varvec{T}} \otimes \varvec{\varSigma }_{\varvec{A}} + \sigma ^2 I_{n \times n}\). The covariance matrix in Eq. (11) is diagonally dominant and thus invertible, allowing to ignore singularity issues that would arise when considering \(\varvec{\varSigma }=\varvec{\varSigma }_{\varvec{T}} \otimes \varvec{\varSigma }_{\varvec{A}}\) alone. In Appendix A we describe a simple Metropolis-Hastings (m-h) model fitting algorithm with multivariate Gaussian random walk which is based on the marginal specification (11) and is sufficient to guarantee a satisfying mixing. For this purpose, it is convenient to parametrize the residual variance \(\sigma ^2\) on the logarithmic scale, i.e., \(\tau = \log {\sigma ^2}\). Computational details concerning the m-h sampler are provided in Sect. 4.1, where we describe how to exploit the separability assumption for fast computations.

Suppose we are able to draw posterior samples for \(\varvec{A}\) and \(\tau \), for instance by using the m-h in Algorithm 1 (in Appendix). Then, predicting the new bold values at a new time and brain region, conditionally on the data, is relatively simple and can be obtained by means of the so-called kriging equations (e.g., [18]), Chap. 2. We remark that kriging the bold values is not of direct interest in the analysis of rs-fMRI data. However, as we will discuss in Sect. 5, this procedure is useful to conduct model assessment in terms of out-of-sample prediction performance. Let \(\varvec{\mathcal {B}}_0\) be the \(L_0 \times T_0\) matrix of unobserved bold values over a new grid of time values with length \(T_0\) for some subset of \(L_0\) brain regions from the original L regions. We are interested in finding the predictive distribution

$$\begin{aligned} p(\varvec{\mathcal {B}}_0 \mid \varvec{\mathcal {B}}) = \int p(\varvec{\mathcal {B}}_0 \mid \varvec{\mathcal {B}}, \varvec{\theta })p(\varvec{\theta } \mid \varvec{\mathcal {B}}) d\varvec{\theta }, \end{aligned}$$
(12)

where we have defined \(\varvec{\theta } = (\text {vec}(\varvec{A}),\tau )\). The conditional distribution \(p(\text {vec}(\varvec{\mathcal {B}}_0) \mid \varvec{\mathcal {B}}, \varvec{\theta })\) is available in closed form, being a multivariate Gaussian distribution

$$\begin{aligned} \begin{aligned}&(\text {vec}(\varvec{\mathcal {B}}_0) \mid \varvec{\mathcal {B}}, \varvec{\theta }) \sim \text {N}\left( \varvec{\mu }_0,\varvec{\varSigma }_0 \right) , \\&\varvec{\mu }_0 = \tilde{\varvec{C}}_0^{\mathsf {T}}\varvec{C}^{-1}\text {vec}(\varvec{\mathcal {B}}),\\&\varvec{\varSigma }_{0} = \varvec{C}_0 - \tilde{\varvec{C}}_0^{\mathsf {T}} \varvec{C}^{-1}\tilde{\varvec{C}}_0, \end{aligned} \end{aligned}$$
(13)

where \(\varvec{C}\) is the covariance matrix of \(\text {vec}(\varvec{\mathcal {B}})\) given the parameters, \(\varvec{C}_0\) is the covariance matrix of \(\text {vec}(\varvec{\mathcal {B}}_0)\), and finally \(\tilde{\varvec{C}}_0\) represent the cross-covariance matrix between \(\text {vec}(\varvec{\mathcal {B}})\) and \(\text {vec}(\varvec{\mathcal {B}}_0)\). Thus, draws from the predictive distribution in (12) can be obtained by composition sampling, by first drawing posterior values for \(\varvec{A}\) and \(\tau \) and then by sampling from the multivariate Gaussian distribution in (13).

4.1 Computational Difficulties

Some useful matrix identities can be exploited to reduce the computational burden both for the m-h algorithm and for Eq. (13). We start by inspecting the log-posterior distribution of the marginal model (11) which is equal, up to an additive constant, to the following quantity

$$\begin{aligned} \mathscr {L}(\varvec{A}, \tau ; \varvec{\mathcal {B}}) = -\frac{1}{2}\log {|\varvec{C}|} - \frac{1}{2}\text {vec}(\varvec{\mathcal {B}})^\mathsf {T}\varvec{C}^{-1}\text {vec}(\varvec{\mathcal {B}}) + \log {p(\varvec{A})} + \log {p(\tau )}, \end{aligned}$$
(14)

where \(p(\varvec{A})\) and \(p(\tau )\) denote the probability density functions of the priors for \(\varvec{A}\) and \(\tau = \log {\sigma ^2}\), respectively. During the mcmc chain the log-posterior is evaluated several times and therefore it is crucial to maintain computations as fast as possible. A potential computational bottleneck is represented by the inverse of the matrix \(\varvec{C} = \varvec{\varSigma }_{\varvec{T}} \otimes \varvec{\varSigma }_{\varvec{A}} + \sigma ^2 I_{n\times n}\), which in our case is a \(n \times n\) matrix, with \(n = T \times L\). In the separable case, thanks to the properties of the Kronecker product, this issue can be attenuated by exploiting the following decomposition of the inverse of \(\varvec{C}\), being equal to

$$\begin{aligned} \varvec{C}^{-1} = (\varvec{U}_{\varvec{T}} \otimes \varvec{U}_{\varvec{A}})(\varvec{\varLambda }_{\varvec{T}} \otimes \varvec{\varLambda }_{\varvec{A}}+\sigma ^2I_{n\times n})^{-1}(\varvec{U}_{\varvec{T}} \otimes \varvec{U}_{\varvec{A}})^\mathsf {T}, \end{aligned}$$
(15)

where \(\varvec{\varSigma }_{\varvec{T}} = \varvec{U}_{\varvec{T}}\varvec{\varLambda }_{\varvec{T}}\varvec{U}_{\varvec{T}}^\mathsf {T}\) and \(\varvec{\varSigma }_{\varvec{A}} = \varvec{U}_{\varvec{A}}\varvec{\varLambda }_{\varvec{A}}\varvec{U}_{\varvec{A}}^\mathsf {T}\) are the spectral decompositions of the matrices. Detailed calculations leading to (15) are given in Appendix A. The above spectral decompositions are relatively cheap in our context. Notice also that the decomposition of \(\varvec{\varSigma }_{\varvec{T}}\) has to be computed only once, since it does not depend on unknown parameters in our formulation. More importantly, the matrix \(( \varvec{\varLambda }_{\varvec{T}} \otimes \varvec{\varLambda }_{\varvec{A}}+ \sigma ^2I_{n\times n})^{-1}\) is diagonal, and can be, therefore, inverted directly.

Decomposition (15) allows easy evaluation of the log-determinant of \(\varvec{C}\), which is given as a simple function of the previously obtained eigenmatrices \(\varvec{\varLambda }_{\varvec{T}}\) and \(\varvec{\varLambda }_{\varvec{A}}\),

$$\begin{aligned} \log {|\varvec{C}|} = \log {| \varvec{\varLambda }_{\varvec{T}} \otimes \varvec{\varLambda }_{\varvec{A}}+\sigma ^2I_{n\times n}|} = \sum _{i=1}^n \log {\left( \lambda _i + \sigma ^2\right) }, \end{aligned}$$

where \(\lambda _i\) is the i-th entry of the diagonal matrix \(\varvec{\varLambda }_{\varvec{T}} \otimes \varvec{\varLambda }_{\varvec{A}}\), for \(i=1,\ldots ,n\). Notice that, as long as \(K \ll L\), the covariance matrix \(\varvec{\varSigma }_{\varvec{A}}\) is not full rank, meaning that some of the eigenvalues \(\lambda _i\) are exactly equal to zero.

Leveraging the decomposition (15) and using some simple properties of the Kronecker product, we can express the quadratic form \(\text {vec}(\varvec{\mathcal {B}})^\mathsf {T} \varvec{C}^{-1} \text {vec}(\varvec{\mathcal {B}})\) in (14) as follows:

$$\begin{aligned} \text {vec}(\varvec{\mathcal {B}})^\mathsf {T} \varvec{C}^{-1} \text {vec}(\varvec{\mathcal {B}}) = \text {vec}\left( \varvec{U}_{\varvec{A}}^\mathsf {T}\varvec{\mathcal {B}}\varvec{U}_{\varvec{T}}\right) ^\mathsf {T} (\varvec{\varLambda }_{\varvec{T}} \otimes \varvec{\varLambda }_{\varvec{A}}+\sigma ^2I_{n\times n})^{-1}\text {vec}\left( \varvec{U}_{\varvec{A}}^\mathsf {T}\varvec{\mathcal {B}}\varvec{U}_{\varvec{T}}\right) . \end{aligned}$$

This drastically reduces the computational burden, since it avoids storage in memory of very large \(n \times n\) matrices. With similar reasoning, the kriging Eq. (13) can also be obtained quite cheaply, adopting fast algorithms for products between matrix involving Kronecker products, implemented for instance in the klin R package [33]. The code used in the paper is made available at the link https://github.com/tommasorigon/StartUpResearch.

5 Data Analysis

5.1 Model Checking

We now apply the spatio-temporal model presented in Sect. 3 to the rs-fMRI dataset. However, before proceeding with the interpretation of the results, it is crucial to check the adequacy of the fit to the data to assess the plausibility of the proposed model (Chap. 6, [34]). We measure the goodness of fit of our model by means of out-of-sample predictions, dic indices, and by direct graphical inspection.

In performing Bayesian inference, we employ the priors described in Sect. 3.4, which require the specification of some tuning parameters. The hyperparameter \(\gamma ^2\) controls the prior variability of the coefficients in \(\varvec{A}\). By choosing \(\gamma ^2 = 100\) we incorporate vague prior information into the model. Following a similar rationale, we let the parameters of the residual variance \(\sigma ^2\) to be equal to \(a_\sigma = b_\sigma = 1\), which induces a fairly noninformative prior for the residual variance.

As discussed in Sect. 3, the temporal component is controlled by the Gaussian processes in \(\varvec{V}(t)\), which in turn are characterized by their correlation function \(\rho (t,t')\). Depending on the choice of such a function, the latent processes \(\varvec{V}(t)\) could behave quite differently. An extreme example consists in setting \(\rho (t,t') = \mathbbm {1}(t=t')\), with \(\mathbbm {1}(\cdot )\) denoting the indicator function, which would imply that the processes \(\varvec{V}(t)\) are independent over time, and the model in Eqs. (6) and (7) reduces to a simple Bayesian factor model. Instead, by letting \(\rho (t,t') = \exp {\{ - \psi |t - t'|\}}\), with \(\psi = 3 \times 10^{-2}\), we introduce temporal dependence favoring stationary and quite regular paths for the latent processes \(\varvec{V}(t)\). In fact, such a correlation function implicitly induces a continuous-time first order autoregressive process for each element of \(\varvec{V}(t)\), with autocorrelation coefficient equal to \(\exp {\{-3\times 10^{-2}\}} \approx 0.97\), which favors fairly regular latent trajectories.

Finally, the number of latent processes K has to be carefully selected, since its choice critically impacts the computational performance. Indeed, the number of parameters grows linearly as a function of K and therefore overly complex models become harder to fit using the m-h algorithm. More sophisticated and efficient approaches might mitigate this issue, and a brief discussion is given in Sect. 6. We set \(K=5\) mainly because of these practical considerations, but we provide below some empirical evidence which offers some reassurance that a model based on such a choice is sufficiently flexible to capture the brain connectivity structure of our data.

Table 1 For different values of \(K=1,3,5\), and for subjects \(i=1,2\), the dic index, the total number of parameters and the out-of-sample root mean squared error (rmse) are reported. For each individual, it is also shown the out-of-sample rmse of a random forest model. For both the subjects, the bold values represent the best model according to each index; in both cases, the lower the better

To assess whether our model leads to reasonable inferential conclusions and to discriminate between competing models, we conduct some posterior checks, obtaining measures of out-of-sample accuracy as well as the dic indices [35]. The original dataset is split in two parts: the first one is used for estimation and it comprises the \(75\%\) of randomly selected columns of \(\varvec{\mathcal {B}}\), i.e., different time instants, selected at random, while the remaining \(25\%\) is used as a test set to compute for instance the out-of-sample root mean squared error (rmse).

Fig. 4
figure 4

Scatter plots of the bold values for subjects \(i=1,2\), over the time grid \(t=1,\ldots ,403\), for 6 selected brain regions. Three of these regions are located in the left hemisphere (lh-lateraloccipital, lh-lateralorbitalfronal, lh-lingual), while the others are their symmetric correspondent of the right hemisphere (rh-lateraloccipital, rh-lateralorbitalfronal, rh-lingual). The solid lines represent the predicted values obtained by means of the kriging Eq. (13), after plugging-in the map estimate

In Table 1 we compare our model with alternatives involving a smaller number of latent processes, e.g., with \(K=1\), or \(K=3\), showing that for both the subjects we obtain improved accuracy and lower dic indices with \(K=5\). The out-of-sample predictions are obtained by means of the kriging Eq. (13), after plugging in the map estimate. Formally, this is incorrect and may lead to the underestimation of the predictive uncertainty. More correctly, we should take the average of the kriged estimates over posterior realizations of the parameters. However, the latter procedure turns out to be computationally too expensive, so we adopt the aforementioned plug-in alternative. Posterior samples for all the competing models and both the subjects are obtained using the m-h algorithm, which rely on a Gaussian random walk proposal, each with its own covariance matrix. These matrices, one for each competing model in Table 1, have been carefully tuned essentially by trial and error, to ensure good mixing and quick convergence of each mcmc chain. For each mcmc chain we retain 250, 000 thinned samples from a chain of 5, 000, 000 iterations, after a burn-in period of about 100, 000 draws. The trace plots show no evidence against convergence and a decent mixing.

As shown in Table 1, we further compare our model with a benchmark method for regression, random forests [36], in which the bold response values are fitted as a function of time and brain regions. Although the latter method is specifically designed to provide accurate predictions of response values, our proposal seems to have better out-of-sample performance.

Finally, in Fig. 4 we graphically explore the predictive performance of our model by comparing the original bold values with their predictions. For illustrative reasons we displayed only few brain regions, but we remark that the other cases present similar patterns. The graphs of Fig. 4 further corroborate the reasonableness of our proposal, which is able to capture the main trends and the differences in variability of the bold values among brain regions.

We remark that, in order to reduce the computational burden, the dic indices of Table 1 and the results in Sect. 5.2 are also based on this \(75\%\) partition of the observations, which we believe well-represents the whole dataset.

5.2 Network Analysis

In neurological applications it is common practice to explore functional connectivity networks exploiting graph theoretical approaches. As summarized in [21], the typical pipeline of the analysis of structural and functional brain networks consists of the following steps: the identification of the brain regions of interests, the estimation of a continuous measure of association between regions, the application of a threshold to generate a binary adjacency matrix, and the computation of network indices on the obtained undirected graph. In our case, the regions of interests are those obtained from the Desikan parcellation [16], whereas a continuous measure of association can be obtained from the covariance matrix \(\varvec{\varSigma }_{\varvec{A}}\), appropriately standardized. Following [22], we define a \(L \times L\) binary adjacency matrix \(\varvec{G}\) as the truncation of a correlation matrix, that is

$$\begin{aligned} \left[ \varvec{G}\right] _{ll'} = \mathbbm {1}\left( [\text {Cor}\left( \varvec{\varSigma }_{\varvec{A}} \right) ]_{ll'} > \text {threshold}\right) ,\qquad \text { for } l\ne l', \end{aligned}$$
(16)

and \(\left[ \varvec{G}\right] _{ll} = 0\) for \(l=1,\ldots ,L\) and \(l'=1,\ldots ,L\), where threshold is a constant between 0 and 1, and \(\text {Cor}\left( \varvec{\varSigma }_{\varvec{A}}\right) \) denotes the correlation matrix obtained by standardizing \(\varvec{\varSigma }_{\varvec{A}}\). The covariance matrix has a direct interpretation, but the precision matrix, i.e., its inverse, might be considered as well. The choice of the threshold is crucial in determining \(\varvec{G}\) but, unfortunately, there are no general guidelines. Indeed, for any value of the threshold, we could obtain a graph having different sparsity and network properties. To mitigate this issue, we explored a range of plausible thresholds [21] and we noticed that in our setting the inferential conclusions are insensitive to moderate variations of the threshold.

Fig. 5
figure 5

Posterior distributions (violin plots) of the transitivity indices and the average path lengths, for subjects \(i=1,2\), evaluated on a graph \(\varvec{G}\) with \(\text {threshold}=0.8\)

Given the threshold, the adjacency matrix \(\varvec{G}\) is a random quantity whose posterior distribution can be easily approximated using the output of the mcmc. In particular, it is possible to quantify the uncertainty of any network characteristic one could be possibly interested in. Among several alternatives, a relevant network index is the so called clustering coefficient, also known as transitivity in the statistical literature, or fraction of transitive triples, which is a measure of global cohesion of the graph \(\varvec{G}\). Another index which provides a measure of global connectivity of a given graph is the average path length, defined as the average minimal distance between two brain regions.

We expect these indicators to be negatively correlated in our application: broadly speaking, a high number of transitive triples suggests that two brain regions require a low number of step to be connected. We refer to [37] for the formal definition of these indices and their theoretical properties. In Fig. 5 we reported the posterior distributions of these measures for both of the subjects. We see substantial differences. In particular, subject 2 presents a much higher functional activity compared to subject 1, in terms of both indices. Understanding the qualitative reasons for such a marked distinction between the two subjects is beyond the aim of this paper. Nonetheless, we remark that our proposal was able to capture the differential traits of the two brains, thus providing a tool for detecting differences in functional connectivity and for quantifying the related uncertainty.

6 Discussion

In this paper we proposed a spatio-temporal Bayesian factor model for the analysis of rs-fMRI data. Both for interpretational and computational reasons, we employed a separable structure. We discussed how to obtain posterior inference using a m-h algorithm, providing also some technical details that could speed up computations. Finally, we applied our model to a real rs-fMRI dataset and we provided an example.

Although the model we described is designed for a single-subject analysis, it could be extended to the multi-subject case adding a further layer in the hierarchical specification of Sect. 3. One possibility is to borrow information across individuals assuming exchangeable prior distributions for the subject-specific covariance matrices \(\varvec{\varSigma }_{\varvec{A}}^{(i)}\). In particular, if we let \((\varvec{\varSigma }_{\varvec{A}}^{(i)} \mid \varvec{V}) \sim \text {InverseWishart}(K, \varvec{V})\) independently and identically distributed, we could then induce dependence across subjects by placing a hyperprior distribution for \(\varvec{V}\), which in turn could be interpreted as the baseline covariance structure, common to all the individuals. Additionally, in the multi-subject setting it might be possible to explore the effect of individual covariates on functional connectivity, which we did not attempt, having considered only two subjects.

Another possible extension, already mentioned in Sect. 3.2, could be the implementation of a dynamic model. This would take us to a non separable model by specifying a factor loading matrix \(\varvec{A}(t)\) that also evolves in time. This issue is examined in depth, in a different applied context, by [25, 26]. To capture the evolution of \(\varvec{A}(t)\), avoiding, at the same time, naive approaches with poor performances, they use independent Gaussian processes with unit variance as a set of basis functions. Thus, the factor loading matrix \(\varvec{A}(t)\) would itself be a time-varying random function, implying that \(\varvec{\varSigma }_{\varvec{A}}(t) = \varvec{A}(t)\varvec{A}^\intercal (t)\), for any fixed t. As a consequence, the evaluation of the adjacency matrix in Eq. (16) for each correlation matrix would generate a dynamic network.

Generalizing our model beyond separability can be done in several other ways. For instance, one could assume that the latent Gaussian processes in \(\varvec{V}(t)\) are independent but not identically distributed, and characterized by different correlation functions \(\rho _k(t,t')\). This would imply a more sophisticated and non stationary covariance structure for the mean process \(\varvec{Z}(t)\). Both the above settings are arguably more realistic [31], but unfortunately they do not lead to the simple interpretation which follows from our separable model.

Besides the difficulties in interpretation that could arise from the above generalizations, the main challenge is on the computational side. The algorithm for posterior inference we described in Sect. 4 can be improved in several different directions. For instance, the default prior setting and the parameter expansion strategy of [38] could be adapted to our framework to provide better mixing. In multi-subject scenarios, or whenever the number of brain regions is massive, and therefore mcmc computations are prohibitive, one could attempt deterministic model fitting approximations like variational Bayes. In the context of fMRI data this approach was developed by [9], and it could possibly be adapted to our model.