1 Introduction

Bipolar disorder (BP) is a mental health condition that causes extreme mood swings. Despite decades of research, the pathophysiology of BP is still not well understood. Some of the most commonly prescribed presentation for patients with BP have also been associated with structural or functional brain differences. For example, [7] found adults with BP had widespread bilateral patterns of reduced cortical thickness in the frontal, temporal and parietal regions. Some studies have also shown evidence of reductions in functional connectivity within the cortical control networks [1, 20].

Many brain imaging techniques including functional MRI (fMRI), structural MRI (sMRI), provide information on different aspects of the brain. However, most models favor only one data type or do not combine data from different imaging modalities effectively, thus missing potentially important differences which are only partially detected by single modality [2, 3]. Combining modalities may thus uncover previously hidden relationships that can unify disparate findings in neuroimaging. To the best of our knowledge, no previous work has been done to combine structural and functional connectivity data to analyze BP. We hold the hypothesis that with joint information, the better representation can be learned to describe BPs’ characteristics and validate this hypothesis in our experiment. A main challenge in multimodal data fusion comes from the dissimilarity of the data types being fused and result interpretation. Traditional multi-modality studies on neuroimaging mainly use principal component analysis (PCA), independent component analysis (ICA), canonical correlation analysis (CCA), and partial least squares (PLS) [17]. However, the model’s intrinsic dependence on the shape and scale of the data distribution causes ambiguity in components discovery and harms the easiness of interpretation.

Graph-based approach for multi-modality is a powerful technique to characterize the architecture of human brain networks using graph metrics and has achieved great success in explaining the functional abnormality from the network [16]. However, this family of methods lacks accuracy in the prediction task due to the model-driven methodology. Graph attention networks (GAT) [19], are novel neural network architectures that have been successfully applied to tackle problems such as graph embedding and classification. Different from CNN-based neurodisorders interpretation [9], one of the benefits of attention mechanisms is that they allow for dealing with variable-sized inputs, focusing on the most relevant parts of the input to make decisions, which can then be used for interpreting the salient input features. Motivated by this, we propose an innovative Edge-weighted Graph Attention Network (EGAT) with Dense Hierarchical Pooling (DHP), where the underlying graphs are constructed from the functional connectivity matrices and the node features consist of both anatomical features and statistics of the nodal connectivity. Our contribution is summarized as follows:

  • We propose a novel multi-modality analysis framework combining the sMRI and fMRI imaging in a graph classification task with workable settings.

  • Our model outperforms the existing methods with a 10–20% improvement, showing the necessity of multi-modality and attention infrastructures.

  • We provide an interpretable visualization to understand the co-activation pattern of sMRI and fMRI from their activation maps.

2 Methodology

2.1 Construction of Graphs

On a labeled graph set \(\mathcal {C} = \{ (G_1, y_1), (G_2, y_2), ... \}\), the general graph classification problem is to learn a classifier that maps \(G_i\) to its label \(y_i\). In practise, the \(G_i\) is usually given as a triple \(G=(V,E,X)\) where \(V=\{v_1,\dots v_N\}\) is the set of N nodes, \(E=\{e_{ij}\}_{N\times N}\) is the set of edges with \(e_{ij}\) denoting the edge weight, and \(X\in \mathbb {R}^{N\times F}\) is the set of node features.

In our BP vs. HC binary classification setting, the nodes are defined by the region of interest (ROI) from some given atlas. For the edges, we utilize the densely connected graph rather than setting a threshold that dismisses the weak connectivity. The edge weight is then defined as the correlation-induced similarity given by \(e_{ij} = 1 - \sqrt{(1 - r_{ij})/2}\), where \(r_{ij}\) is the Pearson’s correlation between the region-averaged BOLD time-series for region i and j. For each node, we construct a dim-11 feature vector combining the structural and functional MRI. The seven anatomical features are Number of Vertices (NumVert), Surface Area (SurfArea), Gray Matter Volume (GrayVol), Average Thickness (ThickAvg), Thickness Standard Deviation (ThickStd), Integrated Rectified Mean Curvature (MeanCurv) and Integrated Rectified Gaussian Curvature (GausCurv) [6], which provide the geometric information of brain surface. The four functional features are from connectivity statistics: mean, standard deviation, kurtosis and skewness of the node’s connectivity vector to all the other nodes, which summarize the moments of the regional time-series.

Fig. 1.
figure 1

Schemata of EGAT-DHP classification network

2.2 Graph Neural Network (GNN) Classifier

The architecture of our proposed GNN network is shown in Fig. 1. Each graph G is first fed to a 5-heads EGAT layer, followed by two pooling layers that coarsen 129 nodes to 32/16 then to 4 for graph feature embedding. The extracted features are then fed to 2 fully-connected layers for classification.

Edge-Weighted Graph Attention Layer (EGAT). The Graph Attention Layer takes a set of node features \({X} = \{{\varvec{x}}_1, {\varvec{x}}_2... {\varvec{x}}_N \}\), \({\varvec{x}}_i \in \mathbb {R}^F\) as input, and maps them to \(\varvec{Z} = \{{\varvec{z}}_1, {\varvec{z}}_2... {\varvec{z}}_N \}\), \({\varvec{z}}_i \in \mathbb {R}^{F'}\). The idea is to compute an embedded representation of each node \(v \in {V}\), by aggregating its 1-hop neighborhood nodes \(\{ {\varvec{x}}_j, \forall j \in \mathcal {N}({\varvec{x}}_i) \}\) following a self-attention mechanism Att: \(\mathbb {R}^{F'} \times \mathbb {R}^{F'} \rightarrow \mathbb {R}\) [19]. Different from the original [19], we leverage the edge weights of the underlying graph. The modified attention map \(\alpha \in \mathbb {R}^{N \times N \times P}\) can be expressed as a single feed-forward layer of \({\varvec{x}}_{i}\) and \({\varvec{x}}_{j}\) with edge weight \(e_{ij}\):

$$\begin{aligned} \alpha ^p_{ij} = \texttt {Att}(W^p {\varvec{x}}_i,W^p {\varvec{x}}_{j}) = \texttt {LeakyReLU}(({\varvec{a}}^p)^T[W^p {\varvec{x}}_i \Vert W^p {\varvec{x}}_{j}]) e_{ij}, \end{aligned}$$
(1)

where \(\alpha ^p\) is the attention weight for the p-th attention head and \(\alpha ^p_{ij}\) indicates the importance of node j’s features to node i in head p. It allows every node to attend all the other nodes on the graph based on their node features, weighted by the underlying connectivity. The \({W}^p \in \mathbb {R}^{F^{\prime } \times F}\) is a learnable linear transformation that maps each node’s feature vector from dimension F to the embedded dimension \(F'\). With P attention heads, attention mechanism Att is implemented by a nodal attributes learning vector \({\varvec{a}}^p \in \mathbb {R}^{2F'}\) and LeakyRelu with input slope = 0.2. Then, the aggregation operation is defined as \({\varvec{z}}_i = \big \vert \big \vert _{p=1}^{P} \sum _{j \in \mathcal {N}(x_i)} \alpha ^{p}_{ij} {W}^{p} {\varvec{x}}_{j}\), symbol \(\Arrowvert \) represents the concatenation operation.

Dense Hierarchical Pooling (DHP). To aggregate the information across nodes for graph level classification, we incorporate Dense hierarchical Pooling (DHP [21]) to reduce the number of nodes passing to the next layer. At the last level, the graph nodes are reduced to a few and features are flatten to a single vector, which is then passed to MLPs to generate graph label. The pooling procedure is performed by an assignment matrix \(\varvec{S} \in \mathbb {R}^{N \times N'}\) that coarsens both the node and edge information: \({\varvec{z}}_{out} = \varvec{S}^T {\varvec{z}}_{in}, \varvec{E}_{out} = \varvec{S}^T \varvec{E}_{in} \varvec{S}\) to a graph of \(N'\) nodes. The assignment \(\varvec{S}\) is learned through another EGAT layer with the regularization loss \(L_{reg} = \Vert \varvec{E}, \varvec{SS}^T \Vert _F\), where \(\Vert \cdot \Vert _F\) denotes the Frobenius norm.

Neurological Motivation of Network Designing. Compared to the GCNs [8] with spectral convolution, our proposed GNN architecture allows for a better description of local integration of node features, which is more biologically consistent with the findings of the community structure of brain networks [12]. Secondly, the efficiency of hierarchical pooling lays on the implicit assumption that the underlined graph possesses the inferred structure. Thus, considering the typical numbers of communities discovered in previous literature [14] and the fact that the brain consists of four lobes, we add two pooling layers in our network where the first one pools the node-set into 16/32 clusters and the second one pools the node-set into 4 clusters. In addition, considering the heterogeneity of the brain networks in local signal processing, multiple heads are employed in the first layer of EGAT convolution.

2.3 Interpretation from Attention Map

Characterizing BP from anatomical MRI and task-fMRI and interpreting the brain features captured by the proposed model can help neuroscientists better understand BP. The attention map \(\alpha \) in the EGAT layer learns salient cerebral cortex functional connectivity to identify BP by stacking layers in which nodes are able to attend over their neighborhoods’ features. By exploring the gradient sensitivity \({\varvec{s}}_{ij}^p\) = \(\frac{\partial (({\varvec{a}}^p)^T[W^p {\varvec{x}}_i \Vert W^p {\varvec{x}}_{j}])}{\partial {[{\varvec{x}}_i,{\varvec{x}}_j}]} \in \mathbb {R}^{F\times 2} \), we can disentangle the relationship among node features (from different modalities) in identifying BP by examining the co-activation.

3 Experiment and Results

3.1 Image Acquisition and Processing

Data for this study consisted of 106 subjects (59 patients, 47 health controls) each subject has 2 paired scans over 6 months (212 pairs in total), each pair consist of 1 structural T1 MR scan (sMRI, dimension \(192\times 256\times 256\), voxel size \(1\times 1\times 1\,\mathrm{mm}^3\), fov \(=192\) mm) and 1 functional MR scan (BOLD, dimension \(64\times 64\times 30\times 244\), voxel size \(4\times 4\times 5\,\mathrm{mm}^3\), fov 256, TR = 3 s), acquired on a GE 3-T scanner. During the fMRI scans, subjects performed “N-back” task in a block design manner (30 s/block, 11 blocks in total). We ended in 150 sMRI and fMRI pairs (half BP pairs and half HC pairs) after removing high-motion data (\({\ge }0.2\) relative mean). Data was split into 5 folds (\(80\%\) training and \(20\%\) validation set) based on subjects for cross-validation.

We preprocessed sMRI and extracted anatomical statistics by FreeSurfer. fMRI was reprocessed using FEAT pipeline of FSL, including steps of motion correction, spatial smoothing (FWHM 5), and registration to standard NMI space. A 0.01 Hz high-pass filter was applied. We extracted regional mean BOLD time series with the \(N=129\) region in Lausanne atlas [4] and calculated the edge weights, connectivity matrices and functional features as described in Sect. 2.1. The functional connectivity matrices was then used as the underlined graph for EGAT. We also normalized each node feature separately by z-scores manner considering the heterogeneity for different measurements.

3.2 BP vs. Healthy Control Classification

The experiment was run on 8 GTX Titan Xp (batch size = 8) with Adam optimizer (learning rate = 1e−4, betas = (0.9, 0.999)). We investigated the effect of tuning the number of kernels of EGAT and showed the performance on the validation sets of all the splits (see Table 1, row 1–4). The optimal solution was achieved when the first pooling layer output 32 communities and the fully connected layer consisted of 32 nodes. The accuracy varied yet not too much when we changed the community size to 16 and the number of nodes in the FC-layer.

To illustrate the importance of integrating multi-modality data, we compared the performance of using single modality (see Table 1, row 9–10). First, to show the necessity of including anatomical features, we replaced the anatomical features as dummy variable ones (namely fMRI only) and performed the task with the same infrastructure as EGAT. The performance decreased, suggesting that the anatomical features provided additional information. For the necessity of functional connectivity, we adopted a 2-layer MLP to classify the two groups based on the vectorized anatomical features of all regions (namely sMRI only). The decreased performance showed the advantage of combining functional data in our proposed model.

To prove that our model better embedded both structural and functional features, we compared with Random Forest, SVM with Linear kernel and GraphSAGE, whose best parameters are chosen by grid search (see Table 1, row 5–8). Our EGAT outperformed the three alternative models. The improvement may come from two causes. First, due to intrinsic complexity of sMRI and fMRI, complex models with more parameters is desired, which also explained why the MLP performed better than the other two. Second, our model utilized the specific topology of the community structure in the brain network thus potentially modeled the local integration more effectively.

Table 1. Classification performance of different models (mean(std)\(\%\))

3.3 Biomarkers Discovery from Structural and Functional Features

One obstacle of applying complex models in diagnosis is the lack of interpretation. Here we utilize activation map and gradient sensitivity to show that our method can provide interpretable visualization of effective features on both group and individual levels in addition to the better prediction accuracy shown above. First of all, in panel (a) of Fig. 2, we showed the reordered attention maps averaged on all subjects. The chord diagram displayed the location and weight of edge-attention. We assigned colors to different brain regions and labeled their name at the bottom of panel (a) of Fig. 2. Second, in panel (b) of Fig. 2, we presented the gradient sensitivity of different node features. The gradient sensitivity on the node feature displayed two modes, one having weights on the source and target nodes with opposite signs and the other with same signs. We can see that the activation patterns are spatially selective, suggesting that the abnormality of biomarkers happened in a heterogeneous way on the brain network, except for Attention 4 that gave a quantification of the overall effect.

Attention 1 and 3 placed strong weight on the connectivity statistics in the node features with opposite modes. This indicated that these two attentions emphasized the heterogeneity of functional connectivity in two modes, mean of variance for Attention 3 and variance of variance for Attention 1. Combined with the spatial preference on the default mode network (DMN), fronto-parietal (FP) and cingulo-opercular (CO) networks, this supports the previous finding on the increase of regional homogeneity in the BD patients [11] and suggests potential sub-types in this deficit. While the focus on DMN in Attention 1 suggested that the integration and segregation of DMN could play a central role in psychiatry [13], the strong co-activation of connectivity and anatomical measurements suggested that the abnormality for DMN, FP and CO in functional networks could be associated with the deficit of anatomical properties [10, 18].

Fig. 2.
figure 2

(a) Activation maps and (b) node feature gradient sensitivity of the five attention heads

For Attention 2 and Attention 5, the highest node weight was on the Gaussian curvature and complemented each other on the sign. Gray matter volume and thickness were also emphasized in these two attentions. While previous literature found widespread of gray matter deficit [10, 18] but not atrophy in the white matter, our results here suggest that the white matter abnormality might be better represented by the curvature information [5]. Also, the spatial highlight on the cingulo-opercular (CO) besides the DMN supports the hypothesis that the deficit of CO integrity could be a reason for the deficit of cognition [15].

4 Conclusion

In this work, we proposed a novel graph-attention based method for cerebral cortex analysis that integrates sMRI and fMRI using GNN to classify BP v.s. HC. It helps to identify the unique and shared variance associated with each imaging modality that underlies cognitive functioning in HC and impairment in BP. Thus, our model shows superiority over alternative graph learning and machine learning classification models. By investigating the attention mechanism, we show that the proposed method not only provides spatial information supporting previous findings in the network-based analyses but also suggested potential associations of anatomical deficit and the abnormality of the functional network. This method can be generalized on multi-modality learning on neuroimaging.