Efficient Interpretation of Deep Learning Models Using Graph Structure and Cooperative Game Theory: Application to ASD Biomarker Discovery

Li, Xiaoxiao; Dvornek, Nicha C.; Zhou, Yuan; Zhuang, Juntang; Ventola, Pamela; Duncan, James S.

doi:10.1007/978-3-030-20351-1_56

Xiaoxiao Li¹⁸,
Nicha C. Dvornek¹⁹,
Yuan Zhou¹⁹,
Juntang Zhuang¹⁸,
Pamela Ventola²⁰ &
…
James S. Duncan^18,19,21,22

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11492))

Included in the following conference series:

International Conference on Information Processing in Medical Imaging

6337 Accesses
8 Citations

Abstract

Discovering imaging biomarkers for autism spectrum disorder (ASD) is critical to help explain ASD and predict or monitor treatment outcomes. Toward this end, deep learning classifiers have recently been used for identifying ASD from functional magnetic resonance imaging (fMRI) with higher accuracy than traditional learning strategies. However, a key challenge with deep learning models is understanding just what image features the network is using, which can in turn be used to define the biomarkers. Current methods extract biomarkers, i.e., important features, by looking at how the prediction changes if “ignoring” one feature at a time. However, this can lead to serious errors if the features are conditionally dependent. In this work, we go beyond looking at only individual features by using Shapley value explanation (SVE) from cooperative game theory. Cooperative game theory is advantageous here because it directly considers the interaction between features and can be applied to any machine learning method, making it a novel, more accurate way of determining instance-wise biomarker importance from deep learning models. A barrier to using SVE is its computational complexity: $2^N$ given N features. We explicitly reduce the complexity of SVE computation by two approaches based on the underlying graph structure of the input data: (1) only consider the centralized coalition of each feature; (2) a hierarchical pipeline which first clusters features into small communities, then applies SVE in each community. Monte Carlo approximation can be used for large permutation sets. We first validate our methods on the MNIST dataset and compare to human perception. Next, to insure plausibility of our biomarker results, we train a Random Forest (RF) to classify ASD/control subjects from fMRI and compare SVE results to standard RF-based feature importance. Finally, we show initial results on ranked fMRI biomarkers using SVE on a deep learning classifier for the ASD/control dataset.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Graph Neural Network for Interpreting Task-fMRI Biomarkers

Invertible Network for Classification and Biomarker Selection for ASD

Graph-Based Disease Prediction in Neuroimaging: Investigating the Impact of Feature Selection

1 Introduction

Autism spectrum disorder (ASD) affects the structure and function of the brain. To better target the underlying roots of ASD for diagnosis and treatment, efforts to identify reliable biomarkers are growing [1]. Deep learning models have been used in fMRI analysis [2], which is used to characterize the brain changes that occur in ASD [3]. However, how the different brain regions coordinate on the deep convolutional neural network (DNN) has not been previously explored. When features are not independent, Shapley value explanation (SVE) is a useful tool to study each feature’s contribution [4,5,6]. The methods are based on fundamental concepts from cooperative game theory [7], which assigns a unique distribution (among the players) of a total surplus generated by the coalition of all players in the cooperative game. However, if the interactive features’ dimensions are high, SVE becomes computationally consuming (exponential time complexity).

The innovations of this study include: (1) We applied SVE on interactive features’ prediction power analysis; (2) Our proposed method does not require retraining the classifier; (3) To handle the high dimensional inputs of the DNN classifier, we propose two methods to reduce the dimension of SVE testing features, once the underlying graph structure of features is defined; and (4) Different from kernel SHAP proposed in [4], as a model interpreter, our proposed methods do not require model approximation. In Sect. 2, we introduce the background on cooperative game theory. In Sect. 3, we propose the two approaches to approximate Shapley value. We also show the approximation is true under certain assumptions. Three experiments are given in Sect. 4 to show the feasibility and advantage of our proposed methods.

2 Background on Cooperative Game Theory

2.1 Shapley Value

Our approach to analyzing the contributions of individual nodes to the overall network is the assignment of Shapley values. The Shapley value is a means of fairly portioning the collective profit attained by a coalition of players, based on the relative contributions of the players in some game. Let $\mathcal {N}=\{1,2,\dots ,N\}$ be the set of all the players, $S \subset \mathcal {N}$ be a subset of players forming a coalition within this game, and $v : 2^ \mathcal {N}\rightarrow {\mathbb {R}}$ be the function that assigns a real numbered profit to the subset S of players. By definition, for any v, $v(\varnothing ) = 0$, here $\varnothing $ is the empty set. A Shapley value is assigned by a Shapley function $\varPhi : \mathcal {N} \rightarrow {\mathbb {R}}$, which associates each player in $\mathcal {N}$ with a real number and which is uniquely defined by the following axioms [7]: 1. Efficiency; 2. Symmetry; 3. Dummy; and 4. Additivity. In our context, we are interested in the brain regions that discriminate ASD and control subjects. Classification prediction score is the total value to be distributed, and each brain region is a player, which will be assigned a unique reward (i.e. importance score) by its contribution to the classifier.

2.2 Challenges of Using Shapley Value

While Shapley values give a more accurate interpretation of the importance of each player in a coalition, their calculation is expensive. When the number of features (i.e., players in the game) is a massive N, the computational complexity is $2^N$, which is especially expensive if the model is slow to run. We propose addressing this computational challenge by utilizing the graph structure of the data. Consider the case when the underlying graph structure of the data is sparsely connected, e.g., the sparse brain functional network. Under this observation, we propose two approaches (Fig. 1) to simplify Shapley value calculation. Method I only considers the centralized coalition of each player to reduce the number of permutation cases by assigning weight 0 to features that rarely collaborate. Method II first applies community detection on the feature connectivity network to cluster similar features (forming different games and teams), then within the selected communities, assigns a feature’s contribution by SVE.

3 Methods

In classification tasks, only certain features in a given input provide evidence for the classification decision. For a given prediction, the classifier assigns a relevance value to each input feature with respect to a class label $Y\in \mathcal {C}$. The probability of class Y for input $\varvec{X}= ( X_1, X_2, \dots , X_N) $ is given by the predictive score of the DNN model $f: \mathcal {D}\rightarrow \mathbb {R}^{|\mathcal {C}|}$ where $\mathcal {D}$ is the domain for input X and each component of the output of f represents the conditional probability of assigning a class label, i.e. $ p(Y|\varvec{X})$.

The basic idea used in prediction difference analysis [8] is that the relevance of a feature $x_i$ can be estimated by measuring how the prediction changes if the feature is unknown. Here we extend this setting by considering the interaction of a set of different features instead of examining the features one by one. Denote the image corrupted at a feature set $S \subseteq \mathcal {N}$ as $\varvec{X}_{\mathcal {N}\setminus S}$. To calculate $p(Y|\varvec{X}_{\mathcal {N}\setminus S})$, following [8], we marginalize out the corrupted feature set S:

$$\begin{aligned} p(Y|\varvec{X}_{\mathcal {N}\setminus S}) = \mathbb {E}_{\varvec{X}_{S}\sim p(\varvec{X}_{S}|\varvec{X}_{\mathcal {N}\setminus S})}p(Y|\varvec{X}_{\mathcal {N}\setminus S},\varvec{X}_{S}). \end{aligned}$$

(1)

Denote $v_{\varvec{X}}$ the importance score evaluation function for input $\varvec{X}$. The prediction power for the rth feature is the weighted sum of all possible marginal contributions:

$$\begin{aligned} \varPhi _r(v_{\varvec{X}}) = \frac{1}{ |\mathcal {N}|}\sum _{S \subseteq \mathcal {N}\setminus \{r\}} \left( \begin{array}{c} |\mathcal {N}|- 1 \\ |S| \end{array} \right) ^ {-1} (v_{\varvec{X}}(S\cup \{r\})-v_{\varvec{X}}(S)). \end{aligned}$$

(2)

Similar to [5], we introduce the importance score of a feature set S

(3)

which can be interpreted as the negative of the expected number of bits required to encode the output of the model based on the input $\varvec{X}_{\mathcal {N}\setminus S}$.

Theorem 1

$\langle \mathcal {N} = \{1,2,\dots ,N\},v_{\varvec{X}}\rangle $ is a cooperative form game and $\varPhi (v_{\varvec{X}}) = (\varPhi _1, \varPhi _2,\dots ,\varPhi _N)$ corresponds to the game’s Shapley value.

The proof can be directly borrowed from [6] showing it has a unique solution and satisfies Axioms 1–4.

An illustrative example is Boolean logic expression, $\text {OR}((x_1,x_2))=1$ when $x_1$ or $x_2$ is one and zero otherwise for $\mathcal {N}=\{1,2\}$, $\mathcal {D} =\{0,1\} \times \{0,1\}$. Suppose $p(Y=1|\varvec{X})=\text {OR}(\varvec{X})$ and the base of the logarithm is 2. We aim to find the contributions of predicting 1 given input $\varvec{X}=(1,1)$. If both values of $\varvec{X}$ are unknown, one can predict that the probability of the result being 1 is $\frac{3}{4}$. We have $v_{\varvec{X}}(\varnothing )= 0 - (-\log \frac{1}{1}) = 0$, $v_{\varvec{X}}(\{1\}) = v_{\varvec{X}}(\{2\}) = 0 - (-\log (\frac{1}{1})) = 0$ and total value $v_{\varvec{X}}(\{1,2\}) = 0 - (-\log \frac{4}{3}) = \log \frac{4}{3}$. Therefore the contributions of each feature are: $\varPhi _1 = \frac{1}{2}[(v_{\varvec{X}}(\{1\}) - v_{\varvec{X}}(\varnothing )) + (v_{\varvec{X}}(\{1,2\})- v_{\varvec{X}}(\{2\}))]= \frac{1}{2}[(0-0)+(\log \frac{4}{3}-0)]=\frac{1}{2}\log \frac{4}{3}$ and $\varPhi _2 = \frac{1}{2}[(v_{\varvec{X}}(\{2\}) - v_{\varvec{X}}(\varnothing )) + (v_{\varvec{X}}(\{1,2\})- v_{\varvec{X}}(\{1\}))]= \frac{1}{2}[(0-0)+(\log \frac{4}{3}-0)]=\frac{1}{2}\log \frac{4}{3}$. The generated contributions reveal that both features contribute the same amount towards the prediction being 1 given input (1, 1). In addition, we can interpret there is coalition between the two players, since $v_{\varvec{X}}(\{1,2\}) > v_{\varvec{X}}(\{1\}) + v_{\varvec{X}}(\{2\})$. However, we will get the myopic conclusion that both features are unimportant by only ignoring a single feature, because given one feature $X_i = 1$, $p = 1$ is for sure.

With the underlying structure of data, we have prior knowledge that some features of the data set are barely connected; in other words, there is very likely no coalition between these features. We define a connected graph $\mathcal {G = (V,E)}$ with nodes $\mathcal {V}$ and edges $\mathcal {E}$. Given an adjacency matrix $\mathcal {A} = (a_{ij})$ of the undirected graph $\mathcal {G}$ (for example, the Pearson correlation of mean time series of brain regions), we use a threshold th to binarize $a_{ij}$, i.e. $a_{ij}^b=1$ when $a_{ij}>th$ and zero otherwise, resulting in a sparsely connected graph.

3.1 Method I: Centralized Shapley Value Explanation (C-SVE)

For a given feature i, its 1-step connected neighborhood is defined by the set . As an approximation, we propose Centralized Shapley Value Explanation (C-SVE), which only calculates the marginal contribution when a feature collaborates with its neighbors.

Definition 1

Given classifier f and sample $\varvec{X}$, the C-SVE assigns the prediction power on feature r by

$$\begin{aligned} \hat{\varPhi }^C_r(v_{\varvec{X}}) = \frac{1}{|\mathcal {N}_r|}\sum _{S \subseteq \mathcal {N}_r\setminus \{r\}} \left( \begin{array}{c} |\mathcal {N}_r| - 1 \\ |S| \end{array} \right) ^ {-1} (v_{\varvec{X}}(S\cup \{r\})-v_{\varvec{X}}(S)). \end{aligned}$$

(4)

The coefficients in front of the marginal contributions is a weighted transformation of the original SVE form (in Eq. (2)), where instead of assigning each permutation the same weight, sets not belonging to the neighborhood were assigned 0 weight. In practice, we can reject the non-coalition permutations and average the Shapley values for the remaining terms.

Theorem 2

We have $\hat{\varPhi }^C_{\varvec{X}}(r) = \varPhi _{\varvec{X}}(r)$ almost surely if we have $X_r \perp \varvec{X}_{\mathcal {N}\setminus \mathcal {N}_r} | \varvec{X}_U$ and $X_r \perp \ \varvec{X}_{\mathcal {N}\setminus \mathcal {N}_r} | \varvec{X}_{U}, Y$ for any $U \subset \mathcal {N}_r \setminus \{r\}$.

The proof is shown in Appendix A. It is important to show that our proposed approximation is a good one. We can easily check the necessary condition that for $k \notin \mathcal {N}_r$, the angle between the average time series $\bar{X}_r$ in ROI r and $\bar{X}_k$ in ROI k satisfies $\cos (\bar{X}_r,\bar{X}_k)< \epsilon $, which corresponds to the small edge weight $(\sim 0)$ in the graph that we created using Pearson correlation.

3.2 Method II: Hierarchical Shapley Value Explanation (H-SVE)

In method II, we approximate the Shapley value by a hierarchical approach: (1) detect communities in the graph, then (2) apply SVE in each community individually.

Modularity-Based Community Detection. We use the same undirected graph architecture defined in Method I, but use greedy modularity method [9] to divide all the features into non-overlapping communities. Then the whole features sets can be expressed by a combination of non-overlapping communities $\mathcal {N} = A_1 \bigcup A_2 \bigcup \dots \bigcup A_M$ and the features in one community only cooperate within the group, hence are independent to those in the different communities. Therefore we can define different Shapley value rules in the different communities, but the Shapley values are comparable within and across communities.

Shapley Value of Each Feature in the Community. With the assumption that different communities of players do not play in a game (rarely connect), we assume the communities of features are independent. In order to compare the feature importance in the whole brain, firstly we define the Shapley value for feature subset S in community $A_i$ as

(5)

Definition 2

Suppose the features are clustered into $\mathcal {N} = A_1 \bigcup A_2 \bigcup \dots \bigcup A_M$. The H-SVE assigns the prediction power of feature r in $A_i$ by

$$\begin{aligned} \hat{\varPhi }^H_r(v_{\varvec{X}}) = \frac{1}{|A_i|}\sum _{S \subseteq A_i\setminus \{r\}} \left( \begin{array}{c} |A_i|- 1 \\ |S| \end{array} \right) ^ {-1} (v_{\varvec{X}}(S\cup \{r\})-v_{\varvec{X}}(S)) . \end{aligned}$$

(6)

Theorem 3

When $\varvec{X}_{A_1} \perp \varvec{X}_{A_2} \perp \dots \perp \varvec{X}_{A_M}$, we have $\hat{\varPhi }^H_r(v_{\varvec{X}}) = \varPhi _r(v_{\varvec{X}})$ almost surely.

The proof is similar to the proof for Theorem 2.

3.3 Monte Carlo Approximation for Large Neighborhood

Although we simplify SVE by C-SVE or H-SVE methods, computation may still be challenging. For example: (1) in C-SVE, feature node r to be analyzed is densely connected with the other nodes and (2) in H-SVE, there exists large communities. Based on the alternative formulation of the Shapley value (Eq. (7)), let $\pi (\mathcal {N})$ be the set of all ordered permutations of $\mathcal {N}$. Let $Pre^r(O)$ be the set of players which are predecessors of player r in the order $O\in \pi (\mathcal {N})$, we have

$$\begin{aligned} \varPhi _r(v_{\varvec{X}}) = \frac{1}{|\mathcal {N}|!}\sum _{O \in \pi (\mathcal {N})}(v_{\varvec{X}}(Pre^r(O) \cup \{r\}) - v_{\varvec{X}}(Pre^r(O))). \end{aligned}$$

(7)

We use the following Monte Carlo (MC) algorithm to approximate Eqs. (4) and (6). We define:

$$\begin{aligned} \tau (x,\hat{x},S) = (z_1,z_2,\dots ,z_s), \; \; z_i = \left\{ \begin{array}{lr} x_i; \; i\in S \\ \hat{x}_i; \; i\notin S \end{array}. \right. \end{aligned}$$

(8)

Then the unbiased MC approximation can be expressed as in Algorithm 1. Given m, if $2^{|\mathcal {N}(r)|} \gg m$, we will apply MC approximation.

4 Experiments and Results

4.1 Validation on MNIST Dataset

In order to show the feasibility of the proposed two approaches, we test the explanation results on MNIST dataset [10], where we can compare to human judgment about the feature importance. We trained a convolutional network (Conv2D(32) $\rightarrow $ Conv2D(64) $\rightarrow $ Dense(128) $\rightarrow $ Dense(10)) achieving $97.32\%$ accuracy. We parcellate the image into ROIs using slic [11] to mimic the setting of detecting saliency brain ROI for identifying ASD. Denoting the distance between the center of ROI i and ROI j as $d_{ij}$, we define the connection between ROI i and j as $a_{ij} = exp(-d_{ij}/2)$. Here we use $th = \frac{\sum _i\sum _j a_{ij}}{|\mathcal {E}|}$. The results are shown in Fig. 2, where we uniformly divided each ROI’s importance score by the number of pixels in the ROI to mitigate dominance by large ROIs and divided by $\underset{i \in \mathcal {N}}{max}(\varPhi _i)$ for visualization. The interpretation results matched our human perception that the “x cross” shape in the center is important for recognizing digit 8. Compared with single ROI testing, our proposed methods assigned smoother and more widely distributed importance scores to more pixels. To examine the effect of important ROIs on prediction, we corrupted pixels whose importance power added up to $90\%$ of the positive importance scores. We then compared the difference between the original prediction probability of digit 8 and the new prediction probability using the corrupted image. C-SVE and H-SVE could better fool the classifier, which decreased the prediction probability by 0.8939 and 0.9089 respectively, compared to only a 0.2043 decrease for the single ROI method. Some ROIs may not contribute to classification on their own but influence the results when combined with other regions. In the single ROI method, these ROIs will be assigned 0 importance score. However, by our proposed SVE method these ROIs can be discovered.

4.2 ASD Task-fMRI Dataset and Underlying Graph Structure

We tested our methods on a group of 82 children with ASD and 48 age and IQ-matched healthy controls used for training the classifiers to distinguish the two groups. Each subject underwent a biological motion perception task [3] fMRI scan (BOLD, TR = 2000 ms, TE = 25 ms, flip angle = $60^{\circ }$, voxel size $3.44\,\times \,3.44\,\times \,4\,\text {mm}^3$) acquired on a Siemens MAGNETOM Trio TIM 3T scanner. We randomly split $80\%$ of the data for training, $10\%$ for validation of model parameters, and $10\%$ for testing.

The Automated Anatomical Labeling (AAL) atlas [12] was used to parcellate the brain into 116 regions. For each subject, we computed the $116 \times 116$ adjacency matrix using Pearson correlation. We averaged the adjacency matrix over the patient subjects in the training data and binarized the edges based on whether its weight is larger than average weight (assigning 1) or not (assigning 0). For H-SVE method, we obtained the non-overlapping community clustering for each subject by greedy modularity method [13], which resulted in 10 communities.

Table 1. Prediction decrease after corrupting important ROIs for the DNN

Full size table

4.3 Comparison with Random Forest-Based Feature Importance

As an additional “reality check” for our method, we apply a Random Forest (RF) strategy (1000 trees) to the same dataset ($71.4\%$ accuracy on testing set) and compare the results, using the RF-based feature importance (mean Gini impurity decrease) as a form of standard method for comparison. Instead of inputting the entire fMRI image, we input the node-weighted modularity, which is defined by $\mathcal {M}_i = \sum _{j \ne i}a_{ij}$ where $a_{ij}$ is the partial correlation coefficient between ROI i and j. Therefore the inputs are $116 \times 1$ vectors. Based on axiom 4, we can treat each subject as a game and each ROI as a player, and then do group-based analysis by adding $\varPhi (r)$ over the subjects to investigate ROI r’s importance. For a fair comparison, like in RF, we used all of the training dataset. The interpretation results are shown in Fig. 3. Seven of the top 10 important ROIs discovered by C-SVE and H-SVE overlapped with RF interpretation.

4.4 Explaining the ASD Brain Biomarkers Used in Deep Convolutional Neural Network Classifier

Here we chose the deep neural network 2CC3D (Fig. 4) described in [14] using each voxel’s mean and standard deviation as two channel input. We start with preprocessed 3D fMRI volumes downsampled to $32\times 32\times 32$. We defined the original fMRI sequence as $\varvec{X}(x,y,z,t)$, the mean-channel sequence as $\varvec{\tilde{X}}(x,y,z,t)$ and the standard deviation-channel as $\hat{\varvec{X}}(x,y,z,t)$. For any x, y, z in $\{0,1,\cdots ,31\}$, $\varvec{\tilde{X}}(x,y,z,t) = \frac{1}{w} \sum _{\tau =t+1-w}^{t}\varvec{X}(x,y,z,\tau )$, $\hat{\varvec{X}}(x,y,z,t)^2 = \frac{1}{w-1} \sum _{\tau =t+1-w}^{t}[\varvec{X}(x,y,z,\tau )-\varvec{\tilde{X}}(x,y,z,t)]^2 $, where w is the temporal sliding window size and $w = 3$ in our experiment, hereby we augment data to 18720 samples. Training, validation and testing data was split based on subjects. It achieved $85.7\%$ classification accuracy by majority voting. Running on a workstation with a Nvidia 1080 Ti GPU, testing all 7 ASD subjects in the testing dataset took 21k s and 26k s for C-SVE and H-SVE, respectively, using 1000 samples for MC approximation, which converged to the stable ranks. As in the MNIST experiment, we divided $\varPhi (r)$ by the number of voxels in ROI r, avoiding domination by large ROIs.

The contribution/prediction power of the regions (relative to the most important one) averaged over testing subjects are illustrated in Fig. 5 and listed in Fig. 6. There are 19 overlapping ROIs out of the top 20 important ROIs found by C-SVE and H-SVE, although the orders were different. The Spearman rank-order correlation coefficient [15] of the importance score ranks of all the ROIs explained by both methods was 0.58. These detected regions were consistent with the previous findings in the literature [2, 3]. Also, we used Neurosynth [16] to decode the functional keywords associated with the overlapping biomarkers found by C-SVE and H-SVE (Fig. 7). These top regions are positively related to self-referential/perspective-taking concepts (higher level social communication) and negatively related to more basic social and language concepts (lower level skills). Using the manner described in Eq. (1), we corrupted the important ROIs ($50\%$ of the positive importance scores summing up in order) determined by C-SVE, H-SVE, and single region testing separately and calculated the average decrease in probability $\varDelta prob$ (showing mean and standard deviation) and accuracy $\varDelta acc$ for the subjects in the testing set. The results are listed in Table 1.

Notice that the top 10 biomarkers we discovered using SVE in the RF model were different from the ones found in the 2CC3D model. Possible reasons are: (1) the inputs are different. 2CC3D used activation whereas RF used connectivity and 2CC3D used ASD subjects in testing set whereas RF used all the training set; (2) the prediction accuracy of RF model is much lower than 2CC3D; and (3) our proposed methods performed as a model interpreter rather than data interpreter, which may have different sensitivity response to the different models.

5 Conclusion and Future Work

Considering the interaction of features, we proposed two approaches (C-SVE and H-SVE) to analyze feature importance based on SVE, using the underlying graph structure of the data to simplify the calculation of Shapley value. C-SVE only considers the centralized interaction, while H-SVE uses a hierarchical approach to first cluster the feature communities, then calculate the Shapley value in each community. When a feature’s neighborhood/community still contains a large number of features, we apply MC integration method for further approximation. Experiments on the MNIST dataset showed our proposed methods can capture more interpretable features. Comparing the results with Random Forest feature interpretation on the ASD task-fMRI dataset, we further validated the accuracy and feasibility of the proposed methods. When applying both methods on a deep learning model, we discovered similar possible brain biomarkers, which matched the findings in the literature and had meaningful neurological interpretation. The pipeline can be generalized to other feature importance analysis problems, where the underlying graph structure of features is available.

Our future work includes testing the methods on different atlases, graph building methods, and community clustering methods, etc. In addition, the interaction score is embedded in the proposed algorithms. It can be disentangled to understand the interaction between the features.

References

Goldani, A.A.S., Downs, S.R., Widjaja, F., Lawton, B., Hendren, R.L.: Biomarkers in autism. Front. Psychiatry 5, 100 (2014)
Article Google Scholar
Li, X., Dvornek, N.C., Zhuang, J., Ventola, P., Duncan, J.S.: Brain biomarker interpretation in ASD using deep learning and fMRI. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11072, pp. 206–214. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00931-1_24
Chapter Google Scholar
Kaiser, M.D., et al.: Neural signatures of autism. Proc. Natl. Acad. Sci. 107(49), 21223–21228 (2010)
Article Google Scholar
Lundberg, S.M., Lee, S.-I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, pp. 4765–4774 (2017)
Google Scholar
Chen, J., Song, L., Wainwright, M.J., Jordan, M.I.: L-shapley and C-shapley: efficient model interpretation for structured data. arXiv preprint arXiv:1808.02610 (2018)
Kononenko, I., Strumbelj, E.: An efficient explanation of individual classifications using game theory. J. Mach. Learn. Res. 11(Jan), 1–18 (2010)
MathSciNet MATH Google Scholar
Shapley, L.S.: A value for n-person games. Contrib. Theory Games 2(28), 307–317 (1953)
MathSciNet MATH Google Scholar
Zintgraf, L.M., Cohen, T.S., Adel, T., Welling, M.: Visualizing deep neural network decisions: prediction difference analysis. arXiv preprint arXiv:1702.04595 (2017)
Clauset, A., Newman, M.E., Moore, C.: Finding community structure in very large networks. Phys. Rev. E 70(6), 066111 (2004)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34(11), 2274–2282 (2012)
Article Google Scholar
Tzourio-Mazoyer, N., et al.: Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. Neuroimage 15, 273–289 (2002)
Article Google Scholar
Newman, M.: Networks. Oxford University Press, Oxford (2018)
Book Google Scholar
Li, X., et al.: 2-channel convolutional 3D deep neural network (2CC3D) for fMRI analysis: ASD classification and feature learning. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 1252–1255. IEEE (2018)
Google Scholar
Young, R.C., Biggs, J.T., Ziegler, V.E., Meyer, D.A.: A rating scale for mania: reliability, validity and sensitivity. Br. J. Psychiatry 133(5), 429–435 (1978)
Article Google Scholar
Yarkoni, T., Poldrack, R.A., Nichols, T.E., Van Essen, D.C., Wager, T.D.: Large-scale automated synthesis of human functional neuroimaging data. Nat. Methods 8(8), 665 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Biomedical Engineering, Yale University, New Haven, CT, USA
Xiaoxiao Li, Juntang Zhuang & James S. Duncan
Radiology and Biomedical Imaging, Yale School of Medicine, New Haven, CT, USA
Nicha C. Dvornek, Yuan Zhou & James S. Duncan
Child Study Center, Yale School of Medicine, New Haven, CT, USA
Pamela Ventola
Electrical Engineering, Yale University, New Haven, CT, USA
James S. Duncan
Statistics and Data Science, Yale University, New Haven, CT, USA
James S. Duncan

Authors

Xiaoxiao Li
View author publications
You can also search for this author in PubMed Google Scholar
Nicha C. Dvornek
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Juntang Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Pamela Ventola
View author publications
You can also search for this author in PubMed Google Scholar
James S. Duncan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoxiao Li .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
Albert C. S. Chung
Department of Radiology, University of Pennsylvania, Philadelphia, PA, USA
James C. Gee
Department of Radiology, University of Pennsylvania, Philadelphia, PA, USA
Paul A. Yushkevich
Department of Natural Language Processing, Baidu Inc., Shenzhen, China
Siqi Bao

A Appendix: Proof of Theorem 2

For any subset $A\subset \mathcal {N}$, we use the short notation and , noting that $A=U_{r}(A)\cup V_{r}(A)$. Rewriting Eq. (2) as

$$ \varPhi _{r}(v_{\varvec{X}})=\frac{1}{|\mathcal {N}|}\sum _{U\subseteq \mathcal {N}_{r}\setminus \{r\}}\sum _{A\subseteq \mathcal {N},U_{r}(A)=U}\left( \begin{array}{c} |\mathcal {N}|-1\\ |A| \end{array}\right) ^{-1}(v_{\varvec{X}}(A\cup \{r\})-v_{\varvec{X}}(A)), $$

and using

$$ \sum _{A\subseteq \mathcal {N},U_{r}(A)=U}\left( \begin{array}{c} |\mathcal {N}|-1\\ |A| \end{array}\right) ^{-1}=\frac{|\mathcal {N}|}{|\mathcal {N}_{r}|}\left( \begin{array}{c} |\mathcal {N}_{r}|-1\\ |U|-1 \end{array}\right) ^{-1}, $$

the expected error between $\hat{\varPhi }_{r}^{C}(v_{\varvec{X}})$ and $\varPhi _{r}(v_{\varvec{X}})$ is

$$ \mathbb {E}[\vert \hat{\varPhi }_{r}^{C}(v_{\varvec{X}})\ -\ \varPhi _{r}(v_{\varvec{X}})\vert ]\le \frac{1}{|\mathcal {N}|}\sum _{U\subseteq \mathcal {N}_{r}\setminus \{r\}}\sum _{A\subseteq \mathcal {N},U_{r}(A)=U}\left( \begin{array}{c} |\mathcal {N}|-1\\ |A| \end{array}\right) ^{-1}\mathbb {E}[\vert \varDelta _{r}^{{\varvec{X}}}(U,A)\vert ] $$

where

$$\begin{aligned} \varDelta _{r}^{{\varvec{X}}}(U,A)&=(v_{\varvec{X}}(U\cup \{r\})-v_{\varvec{X}}(U))-(v_{\varvec{X}}(A\cup \{r\})-v_{\varvec{X}}(A))\\&=\log \frac{p(Y\vert X_{\mathcal {N}\setminus U})}{p(Y\vert X_{\mathcal {N}\setminus (U\cup \{r\})})}-\log \frac{p(Y\vert X_{\mathcal {N}\setminus (U\cup V)})}{p(Y\vert X_{\mathcal {N}\setminus (U\cup V\cup \{r\}})}, \end{aligned}$$

with V short for $V_{r}(A)$. Let $W=\mathcal {N}\setminus (\mathcal {N}_{r}\cup V)$, $Z=\mathcal {N}_{r}\setminus (\{r\}\cup U)$. Then

$$\begin{aligned} \varDelta _{r}^{{\varvec{X}}}(U,A)&=\log \frac{p(Y\vert X_{W\cup V\cup Z\cup \{r\}})p(Y\vert X_{W\cup Z})}{p(Y\vert X_{W\cup V\cup Z})p(Y\vert X_{W\cup Z\cup \{r\}})}. \end{aligned}$$

(9)

Since $X_{r}\perp X_{V}\vert X_{Z}$, we have $p(X_{V}\vert X_{W\cup Z\cup \{r\}})=p(X_{V}\vert X_{W\cup Z})$, and

$$ (\star )=\frac{p(X_{W\cup V\cup Z\cup \{r\}})p(X_{W\cup Z})}{p(X_{W\cup V\cup Z})p(X_{W\cup Z\cup \{r\}})}=\frac{p(X_{W\cup Z\cup \{r\}})p(X_{V}\vert X_{W\cup Z\cup \{r\}})p(X_{W\cup Z})}{p(X_{W\cup Z})p(X_{V}\vert X_{W\cup Z})p(X_{W\cup Z\cup \{r\}})}=1. $$

We can multiply the quotient in Eq. (9) by $(\star )$,

$$\begin{aligned} \varDelta _{r}^{{\varvec{X}}}(U,A)&=\log \frac{p(Y\vert X_{W\cup V\cup Z\cup \{r\}})p(Y\vert X_{W\cup Z})}{p(Y\vert X_{W\cup V\cup Z})p(Y\vert X_{W\cup Z\cup \{r\}})}\frac{p(X_{W\cup V\cup Z\cup \{r\}})p(X_{W\cup Z})}{p(X_{W\cup V\cup Z})p(X_{W\cup Z\cup \{r\}})}\\&=\log \frac{p(X_{W\cup V\cup \{r\}}\vert Y,X_{Z})p(Y,X_{Z})p(Y,X_{Z})p(X_{W}\vert Y,X_{Z})}{p(Y,X_{Z})p(X_{W\cup V}\vert Y,X_{Z})p(Y,X_{Z})p(X_{W\cup \{r\}}\vert Y,X_{Z})}. \end{aligned}$$

We have $p(X_{W\cup V\cup \{r\}}\vert Y,X_{Z})=p(X_{W\cup V}\vert Y,X_{Z})p(X_{r}\vert Y,X_{Z})$, since $X_{W\cup V}\perp X_{r}\vert Y,X_{Z}$. So

$$ \varDelta _{r}^{{\varvec{X}}}(U,A)=\log \frac{p(X_{W\cup V}\vert Y,X_{Z})p(X_{r}\vert Y,X_{Z})p(X_{W}\vert Y,X_{Z})}{p(X_{W\cup V}\vert Y,X_{Z})p(X_{W\cup \{r\}}\vert Y,X_{Z})}. $$

Since $X_{W}\perp X_{r}\vert Y,X_{Z}$, we have $p(X_{W\cup \{r\}}\vert Y,X_{Z})=p(X_{W}\vert Y,X_{Z})p(X_{r}\vert Y,X_{Z})$. Hence $\varDelta _{r}^{{\varvec{X}}}(U,A)=\log 1=0$. Therefore we have $\mathbb {E}[\vert \hat{\varPhi }_{r}^{C}(v_{\varvec{X}})-\varPhi _{r}(v_{\varvec{X}})\vert ]=0$.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Dvornek, N.C., Zhou, Y., Zhuang, J., Ventola, P., Duncan, J.S. (2019). Efficient Interpretation of Deep Learning Models Using Graph Structure and Cooperative Game Theory: Application to ASD Biomarker Discovery. In: Chung, A., Gee, J., Yushkevich, P., Bao, S. (eds) Information Processing in Medical Imaging. IPMI 2019. Lecture Notes in Computer Science(), vol 11492. Springer, Cham. https://doi.org/10.1007/978-3-030-20351-1_56

Download citation

DOI: https://doi.org/10.1007/978-3-030-20351-1_56
Published: 22 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20350-4
Online ISBN: 978-3-030-20351-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us