Introduction

Autism spectrum disorder (ASD) is a group of complex neurodevelopmental disorders with a strong genetic component (Weiner et al. 2017; Bai et al. 2019; Grove et al. 2019; Satterstrom et al. 2020). The field of psychiatric genetics has worked vigorously for more than a decade to discover genetic contributors to the risk for ASD. As a result, it is now understood that the genetic architecture of ASD represents a combination of high-risk rare copy number variants (Sebat et al. 2007; Marshall et al. 2008; Pinto et al. 2010; Malhotra and Sebat 2012), rare coding variants detected through whole-exome sequencing of ASD families (Iossifov et al. 2012; O’Roak et al. 2012; Sanders et al. 2012; De Rubeis et al. 2014), and common variants identified in genome-wide association studies (Grove et al. 2019). Recently, however, non-coding variants identified through the large whole-genome sequencing studies (Yuen et al. 2015; An et al. 2018; Brandler et al. 2018) have also begun to accumulate evidence for involvement in ASD. The genetic etiology of ASD is likely intermediate, with polygenic variation contributing additively in the presence of a strong de novo variant (Weiner et al. 2017; Leblond et al. 2019). In particular, pathogenic de novo variation shows potential to account for ASD occurrence in simplex families; i.e., those with a single affected child. ASD cases in such families have been found to harbor twice as many de novo loss-of-function (LoF) variants than expected by chance, although the recurrence of any particular variant is low (Iossifov et al. 2014). Other types of variants, primarily missense variants, have subtler group signatures (Iossifov et al. 2014) and have recently attracted increased attention (Chen et al. 2018; Pejaver et al. 2020; Chen et al. 2020; Koire et al. 2021).

Due to the complex genetic architecture of ASD, identification of dysregulated signaling and regulatory pathways has remained challenging. Based on the biological functions of genes that carry recurrent de novo mutations, convergence on chromatin remodeling, synaptic and neuronal signaling, transcriptional and translational regulation have emerged (De Rubeis et al. 2014; Gilman et al. 2011; Iossifov et al. 2014; Pinto et al. 2014). Collectively, mTor, MAPK and beta-catenin/Wnt signaling have all been implicated (Iakoucheva et al. 2019). The integration of genetic data with other data types has further demonstrated that high-risk ASD genes are highly connected in co-expression and protein interaction networks, especially during late mid-fetal stages of brain development (Parikshak et al. 2013; Willsey et al. 2013; Corominas et al. 2014; Lin et al. 2015, 2017).

The abundance of available genetic and molecular data have led to the development of computational approaches to effectively identify new genes with association to ASD. For example, Mosca et al. (2017) used a diffusion-based prioritization in a network to identify significantly connected gene modules associated with ASD. Krishnan et al. (2016) performed a genome-wide prediction of ASD risk genes using a machine-learning approach based upon a brain-specific gene network, and used a case-control sequencing-study validation set to identify pathways and brain developmental stages to predict ASD risk genes with minimal or no prior genetic evidence. Similarly, Duda et al. (2018) used a brain-specific functional relationship network for ASD risk gene prioritization. In the past several years, more comprehensive efforts have been made to integrate brain-specific gene expression data to further generate gene-level predictions for association with ASD (Gilman et al. 2011; Liu et al. 2014; Zhang and Shen 2017; Norman and Cicek 2019; Brueggeman et al. 2020; Beyreli et al. 2020; Schaaf et al. 2020). While these approaches have made strides in the identification of genes relevant to ASD, the challenge remains to incorporate this data with variant-level information to identify individual variants that significantly increase the risk for ASD.

Here, we seek to assess the utility of gene- and variant-scoring methods to prioritize impactful exonic de novo variation in individuals with ASD. We first quantify the strength of the relationship between a given gene and previously discovered high-confidence ASD risk genes by leveraging brain gene expression and protein–protein interaction data. We find that brain-specific co-expression networks improve model performance compared to the networks from other tissues, or to protein–protein interaction networks. Then, our approach integrates gene scores with variant pathogenicity predictions to prioritize individual exonic variants. The integration was carried out in a positive-unlabeled framework that allows for rigorous score calibration (Jain et al. 2016a). We apply this methodology to de novo variation derived from the Simons Foundation Collection families  (Fischbach and Lord 2010; Iossifov et al. 2012; Neale et al. 2012; O’Roak et al. 2012; Sanders et al. 2012) and from other large-scale sequencing studies (O’Roak et al. 2011; Xu et al. 2011, 2012; Michaelson et al. 2012; Rauch et al. 2012; Gulsuner et al. 2013; Jiang et al. 2013; De Rubeis et al. 2014; Iossifov et al. 2014; O’Roak et al. 2014; Krumm et al. 2015; Brandler et al. 2016; Hashimoto et al. 2016; Turner et al. 2016; Yuen et al. 2016, 2017; van Bon et al. 2016; Stessman et al. 2017), and achieve effective discriminative case/control capacity on high-scoring variants. Finally, we validate one missense variant in an experimental follow-up study, confirming its putative contribution to ASD risk through the disruption of interactions with three protein-binding partners.

Materials and methods

Systems data

To construct gene networks, we first integrated gene expression data and protein–protein interaction (PPI) data. We used the “RNA-Seq Gencode v10 summarized to genes” dataset from the BrainSpan atlas of the developing human brain (Kang et al. 2011; Li et al. 2018) to construct an expression matrix of 52,376 transcripts over 524 human brain samples derived from 57 postmortem brain specimens  (Kang et al. 2011). The 19,113 transcripts corresponding to protein-coding genes were subsequently grouped into 4 brain regions (Table 1) and 12 developmental periods (Table 2) as previously described (Willsey et al. 2013; Lin et al. 2015).

Table 1 Brain region groupings for the BrainSpan dataset
Table 2 Time period groupings for the BrainSpan dataset

Next, we assembled a dataset of 303,040 binary protein–protein interactions by combining physical PPIs from BioGRID v3.4.159 (Chatr-Aryamontri et al. 2017), gene-level interactions from the Autism Spliceform Interaction Network (Corominas et al. 2014), the human interactome from Rolland et al. (2014) and gene-level PPIs from Yang et al. (2016).

Rare de novo variants

We obtained 9174 protein-coding de novo variants of three types; i.e., missense, in-frame insertion/deletion (indel) and loss-of-function (LoF; stop gain and frameshifting indels) from whole-exome sequencing studies of the Simons Foundation Collection families  (Iossifov et al. 2012; Neale et al. 2012; O’Roak et al. 2012; Sanders et al. 2012) and other studies (O’Roak et al. 2011, 2014; Xu et al. 2011, 2012; Michaelson et al. 2012; Rauch et al. 2012; Gulsuner et al. 2013; Jiang et al. 2013; De Rubeis et al. 2014; Iossifov et al. 2014; Krumm et al. 2015; Brandler et al. 2016; Hashimoto et al. 2016; Turner et al. 2016; Yuen et al. 2016, 2017; van Bon et al. 2016; Stessman et al. 2017). Variants present in the gnomAD database (Karczewski et al. 2020) as well as variants shared between cases and controls were filtered out from the case dataset because of their high likelihood of being non-pathogenic (Kosmicki et al. 2017). Our final set contained 3,608 variants (Table 3).

Table 3 Breakdown of the de novo variants used in this study

Network construction

We used gene expression data to build a correlation network for all brain regions (R), developmental periods (P) and their combinations (RP). Genes constituted nodes in these networks, whereas the links were constructed by reliably estimating the correlation of expression profiles across relevant samples for all pairs of genes. As criteria for noise filtering, we required more than five pairs of independent samples supporting the calculation of a Pearson’s correlation coefficient (\(\rho \)) as well as that at least one gene from each pair had at least 0.5 Transcripts Per Million (TPM) expression. The same network construction steps were taken for processing co-expression data from GTEx (Mele et al. 2015) to construct tissue-specific networks for our baseline approaches. GTEx data was available for multiple tissues, including adult human brain.

Gene co-expression networks were merged with PPI networks in two ways, referred to here as the “intersection” and “union” integration. In the “intersection” approach, the weight \(w_{ij}\) of the link between gene i and gene j was defined as

$$\begin{aligned} w_{ij} = \min \{ I_{(g_i, g_j) \in \mathrm {PPI}}, |r_{ij}| \}, \end{aligned}$$

where \(I_{\text {c}}\) is an indicator function for the logical expression c, \(r_{ij}\) is the thresholded correlation coefficient \(\rho _{ij}\) as described below, and \(|\cdot |\) is an absolute value function. Similarly, the weight of each link in the “union” approach was defined as

$$\begin{aligned} w_{ij} = \max \{ I_{(g_i, g_j) \in \mathrm {PPI}}, |r_{ij}| \}. \end{aligned}$$

In both approaches, we retained only co-expression edges with absolute values of at least 0.75; i.e., \(r_{ij} = \rho _{ij} \cdot I_{|\rho _{ij}| \ge 0.75}\), where \(\rho _{ij}\) is Pearson’s correlation between expressions of genes i and j over a set of samples in the BrainSpan dataset. In summary, three types of networks were built for each R, P, or RP for gene prioritization: (1) gene co-expression network without PPIs; (2) the intersection of co-expression and the PPI network; (3) the union of co-expression and the PPI network.

Gene and variant scoring

There may be hundreds or even thousands of genes involved in ASD (Brandler and Sebat 2015; de la Torre-Ubieta et al. 2016; Iakoucheva et al. 2019), but the role of each gene and its contribution to the development of ASD is for the most part unknown. In the past decades, studies have identified about a hundred genes conferring high risk for ASD (Satterstrom et al. 2020). To prioritize the remaining genes, we used the functional flow network propagation approach (Nabieva et al. 2005) across spatio-temporal co-expression and PPI networks as described below.

The network seed genes (denoted as POS65; Supplementary Table S2) were derived from (Sanders et al. 2012) and consisted of 65 highly confident genome-wide significant ASD risk genes. The performance of gene scoring was evaluated on an independent set of 63 genes (denoted as VAL63; Supplementary Table S2) assembled by removing POS65 genes from recently identified 102 high-risk autism genes (Satterstrom et al. 2020). As a negative control, additional evaluation datasets included 1000 lists of 63 genes, randomly sampled from BrainSpan to be similar in length and GC content (\(\pm 10\%\)) to the VAL63 genes.

The performance among methods with different parameter settings was compared along several dimensions: (i) edge weight normalization method in propagation: the original functional flow and its two variants (i.e., incoming and outgoing edge normalization); (ii) three different settings for edge cutoffs with controls of network sparsity; and (iii) number of propagation strides (Nabieva et al. 2005). Through five-fold cross-validation, we identified the parameter settings with the best performance using 51 BrainSpan networks (4 regions, 12 periods and 35 region/period combinations) with and without PPI networks, and then used the best parameters for testing. Thirteen region/period combinations were omitted due to lack of samples.

The effect of missense variants was estimated with MutPred2 (Pejaver et al. 2020), loss-of-function variants with MutPred-LOF (Pagel et al. 2017), and non-frameshifting indel variants and multi-residue substitutions with MutPred-Indel (Pagel et al. 2019). These predictors were selected based on the fact that they were all trained using similar protocols, their good performance in the prediction of both pathogenicity and protein function disruption (Pejaver et al. 2017), as well as that all report molecular mechanisms potentially causative of pathogenicity. High-scoring variants with “loss of protein binding” as an underlying mechanism were of primary interest for downstream experimental validation.

Although we were interested in variants that alter protein function, we note that exonic variants could lead to phenotypic changes via other molecular mechanisms, such as splicing disruption or impact on RNA stability and folding. Our approach has not directly considered such events.

Probabilistic model for autism-specific variant scoring

We propose a simple semi-supervised probabilistic model that combines the risk that a gene is involved in ASD with the probability that the variant disrupts the function of this gene. Before describing the model, we argue that both gene scoring and variant scoring can be approached through positive-unlabeled learning, a form of semi-supervised binary classification in which all labeled data is positive and unlabeled data is a mixture of positive and negative examples at unknown proportions (Denis et al. 2005). In our problem, known ASD genes can be considered positive, whereas other genes can be considered to be unlabeled. The task of a gene prioritization model is then to identify remaining positive genes among unlabeled genes. Similarly, in variant scoring, we are given a set of disease-causing variants, such as those from the Human Gene Mutation Database (Stenson et al. 2017), and a set of unlabeled variants from gnomAD. The task of a variant interpretation model is then to identify remaining disease-causing variants among unlabeled variants.

A common approach to positive-unlabeled learning is to develop classifiers by training positive against unlabeled data (Elkan and Noto 2008). This approach was in fact shown to be optimal for a range of loss functions for model learning (Blanchard et al. 2010; Reid and Williamson 2010), in the sense that minimizing the loss function from positive and unlabeled data [models referred to as non-traditional classifiers (Elkan and Noto 2008)], simultaneously minimizes the loss if one were to train a model from positive and negative data [models referred to as traditional classifiers (Elkan and Noto 2008)]. Unfortunately, although ranking objectives such as area under the ROC curve fall under this scenario, the scores outputted by non-traditional classifiers are not calibrated to represent posterior probabilities (Jain et al. 2016a). As such, the outputs from gene prioritization tools and variant interpretation tools cannot be formally combined as probabilities. We will address the score calibration models after the model is introduced.

Let D (diagnosis) be a binary random variable indicating the diagnosis of ASD and v a single variant occurring in some gene g. We focus on a single variant at a time because the average number of de novo coding variants in an individual is around one (Acuna-Hidalgo et al. 2016). Let now E (effect) and R (risk) denote binary random variables whether the function of a protein product of g is disrupted in the presence of v and whether g is an ASD risk gene, respectively. We can then use marginalization to write the probability of diagnosis d as

$$\begin{aligned} p(d | v) = \sum _{e \in \{0, 1\}} \sum _{r \in \{0, 1\}} p(d | e, r, v) p(e, r | v), \end{aligned}$$

where d, e, and r are realizations of the random variables D, E, and R, respectively, variant v can be thought of as a realization of a random variable V, and p denotes an appropriate probability mass function; e.g., \(p(d|v)=P(D=d|v)\), etc. We are primarily interested in identifying new risk genes and variants and, thus, we focus on \(P(D=1|v)\).

We now observe that nonfunctional variants cannot contribute to the positive diagnosis and neither can variants outside of the group of the ASD risk genes; i.e., \(P(D = 1 | E = e, R = r, v)=0\) unless both \(e =1\) and \(r = 1\). Hence, we can write the probability of the ASD diagnosis given variant v as

$$\begin{aligned} P(D = 1 | v) = P(D = 1 | E = 1, R = 1, v) P(E = 1, R = 1 | v) \end{aligned}$$

since all other terms reduce to 0. We now make an assumption that any variant disrupting the function of an ASD risk gene causes the phenotype with certainty. Then, by applying conditional independence between a variant disrupting gene function and that gene being an ASD risk gene, we obtain a probabilistic model of ASD diagnosis in the presence of a de novo variant v as

$$\begin{aligned} P(D = 1 | v) =&\; P(E = 1, R = 1| v) \nonumber \\ =&\; P(E = 1 | R = 1, v) P(R = 1 | v) \nonumber \\ =&\; P(E = 1 | v)P(R = 1 | g). \end{aligned}$$
(1)

The last two terms on the right-hand side of Eq. (1) correspond to a variant-level score and a gene-level score, respectively. We have further replaced the probability \(P(R = 1 | v)\) with \(P(R = 1 | g)\) to clarify that the probability that g is a risk gene is strictly a gene property, as long as variant v is within g. Probabilities \(P(E = 1 | v)\) and \(P (R = 1 | g)\) are first obtained by applying a dedicated variant- or gene-prediction tool, which are then calibrated to be proper probabilities, as described in “Score calibration in the positive-unlabeled setting”. To avoid multiplying small probabilities, we have scored each variant using a logarithm transform

$$\begin{aligned} \log P(D = 1 | v) = \log P(E = 1 | v) + \log P(R = 1 | g). \end{aligned}$$
(2)

This model described above is appropriate for phenotype-specific prioritization of highly penetrant variants. However, even in complex phenotypes such as ASD, it has been observed that the polygenic effect can be modulated in the presence of a strong de novo variant (Weiner et al. 2017). Therefore, although we expect a lower performance levels compared to Mendelian disorders, we believe that a useful diagnostic signal can still emerge. Polygenic scores corresponding to the set of individuals considered here were not available for the development of more sophisticated models.

Score calibration in the positive-unlabeled setting

According to the above derivation, the gene score \(P(R = 1 | g)\) and variant score \(P(E = 1 | v)\) can be simply multiplied to yield the probability that the variant v in gene g leads to ASD. However, the outputs of the gene and variant scoring tools require calibration before they can be considered good approximations of the posteriors and multiplied together. To illustrate this seeming subtlety, we will digress to discuss model development in a positive-unlabeled setting.

Consider a binary classification problem of mapping inputs \(x \in {\mathcal {X}}\) into outputs \({\mathcal {Y}} = \{0, 1\}\) on the dataset drawn i.i.d. from a fixed but unknown probability distribution p(xy), where (xy) is a realization of a random vector (XY) of inputs (X) and outputs (Y). In a traditional supervised setting, we are given a set of positive examples obtained from \(p(x | Y = 1)\) and a set of negative examples obtained from \(p(x | Y = 0)\), roughly available at proportions \(P(Y = 1)\) and \(P(Y = 0)\), respectively. In contrast, a positive-unlabeled setting considers a training data obtained through a selection process to contain a set of positive examples drawn from \(p(x | Y = 1)\) and a set of unlabeled examples drawn from the marginal distribution \(p(x) = \sum _{y \in {\mathcal {Y}}} p(x,y)\).

Using S to represent a binary random variable that a data point is labeled (\(S = 1\) indicates labeled and \(S = 0\) unlabeled), we can train a classifier to approximate the posterior distribution between labeled and unlabeled data as \(P(S = 1 | x)\). Jain et al. (2016b) derived a formula to then convert \(P(S = 1 | x)\) into \(P(Y = 1 | x)\) as

$$\begin{aligned} P(Y = 1 | x) = \frac{P(S = 0)}{P(S = 1)} \cdot \frac{P(S = 1 | x)}{P(S = 0 | x)} \cdot P(Y = 1), \end{aligned}$$
(3)

where \(P(S = 1) = 1 - P(S = 0)\) is the probability of observing a (positively) labeled example in the training data and \(P(Y = 1)\) is the probability of observing a positive example in the unlabeled data. Therefore, to estimate the posterior probability of the positive output given some input x, two conditions must be fulfilled: (1) we must train a non-traditional classifier that estimates \(P(S = 1 | x)\), and (2) we must estimate \(P(Y = 1)\).

The first condition can be reasonably achieved by training models that approximate the posterior distributions in a binary classification setting. Posterior approximation has been covered in the literature; e.g., Rojas (1996) demonstrated it for neural networks, whereas Platt (1999) gave a post-processing algorithm for learners such as support vector machines. The second condition requires a complex step of nonparametric estimation of class priors in unlabeled data (Jain et al. 2016a, b). The class prior \(P(Y = 1)\) in this work was estimated using the AlphaMax algorithm (Jain et al. 2016a, b) and the fraction \(P(S = 0)/P(S = 1)\) was estimated as the fraction of unlabeled and labeled training examples. The uncalibrated probability \(P(S = 1 | x)\) was the output of a dedicated tool in either gene prioritization or variant interpretation, as applicable. Finally, we note that errors in estimating \(P(S = 1|v)\) and \(P(Y = 1)\) can lead to undesired situations that the calibrated probability is greater than 1. The (monotonic) logarithmic transform from Eq. (2) allowed us to disregard such problems.

Clinical significance of variant scoring

Evaluation of variant interpretation for complex clinical phenotypes is a difficult task owing to the mostly low-to-moderate penetrance of pathogenic variants. Even when penetrance is high, de novo variation, compound heterozygosity, or structural variation could all be contributing factors for different subsets of individuals (Iakoucheva et al. 2019), which presents evaluation problems for de novo variation because the ground truth is unavailable. Therefore, standard machine learning approaches that include ROC curves, precision-recall curves and their derivatives (e.g., area under the curve) cannot be effectively used to evaluate the quality of predictive models.

To evaluate potential clinical impact of our scoring of de novo variation, we use the positive likelihood ratio (\(\text {LR}^+\)), defined as the ratio of posterior and prior odds of pathogenicity (Glas et al. 2003). Let \(P(Y=1)\) be the fraction of pathogenic variants in the population of interest and \(P(Y=1|f(x)=1)\) be the fraction of pathogenic variants in the same population but when a computational model \(f:{\mathcal {X}} \rightarrow {\mathcal {Y}}\) gives a positive prediction. Then, the likelihood ratio for the positive prediction \(f(x)=1\) is defined as

$$\begin{aligned} \text {LR}^+=\frac{\text {posterior odds}}{\text {prior odds}}, \end{aligned}$$
(4)

where the prior odds are defined as the ratio of \(P(Y=1)\) and \(P(Y=0)\), and the posterior odds are defined as the ratio of \(P(Y=1|f(x)=1)\) and \(P(Y=0|f(x)=1)\). The likelihood ratio is therefore the increase in odds of pathogenicity due to the positive prediction. It can be shown that \(\text {LR}^+\) is independent of the class prior \(P(Y=1)\) and can be computed as the ratio of the true positive rate and false positive rate (Glas et al. 2003).

The positive likelihood ratio is related to the Diagnostic Odds Ratio (\(\text {DOR}\)) that has been often used for risk assessment, particularly in cancer studies (Breast Cancer Association Consortium et al. 2021). The relationship is expressed as

$$\begin{aligned} \text {DOR}=\frac{\text {LR}^+}{\text {LR}^-}, \end{aligned}$$

where \(\text {LR}^-\) is defined as the ratio of the false negative rate and true negative rate (Glas et al. 2003). Since \(\text {LR}^-\) is limited to a [0, 1] interval for predictors whose ROC curve never drops below the identity line, \(\text {LR}^+\) is generally lower than \(\text {DOR}\) and thus gives a more conservative view on clinical utility.

Experimental validation

As a proof of principle, we selected one missense variant for experimental validation. This variant was selected based on the following criteria: (1) the gene was not in the list of POS65 or 102 genes from Satterstrom et al. (2020); (2) the gene was highly scored by top-performing BrainSpan networks to suggest that it was likely an ASD risk gene; (3) the mutation was scored with a high MutPred score to suggest pathogenicity; (4) the mutation was not present in either the Human Gene Mutation Database (Stenson et al. 2017) or ClinVar (Landrum et al. 2020); (5) no variants from controls were found in the gene.

Plasmid cloning and cell transfection

The ORF clones of the gene of interest, ATP1A3, and its interacting partners were obtained from the ORFeome Collaboration (The ORFeome Collaboration 2016). The genes were transferred from the donor plasmid pDONR223 to destination plasmids, pDEST40 and pDEST47, using LR Gateway reaction (Invitrogen) following manufacturer’s instructions. The gene of interest was introduced into pDEST40 to obtain ATP1A3-V5 tagged, and the partners were transferred to pDEST47, to obtain GFP-tagged proteins.

HEK 293T cells were seeded at \(5 \times 10^5\) cells per well in 60 mm plates (Genesee Scientific). After 24 h, cells were transfected using Lypofectamine 3000 (Invitrogen) following manufacturer’s instructions and then harvested after additional 48 h.

Co-immunoprecipitation and Western Blot

HEK 293T cells were harvested and rinsed once with ice-cold 1xPBS, pH 7.2, and lysed in immunoprecipitation lysis buffer (20 mM Tris, pH 7.4, 140 mM NaCl, \(10\%\) glycerol, and \(1\%\) Triton X-100) supplemented with 1xEDTA-free complete protease inhibitor mixture (Roche) and phosphatase inhibitor cocktails-I, II (Sigma Aldrich). The cells were centrifuged at \(16,000\times g\) at 4 °C for 30 min, and the supernatants were collected. Protein concentration was quantified by modified Lowry assay (DC protein assay; Bio-Rad). The cell lysates were resolved by SDS-PAGE and transferred onto PVDF Immobilon-P membranes (Millipore). After blocking with \(5\%\) nonfat dry milk in TBS containing \(0.1\%\) Tween 20 for 1 h at room temperature, membranes were probed overnight with the appropriate primary antibodies. They were then incubated for 1 h with the species-specific peroxidase-conjugated secondary antibody. Membranes were developed using the Pierce-ECL Western Blotting Substrate Kit (Thermo Scientific).

For immunoprecipitation experiments, samples were lysed and quantified as described above. Then, 1 mg of total protein was diluted with immunoprecipitation buffer to achieve a concentration of 1 mg/ml. A total of 30 \(\upmu \mathrm {l}\) of anti-V5-magnetic beads-coupled antibody (MBL) was add-ed to each sample and incubated for 4 h at 4 °C in tube rotator. Beads were then washed twice with immunoprecipitation buffer and three more times with ice cold 1xPBS. The proteins were then eluted with 40 \(\mu \)l of 2xLaemli buffer. After a short spin, supernatants were carefully removed, and SDS-PAGE was performed. The following primary antibodies were used: anti-V5 (1:1000; Invitrogen), anti-GFP (1:1000; Cell Signaling), anti-GAPDH (1:5000; Cell Signaling).

Results

Brain-specific co-expression networks improve ASD gene prioritization

To assess the extent to which brain-specific gene co-expression networks and protein–protein interaction (PPI) networks help in autism gene prioritization, we ran a label propagation algorithm with POS65 as the seed genes on all gene networks described in the “Methods” section. We assessed the quality of the predicted gene scores of the algorithm with various parameters using stratified five-fold cross-validation for the three network types (i.e., co-expression networks and the “union” and “intersection” with PPI). The final parameter set included \(\rho \) = 0.75 for co-expression network construction and 5 strides with outgoing weight normalization for the functional flow procedure (Nabieva et al. 2005).

Figure 1A shows the area under the Receiver Operating Characteristic (ROC) curve (AUC) evaluated over all region-, period-, and region/period-specific networks with or without the PPI data; for detailed results see Supplementary Table S3. We observed that in most cases, networks constructed using the union of co-expression and PPI networks performed better than co-expression networks without the PPI information. Similarly, many of the networks using region- and period-specific brain co-expression data outperformed the PPI-only network. Region-wise, all brain regions performed similarly well, with R1 and R2 displaying a slightly better performance than other regions. Period-wise, P5 (16–18 pcw) and P10 (1–5 years) performed best. With regard to region/period networks, P10 in combination with any region, but especially R2-P10 and R4-P10 combinations, had superior performance. Top region/period-specific networks with the union of PPI outperformed region- and period-specific networks as well as the PPI-only network.

Next, we evaluated the quality of gene scoring obtained by a one-time propagation of POS65 but on an independent validation dataset VAL63. The classification performance was only slightly lower than POS65 cross-validation performance (Fig. 1B; Supplementary Table S3). However, in agreement with cross-validation results, period P10 by itself, or P10 in combination with various regions remained the best performing networks on the out-of-sample VAL63 dataset. Again, the addition of PPI data improved the performance for the majority of datasets. As a negative control, we also generated 1000 simulated gene lists, each of which consisted of 63 brain-expressed genes with similar length and GC content as VAL63. The performance of gene scoring on these control networks (Fig. 1C) was considerably lower than on the VAL63 gene set.

Fig. 1
figure 1

Heatmap plot of the performance of gene scoring of BrainSpan networks with POS65 as seed genes. A Cross-validated performance with POS65 as seed genes. B Using POS65 as seed genes but evaluated on the VAL63 genes. C Using POS65 as seed genes but evaluated on the 1000 lists of simulated genes with similar length and GC content as VAL63. Each pie chart represents the estimated AUC on one BrainSpan region, period or region/period combination. The three patches in each pie chart represent: (top) the original BrainSpan network; (lower-left) the intersection of BrainSpan and PPI network; (lower-right) the union of BrainSpan and PPI network. More detailed results are shown in Supplementary Table S3

To understand whether region- and period-specific brain expression data was indeed important for the ASD risk gene prioritization, we repeated the experiment with the same network construction procedures for gene scoring, only using tissue-specific gene expression data from the GTEx database v.7 instead of BrainSpan. We found that all GTEx networks, including those from the brain, showed inferior performance even to the protein–protein interaction network (Fig. 2; Supplementary Table S3). Of note, GTEx brain datasets are derived from adult brain samples. This suggests that spatio-temporal developmental brain transcriptome from BrainSpan, and especially from fetal and early postnatal periods, significantly improves ASD gene prioritization.

Fig. 2
figure 2

Performance of gene scoring on networks constructed from 31 GTEx tissues. Black bar corresponds to the performance of the PPI network. The bars show average AUC and standard error over 100 restarts of network propagation through cross-validation

Estimating prior probabilities

To further prioritize individual exonic variants, the variant scores were calculated by an appropriate tool from the MutPred family; i.e., MutPred2 (Pejaver et al. 2020) was used on missense variants, MutPred-LOF (Pagel et al. 2017) on frameshifting indels and stop variants (LoF variants), and MutPred-Indel (Pagel et al. 2019) on non-frameshifting indels and multiresidue substitutions. The gene scores were calculated by gene prioritization tools described in “Materials and methods”.

After non-traditional scores were obtained, we used the AlphaMax algorithm (Jain et al. 2016a, b) to estimate the prior probability of pathogenicity caused by different types of variants; i.e., missense, loss-of-function (LoF) and indel to be \(1.5\%\), \(2.5\%\) and \(5\%\), respectively, while the prior probability for a gene being categorized as an ASD risk gene was estimated at \(10\%\). All raw prediction scores were then re-scaled according to Eq. (3) to acquire calibrated scores.

Gene scoring improves discriminatory power of highly scored variants

We next demonstrate that the integration of gene-scoring with variant-scoring from Eq. (1) increases the discriminatory power of highly scored variants between autism cases and controls. After re-weighting with gene scores, we defined high-risk variants as those whose final scores were larger than \(90\%\) of control variants. This score corresponds to the variants with the one-sided p-value below 0.1 given an empirical null distribution defined by the control variants. We then applied Fisher’s exact test to determine whether the proportion of high-risk de novo variants is higher among probands than in their healthy siblings. A more stringent threshold corresponding to the 95% showed similar results, although we considered it less reliable due to the relatively small sample sizes of LoF and indel variants at this threshold.

We benchmarked the discriminating power of our gene prioritization with brain-specific networks against several baseline gene scoring methods: (i) POS65—the known 65 risk seed genes were assigned the probability of 1, while all other genes were assigned the probability of 0; (ii) MutPred—this baseline scoring scheme does not use any gene prioritization and scores variants simply based on the outputs of MutPred; (iii) Krishnan—the gene probabilities were obtained from Krishnan et al. (2016); (iv) Duda—the gene probabilities were obtained from Duda et al. (2018); (v) PPI—the genes were scored by propagating over the PPI network. The blue dashed line indicates the significance value corresponding to the p-value of 0.05 in each plot, whereas the gray dashed line indicates the Bonferroni-corrected p-value (Figs. 3, 4, 5).

Our scoring method based on BrainSpan networks was more powerful in discriminating high-risk variants between case and control groups than PPI networks and two other published methods (Duda et al. 2018; Krishnan et al. 2016). Interestingly, all scoring approaches performed better on LoF mutations compared to missense and indel variants. This is consistent with the notion that LoF mutations are generally more pathogenic, and that ASD patients have an excess of LoF variants compared to controls. Across region-based networks, R1 and R2 cortical regions generally outperformed other regions, which is consistent with previous observations from the literature (Willsey et al. 2013; Parikshak et al. 2013; Lin et al. 2015, 2017). Across period-based networks, P3 (early fetal) for all mutation types, and P11 (middle and late childhood) for LoF generally outperformed other periods. This is a surprising finding since previously P6 (late mid-fetal) period has been strongly implicated in ASD based on the gene network-level (Willsey et al. 2013; Lin et al. 2015). This suggests that adding variant scoring to the networks may pick up additional signals that were not present in the gene-based models. Furthermore, predictions using region-period combination networks (Fig. 5) generally performed better than region-based (Fig. 3) or period-based (Fig. 4) did individually. Some of the region/period network-based predictions significantly outperformed the existing methods; i.e., R2–P5 and R3–P3 for missense, R1–P11 and R3–P3 for LoFs, and R2–P8 and R4–P5 for indels (Fig. 5). Note that the addition of gene scoring increased predictive performance compared to just pure variant scoring by MutPred. In most cases, combination of gene and variant scoring on certain region/period combinations improved upon POS65. This suggests that our method improves the predictions of novel ASD risk genes and variants. We note that the statistical signal also holds when POS65 were completely removed from the networks (Supplementary Materials).

Fig. 3
figure 3

Fisher’s exact test for discriminating case and control exonic de novo variants by using gene scores from various brain region networks. From left to right: missense, LoF, and indel. Each region has three bars (color coded from light to dark) corresponding to the co-expression network, and merged networks with PPI using the “intersection” and the “union” methods, respectively. Dotted lines show the thresholds for statistical significance, with \(p'\) being the Bonferroni-corrected value

Fig. 4
figure 4

Fisher’s exact test for discriminating case and control exonic de novo variants by using gene scores from various period networks. From left to right: missense, LoF, and indel. Each period has three bars (color coded from light to dark) corresponding to the co-expression network, and merged networks with PPI using the “intersection” and the “union” methods, respectively. Dotted lines show the thresholds for statistical significance, with \(p'\) being the Bonferroni-corrected value

Fig. 5
figure 5

Fisher’s exact test for discriminating case and control exonic de novo variants by using gene scores from various region/period combination networks. From left to right: missense, LoF, and indel. Each combination has three bars (color coded from light to dark) corresponding to the co-expression network, and merged networks with PPI using the “intersection” and the “union” methods, respectively. Dotted lines show the thresholds for statistical significance, with \(p'\) being the Bonferroni-corrected value

Assessing clinical significance

Recent guidelines on variant interpretation in the clinic allow for the use of computational models (Richards et al. 2015) with concrete likelihood ratio values proposed by Tavtigian et al. (2018). While these recommendations are relatively new and, for now, mostly apply to known Mendelian genes, the numerical values of likelihood ratios on the strength of clinical support can be seen as a form of guidance. In this light, we computed positive likelihood ratios (Eq. 4; “Clinical significance of variant scoring”) for our method when the decision threshold for the raw scores of the predictor was set at the level of the top 10th (and top 5th) percentile of the empirical null distribution defined by the scores from control variants. For easier interpretation, we also report the estimates of the diagnostic odds ratio (DOR). Areas under the ROC curve are reported in Supplementary Table S5.

The averaged and maximum values of \(\text {LR}^+\) and DOR are shown in Table 4. These results indicate that the expected increase in odds of pathogenicity is around 1.5 for missense variants, around 4.5 for loss-of-function variants, and around 2 for indels, with their maximum values being higher, depending on the best-performing network. Following Tavtigian et al. (2018), we can classify the increase in odds of pathogenicity as informative for clinical decision-making. We observe, however, that \(\text {LR}^+\) values do not completely reflect the results from “Gene scoring improves discriminatory power of highly scored variants” because of the discrepancy in the number of variants from each group. That is, we found useful statistical signal for the missense variants but their diagnostic value is the lowest, whereas the small dataset size of indel variation resulted in a loss of statistical signal despite a generally informative diagnostic value. The most trustworthy results in interpreting variation are therefore provided by our scoring of the LoF variants, where both statistical significance and moderate diagnostic signal were found.

Table 4 Aggregated positive likelihood ratio (\(\text {LR}^+\)) and diagnostic odds ratio (\(\text {DOR}\)) over the BrainSpan datasets

Experimental validation

To validate functional effect of missense mutations predicted by our ASD variant effect predictor, we selected one highly ranked mutation (Supplementary Table S4), and investigated its impact on protein–protein interactions using co-immunoprecipitation. We selected the mutation NP_001243143.1:p.Phe309Ser in the sodium/potassium-transporting ATPase ATP1A3 gene, which had a score of 0.92 and was annotated by our model as “altered PPI hotspot”. This mutation was initially identified by the exome sequencing of ASD families (De Rubeis et al. 2014). The ATP1A3 gene carries several more de novo missense mutations from other ASD or developmental delay sequencing studies (Kong et al. 2012; Deciphering Developmental Disorders Study 2015; Takata et al. 2018); however, the pathogenicity of F309S or other mutations in ATP1A3 is unknown.

We tested the interaction of the wild type (WT) and F309S mutant (MT) ATP1A3 proteins for interaction with its three interacting protein partners, TOMM22, VDAC1 and TGF\(\beta \)1 (Fig. 6B). All these PPIs had highly confident interaction scores in the BioGRID database. To investigate the impact of the mutation on PPIs with these three partners, we tagged the WT and MT forms of ATP1A3 with the V5-tag, and the interacting partners with the GFP tag (“Experimental validation”). We then co-transfected the WT or MT forms with one of its partners into HEK 293T cells, and assessed the strength of interaction by performing a V5 immunoprecipitation and blotting against GFP.

We observed a reduction in the interaction of F309S MT form of ATP1A3 with all three partners compared to the WT ATP1A3 (Fig. 6A). We did not observe significant reduction in the expression of MT ATP1A3 or its partners after transfection, as evident from full lysate inputs before the immunoprecipitation (Fig. 6A). Thus, the observed reduction in interaction strength is not due to lower expression levels of the MT protein or the partners. The reduction of the interaction was around \(50\%\) for all interacting partners (Fig. 6C). These results suggest that the F309S mutation weakens or disrupts the interaction of ATP1A3 with its partners, in agreement with our prediction. In addition to ASD, heterozygous mutations in ATP1A3 gene are also implicated in other neurological disorders including alternating hemiplegia of childhood 2 [OMIM 614820], CAPOS syndrome [OMIM 601338], and dystonia-12 [OMIM 128235]. This example demonstrates the utility of our combined gene- and variant-scoring model for formulating and validating testable hypothesis with regards to the functional impact of missense mutations in ASD.

Fig. 6
figure 6

A Representative images of Western Blot for ATP1A3 interaction with its selected partners. Numbers below the anti-GFP blot represent the percentage of densitometry intensity for each mutant partner compared to its WT counterpart. B Table representing different relevant parameters for ATP1A3 and the selected partners. C Graph representing the percentage of interaction for each partner, comparing the F309S mutant against its WT counterpart (\(n = 3\), paired ratio t-test, *\(p < 0.05\))

Discussion

As a result of whole-exome and whole-genome sequencing of affected families, the number of genes and variants potentially implicated in complex neurodevelopmental diseases will continue to grow. It is therefore increasingly important to be able to interpret the significance of the newly found variants in the disease context and identify molecular mechanisms, in the form of specific alterations of molecular function (Rost et al. 2016; Lugo-Martinez et al. 2016), underlying the development of the phenotype. In this work, we proposed a probabilistic framework to prioritize exonic de novo variation and evaluated the usefulness of gene- and variant-scoring methods to discriminate between individuals with and without ASD. We found that the higher-resolution systems data was beneficial to prioritization; i.e., that brain region-specific and developmental period-specific gene co-expression networks provide a valuable source of information for prioritizing ASD genes. We have also shown that gene scoring based on a network propagation method using a combination of co-expression and protein–protein interaction networks outperforms each single source of information, suggesting complementarity between the two types of data. Furthermore, co-expression networks focusing on particular brain regions and developmental periods, especially with inclusion of fetal and early postnatal brain development, were more powerful in scoring ASD genes than general tissue-specific networks, including adult brain networks from GTEx.

A novel aspect of this study is that formal integration of gene and variant scoring was based on formulating the inference in a positive-unlabeled setting through which we were able to convert general-purpose variant- and gene-scoring methods into a disease-specific variant interpretation method. We have shown that the final variant-level scores were accurate in distinguishing high-risk exonic de novo variants of all types between case and control individuals. A combination of these scoring methods is advantageous for predicting molecular mechanisms of pathogenicity in ASD risk genes and, more broadly, offers a probabilistic model for comparing the impact of multiple variants in an individual’s genome. Although we have only evaluated the impact of exonic de novo variants in the context of a pre-specified phenotype, we anticipate that this formulation has the potential to be incorporated into polygenic risk scoring schemes that combine common and rare variation (Torkamani et al. 2018).

While our results are generally positive, there are also limitations that merit discussion. First, from a technical perspective, the calibration method used in this work depends on the ability of the underlying methods to estimate prior and posterior distributions in the positive-unlabeled setting. Both problems remain open and actively researched in machine learning (Zeiberg et al. 2020; Kiryo et al. 2017). Second, our probabilistic framework for scoring variants in a disease-specific context relied on simplifying assumptions; e.g., any variant that disrupts gene function in a known disease gene was automatically assumed to be disease-causing. This limitation will be difficult to overcome until such a time as gene function can be predicted at the level of protein domains or, optimistically, in a residue-specific manner for all aspects of protein function. Third, we relied on the MutPred family of tools to capture disease-causing variants based on our familiarity with these models, their good performance, and their ability to predict specific types of functional alteration. These tools, however, have not been benchmarked against others in this project and thus a higher performance may be achievable. Fourth, our main evaluation strategies were based on our ability to discriminate cases and controls for high-scoring variants. We have selected this evaluation because only a small fraction of cases may be caused by exonic de novo variation. Since this fraction is unknown and difficult to estimate, we could not apply the available correction strategies to give robust performance estimates (Jain et al. 2017; Ramola et al. 2019). Consideration of the top 5–10% of high-scoring variants to evaluate the accuracy of our models was simply pragmatic. Finally, as recent studies have demonstrated (Farahbod and Pavlidis 2020), bulk gene expression data could be confounded by cell type-specific gene expression signals. Thus, using single-cell transcriptomic data could be beneficial in the context of the current study, although it is not yet available across multiple brain developmental periods and regions.

The probabilistic framework and findings from this study have led to encouraging results in prioritizing de novo variation in a disease-specific context. They have also lead to candidates that could be experimentally evaluated and thus contribute to the knowledge of complex neurodevelopmental disorders. The F309S variant in ATP1A3 is one such candidate that significantly reduces protein–protein interaction propensity and is therefore a candidate for further studies of causality.