Introduction

Posttraumatic stress disorder (PTSD) is among the most common and debilitating major psychiatric disorders, with an overall lifetime prevalence rate of 7.8% in the United States [1]. While PTSD is approximately twice as common in women, prevalence rates correlate with trauma exposure rates. Prospective studies indicate that the majority of trauma survivors experience the cardinal symptoms of re-experiencing, avoidance, and hyperarousal immediately following trauma. While these symptoms abate and eventually disappear for most patients [2], symptoms persist and develop into syndromal PTSD in a significant minority.

The most significant comorbidity for PTSD is major depressive disorder (MDD) [3]. Together, PTSD and MDD following trauma are associated with additional comorbidities (e.g., substance and alcohol abuse, panic disorder) and mortality (suicide, medical comorbidity, reduced life expectancy) as well as disability in daily activities, reduced vocational productivity, and increased health care utilization [4,5,6]. The identification of individuals at high risk to develop PTSD in the immediate aftermath of trauma is crucial to design interventions that are effective in preventing the development of PTSD and its associated disease burden. Furthermore, understanding the acute biology of the traumatized individual may lead to novel interventions, targeting stress dysregulation, fear, and trauma-memory consolidation processes that seem to be critical in the pathophysiology of PTSD and MDD following trauma exposure.

Although epidemiological studies indicate that roughly 70% of the whole population will experience a traumatic event in their lifetimes, only a minority will develop PTSD even after repeated exposures, pointing to a fundamental question which trauma survivors will eventually experience PTSD. Therefore, the development of a clinically useful metric is critical, allowing us to predict who will suffer from PTSD in a prospectively defined traumatized population using a combination of biological, epidemiological, and psychological variables. Considering the nearly universal exposure to traumatic events and the worldwide public health importance of PTSD, the determination of biological risk factors is of utmost importance for both civilian and military populations.

Several risk factors have been identified for developing stress-related disorders following trauma, including: (i) female sex [7, 8], (ii) pre-trauma personal and familial psychopathology [9,10,11], (iii) prior trauma exposure, including child abuse and neglect [12, 13], (iv) trauma severity, (v) perceived life threat and peritraumatic emotional response [14], (vi) lack of social support [15], and (vii) reduced hippocampal size [16, 17]. Differential risk that determines which patients develop PTSD is in part genetic, with at least 30–40% risk heritability for PTSD following trauma [18,19,20,21,22,23]. Similar to many medical disorders such as cancer, diabetes, and metabolism, PTSD and MDD are complex diseases in which gene–environment interactions may contribute to vulnerability [24,25,26,27,28,29,30,31]. While early studies have suggested that gene–environment interactions are important in PTSD, these data, however, are merely suggestive, indicating that definitive data have yet to be established [12, 29,30,31,32]. Although identified alleles have been found in a number of PTSD- and depression-related genes, genome-wide association studies are still early in predicting risk [33, 34].

Furthermore, the determination of dynamic differential gene expression pathways following trauma may provide more robust effects than heritable patterns of differential genetic variants for determining risk [35, 36]. Combining these different approaches—transcriptional and genetic—may be particularly powerful, such that genetic variants that modify brain gene expression may also influence risk for human diseases. In fact, studying the transcriptome as a quantitative intermediate phenotype may have greater power for discovering risk SNPs influencing expression than the use of discrete diagnostic categories. Such work suggests that using transcriptional variation may lead to causal pathway identification, and that genetic-expression effects may be useful in determining the underlying biology of associations with common diseases. One example of progress in this area is in degenerative disorders and Alzheimer’s disease, as we previously examined genotype–transcript relationships via expression quantitative trait locus (eQTL) analysis [37,38,39] and constructed regulatory networks using Alzheimer’s samples including both transcriptomics [40, 41] and proteomics [42, 43]. In particular, the analysis of genetic risk provided robust and reproducible targets, driving known brain pathogenesis [42]. The goal of the current study was to apply similar genetic–transcriptional interaction approaches to PTSD and depression.

Here, we apply a computational pipeline integrating and analyzing transcriptomic, genomic, and interactomic data to predict which trauma victim survivors may develop PTSD. In particular, we utilize a novel approach to maximize utility of blood-based transcriptomics to understand development of risk vs. resilience in the months following trauma exposure. In a two-site longitudinal, observational study we identified subjects, obtained blood samples in the immediate aftermath of trauma, and determined RNA expression data at the time of trauma and symptom levels prospectively at a 6-month follow-up timepoint. To maximize our power and understanding of transcriptional pathways predicting risk vs. resilience, we utilized gene expression sets of individuals with PTSD and/or MDD at 6 months and control individuals without PTSD and/or MDD at 6 months. To capture fluctuations of gene expression in the underlying disease cases, we applied a gene set-cover approach, allowing us to find representative sets of dysregulated genes. Linking such genes to candidate causal genes through an eQTL analysis, we determined the significance of each such link by modeling information flow through a network of molecular interactions, pointing to causal pathways as well as genes that drive the underlying disease phenotype.

Using such an unbiased genotype–transcript interaction analysis we hypothesize that our approach will identify transcriptional pathways predicting risk vs. resilience in subjects with symptoms of PTSD and/or MDD, pointing to potential causal pathways for further development. In fact, as outlined below, a number of our identified causal pathways have been previously implicated in the biology or genetics of PTSD and MDD [44,45,46,47] as well as in preclinical models of amygdala function and fear regulation [48, 49]. Our approach provides a promising set of pathways that may underlie causal mechanisms in the development of PTSD or MDD in the aftermath of trauma.

Results

Identification of dysregulated genes in PTSD and depression

Our main objective was the integration of disparate gene expression and genomics data of patients that suffer from significant symptoms of depression, PTSD and/or both after trauma exposure with molecular interaction data. Using a set-cover approach [50, 51], we utilized gene expression sets of patients with criteria PTSD or depression symptoms (“cases”) and non-disease controls to select a set of dysregulated genes. Normalizing the gene expression values in every patient as a Z-score, where we considered mean and standard deviation of gene expression values in the non-disease controls, we defined that a gene is differentially expressed and “covers” a given disease case if the normalized gene expression value of the gene in question has a p value <0.01 using a Z-test in a given disease sample. Such disease case-specific tests allowed us to better capture fluctuations of gene expression in the disease cases that would have been otherwise missed with approaches that simply consider expression of a gene in all samples at the same time. Assuming that coverage refers to the number of differentially genes in a given patient case, small coverage of patient samples points to most commonly differentially expressed genes across patient samples. In turn, increasing coverage captures genes that are specific to smaller subgroups of affected patients. Investigating different choices of parameters, we observed that the size of gene sets increased with growing coverage, but were remarkably robust with increasing numbers of outliers (Supplementary Fig. 1). To account for such fluctuations and provide a representative, small set (≤150) of dysregulated genes that represent most commonly expressed genes in the underlying disease cases, we required that all but one case was covered by at least five differentially expressed genes (see Methods for details, Fig. 1a). In particular, we obtained 138 dysregulated genes in depression (BDI), 144 genes in PTSD (PSS), and 150 genes in PTSD and/or depression (PSS×BDI) (Supplementary Table 1). The Venn diagram in Fig. 1b indicates considerable overlap of dysregulated gene sets.

Fig. 1: Determination of dysregulated genes in PTSD.
figure 1

a Outlining our approach to select dysregulated genes in a toy model, links indicate genes that are differentially expressed in a given patient, “covering” a disease case. Specifically, we stipulate that each disease case needs to be covered by at least two differentially expressed genes. Furthermore, we allow that one outlier case does not need to meet this demand. Solving such a problem with a set-cover approach (see Methods) we obtain a set of two dysregulated genes (dashed ellipse). b Utilizing a threshold of p < 0.01 (Z-test) to determine if a gene covers a patient and demanding that each patient case needs to be covered by five differentially expressed genes and allowing one outlier, we found 138 genes in depression (BDI), 144 genes in PTSD (PSS), and 150 genes in PTSD and/or depression (PSS×BDI). Notably, the Venn diagram indicates some overlap of such sets of dysregulated sets.

Determination of candidate causal genes through an eQTL analysis

In the next step, we determined genome-wide associations between expression levels of dysregulated genes and single nucleotide variations that point to candidate genes that are potentially causal for the observed dysregulation. In particular, we utilized Matrix eQTL [52], applying an additive model to find variations that were correlated with the expression of dysregulated genes (Fig. 2a). In particular, we considered trans-eQTLs on a p < 10–3 level, indicating loci that were placed outside the local genomic boundaries around coding regions of the underlying dysregulated genes. To translate such trans genomic loci to corresponding candidate causal genes of a dysregulated gene, we identified all genes that harbor these variations in their coding regions. As a consequence, we found 116,604 such links between 138 dysregulated genes and a total of 10,640 candidate causal genes in depression, while we obtained 113,238 links between 144 dysregulated genes and 10,535 candidate causal genes in PTSD and 101,995 such links between 150 dysregulated genes and 10,433 candidate causal genes in depression and/or PTSD. Such numbers translate to on average of roughly 900 candidate causal genes that are connected to a single dysregulated gene and allow us to strike a balance between finding large enough sets of candidate causal gene and avoiding associations of dysregulated genes to a large majority of remaining genes (Supplementary Fig. 2).

Fig. 2: Determination of sets of causal genes.
figure 2

a We determine genome-wide associations between expression levels of dysregulated genes and single nucleotide variations. Accounting for genomic characteristics in the (non-)coding gene region, we assume that corresponding genes may be causal candidates for the dysregulation of the underlying disease gene. b Modeling an electric circuit that connects a dysregulated gene with its candidate causal genes through a network of molecular interactions we use expression correlation levels of interacting proteins as resistance. Specifically, such a step allows us to find candidate causal genes that receive most current, indicating increased causality. In c we compare currents that were received by causal candidates using expression values in the disease cases and controls, pointing to strong correlations in all disease states. d Determining the difference of currents that candidate causal genes received in the disease cases and controls, we found normal distributions in all disease states. Such an observation indicates that candidate causal genes can either receive less or more current in the disease compared to the control cases. In e, we annotated each dysregulated gene with causal genes that showed significant differences in received current in disease and control cases using a Z-test (FDR < 0.01). Notably, we observed that a small minority of genes exist that are significantly causal for a large number of dysregulated genes and vice versa in all disease states.

Determination of causal genes for each dysregulated gene

As the previous step allowed us to find a set of candidate causal genes for each dysregulated gene, we determined the significance of such links by modeling information flow from dysregulated to causal genes by a variant of a circuit flow algorithm. In particular, this step allowed us to identify candidate causal genes that showed significantly different characteristics when we compared disease to control cases. Our circuit flow algorithm is based on the well-known analogy between random walks and electric networks where the amount of current entering a node or an edge is proportional to the expected number of times a random walker will visit the node or edge. Specifically, we constructed a network of molecular interactions by utilizing the Pathway Commons database [53] that collects interaction data from a variety of human protein–protein interactions and pathway databases. In particular, such interaction data allowed us to consider undirected interactions such as protein–protein and protein complex interactions as well as directed interactions including transcription factor–target and kinase–substrate interactions. As a result, we obtained 753,239 interactions between 12,614 human proteins. Considering gene expression correlation as a proxy of resistance, we fed a unit of electrical current to a dysregulated gene and simulated the flow of electric current to the corresponding candidate causal genes through the molecular interaction network. Thus, we ensured that non-correlated nodes posed considerable resistance to the current flow. Specifically, such a step allowed us to find candidate causal genes that received most current, indicating increased causality (Fig. 2b). Comparing current values in the disease and control cases, we largely observed strong correlations in all disease states (Fig. 2c), suggesting that most pairs of dysregulated and candidate causal genes shared similar current flows in disease cases and controls. Considering the difference of such currents that a candidate causal gene received from a dysregulated gene when we compared disease and non-disease patients, we observed normal distributions in all disease states (Fig. 2d), suggesting that a small minority of pairs of dysregulated and causal genes showed large differences. To assess the statistical significance of the observed difference, we randomly sampled current values from the empirical distributions of all pairs of dysregulated and causal candidate genes in the disease and controls and considered the flow difference of a pair of dysregulated and causal genes to be significant if the corrected p value from a Z-test was FDR < 0.01 [54]. Notably, we obtained 11,224 links between 137 dysregulated genes and 3670 causal genes in depression, 9645 links between 144 dysregulated genes and 3278 causal genes in PTSD, and 9022 links between 150 dysregulated genes and 3100 causal genes in PTSD and/or depression. Determining the number of dysregulated genes that were connected to a single causal gene, we observed a small minority of causal genes that were significantly causal for a large number of dysregulated genes and vice versa in all disease states (Fig. 2e and Supplementary Table 2).

Driver causal genes for PTSD and depression symptoms

As the previous step allowed us to annotate each dysregulated gene with a set of causal genes that showed significant differences to current flow when we compared the disease to the control cases, we aimed to find a subset of causal genes that potentially drove the underlying dysregulation of genes. In particular, we focused on the identification of a subset that explained a majority of disease cases, by defining that a causal gene explained a disease case if there was a nonempty set of corresponding dysregulated genes that were differentially expressed in the given case. In other words, we constructed a bipartite graph, where links between causal genes and disease cases were weighted by the number of corresponding differentially expressed dysregulated genes in the given disease case. We aimed to explain the majority of disease cases (allowing a few outliers) with a minimum number of causal genes, where we defined a patient’s case as explained if the total weight covering the case exceeded a certain threshold. Specifically, we applied a greedy approach to determine a subset of “driver” causal genes [50, 51] (see Methods for details, Fig. 3a). While the choice of the number of outliers hardly influenced results, we observed that the obtained sets of driver causal genes remained unchanged when we demanded that each case was explained by at least ten dysregulated genes. In particular, we obtained 11 driver causal genes in depression cases, 13 genes in PTSD, and 22 genes in PTSD and/or depression. In the Venn diagram in Fig. 3b, we observed that all disease states shared ESR1, RUNX1, PPARA, and WWOX as driver causal genes. Considering links between causal and dysregulated genes, we determined the difference of electric current that arrived at the corresponding dysregulated gene when we compared disease and control cases. As a gene can be causal for many different dysregulated genes, we counted the number of times where the corresponding current difference was positive or negative in the disease cases. In Fig. 3c, we observed that most driver causal genes sent more current to corresponding dysregulated genes in the disease cases in all disease states (Supplementary Table 3A–C).

Fig. 3: Determination of causal driver genes.
figure 3

a We aim to identify a minimum subset of causal “driver” genes that covers a maximum number of disease cases as our approach provides sets of causal genes for each dysregulated gene. In particular, we define that a gene “covers” a disease case if the gene was differentially expressed (p < 0.01, Z-test) in the underlying patient case. In our toy model, we demand that each disease case is covered by at least two differentially expressed genes allowing one outlier case, pointing to a set of two driver causal genes (dashed circles). b Solving such a weighted multi-set cover problem with a greedy algorithm, we found 11 driver genes in depression cases (BDI), 13 in PTSD cases (PSS), and 22 genes in PTSD and/or depression (PSS×BDI). Notably, the Venn diagram indicates some overlap of such driver gene sets. c Each link between a causal gene and a dysregulated gene is annotated with a current difference when we compare the disease cases to controls. Assuming that a gene can be causal for many different dysregulated genes, we determined the number of times corresponding differences were negative or positive in the disease cases. In all disease states, we observed that driver causal genes were mostly connected to dysregulated genes that received more current in the disease cases in all disease states.

As a corollary of finding driver causal genes, we determined causal paths from driver genes to corresponding dysregulated genes. In particular, we found 343 paths in depression, 458 paths in PTSD, and 409 paths in depression and/or PTSD (Supplementary Table 4A–C). Furthermore, we counted the number of times a gene appeared in the paths (Supplementary Table 5A–C), suggesting that prominent genes such as MYC and TP53 tend to appear frequently.

Discussion

There is a critical need to identify early biomarkers for risk vs. resilience in the aftermath of trauma. Such biomarkers would both provide utility in predicting those most at risk and in need of early interventions, and identify novel and potentially robust mechanisms for targeting biological processes to prevent the development of PTSD, depression, and other pathological outcomes of psychological trauma exposure.

The current study examined peripheral blood RNA expression from the time of trauma and 6-month follow-up symptoms from 297 individuals enrolled from two large Level I trauma centers in Atlanta and Miami. We utilized gene expression sets of PTSD or MDD patients and non-PTSD or depressed controls, applying a gene set-cover approach. In particular, we expected that gene expression profiles of patients with complex diseases were not clear cut and indicated few genes that generally differed in disease compared to controls. As a consequence, we used raw p values to find differentially expressed genes in each patient case to not only capture fluctuations of gene expression in the underlying disease cases through the set-cover approach but also to broaden the spectrum of potential disease-relevant genes.

To find genes that were potentially causal for the observed dysregulation of genes in the patient cases, we determined eQTL associations between their expression levels and single nucleotide variations. Note, that this step dealt with raw p values as well, as its purpose is to annotated each dysregulated gene with a relatively large portfolio of genes that may significantly influence the observed dysregulation. We made the explicit assumption that causal genes harbor genomic alterations that were associated with the expression of dysregulated genes, allowing us to utilize genes in these genomic regions as causal candidates.

To establish significance that a candidate gene is indeed causal for a dysregulated gene, we modeled the propagation of information from such candidate causal genes to a disease gene as the flow of electric current through a network of molecular interactions. In particular, we assumed that causality for the observed dysregulation of a given disease gene is not merely the consequence of an individual eQTL. Instead, we surmised that a set of candidate genes can be causal for the observed dysregulation of a given disease gene that are embedded in a network of molecular interactions as reflected by high levels of electric current as a measure for their causal importance. Weighting interactions according to their gene expression as a proxy of electric resistance, we determined genes that were causal for the observed dysregulation if the flow of information into a gene significantly differed when we compared disease and control cases. While we used lenient p value cut-offs in the previous steps to better capture gene fluctuations and broaden the portfolio of candidate causal genes, we established final sets of causal genes for each disease gene through statistically significant differences of received currents comparing disease and control cases with an adjusted p value threshold to curb spurious signals and obtain relatively small causal gene sets.

Annotating each dysregulated gene with a final set of causal genes, we further determined a subset of causal genes that mostly drove the underlying disease phenotype. Despite using largely lenient thresholds to find sets of dysregulated genes and eQTLs, consecutive steps to determine causal genes through a current flow and set-cover approach allowed us to obtain robust, meaningful, and statistically significant results as shown in our previous work on this topic [50, 51]. Furthermore, the flow of information also indicated paths through the network of molecular interactions that connect dysregulated and causal driver genes. As a consequence, our pipeline points us to molecular key players and novel paths in the underlying disease states, results that otherwise would be inaccessible through simple analysis of differentially expressed genes and enrichment of predefined, known molecular pathways and gene sets that pertain to cellular and molecular functions.

Utilizing this novel approach, we obtained a final list of 11 driver causal genes in depression cases, 13 genes in PTSD, and 22 genes in PTSD and/or depression. We observed that all disease states shared ESR1, RUNX1, PPARA, and WWOX as driver causal genes. In addition, a number of other gene pathways appeared to be driver causal genes in the PTSD only or MDD only cases. Because previous work has found increased power and unique risk factors when considering PTSD and depression together as outcomes [55], we analyzed these two primary outcomes both together and separately. A number of these identified causal pathways have been previously implicated in the biology or genetics of PTSD and depression, as well as in preclinical models of amygdala function and fear regulation. A summary of some of the top genes implicated in the driver causal gene analyses from Fig. 3 is listed in Table 1 with additional information as to their function and possible relationship to PTSD and depression outcomes following trauma. Below, we focus in more detail on three of the gene pathways (ESR1, RUNX, and PPARA) that are shared by both PTSD and MDD symptoms, as they all are supported by robust pre-existing literature suggesting that they may be particularly interesting targets for further understanding the development of trauma-related risk vs. resilience.

Table 1 Selected driver causal genes associated with PTSD and depression.

A leading causal driver gene pathway was marked by the estrogen receptor (ESR1) network. This observation is very interesting because a well-replicated finding across numerous epidemiological studies is that females have an approximately two-fold increased risk for PTSD, depression, and other stress-related disorders compared to males. Understanding the biological mechanisms of this differential risk is of critical importance, and many of these findings have been thought to be related to estrogen regulation in females. Increasing evidence suggests that a number of cellular and molecular pathways are regulated by both stress and estrogen modulation and may provide an important window into understanding mechanisms of sex differences in the stress response, and one of the best understood neural pathways related to stress effects has recently been shown to be estrogen receptor mediated [56, 57]. Specifically related to PTSD, estrogen has been associated with differential levels of fear responding and retention of fear extinction [44, 45, 58, 59]—critical intermediate phenotypes of PTSD and other stress-related disorders. Additional recent work has demonstrated estrogen regulation of epigenetic processes that underlie fear acquisition and possibly PTSD following trauma exposure [46]. While estrogen receptor effects are most commonly associated with females, the conversion of testosterone to estrogen in the brain via steroid reductase enzymes indicates the relevance of this process for both sexes [60]. In fact, estrogen and estrogen-related genetic effects have been identified in males in PTSD as well as animal models of stress [58, 61, 62]. Finally, the role of estrogen regulation and ESR1-related genomics is also appreciated as likely critical for premenstrual dysphoric disorder, premenopausal depression, and postnatal depression [47, 63], which may be closely related to sex-specific differences in trauma-related PTSD and depression examined here.

While primarily associated with epithelial cell biology and blood formation [64, 65], RUNX1 is also clearly involved in nervous system development [66] and a critical mediator of both Wnt/Beta-catenin pathways and Notch pathway regulation of cell function and development. Our labs and others have demonstrated critical roles for both of these pathways in fear and extinction processing, suggesting that there may be an important role for RUNX1 causal gene driver function in a host of stress- and trauma-related pathways, including regulation of fear formation in the aftermath of trauma exposure. Previous work has shown that during fear memory formation in the amygdala, many Wnt-signaling genes are dynamically regulated during threat memory consolidation. Furthermore, Wnt modulators prevented long-term fear memory consolidation without affecting short-term memory [49]. Additional work has shown that another RUNX-regulated pathway, Beta-catenin modulation, is also critically involved in fear memory processing [49, 67]. Such observations suggest that dynamic modulation of Wnt/β-catenin signaling, perhaps via RUNX1 regulation, during fear and trauma-memory consolidation is critical for long-term memory formation. RUNX1 has also been shown to regulate Notch signaling during development. Similar to Wnt/B-catenin, Notch has been implicated to causally regulate synaptic plasticity in the adult amygdala [68], suggesting that Notch regulation is important for the consolidation of fear and trauma-related memory. Together, this prior work expands the hypothesis that developmental molecules, such as the RUNX1-regulated Wnt, B-catenin, and Notch signaling pathways, have roles in adult behavior and trauma-memory formation, and that existing interventions targeting them hold promise for treating trauma-related neuropsychiatric disorders.

PPAR-alpha is a member of the peroxisome proliferator-activated receptor (PPAR) family. Studies in different fields have illustrated that PPARs are nuclear receptors that participate in a variety of metabolic functioning, along with cell development and cell growth [69]. Furthermore, recent data suggest that PPARs also play other roles in inflammation, neuropsychiatric diseases, and cancer. PPAR may be an important target for neuroprotection in stroke and neurodegenerative diseases such as Alzheimer’s disease [70]. Preclinical work in mouse models of neurodegeneration, stress, and memory dysfunction has suggested that PPARa may be an important substrate that mediates neuronal health and plasticity. Specifically, very recent data have suggested that PPAR-mediated cellular autophagy reduces Alzheimer’s-related neuropathology and improves cognitive decline [71]. Additional work indicates that PPARa is a critical target mediating the effects of both aspirin and HMG-CoA reductase inhibitors in improving neural plasticity and memory formation in mouse models [72, 73]. Finally, recent work implied that PPARa knockout mice show enhanced fear learning in the passive avoidance test [48]. Together, these data suggest that PPARa may be a causal gene pathway driving networks related to resilience, neuronal plasticity, and cognitive functioning in the aftermath of trauma exposure.

Limitations to this work include a relatively limited sample size for this unique type of cohort with combined “omic” datasets, lack of a well-powered replication cohort, whole-blood level RNAseq analyses, transdiagnostic approach including PTSD and/or MDD affected individuals, inclusion of subjects based on symptom severity across diagnoses and not limited to validated DSM diagnostic groups or utilization of gold-standard diagnostic interviews, and insufficient power for sex- or race/ethnic-specific stratification of analyses. Furthermore, examining a number of other potential differences at baseline, including childhood trauma exposure, differential medication history, and other possible pre-trauma confounds, was outside of the scope of the current study. Despite these limitations, there were a number of strengths including novel analytic approaches combining genotype–transcript interactome approaches with causal network analyses, multisite data collection, broad inclusion criteria, rapid enrollment in the post-trauma period in EDs with long-term follow-up data collection, and a transdiagnostic, intermediate phenotype approach to gene discovery.

In conclusion, we have examined the integration of peripheral gene expression in the immediate aftermath of trauma with molecular interaction data, from patients who go on to suffer from depression, PTSD, and/or both 6 months following trauma. Utilizing a set-cover approach [50, 51], we utilized gene expression sets of patients and non-symptomatic controls to select a set of dysregulated genes. This approach leads to a final list of over 40 driver causal genes across depression, PTSD, or the combined phenotypes at the 6-month timeframe following trauma exposure. In addition to a number of interesting and well-known genes related to stress-regulation with individual phenotypes, we observed that all disease states shared ESR1, RUNX1, and PPARA as driver causal genes—all three of which have substantial prior association with pathways known to be involved in stress- and trauma-regulation, as well as with preclinical models of neuronal function and plasticity that underlies fear- and threat-processing. Future studies are required to both replicate and validate these specific findings. Furthermore, identifying whether these driver causal gene pathways will lead to particular predictive biomarkers remains to be determined. Finally, a further understanding of the biology of these gene networks, and how they may lead to novel insight into systemic and central nervous system pathology, and potential future therapeutic targets, will be critical toward the goal of understanding and predicting outcomes in the aftermath of trauma.

Methods

Diagnosis and definitions

Many different datatypes were collected within the cohort that included, but were not limited to, demographic data, syndromal psychiatric definitions, measures of depression, anxiety and PTSD symptoms, current and childhood trauma, immediate stress response, additional treatment inventory, and self-report medication/drug use as well as urine drug analysis. As two main groups, we classified our cohort into controls and positive cases. Controls never met criteria for either PTSD or MDD throughout the year of longitudinal follow-up. In turn, transitioners were classified as meeting criteria for at least two timepoints during the year-long follow-up. Note that throughout we are focused on subjects with significant symptoms of PTSD and/or MDD, utilizing this combined definition to capture patients who were most at risk for transdiagnostic symptoms in the aftermath of significant trauma exposure. Of specific interest to the work at hand are the following.

PTSD symptoms

The PTSD Symptom Scale, Interview Version (PSS-I) was employed to assess PTSD [74] symptoms that correspond to the 17 DSM-IV PTSD symptoms, each rated on a 0–3 scale. Specifically, alpha for the full scale PSS-I is 0.91, re-experiencing 0.78, avoidance 0.80, and arousal 0.82. One-month test-retest reliability is 0.80. Inter-rater reliability is excellent for both diagnosis (kappa = 0.91) and symptom severity (intraclass correlation = 0.97). The PSS-I is highly correlated with the CAPS, and has a short administration time (20 min) [75]. The PSS was administered during the follow-up appointments to assess PTSD symptoms specifically in response to the index, study-related trauma that precipitated ED admission.

Depression symptoms

The Beck Depression Inventory, Second Edition (BDI-II) was used to assess depression symptom severity [76]. The BDI-II is a 21-item self-report that is summed for a single score assessing the existence and severity of depression symptoms. Individuals were assessed regarding the past 2 weeks thus corresponding to DSM-IV timeframe criteria for major depression.

Trauma exposure

Standardized Trauma Interview (STI) [77] was used to assess the index trauma. The STI was modified for this study, optimized based on prior work [78,79,80] from this group, and used to gather information on history of past adult traumas as well as details of the current trauma, including intensity, level of fear, interpersonal components, feelings of helplessness, and horror. To assess childhood trauma, the Childhood Trauma Questionnaire (CTQ) was employed [81], which is a 28-item and psychometrically validated, self-report inventory that assesses child abuse and neglect.

Inclusion criteria

Our inclusion criteria included the following: trauma in the past 24 h, meeting DSM-IV diagnostic criterion A in which both of the following were present: the person experienced, witnessed, or was confronted with an event that involved actual or threatened death or serious injury, or a threat to the physical integrity of self or others. The person’s response involved intense fear, helplessness, or horror. For our transdiagnostic approach to symptom severity inclusion, at least two timepoints within a year of follow-ups where the PSS-I score was >21 were required to be classified as having likely PTSD. At least two timepoints where the BDI score was >18 were required to be classified as having likely depression.

Exclusion criteria

Inclusion and exclusion criteria were determined via self-report during the initial interview in the ED, examining past medical, and psychiatric history. Exclusion criteria were a history of mania, schizophrenia, other psychosis, current suicidal ideation or suicidal ideation in the last month, suicide attempt in the previous 3 months and current intoxication. Participants were excluded for respiratory distress or if they were medically unstable or hemodynamically compromised.

Demographics

Out of 297 subjects included in the genome and transcriptome analyses, there were 193 MDD and 189 PTSD controls while there were 104 MDD and 108 PTSD cases. Furthermore, we accounted for 133 cases with MDD and/or PTSD and 164 controls that had no indication of MDD and PTSD. Note that the total of controls and cases in these four groups is >297 because individuals could be in more than one category, e.g., low BDI and PSS symptoms counting as controls for both PTSD and MDD, or high BDI and PSS symptoms counting as cases in both categories. In more detail, there were 25 MDD cases without PTSD, 29 PTSD cases without MDD, and 79 cases that were diagnosed with both MDD and PTSD, pointing to a total of 133 cases with PTSD and/or MDD.

Out of the 297 individuals, 43% were female and 57% were male. The average age was 34.9 years (SD = 12.7). In all, 69.9% self-identified as black or African American, 24.4% identified as Caucasian, and 14% identified as Hispanic or Latino, while 22% were unemployed, >53% had some college experience, and >83% graduated from high school.

Samples

Blood samples were collected from living subjects at Emory University (Grady Memorial Hospital) and the University of Miami (Jackson Health Systems). Participants were enrolled in the emergency room an average of 5.3 h (0.5–12 h) after experiencing a DSM-IV-TR Criterion A trauma as previously described [82,83,84]. We screened a total of over 10,000 individuals, with final inclusion counts of 525 individuals from Emory and 177 from Miami. Diagnoses were collected over multiple timepoints up to a year of follow-up for most individuals for the DNA-RNA cohort. Out of the inclusion set of 702 samples, 366 individuals had both DNA genotypes and RNA profiling. After removing subjects whose DNA/RNA did not pass QC, who had missing demographics or outcomes data, a total of 297 subjects were included in the multi-omics profiling defined below. All study procedures were reviewed and approved by the Emory Institutional Review Board (IRB), the Grady Hospital Research Oversight Committee, and the University of Miami IRB.

Data collection

DNA and RNA collection and preparation

Venous blood samples were collected in EDTA tubes (DNA) and Tempus Tubes (RNA, Applied Biosystems) in the emergency department after trauma exposure (mean time elapsed between trauma and blood sampling, 201 min) by medical staff using standard techniques. Within 6 h of collection, tubes were centrifuged at 4 °C, aliquoted into 500 mL samples and frozen at 280 °C until time of assay. Tempus tubes were mixed and stored overnight, per manufacturer’s instructions at 4 °C prior to processing and freezing.

DNA was extracted from 200–500 μL of whole blood using the Genfind v2 kit (Beckman Coulter Genomics, Danvers MA), utilizing automation methods on the Biomek NX (Beckman Coulter Inc., Brea, CA). All DNA for genotyping was quantified using PicoGreen (Invitrogen Corp., Carlsbad, CA) and normalized to a concentration of 5–10 ng/μL. DNAs that fell below 5 ng/μL were not used in downstream applications. DNA samples were analyzed on the Genome-Wide Human PsychArray (Illumina Inc. San Diego, CA) according to the manufacturer’s protocols.

RNA samples were isolated from peripheral blood, mRNA libraries created using the TrueSeq sample preparation kit, and sequenced using sequenced using paired-end read procedures on the Illumina HiSeq 2000 (Illumina Inc. San Diego, CA). One round of redos was performed for six samples within the cohort.

DNA: data were normalized as before in [40, 42, 85]. Samples were checked for gender-discord, missingness, heterozygocity, and relatedness errors using PLINK v1.07 [86]. Samples with PIHAT scores >0.185 were excluded as in [87]. SNP calls were examined for missingness and Hardy–Weinberg equilibrium as well as non-random missingness using PLINK v1.07 [86]. Pedigree files were pruned using PLINK v1.07 to exclude SNPs with pairwise LD threshold of r2 > 0.5 using a sliding window of 50 SNPs and a shift of 5 SNPs at each step. Ancestry was checked via EIGENSOFT [88] and PC1, 2, and 3 were retained and used in downstream analysis. Plots of a principal component analysis of eigenvalues are shown in Supplementary Fig. 3.

Genotypes were phased with SHAPEIT v2.r790 [89], and missing genotypes were imputed with Impute2 v2.3.2 [90] using the reference panel from the 1000 Genomes Project Phase 3 [91]. Markers with high imputation quality (INFO > 0.5 [90]), minor allele frequency over 1% and Hardy–Weinberg p value <10–6 were retained for downstream analysis.

RNA: raw TruSeq counts were exported, and reads were aligned to GRCh37 using STAR v2.5.2b. [92]. Gene counts were computed with FeatureCounts v1.5.1 that were processed with DESeq2 1.12.4 [93]. Raw counts were normalized using the median-ratio method. An initial principal component analysis identified three additional low count outliers, which were removed. Principal component analysis of Log-transformed remaining counts indicated no strong batch effects (Supplementary Fig. 4). Furthermore, sample data were adjusted for several biological/environmental covariates (education, income, gender, living status, CTQ score, current job, employment, number of dependents, and age) and methodological covariates (site, number of visits, batch, number of times the PSS or BDI score was collected, and library quantification). Overall, the largest explanatory covariate was gender (Supplementary Fig. 5). Residuals from this correction were then used in downstream analysis.

Transcriptomic data

To correct for potential biological or technical confounders, the contribution of every covariate of transcriptional variability was characterized using the variancePartition method [94]. This method partitions the total variance into the contribution of each variable in the experimental design (e.g., age, sex, batch), plus the residual variance. The sample data were adjusted for several biological covariates (gender, age) and several methodological covariates (institute source of sample, post-mortem interval, detection and hybridization date). The residuals from this correction were then used in downstream analysis.

Determination of dysregulated genes using patient and control cases

We utilized gene expression sets of patients and non-disease controls to select sets of dysregulated genes. In particular, we normalized gene expression values of a gene i in a given patient’s case D by \(Z_i^D = \frac{{e_i^D - \mu ^C}}{{\sigma ^C}}\), where we considered mean μC and standard deviation σC of gene expression values in the set of controls C. Using a Z-test, we considered a gene differentially expressed, thereby “covering” a given disease case if the normalized gene expression value of the gene in question had a p value <0.01 in a given disease sample. Such disease case-specific tests allowed us to capture fluctuations of gene expression in the underlying disease cases. Providing a representative set of dysregulated genes in the underlying disease, we required a minimum level of coverage and demand that each gene covers as many disease cases as possible (Fig. 1a). Formulating this problem as a minimum multi-set cover problem, we defined a bipartite graph B(T, S) between genes T and disease cases S. We added edges between gene g and case s if gene g was differentially expressed in case s. We constructed a multi-set cover instance SC = {B(T, S),α, β} where α represented the number of times that a disease case needed to be covered, and β was the maximum number of outliers. In other words, all but β cases needed to be covered at least α times with differentially expressed genes in the output cover. Specifically, we solved this problem with a greedy algorithm that we introduced in [50, 51] (Fig. 1a).

Determination of candidate causal genes

To examine the genetic contribution to gene expression for our set of dysregulated disease genes and 1,049,614 alleles, we utilized an additive eQTL model provided by Matrix eQTL [52], where the first three principle components obtained via EIGENSOFT [88] were included to correct for any allelic effects due to ethnicity and race stratifications. To detect cis-eQTLs we used a threshold of ±1 Mb from the gene’s coding region. In turn, we determined trans-eQTLs when associated loci were placed outside these boundaries. Subsequently, we identified all genes that harbor such associated variations in their coding regions that we extended to 10 kb at the 3’ and 5’ end (Fig. 2a).

Current flow algorithm

As each dysregulated gene is associated with a set of candidate causal genes, we screened such candidate causal genes through a variant of a circuit flow algorithm that we introduced in [50, 51]. In particular, we modeled the problem of finding a pathway through a molecular interaction network from a dysregulated gene to its candidate causal genes as current flow in an electric circuit (Fig. 2b). In other words, we screened for causal genes that received significant amounts of electric current from their dysregulated gene. The circuit flow algorithm is based on the well-known analogy between random walks and electric networks where the amount of current entering a node or an edge is proportional to the expected number of times a random walker visits the node or edge. Let G = (N, E) represent a network where N is a set of genes and E is a set of molecular interactions. Let vector \(I = \left[ {I\left( e \right)\,\forall e \in E} \right]\) denote current passing through the edges, while vector \(V = \left[ {V\left( n \right)\,\forall n \in N} \right]\) holds variables of voltage at the nodes. Let C be the set of candidate causal genes. Vector \(X = \left[ {X\left( c \right)\,\forall c \in C} \right]\) holds the current leaving the candidate causal genes. For an edge e = (u, v) that connects genes u and v, we calculate the gene expression correlations pr(u, d) and pr(v, d) between both interacting genes and dysregulated gene d, where pr is Pearson’s correlation coefficient. Preliminarily, we define the conductance of edge e, w(e) as the mean of pr(u, d) and pr(v, d). Thus, we ensure that a single non-correlated node reduces but not completely interrupts the current flow, while a cluster of non-correlated nodes put a considerable resistance to the current flow. Ohm’s law is defined as

$$Id \times I + P \times V = 0$$

where Id is an |E|×|E| identity matrix, and O is a zero matrix. P is a |E|×|N| resistance matrix, and P(e, n) = w(e) if n = v, −w(e) if n = u, and 0 otherwise. Kirchhoff’s current law is

$$Q \times I + R \times X = T$$

where Q is a |N|×|E| matrix, and Q(n, e) = 1 if n = u, −1 if n = v, and 0 otherwise. R is an |N|×|C| matrix where R(n,c) = 1 if n = c, and 0 otherwise. T is an |N|×1 vector where T(n) = 1 if n is the dysregulated gene d, and 0 otherwise. Finally, we set the voltage of all genes in C to be 0 so that all current flows into the candidate genes, while there was no current flow between them, defined as

$$S \times V = 0$$

where S is a |S|×|N| matrix and S(c, n) = 1 if n = c, and 0 otherwise. Specifically, we solved this system of linear equations through a series of matrix operations. In particular, we calculated vector V that holds the electric current that was applied at a given dysregulated gene and ended up at the candidate causal genes.

The setup of such a linear system implicitly considers all interactions as undirected and stipulates that each interaction can have a regulatory effect on the expression of a dysregulated gene. However, accounted for the direction of molecular interactions such as transcription and phosphorylation events by implementing a simple heuristic. After solving the linear system, we removed edges that were used in the wrong direction. We repeated this procedure until no directed edge was used in the wrong direction [50, 51].

Statistical significance of causal genes

Because a dysregulated gene may have hundreds of causal genes the amount of current flowing to candidate genes from a dysregulated gene cannot be directly compared. Therefore, we empirically estimated a p value for each pair of a dysregulated and causal gene by calculating the difference of the current flow in the disease and control cases. We estimated the statistical significance of the current difference by randomly sampling current values from the empirical distributions of all pairs in the disease cases and controls. We considered the flow difference of a pair of dysregulated and causal gene significant if the corrected p value from a Z-test was FDR < 0.01 [54].

Determination of “driving” causal genes

Based on the profile of significant causal genes of each dysregulated genes, our approach focuses on the identification of a subset that explains a majority of disease cases. We defined that a causal gene ck explained a case si if there existed a nonempty set of corresponding dysregulated genes, D(ck, si), that are differentially expressed in case si. The weight between a causal gene and a case was defined as \(w\left( {c_k,s_i} \right) = \left| {D\left( {c_k,s_i} \right)} \right|\). A weighted bipartite graph WB(C, S) between a set of causal genes C and disease cases S was constructed by adding an edge between gene ck and case si if and only if gene ck explained a case si. In other words, we set a link between a causal gene and a case if there existed at least one dysregulated gene that was differentially expressed in the underlying disease case. In particular, we defined that a gene is differentially expressed in a given disease case if the normalized gene expression value of the gene in question had a p value <0.01 in a given disease sample. For a subset of candidate causal genes C0 and a case s, let W(C0, s) be the total number of dysregulated genes covering s by the genes in C0, \(W\left( {C_o,\,s} \right) = \left| {{\mathrm{U}}_{c \in C_0}D(c,s)} \right|\). We considered a patient’s case explained if the total weight covering the case exceeded a certain threshold and want to explain the majority of cases (allowing a few outliers) with a minimum number of causal genes. We previously formulated such a problem as a variant of a minimum weighted multi-set cover problem [50, 51]. Considering an instance \(WSC = \{ WB(C,\,S),\,\gamma ,\,\delta \}\) where WB(C, S) is a weighted bipartite graph between causal genes C and cases S, we want to choose a subset of genes C’ from C such that for each case s except δ cases, \(W(C{\prime},s) \ge \gamma\). Since a very simple version of the multi-set cover problem is NP-hard, we used a greedy approach to determine a subset of “driver” causal genes [50, 51].

Determination of causal paths

After we identified sets of causal genes that received significant amounts of current from a dysregulated gene, we determined causal paths that connect pairs of causal and disease genes in the disease cases. Based on our previous work [50, 51], we determined a maximum current path from dysregulated gene d to causal gene c, defined as a simple path \(P\,(d,\,c) = (d,\,g_1,\,g_2, \ldots ,\,c)\) such that mingiP(d, c)I(gi) was maximized where I(gi) was the total current passing through gene gi. Our rationale was that we need to find the bottlenecks in such paths, suggesting that the path with the biggest bottleneck was the most causal path.

Molecular interaction data

As for a reliable source of molecular interactions between human proteins we utilized the Pathway Commons database [53] that collects data from a variety of well-curated pathway databases such as KEGG [95], Reactome [96], and PID [97] as well as protein–protein interaction databases such as HPRD [98], BioGRID [99], and BIND [100]. Specifically, such interaction data allowed us to obtain 753,239 (un-)directed interactions between 12,614 human proteins.