Introduction

A better understanding of genetic variation contributing to cellular sensitivity to chemotherapeutic drugs can lead to more precise and personalized treatment of cancer patients [1]. Lymphoblastoid cell lines (LCLs) have been established as a model system to study the genetic components of drug-induced cytotoxicity by measuring cell growth inhibition following drug exposure [2]. Previous genome-wide association studies (GWAS) have identified numerous genetic variants and gene expression variables associated with drug cytotoxicity [3,4,5,6]. However, a comprehensive study of multiple drugs in different populations can reveal new insights into the genetic susceptibility of cytotoxicity.

We studied genetic factors associated with cytotoxicity of five mechanistically distinct chemotherapeutic drugs: cisplatin, carboplatin, capecitabine, cytarabine, and paclitaxel (Fig. 1). Cytotoxicities were measured for all drugs in two HapMap populations: Utah Residents with European ancestry (CEU) and African individuals from Yoruba in Ibadan, Nigeria (YRI). Platinum-based compounds, including cisplatin and carboplatin are the most widely applied group of cytotoxic drugs worldwide, are used to treat head and neck, testicular, lung, endometrial and ovarian cancers [7,8,9]. Capecitabine is mainly used to treat colorectal and breast cancers [10]. Patients with acute myeloid leukemia have long been treated with cytarabine [11]. Paclitaxel is commonly used for the treatment of lung, breast, and ovarian cancers [12]. Previous studies have shown that drugs in the same class have common genetic loci associated with drug induced cytotoxicity, for example, cisplatin and carboplatin [4]. An individual’s ancestral background has also been linked to differential risks for cytotoxicity [13]. Thus, a more comprehensive understanding of the distinct and shared genetic components associated with cytotoxicity between drugs and populations would be valuable to identify new treatment options.

Fig. 1
figure 1

Using functional annotations and network analysis to model chemotherapeutic induced cytotoxicity, chemotherapeutic drugs’ IC50 and SNPs’ genotype were measured on the HapMap CEU and YRI cell lines. GWAS were used to identify the most significantly associated SNPs for each drug and population; the resulting SNPs were annotated using Biofilter. To integrate SNP and gene expression, the associated SNPs and gene expression variables were trained using the grammatical evolution neural network. The resulting network models were prioritized using the ENCODE functional data.

However, a molecular understanding of individual genetic variations is challenging because there are a large number of genetic variations that can be associated with drug cytotoxicity and each variant is a surrogate for many other variants in the same region. To address these issues, we evaluated genetic variants using higher-level biological annotations in efforts to group variants into more interpretable biological modules. Comparing CEU to YRI, we found population specific annotations for each drug. Within individual populations, we observed drugs that treat similar types of cancers are enriched for the same biological annotations. In some cases, we identified similar biological annotations across CEU and YRI, as well as across multiple drugs.

Previous studies relied on GWAS to identify genetic variants that have the strongest independent genetic effects on drug-induced cytotoxicity and incorporated gene expression levels through studies of expression quantitative trait loci (eQTL) analysis [14]. This work led to the important observation that pharmacological GWAS SNPs are enriched in eQTLs for many cytotoxic drugs [15]. While the eQTL method can capture a linear relationship between SNPs and gene expression, it overlooks the possibility that interactions among SNPs or gene expression could also play a crucial role in drug cytotoxicity. To identify these non-linear interactions, we applied the grammatical evolution neural network (GENN) algorithm to build interaction networks consisting of SNPs and gene expression variables. Although the identification of associated SNPs and gene expression variables is an important first step in understanding drug cytotoxicity, a challenge remains on how to interpret the functional relevance of the interaction models. It has been shown that many regulatory elements can aid in identifying important functional SNPs [16, 17]. To this end, we used DNAseI and genome segmentation data published by the ENCODE consortium to prioritize the network models.

In this study, our goal is to better understand chemotherapeutic drug responses from three new avenues: First, while most of the existing studies have analyzed one drug at a time [5, 9, 18], our study investigated multiple drugs in two different populations. Second, to overcome the bias of selecting individual causal variants, we grouped the associated SNPs to higher biological modules including genes, protein families, and other biological annotations. While individual variant’s association is affected by the linkage disequilibrium patterns, it should have minimal impact on the higher-level annotations. Third, we integrated SNPs and gene expression variables using a network approach and quantified the importance of the additional data source in terms of prediction accuracy. We believe that a comparative study of multiple drugs and integration of different data sources could be a useful approach for interpreting genetic factors contributing to chemotherapeutic drug responses.

Methods

Genetic variants and gene expression data

Genetic variants data for Utah residents with Northern and Western European ancestry (CEU) and African individuals from the Yoruba in Ibadan, Nigeria (YRI) were downloaded from the 1000 Genome project (phase1_release_v3.20101123) [19]. RNAseq gene expressions on the same individuals were downloaded from the gEUVADIS project [20]. The gene expression data was normalized by library depth and transcripts length (RPKM). Gene expressions with 0 counts in more than half the samples were removed and technical variations were adjusted by PEER normalization. The detailed normalization process was described in [20].

Cytotoxicity data

Lymphoblastoid cell lines from HapMap phase 1 CEU and YRI populations were treated with increasing concentrations of capecitabine [18], carboplatin [3], cisplatin [6], cytarabine [21], and paclitaxel [22] as previously reported. For carboplatin and cisplatin, their IC50, concentration required to inhibit 50% of the cell growth, were calculated and log2 transformed to normality. The areas under the survival curve (AUC) were calculated for capecitabine, cytarabine, and paclitaxel. To satisfy the regression assumptions, all AUC values were also log2 transformed to allow for normal distribution. For replication studies, HapMap phase 3 YRI and CEU cell lines were treated with four of the drugs under the same concentrations: capecitabine, carboplatin, cisplatin, and cytarabine.

Quality control for genetic variants and gene expression data

The SNP data was first transformed into the variant call format. Only SNP data from the autosomes were used for the GWAS analyses. To minimize error accompanied with the sequencing technology, only SNPs with 100% call rate were retained using GATK [23]. To remove extreme outliers and increase statistical power, we limited our analysis to SNPs that have all three possible genotypes and each genotype has at least two representing samples [5]. This effectively removed all of the rare variants, which do not have the power to be detected given the small sample size. Between 2.7 and 4.7 million SNPs have passed the quality control. Gene expressions were filtered so that 90% samples have non-zero expression values. This resulted in around 20,000 gene expression probes being retained (Table 1).

Table 1 SNPs and gene expression variables count after quality control

GWAS analyses of drug susceptibility

In order to perform subsequent integration analyses using genetic variants and gene expression data, only samples that are common between cytotoxicity data, 1000 Genome genetic variant data, and gEUVADIS gene expression data were used for GWAS analyses. As a result, the number of samples is different for each drug (Table 1) and all of the study samples are unrelated. To control for potential confounding effects due to population structure, SNPs that passed quality control criteria were first LD-pruned (--indep 50 5 2) using PLINK software [24]. The principal components of the pruned SNP data were estimated using Eigenstrat [25]. Along with individual’s sex, significant principal components (2 or 3) were adjusted in the association analysis for each SNP. For gene expression data, individual’s sex was adjusted for each expression probe.

Functional meta-analysis of associated SNPs

To determine the biological annotations that are associated across populations and drugs, Biofilter (v2.2) [26] was used to separately map the associated SNPs of each cytotoxicity phenotype to functional groups including genes regions [27], protein family (Pfam) [28], gene ontology (GO) term [29], KEGG pathway [30], and Reactome [31]. Then, for each of the functional groups, we investigated whether any of its functional terms were shared in multiple populations and drugs. To evaluate the significance of the sharing, we carried out 1000 permutation tests, where we permuted each drug’s cytotoxicity and performed GWAS on the permuted outcome. If less than 5 out of 1000 permutations resulted in equal or larger number of sharing for a function term, the term was deemed significant (p < 0.005). After permutation, 63 genes, 35 GO terms, 2 KEGG pathways, 12 Pfam, and 39 Reactome were determined to be significant.

Integration analysis using ATHENA

The Analysis Tool for Heritable and Environmental Network Associations (ATHENA) is a multifunctional software package that provides machine learning tools to analyze genomics data. The software has been extensively tested and applied in simulation data and real world data with great success [32, 33]. The software and its modeling processes have been described previously [34]. Briefly, we used an evolutionary algorithm, grammatical evolution neural network (GENN), to optimize artificial neural networks (ANNs), which are used to integrate genetic variants and gene expression data. The evolution process initiates a set of random models and these random models compete with each other through generations. The “fittest” models, or the models that maximize desired target function, can exchange components of themselves. Through transferring of the components, some models may acquire beneficial components and eventually take over the population pool. This evolution process mimics natural selection where the “fittest model” will survive at the end of evolution. The algorithm is described below.

Step 1: The data is divided into five parts for five cross validations with 4/5 for training and 1/5 for testing.

Step 2: Under population size constraint, a random population of models (ANNs) is generated.

Step 3: All models are evaluated with training data. The models with highest fitness are selected for crossover, mutation, reproduction and migration.

Step 4: Step 3 is repeated for a set number of generations.

Step 5: The best solution at the final generation is tested on the testing data and saved.

Step 6: Steps 2–5 are repeated for each cross validation.

The fitness of the model aims to measure how well the variables can explain the cytotoxicity, a continuous value. We used R-squared as our fitness metric to represent the percentage of cytotoxicity variation explained by SNPs and gene expression. We scaled the cytotoxicity to be between 0 and 1 using min-max scaling so that it matches the output of neural networks, where

$${Narmalized}\;{D}_i = \frac{{D_i - \min \left( D \right)}}{{\max \left( D \right) - \min \left( D \right)}}$$
(1)

*Di is the IC50 value for the ith sample, D is the vector of IC50 values.

And the R2 is calculated as:

$$R^2{\mathrm{ = }}1 - \frac{{\mathop {\sum }\nolimits_i^n \left( {D_{predict\,i} - D_i} \right)^2}}{{\mathop {\sum }\nolimits_i^n \left( {D_i - \overline D } \right)^2}}$$
(2)

Dpredict i is the predicted IC50 for the ith sample, Di is the IC50 value for the ith sample, \(\overline D\) is the average value of IC50.

Linkage disequilibrium patterns exist in the associated SNPs because many are proximately located. Even though the SNPs may have distinct biological functions, they are indistinguishable in regards to their association with cytotoxicity because they are highly correlated. To reduce the correlated signals resulting from LD, for each cytotoxicity phenotype, pairwise LD among all associated SNPs were estimated. r2 > 0.7 was used as a threshold to form LD clusters among the associated SNPs and if a cluster has more than one SNP, the SNP that is the most significantly associated with cytotoxicity was selected as the tag SNP for the cluster. To reduce multi-collinearity in the gene expression data, Pearson correlation was calculated for all possible gene pairs. Genes that have correlation coefficient r > 0.8 were grouped into a cluster. The most significantly associated gene from each cluster was selected as the tag gene for the cluster.

We first used ATHENA to perform variable selection on tagging SNPs and gene expressions. SNPs and gene expressions were integrated together to build neural networks that model the data. We selected SNPs and gene expressions that were consistent between cross validations [33]. The variable selection step used training R squared to select candidate SNP and gene expression variables; however, the final models are selected based on testing R squared. Using the selected SNPs and gene expressions, we used ATHENA to build five models, one for each cross validation, and for each cytotoxicity phenotype.

Using functional data to prioritize neural network models

In order to distinguish Neural Network models that have similar predictable power of cytotoxicity, we utilized functional data produced by the ENCODE project [35] to quantify the functional relevance of each model. We downloaded 128 DNase-I hypersensitivity samples from the ENCODE project (http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/openchrom/jan2011/combined_peaks/). The data contains merged DNAse-I peaks from UW and Duke that passed FDR 1% cutoff. Genome segmentations of six ENCODE cell lines (GM12878, K562, H1-hESC, HeLa-S3, HepG2, and HUVEC) were obtained from (http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/integration_data_jan2011/byDataType/segmentations/jan2011/hub/). We used the combined segmentations calls based on the consensus calling of ChromHMM and Segway algorithms. The combined segmentations split the genome into non-overlapping regions of CTCF enriched element, enhancer, weak enhancer, promoter flanking region, promoter region including TSS, transcribed region, and repressed region. For every SNP in the neural network model, we determined whether it is located in DNase-I hypersensitive regions or genome segmentation regions across all cell types. Because the network models only include the tagging SNPs, we also determined the functional region overlaps for SNPs that are in LD with the tagging SNP. The functional score for each model is calculated as the sum of overlap for each individual SNP, normalized by the model size. In the case where SNPs in LD with the tagging SNP has a higher number of overlaps, the tagging SNP was replaced with the LD SNP. In order to select the final model, we first selected 3 models that have the best testing accuracy (R2). Of those, we selected the model with highest functional score as the final model. Once we had the final model, we used SNPs and gene expressions to separately build SNP and gene expression only models. In the case where the models have worse prediction than predicting using the mean IC50, the R2 value was replaced with 0, thus effectively removed.

Results

Chemotherapeutic drug genetic associations

Cell growth inhibition was measured previously on unrelated CEU and YRI LCLs following treatment with increasing concentrations of cisplatin [6], carboplatin [3], cytarabine [21], capecitabine [18] or paclitaxel [36]. Their dose-dependent inhibition was calculated as IC50, concentration required to inhibit 50% of cell growth, or AUC, area under the survival-drug concentration curve for up to 77 LCLs (Table 1).

Genome-wide SNP data for the LCLs were obtained from the 1000 Genomes Project (http://www.1000genomes.org/) and were evaluated for their association with each drug’s cytotoxicity. We adjusted for sex and significant principal components of ancestry (2 or 3) in the linear regression model. The quantile-quantile plots and the genomic inflation factors (λ) showed that there was minimal inflation (S1 Fig). We identified between 1230 and 2749 SNPs significantly associated with each drug response at p < 0.0005, respectively (Table 2). Gene expression levels for the LCLs, measured by RNA-Seq, were downloaded from the gEUVADIS consortium (http://www.geuvadis.org/). Normalized RPKM (reads per kilobase per million) values for ~20,000 genes were tested for association with each drug’s IC50 or AUC. To keep the number of associated genes similar across drugs, we used p < 0.005 or p < 0.0005 to select candidate genes. We identified between 65 and 295 genes whose expression levels were associated with drug outcome (Table 2). A list of all associated SNPs and gene expression levels can be found in the supplemental materials (S1 Table).

Table 2 Genotype and gene expression associations with chemotherapeutic drugs

To replicate the SNP associations, we exposed an independent set of HapMap phase 3 LCLs to four of the five chemotherapeutic drugs: cisplatin, carboplatin, cytarabine, and capecitabine. We performed an association analysis on the independent LCLs and using the same p-value threshold (p < 0.0005) and direction of effect, we replicated between 122 and 722 SNPs that were associated in the original samples (Table 2).

Pan-drug analysis of associated SNPs reveals distinct patterns of functional enrichment

To get a better understanding of the biological processes involved in the differential cytotoxicity, we annotated all the SNPs that are associated with each drug response using gene regions, KEGG pathways, GO terms, REACTOME, and Pfam using Biofilter [26]. We observed that many biological annotations were shared across different drugs and/or populations. To remove annotations that were shared due to random chance, we performed a permutation test (1000×) for each drug’s IC50 or AUC. Using the permuted IC50 or AUC, we identified associated SNPs using the same criteria as our original analysis. For each permutation, we calculated how many times an annotation is shared across the drug and population. We then removed any annotations that are over-represented in the permutations (p < 0.005).

Cellular sensitivity to drugs is a broad phenotype that includes cell cycle arrest, cell damage, and cell death through apoptotic and non-apoptotic mechanisms [37, 38]. Cytarabine (5 and 40 uM), cisplatin (5 uM) and paclitaxel (12.5 nM) were evaluated for chemotherapeutic-induced apoptosis because they cause a significant increase in cellular caspase-3/7 activation, a measure of apoptosis [22]. For each drug, apoptosis was measured in 30 CEU LCLs and 35 YRI LCLs. We identified SNPs that are associated with drug-induced caspase 3/7 activation (S2 Table) and mapped them using biological annotations. To obtain the most stringent list of biological annotations that are shared between different drugs and populations, we kept only the annotations that passed the permutation test and were also identified in the replication or apoptosis dataset (Fig. 2).

Fig. 2
figure 2

Pan-drug analysis of functional annotations. For each drug in CEU and YRI, associated SNPs were mapped to various functional annotations. A colored square indicates SNP(s) were mapped to that functional term (Cisplatin: Red, Carboplatin: Blue, Cytarabine: Orange, Capecitabine: Purple, Paclitaxel: Black). Only functional terms that have significant enrichment across drugs and populations (permutation analysis p < 0.005) were shown. Functional terms were grouped using hierarchical clustering according to its enrichment. a Gene; b GO term; c KEGG pathway; d REACTOME; e Pfam

When we compared the associated functional annotations across CEU and YRI LCLs, we observed that some annotations are population specific. For gene annotations, a group of genes including HUNK, MTMR9, PRAMEF4, and ACACA were only associated in the CEU population for at least two chemotherapeutics (Fig. 2a). Meanwhile, Spermatogenesis family BioT2, GNS1/SUR4 family, Translin family, and Leukotriene A4 hydrolase C-terminal in pfam (Fig. 2e), IKK related terms in REACTOME (Fig. 2d), and several neuronal development and leukocytes GO terms (Fig. 2b) were only identified in the YRI population. On the other hand, there is a common group of functional terms associated in both CEU and YRI populations. This group consists of mostly fatty acid related functional terms clustered together in GO term, REACTOME, and KEGG pathway. One notable example is the NF-kappa B signaling pathway in the KEGG pathway. This pathway was associated with all of the drugs in both populations (Fig. 2c).

Within each population, we observed that drugs within the same class have similar associated annotation patterns. In particular, cisplatin and carboplatin, both platinating agents have many functional annotations in common. Cytarabine and capecitabine, both antimetabolites, have a number of overlapping annotations (Fig. 2).

We also observed overlapping annotations between drug cytotoxicity and apoptosis. TSNAX-DISC1 and DISC1 gene was associated with cytarabine and paclitaxel for both cell cytotoxicity and apoptosis. A number of triglyceride and fatty acid GO terms and REACTOME pathways were shared for cytarabine, paclitaxel and cisplatin. Both Fatty acid elongation and NF-kappa B signaling pathway in KEGG are enriched for both processes. In Pfam, GNS1/SUR4 family, Translin family, and RFX DNA binding domain were enriched for cytotoxicity and apoptosis.

Network modeling identified interactions between SNPs and gene expression variables important in cytotoxicity

Starting with the SNPs and gene expression variables that were associated with each drug’s cytotoxicity, we calculated pairwise correlations among SNPs or gene expression. Using cutoffs of r2 > 0.7 for SNPs and Pearson’s r > 0.8 for gene expression, we grouped SNPs and gene expression variables that are highly correlated to the same clusters. To reduce multi-colinearity for the network analysis, we selected one tag SNP or tag expression that had the highest association with cytotoxicity to represent each cluster.

We integrated the tag SNPs and gene expressions using GENN and built interaction network models for each drug and population combinations.

Using ENCODE data to prioritize network models

It is possible that a number of network models can be similarly predictive for each drug’s cytotoxicity. To prioritize these models, we selected the model that contains variables with evidence of functional relevance from the ENCODE. Previous studies suggested that SNPs that lie in the open chromatin and regulatory regions are more likely to be functional [35]. Thus, we used DNAseI hypersensitivity sites from 124 cell lines and genome segmentation data from 6 cell lines produced by the ENCODE project to give functional relevance for each model. The DNAseI data marks genomic regions that are not occupied by heterochromatin and the genome segmentation data divides the genome into enhancer, transcription start sites, promoter-flanking regions, CTCF binding sites, and repressed regions. For each network model, we first identify the full set of features by including SNPs that are in the same clusters as the tag SNPs in the model. We then calculated a functional score for each feature that is proportion to the number of functional elements it overlaps with in all of the cell lines. The final score for a network model is the summation of the individual score for each feature normalized by network size (Fig. 3). Using the functional score, we were able to prioritize models that have similar predictive power in terms of R2 (amount of variability explained by the model) and identified one final model for each drug and population (Table 3 & S2-10 Fig). As an example, the multi-layer neural network for capecitabine in CEU population uses 6 SNPs to model IC50 response (Fig. 4).

Fig. 3
figure 3

Schematic for functional score calculation. Functional score of a model is calculated as the sum of scores of individual SNP or SNPs in LD normalized by the model size. Individual score was determined by its positional overlap with functional regions. In this example, yellow squares represent DNAseI or genome segmentation regions. The score for a network model of SNP A, B, C, D is (7 + 3 + 5 + 1)/4 = 4

Table 3 Network model identified by GENN
Fig. 4
figure 4

Neural Network model for capecitabine’s chemotherapeutic response in CEU. W is a weight node, PADD is an addition activation node. In this network, the SNP genotype values are multiplied with their associated weights and then added together in the PADD node

Discussion

Understanding a patient’s genetic susceptibility to chemotherapeutic drugs will provide important information for precision medicine. Previous studies have evaluated genotype associations to an individual chemotherapeutic drug; however, a comparative study of multiple drugs in different populations could reveal common or unique mechanisms that can be exploited in terms of therapy. Here, we present the first study to analyze the genetic associations of cytotoxicity induced by five chemotherapeutic drugs (cisplatin, carboplatin, capecitabine, cytarabine, and paclitaxel) in LCLs derived from two populations (CEU and YRI). To comparatively analyze the associated genetic variants across multiple drugs in two populations, higher-level biological knowledge was used to group variants into functional modules. We discovered that mechanistically distinct drugs are enriched in the same functional modules such as NF-kB pathway. We also set to identify biomarkers that are predictive of the drug cytotoxicity. To this end, we found that integrated networks of SNP and gene expression performed better than either data type alone. Finally, we used DNA regulatory information to select network models that are both predictive and functionally important.

We performed genome-wide SNP association analysis for each of the five drugs in both populations to identify significant genetic associations with drug-induced cytotoxicity. A major challenge to interpreting significant SNP associations across different drugs and populations is that comparing individual SNPs alone can be misleading. A slight change in allele frequency could result in any of the SNPs in linkage disequilibrium to be identified, however, SNPs in LD are likely located in the same genes or regions. We, therefore, annotated the associated SNPs to higher-level biological processes using gene regions, GO term, KEGG pathway, REACTOME pathway, and Pfam. We found that biological annotations are considerably different between LCLs derived from individuals of European and African ancestry. Interestingly, ancestry has also been reported to affect gene expression [14], modified cytosines [39] and sensitivity to chemotherapy [13]. The disparities might lie in the differences in population susceptibility to cancer, which could also affect cytotoxicity-induced response. HUNK and ACACA genes were associated only in the CEU population and are both related to breast cancer [40, 41] (Fig. 2a). A previous report has shown that differences exist between African American and European American women in the nature of breast cancer [42]. SEMA4D and CCDC7 genes were associated in the YRI population (Fig. 2a). Expressions of the genes have been reported to correlate with poor outcome in cervical cancer [43, 44]. In addition, a recent survey has found that Black American females are more likely to develop cervical cancer than White American females [45] and Black American males have higher lung cancer incidence compared to White American males [46]. These candidate genes could be further validated in their respective population. Several IKK related REACTOME pathways were associated with the YRI population (Fig. 2d). IKK is a central regulator of NF-kB pathway [47] and activation of NF-kB pathway has been observed in many solid tumors [48]. Interestingly, NF-kB pathway is associated in both CEU and YRI populations (Fig. 2c), but IKK is only associated with the YRI population. This suggests a possible alternate regulator of NF-kB pathway for cytotoxic response.

Many annotation terms were also associated in both populations. Fatty acid and triglyceride related functional terms were identified in GO term, KEGG pathway, and REACTOME (Fig. 2b, c, d). In Pfam, GNS1/SUR4 family is also involved in fatty acid elongation systems [49]. Fatty acid synthase is an important process for cancer cells to expand and proliferate and has been found to be associated with cytotoxicity in cells [50, 51]. High expression of fatty acid synthase was also observed in colon, prostate, ovarian, breast and endometrium cancers [52, 53]. Recent studies have shown that fatty acid synthase inhibitors are potent therapeutic agents for cancer [54, 55]. Altered growth is one of the direct results of cytotoxic response, so it is likely that fatty acid synthase is also involved in the observed differential drug responses. Positive regulation of endothelial cell migration was associated with all five drugs. In addition, it was reported that during metastasis, cancer cells extravasate metastasis sites by attaching to endothelial cells [56]. We also observed drugs that were known to treat similar cancers have high overlap of biological annotations. In particular, cisplatin and carboplatin are both platinum compounds that treat lung, head and neck, testicular, and ovarian cancer [4, 9]. It can be seen that cisplatin and carboplatin have high overlap in all annotations, especially in the YRI population (Fig. 2).

LCLs’ cellular sensitivity to drugs is a broad phenotype that encompasses many sub-phenotypes including drug-induced apoptosis. Cell apoptosis, as measured by caspase activity, was shown to be weakly correlated with cytotoxicity [22]. Despite the weak correlation at the phenotypic level, we found that many functional terms enriched for cell cytotoxicity are also associated with cell apoptosis (Fig. 2), indicating shared biological mechanism for the two responses. As an example, SNPs in RFX2 gene were identified in a clinical trial evaluating paclitaxel-induced neuropathy of breast cancer patients and shown to be functionally important in paclitaxel-induced cytotoxicity using siRNA [36]. In our analysis, RFX DNA binding domain was associated with both paclitaxel-induced cytotoxicity and apoptosis (Fig. 2e).

The integration of SNP and gene expression data yielded higher predictive R2 than SNP or gene expression data alone (Table 3), which supports the potential value for combining multiple types of genomics data [57,58,59]. Because we prioritized our model based on overlaps with DNA regulatory regions, many of our models contain SNPs that are located in the DNAseI region and functional genome segmentation regions. This information can provide additional interpretability to our models compared with using R2 alone.

Due to the small sample size of LCLs in the original analysis, we sought for replication in independent HapMap3 LCLs to confirm our result. Of note, we found a large number of biological annotations were replicated in the independent datasets. Of annotations/drug pairs identified in the discovery analysis, between 15–100% were also significant in the respective HapMap3 replication population. This confirms that the associated SNPs might not be identical between discovery and replication studies, but the underlying biological mechanisms are the same. Our results show that many genetic variants and genes are involved in chemotherapeutic drugs cytotoxicity. By mapping genetic variants to higher-level biological processes, we were able to encapsulate variants in the same genomic region into more informative units. Comparing biological processes groups showed population specific patterns between CEU and YRI. However, as CEU LCLs were derived from an earlier time point [60], further studies are needed to verify whether some of the observed differential patterns might be due to time in culture. Nonetheless, a previous study showed that the cellular proliferation rate was not significantly different between CEU and YRI and no widespread genetic differences on common SNPs were observed between phase 2 and phase 3 YRI LCLs [60]. Also, there are common processes across all drugs as well as between drugs that belong to the same class. These results could identify new drug repositioning candidates based on sharing of biological processes. Using the ENCODE data, we built predictive network models for drug cytotoxicity that are also functionally relevant. Future work could explore the use of additional types of functional data to better reflect the functional relevance of the models.