Introduction

Ankylosing spondylitis (AS) is a common, chronic inflammatory disease that not only primarily affects the spine and the sacroiliac joints, but can also affect peripheral joints and extraarticular tissues including the eye, gut and skin. Furthermore, up to 70% of AS patients have subclinical intestinal inflammation, with 5–10% developing clinically diagnosed inflammatory bowel disease (IBD).1 AS is highly familial and heritable,2, 3 with a prevalence of 0.55% in European populations4 and 0.2–0.54% in Han Chinese populations.5 Although the disease is strongly associated with HLA-B27 within the major histocompatibility complex (MHC; odds ratio >50),6, 7 HLA-B27 contributes only ∼20% of the genetic risk.8 At least 39 genetic variants lying outside the MHC have been identified as associated with AS with genome-wide levels of significance.8, 9, 10, 11, 12 A further 74 non-MHC associations have been identified through studies of pleiotrophic genetic associations of AS, inflammatory bowel disease, psoriasis and primary sclerosing cholangitis.8 Conditional analyses have revealed multiple independent human leukocyte antigen (HLA) associations with AS in addition to HLA-B27; for the purposes of this paper we consider this locus as one.13

As with most common variant disease and trait-associated polymorphisms, most of the associated single-nucleotide polymorphisms (SNPs) lie within noncoding regions.14, 15 Utilizing the increasing depth of genomic annotation of epigenetic and gene expression properties of different cell types, several recent studies have shown that disease- or trait-associated SNPs often operate through effects on gene regulatory regions in specific cell types or tissues.16, 17 With the rapid increase in epigenetic and gene expression data in hundreds of cell types, associated variants can be further interrogated, and potential causal variants and loci prioritized for further study, focussing on the cell types in which they primarily operate.18, 19, 20

A wide variety of cell types have been suggested to be involved in AS, largely through flow cytometric analyses of cell counts and activation, and include CD4+ and CD8+ T lymphocytes;21 T helper 17 (Th17) cells;22 regulatory T cells;21 innate immune cells, including natural killer (NK) cells,23 gamma-delta cells and innate lymphoid cells;24 B cells;25 mast cells;11, 26 and intestinal Paneth cells.27, 28, 29 Although these studies provide valuable information about the cellular mechanisms involved in AS pathogenesis, they only indirectly inform how AS-associated SNPs affect the pathogenic mechanism of the disease. A further limitation of these studies is that they focussed on cell types that are easy to sample, such as circulating immune cells, and thus the involvement of other rarer cell types, particularly in sites not easily sampled, remain underinvestigated. To date, there have been very few studies that utilized AS-associated SNPs and multiomics data to identify the cell types involved in the pathogenesis of the disease. One that broadly examined autoimmune and inflammatory diseases used PICS (Probabilistic Identification of Causal SNPs) and the chromatin maps of different cell types identified Tnaive, various T helper cells, and colonic and duodenal mucosa as cell types likely to be involved in AS.30 Another, using a smaller number of curated AS-associated loci, further implicated Th17 cells in AS pathogenesis.31 Recently, a cross-disease study of a larger cohort identified additional genome-wide significant (GWS) AS-associated loci.8 Here, we leveraged the expanded AS-associated SNP set from the cross-disease study, publicly available cell type-specific epigenetic, and gene and protein expression data to elucidate the critical cell types involved in AS pathogenesis.

Results

AS-associated SNPs are enriched in regulatory elements in gut and immune cell types

AS-associated SNPs were defined as either being GWS significant in analyses of AS alone (AS_GWS_40), or in cross-disease studies (AS_cross_disease_114; see Materials and methods and Supplementary Information). Of the 39 most strongly AS-associated GWS (non-MHC) SNPs, only one was located in an exon, and of the 113 (non-MHC) SNPs from cross-disease studies, up to 13 were located in exons. There are 11 SNPs in linkage disequilibrium (LD; r2>0.8) with the 39 AS-GWS SNPs (28%), and 40 SNPs in LD with the 113 SNPs located in exonic region (35%). As 65–72% of disease-causative SNPs at these loci are likely to lie in noncoding regions, we examined whether any of these SNPs were located in regulatory elements of the genome as indicated by epigenetic marks. We began by assessing the enrichment of the AS-associated SNP data sets in H3K4me3 marks within 34 cell types using EPIGWAS.32 AS GWS-associated SNPs showed significant enrichment in immune cell types and gut tissue, consistent with AS being an immune-mediated arthropathy in which the gut is involved in pathogenesis (Figure 1 and Supplementary Table S1). The highest levels of enrichment were observed in rectal mucosa (P-value=0.0155) and regulatory T cells (P-value=0.0183). CD4-naive T cells and CD8 memory T cells also showed signals. With the larger associated SNP set, AS_cross_disease_114, more cell types were detected, including B cells, naive primary T cells and CD4 memory primary cells, with increased levels of significance (Supplementary Figure S1 and Supplementary Table S1).

Figure 1
figure 1

Cell type specificity of AS_GWS_40 SNP set with H3K4me3 chromatin marks using NIH Roadmap cell types data using EPIGWAS. AS GWS-associated SNPs are enriched in association specifically with immune cell types and rectal mucosa. The blue line indicates P-value=0.05, and the red line, P-value=0.01.

As histone marks such as H3K27ac and H3K4me1 can indicate nearby active enhancers,33 we further extended the epigenetic analysis by utilizing the consolidated NIH Roadmap Epigenomics database. We focussed on the 20 epigenetic marks with data in all 127 cell types in the database. AS GWS-associated SNPs from the AS_GWS_40 set significantly colocalized with H3K4me1 and enhancers (7_Enh) in T cells, monocytes, CD4 naive T cells, T helper naive cells and NK cells (Figure 2). Similar enrichment patterns were observed in both H3K4me1 and 7_Enh, and these two epigenetic marks were clustered together (Figure 2 and Supplementary Table S2). In addition to immune cells, AS_GWS_40 SNPs also showed significant enrichment in enhancer-related epigenetic regions in gut cells (small intestine, rectal mucosa, duodenal mucosa and sigmoid colon). A relatively weak enrichment in B cells was observed in genic enhancers. The signal for gut cell types increased for the larger SNP set, AS_cross_disease_114 (Supplementary Figure S2 and Supplementary Table S2). Furthermore, enrichment signal (P-adjust=0.0025) was observed for bone marrow-derived chondrocytes in 7_Enh enhancers with the AS_GWS_40 SNPs. Notably, the bone cell signal is much weaker than that seen in immune cells in AS_GWS_40 SNPs. Both chondrocytes and osteoblasts showed significant signals with the AS_cross_disease_114 SNP set (Supplementary Figure S2).

Figure 2
figure 2

AS-associated SNPs show enrichment in enhancer-related epigenetic regions in immune cells and gut cells. (a) Heat map showing epigenetic marks and chromatin state enrichment of AS_GWS_40 SNP set within the NIH Roadmap and ENCODE cell types. (b, c) Enlarged views of the boxed areas of b and c (a) respectively. Note that in addition to immune and gut cells, AS-associated SNPs also showed a relatively weak enrichment in B cells. Orange to red indicates significance, whereas cornflower blue to light blue indicates nonsignificance. Cell and tissue are clustered into eight groups: haematopoietic stem cells (HSCs), T cells and B cells; digestive cells; epithelial cells; cell lines from ENCODE; embryonic stem (ES) cells, induced pluripotent stem cells (iPSCs) and ES-derived cells; brain cells; heart cells; musculoskeletal and other cells, and coloured according to the key. For a full listing of the cell types and chromatin states in (a), see Supplementary Table S15.

AS-associated loci were enriched in gut and immune cell types and their response to cytokines

As many intergenic SNPs have cis-acting effects on gene expression, we examined cell type-specific expression of genes in proximity to AS-associated SNPs. To this end, we employed the SNPsea program34 to identify potential cell types and pathways affected against four public gene expression databases. AS_GWS_40 loci were significantly enriched in gut and immune cell types, especially in monocytes (Table 1 and Supplementary Table S3). Similar cell types identified in previous epigenetics analyses, including CD4+ memory T cells and gut cells, were also identified here. Gene Ontology (GO) enrichment analysis showed that these AS_GWS_40 loci were predicted to be involved in pathways responding to cytokine stimulus (Table 1), consistent with reports that the cytokines interleukin (IL)-17 and IL-23 play important roles in the pathogenesis of AS.35, 36 The AS_cross_disease_114 SNP set was analysed in a similar way and were enriched in immune cell types, including monocytes, CD4+, CD8+ cells and their subsets (Supplementary Table S3).

Table 1 AS-associated loci were enriched in gut and immune cell types, particularly in monocytes, and involved in cytokine response pathways

AS differentially expressed genes showed enrichment in different immune cell types

Genes demonstrated by our in-house RNA sequencing (RNA-Seq) to be differentially expressed in peripheral blood mononuclear cells between AS cases and healthy controls (herein termed ‘AS differentially expressed genes’) were then analysed for the cell type in which they were primarily expressed (a listing of the AS differentially expressed genes appear in Supplementary Table S4). This was performed by merging the NIH Roadmap and BLUEPRINT RNA-Seq data to form a combined data set representing RNA-Seq data from 93 cell types. As with epigenetic studies, and studies of genes near AS-associated SNPs, the primary cell types that demonstrated enrichment of expression of AS differentially expressed genes were circulating immune cells (Figure 3 and Supplementary Table S5). This was expected, particularly given the fact that the in-house RNA-Seq data are from RNA from peripheral blood mononuclear cells. No enrichment of expression was noted in intestinal or other noncirculating tissues other than in breast myoepithelial. AS upregulated genes (from 1.2- to 1.4-fold) showed different enrichment patterns compared with AS downregulated genes (0.9-fold; Figure 3). AS upregulated genes were most enriched in monocytes (adjusted P-value=0.00093), whereas downregulated genes were most enriched in CD8-naive primary cells compared with all the human genes in the RNA-Seq data set. Only the mixed peripheral blood mononuclear cell types showed both up- and down-regulated genes. AS upregulated genes also showed significant enrichment in mature neutrophil, macrophage and proinflammatory macrophages after multiple hypothesis testing correction. AS downregulated genes showed relatively weak, but significant, enrichment in CD4+ primary T cells, CD56–dim NK cells, central/effector CD8+ T cells, central/effector CD4+ T cells and regulatory T cells. Bone cells were not represented in the NIH Roadmap and BLUEPRINT RNA-Seq databases, and thus were not able to be tested in this analysis. A high degree of correlation of gene expression patterns was observed between cell types, reducing the power of this analysis to resolve cell type-specific expression patterns of AS differentially expressed genes.

Figure 3
figure 3

AS upregulated and downregulated genes were enriched in different blood cell types. (a) Heat map showing expression of AS differentially expressed genes based on RNA-Seq data. (b) Enlarged view of the boxed area b of (a). Note that different cell types were identified based on either AS up-regulated or down-regulated genes. The enrichment scores were normalized by the number of genes. Yellow to red colours indicate significance after Bonferroni correction (α=0.05), whereas white colour indicates no significance (NS). Normalization, reads per kilobase per million. For a full listing of the cell types in (a), see Supplementary Table S15. Differentially expressed genes were grouped by different basemean counts (mean of normalized counts of all samples) as an output from DESeq2, and labelled as b100, b200 and b500, where b100 represents genes with basemean counts of ⩾100 and so on. The fold change is also indicated.

To determine whether the differentially expressed gene set also showed cell type bias in protein expression levels, the E-PROT-1 data set from Expression Atlas was assessed using a similar methodology to that employed for RNA-Seq data analyses. A very similar pattern of protein expression was observed as for RNA expression (Figure 4 and Supplementary Table S5). Proteins encoded by AS upregulated genes showed significant enrichment in different cell types versus AS downregulated ones. Proteins encoded by AS upregulated genes showed the most enrichment in monocytes, whereas those encoded by AS downregulated genes showed enrichment in CD8+ T cells and NK cells. Proteins encoded by AS upregulated genes also showed significant enrichment in gastrointestinal tissues, including rectum and colon.

Figure 4
figure 4

AS up-regulated and down-regulated genes were enriched in different blood cell types, as well as gut and other cell types, in proteomics data. Heat map showing expression of AS differentially expressed genes based on proteomics data. The enrichment scores were normalized by the number of genes. Yellow to red colours indicate significance after Bonferroni correction (α=0.05), whereas white colour indicates no significance (NS). Basemean and fold change are as described in Figure 3.

AS gut consensus loci enriched in pathway implicating microbes and the gut

Overall, the AS_GWS_40 SNPs were significantly enriched in nine gut cell types in 7_Enh enhancer regions. To dissect this further, we defined SNPs in 7_Enh enhancer regions with a locus score lower than 1 (see Materials and methods) in at least six gut cell types cells as consensus gut SNPs, and analysed these in more depth. DAVID (the Database for Annotation, Visualization and Integrated Discovery) annotation (version 6.8) with high classification stringency was employed to perform GO enrichment analysis. There were 14 gut consensus SNPs that were located in 13 loci (all 13 loci were matched with DAVID IDs). Enrichment analysis identified multiple pathways that implicate microbes and the gut; GO term ‘response to bacterium’, with 7 out of 13 loci being involved (TNFRSF1A, LTBR, IL27, ERAP1, NFKB1, IL6R and NOS2, P-value=3.3 × 10−7; Supplementary Table S6), and GO term ‘response to external biotic stimulus’, ‘response to other organism’ and ‘response to biotic stimulus’ with the same 7 loci involved (P-value=6.6 × 10−6, 6.6 × 10−6 and 8.5 × 10−6, respectively). Out of the 13 AS loci, 5 were also GWS IBD associated (C1orf106, IL1RL2, IL27, NFKB and LTBR).

Two more gastrointestinal cells (fetal intestine large and fetal intestine small) were identified using the AS_cross_disease_114 SNP set in enhancer regions (7_Enh, 10 gut cells). A similar GO term enrichment analysis was performed on this SNP set that is enriched for IBD genes (70 GWS IBD-associated SNPs out of 113 non-MHC SNPs). SNPs in 7_Enh enhance regions with a locus score lower than 1 in at least six gut cells were considered as consensus gut SNPs. There were 43 gut consensus SNPs that were located in 41 loci (36/41 are matched in DAVID), and enrichment analysis identified multiple pathways (Supplementary Table S7); GO term ‘response to bacterium’, with 13 of 36 loci being involved (TNFRSF1A, FOS, CARD9, LTBR, PTGER4, IRF5, IL27, ERAP1, NFKB1, IL6R, NOS2, IL10 and TNFSF8, P-value=3.6 × 10−10), GO term ‘response to biotic stimulus’, with 16 loci involved (CARD9, LTBR, PTGER4, IL27, NFKB1, IL6R, IL10, TNFSF8, LSM14A, FOS, TNFRSF1A, IRF5, ERAP1, NOS2, PPP2R3C and CD28, P-value=1.1 × 10−5) and GO term ‘response to other organism’ and ‘response to external biotic stimulus’, with 15 loci involved (CARD9, LTBR, PTGER4, IL27, NFKB1, IL6R, IL10, TNFSF8, LSM14A, FOS, TNFRSF1A, IRF5, ERAP1, NOS2 and CD28, P-values are both 8.4 × 10−10). There were eight IBD GWS loci (CARD9, FOS, IL10, IL27, IRF5, LTBR, NFKB1 and TNFSF8) and seven AS GWS loci (ERAP1, IL27, IL6R, LTBR, NFKB1, NOS2 and TNFRSF1A) among these genes. This census SNP set shared 26 loci (26 SNPs, 26 loci, matched 21 DAVID IDs) that are IBD GWS, whereas 16 out of 41 loci (18 SNPs, 16 loci, matched 16 DAVID IDs) are AS GWS loci. The 26 IBD GWS shared loci identified an enrichment in GO terms ‘response to bacterium’, ‘response to other organism’, ‘response to external biotic stimulus’ and ‘response to biotic stimulus’ (Supplementary Table S8), whereas the 16 AS GWS loci were enriched in GO terms ‘response to bacterium’, ‘response to external biotic stimulus’, ‘response to other organism’, ‘response to biotic stimulus’ and ‘immune response’ (Supplementary Table S9).

Discussion

Now that large numbers of robust genetic associations have been identified for common diseases, a major challenge is faced to identify the causative variants and unravel the mechanisms through which they operate. This is particularly challenging in the case of intergenic variants that make up ~65–72% of the known AS-associated variants, a similar proportion to other immune-mediated diseases. Furthermore, causal variants in LD with disease-associated variants may affect a limited number of cell types or tissues. Therefore, in the present study we leveraged cell type-specific epigenetic, gene expression and protein expression data to identify the cell types involved in AS pathogenesis using a larger AS-associated SNP set recently identified in a large cohort cross-disease study.8 A limitation of this kind of approach is the lack of epigenetic annotation and gene expression data for some AS-relevant tissues, including enthesial and synovial cell populations, and a paucity of data from relevant intestinal cell subsets, such as innate lymphoid cells and Paneth cells, that have previously been implicated in AS pathogenesis. Nevertheless, the data presented here indicate that monocytes, CD4+, CD8+ T cells, NK cells, regulatory T cells and gut cells are relevant cell types in AS, in reasonable agreement with a recent study that used a similar epigenetic analysis with predicted causal SNPs (using PICS) that identified CD4+ T cells, Th1, Th17, monocytes and intestinal mucosal cells as being implicated in AS.30

Despite the strength of association with HLA-B27 and, more recently, components of class I antigen processing machinery such as ERAP1, ERAP2 and NPEPPS, the precise role of the HLA class I system of antigen presentation in AS pathogenesis remains unclear. Several genes involved in the regulation of CD8+ T-cell number and function are GWS associated with AS, including EOMES, IL7R, RUNX3 and TBX21 (that encodes the transcription factor, T-bet).11 Total CD8+ T-cell counts are lower in the periphery of AS patients compared with healthy controls,12 although numbers of cytotoxic, perforin-secreting CD8+ T cells are elevated in AS cases.37 Recently, carriage of risk variants in TBX21 has been shown to expand a population of IFN-γ+ IL-17+ inflammatory cells in the circulation of AS patients.38 However, the coordinated effects of IL-7R, EOMES, RUNX3 and T-bet on CD8+ T-cell function in AS remain to be determined.

Enrichment of AS-associated SNPs identified enhancer and gene expression data related to CD4+ memory T cells, with nine of these loci (RP11-84D1.2, IL6R, C1orf106, GPR65, NOS2, ERN1, FAM118A, NFKB1 and ERAP1) shared in both epigenetic and gene expression enrichment analyses. Nuclear factor-κB plays an important role in regulating different CD4+ T-cell subsets39 among other diverse immunological functions, and IL-6 is an important mediator of the immune response especially by directly acting on CD4+ T cells.40

Although the primary MHC associations of AS is with HLA class I alleles such as HLA-B27, the disease is also associated with HLA class II alleles.13 CD4+ T cells, which classically recognize HLA class II-presented antigens, have been shown to be an important source of IL-17 in AS cases.22 CD4+ T cells expressing the killer cell immunoglobulin receptor, KIR3DL2, recognize HLA-B27 homodimers, and this recognition is associated with the secretion of large amounts of inflammatory cytokines.41 Genes such as ICOSLG, BACH2 and SH2B3 that regulate CD4+ T-cell function have also been associated with AS.11 The classic definition of a Th17 cells is that of a CD4+ T-cell secreting IL-17. In this regard, CD4+ T-cell function may be heavily influenced by polymorphisms in components of the IL-23 signalling pathway, including IL23R, IL27, IL12B, JAK2, and TYK2, that have all been linked with increased AS susceptibility.8, 9, 10, 11 Our analyses here provide further evidence of the involvement of this cell type in AS pathogenesis.

B cells were identified on the enrichment of AS_cross_disease_114 SNPs in enhancer-related epigenetic regions. B cells were further implicated by AS downregulated genes being enriched in these cells (in contrast to the AS genes upregulated in T cells; Figure 3). Although there is some evidence that the B cell-depleting agent, rituximab, may be beneficial for AS patients, there appears to be variability in response to that agent with many patients being nonresponders.42 It remains to be determined whether individuals who responded to rituximab carry the SNPs that were enriched in B-cell signals.

The role of innate cell types has been largely overshadowed by the extensive efforts to characterize T-cell responses in AS. Our data suggest that NK cells and monocytes are important regulators of inflammatory processes in AS, and AS patient-derived monocytes express a range of inflammatory proteins.43 It is yet to be determined how monocytes contribute to AS pathogenesis, although they may present microbial antigens derived from an altered gut microbiome, or they may be an important source of inflammatory cytokines, such as IL-23. Indeed, monocytes have been shown to secrete inflammatory cytokines in axial SpA.44 Similarly, potential roles of NK cells in AS are not well described. In a simple model, NK cells may operate in AS through HLA–KIR interactions.23 We have recently shown that NK cells expressing the transcription factor T-bet are potent producers of IL-17 and interferon-γ.38 However, NK cells may also play important protective roles at certain tissue sites, including the gut where they secrete immunoprotective IL-22.45 The data presented here should focus research on understanding how CD4+ and CD8+ T cells, NK cells and monocytes operate in AS pathogenesis. Profiling the transcriptome of these cell types in large numbers of cases and controls would aid in understand how these cell types are altered in disease and how these cell types operate at sites of inflammation in AS.

Microbial interaction with the gut has been suggested in AS, and there is a discrete microbial signature in the terminal ileum of patients with AS compared with healthy subjects.29 The microbial composition correlated with disease status and greater differences were observed between disease groups than within disease groups. Furthermore, AS-associated genes play a role in gut immunity, particularly genes involved in the IL-23 pathway (reviewed in refs 46, 47). In the present study, the AS_GWS_40 SNP set showed enrichment in gastrointestinal cells in addition to immune cells. Our enrichment analyses identified GO term pathways, some implicating microbes and the gut; ‘defense response’, ‘response to bacterium’, ‘response to molecule of bacterial origin’, ‘defense response to bacterium’, ‘immune response’ and ‘innate immune response’. This is also in reasonable agreement with a study that applied pathway enrichment analysis with 30 AS-associated genes that identified GO term pathways ‘cytokine production’ and ‘response to bacterium’ as the most relevant to AS.31 One likely factor contributing to the difference between the studies was that we analysed up- and down-regulated AS genes. Of the pathways we identified, a number of involved loci are associated with AS; NOS2 and CARD9 are known to be involved in the regulation of IgA function and the composition of the gut microbiota, and to maintain the balance between the host and the microbiota, whereas IL-10 plays a role in the development of intestinal immunity.48 Together with the results of the AS_GWS_40 SNPs set, they indicate that not only IBD genes but also AS genes were enriched in gastrointestinal tissue. These results further implicate microbes and the gut in the aetiology of AS and highlight the overlap in genetic risk for IBD and AS. This provides further support to the hypothesis that AS is caused by interactions between the host gut immune system and the gut microbiome.49, 50

The relatively weak enrichment signals for bone cell types are consistent with bone disease being a secondary manifestation, not a pathogenic driver for AS. There are thought to be two disease phases in AS: an initial immune-driven inflammatory period that then progresses, to varying degrees, to the debilitating joint fusion that characterizes the disease, with the changes in the joints being secondary to the initial inflammatory changes.51, 52, 53 Unfortunately, there were no bone cell data available in gene expression or protein expression datasets for our analyses. Therefore, it is difficult to investigate whether AS-associated genes were differentially expressed in bone cell types. Nevertheless, changes in bone-specific genes would be likely to alter disease progression and severity rather than initiate AS pathogenesis.

Discriminating pathogenic from bystander inflammatory cell subsets has proven difficult in AS, with many genomic and functional studies implicating a wide range of cell types.8, 54, 55 This study shows that AS-associated variants primarily operate through effects in CD8+ T cells, CD4+ T cells, NK cells and monocytes, as well as gastrointestinal cells. It also highlights that AS pathogenesis is affected by the relationship between the host genetics, the intestinal microbiome and the immune response. As the number of cell type-specific databases increase, including epigenetics and gene expression data, further integration of multiomics data will continue to advance our understanding of AS pathogenesis.

Materials and methods

SNP sets

So far, 39 non-MHC susceptibility variants showed GWS in AS-only analyses. However, a further 74 non-MHC GWS variants were identified in analyses leveraging pleiotropy with other related diseases (Crohn’s disease, psoriasis, primary sclerosing cholangitis and ulcerative colitis).8 This SNP set of 113 SNPs with the HLA-B27 tag SNP, rs116488202, was labelled ‘AS_cross_disease_114’, and the set of 40 AS-only GWS SNPs set, ‘AS_GWS_40’. The complete list of SNP IDs used appears in Supplementary Table S10, and a summary of analyses performed in this study appears in Supplementary Table S11. All coordinates used in this study were converted using the UCSC (University of California, Santa Cruz) Genome Browser liftOver tool to the GRCh37/hg19 assembly.

Epigenetic marks data set

A total of 127 consolidated epigenomes were generated from the integrated study of 111 reference human epigenomes in NIH Roadmap Epigenomics Mapping Consortium (Release 9 of the compendium) and 16 cell lines in ENCODE. Each of the 127 epigenomes contains chromatin immunoprecipitation sequencing data sets for a core set of histone modifications (H3K4me1, H3K4me3, H3K27me3, H3K36me3 and H3K9me3; for the putative functions of the histone modifications, see Supplementary Table S12). Epigenetic mark data utilized included all processed narrowPeak data of histone modifications, chromatin immunoprecipitation sequencing, DNase-seq (processed by MACS 2), reduced representation bisulfite sequencing data and the putative core 15 chromatin states data set for all 127 epigenomes (Supplementary Table S13). Data were available for all of these marks with data in 127 cell types (Supplementary Table S14).

Gene and protein expression data sets

Three gene expression data sets were used for differentially expressed gene set enrichment analysis, including FANTOM5 CAGE human expression data (533 cell types), RNA-Seq data from integrated NIH Roadmap compendium (57 cell types) and RNA-Seq data from BLUEPRINT of Haematopoietic Epigenomes project (36 cell types). We summed the CAGE entries for each gene and averaged them. The protein data set used was the Expression Atlas (https://www.ebi.ac.uk/gxa/experiments/E-PROT-1),56 a data set that contains protein expression level information for 23 cell/tissue types.

Identification of the AS differentially expressed gene set

The set of genes differentially expressed between AS patients and healthy controls was derived from an in-house RNA-Seq study that will be the subject of a separate report. This study was undertaken in accordance with the Australian National Statement on Ethical Conduct in Human Research (National Health and Medical Research Council of Australia) approved by the Metro South Health Service District Human Research Ethics Committee (HREC/05/QPAH/221) and The University of Queensland Medical Research Ethics Committee (reference 2006000102). All individuals involved gave written informed consent. The 69 patients who had met the modified New York criteria for AS and 79 healthy controls, all of whom were Caucasian, were involved in the study. AS patients who were taking anti-TNF treatment were excluded from this study. The mean age of AS patients was 39.8±11.6 years, and of healthy controls, 40.9±11.2 years. Among the AS patients, the mean erythrocyte sedimentation rate was 22.2±21.5 mm h−1 (median, 13 mm h−1), mean C-reactive protein level was 16.0±16.2 mg l−1 (median, 9.2 mg l−1) and mean Bath Ankylosing Spondylitis Disease Activity Index was 5.3±2.6 (median, 5.8).

Peripheral blood mononuclear cell samples were collected and 1 μg RNA from each sample was used for library preparation using the Illumina TruSeq Stranded Total RNA with Ribo-Zero for Human/Mouse/Rat Low-throughput sample prep kit (Illumina, San Diego, CA, USA) according to the manufacturer’s instructions. Samples were pooled into groups of 12. Flow cells were sequenced using reagents from TruSeq SBS kit v3 (Illumina) on either a HiSeq2000 or a HiSeq 2500 (Illumina) running in high output mode. Read quality was assessed using FastQC in each sequencing lane. Sequenced reads were mapped back to the human genome reference hg19 using TopHat version 2.0.6 (http://ccb.jhu.edu/software/tophat/index.shtml). The table of counts created by mapping reads was generated by HTSeq. Data were assessed for systemic differences caused by factors other than patient status using principal components analysis from the lattice package for R (https://cran.r-project.org/). Only genes that had at least 10 reads mapped to them in at least 15 samples were included in the analysis (20 088 genes). Differential gene expression was determined using DESeq2.57

Cell type specificity score with an epigenetic mark

To compare the extent of enrichment of SNP location within epigenetic marks in specific cell types, we calculated a cell type specificity score. For each testing SNP set, i, and epigenetic mark, m, in cell type, c, we calculated the cell type specificity score,

where Ni is the number of SNPs in the SNP set i, and ni,m,c is the number of SNPs overlapping with the epigenetic mark, m, in cell type, c. A SNP was considered to be overlapping with the epigenetic mark, m, if any other SNP in LD (r2>0.8) with the original SNP was located within the mark narrowPeak region.

Assessment of SNP enrichment in biological annotations

In order to calculate enrichment in biological annotations within the AS cross-disease SNP sets, we accounted for potential biases by preparing a pool of appropriately matched background SNPs for comparison with the query SNPs based on GC content, minor allele frequency, distance to nearest transcription start site and number of SNPs in LD with the testing SNP. GC content was calculated within a 101 bp window extending 50 bps both sides of the testing SNP. European ancestry minor allele frequencies were extracted from the 1000 Genome Project Phase1 Integrated v3 call data set. The transcription start site locations were defined by GENCODE Project (v19). The number of SNPs in LD (r2>0.8) with the query SNP was calculated from the European ancestry data from the 1000 Genome Phase1 Integrated v3 call data set. SNPs were grouped into different bins according to GC content, minor allele frequency, transcription start site and number of LD SNPs. If no matched background SNP could be identified for the query SNP, SNPs in neighbouring bins were considered. A permutation test and multiple testing correction was then performed thus: a null distribution was constructed by 100 000 permutations. Each time, a random SNP set with the same number of SNPs was extracted from the matched background SNP sets, and the cell type specificity score was calculated under the same epigenetic mark. As described in equation (2), the empirical P-value was calculated as the adjusted proportion of permutations for which the cell type specificity score is equal or greater than the query SNP set;58

where Xm,c is the total number of permutations (100 000) and xm,c is the number of permutations for which the cell type specificity score is equal or greater than the query SNP set. Bonferroni correction was employed to correct for multiple testing.

Locus score

In order to rank the informative loci, a locus score was calculated. Similar to the cell type specificity score, the locus score was calculated as the number of matched random SNPs overlapping with epigenetics region divided by the total number of matched random SNPs. The locus score is the probability of a SNP (with the restraints of minor allele frequency, GC content, transcription start site and number of LD SNPs) overlapping an epigenetics marker region by chance. The lower the score, the more significantly the target SNP overlaps with an epigenetic mark. Locus scores <1 were considered meaningful. A gene was assigned to the SNP by the nearest distance to the gene region defined in Ensembl database (GRCh37, Release 75).