Introduction

White blood cell count (WBC) is an important marker of the health of the immune system. It is also emerging as a risk factor for chronic diseases. Elevated total WBC and its differential cell line levels have been identified as risk factors for multiple diseases including asthma (Gudbjartsson et al. 2009), coronary artery disease (Madjid et al. 2004; Weijenberg et al. 1996), myocardial infarction (Gudbjartsson et al. 2009), and cancer (Nalls et al. 2008; Shankar et al. 2007). Genome wide association studies of total WBC (leukocyte count) and its constitutional cell lines (neutrophils, basophils, eosinophils, monocytes, and lymphocytes) are expected to identify genes that predict WBC variability as the end result of a variety of inflammatory pathways.

WBC varies with clinical events, which is why it is often gathered in the course of clinical care. High levels are associated with infections or leukemias, while low levels are associated with immune failure or diseased bone marrow. Within the normal range, however, several studies have suggested associations with chronic WBC levels and a variety of conditions, likely because the WBC is an indicator of overall inflammatory activity levels. While environmental influences (infection, cancer, etc.) likely are responsible for the majority of variability in acute WBC levels, chronic “resting state” WBC levels are likely to be influenced by genetic variation. We thus sought to use a genome wide association study (GWAS) of “resting state” WBC to identify genetic loci, with the ultimate goal of furthering our understanding of genetic influences on resting state inflammation.

WBC is also known to vary among groups of different ancestry (Bach 2002) and understanding these biological mechanisms may be useful in understanding the etiology of inflammation and ultimate effects on chronic inflammatory diseases. A lower neutrophil count, the major class of WBC in human peripheral blood, is reported in subjects with African ancestry (Nalls et al. 2008; Reich et al. 2009). This lower WBC in those of African ancestry has been attributed, in part, to an association with the Duffy antigen/chemokine receptor gene (DARC). The 336 amino acid Duffy glycoprotein is a receptor that binds cytokines released during inflammation.

Two SNPs (rs12075 and rs2814778) determine the Duffy antigen phenotype. The Duffy Fya and Fyb phenotypes differ at a single amino acid glycine → aspartic acid (G42D), due to a 125 G → A substitution (rs12075) (Dean 2005). The rs12075 variant has been shown to be associated with levels of monocyte chemo attractant protein-1 (MCP-1) and other inflammatory mediators (Schnabel et al. 2010). The null phenotype variant rs2814778 is located in the erythroid promoter region and is one of 34 SNPs used to infer ancestral origin by the SNPforID Consortium (Phillips et al. 2007). Individuals with the FYB allele who are homozygous for rs2814778 (−33T → C) have the phenotype Fy(a−b−) and do not express Duffy antigens on their red blood cells (RBCs) (Dean 2005). RBCs that lack the Duffy Fya and Fyb antigens are resistant to invasion of the malaria parasites Plasmodium vivax and Plasmodium knowlesi, a phenotype posited to result in positive selection in African populations. Lower WBC in subjects of African ancestry has been attributed to this variant ([MIM 613665]) (http://www.omim.org).

We performed a pooled and genetically determined ancestry (GDA)-stratified analysis of total WBC identified through electronic medical records (EMRs) among 13,923 subjects in the electronic Medical Records and Genomics (eMERGE) Network to assess influence of genetic/genomic variations on WBC level (http://www.gwas.net). An advantage of our study sample from all participating eMERGE sites was that most of the 13,923 subjects have longitudinal measurements over multiple years as well as clinical covariates, such as medications, to identify and eliminate WBC observations that may have reflected acute conditions such as cancer or infection.

Methods

Selection and description of participants

The eMERGE Network is a consortium of five U.S. cohorts linked to EMR data for conducting large-scale, high-throughput genetic research (McCarty et al. 2011). Participating sites include the following: (1) Group Health Cooperative, University of Washington and Fred Hutchinson Cancer Research Center partnership, Seattle, WA, (2) Marshfield Clinic, Marshfield, WI, (3) Mayo Clinic, Rochester, MN, (4) Northwestern University, Evanston, IL, and (5) Vanderbilt University, Nashville, TN (Roden et al. 2008). The network brings together researchers with a wide range of expertise in genomics, statistics, ethics, informatics, and clinical medicine from leading medical research institutions across the country (McCarty et al. 2011). Each center participating in the consortium, organized by the National Human Genome Research Institute (NHGRI), proposed studying the relationship between genetic variation and one or more common human traits, using the technique of genome-wide association analysis. The Center for Inherited Disease Research (CIDR) at Johns Hopkins University and The Broad Institute of MIT and Harvard served as the genotyping centers for the network.

The WBC data extraction algorithm for EMRs was developed and validated by Group Health and the University of Washington. Group Health, Marshfield Clinic and Mayo Clinic reported >98% of subjects as having WBC records. Northwestern University reported 92% and Vanderbilt University reported 91%. Visit and subject level exclusion criteria are listed in Table 1. In general, we excluded any visit and/or subject whose values were possibly reflecting something other than resting state WBC levels. The subjects were removed if there was a record of dialysis or HIV infection at any time. Visit-level exclusions included inpatient and emergency room visits. Other visit-level exclusions included the following: (1) active infections, (2) medications affecting WBC levels, (3) chemotherapy window of 6 months prior and 3 months post index visit, and (4) a prior diagnosis of Alzheimer’s disease. The algorithm was presented to all participating sites and data were returned to Group Health for harmonization. Once harmonized, the data were transferred to the University of Washington for the association analyses.

Table 1 White blood cell count subject/visit algorithm using electronic medical records for the five eMERGE participating sites

Technical information

Genotyping

Most subjects were genotyped on the Illumina Human660 W-Quadv1_A (660 W) genotyping platform. Samples from subjects who were self-reported (Northwestern) or observer-reported (Vanderbilt University) to have more recent African ancestry were genotyped on the Illumina Human1 M-Duo (1 M) genotyping platform. The 660 W array included 561,490 SNPs and 95,876 intensity-only probes, and the 1 M array consisted of a total of 1,199,187 loci. Genotyping calls for both platforms were made at CIDR and Broad using BeadStudio version 3.3.7 and Gentrain version 1.0. Vanderbilt University served as the eMERGE coordinating center, which conducted the QC/QA process for the genotypic data, with duplicate efforts at each site. Both samples and SNPs were assessed for quality and subsequently filtered from the production data if thresholds were not met. Cryptic relatedness was assessed for all sites and pairs at half-sibling level \( ( {{\uptheta}} = {{k}}_{1} /4 = {{k}}_{2} /2 = 1/8) \) or higher were randomly broken (by dropping one) before assessing whole-genome association. Subjects identified for filtering at each particular site through the quality control/quality assurance (QA/QC) process were subsequently filtered for the entire merged data set. The (QA/QC) process was performed using four general categories: (1) assessment of genotyping batch quality, (2) assessment of sample quality, (3) validating sample identity, and (4) assessment of SNP quality (Laurie et al. 2010; McCarty et al. 2011; Turner et al. 2011). Detailed QA/QC reports of all eMERGE projects are available for public access on the eMERGE web site.

The combined genotype data set was created at the Fred Hutchinson Cancer Research Center. All source projects were generated on TOP strand from Bead studio final report files, so (nominally) there was no need to flip strands before merging. The released SNPs in the Group Health Cooperative (s = 560,732) were used as the prototype and released SNPs from the other eMERGE projects were extracted. Merging these files on the SNPs that were called in every project yielded 532,566 variants in common. There were 1,564 strand ambiguous SNPs (A/T and C/G polymorphism) out of the 532,566 total intersecting SNPs. If the most frequent alleles in these generally agreed between source projects, then we concluded that the strands were concordant prior to merging.

Imputation

Imputation analyses of untyped SNPs were performed using BEAGLE version 3.3, a freely available software program written in Java (Browning and Browning 2009). We imputed the region surrounding the Duffy antigen/chemokine receptor gene (DARC) on chromosome 1, specifically to infer the rs2814778 genotypes for the 120 AA subjects not genotyped on the 1 M array. We used a cosmopolitan reference panel comprising phased HapMap3_r2 founder haplotypes from six Phase 3 populations: CEU, TSI, ASW, LWK, MKK, and YRI (Altshuler et al. 2010). Given the 19 Mb region of significance, we added a 5 Mb buffer to each end. Thus, the imputed result is a 29 Mb region (140.5–169.6 Mb, NCBI build 36), which encompasses 13,516 SNPs. For quality, we checked the concordance between the most likely imputed genotype and genotyped results for rs2814778 in the 1,373 overlapping individuals (typed on 1 M and AA subjects who we imputed), which was 99.1%. The estimated allele dosage \( (0 \times P\left( {AA} \right) + 1 \times P\left( {AB} \right) + 2 \times P(BB)) \) for rs2814778 was subsequently used in the association analysis for the 120 subjects without observed 1 M array genotypes.

Principal components analysis (PCA) for ancestry

We used principal components analysis (PCA), essentially as described by Patterson et al. (2006) (Patterson et al. 2006), using independent, autosomal SNPs with missing call rates <5.0% and minor allele frequency >5.0% across the merged data set of 17,150 unique subjects (532,566 → 482,985 SNPs). To select independent SNPs, we utilized PLINK’s linkage disequilibrium (LD) pruning function (Purcell et al. 2007). For the first round of short-range LD pruning, we used the default settings of a 50 SNP window with a shift of five SNPs, and pair-wise genetic correlation with a threshold of r 2 = 0.2. In a second round to remove long-range LD, we took the average number of SNPs over 5 Mb from the output of the first round as our window (s = 200), again with an iteration of 20 SNPs and a threshold of 0.2. The resulting 105,291 SNPs were used to generate the principal components. For PCA, we utilized an in-house program (SNPRelate), which has no limit to the number of SNPs used for the covariance matrix (http://cran.r-project.org/web/packages/SNPRelate/index.html).

Supplemental Figure S1 shows a plot of the first two eigenvectors from an analysis of 17,150 subjects from all sites, along with HapMap controls (CEU, YRI, CHB and JPT). The solid and dotted lines are the means and standard deviations (SD) for eigenvectors 1 and 2 for self-identified EA and AA subjects, respectively. The first eigenvector, accounting for 4.8% of the variance, separated the self-identified (all sites other than Vanderbilt) or observed-identified (Vanderbilt) EA subjects from the self-identified or observed-identified AA subjects. The second eigenvector, accounting for 0.3% of the variance, separated the self-identified Asian subjects from the other ancestry groups.

To define the EA sample, we identified all subjects with values less than three (+3) and greater than negative one (−1) SD from the mean of eigenvectors 1 and 2 of self-identified European ancestry, respectively (see Figure S1).

For the AA sample, we identified all subjects with values less than two (+2) and greater than negative one (−1) SD from the mean of eigenvector 1, and less than and greater than one (±1) SD for eigenvector 2 of self-identified African ancestry subjects, respectively. These boundaries were chosen as an attempt to select a homogenous set of ancestry groups based on visual inspection of the eigenvector plots. Respective numbers of self-identified and genetically determined ancestry are listed in Table 2 and discussed in the “Results” section.

Table 2 Summary statistics of demographic data and phenotypes by eMERGE participating site and combined

Statistics

To assess heterogeneity of potential phenotypes and covariates among sites, we used the summary function in the Hmisc package (http://cran.r-project.org/web/packages/Hmisc/index.html) (Harrell 2004) (R statistical computing) (Table 2). We used the Pearson Chi-square tests for categorical variables and the Kruskal–Wallis tests for continuous variables (>2 groups). Clinical characteristics deemed important for the association analyses were then analyzed for significant association to the phenotypes of interest using ordinary least squares. A majority of the subjects from all sites had multiple visits over many years for the duration of the EMR. While other within-subjects measures were considered, we assessed the median value of WBC and its differentials for a given subject, which provides a natural transformation of the data and is less sensitive to outlying extreme values than other summaries such as the mean. As a further check, we identified and excluded outliers whose median WBC value fell outside two standard deviations from the mean of the median values.

We analyzed total WBC and each of its differential components as quantitative phenotypes (ordinary least squares) adjusted for the following variables: (1) sex, (2) median BMI, (3) median age, (4) eigenvectors 1 and 2 derived from PCA, and for the joint analysis (5) study site. For the GDA-stratified analyses, eigenvectors 1 and 2 were dropped from the model. We analyzed each dependent variable with the given covariates and the genotype of one SNP coded as 0, 1 and 2 copies of the minor allele (additive genotypic model) in PLINK. We deemed genome-wide significance at a p value less than 5.0e−8, which approximated a Bonferroni correction for approximately 1,000,000 independent hypothesis tests (Schnabel et al. 2010).

Results

Descriptive statistics by eMERGE study site

Table 2 contains descriptive statistics of total WBC and its subtypes along with covariates by eMERGE study site and combined. Continuous variables are illustrated as box-percentile plots by study site in Supplemental Figure S2. NU (26%) and VU (34%) had the highest proportions of participants who self-identified or were observed-reported as having more recent African ancestry. Most (83%) participants self-identified or were observed-reported as having European ancestry. These numbers reflect the percentages of genetically determined ancestry groups (see “Methods”) for both the African and European continent. For AA subjects, NU and VU had the largest percentage at 22 and 32%, respectively. All other sites were at 3% or less. Only a small percentage of subjects from all sites combined (<1%) self-reported an ethnicity of Hispanic or Latino. The differences of WBC and the differentials by site are illustrated in Supplemental Figure S2, and listed in Table 2. There was a significant difference (p < 0.001) in median WBC, with NU and VU having the highest median values of 6.8 and 6.9 K/μl, respectively.

Effects of covariates on WBC

Median age, median BMI and sex were used as standard covariates for all multivariate models considered. The effects of BMI and sex on WBC are similar for the GDA-stratified and pooled analyses as illustrated in Supplemental Figure S3. Higher BMI correlated with higher WBC. On average, men had lower WBC than women. Age had suggestive effects in the stratified ancestry groups, although the confidence intervals for both effects cross β = 0.0. In the AA group, the higher ages inversely correlated with lower WBC. For the EA group, the higher the age correlated with higher WBC. There is virtually no effect of age on WBC in the pooled analysis, but the variance may have been captured in the significant site effects.

White blood cell count association

We performed a pooled and GDA-stratified association analyses as defined in the “Methods” section. The results of the major class of the WBCs, the neutrophils, strongly mimicked the WBC results as expected (data not shown). The results presented were classified into two categories, one for the AA subjects and one for the EA subjects. For the AA subjects, there was a strong genome-wide association centered on the 1q21–q22 region, specifically the Duffy antigen/chemokine receptor gene (DARC) detailed in the “Introduction” section. Figure 1 illustrates this peak that traverses the centromere on chromosome 1. The genome-wide significant association for the EA subjects was found at the 17q21 region, tagging several genes of interest. Both regions of significance were GDA-specific, but for completeness the SNP association results are presented in Table 3 for each stratum as well as the pooled analyses. We also performed association analysis by site as a sensitivity analysis and have presented the results in Supplemental Table 1. Effects sizes and p values by site are listed for the SNPs presented in Table 3 except for the Duffy SNP rs2814778. This variant was not genotyped for all sites and therefore is not provided. We also do not provide ancestry-stratified results for each site as PCA was performed across the entire consortium. In general, the site-specific results were suggestive of the joint results as expected. In general, it is more advantageous to perform a joint analysis over a replication-based analysis (Skol et al. 2006). PCA across the entire consortium controlling for ancestry effects allowed for this joint analysis.

Fig. 1
figure 1

Manhattan and Q–Q plots of p values from the WBC association analyses of subjects with GDA of African continent

Table 3 Summary of effects of loci that reached genome-wide significance for the AA, EA and pooled analyses

Duffy antigen/chemokine receptor gene (DARC)

Figures 1, 2 and 3 illustrate the Manhattan and Q–Q plots of p values from the AA, EA and pooled analyses, respectively. We also provide in Fig. 1 a zoomed-in plot of chromosome 1 that shows chromosomal width of the extensive association peak. As illustrated in the Q–Q plot, there was marginal genomic inflation, lambda = 1.052, most likely driven by the extensive chromosome 1 region of significance, which is likely attributable to a selective sweep for Duffy as noted. While many genes fell within the peak (19 MB), the most significant result (p value ≈ e−24) was a missense mutation (rs12075) found in the DARC (“Duffy”) gene. Table 3 outlines the SNP association results for rs12075 on WBC. The pooled analysis model included eigenvectors 1 and 2 derived from the PCA. The AA subjects with the minor allele for rs12075 on average had a higher median WBC (β = 1.28, S.E. = 0.12). The EA association result were suggestive (p value = 0.04), but the effect was small and in the opposite direction (β = −0.04, S.E. = 0.02). As rs12075 is the SNP that is responsible for the FYA and FYB Duffy phenotypes, we also assessed the corresponding Duffy variant rs2814778 producing the null phenotype from subjects genotyped on the 1 M genotyping platform.

Fig. 2
figure 2

Manhattan and Q–Q plots of p values from the WBC association analyses of subjects with GDA of European continent

Fig. 3
figure 3

Manhattan and Q–Q plots of p values from the pooled WBC association analyses of subjects from all eMERGE sites

The majority of these samples were self-reported or observed-reported as African ancestry at NU and VU, respectively. For the AA subjects not genotyped on the 1 M (n = 120), we included the imputed dosages in the analyses (see “Methods”). The AA association for rs2814778 (the null variant) was more significant (p value = 6.71e−55) than the Duffy missense mutation rs12075 with a similar effect size and direction (β = 1.35, S.E. = 0.08). As expected, this allele was very rare in subjects of European ancestry.

We were able to explore a novel multi-SNP Duffy model with an interaction term because we had genotypes for both rs2814778 and rs12075 in the AA subjects (see Table 3). The LD for these two SNPs in the genotyped 1 M subjects was small (r 2 = 0.394). When both SNPs were added to the model, the rs12075 effect was no longer significant (p value = 0.63), while the rs2814778 effect remained significant at a genome-wide level (p value = 1.62e−32). When the interaction term was added (see Table 3), the rs2814778 effect (β = 1.37, S.E. = 0.11) remained significant (p value = 7.19e−33), while both rs12075 and the interaction between the SNPs were moderately significant, although not at a genome-wide level (p value ≈ 0.001). This model appears to account for a proportion of the WBC variance on our AA sample (multiple r 2 = 0.203). Residuals from this two-SNP, interaction model were extracted and analyzed as a phenotype. Figure 4 illustrates the Manhattan and Q–Q plot of the p values generated from the analyses. There were no significant genome-wide associations with moderate inflation (lambda = 1.039), including the previous evidence of X-linked associations illustrated in Fig. 1. Two untranslated 3′SNPs (rs2209549, rs7052314) tagged two genes KIAA2022, ZDHHC15, respectively. Both hypothetical protein LOC340533 (KIAA2022) and zinc finger, DHHC-type containing 15 isoform 1 (ZDHHC15) have been linked to X-linked mental retardation. A relationship if any between these genes and WBC is no longer supported. Also tagged was UPRT by SNP rs12832571 with unknown function. This gene encodes uracil phosphoribosyltransferase, which plays an important part of nucleotide metabolism, specifically the pyrimidine salvage pathway (Kent et al. 2002).

Fig. 4
figure 4

Manhattan and Q–Q plots of p values from the analyses residuals derived from the Duffy multivariate model described in the “Results” section. Subjects of African ancestry were analyzed

Supplemental Figure 4 illustrates the difference of WBC by Duffy phenotypes (Fya + b+, Fya+b−, Fya−b+, Fya−b−) in the subjects of African ancestry. As expected, the neutrophils showed the same associations with Duffy antigens as the total leukocyte count with Duffy null (Fya−b−) having a reduced count. Notably, the lymphocytes, eosinophils, monocytes, and basophils demonstrated the opposite trend with Duffy null being associated with higher counts.

17q21.1 region

There was a genome-wide significant association with median WBC for EA subjects centered on the 17q21.1 region. There were several genes tagged by the genome-wide significant SNPs presented in Table 3, all located in this region. Figure 2 illustrates the Manhattan and Q–Q plot of the p values generated from the analyses (lambda = 1.034). Figure 2 also provides a zoomed-in plot of chromosome 17, and further of the 17q21.1 region illustrating genes and directions of transcription. Increasing in genomic coordinates, GSDMA is one of the genes tagged with the most significant association for both the EA (p value ≈ e−12) and pooled (p value ≈ e−11) analyses. The pooled sample minor allele for this intronic variant (rs3859192) was also the ancestral allele (A), but the frequency was 0.45. In the EA sample, higher numbers of copies of this allele were associated with higher WBC levels (β = 0.14, S.E. = 0.02), and this association was present in the overall pooled sample (β = 0.14, S.E. = 0.02). This SNP was in moderate LD (r 2 = 0.492) with a missense variant (rs3894194) that was suggestive in our EA (2.01e−07) and pooled (2.29e−07) samples (see Table 3). Higher numbers of copies of the minor allele were associated with higher values of WBC. For the EA sample β = 0.11 (S.E. = 0.02), and for the pooled sample β = 0.10 (S.E. = 0.02). Moving towards 3′ as illustrated in Fig. 2, the next significant association was found in the PSMD3 gene, intronic SNP rs4065321. As listed in Table 3, this variant is in moderate LD with the GSDMA intronic SNP rs3859192 (r 2 = 0.458). The MAFs (0.45 and 0.44), effect sizes (both β = 0.14) and significance (p value = 3.47e−11 and 1.43e−12) were similar to rs3859192 in the EA and pooled analyses. On the anti-sense strand, there was a genome-wide significant intronic SNP (rs9916158) found in the MED24 gene. As outlined in Table 3, this SNP was significant in the EA (p value = 4.92e−10) and pooled analyses (p value = 8.86e−10). The sample minor allele (A) was associated with lower median WBC. For the EA sample β = −0.13 (S.E. = 0.02), and for the pooled sample β = −0.13 (S.E. = 0.02). This variant is in moderate LD with the GSDMA intronic SNP rs3859192 (r 2 = 0.410). None of the 17q21.1 SNPs associated with WBC in EA analyses were found to be significantly associated in analyses of AA (all p value > 0.05).

Discussion

Through the eMERGE Network, we were able to successfully mine electronic medical records linked to five U.S. cohorts encompassing 13,923 subjects for WBC. This EMR algorithm, which attempted to remove acute and chronic influences on WBC levels, was developed at one site and confirmed at the other network sites. Our findings extend previous reports of two regions of interest unique to subjects of genetically determined ancestry to the African or European continents and additionally identify a novel interaction for the former. Of interest, we were able to detect these ancestry-specific associations in pooled analyses. Further, these results may inform a previously reported asthma association on chromosome 17q21.1 and its impact on the expression of ORMDL3 or GSDMB genes (Halapi et al. 2010; Kabesch 2010; Moffatt et al. 2007, 2010; Verlaan et al. 2009).

Duffy antigen/chemokine receptor gene (DARC)

Our chromosome 1 association with WBC in the AA sample provided unified support of previous findings through admixture mapping (Nalls et al. 2008) and SNP associations (Reich et al. 2009; Schnabel et al. 2010). These included the Duffy null polymorphism (rs2814778) with its association with the major WBC component, neutrophils (Reich et al. 2009), and the Duffy missense mutation (rs12075) responsible for the two principal antigens, Fya and Fyb, and its association with monocyte chemoattractant protein-1 (MCP-1) (Schnabel et al. 2010). The direction of effect in our results support the hypothesis that DARC on red blood cells acts as a chemokine sink, limiting the stimulation of leukocytes by IL-8 in the blood (Pruenster and Rot 2006). Chemokines play a key role in trafficking leukocytes in the blood. The two Duffy variants interact to account for WBC variance observed in our AA sample, as evidenced by the lack of association from the genome-wide analyses of the residuals derived from the novel two-SNP, interaction model described in the “Results” section (Fig. 4). Still unexplained is the X chromosome peak in the AA subjects. This signal was also eliminated when “Duffy” was controlled for in the residual model derived from the two-SNP, interaction model. While we continue to study these effects, we suspect they may represent an artifact of different patterns of admixture in males versus females of recent African origin. The atypically broad region of association on chromosome 1 demonstrated in Fig. 1 may be due to selective sweep in the genetic region surrounding Duffy in geographic areas with high prevalence of P. vivax and P. knowlesi. This selective sweep also leads to spurious associations in regional genes not relevant to WBC. Such associations could have been misleading had Duffy not been typed. This suggests the need for caution in interpreting associations in atypically broad regions of association.

17q21.1 region

Our 17q21.1 association results in the EA subjects provided further evidence that this region may play a role in regulating inflammatory processes and disorders. In our EA subjects, we identified a genome-wide significant region that tagged three separate genes: (1) GSDMA, (2) MED24, and (3) PSMD3. The HaemGen consortium used meta-analysis to identify one variant (rs17609240) with an association with WBC of genome-wide significance (Soranzo et al. 2009); this variant was on 17q21.1 near ORMDL3, a known susceptibility locus for childhood asthma (childhood asthma [MIM 610075]) (Moffatt et al. 2007; Soranzo et al. 2009). A study of WBC in Japanese individuals identified another SNP (rs4065321) in the same region (Kamatani et al. 2010). Others have linked this region to neutrophil count levels (Okada et al. 2010), and to the inflammatory disorders asthma (Halapi et al. 2010; Moffatt et al. 2007, 2010; Verlaan et al. 2009), Crohn’s disease (IBD22 [MIM 612380]) (Barrett et al. 2008) and type 1 diabetes (IDDM [MIM 222100]) (Barrett et al. 2009). Other genes in the same region include GSDMA, MED24, and PSMD3. PSMD3 encodes one of the non-ATPase subunits of the 19S regulator lid for the multicatalytic proteinase complex that is involved in many cellular functions including inflammatory responses, and apoptosis (Kent et al. 2002). Another gene in close proximity is CSF3, which encodes colony stimulating factor 3, a cytokine controlling the production, differentiation and function of granulocytes (Hollard et al. 1975; Soranzo et al. 2009). ORMDL3 has been identified as a potential risk factor for asthma (Breslow et al. 2010; Halapi et al. 2010; Kabesch 2010; Moffatt et al. 2010; Verlaan et al. 2009). Through a combination of global and target studies, Breslow et al. (2010) identified Orm proteins as homeostatic regulators of sphingolipid biosynthesis. Moffatt et al. identified genetic 17q21.1 variants regulating ORMDL3 expression and its contribution to the risk of childhood asthma (Moffatt et al. 2007). Two of these variants associated with asthma (rs3894194 and rs3859192) are missense and intronic SNPs found in GSDMA that were associated to WBC in our EA sample. With associations to multiple inflammatory diseases in addition to WBC and its differential, this 17q21.1 region is an excellent candidate for pleiotropy analyses (Kabesch 2010; Verlaan et al. 2009).

Our own and the other reported associations in the 17q21.1 region are located near the European-enriched 17q21.31 MAPT inversion polymorphism region (Zody et al. 2008). Allergic and autoimmune diseases, including asthma, type 1 diabetes and Crohn’s disease, are not evenly distributed among continents, countries, or ethnic groups (Bach 2002) and this inversion could play a role. The incidence of disease decreases from north to south in the Northern Hemisphere (Bach 2002). Further investigation of whether this inversion and the lack of recombination surrounding this region may play a role in any 17q21 association is warranted to determine if these associations are an artifact of admixture.

Conclusion

Our results provided further evidence that variation in WBC levels does have a genetic component. While Duffy is likely the source of association in the AA subjects, further elucidation of the biological mechanism driving the EA association in the 17q21.1 region is needed. These regions are excellent candidates for exploring pathway enrichment as well as gene–gene interaction for genes with SNPs meeting genome-wide significance. In particular, we find that two SNPs previously associated with asthma (rs3894194 and rs3859192) were also associated with WBC. Joint study of these SNP effects on asthma and WBC is warranted to determine if these are independent effects. Such studies will inform the reported relationship of WBC to multiple chronic diseases. Finally, we demonstrated that phenotypes such as WBC can be mined from existing EMRs and translated between sites with different systems and coding schemes, which allowed efficient study of multiple phenotypes considering the same genotyped subjects.