Introduction

The increasing prevalence of type 2 diabetes (T2D) is one of the greatest challenges in published health, causing enormous costs and a decreasing in the quality of life [1]. In Europe, T2D accounted for 80%–90% of all diabetes and decreased life expectancy by 5–10 years [2]. The risk of T2D is determined by a complex interplay of genetic and environmental factors and can be partly acted by changes in lifestyles [3].

It is well known that the heritability for T2D is 26%–69% [4]. Genome-wide association studies (GWAS) have enabled major advances in the identification of single-nucleotide polymorphisms (SNPs) associated with T2D risk [5,6,7,8]. Recently, some researches highlighted the potential of genomic risk scores (GRSs) for risk prediction of common diseases. They identified 8.0, 6.1, 3.5, 3.2, and 1.5% of the population at greater than threefold increased risk for coronary artery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease, and breast cancer, respectively [9, 10]. For predicting disease risk, GRS has a notable advantage over conventional clinical factors as it could identify high-risk individuals from birth [11].

A recent T2D-GRS using 66 SNPs derived from European-ancestry participants indicated that individuals with high GRS and unhealthy lifestyle factors increased the risk of T2D. The research concluded that compared with participants with the healthiest lifestyle and low T2D-GRS, the relative risk of T2D for participants with the least healthy lifestyle and high T2D-GRS was 8.72 [12]. Nevertheless, the clinical utility of GRS depends on the ability to predict future T2D events, not on the strength of the association with T2D. Several studies have examined the utility of the combination of susceptibility variants for T2D prediction [13, 14]. These results showed that the additive effect of GRS on clinical factors was marginal, although statistically significant. In previous studies on coronary artery disease prediction, there is an approach that combined multiple GRSs into one meta-score (metaGRS) to improve the predictive performance [15]. But this comprehensive metaGRS construction method has not been applied to the prediction of T2D.

Here, we extend the metaGRS method to predict T2D risk by incorporating summary GWAS statistics for T2D and its risk factors. The metaGRS is constructed and validated using UK Biobank database (UKB), and compared with previously published T2D-GRS [6]. Most previous diabetes predictions were based on a conventional risk model that included clinically available indicators, including lipids, blood pressure, family history, etc. Moreover, we examine the improvement in prediction performance and reclassification accuracy after adding metaGRS to the conventional risk model.

Methods

Study design and participants

Study design and methods of UKB have been reported previously [16]. In brief, UKB is a large-scale prospective study with 502, 527 participants aged 37–73 years and recruited in 2006–2010. At recruitment, detailed information was collected via a standardized socio-demographic questionnaire, as well as health status and physician-diagnosed medical conditions, family history, and lifestyle factors. Several physical measurements were obtained, including height, weight, waist–hip ratio (WHR), systolic blood pressures (SBP), and diastolic blood pressures (DBP). Individual data were linked to Hospital Episode Statistics (HES) data, and national death and cancer registries. HES uses International Classification of Diseases (ICD)-9th and -10th Revisions to record diagnosis information. Death registries include all death in the UK with both primary and secondary causes of death coded in ICD-10. The UKB study has approval from the North West Multicentre Research Ethical Committee. All participants provided written informed consent.

For diabetes cases, we used HES/death data (ICD-10: E10–E14, ICD-9: 250), diabetes diagnosed by doctor (data files #2443), and glucose medication (data files #6153, #6177, #20,003); type 2 diabetes case was defined by ICD-10 E11.

We defined risk factors at the first assessment, including body mass index (BMI, data files #21,001), smoking status (data files #20,116), triglycerides (TG, data files #30,870), high-density lipoprotein (HDL, data files #30,760), low-density lipoprotein (LDL, data files #30,780), glucose (data files #30,740), hypertension, and family history of diabetes. For hypertension, we used an expanded definition: blood pressure medication (data files #6153, #6177), SBP > 140 mm Hg (the mean of two SBP measurements, data files #4080), DBP > 90 mm Hg (the mean of two DBP measurements, data files #4079), or HES/death data (ICD-10: I10–I15). For the family history of diabetes, we considered history in any first-degree relative (father, mother, sibling; data files #20,107, #20,110, #20,111, respectively).

The design of this study and detailed inclusion and exclusion criteria are shown in Fig. 1. We randomly divided UKB British white data set into training (n = 40,423) and validation set (n = 303,053). In order to increase statistical power in metaGRS generating phase, we enriched the training set with 7,558 T2D cases, those excluded from the validation set due to T2D diagnosis at baseline, leading to 9,913 T2D cases and 38,068 controls.

Fig. 1
figure 1

Study design. GRS, genomic risk score, GWAS, genome-wide association studies, SNP, single-nucleotide polymorphisms. 1These 7,558 events were cases excluded from the validation set due to T2D diagnosis at baseline

Generation of metaGRS

The training data set was used to construct GRSs and was excluded from further analysis. The genotyping process and arrays used in the UK Biobank study have been described previously [17]. We constructed 17 genetic risk scores (GRSs) for phenotypes associated with T2D: T2D [6], HbA1c [18], 2-h blood glucose (2hGlu), fasting glucose (FG), fasting insulin (FI) [19], HDL, LDL, total cholesterol (TC), TG [20], SBP, DBP [21], waist, hip, WHR [22], BMI [23], height [24], and smoking [25]. Totally, we selected 1,692 SNPs associated with corresponding phenotypes with genome-wide significance (P < 5e-8) in populations of European descent (details in supplementary table). We generated these GRSs based on r2 < 0.1 with PLINK [26] LD thinning. To control for population structure, we used the genetic principal components (PCs) supplied by UKB [27].

On the basis of the selected SNPs, the 17 GRSs were calculated separately, using a weighted method (the sum of the risk allele dosages of each variant multiplied by its marginal effect size):

$${\mathrm{Weighted GRS}}_{i}={\sum }_{j=1}^{m}{x}_{ij}{\beta }_{j}$$
(1)

where \({x}_{ij}\in \left\{0, 1, 2\right\}\)is the count of risk alleles for the jth variant in the ith individual, and β j is the marginal effect size for the jth variant obtained from the reported GWAS data.

Each GRS was standardized to zero mean and unit standard deviation over the entire data set. Next, we employed elastic-net logistic regression [28] using the R package “glmnet” [29] to model the associations between the 17 GRSs and T2D, adjusting for sex and ten PCs. A range of models with different penalties was evaluated using tenfold cross-validation. We selected the best model, in terms of the highest cross-validated area under the receiving-operating characteristic curve (AUC), as the final model to generate metaGRS and evaluated it in UKB validation set.

We generated metaGRS consisting of a weighted average of the standardized scores

$${\mathrm{GRS}}_{i}^{\mathrm{meta}}=\frac{{\alpha }_{1}{GRS}_{i1}+\dots +{\alpha }_{17}{GRS}_{i17}}{{\alpha }_{1}+\dots +{\alpha }_{17}}$$
(2)

where \({GRS}_{i1},\dots ,{GRS}_{i17}\)are the 17 zero mean and unit variance standardized GRSs for the ith individual; α1,…, α17 are the coefficients (log odds ratio) for each of the 17 GRSs.

Statistical analysis

The demographic and clinical characteristics of the UKB validation subset were described using median with interquartile for continuous variables and the frequency and percent for categorical variables. The subsequent analysis focused only on incident T2D events. Taking into account the existence of death, we evaluated the metaGRS in the validation subset using competing risk model proposed by Fine and Gray [30] in the R package “cmprsk,” estimated hazard ratios (HR) and 95% confidence interval (CI), and computed five-year T2D event probabilities. Based on the predicted five-year incidence probabilities of T2D for each individual and whether it actually occurs, we calculated the C-index and AUC under the presence of competitive risk. We employed the competing risk model to estimate the cumulative incidence of T2D, stratified by sex, with Gray test for comparison between groups. AUC and C-index were based on a five-year follow-up window previously described [31] using the R package “pROC” and “Hmisc,” respectively. Between the two groups, the difference in AUC was estimated using the “roc.test” and the other difference in effect size was estimated using the two-sample z test [32]. All analyses were adjusted for age, sex, and ten PCs.

The net reclassification improvement (NRI) was used to assess the potential for improved discrimination between the incident and non-incident cases when added to new factors to T2D [33]. A base model including all conventional risk factors was compared with an alternate model that includes the metaGRS being evaluated. The R package "survIDINRI" was applied for NRI analysis.

All statistical tests were two-sided, and a P < 0.05 was considered significant. All analyses are conducted by R, version 3.6.1.

Results

The characteristics of UKB participants in the validation subset are shown in Table 1. The overall UKB validation set consists of 303,528 participants with a median age of 58 years (range 38–73 years). There are 6,724 (2.22%) incident cases of T2D during a median follow-up of 8.90 years (range 0–11.03 years), consisting of 2,714 (1.64%) women and 4,010 (2.92%) men with onset T2D.

Table 1 Baseline characteristics of UK Biobank validation data set

Using the independent UKB validation set, we next evaluated the association between metaGRS and T2D via survival analysis adjusted for age, sex, and ten PCs. The metaGRS was associated with incident T2D with an HR of 1.32 (95% CI 1.29–1.35) per standard deviation of metaGRS, which was elevated compared with T2D-GRS (HR = 1.30 [95% CI 1.27–1.33]), but p value did not reach significance (P = 0.12). Hazard ratios associated with quartiles metaGRS are shown in Table 2. Using the bottom metaGRS quartile as a reference, the top metaGRS quartile of the population was at 2.08-fold increased risk of T2D (95%CI: 1.93–2.23). To investigate the potential role of the metaGRS in earlier life genetic screening, we compared the sex-stratified cumulative incidence of T2D across quartiles of the metaGRS as shown in Fig. 2. The quartile metaGRS showed substantial differences in the cumulative incidence of T2D (the Gray test between four groups: P < 0.001). For men, the T2D risk in the top 25% of metaGRS reached 4.15% (95%CI: 3.93%–4.37%) by 10 years of follow-up time. In comparison, the T2D risk in the bottom 25% of metaGRS reached 2.05% (95%CI: 1.89%–2.20%) by 10 years of follow-up time. In UKB women, the results were similar but had a lower T2D risk overall compared with men. For women in the highest metaGRS quartile, T2D risk reached 2.36% (95%CI: 2.21%–2.52%) by 10 years of follow-up time, whereas women in the lowest metaGRS quartile were at extremely low levels of risk, reaching 1.16% (95%CI: 1.05%–1.26%) by 10 years of follow-up time. The p value for the difference between males and females in the highest metaGRS is < 0.001. The result was consistent with a competing risk model of the quartiles metaGRS.

Table 2 Hazard ratios associated with incidence type 2 diabetes for metaGRS among validation data set
Fig. 2
figure 2

Cumulative incidence of type 2 diabetes by quartiles of metaGRS in males and females

Next, we compared the predictive performance of the metaGRS with conventional risk factors as shown in Fig. 3. We examined eight conventional risk factors at the baseline, consisting of smoking status, glucose, TC, HDL, LDL, BMI, hypertension, and family history of diabetes. BMI had the largest C statistic (0.787, 95%CI: 0.779–0.794). Notably, metaGRS had a higher C statistic and AUC than smoking and family history of diabetes (the AUC of metaGRS vs smoking: 0.684 vs 0.671, P < 0.001; vs family history of diabetes: vs 0.668, P < 0.001). The C statistic associated with metaGRS was 0.684 (95%CI: 0.676–0.693), which was stronger than T2D-GRS of 0.681 (95%CI: 0.672–0.690). The AUC associated with metaGRS had the same value as the above C statistic, so is T2D-GRS, and the p value for AUCs associated with the two GRSs was less than 0.001. The addition of metaGRS to all conventional clinical risk factors modestly but significantly increased the C statistic from 0.850 (95%CI: 0.843–0.856) to 0.854 (95%CI: 0.848–0.860), and the incremental value in C statistic was 0.004; the AUC was increased from 0.851 (95%CI: 0.844–0.857) to 0.855 (95%CI: 0.849–0.862), and the increment in AUC is 0.004 (P < 0.001). MetaGRS plus conventional risk factors and T2D-GRS plus conventional risk factors had the same AUC and the same C statistic. The 5-year T2D probability by age groups based on the addition of metaGRS to all conventional risk factors is shown in supplementary figure.

Fig. 3
figure 3

C-index (95%CI) for incident T2D in UK Biobank validation comparing metaGRS with conventional risk factors. T2D, type 2 diabetes, TG, triglycerides, HDL, high-density lipoprotein, LDL, low-density lipoprotein, BMI, body mass index, GRS, genomic risk score

Adding metaGRS to all conventional risk factors significantly improved the reclassification accuracy (continuous NRI = 11.8%, 95%CI: 9.2%–14.2%, P < 0.001).

Discussion

In this study, we constructed a metaGRS based on GWAS summary statistics for 17 T2D and its risk factors. We evaluated the predictive power of metaGRS by comparing it to established conventional risk factors. Then, we found that the effects of adding metaGRS to clinical information were significant.

In UKB British white data set, the incidence of type 2 diabetes in men was higher in women (2.92% against 1.64%). First, we showed that the new metaGRS had an elevated association with T2D compared with previously published T2D-GRS, although it did not reach significance (HR: 1.32 against 1.30, P = 0.12). The AUC associated with metaGRS was 0.684 (95%CI: 0.676–0.693), which was stronger than T2D-GRS of 0.681 (95%CI: 0.672–0.690), p value was less than 0.001. In addition, individuals in the top metaGRS quartile had a 2.08-fold HR and a higher cumulative incidence for T2D versus the bottom quartile. Next, we found that metaGRS had higher predictive power than the family history of diabetes and smoking status, but lower than other conventional risk factors. Adding metaGRS to all conventional risk factors significantly improved predictive power and 11.8% reclassification accuracy.

Our finding suggested that the addition of genetic factors only marginally improved the C-index and AUC beyond the clinical risk model, although the differences were statistically significant. But metaGRS plus clinical risk factors and T2D-GRS plus these factors had the same AUC and the same C statistics. Similar results were seen with regard to the predictive power of metaGRS, with C statistic in a similar range [11]. This indicated that the modest improvement in C-index was not a result of genetic variant selection. In line with our results, the T2D-GRS constructed from 49 variants showed a limited power to discriminate between susceptible and unsusceptible individuals in a Japanese population [34]. Although gene sequencing is slightly more expensive than clinical indicators, it needs to be tested once in a lifetime; it is helpful for the prediction of diseases from birth. A key measure of the clinical utility of a survival model is its ability to discriminate those who will develop T2D from those who will not. Although C statistic and AUC are the most popular metric, the increase is often very small in magnitude [35]. Therefore, we utilized the NRI to quantify the improvement. We observed that adding metaGRS to conventional factors caused 11.8% improvement in reclassification accuracy. This indicated metaGRS had a certain degree of clinical value to correct reassignment among risk categories.

The advantage of our study is that we examined a new metaGRS constructed with a large number of T2D and T2D-related traits susceptibility loci in three hundred thousand UKB participants whose genotype information was complete and demonstrated its clinical value in predicting T2D. Our study also had some limitations. The UKB participants have a minimum enrollment age of 38 years and have been shown to be healthier than the UK general population [36]. Some diabetes cases are not be distinguished between type 1 and type 2 diabetes. Thus, our study may have underestimated population-level lifetime T2D risk.

In conclusion, the metaGRS, constructed using 1,692 SNPs strongly associated with 17 T2D and T2D-related traits based on GWAS summary statistics, was significantly related to an increased risk of T2D in the European population. This genetic information significantly improved T2D prediction ability. It lays the groundwork for larger GWAS of T2D as well as analyses that leverage the totality of information available for T2D genomic risk prediction.