Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

It is a truth universally acknowledged that personal genome sequences will be a core component of individualized health care in the coming decades [23]. It is rightly claimed that genomic medicine should be predictive, personalized, preventive, and participatory, meaning that individuals will be encouraged to understand their own health risks and take preemptive measures to avert the onset of disease [5, 25]. Yet expert geneticists are debating whether genotypic predictors are now or will ever be more predictive than family history and clinical indicators [17, 28, 62], and there is reasonable skepticism surrounding causal inference from rare deleterious variants [37, 39]. So while tremendous progress is being made toward routine incorporation of whole genome sequence analysis for rare congenital disorders detected at birth [4, 22, 40, 56], and in cancer diagnosis and prognostics [9, 10, 18, 51], broader application to the complex common diseases that eventually afflict most adults remains to be introduced. The purpose of this chapter is to argue that the gap between hype and reality [53] needs to be addressed on two fronts: recalibration of expectation from prediction to classification and incorporation of functional genomic data into integrative predictive health.

The WHOLE approach encoded in the acronym for Wellness and Health Omics Linked to the Environment also places emphasis on the concept of wellness. Whereas the focus of most western medicine is on curing illness, universal public health strategies should attend more to disease prevention. As one of the leaders of this movement, Dr. Ken Brigham at Emory University remarks [7], the goal of health care should be to assure that “as many of us as possible should age with grace and die with painless dignity of natural causes.” Our vision at the Center for Health Discovery and Well Being in Atlanta [6, 47] is that genomic data will be integrated into primary medical care precisely for this purpose, to help people make better lifestyle choices that promote the maintenance of good health.

There are three major challenges we see that need to be confronted, which are discussed successively below. The first is the development of genomic classifiers that explain a sufficient proportion of the variance in disease risk to be informative. In the near future, these will be genotype based, incorporating rare and common variants, clearly utilizing advanced statistical methodologies but also requiring adjustment for population genetic differences [15] and family structure [49]. The second is integration of sequence data with other genomic data types [19, 25, 33], such as transcriptomic, epigenomic, and metabolomic profiles, as well as with relevant clinical and biochemical measures and family history data. Whether or not the environment can be directly incorporated as well is an open question [2], though it can be argued that functional genomic data captures lifetime environmental exposure indirectly. The third challenge is working out how to present all of this data to healthy adults in a manner that is understandable and sufficiently actionable that they will commit to positive health behaviors. To this end I conclude with an outline of one strategy that is likely to involve the training of a new generation of professional genomic counselors.

1.1 Genomic Classifiers

The foundation of genomic classification is always likely to be genotypic. Single nucleotide polymorphisms identified through genome-wide association studies [28] or classical candidate gene approaches can be combined to more accurately discriminate cases and controls than single locus classifiers [61, 63]. The simplest multivariate scores are allelic sums, where the number of alleles that is associated with disease is tallied across all identified loci. For n loci, the score theoretically ranges from zero to 2n, and the distribution is normal, but it will be skewed as a function of the allele frequency spectrum. Few individuals will have extreme values, but under a liability threshold model, it is assumed that individuals with scores at the top of the range are at the most elevated risk of disease.

Risk can be modeled as a function of the score as a predictor in the same sense as Framingham risk scores predict likelihood of onset of disease in a given time period [8, 59], or more simply individuals above and below an appropriate value can be classified as high or low risk. For type 2 diabetes, a classifier based on 18 loci established that individuals with the top 1 % of simple allelic sum scores (25 or more risk alleles) have quadruple the risk relative to the bottom 2 % (fewer than 12 risk alleles) and slightly more than double the risk of the general population [30; see also 58]. This measure only marginally improves on the Framingham risk score for diabetes [60] and alone does not approach it for predictive power. However, at least in our CHDWB study the two measures (allelic sum and FRS) are only mildly correlated (unpublished observation), and so it is interesting to ask whether extreme genotype scores may suggest an alternative mode of diabetes risk.

A slightly more sophisticated approach is to weight the allelic scores by the magnitude of their effect. If one allele has a relative risk of 1.4, then it should have twice the impact of one with a relative risk of 1.2. In practice, it is not clear that weighted allelic sums improve on simple ones (Fig. 1.1a), perhaps reflecting the small amount of variance explained by current models built with variants that in general collectively explain no more than 20 % of disease risk. There is also likely to be large error in the estimation of individual allelic effects both due to sampling biases and incomplete LD between tagging SNPs and unknown causal variants. Nevertheless, for type 1 diabetes, a multiplicative allelic model based on 34 loci that collectively explain 60 % of the expected genetic contribution has been introduced [14, 44]. A score with a sensitivity of 80 % is achieved in 18 % of the population even though only less than half of one percent is type 1 diabetic. However, the positive predictive value remains fairly low since the false positive rate still exceeds 90 %. It seems that for rare diseases (less than 1 % of the population), it is unlikely that genotypic measures will ever be predictive in a clinical setting. Nevertheless, as a screening tool, there may be enormous financial and medical value in focusing resources on the highest risk portion of the population and excluding those least at risk from unnecessary surveillance or treatment.

Fig. 1.1
figure 1

Comparison of risk scores. The three xy plots compare risk scores generated by three different methods, applied to a simulated dataset consisting of 200 disease SNPs measured in 1,000 people. The alleles range in risk allele frequency from 0.1 to 0.9 with a bias toward lower frequencies, and effect sizes were drawn from a normal distribution with mean of zero and standard deviation of 0.07. (a) Comparison of simple allelic sum score and weighted allelic sum score, showing a modest effect of weighting the sum by the effect size. Red points highlight individuals in the top decile of scores. (b) Comparison of simple allelic sum score and probability calculation from odds ratios obtained following the method in Morgan, Chen, and Butte [35] which computes the probability of disease from the summation of log odds ratios that are necessarily conditioned on the allele frequency. Despite increased variance of the score reflecting the multiplicative nature of the risk assessment (due to summation of log odds), the correlation in ranks is strong. (c) Comparison of probability scores for the same data as in (b) with computations assessed after randomizing the frequencies of one-quarter of the alleles, showing how population structure potentially affects disease risk assessment even where allelic effect sizes are assumed to be constant

If allele sums are used, it also makes sense to attempt to weight scores by allele frequencies. Two individuals may have the same score, but if one of them has most of the risk attributed to alleles that are not typically the risk allele in the population, whereas the other has the common high-risk variants, then it stands to reason that the former is likely to be at elevated overall relative risk. This is illustrated in Fig. 1.1b. An obvious way to achieve the weighting is to convert relative risks into odds ratios, compute the log sum of those odds, and regenerate a probability of disease [35]. Starting with a baseline risk for the relevant gender, ethnicity, and age group, each successive allele adds to or subtracts from the log odds, which are a function of the allelic effect and frequency.

The immediate problem with this approach is that it is susceptible to variation in allele frequencies among populations. Two people with identical weighted allelic sums may nevertheless have very different relative risks according to whether they are, for example, of African, Asian, or European descent (Fig. 1.1c). Somewhat paradoxically, heterozygosity at a single contributing locus can either increase or decrease the odds in different ethnicities, according to whether the risk allele is rare or common in either population. Accommodations can be made by deriving separate multi-allelic scores for each ethnicity, but an additional complication arises where admixture (population mixing) exists, which is the norm in contemporary America at least. Perhaps risk scores should be adjusted by the allelic frequencies expected of individuals with the observed mixture of ethnicities, but a case for local ancestry adjustment with phased genomes can be made [54, 55], and then the issue of the appropriate baseline prevalence arises. It is not yet clear how much of an issue this is, and clearly much more research needs to be done, likely also including attention to geographically structured cultural and environmental modifiers of prevalence.

Finally, predictors and classifiers that do not assume additive effects of GWAS hits are being introduced. Sparse factorization and machine-learning approaches offer very powerful approaches that generate scores, incorporating SNPs that do not have strong univariate associations, or whose effects are conditional on other terms in the model [1, 3, 24, 29]. Often scores are developed purely as mathematical abstractions, though the interpretation is that they incorporate cryptic epistasis (genotype-by-genotype interactions) as well as environment or gender-specific interactions [48]. In these cases, there is always the assumption that the conditions and effects are consistent across populations. Again, it is not yet clear how reliable this assumption is and hence how transitive machine-learning based scores typically will be.

1.2 Integrating Functional Genomic and Clinical Data to Capture Environmental Contributions

Irrespective of the nature of the risk score, the second major challenge is to combine these into an overall personal health profile. A key insight is that the extensive comorbidity of diseases establishes the expectation that genotypic risks should covary [42, 52]. Given risk scores for dozens or even hundreds of diseases, further mathematical manipulations may facilitate gains in prediction or classification accuracy that borrow power from across diseases. At the current stage of development of personal genomic medicine, there is insufficient data to discern robust patterns of covariance, with the exception of autoimmune diseases that share common polymorphisms [32, 46]. So long as individual disease risk scores only capture a minor fraction of the genotypic risk, they are unlikely to capture to true architecture of comorbidity, but presumably this will change as more comprehensive predictors are developed.

In the mean time, Ashley et al. [2] presented a mode of visualization of combined risk that suggests how path analyses might integrate univariate risk scores. This is reproduced in Fig. 1.2b focusing just on a half dozen common disease conditions mostly related to metabolic syndrome. On the left (Fig. 1.2a), the so-called risk-o-grams [2, 13, 16] show how baseline risk for these conditions is modified by a hypothetical individual’s genotypic risk. The point estimates should not be over-interpreted, the more important information being contained in the sign and magnitude of the genetic contribution. These are modified by comorbidity and redrawn in the form of the size of the font on the right, where larger circles represent increasingly elevated risk due to the individual’s genotypes and the disease interactions. Interrelated disease conditions are connected by directed edges where, for example, the likelihood of developing cardiovascular disease is increased by the person’s elevated risk of obesity but decreased by their low hypertension risk. Unfortunately, we do not yet have the tools to estimate the strengths of the connections, and much theoretical work on the optimal multivariate integration strategy remains to be performed.

Fig. 1.2
figure 2

Risk-o-grams. Following Ashley et al. [2], a hypothetical risk-o-gram (a) shows how genotypic risk can be used to generate a point estimate of probability of disease conditioned on the population prevalence. The figure shows a hypothetical risk assessment on the log scale for 20 diseases where the black triangles show the prevalence for the individual’s gender, ethnicity, and age group, pointing to the right if genotype is predicted to increase risk or left if it decreases risk relative to the population average. The horizontal bars show the degree of genotypic effect, where, for example, Crohn’s disease risk is highly elevated, but asthma and breast cancer are reduced. (b) These risks need to be combined, recognizing the comorbidity matrix of disease and the influence of environmental factors, including dietary and psychological stressors, exercise patterns and drug usage, and personal history of illness. The modified risk for each condition conditioned on the matrix of influences is represented by the size of the font. Although we are a long way from being able to generate robust assessments, the figure implies that classification into high- and low-risk classes should be feasible in the near future

Just as importantly, the grand circle surrounding the disease prediction network shows that the environment must also be incorporated into computations. In this case, the individual’s heavy alcohol usage and lack of exercise also increase their risk of metabolic syndrome, as does a history of early life stress coupled with low family support and high work pressure. It is apparent that they are already taking statins and eating a low-fat diet to offset some of the risk, and regular yoga practice may help qualitatively. A traumatic brain injury suffered in a car accident as a child may have been a trigger that cannot be factored into population-based measures of risk, but it also feeds into likely cognitive decline with age. Again, it is not yet obvious how these environmental risks should be formulated from a statistical perspective. Drug usage can conceivably be incorporated as a cofactor in the computation of individual risk scores, but it is less obvious how to model diet and mental stress, or what the appropriate multivariate framework may be. A further advantage of this visualization is that it readily lends itself to dynamic representation of how lifestyle modifications may reduce the risk of key diseases, as individuals can observe projected changes in risks if they adopt new health behaviors.

Another aspect of the environment that we may endeavor to incorporate is cultural and geographic differentiation. Perusal of the Centers for Disease Control (CDC) database of morbidity (see, e.g., http://www.cdc.gov/cancer/dcpc/data/state.htm for cancer data) shows that most diseases have very different prevalence according to the location within the United States. An excellent example is the well-known southern stroke belt [11] stretching from Louisiana across Alabama to Georgia and the Carolinas, but cancer incidence and many other diseases vary from region to region. Undoubtedly, rural and urban lifestyles impact disease risk, and we have shown that they also impact peripheral blood gene expression profiles [26, 36], while emerging data also suggests differences in the microbiome [64]. Most readily, this type of information could be incorporated into risk prediction already at the level of baseline prevalence, which might be assessed regionally rather than simply by gender and ethnicity. Of course someone who moves from Manhattan, New York, to Manhattan, Kansas, does not modify their risk overnight, so yet another obstacle to absolute risk prediction lies in assessing the perdurance of lifestyle effects and the impact of life stage. Notably, there is accumulating evidence that early life stress is among the biggest risk factors for a wide range of diseases, particularly in lower socioeconomic strata [21, 34, 41].

Another unresolved issue is to what extent genotype-by-environment interactions need to be taken into account in risk evaluation. There is very little evidence from GWAS that G×E is either prevalent or of sufficient magnitude to be important components of population variance [57], notwithstanding occasional reports, for example, of smoking by nicotinic acetylcholine receptor polymorphism interactions with lung cancer [65] and of arsenic by solute carrier interactions for bladder cancer [27]. This is surprising given the prevalence of both genotypic and environmental effects on gene expression [26]. Supposing that low transcript abundance for a particular gene in a relevant tissue contributes to disease risk, those homozygous for a low expression cis-regulatory polymorphism, in an environment where expression is significantly reduced as well, will constitute the most at-risk group. Under a liability model, G×E for disease is plausible, even in the absence of interaction effects between the genotype and gene expression. However, large eQTL effects do not translate into large disease effects measured in case-control GWAS settings. It is possible that genotypic risk score-by-environment interactions will be observed, but such studies are yet to be performed. Furthermore, perhaps the more important mode of interaction is with individualized effects, such as triggers (accidents, transient stresses) that either are not captured in epidemiological surveys or have such high variance that interaction effects do not attain significance in population-scale studies.

All of these considerations add uncertainty to risk assessment and raise the question of whether it might not be better to measure the impact of the environment biochemically. The notion is that a person’s individuality results from the longitudinal interaction of their genome with all of the above lifestyle and environmental factors. These influences mediate disease risk ultimately by modifying metabolism and physiology, which in turn are a function of gene expression, which is subject to epigenetic modification. Consequently, measurement of the metabolome, transcriptome, and epigenome (e.g., chromatin methylation) should provide parallel omic information of high relevance to health care [25]. This systems biology approach is much hyped [53], but many would argue that it has yet to provide the clinical or mechanistic insights that have stemmed from genotype and sequence-based genomic medicine. A major limitation of course is that only a few tissues, principally peripheral blood or sometimes adipose biopsy, are readily available for high-throughput analysis. Blood does reflect immune and metabolic function and possibly mirrors psychological stressors [20, 31], so there is undoubtedly much to be learned from characterization of the sources of variance, and major advances in predictive health can be expected from this approach in the next decade.

Just to briefly highlight two strategies from our own work. First, characterization of extremes of individual transcript abundance detected by either microarray or RNASeq analysis of individuals is in many ways equivalent to rare deleterious coding variant detection from sequencing. We do not yet know how to read regulatory variation directly, but this is unnecessary if it can be directly demonstrated that an RNA (or protein) is not expressed in a particular individual. Association of such differential expression with phenotypes is subject to the same caveats as rare variant association analysis. Second, transcriptional variation is highly structured and characterized by major axes that represent aspects of lymphocyte function such as B and T cell signaling, antiviral responsiveness, and inflammation [45]. This variability is evident in the principal components of peripheral blood gene expression, but also appears in modules and axes of variation that are captured by the expression of biomarker genes [12], or blood informative transcripts. We postulate that the level of activity of gene expression in these axes will be found to correlate with aspects of immune and metabolic health.

1.3 Presenting and Interpreting Genomic Risk for Wellness

The third great challenge is to present genomic indicators of disease risk to healthy individuals in a manner that will help them to make sensible health behavior choices. This is one of the major goals of the emerging discipline of medical informatics. Risk-o-grams (Fig. 1.2a) are an excellent starting point since they present risk both in absolute terms as well as apportioning the genetic contribution relative to the population average. However, they have some obvious drawbacks, not least of which is the overwhelming number of assessments, many of which are for rare conditions or are clinically not actionable. They also fail to convey a sense of the error associated with risk assessments: we are used to the notion that heavy smoking more than doubles your lifetime risk of lung cancer, yet know heavy smokers who never get lung disease and never-smokers who do. Inevitably inappropriate presentation of genetic risks will engender skepticism toward genomic medicine that may undermine the certain benefits that stand to be realized.

For this reason, in the context of wellness, classification is the more appropriate emphasis than prediction. Classification into very-high-, high-, normal-, low-, and very-low-risk levels should help individuals to focus on those aspects of their health that will benefit from close attention. It draws attention away from the myriad statistical issues discussed above, instead promoting joint consideration of genetic and clinical measures. Furthermore, it is consistent with a simplification of risk presentation in health domains that recognize patterns of comorbidity and leverage existing modes of health assessment. At the Center for Health Discovery and Well Being, we are promoting the idea that comprehensive clinical evaluation annually, starting in the fourth decade of life, will foster prevention over reaction as it engages individuals in their own health choices [43]. Figure 1.3 suggests one mode of presentation of genomic data that may be incorporated into the preventative medicine framework.

Fig. 1.3
figure 3

Spider-web plots representing genomic and clinical risk in ten health domains for two hypothetical individuals. Genomic risk scores, generated by combining genotypic and functional genomic evidence, place each person in one of five risk classifications from very high (outer band) to very low (inner band) in ten health domains (IMM immunological, MET metabolic, CVD cardiovascular, MSK musculoskeletal, RSP respiratory, REP reproductive, COG cognitive, PSY psychiatric, ONC oncological, ORG organ failure). Clinical risk assessments generated from comprehensive medical examinations as well as personal and family history of disease are indicated by the size of the dots in each axis. Colors represent discordance between genomic and clinical risk as these situations are likely to be of greatest interest for individuals, alongside concordance for high risk, as they develop health action plans. Details and actual individual examples are described in Patel et al. [43]

Each radiating axis on the spider-web plots represents one of ten health domains. The bold polygon crosses each axis at a point, representing genomic risk in that domain (points further out mean higher risk), while the size of the circle at that point represents the observed clinical risk and/or evidence for disease. A quick glance at the spider-web plot tells an individual where they have high or low genetic and clinical risk. Areas of continuity between genetics and clinical data are highlighted as green dots. Discontinuities may be even more interesting. Those indicated in red where genetic risk is high but there is no sign of clinical danger (cardiovascular disease for A and musculoskeletal decay for B) suggest situations where the individual may pay close attention despite current good health. By contrast, situations where the genetic risk is low but clinical signs are not hopeful (respiratory disease for the smoker A and psychiatric problems for the socially isolated person B) may suggest that lifestyle changes are likely to have an impact. The main objective of this combined genomic and clinical classification is not to predict disease but to help individuals focus attention on areas where they should concentrate their health-related behaviors and surveillance.

The proposed ten common health domains are as follows:

  • Immunological, including autoimmune (type 1 diabetes, multiple sclerosis, SLE, arthritis), inflammatory (especially bowel diseases), and infectious (viral and microbial) disease susceptibility, many of which show comorbidity and all of which should be related to gene expression in various blood cells

  • Metabolic syndrome, generally referring to obesity and either hyperlipidemia or high blood glucose, leading to type 2 diabetes, and encompassing impaired insulin production and sensitivity

  • Cardiovascular, primarily atherosclerosis and hence related to metabolic dysfunction, but also including cardiomyopathy, arrhythmia, and heightened risk of myocardial infarction or stroke

  • Respiratory discomfort, namely, asthma, COPD, and fibrosis, all of which are exacerbated by smoking and call for attention to genotype-by-environment interaction

  • Musculoskeletal problems, such as low bone density, chronic back pain, and muscle weakness or wasting, which are a primary cause of reduced quality of life for large percentage of the elderly

  • Mental health, manifesting as depression and/or anxiety in an increasingly alarming percentage of adults, but also including schizophrenia, autism spectrum, and attention deficit disorders in adolescents and young adults

  • Cognitive decline, whether due to Alzheimer’s disease, Parkinson’s disease, or generalized senile dementia and expected to become the major public health burden of the twenty-first century

  • Cancer risk, assessed from family history and possibly peripheral blood biomarkers

  • Organ malfunction, which is unlikely to have a common genomic foundation but collectively loss of eyesight, hearing, and renal and liver function, are a major source of morbidity

  • Reproductive health, namely, the capacity to conceive and maintain pregnancy or to produce fertile sperm, but also including endometriosis and other causes of uterine discomfort

Pharmacological variation, for both toxicity and responsiveness to specific drugs, is also an important aspect of genomic health, sometimes having a simple genetic basis (e.g., warfarin [50]) but generally as complex as disease risk [38]. This is not by any means an exhaustive list of disease but is meant to capture the major domains that concern adults as they enter middle age and begin to make lifestyle modifications in response to self-perception of personal health concerns. Genome-wide association studies have been performed for specific diseases in each domain, and thousands of variants are available for generation of risk scores. Similarly, relevant clinical measures can be taken during routine medical checkups or as part of a dedicated wellness program such as the CHDWB and collectively generate risk profiles in these ten domains as well.

An immediate concern is how to collapse disparate genotypic and clinical risk scores into summary measures of risk for the various domains. For clinical measures, z-scores place each person in relative risk categories with those within one standard deviation of the mean being at intermediate risk, those between 1 and 2 standard deviations at high (or low) risk, and everyone at the extremes at the very-high- or low-risk categories. A similar strategy could be applied to genotypic risk, or thresholds can be established based on the risk score distributions. Geometric means might be used to combine multiple scores, enhancing the relevance of individual high-risk values. My concern here is not with the optimal mode of collapsing but rather to suggest how spider-plot or similar visualization might be interpreted.

After consulting the spider-web plot with a physician or other health professional, the next step would be to examine the contributing risk factors in more detail. Consider the three examples. (1) In the cardiovascular domain, individual B in Fig. 1.3 has intermediate overall risk, but close examination shows that she is discordant for high blood pressure and lower than average genotypic risk of hypertension. This may suggest that some aspect of lifestyle, either high levels of job stress or a high salt diet, is responsible, and the low genetic risk might in some cases provide impetus for the individual to address the root cause. (2) Person A is concordant for obesity and high genetic risk of obesity, both of which produce high scores in the metabolic domain. Rather than accepting this as a fait accompli, with appropriate counseling she may learn that much of the genetic risk is due to neurological factors rather than any deficit in metabolic enzyme function, and this may help him to seek guidance in controlling dietary compulsions. (3) Another individual may be discordant in the organ failure domain for high genetic risk of age-related macular degeneration, but as a 70-year-old with above average eyesight has paid no attention to the possibility that he may soon suffer from loss of vision. Knowing the genetic risk, he will now have regular eye exams and follow emerging guidelines directed at preventing onset of the disease.

As discussed earlier, I envisage that genomic risk assessment will eventually incorporate transcriptional, epigenomic, and metabolic measures. The costs involved will be an obstacle for the foreseeable future, and it is not clear who will pay. It is nevertheless not difficult to see how a few thousand dollars spent on genomic analyses in middle age may save tens or hundreds of thousands of dollars in acute medical care for people approaching retirement age. Employers stand to benefit from reduced absenteeism and elevated productivity, and economic modeling suggests that the savings can be substantial. Scientific demonstration of the clinical efficacy of joint genomic and clinical profiling will likely take thousands of case studies over several years, a daunting challenge, but given the stakes, one that must be taken on.

1.4 Conclusion

Assuming success of the WHOLE paradigm, there will also be a need for training of a new class of health-care professional. A few genetic counseling programs are beginning to provide training in the interpretation of genome sequences. At the CHDWB, we have developed a Certificate program for Health Partners who consult with participants on the interpretation of their clinical profiles and help them to formulate personal health action plans. The combination of advanced genetic counseling with a health partner is expected to yield genomic counselors, masters level professionals who will work alongside physicians, dieticians, personal trainers, and clinical geneticists to provide people who care to take advantage of the wealth of information implicit in genomic medicine, with a path to health maintenance and extended well-being.