Introduction

Over the past 20 years, tremendous gains have been made in understanding the structure and function of the human genome [1]. Advances in technology have allowed investigators to interrogate genes, transcripts, proteins, and metabolites at the genome-wide level, with innovation progressing at a rapid pace [2]. Due to the prominent genetic nature of cancer, oncology has led the implementation of genomic-based medicine and is the initial focus of the precision medicine initiative recently announced by the President of the USA [3, 4]. The near-term goals of this initiative are to utilize molecular signatures to develop targeted tumor therapies and to better understand mechanisms underlying drug resistance, including the development of new tumor cell models to aid in these tasks. A longer-term goal is to collect biological, clinical, and environmental data from one million volunteer subjects using electronic medical records and mobile health devices, allowing investigators to look beyond cancer and apply precision medicine to other diseases. Coronary artery disease (CAD) is one such disease that should benefit greatly from this initiative, as a firm foundation has already been established in the development of molecular- and genomic-based tools for the diagnosis and treatment of CAD [5].

Although it has been long known that familial history is a good predictor of CAD risk, and population-based studies such as the CARDIoGRAM consortium have independently replicated a moderate number of CAD-associated loci, these single-nucleotide polymorphisms (SNPs) only account for ~10 % of CAD heritability and have relatively low effect sizes [6]. Using systems biology and network-based approaches to incorporate existing SNPs and biomarkers with novel variants and other genomic markers that may be discovered in large studies such as those outlined in the precision health initiative may provide novel insights into complex disorders such as CAD [7, 8]. Gene expression profiling, which measures changes in gene transcript levels in response to alterations in biological state and can provide a dynamic measurement of a patient’s current physiological condition, holds great promise as platform that can be utilized in precision health. In this article, we review what has already been accomplished using peripheral blood gene expression profiling to assess CAD and discuss the best approaches for developing and validating such signatures.

Peripheral Blood Gene Expression Studies in CAD

It has long been known that the cellular and molecular basis of atherosclerotic plaque development has a strong systemic inflammatory component involving cells of both the innate and adaptive immune systems [9, 10]. The deposition of oxidized lipids into the vascular bed initiates the process with subsequent responses by endothelial, vascular smooth muscle cells, and circulating immune cells. A large body of evidence supports the role of monocytes/macrophages in the development and progression of CAD [10], and it has been recently demonstrated that neutrophils also play a key role in CAD progression and plaque instability [11, 12]. Work in recent years has also highlighted both an athero-protective and atherogenic role for the adaptive immune system in the development of atherosclerosis [13]. In sum, strong evidence supports a role for multiple immune cell types in the development and progression of CAD and suggests that gene expression profiling in peripheral blood is a viable approach for monitoring the presence and progression of CAD.

To date, a number of studies have been published examining whole blood gene expression profiling as a method to identify subjects at risk for CAD (Table 1) [1418]. Comparison of significant genes has shown limited overlap (Fig. 1) [14, 16]; there are a number of reasons for why this might be. Lack of concordance in clinical phenotype can be a major contributor to intra-study differences. In two studies, control populations did not have angiographic evidence supporting disease absence, which could result in decreased power to detect CAD associations [15, 16]. Disease definitions varied between studies, as did patient populations. Different exclusions were applied to patients with a previous history of MI or CAD, as well as other conditions whose presence might confound CAD signals; diabetes and rheumatoid arthritis have each been shown to influence the expression of genes associated with CAD [18, 19] as has the use of steroids or immunosuppressive drugs [2023]. A variety of gene expression profiling technologies exist (summarized in Table 2) including multiple microarray platforms, which differed between the above studies. In addition, independent technical confirmation and validation of candidate genes in separate cohorts were not consistently implemented. Lastly, poor agreement between genome association studies often indicates that true associations are weak and that individual studies are not well powered to detect associations; i.e., results of individual studies may be qualitatively correct, but marker effect size estimates are overstated [24]. In such underpowered studies, sets of genes that are associated with a given outcome can be unstable due to their correlative nature, a factor that is compounded in peripheral gene expression studies where gene expression patterns can be highly correlated [25, 26].

Table 1 Peripheral blood gene expression studies in coronary artery disease
Fig. 1
figure 1

Venn diagram showing the overlap of genes in the CAD cohorts. Due to the use of different array platforms and annotations in the five coronary artery disease studies described, the diagram likely has missed shared genes and represents a slight underrepresentation of the true overlap. In addition, the list of the 365 genes described in Taurino et al. was not available and thus was not included in the analysis

Table 2 Gene expression platforms

Pathways for Clinical Test Development

In the previous section, we summarized the results from recent CAD gene expression studies and highlighted the lack of concordance between studies. Developing a gene expression-based diagnostic is complex with many critical factors to be considered. With this in mind, the remaining sections outline approaches for developing diagnostic tests. Using CAD as an example, we describe general stages of diagnostic test development and offer recommendations that focus on core considerations and common pitfalls.

Developing a diagnostic test requires clarity, particularly with respect to the clinical outcome, methods for clinical phenotyping, existing clinical and molecular prediction models and confounders, and intended use population. Test development is a multistage process starting with initial gene discovery and proceeding through test validation and beyond to post-validation studies. The stage definitions and boundaries vary in practice but generally can be defined as follows: (1) gene discovery, (2) prediction model building, (3) prediction model testing, (4) test development, and (5) test validation. Each stage is described below and may include multiple studies (Fig. 2).

Fig. 2
figure 2

Overview of the pathway for new diagnostic research, development, and commercialization, with key studies (af). Research is characterized by marker discovery, prediction model building, and testing (a) which may be performed in multiple independent studies (13) or in a single step (4). Development begins with transfer of assays from research platforms to commercial using technical replication studies (b) and includes analytical validation and clinical validation (c) which may continue post-development, to broaden support for diagnostic validity. Commercialization includes utility studies (e) or commercial registries, directed at answering questions related to decision impact and patient health outcomes. Additional studies may be performed to demonstrate economic benefits of testing (f)

Gene Discovery

Important technical considerations must be addressed prior to undertaking gene discovery, such as the type of RNA that will be assessed, how samples will be collected, and which technology will be used to assess gene expression levels; examples of these options are summarized in Tables 2, 3, and 4. Gene discovery itself involves the identification and selection of informative gene expression markers from a larger candidate pool; this pool can represent the content on a real-time quantitative polymerase chain reaction (RT-qPCR) panel or microarray or the total sum of detectable transcripts in a biological sample. Gene discovery studies require a variety of design considerations such as the disease phenotype and subphenotypes, study population(s), the disease prevalence in study population, the state of prior evidence supporting candidate markers, and the desired power of the study to detect modest or weak associations. Various study designs have been described in the literature including single step and sequential, and various designs have been implemented for CAD [3234]. Examples for CAD marker discovery range from single-center matched and unmatched case-control studies [1416] to multistaged multicentered prospectively recruited cohorts [18]. Arguably, the majority of markers thus far identified as associated with obstructive CAD risk have shown only modest to weak effect sizes. Consequently, two statistical pitfalls of marker discovery are important to note—multiple testing artifacts and winner’s curse.

Table 3 Types of RNA
Table 4 Collection methods

Multiple Testing

Data sets used for gene discovery are commonly characterized by large initial gene sets and smaller sample sizes and are commonly examined using various statistical models (including and excluding covariates, normalizing and un-normalizing marker expression levels, defining and redefining endpoints, population subsets, etc.). When any large number of tests is performed, it is inappropriate to use traditional single model test thresholds (e.g., p ≤ 0.05) for significance testing, as many apparently significant observations occur by chance. Statistical methods exist for multiple testing corrections [35, 36]; however, these methods are accurate only when the full scope of multiple testing is defined and when the techniques are applied rigorously. Often, the iterative nature of research makes this challenging; in such cases, it may be simpler for researchers to adopt split-sample designs, i.e., to use one part of a cohort for unfettered discovery and hypothesis generation and a separate part for testing a strictly limited number of candidate hypotheses.

Winner’s Curse

It is common for large discovery studies to identify candidate markers whose statistical significance clusters near statistical rejection thresholds and for the initial findings to replicate poorly in subsequent studies. This phenomenon is expected when the disease is only modestly or weakly associated with many markers and where the studies are not reliably powered to detect such associations. In such cases, a subset of true disease associations may achieve statistical significance but do so by being overestimated, by chance. This statistical problem of biased effect size estimates, conditional on statistical significance, is referred to in the literature as the “winner’s curse” or “Beavis effect” and is pronounced when the power of the marker discovery study is low [37, 38]. Consequent failure to confirm an initial finding may thus be due to a true positive finding being overestimated.

In addition to the above pitfalls, it should be recognized that known clinical risk factors describe a significant portion of CAD. For example, increased risk for CAD is associated with age, sex, smoking, and a variety of patient symptoms [39]; ignoring clinical predictors, or other clinical variables such as medicine usage, during gene discovery may yield markers correlated with different patient strata (Fig. 3). For example, any biological marker that changes value with patient age will therefore be associated with all other age-associated diseases, including CAD. It is important to note that it is possible to identify gene expression changes associated with clinical co-factors associated with CAD; identifying gene expression surrogates for clinical risk factors can be advantageous when clinical data is incomplete and to measure physiological responses to either environmental or biological phenomena, thus adding information beyond clinical data alone. For example, gene expression signatures have been identified for age [4042], smoking [4345], hyperlipidemia [46], and hypertension [47], all of which are associated with CAD risk. Finally, whole blood cell populations may change in response to disease states. For instance, the ratio of neutrophils to lymphocytes (N/L ratio) has been shown to be prognostic for MI [48], and a gene surrogate measurement of the N/L ratio has been correlated with the presence of obstructive CAD [18].

Fig. 3
figure 3

Illustration of possible relationships between two clinical factors (patient age and sex), environmental, patient symptoms, markers of disease risk, and patient disease

Prediction Model Building

The goal of prediction model building is to identify candidate predictors using a set of candidate genes and a family of prediction functions. Ideally, predictors should be chosen from families with the best general performance characteristics. In our experience, however, many families yield similar accuracy, and prediction functions are often chosen from families that are straightforward to interpret such as those where disease probability is modeled by a linear combination of independent variables.

Prediction model building need not be separate from gene discovery, and many statistical methods (e.g., stepwise regression, Random Forest, LASSO) can be used to simultaneously reduce candidate genes and combine them in a predictive model. However, it is useful to treat gene discovery and model building as separate steps, as they may proceed sequentially (e.g., when markers are identified from the literature or by univariate associations). Even when this is not the case, there are specific considerations to model development above and beyond those of gene discovery.

Decisions for prediction model building include the following: (1) model family selection, (2) gene selection, (3) determining model constraints, (4) use of clinical covariates in the model, (5) consideration of population heterogeneity, (6) the data set used for final model selection, and (7) criteria for model acceptance. Of these, the most critical are those that influence model evaluation and acceptance. It is technically valid to reuse gene discovery data for model building; however, use of such data can result in models with overstated performance, as naive reuse of gene discovery data leads to model overfitting and biased model performance [49]. When performing gene discovery and model building on the same data set, statistically valid methods such as cross-validation and bootstrapping can provide relatively unbiased overall performance estimates [50]. However, these methods only work when all steps of discovery are nested within the cross-validation loop [51]. For complex research workflows, it may be simpler to rely instead on an independent test set to evaluate candidate model performance.

The importance of evaluating clinical covariates continues through prediction model building. Clinical covariates or biomarker surrogates may be included in the model or distinct models developed for different covariate strata (e.g., separate models for separate sexes) to ensure a well-calibrated model. As such, matched case-control designs may be of limited value, though unmatched case-control designs may still be necessary for low-prevalence disease applications.

Prediction Model Testing

Disease prediction models are tested to assess diagnostic performance. Testing may occur on the gene discovery cohort (e.g., using cross-validation) or as a distinct step on a reserved sample set. In either case, prediction model testing does not invariably constitute clinical validation.

In many cases, prediction model testing is performed on a discovery cohort or using a split population design. In such cases, model performance cannot be known in advance and the performance testing set cannot be formally powered for confirmatory testing. In such cases, the focus during prediction model testing is to estimate performance. Split sample set designs offer the advantage of allowing both for estimation of performance (e.g., AUC) as well as precision (e.g., confidence intervals) in the reserved test set. Cross-validation approaches offer advantages of increased power for marker discovery and model building at the costs of some bias and difficulty in characterizing precision of performance estimates.

Many methods used to build prediction models do not require that individual model terms achieve statistical significance. Resulting models may perform well overall, though their use introduces an element of uncertainty in explaining or attributing performance. For instance, it may be difficult to know which terms are positively contributing to the model accuracy, which are not, and which (if any) are detrimental to model accuracy. In such cases, it is important to prespecify the baseline for overall model comparison (e.g., a competing clinical model or the base clinical terms of the full-disease prediction model).

Test Development

Diagnostic test development entails the translation of the test as performed in a research setting into a form and process amenable to clinical laboratory. Test development encompasses the complete system required to run the assay, including laboratory instrumentation, reagents, and software implementing the disease prediction model. Development is entered after final product requirements are accepted, with clearly defined design inputs and outputs and should be managed under a quality system. Technical replication, necessary when assay platforms are changed such as from microarray to RT-qPCR, can be performed using a subset of samples from the discovery studies, as the objective is to prove accuracy and precision of measurements across the assay’s dynamic range and not diagnostic validity of the test. The principal sample requirements for test development are that the samples represent the analytic and clinical ranges observed during discovery and that sufficient quantity of sample materials remains for repeat testing.

Test Validation

Test validation is composed of two components: analytical validation and clinical validation.

Analytical Validation

Analytical validation demonstrates that markers (e.g., analytes or RNA transcripts in the case of gene expression tests) are correctly measured by the new test. Analytical validation is performed through a series of studies designed to confirm test accuracy, precision, limits of detection or quantitation, robustness against likely interfering substances, stability of reagents and samples against their defined limitations of use, and, potentially, the robustness of the process to variance in assay conditions [52, 53].

Clinical Validation

Clinical validation studies demonstrate that the diagnostic performance of the test meets predefined acceptance criteria. Design considerations for such studies, including sample size calculations for categorical (e.g., diagnostic sensitivity and specificity) and continuous endpoints (e.g., AUC), are complex but well defined in the statistical literature [54]. As with analytical validation, clinical validation may appear to recapitulate findings of previous studies. However, clinical validation serves as a strong claim of diagnostic performance. Clinical validation studies have the following characteristics: (1) validation is performed in the intended use population; (2) the study cohort is independent of discovery cohorts; (3) diagnostic testing is performed using the final version of the diagnostic assay(s), with qualified materials and equipment, executed in the clinical laboratory setting; (4) clinical endpoints and performance claims are prespecified, with clear null and alternative hypotheses; and (5) the study is powered to meet the primary objective (or co-primary objectives, when claims of diagnostic sensitivity and specificity are desired) (for a more detailed description of diagnostic accuracy study reporting and evaluation considerations, see Bossuyt et al. and Whiting et al. [55, 56]). Clinical validation studies are not intrinsically single studies, as it may be necessary to validate multiple claims (e.g., in different intended use populations) or to strengthen initial claims (e.g., by raising the stringency of the null hypotheses). Indeed, in the case of any new diagnostic test, it may be difficult to perform the first validation study in the desired intended use population due to costs and risks of gold standard testing on all study participants, such as invasive coronary angiography. In such cases, initial validation studies may be focused on higher-risk populations [57, 58].

Beyond Gene Expression

The development and clinical validation of a peripheral blood gene expression diagnostic test can be challenging although achievable; the choice of the appropriate platform and adherence to rigorous experimental, clinical, and statistical approaches is paramount to success. As mentioned previously, although gene expression profiling in whole blood is a powerful approach, it has inherent limitations especially when applied to a disease such as CAD where the measurement of the disease process may be indirect.

To remedy this, the incorporation of other types of biomarkers, whether they are gene expression based such as the measurement of circulating RNAs or other types of markers (genetic, proteomic, metabolite, etc.), should be considered. A number of studies have examined the interactions between genetics and gene expression (for a general review, please see Cookson et al. [59]), and it has been suggested that examining genetic-gene expression interactions in CAD may be a powerful approach to further understanding coronary disease [8]. One example of this approach is illustrated in a study using the same subject set described in Joehanes et al. [60] (Table 1). In this study, the investigators identified co-expression modules that were differentially represented in either CHD cases or age- and sex-matched controls and demonstrated that these differential modules were enriched in CHD risk expression SNPs (eSNPs), loci known to be associated with increased CHD risk and also to alter gene expression. This approach led to the identification genes involved in B cell activation, immune response, and ion transport, as well as higher-level regulatory drivers. The same group of researchers also successfully employed a similar approach to investigate miRNA-mRNA-SNP interactions in the same set of subjects [61]. In addition to genetics, the inclusion of other biomarkers in a systems biology approach may strengthen the performance of a diagnostic test for CAD by incorporating orthogonal signals that may reflect different biological aspects of the disease [62].

As the field of genomics continues to develop and mature, the incorporation of precision medicine into clinical practice will continue to progress and holds great promise for altering the diagnosis and treatment of cardiovascular disease.