1 Introduction

Medical decisions are often based on an individual’s calculated risk of having or developing a condition. For example, decisions to prescribe long-term cholesterol lowering statin therapy are often made with use of the Framingham risk of a cardiovascular event (Truett et al. 1967; Kannel et al. 1976; Gordon and Kannel 1982; Anderson et al. 1991) that uses as input information the individual’s sex, age, blood pressure, total cholesterol, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol, smoking behavior and diabetes status. The Breast Cancer Risk Assessment Tool (BCRAT) is used to calculate 10 year risk of breast cancer for individuals, using information on age, personal medical history (number of previous breast biopsies and the presence of atypical hyperplasia in any previous breast biopsy specimen), reproductive history (age at the start of menstruation and age at the first live birth of a child) and family history of breast cancer. If a woman’s risk exceeds an age-specific threshold, she may be recommended for hormone therapy that reduces the risk at least in some women. Risk prediction models can also be used to determine if a person’s risk is low enough to forgo certain unpleasant or costly medical interventions (Gail et al. 1989, 1999).

Our ability to predict risk with currently available clinical predictors is often very poor. For example the BCRAT model has a very modest capacity to discriminate women who develop breast cancer within 10 years from those who do not. The area under the age-specific receiver operating characteristic curve is approximately 0.56 (Mealiffe et al. 2010). Therefore new predictors are sought for their capacity to improve upon its prediction performance. Recent advances in and wider availability of molecular and imaging biotechnologies offer the potential for new powerful predictors. Recent studies have examined the use of data on genetic polymorphisms and breast density to improve the performance of BCRAT.

This paper concerns study designs to estimate the improvement in prediction performance that is gained by adding a new predictor \(Y\) to a set of baseline predictors \(X\), to predict the risk of an outcome \(D\) (\(D=1\) for a bad outcome and \(D=0\) for a good outcome). When resources are limited and \(Y\) is difficult to ascertain, it may not be feasible to measure it on all subjects in a study cohort. Consider, for example, if the new predictor is a biomarker measured on biological samples obtained and stored while women were healthy at enrollment in the Women’s Health Initiative. The preciousness of such biological samples dictates that they be used with maximum efficiency. Typically therefore a case–control study design is employed wherein \(Y\) is measured on a random subset of cases (denoted by \(D=1\)) and a selected subset of controls \((D=0)\).

Our specific interest concerns whether or not the controls on whom \(Y\) is measured should be selected to frequency match the cases with regard to the baseline predictors \(X\). Matching is in fact routinely done in practice in order to avoid observing associations between \(Y\) and \(D\) that are solely due to associations of \(X\) with both \(Y\) and \(D\). However, the effect of this practice on estimation of performance improvement is not fully understood. We have raised concerns about matching with regards to bias, emphasizing that naïve analyses typically employed are misleading, as they underestimate performance (Pepe et al. 2012). The effect of matching on the estimation of incremental value with regards to efficiency has not been examined. Nevertheless, the practice is entrenched in the field of biomarker research. Here, we propose a two-stage estimator that accounts for matching to produce unbiased estimates. Using this estimator, we look to address the question of whether matching can improve the efficiency of estimating the increment in performance. This is an important question given that matching also necessitates a somewhat more complicated analysis algorithm than is required for an unmatched study. We ask whether there is a large enough (or any) efficiency gain that justifies the common practice of matching and a more complicated analysis.

Matching is known to improve efficiency for estimating the odds ratio for \(Y\) in a risk model that includes \(X\) (Breslow and Day 1980). However, the odds ratio, \(\frac{P(D=1|X,Y=y+1)/P(D=0|X,Y=y+1)}{P(D=1|X,Y=y)/P(D=0|X,Y=y)}\), does not characterize prediction performance or improvement in prediction performance gained by including \(Y\) in the risk model over and above use of \(X\) alone. The distribution of \((X,Y)\) in the population is an additional component that enters into the calculation of prediction performance. Janes and Pepe (2009) showed that matching on \(X\) is also optimal for estimating the covariate adjusted ROC curve, which is a measure of prediction performance. However, Janes and Pepe (2008) show that the covariate adjusted ROC curve that characterizes the ROC performance of \(Y\) within populations where \(X\) is fixed, does not quantify the improvement in the ROC curve gained by including \(Y\) in the risk model. It is currently unknown if matching leads to gains in efficiency for estimating performance improvement.

There are many metrics available for gauging improvement in prediction performance, and there is much confusion in the field about which metrics are most worthy for reporting. In Sect. 2, we review the most popular measures, providing some novel insights about their interpretations and inter-relationships. We provide rationale for the measures we selected to study here. In Sect. 3, we describe how these measures can be estimated from matched and unmatched studies. Simulation studies that were performed to evaluate the properties of the estimators and the efficiencies of matched designs are described in Sect. 4 using a simulated dataset and a real dataset concerning the prediction of renal artery stenosis. In Sect. 5, we propose a bootstrap approach for inference and demonstrate its validity through simulation studies. In Sect. 6, we illustrate our methodology in the context of renal artery stenosis. We close with some recommendations and suggestions for further research.

2 Measures of improvement in prediction performance

We first consider the most popular measures used to quantify improvement in prediction performance. Table 1 presents definitions for these measures. In this section, we review the measures in more detail.

Table 1 Definitions of performance measures

2.1 Notation

Recall our use of \(D\) for the outcome variable, \(D=1\) denoting a case with a bad outcome and \(D=0\) denoting a control with a good outcome. We use \(X\) for predictors in the baseline risk function, \(\mathrm{risk}(X)=P(D=1|X), Y\) for the novel predictors to be added and we write \(\mathrm{risk}(X,Y)=P(D=1|X,Y)\). All measures of prediction performance involve the distributions of \(\mathrm{risk}(X)\) and \(\mathrm{risk}(X,Y)\) in cases and controls. We write these distributions as:

$$\begin{aligned} F^{D}_{X}(r)&= P(\mathrm{risk}(X)\le r|D=1)\\ F^{\bar{D}}_{X}(r)&= P(\mathrm{risk}(X)\le r|D=0)\\ F^{D}_{X,Y}(r)&= P(\mathrm{risk}(X,Y)\le r|D=1)\\ F^{\bar{D}}_{X,Y}(r)&= P(\mathrm{risk}(X,Y)\le r|D=0) \end{aligned}$$

The joint distributions of \((\mathrm{risk}(X),\mathrm{risk}(X,Y))\) in cases and controls will be denoted by \(F^{D}(r,r^{\prime })\) and \(F^{\bar{D}}(r,r^{\prime })\) respectively.

2.2 Proportions at high risk and net benefit

In some settings a threshold exists for high risk classification and patients designated as ‘high risk’ receive an intervention. For example, patients whose 10-year risk of a cardiovascular event exceeds 20 % are recommended for cholesterol lowering therapy (Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults 2001). A risk model performs well, in the sense of treating people who would have an event in the absence of therapy, i.e. the cases, if a large proportion of those subjects are placed in the high risk category by the model, i.e. if \(\mathrm{HR}^{D}(r)\equiv P[\mathrm{risk}>r|D=1]\) is large. Conversely, one must consider to what extent subjects that would not have an event in the absence of intervention, i.e. the controls, are inappropriately given intervention. A good model will place few of the controls in the high risk category, i.e. \(\mathrm{HR}^{\bar{D}}(r)\equiv P[\mathrm{risk}>r|D=0]\) is small. The changes in \(\mathrm{HR}^{D}(r)\) and \(\mathrm{HR}^{\bar{D}}(r)\) that are gained by adding \(Y\) to the risk model are therefore key entities for quantifying improvement in model performance for decision making when a therapeutic threshold for risk exists:

$$\begin{aligned} \Delta \mathrm{HR}^{D}(r)&\equiv P[\mathrm{risk}(X,Y)>r|D=1]-P[\mathrm{risk}(X)>r|D=1]\\ \Delta \mathrm{HR}^{\bar{D}}(r)&\equiv P[\mathrm{risk}(X)>r|D=0]-P[\mathrm{risk}(X,Y)>r|D=0]. \end{aligned}$$

These measures are also called changes in the true and false positive rates. Note that our goal is to increase \(\mathrm{HR}^{D}(r)\) and reduce \(\mathrm{HR}^{\bar{D}}(r)\) by adding \(Y\) to the baseline risk model. Therefore positive values of \(\Delta \mathrm{HR}^{D}\) and \(\Delta \mathrm{HR}^{\bar{D}}\) are desirable.

There is a net expected benefit \((B)\) associated with designating a case as high risk and a net expected cost \((C)\) associated with designating a control as high risk. It has been noted that a rational choice of risk threshold is \(r=C/(C+B)\) (Pauker and Kassierer 1980; Vickers and Elkin 2006) and that the expected population net benefit associated with use of a risk model and threshold \(r\) to assign treatment is \(\text{ NB}(r)=\{\rho \mathrm{HR}^{D}(r)-(1-\rho )\frac{r}{(1-r)}\mathrm{HR}^{\bar{D}}(r)\} B\) where \(\rho \) is the population prevalence, \(P(D=1)\). Baker (2009) suggests standardizing \(\text{ NB}(r)\) by the maximum possible benefit, \(\rho B\), achieved when all cases and no controls are designated as high risk. This standardized measure \(\text{ B}(r)\equiv \mathrm{HR}^{D}(r)-\frac{(1-\rho )}{\rho } \frac{r}{(1-r)} \text{ HR}^{\bar{D}}(r)\), the proportion of maximum benefit, can also be viewed as the true positive rate \(\mathrm{HR}^{D}(r)\) discounted (appropriately) for the false positive rate \(\mathrm{HR}^{\bar{D}}(r)\). The change in \(\text{ B}(r)\) that is achieved by adding \(Y\) to the risk model is an appropriate summary of its components \(\Delta \mathrm{HR}^{D}(r)\) and \(\Delta \mathrm{HR}^{\bar{D}}(r)\):

$$\begin{aligned} \Delta \text{ B}(r)=\Delta \mathrm{HR}^{D}(r)+\frac{1-\rho }{\rho }\frac{r}{1-r} \Delta \mathrm{HR}^{\bar{D}}(r). \end{aligned}$$

In some settings all subjects receive treatment by default and use of a prediction model is to identify low risk subjects that can forego treatment. Parameters analogous to \(\Delta \mathrm{HR}^{D}(r),\) \(\Delta \mathrm{HR}^{\bar{D}}(r)\) and \(\Delta \text{ B}(r)\) can be defined but we do not focus on those here.

2.3 Performance measures related to fixed points on the ROC curve

When risk thresholds or costs and benefits are not available, other approaches to summarizing prediction performance have been proposed. Points on the ROC curve or on its inverse are commonly used in practice because of their use in evaluating diagnostic tests and classifiers. We define

$$\begin{aligned} \Delta \mathrm{ROC}\big (p^{\bar{D}}\big ) = \mathrm{ROC}_{(X,Y)}\big (p^{\bar{D}}\big ) - \mathrm{ROC}_X\big (p^{\bar{D}}\big ) \end{aligned}$$

where \(\mathrm{ROC}(p^{\bar{D}})\) is the proportion of cases with risks above the threshold \(r(p^{\bar{D}})\) that allows the fraction \(p^{\bar{D}}\) of controls to be classified as high risk. Analogously,

$$\begin{aligned} \Delta \mathrm{ROC}^{-1}\big (p^D\big ) = \mathrm{ROC}^{-1}_X\big (p^D\big ) - \mathrm{ROC}^{-1}_{(X,Y)}\big (p^D\big ) \end{aligned}$$

where \(\mathrm{ROC}^{-1}(p^D)\) is the proportion of controls with risks above the threshold \(r(p^D)\) that is exceeded by the fraction \(p^D\) of cases.

Interestingly, the ROC points are closely related to measures proposed by Pfeiffer and Gail (2011) for quantifying prediction performance. They argue for choosing a high risk threshold \(r(p^D)\) so that a specified proportion of cases \((p^D)\) are designated as high risk and define the proportion needed to follow, \(\mathrm{PNF}(p^D)=P[\mathrm{risk}>r(p^D)]\), as a performance metric. In words, \(\mathrm{PNF}(p^D)\) is the proportion of the population designated as high risk in order that \(p^D\) of the cases are classified as high risk. A little algebra shows that \(\mathrm{PNF}(p^D)=\rho p^D +(1-\rho )\mathrm{ROC}^{-1}(p^D)\). The reduction in the proportion of the population needed to follow in order to identify \(p^D\) of the cases \((\Delta \mathrm{PNF})\) that is gained by adding \(Y\) to the model is

$$\begin{aligned} \Delta \mathrm{PNF}\big (p^D\big )=(1-\rho )\Delta \mathrm{ROC}^{-1}\big (p^D\big ). \end{aligned}$$

We choose to study \(\Delta \mathrm{ROC}^{-1}(p^D)\) here as it does not depend on the prevalence. Pfeiffer and Gail (2011) also define a performance metric that is the proportion of cases followed, \(\mathrm{PCF}(p)\), when a fixed proportion \(p\) of the population is designated as highest risk. This measure relates directly to the \(\mathrm{ROC}\):

$$\begin{aligned} \mathrm{PCF}(p)=\mathrm{ROC}\big (p^{\bar{D}}\big ) \end{aligned}$$

where \(p^{\bar{D}}\) is the point on the x-axis of the ROC plot such that \(p = \rho \mathrm{ROC}(p^{\bar{D}}) + (1-\rho )p^{\bar{D}}\). We study \(\Delta \mathrm{ROC}(p^{\bar{D}})\) rather than \(\Delta \mathrm{PCF}(p)\) here because of its widespread use and its independence from the prevalence.

2.4 Global performance measures that do not specify a risk threshold

The above measures require explicit or implicit choices for risk thresholds. Measures that average over all risk thresholds in some sense are popular in part because they avoid the need to choose a risk threshold. The change in the area under the ROC curve by adding \(Y\) to the model, denoted \(\Delta \mathrm{AUC}\), is the most commonly used measure in practice. The AUC is often written as

$$\begin{aligned} \mathrm{AUC}= P(\mathrm{risk}_i > \mathrm{risk}_j | D_i=1, D_j=0) \end{aligned}$$

and

$$\begin{aligned} \Delta \mathrm{AUC}= \mathrm{AUC}_{(X,Y)} - \mathrm{AUC}_X. \end{aligned}$$

A more recently proposed measure, called the integrated discrimination improvement (IDI) index, is the change in the difference in mean risks between cases and controls:

$$\begin{aligned} \mathrm{IDI}= \Delta \mathrm{MRD}= \mathrm{MRD}_{(X,Y)} - \mathrm{MRD}_X \end{aligned}$$

where

$$\begin{aligned} \mathrm{MRD}= E(\mathrm{risk}|D=1)-E(\mathrm{risk}|D=0). \end{aligned}$$

Both the AUC and the MRD are measures of distance between the case and control distributions of modeled risks. Another measure of distance between distributions is the above average risk difference:

$$\begin{aligned} \mathrm{AARD}= P(\mathrm{risk}>\rho |D=1)-P(\mathrm{risk}>\rho |D=0), \end{aligned}$$

the name deriving from the fact that \(E(\mathrm{risk})=\rho \) regardless of the risk model. We study the AARD because it is related to several other measures of prediction performance. We note in particular that \(\mathrm{AARD}= \text{ B}(\rho )\). Youden’s index is a measure of diagnostic performance for binary tests and we write \(\text{ YI}(r)=\mathrm{HR}^{D}(r)-\mathrm{HR}^{\bar{D}}(r)\). We note that \(\mathrm{AARD}=\text{ YI}(\rho )\). Moreover, theory from Gu and Pepe (2009a) implies that \(\text{ YI}(\rho )=\max (\mathrm{ROC}(\rho )-\rho )=\max (\text{ YI}(r))\). Therefore, \(\mathrm{AARD}=\max (\text{ YI}(r))\). This is also known as the Kolmogorov–Smirnov measure of distance between the case and control risk distributions. Finally, Gu and Pepe (2009a) also showed that this statistic is equal to the standardized total gain statistic (Bura and Gastwirth 2001), a measure derived from the population distribution of risk. The measure of improvement in prediction performance that we consider is the difference in measures calculated with \(\mathrm{risk}(X,Y)\) compared with when calculated with \(\mathrm{risk}(X)\):

$$\begin{aligned} \Delta \mathrm{AARD}= \mathrm{AARD}_{(X,Y)} - \mathrm{AARD}_X. \end{aligned}$$

2.5 Risk reclassification performance measures

Reclassification measures of performance compare \(\mathrm{risk}(X,Y)\) with \(\mathrm{risk}(X)\) within individuals and summarize across subjects. The most popular measure is the net reclassification improvement (NRI) index (Pencina et al. 2008). We focus on the continuous NRI (Pencina et al. 2011), written NRI\((>\!\!0)\):

$$\begin{aligned} \mathrm{NRI}(>0)&\equiv P(\mathrm{risk}(X,Y)>\mathrm{risk}(X)|D=1) - P(\mathrm{risk}(X,Y)<\mathrm{risk}(X)|D=1) \\&+ P(\mathrm{risk}(X,Y)<\mathrm{risk}(X)|D=0) - P(\mathrm{risk}(X,Y)>\mathrm{risk}(X)|D=0)\\&= 2 \{P(\mathrm{risk}(X,Y)>\mathrm{risk}(X)|D=1)-P(\mathrm{risk}(X,Y)>\mathrm{risk}(X)|D=0)\} \end{aligned}$$

It is interesting to consider the NRI\((>\!\!0)\) statistic when the baseline model contains no covariates, i.e. when all subjects are assigned \(\mathrm{risk}=\rho \). In this setting it is related to measures mentioned previously:

$$\begin{aligned} \mathrm{NRI}=2\left\{ \mathrm{HR}^{D}(\rho )-\mathrm{HR}^{\bar{D}}(\rho ) \right\} = 2 \mathrm{AARD}(\rho ) = 2 \text{ YI}(\rho ) = 2 \text{ B}(\rho ). \end{aligned}$$

Originally the NRI was proposed for categories of risk and was defined as the net proportion of cases that moved to a higher risk category plus the net proportion of controls that moved to a lower risk category. When there are two categories, above or below the risk threshold \(r\), the NRI\(=\Delta \mathrm{HR}^{D}(r)+\Delta \mathrm{HR}^{\bar{D}}(r)=\Delta \text{ YI}(r)\). Similar to \(\Delta \text{ B}(r)\), it is a weighted summary of improvements in true and false positive rates but unfortunately it uses inappropriate weights.

Another risk reclassification measure is the IDI, also defined as:

$$\begin{aligned} \mathrm{IDI}= E\{\mathrm{risk}(X,Y)-\mathrm{risk}(X)|D=1\}+E\{\mathrm{risk}(X)-\mathrm{risk}(X,Y)|D=0\}. \end{aligned}$$

Interestingly, because of the linearity, this measure of individual changes in risk due to adding \(Y\) to the model can also be interpreted as a difference of two population performance measures. That is, as noted earlier

$$\begin{aligned} \Delta \mathrm{MRD}= \mathrm{MRD}_{(X,Y)} - \mathrm{MRD}_X = \mathrm{IDI}. \end{aligned}$$

3 Estimation from matched and unmatched designs

We now consider how the measures defined above can be estimated from a cohort study within which a case–control study of a new predictor is nested.

3.1 Data

We assume that data on the outcome and baseline covariates are available on a simple random sample of \(N\) independent identically distributed observations: \((D_{k}, X_{k}), k=1\ldots ,N\). We select a simple random sample of \(n_{D}\) cases from the cohort to ascertain \(Y\): \(Y_{i}, i=1,\ldots ,n_{D}\). The controls on whom \(Y\) is ascertained \(\{Y_{j},\ j=1,\ldots , n_{\bar{D}}\}\) may be obtained as a simple random sample in an unmatched design. Alternatively, in a matched design, a categorical variable \(W\) is defined as a function of \(X, W=W(X)\), and the number of controls within each level of \(W\) is chosen to equal a constant \(K\) times the number of cases with that value for \(W\).

As shown in Table 1, all performance improvement measures are defined as functions of the risk distributions (notation in Sect. 2.1). We estimate \(\mathrm{risk}(X)\) and \(\mathrm{risk}(X,Y)\) first, then estimate their distributions in cases and controls and substitute the estimated distributions into expressions for the performance improvement measures.

3.2 Estimating risk functions

For the baseline model, we fit a regression model to the cohort data \(\{(D_{k},X_{k}), k=1,\ldots ,N\}\) and calculate predicted risks, \(\widehat{\mathrm{risk}}(X)\), for each individual in the cohort. For the expanded model, \(\mathrm{risk}(X,Y)\), we consider two approaches.

Case-control with adjustment We fit a model to data from the case–control subset, yielding fitted values \(\widehat{\mathrm{risk}}^{cc} (X,Y)\), and then adjust the intercept to the prevalence in the cohort

$$\begin{aligned} \mathrm{logit }\ \widehat{\mathrm{risk}}^{adj} (X,Y)=\mathrm{logit }\ \widehat{\mathrm{risk}}^{cc}(X,Y)-\mathrm{logit }\left( \frac{n_{D}}{n}\right)+\mathrm{logit }\left( \frac{N_{D}}{N}\right), \end{aligned}$$

where \(n=n_{D}+n_{\bar{D}}\) and \(N_{D}\) is the number of cases in the cohort. This is a well-known and standard approach to estimation of absolute risk for epidemiologic case–control studies (Breslow 1996). It draws upon the results of Prentice and Pyke (1979), which suggested that a prospective logistic model can be fit to retrospective data from a case–control study with a slight modification that adds an offset term to the logistic model. The approach maximizes the pseudo- (or conditional-) likelihood that an observation in the case–control sample is a case or a control (Breslow and Cain 1988; Fears and Brown 1986).

However this approach does not account for matching. Pencina et al. (2011) presented a similar approach that used intercept adjustment to estimate NRI(\(>\)0) in the context of simple case-control studies.

Two-stage Two-stage methods acknowledge that selection of subjects for whom \(Y\) is measured, i.e. the second stage of sampling, may depend on their values of \((D,X)\) found in the first stage. In particular, they account for matching. We generalize the intercept adjustment idea presented above to account for matching on \(X\). This requires using the cohort to adjust the odds ratio associated with \(X\). The odds ratio associated with \(Y\) is correctly estimated using standard logistic regression applied to the case–control dataset. We use the corresponding fitted values but adjust them using fitted values from the baseline model fit to the cohort and to the case–control datasets. Specifically, if we let \(\widehat{\mathrm{risk}}^{cohort}(X)\) and \(\widehat{\mathrm{risk}}^{cc}(X)\) denote the fitted values for the baseline models, then the two-stage estimator of the absolute risk is:

$$\begin{aligned} \mathrm{logit }\ \widehat{\mathrm{risk}}^{2-stage}(X,Y) = \mathrm{logit }\ \widehat{\mathrm{risk}}^{cc}(X,Y)-\mathrm{logit }\ \widehat{\mathrm{risk}}^{cc}(X)+\mathrm{logit }\ \widehat{\mathrm{risk}}^{cohort}(X) \end{aligned}$$

Using ‘\(cohort\)’ and ‘\(cc\)’ to denote sampling in the cohort or in the case–control subset, rationale for \(\widehat{\mathrm{risk}}^{2-stage}(X,Y)\) derives from the facts that

$$\begin{aligned} \mathrm{logit }\ P(D=1|X,Y,cohort) = \mathrm{logit }\ P(D=1|X,cohort) + \log \ \mathrm{DLR}_{X}(Y) \end{aligned}$$

and

$$\begin{aligned} \mathrm{logit }\ P(D=1|X,Y,cc)=\mathrm{logit }\ P(D=1|X,cc) +\log \ \mathrm{DLR}_{X}(Y) \end{aligned}$$

where the covariate-specific diagnostic likelihood ratio

$$\begin{aligned} \mathrm{DLR}_{X}(Y)=P(Y|X,D=1)/P(Y|X,D=0) \end{aligned}$$

is the same in the (matched or unmatched) case–control and cohort populations. The equations are a simple application of Bayes’ theorem (Gu and Pepe 2009b). Substituting the expression for \(\log \ \mathrm{DLR}_X(Y)\) derived from the case–control equation into that for the cohort equation gives the expression above for \(\mathrm{logit }\ \widehat{\mathrm{risk}}^{2-stage}(X,Y)\).

3.3 Estimating distributions of risk

To estimate the risk distributions, we draw upon previously proposed methods for the estimation of risk distributions in simple case–control studies (Gu and Pepe 2009b; Huang et al. 2007; Huang and Pepe 2009). Here, we propose methodology for estimation with matched nested case–control data, which has not been previously considered. We estimate the baseline risk distributions, \(F^{D}_{X}\) and \(F^{\bar{D}}_{X}\), using the empirical distributions of \(\widehat{\mathrm{risk}}(X)\) in the cohort data. Since the cases in the case–control set are drawn as a simple random sample from the cases in the cohort, we use the empirical distribution of \(\widehat{\mathrm{risk}}(X,Y)\) in the cases as the estimator of \(F^{D}_{X,Y}\). For estimation of the distribution of \(\widehat{\mathrm{risk}}(X,Y)\) in the controls, we propose nonparametric and semiparametric approaches.

Nonparametric estimation In unmatched case–controls studies we can also use the empirical distribution of \(\widehat{\mathrm{risk}}(X,Y)\) among the controls to estimate \(F^{\bar{D}}_{X,Y}\). However in matched designs the controls are not a simple random sample and the distribution of \(\widehat{\mathrm{risk}}(X,Y)\) must be reweighted to reflect the distribution in the population. Specifically, letting \(c=1, \ldots ,C\) represent the distinct levels of the matching variable we can write

$$\begin{aligned} \nonumber F^{\bar{D}}_{X,Y}(r)&= P\{\mathrm{risk}(X,Y)\le r|D=0\} \\&= \sum ^{C}_{c=1}P\{\mathrm{risk}(X,Y)\le r|D=0,W=c\}P(W=c|D=0). \end{aligned}$$
(1)

A nonparametric estimator substitutes the observed proportions in the cohort for \(P(W=c|D=0)\) and the observed empirical stratum specific distributions of \(\widehat{\mathrm{risk}}(X,Y)\) for \(P\{\mathrm{risk}(X,Y)|D=0,W=c\}\). We also consider a semiparametric estimator that substitutes semiparametric stratum specific estimates for \(P\{\mathrm{risk}(X,Y)\le r|D=0,W=c\}\).

Semiparametric estimation Observe that

$$\begin{aligned} P\{\mathrm{risk}(X,Y)\le r|D\!=\!0,W\!=\!c\}\!=\!E\{P(\mathrm{risk}(X,Y)\le r|D\!=\!0,X)|D\!=\!0,W\!=\!c\}.\nonumber \\ \end{aligned}$$
(2)

A semiparametric location-scale model for the distribution of \(Y\) conditional on \((D=0,X)\) is written

$$\begin{aligned} Y=\mu ^{\bar{D}}(X)+\sigma ^{\bar{D}}(X)\varepsilon \end{aligned}$$

where the distribution of \(\varepsilon \) is unspecified, \(\varepsilon \sim F_{0}\), and \(\mu ^{\bar{D}}(X)\), and \(\sigma ^{\bar{D}}(X)\) are parametric functions of \(X\) (Heagerty and Pepe 1999). After fitting the regression functions \(\mu ^{\bar{D}}(X)\) and \(\sigma ^{\bar{D}}(X)\), the empirical distribution of the residuals \(\hat{\varepsilon _j} = (Y_{j}-\widehat{\mu }^{\bar{D}}(X_{j}))/\widehat{\sigma }^{\bar{D}}(X_j), j=1,...,n_D\), yields an estimator \(\widehat{F}_{0}\). The semiparametric estimate of the distribution of \(Y\) is then

$$\begin{aligned} \nonumber \widehat{P}(Y\le y|D=0,X)&= \widehat{P}\left\{ \frac{Y - \widehat{\mu }^{\bar{D}}(X)}{\widehat{\sigma }^{\bar{D}}(X)} \le \frac{y - \widehat{\mu }^{\bar{D}}(X)}{\widehat{\sigma }^{\bar{D}}(X)} \bigg | D=0, X \right\} \nonumber \\ \nonumber&= \widehat{P}\left\{ \hat{\varepsilon } \le \frac{y - \widehat{\mu }^{\bar{D}}(X)}{\widehat{\sigma }^{\bar{D}}(X)} \bigg | D=0, X \right\} \nonumber \\&= \widehat{F}_{0}\left\{ \frac{y-\widehat{\mu }^{\bar{D}}(X)}{\widehat{\sigma }^{\bar{D}}(X)} \right\} , \end{aligned}$$
(3)

which in turn yields \(\widehat{P}\{\mathrm{risk}(X,Y)\le r|D=0,X\}\). For example, if we use a logistic model for \(\mathrm{risk}(X,Y)\) and write \(\mathrm{logit }\ \widehat{\mathrm{risk}}(X,Y)=\widehat{\theta }_{0}+\widehat{\theta }_{1}X+\widehat{\theta }_{2}Y\) where \(\widehat{\theta }_{2}>0\), then

$$\begin{aligned} \widehat{P}\{\mathrm{risk}(X,Y)\le r|D=0,X\}&= \widehat{P}\left\{ \mathrm{logit }\ \widehat{\mathrm{risk}}(X,Y) \le \mathrm{logit }(r) | D=0, X \right\} \\&= \widehat{P}\left\{ \widehat{\theta }_0 + \widehat{\theta }_1 X + \widehat{\theta }_2 Y \le \mathrm{logit }(r) | D=0, X \right\} \\&= \widehat{P}\left\{ Y\le \frac{\mathrm{logit }(r)-\widehat{\theta }_{0}-\widehat{\theta }_{1}X}{\widehat{\theta }_{2}} \bigg | D=0, X \right\} \\&= \widehat{F}_{0}\left\{ \frac{ \frac{\mathrm{logit }(r)-\widehat{\theta }_{0}-\widehat{\theta }_{1}X}{\widehat{\theta }_{2}} -\widehat{\mu }^{\bar{D}}(X)}{\widehat{\sigma }^{\bar{D}}(X)} \right\} , \end{aligned}$$

by substituting into (3). In turn, we estimate (2) as

$$\begin{aligned} \widehat{P}\{\mathrm{risk}(X,Y)\le r|D\!=\!0,W\!=\!c\}\!=\!\frac{\sum ^{N}_{j=1}\widehat{P}\{\mathrm{risk}(X_{j},Y)\le r|D_{j}\!=\!0,X_{j}\}\ I\{W(X_{j})\!=\!c,D_{j}\!=\!0\}}{N_{\bar{D}}^{c}} \end{aligned}$$

where \(N_{\bar{D}}^{c}\) is the number of controls in the cohort with matching covariate value \(W=c\). This estimator is then substituted into (1) to get \(\widehat{F}^{\bar{D}}_{X,Y}(r)\). As noted above, a nonparametric estimator substitutes the observed proportions in the cohort for \(P(W=c|D=0)\), so that \(\widehat{P}(W=c|D=0) = \frac{N_{\bar{D}}^{c}}{N_{\bar{D}}}\). The semiparametric estimator then simplifies to

$$\begin{aligned} \widehat{F}^{\bar{D}}_{X,Y}(r) \!=\! \widehat{P}\{\mathrm{risk}(X,Y)\le r|D\!=\!0\} \!=\! \frac{\sum ^{N}_{j=1}\widehat{P}\{\mathrm{risk}(X_{j},Y)\le r|D_{j}\!=\!0,X_{j}\}\ I\{D_{j}\!=\!0\} }{N_{\bar{D}}} \end{aligned}$$

for both matched and unmatched studies.

Both nonparametric and semiparametric estimators of \(F^{\bar{D}}_{X,Y}\) are accompanied by a nonparametric estimator of \(F^{D}_{X,Y}\).

3.4 Estimates of performance improvement measures

In Table 1, we presented the definitions of all performance improvement measures being studied here. Observe that estimates of \(\Delta \mathrm{HR}^{D}(r), \Delta \mathrm{HR}^{\bar{D}}(r), \Delta \text{ B}(r)\) and \(\Delta \mathrm{AARD}(r)\) follow directly from the estimators described above for the cumulative distributions of \(\mathrm{risk}(X)\) and \(\mathrm{risk}(X,Y)\) in cases and in controls. Note that since \(\Delta \mathrm{HR}^{D}(r)\) relies only on \(F^{D}_{X,Y}\), what we refer to as nonparametric and semiparametric estimates of \(\Delta \mathrm{HR}^{D}(r)\) are in fact the same empirical estimate.

The pointwise ROC measures are also calculated directly, after noting that \(\mathrm{ROC}(p^{\bar{D}})=1-F^{D}(r(p^{\bar{D}}))\) where \(r(p^{\bar{D}})\) is such that \(1-F^{\bar{D}}(r(p^{\bar{D}}))=p^{\bar{D}}\) and \(\mathrm{ROC}^{-1}(p^{D})=1-F^{\bar{D}}(r(p^{D}))\) where \(r(p^{D})\) is such that \(1-F^{D}(r(p^{D}))=p^{D}\).

For \(\Delta \mathrm{AUC}\), we use the usual empirical estimator with cohort data for the baseline value \(\mathrm{AUC}_{X}\), while we use

$$\begin{aligned} \widehat{\mathrm{AUC}}_{(X,Y)}=\frac{1}{n_{D}}\sum ^{n_{D}}_{i=1}\widehat{F}^{\bar{D}}_{X,Y}\left\{ \widehat{\mathrm{risk}}(X_{i},Y_{i})\right\} , \end{aligned}$$

where the summation is over cases, for the enhanced model. Note that this is equal to the usual empirical estimator in an unmatched study but that it also yields an estimate of \(P\{\mathrm{risk}(X_{j},Y_{j})\le \mathrm{risk}(X_{i},Y_{i})|D_{i}=1,D_{j}=0\}\) in the matched design setting.

The baseline MRD is calculated empirically from the cohort values of \(\widehat{\mathrm{risk}}(X)\) while the enhanced model MRD is calculated as

$$\begin{aligned} \mathrm{MRD}_{(X,Y)}&= \frac{1}{n_{D}}\sum ^{n_{D}}_{i=1}\widehat{\mathrm{risk}}(X_{i},Y_{i})-\sum ^{C}_{c=1}\widehat{E}\left\{ \widehat{\mathrm{risk}}(X,Y)|D=0,W=c\right\} \\&P(W=c|D=0). \end{aligned}$$

Here \(\widehat{E}\{\widehat{\mathrm{risk}}(X,Y)|D=0,W=c\}\) are the stratum specific sample averages of \(\widehat{\mathrm{risk}}(X,Y)\) for controls in the case–control study for the nonparametric estimator. For the semiparametric estimator \(\widehat{E}\{\widehat{\mathrm{risk}}(X,Y)|D=0,W=c\}\) is calculated as the average of

$$\begin{aligned} \int \widehat{\mathrm{risk}}(X_{i},y) d\widehat{F}_{0}\Bigg \{\frac{y-\mu ^{\bar{D}}(X_{i})}{\widehat{\sigma }^{\bar{D}}(X_{i})}\Bigg \}&= \frac{1}{n_{\bar{D}}} \sum _{j=1}^{n_{\bar{D}}} \widehat{\mathrm{risk}}\Bigg \{X_{i}, \frac{Y_{j}-\widehat{\mu }^{\bar{D}}(X_{j})}{\widehat{\sigma }^{\bar{D}}(X_{j})} \widehat{\sigma }^{\bar{D}}(X_i)\\&+ \widehat{\mu }^{\bar{D}}(X_{i}) \Bigg \} \end{aligned}$$

over the controls in the cohort stratum with \(W=c\).

The NRI(\(>\!\!0\)) statistic uses the observed proportion of cases with \(\widehat{\mathrm{risk}}(X,Y)>\widehat{\mathrm{risk}}(X)\) in the case–control study for the event NRI component, which requires estimation of \(P\{\mathrm{risk}(X,Y)>\mathrm{risk}(X)|D=1\}\). The non-event NRI component requires \(P\{\mathrm{risk}(X,Y)<\mathrm{risk}(X)|D=0\}\), which is estimated as a weighted average of the stratum specific observed proportions for the nonparametric estimator and as \(\frac{1}{N_{\bar{D}}}\sum ^{N_{\bar{D}}}_{i=1} \widehat{P}\{\widehat{\mathrm{risk}}(X_{i},Y)<\widehat{\mathrm{risk}}(X_{i})|D_i=0,X_{i}\}\) for the semiparametric estimator.

Further details of the performance measure estimators obtained in each scenario are presented in Appendix Tables 8, 9 and 10.

3.5 Summary of estimation approaches

In Table 1, we showed that all performance improvement measures are functions of the risk distributions. Therefore, regardless of which measure is used, estimation of performance improvement is a two-fold task that requires estimating: (1) the risk functions \(\text{ risk}(X)\) and \(\text{ risk}(X,Y)\), and (2) the distributions of the risk functions in cases and in controls. We then substitute the estimated distributions into expressions for the performance improvement measures.

We estimated both risk functions parametrically using simple logistic models with linear terms. Other more flexible forms may be used in practice. In Sect. 3.2, we presented two different modeling approaches for estimating \(\text{ risk}(X,Y)\) under the logistic regression framework. The first method (\(M_{adj}\)) is a commonly used approach which utilizes only the data in the case–control subset and is valid only for an unmatched design. The second method (\(M_{2-stage}\)) is a two-stage estimator which utilizes additional data from the cohort and is valid for both matched and unmatched designs. By comparing these two approaches to modeling the risk function, we aim to demonstrate that matching invalidates commonly used naïve analysis. Additionally, we investigate whether utilizing the parent cohort data for \(X\) improves the efficiency of risk function estimation.

In Sect. 3.3, we turned our attention to the estimation of the risk distributions in cases and in controls. We estimated the distributions of \(\text{ risk}(X)\) using the empirical distributions estimated from the cohort. We also estimated the distribution of \(\text{ risk}(X,Y)\) in cases empirically. For the estimation of the risk distribution in controls, we proposed nonparametric and semiparametric approaches for matched and unmatched case–control designs. The nonparametric approach has the advantage of making no modeling assumptions for the distribution of \(Y\) given \(X\) in controls. On the other hand, the semiparametric approach does make modeling assumptions and borrows information across strata of controls, and is therefore expected to be more efficient. One would therefore use the nonparametric approach in situations where there was uncertainty about how to model the distribution of \(Y\) given \(X\) in controls. The semiparametric approach would be preferable in situations with sparse controls. Using these two approaches for estimating the risk distribution, we aim to compare the efficiency of semiparametric estimation to that of nonparametric estimation.

Finally, using the above methods, we aim to answer the question of whether matching in the nested case–control subset improves efficiency in the estimation of performance improvement measures.

4 Simulation studies

We investigated the performances of the estimators and the merits of matched study designs using two small simulation studies—in the first study, we generated the data from a bivariate binormal model and in the second study, we used a real dataset.

4.1 Simulation study 1: bivariate binormal data

4.1.1 Data generation

We generated bivariate binormal cohort data of size \(N=5{,}000\) for cases \((D=1)\) and controls \((D=0)\) with population prevalence \(\rho =P(D=1)=0.10\), so that the cohort contained \(N_D=500\) cases and \(N_{\bar{D}}=4{,}500\) controls:

$$\begin{aligned} \left( \begin{matrix} X\\ Y \end{matrix} \right)\sim \mathrm BVN \left( \left( \begin{matrix} \mu _{X}(D)\\ \mu _{Y}(D) \end{matrix} \right), \left( \begin{matrix} 1&\quad corr(X,Y|D)\\ corr(X,Y|D)&\quad 1 \end{matrix} \right) \right) \end{aligned}$$

where \(\mu _{X}(0)=\mu _{Y}(0)=0\) and \(\mu _{X}(1)=\mu _{Y}(1)=0.742\). The corresponding AUC values associated with \(X\) and \(Y\) alone are \(\mathrm{AUC}_{X}=\mathrm{AUC}_{Y}=\Phi (0.742/\sqrt{2})=0.7\). Data for \(N=5{,}000\) subjects were generated, so that \(\{(D_{i},X_{i}),i=1, \ldots N\}\) constitutes the study cohort data. A random sample of \(n_{D}=250\) cases were selected from the cohort and their \(Y\) values added to the dataset. For the unmatched design, \(Y\) values for a random sample of \(n_{\bar{D}}=500\) controls were also added to the dataset. For the matched design, we generated the matching variable \(W\) using quartiles of \(X\) in the control population and selected 2 controls randomly for each case in each of the four \(W\) strata.

4.1.2 Results

Using the notation \(M\) for a generic performance improvement measure, Table 2 shows mean values for estimates derived from 5,000 simulations. Estimates calculated using the adjusted case–control modeling approach for \(\text{ risk}(X,Y)\) are denoted by \(M_{adj}\), while estimates calculated using the two-stage modeling approach are denoted by \(M_{2-stage}\). Bias estimates are calculated by subtracting the mean values from the true value for each measure. We see that the \(M_{adj}\) estimators are valid in unmatched designs, in the sense that mean values are close to the true values. However, \(M_{adj}\) estimators are biased in matched designs because they do not account for matching. Note that the direction and size of the bias is such that performance appears to decrease rather than increase with addition of \(Y\) to the model. In contrast the \(M_{2-stage}\) estimators provide estimates that are centered around the true values in matched and unmatched designs.

Table 2 Mean estimates of improvement in prediction performance for measures defined in Table 1. Results are from 5,000 simulations of nested case–control studies (\(n_D\) = 250, \(n_{\bar{D}}\) = 500) with a cohort of 5,000 subjects. Data were generated from the bivariate binormal model described in the text with \(corr(X,Y|D)\) = 0.5. Estimates calculated with \(\widehat{\mathrm{risk}}^{adj}(X,Y)\) are denoted by \(M_{adj}\) and those calculated with \(\widehat{\mathrm{risk}}^{2-stage}(X,Y)\) are denoted by \(M_{2-stage}\). (a) Nonparametric and (b) semiparametric estimates are presented

The relative efficiencies of estimators are considered in Table 3 using ratios of standard deviations, with the standard deviation of the nonparametric \(M_{adj}\) estimator in the unmatched studies as the reference.

Table 3 Efficiency of \(M_{2-stage}\) in matched and unmatched designs relative to the nonparametric \(M_{adj}\) estimator from the unmatched design. Shown are the ratios of the standard deviations of estimates found in simulation studies divided by standard deviations (\(M_{adj}\)-NP; unmatched), so smaller values show more efficiency. NP and SP represent nonparametric and semiparametric estimation, respectively, of the distribution of risk\((X,Y)\) in controls

In the unmatched design, we found that the nonparametric \(M_{2-stage}\) estimator is more efficient than \(M_{adj}\) for estimating \(\Delta \mathrm{HR}^D(0.20), \Delta \mathrm{MRD}\) and \(\mathrm{NRI}(>0)\). Interestingly, \(M_{2-stage}\) performs slightly worse than \(M_{adj}\) for \(\Delta \mathrm{HR}^{\bar{D}}(0.20)\), but has similar performance to \(M_{adj}\) for all other performance measures.

To evaluate the impact of matching on efficiency we only consider \(M_{2-stage}\) because \(M_{adj}\) estimators are biased. Comparing \(M_{2-stage}\) in matched versus unmatched designs, we see that matching improves precision with which performance improvement is estimated for most measures. For example, with nonparametric estimation of the ROC related measures, the standard deviations in matched studies are 80–90 % the size of those in unmatched studies.

Interestingly, the improvement observed from matching can often be achieved in unmatched data by using the semiparametric estimator. In fact, for many of the measures, the efficiency is improved more by modeling \(P(Y|X,D=0)\) in an unmatched study than by matching controls to cases in the design and using the nonparametric estimator. For example, the standard deviation of the nonparametric estimate of \(\Delta \mathrm{HR}^{\bar{D}}(0.20)\) in matched studies is 74.0 % of the reference, while the semiparametric estimate in unmatched studies has a standard deviation that is 53.4 % of the reference. Some intuition for this result is provided by the fact that semiparametric estimation borrows information across strata of controls. While matching enriches strata with larger numbers of cases, it also makes those strata with fewer cases more sparse with respect to the number of controls. Therefore, both matched and unmatched data are prone to sparseness of controls in certain strata and nonparametric estimation suffers in such scenarios. The semiparametric approach, however, is less affected as it borrows information across strata.

4.2 Simulation study 2: renal artery stenosis data

4.2.1 Study description

The kidneys play several major regulatory roles in the human body, including regulation of blood pressure. The renal arteries aid in the proper functioning of the kidneys by supplying them with blood. Narrowing of the renal arteries is a condition termed renal artery stenosis (RAS); it inhibits blood flow to the kidneys and can lead to treatment-resistant hypertension.

The gold standard diagnostic test for RAS is an invasive and expensive procedure called renal angiography. In order to avoid unnecessarily performing angiography on individuals with a low likelihood of having disease, a clinical decision rule was developed to predict RAS based on patient characteristics and thus identify high-risk patients as candidates for the procedure (Krijnen et al. 1998).

We illustrate the proposed methodology using data from a RAS study (Janssens et al. 2005). For 426 patients, information is available on disease diagnosis from angiography, as well as age (10-year units), BMI, gender, recent onset of hypertension, presence of atherosclerotic vascular disease and serum creatinine (SCr) concentration. We model baseline risk using the first five characteristics and look to estimate the incremental value gained from adding SCr concentration to the model. Age and BMI were mean-centered. SCr concentration was log-transformed and standardized to have mean 0 and standard deviation 1. The study cohort includes 98 cases and 328 controls.

4.2.2 Methods

We simulated nested case–control studies using this dataset. Specifically, we resampled 426 observations with replacement from the cohort, selected all the cases and twice the number of controls, and disregarded SCr concentration data for patients who were not in the selected case–control subset. In one set of analyses the controls were selected unmatched as a simple random sample from all controls. In a second set of analyses the controls were selected to match the cases in regards to estimated baseline risk category. In particular, we created a three-level risk category variable, \(W\), defined as: low if \(\widehat{\mathrm{risk}}(X) < 0.10\), medium if \(0.10 < \widehat{\mathrm{risk}}(X) < 0.20\) and high if \(\widehat{\mathrm{risk}}(X) > 0.20\). We selected two controls per case at random without replacement within each baseline risk category for the matched controls datasets. We also evaluated settings with 1:1 case–control ratios.

4.2.3 Results from renal artery stenosis dataset

Tables 4 and 5 summarize results of 1,000 nested case–control studies based on the renal artery stenosis dataset. We see that the \(M_{adj}\) estimators are only valid in unmatched case–control studies. Interestingly, the bias in \(M_{adj}\) in matched studies is such that prediction performance appears to disimprove considerably with addition of \(Y\) when the IDI, NRI(\(>\!\!0\)) or \(\Delta \mathrm{HR}^D\) performance measures are employed. This is very similar to results in Table 2 for the simulated bivariate normal distributions. Also as in Table 2, we see that \(M_{2-stage}\) is valid in matched and unmatched designs.

Table 4 Nonparametric estimates of improvement in prediction performance from the complete renal artery stenosis dataset and from simulated nested case–control datasets derived from it using a 1:2 case–control ratio. Shown are mean (a) nonparametric and (b) semiparametric estimates. Estimates calculated with \(\widehat{\mathrm{risk}}^{adj}(X,Y)\) are denoted by \(M_{adj}\) and those calculated with \(\widehat{\mathrm{risk}}^{2-stage}(X,Y)\) are denoted by \(M_{2-stage}\). True values are obtained using the original renal artery stenosis dataset of all 426 subjects
Table 5 Efficiency of estimates of improvement in prediction performance in studies simulated from the renal artery stenosis dataset. Shown are standard deviations (SD) and the ratios of the standard deviations relative to that for nonparametric \(M_{adj}\) in the unmatched studies. NP and SP represent nonparametric and semiparametric estimation of the distribution of risk\((X,Y)\) in controls, respectively

Comparing the efficiency of \(M_{2-stage}\) to \(M_{adj}\) in unmatched designs where both are valid, we see trends in the top panel of Table 5 that are similar to those observed in Table 3. For a case–control ratio of 1:1, \(M_{2-stage}\)-NP is more efficient than \(M_{adj}\)-NP, but only for \(\Delta \mathrm{HR}^D, \Delta \mathrm{MRD}\) and NRI(\(>\!\!0\)). For a larger number of controls (case–control ratio = 1:2), \(M_{2-stage}\) loses some of its efficiency advantage. As before, \(M_{2-stage}\) has worse performance than \(M_{adj}\) for the estimation of \(\Delta \mathrm{HR}^{\bar{D}}\), although again, this effect is lessened with the larger case–control ratio of 1:2.

Turning to the main question concerning efficiency due to matching, we again see some trends in the top panel of Table 5 that are similar to observations made for the bivariate binormal simulations in Table 3. Comparing \(M_{2-stage}\)-NP in matched versus unmatched designs, matching appears to improve the efficiency with which \(\Delta \mathrm{HR}^{\bar{D}}\) is estimated. However, \(\Delta \mathrm{HR}^D\) is not affected by matching and estimation of NRI(\(>0\)) may be worse in matched studies. With larger numbers of controls, we see in the bottom panel of Table 5 that there is no gain from matching with regards to efficiency of \(M_{2-stage}\)-NP.

Semiparametric estimation improves efficiency much more than matching does in these simulations. Again, this is consistent with the earlier simulation results.

5 Bootstrap method for inference

Performance improvement estimates obtained from nested case–control data incorporate variability from both the cohort and the nested case–control subset. However, simple bootstrap resampling from observed data cannot be implemented in this setting, as data on \(Y\) are observed only for subjects selected in the original case–control subset. Below we discuss our proposed strategy for bootstrapping with nested case–control data.

5.1 Proposed approach

We propose a parametric bootstrap method that combines resampling observations in the cohort and resampling residuals in the case–control subset (Efron and Tibshirani 1993). To begin, we have the original study cohort for which \(X\) and disease status are available and a nested case–control subsample on which \(Y\) is measured. We first bootstrap a cohort (say, cohort\(^{*}\)) from the original cohort and proceed to generate the matching variable \(W^{*}\) based on quartiles of \(X^{*}\) in the bootstrapped cohort\(^{*}\). A matched or unmatched case–control subsample\(^{*}\) is then constructed in the same fashion as before. However, note that in this bootstrapped case–control subsample\(^{*}\), the only subjects that have \(Y\) data are those who were selected to be in the original case–control subsample. We generate \(Y^{*}\) values for all subjects in the bootstrapped case–control subsample\(^{*}\) using a parametric bootstrap method combined with residual resampling.

Specifically, we use the original case–control subsample to model \(Y|X,D=0\) semiparametrically as in Sect. 3.3,

$$\begin{aligned} Y^{\bar{D}} = \mu (X^{\bar{D}}) + \sigma \epsilon . \end{aligned}$$

Fiting this model on the original case–control subsample gives us estimated values \(\hat{\mu }, \hat{\sigma }\) and residuals \(\hat{\epsilon }_1,\ldots ,\hat{\epsilon }_{n_{\bar{D}}}\). Then, for each control* in the bootstrapped case–control subsample\(^{*}\), we use that subject’s covariate values, \(X^{*}\), and sample with replacement a residual from among \(\hat{\epsilon }_1,\ldots ,\hat{\epsilon }_{n_{\bar{D}}}\) to generate a \(Y^{*}\) value using \(\hat{\mu }\) and \(\hat{\sigma }\):

$$\begin{aligned} Y^{*}_i = \hat{\mu }(X^{*}_i) + \hat{\sigma } \hat{\epsilon }_i^*, i=1,...,n_{\bar{D}}^*. \end{aligned}$$

We fit a separate model for \(Y|X,D=1\) in the original case–control subsample and take a similar approach to generate \(Y^{*}_1,\ldots ,Y^{*}_{n_D*}\) for cases in the bootstrapped case–control subsample\(^{*}\).

5.2 Simulation study

We assessed the performance of the proposed bootstrap method with a simulation study using bivariate binormal data generated as in Sect. 4.1.1. We carried out 1,000 simulations, each time generating a new study cohort of size \(N=5{,}000\) and from this study cohort, selecting a nested case–control subsample of size 250 cases and 500 controls. We used both the matched and unmatched designs. Within each simulation, we carried out 200 bootstrap repetitions using the procedure described above. For each performance measure estimate obtained in that simulation, we estimated its standard error as the standard deviation across the 200 bootstrap repetitions and used it to calculate normality-based 95 % confidence intervals. Coverage was averaged over all 1,000 simulations.

Results are presented in Table 6. Not surprisingly, \(M_{adj}\) estimators, which are biased in matched designs, also generate confidence intervals with poor coverage. For all other settings, coverage of the 95 % bootstrap confidence intervals is good.

Table 6 Coverage of normality-based 95 % bootstrap confidence intervals

6 Illustration with renal artery stenosis study

We illustrate our methodology on the renal artery stenosis dataset by simulating a single nested case–control dataset using the unmatched design and a single dataset using the matched design with a 1:2 case–control ratio. We include bootstrap standard errors and normality-based 95 % confidence intervals (CIs), obtained from 500 bootstrap repetitions following the approach described in Sect. 5. Instead of repeating numerous simulations as in Sect. 4.2, we have a single study cohort and a single two-phase dataset here that we bootstrap from.

Results are presented in Table 7. We see that the two-phase estimates are quite different from the full-data estimates. We used only a single two-phase sample here to mimic a real-life two-phase dataset. Repeating the sampling 100 times and averaging estimates across repetitions showed that the estimates are unbiased (data not shown). The observed inconsistency is a result of sampling variability. As before, we see that a standard adjusted analysis (\(M_{adj}\)) underestimates performance improvement in a matched design. \(M_{2-stage}\) produces valid estimates. Conclusions regarding the incremental value of SCr concentration are similar using any of the valid estimation methods in this setting. We use estimates from \(M_{2-stage}\) with semiparametric estimation and a matched design in the following discussion.

Table 7 Results from a matched and an unmatched two-phase study simulated from the renal artery stenosis dataset. 95 % bootstrap confidence intervals were obtained from 500 bootstrap repetitions, using \(M_{2-stage}\) and \(M_{adj}\) for estimation of risk\((X,Y)\) and (a) nonparametric and (b) semiparametric estimation of the distribution of risk\((X,Y)\) in controls

The incremental value of SCr concentration appears to be significant using \(\Delta \text{ MRD}\) and NRI as the measures of interest. Values of 0 for both measures would indicate no improvement from SCr concentration. \(\widehat{\Delta \text{ MRD}}\) is 0.069 {95 % CI (0.013,0.124)}, indicating that the change in the difference in mean risks between cases and controls is approximately 0.069. \(\widehat{\text{ NRI}}\) is 0.547 {95 % CI (0.259,0.836)}; given that NRI has a range of (\(-\)2,2), this seems like a moderate level of improvement in risk reclassification. Small improvements that are not statistically significant are seen using all other measures.

7 Discussion

Matching controls to cases on baseline risk factors is a common practice in epidemiologic studies of risk. It has also become common practice in biomarker research (Pepe et al. 2008). It allows one to evaluate from simple two-way analyses of \(Y\) and \(D\) if there is any association between \(Y\) and \(D\) and to be assured that the association is not explained by the matching factors. Matching also allows for efficient estimation of the relative risk associated with \(Y\) controlling for baseline predictors \(X\) in a risk model for \(\mathrm{risk}(X,Y)\). However, the impact of matching on estimates of prediction performance measures has not been explored previously.

We demonstrated the intuitive result that matching invalidates standard estimates of performance improvement. Our estimators that simply adjust for population prevalence but not for matching, \(M_{adj}\), substantially underestimated the performance of the risk model \(\mathrm{risk}(X,Y)\) and therefore underestimated the increment in performance gained by adding \(Y\) to the set of baseline predictors \(X\). Intuitively, this underestimation can be attributed to the fact that matching causes the distribution of \(X\) to be more similar to cases in study controls than in population controls and therefore the distribution of \(\mathrm{risk}(X,Y)\) is also more similar to cases in study controls than in population controls.

We derived two-stage estimators that are valid in matched or unmatched nested case–control studies. We were unable to derive analytic expressions for the variances of these estimates. Therefore we investigated efficiency in two simple simulation studies. Our results suggest that the impact of two-stage estimation and of matching varies with the performance measure in question. In our simulations two-stage estimation in unmatched studies had little impact on efficiencies of ROC measures but was advantageous for estimating the reclassification measures NRI(\(>\!\!0\)) and IDI = \(\Delta \)MRD. On the other hand, matching improved efficiency of estimates of ROC related measures but did little to improve estimation of reclassification measures.

Our preferred measures of performance increment are neither ROC measures nor risk reclassification measures. We argue for use of the changes in high risk proportions of cases, \(\Delta \mathrm{HR}_D(r)\), high risk proportion of controls, \(\Delta \mathrm{HR}_{\bar{D}}(r)\), and the linear combination \(\Delta \text{ B}(r)\). These measures are favored due to their practical value for quantifying effects on improved medical decisions (Pepe and Janes 2013).

In our simulations we found that two-stage estimation improved efficiency of \(\Delta \mathrm{HR}_D\) but that matching had little to no further impact. Note that matching only affects the two-stage estimator for \(\Delta \mathrm{HR}_D\) through the influence of controls on the estimator of \(\mathrm{risk}(X,Y)\). That is, given estimates of \(\mathrm{risk}(X,Y)\), the empirical estimator of \(\Delta \mathrm{HR}_D\) is employed in both matched and unmatched designs as the cases are a simple random sample from the cohort. We conclude that the improvement in estimating \(\mathrm{risk}(X,Y)\) that is gained with matched data does not carry over to substantially impact on estimation of the distribution of \(\mathrm{risk}(X,Y)\) in cases. On the other hand, matching improved estimation of \(\Delta \mathrm{HR}_{\bar{D}}(r)\), at least with smaller control to case ratio.

We implemented a semiparametric method that modeled the distribution of \(Y\) given \(X\) among controls. This had a profound positive influence on efficiency with which most measures were estimated, especially in unmatched designs. If one is comfortable with making necessary assumptions to model \(Y\) given \(X\) in controls, it seems that little additional efficiency is gained by using a matched design.

We recognize that the simulation scenarios we studied are limited and our conclusions may not apply to other scenarios. There are a number of factors to consider with respect to study design and estimation and changing one of these factors could affect results. In fact, we see this happen in our two simulation studies. For example, in our second simulation study, changing the case–control ratio from 1:1 to 1:2 alone lessens the advantage of matching on results. Moreover, the effect of matching is different on different performance measures. More work is needed to derive analytic results that could generalize our observations. In the meantime our practical suggestion is to use simulation studies based on the application of interest in order to guide decisions about matching and other aspects of study design. Simulation studies may be based on hypothesized joint distributions for biomarkers, as in our first simulation study (Sect. 4.1). If pilot data are available one could base simulation studies on that, as we did with the renal artery stenosis data (Sect. 4.2). Simulation studies can be used to guide the design of another larger study, by simulating both matched and unmatched nested case–control studies by varying factors related to study design and estimation approach and investigating which approaches would maximize efficiency for the performance improvement measures of interest.

Another consideration in the decision to match is that inference is complicated by matching. Asymptotic distribution theory is not available for two-stage estimators of performance measures. The difficulty in deriving analytic expressions comes from the fact that there are multiple sources of variability that must be accounted for, given the complicated analytic approach and study design. Simple bootstrap resampling cannot be implemented in this setting because the nested case–control design implies that \(Y\) is only available for the study controls. We proposed a parametric bootstrap approach that generates \(Y\) for all cohort subjects using semiparametric models for \(Y\) given \(X\) fit to the original data. We showed that this method was valid with good coverage in simulation studies. We recommend this approach with the caveat that near the null, estimates tend to be skewed and in turn, inference tends to be problematic near the null for all measures of performance improvement. We and others have noted severe problems with bootstrap methods and inference in general for estimates of performance improvement even in cohort studies and especially with weakly predictive markers (Pepe et al. 2013; Kerr et al. 2011; Vickers et al. 2011). In practice, we recommend doing simulations similar to those suggested above to determine if valid inference is possible with the given data and study design or if the performance improvement is too close to the null. Continued effort is needed to develop methods for inference about performance improvement measures in cohort studies and then to extend them to nested case–control designs.