Introduction

One of the main goals in public health research is to evaluate the disease burden due to a specific exposure. For this purpose, the attributable fraction (AF) is commonly used. The AF was initially defined in the 1950s and has been used extensively since then in epidemiological studies. A thorough historical review of the AF is given by Poole [1]. Originally, the AF was defined for binary outcomes as the proportion of unfavourable outcomes that would have been prevented if the exposure of interest were eliminated from the population [2]. As such, the AF takes both the exposure-outcome association and the exposure prevalence into account, and is specific to the study population. To formalize, let X and Y be the exposure and outcome of interest. In standard counterfactual notation, the AF is defined as

$$\begin{aligned} {\text {AF}}=1-\frac{\text {Pr}(Y_{0}=1)}{\text {Pr}(Y=1)}, \end{aligned}$$
(1)

where \(\text {Pr}(Y=1)\) is the factual outcome prevalence, and \(\text {Pr}(Y_{0}=1)\) is the counterfactual outcome prevalence had the exposure been eliminated (set to 0) for everyone. For instance, if the factual outcome prevalence is 1 0 % and the counterfactual outcome prevalence is 5 %, then \(1-0.05/0.1=50\,\%\) of all outcomes would have been prevented, had the exposure been eliminated. We note that the exposure doesn’t have to be binary per se, but the definition in Eq. (1) assumes that there is a ‘zero-level’ for the exposure, corresponding to the exposure being completely absent.

Recently, the AF has been extended to time-to-event outcomes [35]. Let T be the time-to-event of interest, e.g. time to death. The AF function is then defined as

$$\begin{aligned} {\text {AF(t)}}=1-\frac{\text {Pr}(T_{0}\le t)}{\text {Pr}(\text {T}\le t)}, \end{aligned}$$
(2)

where \(\text {Pr}(\text {T}\le t)\) is the factual probability of an event at or before time t, and \(\text {Pr}(T_{0}\le t)\) is the counterfactual probability of an event at or before time t had the exposure been eliminated for everyone at baseline.

To estimate the AF from observational data, it is important to adjust for confounders for the exposure-outcome association. If a covariate set \(\mathbf{Z }\) is sufficient for confounding control, then the AF can be consistently estimated by adjusting for \(\mathbf{Z }\). For binary outcomes in cross-sectional studies, the confounder-adjusted AF can be estimated with logistic regression [69]. For binary outcomes in (possibly matched) case–control studies, the confounder-adjusted AF can also be estimated with logistic regression, under a ‘rare-disease’ assumption [10, 11]. For time-to-event outcomes in cohort studies, the confounder-adjusted AF function can be estimated with Cox proportional hazard regression [4].

Even though the theory for AF estimation is well developed, there is still a lack of up-to-date software implementations. In this article we focus on the open-source statistical software R [12]. To our knowledge there are three earlier packages for AF estimation available at CRAN: epiR [13], attribrisk [14] and paf [15]. These packages all have important limitations. The epiR package uses the function epi.2by2 to estimate the AF for various sampling designs, but does not allow for model-based confounder adjustment. The attribrisk package allows for confounder adjustment through logistic regression, but it relies on the ‘rare-disease’ assumption and is thus essentially restricted to case–control studies. Furthermore, the attribrisk package only provides bootstrap and jackknife standard errors, which makes it relatively time consuming. The paf package estimates the AF function using Cox proportional hazard regression for confounder adjustment. However, it does not handle big data (in our simulations it breaks down for data with around 20,000 observations or more). A common limitation of all three packages is that none of them provides accurate standard errors when data are clustered, e.g. when there are repeated measures on each subject.

The aim of this article is to present a new R package for AF estimation. This new package AF allows for confounder-adjusted estimation of the AF for the three major study designs: cross-sectional, (possibly matched) case–control and cohort. It provides analytical standard errors for all estimates, which obviates the need for bootstrapping. When data are clustered, these standard errors are adjusted for the within-cluster correlations. The package is designed to scale up, so that it is able to handle very large datasets (up to several millions of observations).

The article is organized as follows. In Sects. 2, 3 and 4 we describe how the AF package is used to estimate the AF in cross-sectional, case–control and cohort studies, respectively. Each section is divided into a theoretical part and an applied part. In the theoretical sections we describe how the confounder-adjusted AF is estimated for each specific study design. These sections serve as a brief but self-consistent tutorial in AF estimation. In the applied sections we use real data examples to illustrate how the AF package is used. All datasets in these examples are publicly available and included in the AF package, so readers can easily replicate all analyses.

Cross-sectional study and cohort study with binary outcome

Theory

In cross-sectional studies with binary outcomes, the AF is defined as in Eq. (1). The factual outcome prevalence, \(\text {Pr}(Y=1)\), can be estimated as the observed (sample) outcome prevalence. To estimate the counterfactual outcome prevalence, \(\text {Pr}(Y_{0}=1)\), it is usually assumed that a set of observed covariates \(\mathbf{Z }\) is sufficient for confounding control. Under this assumption, \(\text {Pr}(Y_{0}=1)\) can be obtained by averaging the outcome prevalence among the unexposed, at a given value of \(\mathbf{Z }\), over the population distribution of \(\mathbf{Z }\):

$$\begin{aligned} \text {Pr}(Y_{0}=1)=\text {E}\{\text {Pr}(Y=1\mid X=0,\mathbf{Z })\}. \end{aligned}$$

In practice, \(\text {Pr}(Y=1\mid X=0,\mathbf{Z })\) is usually estimated with a logistic regression model

$$\begin{aligned} \text {logit}\{\text {Pr}(Y=1|X,\mathbf{Z })\} = g(X,\mathbf{Z };{\varvec{\beta }}), \end{aligned}$$
(3)

where g() is a specified function indexed by the parameter vector \({\varvec{\beta }}\). For example, g() could be specified as \(\beta _0+\beta _1{X}+\beta _2{\mathbf{Z }}\). However, g() could also involve interactions and higher order terms. The model in Eq. (3) is fitted to obtain an estimate of \({\varvec{\beta }}\). Then, for each subject i with covariate vector \(\mathbf{Z }_i\) we use \(\text {expit}\{g(X=0,\mathbf{Z }_i;\hat{{\varvec{\beta }}})\}\) as a prediction of \(\text {Pr}(Y=1\mid X=0,\mathbf{Z }_i)\). These predictions are averaged to obtain an estimate of \(\text {Pr}(Y_{0}=1)\):

$$\begin{aligned} \widehat{Pr}(Y_{0}=1)= \frac{1}{n}\sum _{i=1}^n\text {expit}\{g(X=0,\mathbf{Z }_i; \hat{{\varvec{\beta }}})\}. \end{aligned}$$
(4)

The estimates of \(\text {Pr}(Y=1)\) and \(\text {Pr}(Y_{0}=1)\) are plugged into Eq. (1), to produce an estimate of the AF. The standard error for the resulting estimate can be obtained by combining the sandwich formula with the delta method [8, 9].

We end this section by noting that neither the definition in Eq. (1), nor the estimation procedure described in this section, requires a cross-sectional study design per se. For instance, they are also applicable in cohort studies with time-to-event outcomes, if the outcome is dichotomized as having the event before a fixed time point, e.g. 5 years from baseline. However, when censoring is present, as is often the case in cohort studies, it more natural to use the time-to-event analysis described in Sect. 4.

Applied example

To illustrate the theory we use a dataset on 487 births among 188 women described in Juul and Frydenberg [16]. For each birth, the following variables are measured: parity (birth), a binary indicator of whether the mother smoked during pregnancy (smoker), race of the mother (race: white, black or other), age of the mother (age), a unique identification number for each mother (id), weight of the mother at last menstrual period in pounds (lwt), birthweight of the newborn child (bwt), and a binary indicator of whether the newborn has low birthweight (defined as birthweight smaller or equal to 2500 grams) (lbw). These variables are stored in the data frame clslowbwt, which is included in the AF package.

We are interested in the effect of smoking during pregnancy on the child’s birthweight. We will adjust for age and race, since both these variables are potential confounders. Initially, we assume the following standard logistic regression model:

$$\begin{aligned}&\text {logit}\{\text {Pr}({\texttt {lbw}} =1 \mid {\texttt {smoker, age, race}})\}\nonumber \\&\quad = \beta _0+\beta _1{\texttt {smoker}} + \beta _2I({\texttt {race=``black''}}) \nonumber \\&\quad \quad +\,\beta _3I({\texttt {race=``other''}})+\beta _4{\texttt {age}}. \end{aligned}$$
(5)

To estimate the AF under this model we can use the AF.cs function in the AF package. This function fits the model ‘internally’, and then outputs the estimated AF. However, for illustrational purpose we first fit the model separately, and discuss the output.

In R, we fit the model in Eq. (5) with the glm function by typing

figure a

Summarizing the output gives

figure b

The output indicates that the odds of getting a low birthweight is about \(e^{0.6}\approx 1.8\) times higher among children born to smokers as compared to children born to non-smokers. The effect is highly significant, with a p-value equal to 0.00561. To estimate the proportion of low birthweights that would have been prevented if no mother had smoked during pregnancy we use the AF.cs function.

figure c

Like the glm function, the AF.cs function has a formula argument and a data argument. Since the outcome is by definition binary in Eq. (1), the AF.cs function always uses logistic regression, and thus has no family argument. The name of the exposure variable is specified by the exposure argument. Summarizing the output gives

figure d

The output indicates that approximately 17 % of all low birthweights would have been prevented if no mother had smoked during pregnancy. The AF is highly significant, with a p-value equal to 0.0076. However, the 95 % CI is quite wide, ranging from 5 to 29 %. The default CI is untransformed, but the summary function also allows for log- and logit-transformed CIs, which sometimes have more accurate coverage probabilities [17].

There are two problems with the analysis above. First, children born to the same mothers are correlated, which is not accounted for in the standard errors and p-values. Second, the model in Eq. (5) is perhaps unrealistically simple, since it assumes that the effect of smoking is the same regardless of age and race. Thus, we next consider the following model, which allows for interactions between all predictors:

$$\begin{aligned}&\text {logit}\{\text {Pr}({\texttt {lbw}}=1 \mid {\texttt {smoker, age, race}})\}= \beta _0+\beta _1{\texttt {smoker}} \nonumber \\&\qquad +\, \beta _2I({\texttt {race=``black''}}) + \beta _3I({\texttt {race=``other''}})+\beta _4{\texttt {age}}\nonumber \\&\qquad +\,\beta _5{\texttt {smoker}}\times I({\texttt {race=``black''}}) +\beta _6{\texttt {smoker}}\times I({\texttt {race=``other''}})\nonumber \\&\qquad +\, \beta _7{\texttt {smoker}}\times {\texttt {age}}\nonumber \\&\qquad +\,\beta _8I({\texttt {race=``black''}})\times {\texttt {age}} +\beta _9I({\texttt {race=``other''}})\times {\texttt {age}}. \end{aligned}$$
(6)

The glm function has no facilities for handling clustered data. Thus, we instead fit the model with the gee function from the drgee package:

figure e

By typing \({\texttt {(smoker + race + age)}}^{\wedge {\texttt {2}}}\), the formula automatically constructs all possible main effects and interactions between smoker, race and age. By specifying the clusterid argument, cluster-robust standard errors are calculated. Summarizing the output gives

figure f

We observe that the interaction term between smoker and age is highly significant, with a p-value \(<\) 0.001. Thus, it seems like the model in Eq. (5), which only includes main effects, may indeed be too simplistic. On the other hand, the more realistic model in Eq. (6) is harder to interpret and communicate, since the smoking effect in this model is captured by four parameters (one main effect plus three interaction terms). This is not a problem for the AF though, since the AF is always a single number, regardless of the complexity of the underlying model. Estimating the AF, together with cluster-robust standard errors, gives:

figure g

The estimated AF is now reduced to 10 %, and not statistically significant. At first glance, it may appear surprising that the AF is not statistically significant, given that both the main effect of smoker and the interaction between smoker and age are statistically significant in the gee output. There are two reasons for this discrepancy. First, the main effect of smoker is negative, but the smoker-age interaction is positive. This means that for young women, smoking appears to decrease the risk of low birthweight in the newborn, whereas for old women smoking appears to increase the risk; for white women the ‘switch’ occurs at \(3.8116/0.14763=25.8\) years. Thus, when averaging over all ages, as in the AF, these effects may cancel out. Second, the statistical uncertainty in the AF does not only depend on the uncertainty in the model parameters, but also on the sampling variability in the distribution of the confounders. To see this, note that even if we would replace the estimated parameter \(\hat{{\varvec{\beta }}}\) in Eq. (4) with the true value \({\varvec{\beta }}\), there would be residual uncertainty in \(\widehat{Pr}(Y_{0}=1)\) due to the sampling variability in \({\mathbf {Z}}_i\) when estimating the distribution of \({\mathbf {Z}}_i\) with the empirical distribution.

Case–control study

Theory

Case–control sampling distorts the outcome distribution, so that \(\text {Pr}(Y=1)\) and \(\text {Pr}(Y_{0}=1)\) in Eq. (1) cannot be separately estimated from the data. However, by assuming that the covariate vector \(\mathbf{Z }\) is sufficient for confounding control, and applying Bayes’ theorem [10, 11], the AF can be reformulated as

$$\begin{aligned} {\text {AF}} = 1-\text {E}\{{\text {RR}}(\mathbf{Z })^{-X}|Y=1\} \end{aligned}$$
(7)

where

$$\begin{aligned} {\text {RR}}(\mathbf{Z })=\frac{\text {Pr}(Y=1\mid X=1, \mathbf{Z })}{\text {Pr}(Y=1\mid X=0, \mathbf{Z })} \end{aligned}$$
(8)

is the conditional risk ratio, given \(\mathbf{Z }\). If the outcome is rare, then the risk ratio \(RR(\mathbf{Z })\) can be approximated by the conditional odds ratio

$$\begin{aligned} {\text {OR}}(\mathbf{Z })=\frac{\text {Pr}(Y=1\mid X=1, \mathbf{Z })\text {Pr}(Y=0\mid X=0, \mathbf{Z })}{\text {Pr}(Y=0\mid X=1, \mathbf{Z })\text {Pr}(Y=1\mid X=0, \mathbf{Z })}. \end{aligned}$$
(9)

The AF is thus approximately equal to \(1-\text {E}\{{\text {OR}}(\mathbf{Z })^{-X}|Y=1\}\). Estimation of the AF proceeds as follows. First, a logistic regression model is fitted to the data. Then, for each subject i with covariate vector \(\mathbf{Z }_i\) the model is used to estimate \({\text {OR}}^{-X_i}(\mathbf{Z }_i)\). For exposed subjects (those with \(X_i=1\)), \({\text {OR}}^{-X_i}(\mathbf{Z }_i)={\text {OR}}^{-1}(\mathbf{Z }_i)\). For unexposed subjects (those with \(X_i=0\)), \({\text {OR}}^{-X_i}(\mathbf{Z }_i)=1\). The predictions of \({\text {OR}}^{-X_i}(\mathbf{Z }_i)\) are then averaged among the cases (those with \(Y_i=1\)), to produce an estimate of AF:

$$\begin{aligned} \widehat{{\text {AF}}}=1-\frac{\sum _{i=1}^nY_i{\text {OR}}^{-X_i}(\mathbf{Z }_i)}{\sum _{i=1}^nY_i}. \end{aligned}$$

For instance, if we assume a logistic model without interactions between X and \(\mathbf{Z }\)

$$\begin{aligned} \text {logit}\{\text {Pr}(Y=1|X,\mathbf{Z })\} = \beta _0+\beta _1{X}+\beta _2{\mathbf{Z }}, \end{aligned}$$
(10)

then \({\text {OR}}(\mathbf{Z })=e^{\beta _1}\). It follows that \(\widehat{{\text {AF}}}\) simplifies to

$$\begin{aligned} \widehat{{\text {AF}}} =1-\frac{\sum _{i=1}^nY_ie^{-\hat{\beta }_1X_i}}{\sum _{i=1}^nY_i} =\hat{Pr}(X=1|Y=1)(1-e^{-\hat{\beta }_1}), \end{aligned}$$
(11)

where \(\hat{Pr}(X=1|Y=1)\) is the sample proportion of exposed among the cases. If we assume a more complicated model that allows for interactions between X and \(\mathbf{Z }\)

$$\begin{aligned} \text {logit}\{\text {Pr}(Y=1|X,\mathbf{Z })\} =\beta _0+\beta _1{X}+\beta _2{Z}+\beta _3{X\mathbf{Z }}, \end{aligned}$$
(12)

then \({\text {OR}}(\mathbf{Z })=e^{\beta _1 + \beta _3{\mathbf{Z }}}\). Under this model, \(\widehat{{\text {AF}}}\) does not simplify as in Eq. (11).

The estimation procedure outlined above applies to matched case–control studies as well, where the conditional logistic regression is commonly used instead of the ordinary logistic regression. The standard error for the resulting estimate can be obtained by combining the sandwich formula with the delta method [8, 9].

Applied example

In a study on causes of oesophageal cancer, cases and controls were collected from hospitals in Singapore during 1970–1972 [18]. Each case was individually matched to 4 controls on sex and age within 5 years intervals. In this article we re-analyse a publicly available subset of these data, consisting of 80 male cases and their 320 matched male controls. De Jong et al.[18] considered various potential risk factors for oesophageal cancer, such as intake of bread, potato, bananas and beverages at burning hot temperatures, smoking, and alcohol intake. In this article we focus on intake of beverages at burning hot temperatures, which was observed to be highly associated with oesophageal cancer by De Jong et al. [18]. The available variables are the patient’s age (Age), dialect group (Dial: 1 for Hokhien/Teochew and 0 for Cantonese/Other), a binary indicator of whether the patient consumes Samsu wine on a daily basis (Samsu), number of cigarettes smoked per day (Cigs), a binary indicator of whether the patient drinks beverages at ‘burning hot’ temperatures on a daily basis (Everhotbev), a unique identification number for each matched set (Set), and a binary case–control indicator (Oesophagealcancer). These variables are stored in the data frame singapore which is included in the AF package. We will adjust for age, dialect, if the patient drinks Samsu wine and number of smoked cigarettes per day since these variables are potential confounders. We assume the following conditional logistic regression model:

$$\begin{aligned}&\text {logit}\{\text {Pr}({\texttt {Oesophagealcancer}}\nonumber \\&\quad = 1\mid {\texttt {Everhotbev, Age, Dial, Samsu, Cigs, Set}})\}\nonumber \\&\quad = \beta _{{\texttt {Set}}}+\beta _1{\texttt {Everhotbev}} + \beta _2{\texttt {Age}} \nonumber \\&\quad \quad +\, \beta _3{\texttt {Dial}} + \beta _4{\texttt {Samsu}} + \beta _5{\texttt {Cigs}}, \end{aligned}$$
(13)

where \(\beta _{{\texttt {Set}}}\) is a set-specific intercept. The model in Eq. (13) can be fitted in R with the clogit function from the Survival package, as follows:

figure h

Summarizing the output gives:

figure i

The output indicates that the odds of getting oesophageal cancer is about \({\texttt {e}}^{{\texttt {1.16912}}}\approx {\texttt {3.22}}\) times higher among those who drink at least one beverage at burning hot temperature every day compared to those who do not. The effect is highly significant, with a p-value of 0.000064. To estimate the proportion of cases of oesophageal cancer that would have been prevented if no patient had consumed beverage at burning hot temperatures we use the AF.cc function:

figure j

By setting the argument matched=TRUE, conditional logistic regression is used instead of ordinary logistic regression (the default). Summarizing the output gives:

figure k

The output indicates that approximately 34 % all cases of oesophageal cancer would have been prevented if no patient had consumed beverages at burning hot temperature. The AF is highly significant, with a p-value close to zero, and a 95 % CI ranging from 23 to 44 %.

Cohort study with time-to-event outcome

Theory

In cohort studies with time-to-event outcomes, the AF function is defined as in Eq. (2) [3]. Equivalently, the AF function can be expressed as

$$\begin{aligned} {\text {AF}}=1-\frac{\{1-\text {S}_0(t)\}}{\{1-\text {S}(t)\}}, \end{aligned}$$
(14)

where \(\text {S}(t)=1-\text {Pr}(\text {T}\le t)\) is the factual survival function, and \(\text {S}_0(t)=1-\text {Pr}(T_0\le t)\) is the counterfactual survival function had the exposure been eliminated for everyone at baseline.

As before, we assume that a set of observed covariates, Z, is sufficient for confounding control. Under this assumption, \(\text {S}_0(t)\) can be obtained by averaging the survival function among the unexposed at a given value of \(\mathbf{Z }\) over the population distribution of \(\mathbf{Z }\):

$$\begin{aligned} \text {S}_0(t)=\text {E}\{\text {S}(t\mid X=0,\mathbf{Z })\}. \end{aligned}$$

In practice, \(\text {S}(t\mid X,\mathbf{Z })\) is usually estimated with a Cox proportional hazards model

$$\begin{aligned} \lambda (t \mid X, \mathbf{Z })= h(t)e^{g(X, \mathbf{Z }; {\varvec{\beta }})}, \end{aligned}$$
(15)

where \(\lambda (t \mid X, \mathbf{Z })\) is the conditional hazard at time t, given X and \(\mathbf{Z }\), h(t) is the unspecified baseline hazard, and \(g(X, \mathbf{Z }; {\varvec{\beta }})\) is a specified function of the exposure X and confounders \(\mathbf{Z }\) indexed by the parameter vector \({\varvec{\beta }}\). For example, \(g(X, \mathbf{Z }; {\varvec{\beta }})\) can be specified as \(g(X, \mathbf{Z }; {\varvec{\beta }})=\beta _1X+\beta _2\mathbf{Z }\). However, g() could also involve interactions and higher-order terms.

The model in Eq. (15) is fitted to obtain the partial likelihood estimate of \({\varvec{\beta }}\) [19] and the Breslow estimate of the cumulative baseline hazard function \(\varLambda (t)=\int _{u=0}^th(u)du\) [20]. Then, for each fixed value of t we proceed as follows. For each subject i with exposure level \(X_i\) and covariate vector \(\mathbf{Z }_i\), we use \(e^{-e^{g(X_i,\mathbf{Z }_i;\hat{{\varvec{\beta }}})}\hat{\varLambda }(t)}\) as a prediction of \(\text {S}(t|X_i,\mathbf{Z }_i)\). These predictions are averaged to obtain an estimate of \(\text {S}(t)\):

$$\begin{aligned} \hat{\text {S}}(t) = \frac{1}{n}\sum ^n_{i=1} e^{-e^{g(X_i,\mathbf{Z }_i;\hat{{\varvec{\beta }}})}\hat{\varLambda }(t)}. \end{aligned}$$
(16)

Similarly, for each subject i with covariate vector \(\mathbf{Z }_i\), we use \(e^{-e^{g(X=0,\mathbf{Z }_i;\hat{{\varvec{\beta }}})}\hat{\varLambda }(t)}\) as a prediction of \(\text {S}(t|X=0,\mathbf{Z }_i)\). These predictions are averaged to obtain an estimate of \(\text {S}_0(t)\):

$$\begin{aligned} \hat{\text {S}}_0(t) = \frac{1}{n}\sum ^n_{i=1} e^{-e^{g(X=0,\mathbf{Z }_i;\hat{{\varvec{\beta }}})}\hat{\varLambda }(t)}. \end{aligned}$$
(17)

The estimates of \(\text {S}(t)\) and \(\text {S}_0(t)\) are plugged in to Eq.(14) to produce an estimate of the AF function. The standard error of the resulting estimate can be obtained by combining the sandwich formula with the delta method [5].

Applied example

To illustrate the theory we use records from 2982 women with primary breast cancer from the Rotterdam tumor bank in the Netherlands. The Rotterdam breast cancer dataset is thoroughly described in Sauerbrei et al.[21] and Royston and Lambert [22]. The follow-up time ranges from 1 to 231 months. The outcome variables are the time, measured in months, that the patient is under study (rf), and an indicator of whether the patient experienced death or relapse before censoring (rfi). Seven prognostic variables are recorded: age at surgery (age), menopausal status (meno: 0 = pre and 1 = post), tumor size in three classes (size: ‘\(<\)=20 mm’, ‘\(>\)20–50 mm’ and ‘\(>\)50 mm’), tumor grade (grade: 2 or 3), progesterone receptors, (pr: fmol/l), oestrogen receptors, (er: fmol/l) and the number of positive lymph nodes (nodes: ranging between 0 and 34). In our example, we consider absence of chemotherapy as the exposure, i.e. we wish to estimate the proportion of deaths that would have been prevented before a given time, if all patients had been given chemotherapy at baseline. Absence of chemotherapy is measured by the variable no.chemo, with levels 0 for ‘yes’ and 1 for ‘no’. These variables are stored in the data frame rott2, which is included in the AF package.

In the analysis we will adjust for the seven prognostic factors as well as age since these variables are potential confounders. We assume the following Cox proportional hazards model:

$$\begin{aligned}&\lambda (t \mid {\texttt {no.chemo, age, meno, size, grade, nodes, pr, er}})\nonumber \\&\quad = h(t)e^{ \beta _1{\texttt {no.chemo}} + \beta _2{\texttt {age}} + \beta _3{\texttt {meno}} + \beta _4{\texttt {size}} + \beta _5{\texttt {grade}} + \beta _6e^{-0.12{\texttt {nodes}}}+\beta _7{\texttt {pr}}+\beta _8{\texttt {er}}}.\nonumber \\ \end{aligned}$$
(18)

In this model we used the transformation \(e^{-0.12{\texttt {nodes}}}\), since Sauerbrei et al.[21] noted that this transformation gave a better model fit. The model in Eq. (18) can be fitted in R with the coxph function from the survival package, as follows:

figure l

Summarizing the output gives:

figure m

The output indicates that the hazard of death or relapse is about 1.33 times higher among those who did not receive chemotherapy compared to those who did.

The AF function can be estimated by the function AF.ch as follows:

figure n

The formula argument is on the same format as the formula argument in the coxph function. The times argument specifies the time points at which the AF is supposed to be estimated. If not specified, times defaults to all observed event times. In the call to AF.ch above we have asked for the AF at years 1 to 5 (12 to 60 months). Summarizing the output gives:

figure o

The output indicates that the AF function decreases over the 5  year period, from 18 % at 1 year (12 months) after baseline to 14 % at 5 years (60 months) after baseline. The AF is statistically significant at all time points.

A convenient way to visualize the AF function is to use the plot function in the AF package:

figure p

where AFest is an object of class AF, estimated by the AF.ch function. This function call produces the plot in Fig. 1.

Fig. 1
figure 1

Estimated AF function over 5 years for relapse-free survival time in the Rotterdam dataset (solid line) together with point-wise 95 % confidence intervals (dashed lines)

Discussion

In this article we have presented the new R package AF, developed for epidemiologists and biostatisticians. The package AF estimates the confounder-adjusted AF for cross-sectional studies, case–control studies (matched and unmatched) and cohort studies with time-to-event outcomes, using the functions AF.cs, AF.cc and AF.ch, respectively. We have used three datasets (clslowbwt, singapore and rott2) for illustration. These datasets are all publicly available and included in the AF package. Thus, readers of the paper can easily replicate all analyses that we have presented.

In order for the estimated AF to have a causal interpretation, it is necessary that those covariates that are adjusted for are sufficient for confounding control. In practice, important confounders might be unknown and/or unmeasured, which implies that the estimated AF should always be interpreted cautiously. For instance, in our applied example in Section 4.2 comorbidity is a potential confounder for the association between chemotherapy and relapse-free survival, since those patients who have severe comorbidities are less likely to be prescribed chemotherapy due to the health risks of this invasive treatment, and also more likely to die early during follow-up. Thus, if comorbidities are not adjusted for, then the protective effect of chemotherapy may be overestimated. As a consequence, the AF may be overestimated as well.

The main advantage of the AF package, as compared to other R packages for AF estimation (epi2by2, attribrisk and paf), is that it offers a uniform way to estimate the confounder-adjusted AF for all three major study designs. The package has a standard input/output interface, which makes it easy to use for practitioners who has some familiarity with R. Another important advantage is that the AF package provides analytic standard errors, based on the delta method and the sandwich formula, which alleviates the need for time-consuming bootstrap or jackknife methods. Finally, the AF package produces correct standard errors when data are clustered, e.g. when there are repeated measures on each individual or when the dataset contains related individuals.

The AF package covers the most fundamental estimation strategies developed for the most common study designs. Possible extensions include so-called ‘partial attributable fractions’ [11, 23, 24], which allow for multiple exposures, and so-called ‘generalized impact fractions’ [2527], which allow for continuous exposures. Another possible extension is to allow for more advanced models for time-to-event outcomes, such as flexible parametric models [22]. We plan to make these extensions in the future.