Introduction

The International Agency for Research on Cancer (IARC) and US Surgeon General (US-SG) now both recognize at least 12 cancers for which active smoking is a causal factor [1, 2]. The grouping of cancers by level of evidence and strength of association with smoking is shown in Table 1. More latterly, cancers with modest causal associations have been identified (e.g., nasal cavities and sinuses, squamous cell carcinoma of the cervix, renal cell carcinoma, and stomach cancer; Table 1). Only endometrial cancer in postmenopausal women clearly has a reduced rate among smokers. Consistently, null associations have been found for some common cancers (category 5 in Table 1). But there remain a number of cancers for which there is uncertainty about any causal association with tobacco smoking (category 4 in Table 1). The substantive focus of this paper is on these latter ‘uncertain’ category 4 cancers.

Table 1 Association of current smoking with cancers, and date of first pronouncement of any causal relationship by either the International Agency for Research on Cancer (IARC) or US Surgeon General (US-SG)

Increasingly, epidemiological studies use large databases, often assembled through record linkage. Cancer registry studies—linked to mortality data—are often of this type. Record linkage beyond routine health data sets can enrich the data and extend possible analyses. In this paper, we use cancer registry data linked to national census data that happens to include smoking questions [3]. The periodic inclusion of a smoking question in New Zealand censuses (1976, 1981, 1996, 2006, and 2013) and subsequent linkage with health data allows rich epidemiological analyses (e.g., mortality and prevalence analyses we have previously published [48]). Linkage of 5 years of cancer registry data back to each of the 1981 and 1996 censuses creates some of the largest cohort studies of smoking and cancer incidence enabling a precise (i.e., narrow confidence interval) determination of relative risks of multiple cancers with smoking. But—in common with many registry and linkage studies—exposure assessment is not ideal (i.e., two simple questions on smoking status only) and data on some key potential confounders are missing (e.g., neither the NZ census nor cancer registry collects data on obesity or alcohol consumption). It is, however, possible to undertake quantitative bias analyses (QBA), in this case to adjust for possible misclassification bias of smoking and residual confounding by obesity and alcohol. (It is also possible that smoking contributes to lower BMI, and thus, adjusting for BMI is a form of overcontrol for confounding. However, for the purposes of this paper, we assume that the obesity is predominantly a confounder in that there are common causes of both smoking and increased BMI, or—expressed in directed acyclic graph terminology—BMI is on a backdoor path from smoking to cancer incidence.) Such QBA methods have existed in the epidemiological repertoire since 1959 [9], but are surprisingly infrequently used. Lash, Greenland, and colleagues have recently strongly argued for their wider application in epidemiology [1012], noting that refusing to undertake QBA is tantamount to saying there is no remaining systematic error in one’s study. Thus, a second purpose of this paper is to demonstrate relatively simple QBA methods as applied to linked cancer registry data.

Brief overview of smoking–cancer associations

The association of breast cancer with smoking has been debated for sometime, with studies producing inconsistent findings [1, 2, 1316]. For instance, while a major meta-analysis concluded there was little or no effect of smoking [13], two recent large cohort studies have found a modest positive (increased risk) association of smoking with breast cancer [15, 16]. The latest IARC report (2012) identifies a consistent but weak positive association (RRs 1.1–1.3) [17].

The 2012 IARC report also states that current data support a causal relationship between smoking and colorectal cancer, although this may be limited to specific subtypes of colorectal cancer [17]. However, recent meta-analyses have reported weak positive associations with RRs of 1.07 (0.99–1.16) [18] and 1.17 (0.97–1.40) [19].

The association of smoking with leukemias and lymphomas is difficult to determine because they are heterogeneous groups of diseases, but there is some (inconsistent) evidence suggesting that the risk of follicular lymphoma may be increased [1, 20, 21]. IARC and the US-SG recognize a casual association of smoking with myeloid leukemias [1, 2, 2224]. However, a meta-analysis of the IARC data for myeloid leukemia did not find a statistically significant increase in risk [25].

Regarding ovarian cancer, three recent meta-analyses reported RRs of about two or more for mucinous ovarian tumors (10–20 % of all ovarian cancer cases) [2628]. A causal association between smoking and mucinous, but not other types of, ovarian cancer has been identified by IARC [17].

A causal association of smoking with liver cancer is recognized by IARC [1], but not yet by the US-SG [2]. The strong correlation of smoking with other causes of liver cancer makes it difficult to confidently rule out residual confounding. Liver cancer is usually caused by chronic infection with hepatitis B or C viruses (HBV/HCV); HBV is particularly important in New Zealand among Māori and Pacific people [29]. Alcohol and other risk factors also make a contribution. A meta-analysis that carefully considered the confounding influence of alcohol, HBV, and HCV found RRs of 1.4–1.5 for liver cancer for current versus never smokers [30].

Melanoma is thought to have either no association or a possible negative association with smoking [1, 31, 32]. There is also some evidence of a negative association for thyroid cancer [33].

Methods

Data set

The 1981 and 1996 New Zealand censuses included questions on smoking. These two censuses and New Zealand Cancer Registry (NZCR) records for 1981–1986 and 1996–2001, respectively, were anonymously and probabilistically linked using geocodes, sex, date of birth, ethnicity, and country of birth (further details of methods are available elsewhere: www.uow.otago.ac.nz/cancertrends-info.html and [34]). The proportion of people with cancer who were successfully linked to a census record was 73 % for the 1981–1986 cohort and 80 % for the 1996–2001 cohort; accordingly, we calculated inverse probability of linkage weights for strata based on age, sex, ethnicity, residential mobility of the neighborhood, time since census, small area deprivation, cancer diagnosis, and region of the country. Assuming that there is negligible linkage variation by smoking status within strata of these covariates, these weights protect against linkage bias in final-weighted analyses.

Observations were included for all person-years of follow-up for current and never smokers aged 25 years and older. (Ex-smokers are excluded from the main analyses; however, results are available in Web Table 1 and Web Table 2).

Variables

The 1981 census included the following question “Tick the box which best describes your current cigarette smoking,” with three options: Never smoked cigarettes at all, or never smoked them regularly; do not smoke cigarettes now, but used to smoke them regularly (1 or more per day); and currently smoke cigarettes regularly (1 or more per day). The 1996 census used a two-step approach: (1) “Do you smoke cigarettes regularly (that is, one or more per day)?” (Count only tobacco cigarettes. Don’t count pipes, cigars, or cigarillos); and for those answering “no,” (2) “Have you ever been a regular smoker of one or more cigarettes per day?”

Socio-demographic census data collected included age, sex, ethnicity, marital status, and multiple measures of socioeconomic position, including household income; highest educational qualification; household car access and housing tenure (both markers of asset wealth, and the latter also access to resources); and neighborhood deprivation (calculated for neighborhoods of about 100 people, using nine census variables [35]).

The outcome was a first cancer registered with the NZCR during the follow-up periods, excluding basal and squamous cell carcinomas of the skin and other in situ cancers. Cancers prior to 2000 were forward mapped from ICD-9 to ICD-10 codes. Site-specific analyses were conducted for adult cancer groupings (see Table 2 for groupings and ICD-10 codes). It was not possible to reliably identify the morphological types such as mucinous ovarian cancer.

Table 2 Number of cancer events and age- and ethnicity-standardized incidence rate ratios (SRRs) for current versus never smokers aged 25+ years by sex and by cohort
Table 3 Poisson regression incidence rate ratios (RR) for current versus never smokers aged 25+ years, and ‘best’ RR allowing or adjusting for systematic error, sexes, and cohorts combined (numbers of cases and person-years for current and never smokers included in regressions are shown in Web Table 4)

Cohort analysis

Census respondents developing cancer were censored at the date of diagnosis. Unfortunately, due to privacy and logistical reasons, it was not possible to censor for death among people not developing a cancer or emigration out of New Zealand (estimated at less than 0.02 % per year; www.stats.govt.nz).

Age- and ethnicity-standardized (WHO world standard, further disaggregated by NZ ethnic distribution) incidence rates and rate ratios of cancer according to smoking status were calculated separately by sex and cohort. Three Poisson regression analyses were conducted for each cancer, for sexes and cohorts combined. First, a baseline regression model was run on the full data set, adjusting for sex, age, ethnicity, and cohort. Second, a model adjusting for the same variables was run on data restricted to those respondents with complete data on all socio-demographic factors. (About 26 % of respondents had missing data on one or more of the socioeconomic factors, usually household income which was classified as missing if any adult in the household had missing data due to either refusal or being away from their usual residence on census night.) Comparing the second to baseline models allowed an assessment of selection bias, or at least that selection bias due to missing socioeconomic data. Third, a Poisson regression model adjusting additionally for socioeconomic factors was run on this restricted data set. Analyses by ethnic grouping were conducted for liver cancer only, due to the marked variation in the causal role of hepatitis B virus by ethnicity [29].

We selected ‘best’ rate ratio (RR) estimates from regression models 1 and 3 as follows. First, if there was negligible confounding by socioeconomic factors (which we conservatively defined as <3 % shift in the RR from model 2 to 3), then the best estimate was that from model 1. Second, if there was some confounding (i.e., ≥3 % shift in RR from model 2 to 3) and negligible selection bias (<3 % shift in RR from model 1 to 2), then the best estimate was that from model 3. Third, the remaining cancers had some evidence of both selection bias and confounding, and no direct ‘best’ estimate was available from our regression models. However, assuming that the observed amount of confounding moving from models 2 to 3 would have applied to the model 1 estimate if we actually had socioeconomic data on all census respondents, we were able to indirectly estimate what the ‘true’ RR estimate might have been. For example, if confounder adjustment resulted in a 15 % reduction in the RR from models 2 to 3, then our ‘derived’ estimate would be 85 % of the model 1 RR. (A worked example of these estimations is given in Box 1.)

Box 1 Demonstration of how the ‘best’ RR estimate was calculated for cancer of the larynx, ear, and nasosinus (as in Table 3)

Quantitative bias analysis: exposure misclassification and residual confounding

The census smoking variable is a crude measure, meaning our results were prone to exposure misclassification bias. We undertook quantitative bias analyses [10, 11] in purpose built Excel spread sheets using Monte Carlo simulation with Ersatz software (www.epigear.com). Details are provided in Web Appendices 1 and 2; briefly, a misclassification matrix of actual by self-reported smoking status was generated based on Heller et al. [36] and other studies and then modeled as a Dirichlet distribution applied to a subset of the crude data (45–64 and 65–84 year olds of European/other ethnicity from the 1996–2001 cohort, to avoid confounding by age or ethnicity) to generate corrected risk ratios. There was a near perfect linear association of the log of these adjusted RR estimates with the log of the crude RRs (Fig. 1), and we used this association to adjust all smoking–cancer RRs (Table 3).

Fig. 1
figure 1

Plot of log of smoking misclassification-adjusted risk ratio by log of crude risk ratio, for 36 sex by age (45–64 and 65–84 year olds; European/other only) by cancer strata from the 1996–2001 cohort. Lung cancer among 65–84-year-old males excluded as outlier. See Web Appendix 1 for further details

The census does not include variables on potential confounders such as obesity and alcohol. We undertook bias analyses for cancers with known associations with either obesity or alcohol (detail in Web Appendix 1). Briefly, we used: (1) distributions of obesity and total alcohol consumption by smoking status among restricted age groups for European/other from the 2002/03 New Zealand Health Survey [37]; (2) the relative risk associations of obesity [38] and alcohol [3942] with various cancers, and (3) uncertainty modeled with Dirichlet (prevalence data) and log normal distributions (relative risks).

Results

Smoking status was missing for 1.7 and 4.7 % of the 1981 and 1996 cohorts, respectively, giving 14.8 million person-years in total. Current smokers accounted for 36 % of person-years (41 % of the 1981–1986 cohort, and 31 % of the 1996–2001 cohort).

Table 2 shows the number of cancers for current and never smokers, and the age and ethnicity-standardized incidence rate ratios (SRRs). The SRRs were greater than 2.0, as expected, for larynx/ear/nasosinus and lung cancers.

Table 3 shows the RRs from Poisson regression models. Shifts in the RRs from model 1 to model 2 reflect selection bias arising from using the restricted data set of only respondents with complete data on all socio-demographic factors. Selection bias was notable for gallbladder and bile duct cancer (20 % increase in RR) and larynx/ear/nasosinus (16 % increase), and for liver (14 % increase; Māori/Pacific/Asian only) and lung/trachea/bronchi cancers (10 % increase).

Shifts in RRs from model 2 to model 3 (additionally adjusting for socioeconomic factors) reflect confounding. Strong confounding was apparent for melanoma (a 12 % increase in RR from 0.65 to 0.73), consistent with melanoma being more common among higher socioeconomic groups who are less likely to smoke. Rate ratios also increased modestly after adjustment for breast cancer, follicular lymphoma and prostate cancer. Modest 5–9 % reductions in the rate ratios after adjustment were evident for cervical, larynx/ear/nasosinus and liver cancers.

The third to last column of Table 3 presents ‘best’ empirical RR estimates from either model 1 or model 3 and ‘derived’ estimates for seven cancer groupings that had evidence of both confounding and selection biases (see “Methods” for details).

We also further adjusted for misclassification bias of the smoking exposure (details in Web Appendix 1). Briefly, we found that this adjustment had a negligible impact for our best estimate relative risks up to about 1.25, but for higher RRs, our best estimates increasingly underestimated the exposure misclassification-adjusted estimates. Notably, the association of the crude and exposure misclassification-adjusted risk ratios for 36 sex by age by cancer strata from the 1996–2001 cohort was near perfectly linear on the log–log scale (Fig. 1). We used the equation for this slope and the finding that the uncertainty interval on the log scale was 8 % wider than the random error confidence interval, to calculate exposure misclassification-adjusted relative risks and uncertainty intervals (UI) for all cancers as shown in the second to last column of Table 3. The misclassification exposure-adjusted RRs were >10 % higher than the original estimates for bladder, esophagus, and oropharynx cancer and >20 % higher for lung and larynx/ear/nasosinus cancer.

Finally, we also explored likely residual confounding by alcohol and smoking for the 1996–2001 cohort for selected cancers (details in Web Appendix 1). The impact of residual confounding was usually negligible, with overestimation of associations due to obesity (less common among smokers) often partly offset by underestimation due to confounding by alcohol (higher among smokers). Net effects of this confounding are depicted in the final column of Table 3, suggesting that our best estimate exposure misclassification-adjusted RRs still underestimate the RR for endometrial, kidney, and esophageal cancer by up to 0.20. Conversely, we likely overestimate oropharynx association by as much as 0.4. But with the exception of endometrial cancer, these cancers still have moderate to strong RR associations with smoking.

Substantively, the following patterns are evident in Table 3. Lung and larynx/ear/nasosinus cancers had markedly increased (RR > 6) incidence among current smokers. Bladder and esophagus have a clearly greater than doubling of incidence, and five cancers have a 20 % to twofold increased rate among current smokers with uncertainty intervals clearly excluding the null: cervix, kidney, oropharynx, pancreas, and stomach. Conversely, four cancers have at least a 20 % lower rate among current smokers with uncertainty intervals excluding the null: endometrial (empirical best estimates of 0.67; 95 % UI 0.56–0.79, and for 50 + years 0.60; 0.50–0.73; but due to residual confounding by obesity, the true RRs might be about 0.08 higher), CML (0.66; 0.44–0.99), melanoma (0.72; 0.65–0.81), and thyroid cancer (0.76; 0.58–1.00).

Eleven cancers had best RR estimates within 15 % of the null and with uncertainty intervals including the null: brain, breast, rectum, gallbladder and bile duct, Hodgkin’s, AML, CLL (and non-AML combined), myeloma, NHL (and follicular lymphoma), ovary, and testicular.

Colon cancer had a best empirical estimate RR of 0.89, with the uncertainty interval just excluding the null (0.81–0.98). However, our confounding bias analyses suggested that we were still slightly underestimating the RR, possibly to the extent that a corrected uncertainty interval would include the null.

The liver RRs varied markedly by ethnic grouping, with a strong 1.75 (1.37–2.24) association among European/other (probably slightly overestimated due to residual confounding by alcohol) but a null association among Māori/Pacific/Asian combined (0.80; 0.53–1.20).

Model 3 regressions were rerun to test for any significant interactions by sex and by cohort (Web Table 3). By sex, there was evidence of a stronger protective association of smoking for melanoma among males (RR 0.62; 0.56–0.69) compared to females (RR 0.84; 0.76–0.93). This stronger protective effect for males was evident in both the 1981–1986 and 1996–2001 cohorts (SRRs in Table 2). There was some suggestion that if there is any protective association for leukemia and its subtypes, it is only present in females. By cohort, there was a tendency for RRs to be further away from the null (be it less than or greater than 1.0) in the 1996–2001 cohort compared to the 1981–1986 cohort (Web Table 3).

Discussion

This study largely agrees with IARC or US-SG reports regarding associations of smoking with various cancers [1, 2]. Figure 2 plots the best empirical estimates from this study (i.e., second to last column of Table 3, plus additional adjustments for likely confounding by alcohol and obesity for kidney, liver (European/other only), esophageal, and oropharyngeal cancers) and recent meta-analyses (where they exist). The concordance is good. Considering category 1 and 2 cancers (positive causal association with smoking, Table 1), our estimates all have the lower 95 % confidence or uncertainty limits greater than 1.0. Our study also finds a strong protective association for endometrial cancer consistent with a previous meta-analysis [43]. We also find a protective association for thyroid cancer (RR 0.76; 95 % UI 0.58–1.00) adding weight to Mack et al.’s [33] pooled analysis that found an OR of 0.6 (0.6–0.7). One possible mechanism for both of these protective relationships is the anti-estrogenic effects of smoking [44].

Fig. 2
figure 2

Forest plot of incidence rate ratios from: current CancerTrends study {best estimates from Table 3 [The best estimates from this study are those in the second to last column of Table 3 (i.e., misclassification bias adjusted), with the following four modification for likely residual confounding: addition of 0.08 to the endometrial cancer 25+ RR giving 0.75 (0.63–0.89); addition of 0.08 to the kidney cancer RR giving 1.37 (1.13–1.65); addition of 0.15 to the esophageal cancer RR giving 2.29 (1.85–2.83); subtraction of 0.25 from the oropharynx RR giving 2.05 (1.73–2.43); subtraction of 0.05 from European/other liver cancer RR giving 1.70 (1.33–2.17)]}; solid black error bars, with ‘multiplication sign’ central estimate]; IARC data meta-analysis [25] (red bars with vertical dash central estimate); and miscellaneous meta-analyses (see text and footnotes to Table 1 for references) red bars with ‘circle’ central estimate. AML acute myeloid leukemia, CML chronic myeloid leukemia, CLL chronic lymphocytic leukemia, NHL non-Hodgkin’s lymphoma. (Color figure online)

The category 4 “causality uncertain” cancers are best interpreted after a discussion of this study’s main strengths and limitations: measurement error of the smoking exposure, and potential residual confounding.

The census smoking question that classified smoking status for our study is simple and does not capture intensity or duration. There were also likely to be some current smokers (inadvertently) reporting themselves as never smokers (or ex-smokers and hence excluded from our study), and vice versa some never smokers reporting themselves as current smokers. However, a key strength of our study is the inclusion of QBA which (assuming a valid bias model and parameters as detailed in Web Appendices 1 and 2) suggests negligible impact of exposure misclassification for most cancers and allows us to estimate exposure misclassification-adjusted estimates for all cancers (Table 3). While we have adjusted for bias from misclassification of the simple smoking question, further measurement error against a construct such as ‘pack-years’ was not possible; we assume that such correction would further strengthen the association for cancers already linked with smoking while having minimal effect on findings for other cancers. That said, we believe that misclassification bias is unlikely to be a major limitation for assessing ‘uncertain’ or marginal smoking–cancer associations in this study given the close agreement between our misclassification-corrected RR for cancer of the lung, trachea, and bronchus and that of Gandini et al.’s [25] meta-analysis of IARC data (Fig. 2), and the observation that misclassification of smoking status has little impact on RRs close to 1.0 (Fig. 1).

Our exploratory confounder bias adjustments provide considerable reassurance that residual confounding by alcohol and obesity is unlikely—except for our study underestimating the association of endometrial, kidney, and esophageal cancer with smoking and overestimating the association of liver (European/other only) and oropharyngeal cancer. But none of these four instances alter conclusions: Smoking will still be protective for endometrial cancer and increase risks for the other four.

There was a tendency in our study for rate ratios to be further from the null in 1996–2001 than 1981–1986 (Table 2 and Web Table 3). This may be consistent with the 1996–2001 results reflecting a more mature smoking epidemic (especially for a long time-lag process such as cancer etiology) and less passive smoking risk in the never smokers, resulting in stronger rate ratios.

Returning to the substantive interpretation of our study findings for the category 4 “causality uncertain” cancers, one remaining source of bias deserves attention. The apparent negative association between smoking and prostate cancer may reflect more health conscious never-smokers being more likely to undergo prostate-specific antigen testing resulting in higher apparent incidence among this group, rather than a genuine protective effect. In contrast, it seems unlikely that the observed protective association of smoking for melanoma is purely due to bias. A protective association for melanoma with RRs of 0.5–0.8 has been previously reported (although often with wide confidence levels including the null) [1, 31, 32, 45]. Curiously, the one other study to report results by sex also found a strong protective association among males, consistent with our study (Web Table 3). It is possible that the apparent protective effect is due to confounding by sunlight exposure, but this would require a strong association between low sunlight exposure and smoking within strata of socioeconomic factors that we have already adjusted for. Such an association seems unlikely, and we know of no evidence for this in the NZ population. Thus, we would recommend recategorizing melanoma as “Causality likely: protective association.” It is possible that because smoking has an immunosuppressive effect [46], it protects melanocytes from UV radiation-induced inflammatory damage [32].

Smoking is recognized as a causal factor for myeloid leukemia by both IARC [1] and the US-SG [2], and the latter limits this to acute disease (AML). However, we did not find smoking to be significantly associated with AML, consistent with a recent meta-analysis for all myeloid leukemia (RR 1.09; CI 0.70–1.70) [47]. The apparently protective association for CML (best empirical estimate 0.66; 95 % UI 0.44–0.99) found in our study is inconsistent with the majority of international research, which suggest a null or possibly positive association [23, 24]. Ours may well have been a chance finding (consistent with an upper uncertainty limit of 0.99 and measurement of multiple associations).

Regarding liver cancer, our study discloses an important interaction of smoking with ethnicity (Table 3). It seems likely that among Māori/Pacific/Asian where hepatitis B is involved in the etiology of over 80 % of cases (compared to 6 % or less among European/other) [29], smoking is not a necessary causal cofactor with hepatitis B. However, our finding of an approximately 70 % increased risk among European/other smokers is consistent with the RR of 1.56 (95 % CI 1.29–1.87) from the meta-analysis of IARC data [47].

Finally, our study finds no compelling evidence of a link between smoking and breast, rectal, colon, gallbladder and bile duct, myeloma, brain, Hodgkin’s, NHL, ovary and testicular cancers, and follicular lymphoma. Regarding colon, rectal, and breast cancers, there is some evidence that cancer risk increases with duration of smoking [10, 17, 19]. However, an increased risk with increased duration of smoking is also the case for other cancers that we do detect moderate or strong associations with, and most people initiate smoking by age 20. Our study cannot separately detect long-run from short-run causation, but it does seem unlikely that our study would produce null findings for colorectal and breast cancer—but not other cancers—if this hypothesis was true.

Conclusion

Our large study with measured and unmeasured confounder adjustment, and smoking misclassification adjustment, confirms the strong association of smoking with many cancers. However, our study also strengthens the evidence for protective associations of smoking for thyroid cancer and melanoma. We find no compelling association of smoking with breast and colorectal cancer. Contrary to recent statements, we find no association of smoking with AML. Our study also demonstrates how QBA can be applied to registry data and to strengthening causal inference in analyses on such data.