Introduction

Breast cancer is the most frequently diagnosed cancer that accounts for 15% of all new cancer cases among women in China, and is the leading cause of cancer death in women younger than 45 years [1]. Over the last decades, genome-wide association studies (GWAS) have linked 3800 SNPs to 427 diseases and traits, only 7% of the SNPs are located in protein-coding regions, but with 93% located in non-coding regions [2,3,4].

LincRNAs are non-coding transcripts longer than 200 nucleotides, originating from the region between two protein-coding genes and have been described as the largest subclass in the non-coding transcriptome in human [5]. LincRNAs are emerging as key regulators of diverse biological and cellular processes through mechanisms such as transcriptional and post-transcriptional processing [6,7,8,9,10,11]. Gupta, R. A. et al. found that HOTAIR could reprogram chromatin state to promote breast cancer metastasis [12]. Studies from Subhrangsu S. Mandal’s laboratory show that HOTAIR is crucial for the viability of breast cancer cells and its expression is transcriptionally regulated in an estradiol (E2)-dependent manner via coordination of estrogen receptors (ER) and ER coregulators [13, 14]. Maite Huarte et al. identified that lincRNA-p21 could serve as a repressor in p53-dependent transcriptional responses; the inhibition of lincRNA-p21 could affect the expression of hundreds of gene targets enriched for genes normally repressed by p53 [15]. Recently, increasing number of researchers identified that genetic variation in lincRNAs may confer susceptibility to cancer by causing aberrant expressions and dysregulations in a cell-type or tissue-specific manner. Bayram S et al. first found that CC genotype of HOTAIR rs920778 polymorphism might play important roles in genetic susceptibility to breast cancer development and aggressiveness in a Turkish population [16]. Peng R et al. identified that tag SNPs (rs3200401, rs619586) of lncRNA MALAT1 were associated with the susceptibility of breast cancer via the alteration of mRNA expression level [17]. MALAT1 overexpression predicted poor survival in tamoxifen-treated ER-positive breast cancer patients, which might serve as a potential biomarker to predict endocrine treatment sensitivity [18]. Genetic variants in the INK4 locus serve as a hotspot for various diseases [13, 19], Royds JA et al. reported that rs11515 was associated with aggressive breast tumors with increased ANRIL and reduced p16 (INK4a) expression [20]. Riaz M et al. found that tag SNP rs2107425 was significantly associated with shorter metastasis-free survival but not through altering H19 mRNA expression [21]. However, most of these studies included only a few SNPs of the candidate lincRNA, and several lincRNAs, such as lincRNA-p21, HULC, and HOTTIP, have not been studied for their association with breast cancer susceptibility.

Systematically identifying such loci involved in breast cancer will be important in understanding breast cancer pathogenesis and in establishing new targets for cancer screening, prevention, and therapy. Therefore, we performed a case-control study to analyze the association between lincRNA SNPs and breast cancer risk.

Materials and methods

Ethic statement

The research protocol was approved by the Ethics Committee of Tianjin Medical University Cancer Institute and Hospital, and written informed consents were obtained from all subjects participated in the study.

Study subjects

We conducted a two-stage case-control study in a combined dataset of 6101 individuals (2881 breast cancer cases and 3220 controls), who were randomly assigned to stage I and stage II according the ratio of 1:1. The sample size was calculated by “QUANTO” software under the condition of the minor allele frequency (MAF) ≥ 0.1 using parameters: α = 0.05 (two-side test), 1 − β = 80%, OR ≥ 1.2. All the patients were newly diagnosed and histologically confirmed breast cancer between 2006 and 2009 in Tianjin Medical University Cancer Institute and Hospital. Controls (genetically unrelated women) were enrolled from the nearby community without personal malignant tumor history and were frequency matched to the cases by age (± 5 year) at the same time. The detailed description of cases and controls can be found in our previous study [22]. Participants were excluded if they had a personal history of cancer, or if they had history of blood transfusion within 6 months as they donated blood. We also excluded male patients. The study comprised two stages. In stage I, 1492 patients and 1677 controls were randomly selected for SNP screening. In stage II, the candidate SNPs identified in the first stage were validated and evaluated; the validation set consists of 1389 cases and 1543 controls.

SNP selection

We used “lincRNA” as key word on PubMed in 2012, and we selected the well-known, functionally characterized and cancer-related lincRNAs; 40 papers were showed up with the inclusion of 7 lincRNAs (HOTAIR, lincRNA-p21, MALAT1, ANRIL, H19, and HOTTIP). The dbSNP database (https://www.ncbi.nlm.nih.gov/projects/SNP/) was used to select SNPs with the following criteria: SNPs have Chinese or Asian population frequency, the minor allele frequency (MAF) ≥ 0.1. We then performed linkage disequilibrium analysis, and a total of 6 lincRNAs, 17 (r2 < 0.5) independent SNPs were selected in this study.

SNP genotyping

Genomic DNA was extracted from peripheral blood using QIAGEN DNA Extraction Kit (QIAGEN Inc.). Taqman platform was used to perform genotyping. Primers and probes were supplied by Applied Biosystems. The 5-µl reaction mixture contained 20 ng genomic DNA, 2.5 µl 2 × TaqMan Genotyping Master Mix, 0.1 µl probe, and 1.7 µl double distilled water. The PCR conditions used were as follows: 50 °C for 2 min, 95 °C for 10 min, and 60 °C for 1 min for 40 cycles. Amplifications were conducted using the 384-well plates, and allelic discrimination was performed using SDS 2.4 software (Applied Biosystems, Foster City, CA, USA). We also performed negative controls (water) and positive controls with known genotype to ensure the accuracy. After 2 rounds of genotyping, the successful rate was 99%, and approximately 5% of the samples were randomly selected for replication; the final concordance rate for quality control was 100%.

mRNA expression profiling

Total RNA, isolated from 228 liquid nitrogen-frozen breast cancer tissues with known H19 genotypes (CC, CA or AA), was extracted and purified Trizol reagents (Invitrogen, USA) according to the manufacturer’s protocol. M-MLV Reverse Transcriptase (Applied Biosystems, USA) was applied for reverse transcript. Quantitative real-time PCR was performed on ABI Prism 7900HT Fast Real-time PCR System (Applied Biosystems, USA). Probe for H19 used Hs00262142_g1 (TaqMan® Gene Expression Assay). GADPH was used as an endogenous control. All samples were done in triplicate. After quality control (CV < 15%, CtGAPDH ≤ 28), 213 breast cancer tissues were included in the expression analysis. The relative expression of H19 was calculated using the 2−△△ct method.

Statistical analysis

The Hardy–Weinberg equilibrium was assessed for controls by a χ2 goodness of fit test. The distributions of the SNPs’ genotypes, demographic variables, and other risk factors between breast cancer cases and controls were evaluated by performing χ2 test. The association between the SNPs and breast cancer risk was evaluated by unconditional logistic regression, estimating the odds ratios (ORs) and 95% confident intervals (CIs), with and without adjustment of age, smoking status, menopause status, oral contraception use, history of benign breast diseases, and family history of cancer. We further analyzed the association of rs2071095 genotypes with demographic, lifestyle, and clinical characteristics by unconditional logistic regression. H19 relative expression in different genotypes was calculated using one-way ANOVA method. All statistical tests were two-sided, and P < 0.05 was considered significant; correction for multiple comparisons was not performed. We used SPSS 20.0 (SPSS Inc., Chicago, IL, USA) for all statistical analyses, ggplot2 and forestplot packages from R 3.4.2 for Figs. 1, 2, and 3 and GradphPad Prism 5.0 for Fig. 4.

Fig. 1
figure 1

Logistic regression analysis of associations between rs2071095 and the risk of breast cancer stratified by demographic characteristics (stage I and stage II). atwo-side χ2 test for the difference in frequency distribution of the three genotypes (CC, CA, and AA) between cancers and controls and we did not show the difference in frequency distribution of the genotypes (CC, CA + AA) between cancers and controls; P < 0.05 was considered statistically significant; badjusted by age, menopausal status, oral contraceptive usage, smoking status, history of benign breast disease, and family history of cancer; cinteraction analysis between demographic variables and rs2071095; dfirst- and second-degree relatives with history of cancer

Fig. 2
figure 2

Logistic regression analysis of associations between rs2071095 and the risk of breast cancer stratified by demographic characteristics (stage I). atwo-side χ2 test for the difference in frequency distribution of the three genotypes (CC, CA, and AA) between cancers and controls and we did not show the difference in frequency distribution of the genotypes (CC, CA + AA) between cancers and controls; P < 0.05 was considered statistically significant; badjusted by age, menopausal status, oral contraceptive usage, smoking status, history of benign breast disease and family history of cancer; cinteraction analysis between demographic variables and rs2071095; dfirst- and second-degree relatives with history of cancer

Fig. 3
figure 3

Logistic regression analysis of associations between rs2071095 and the risk of breast cancer stratified by demographic characteristics (stage II). atwo-side χ2 test for the difference in frequency distribution of the three genotypes (CC, CA, and AA) between cancers and controls and we did not show the difference in frequency distribution of the genotypes (CC, CA + AA) between cancers and controls; P < 0.05 was considered statistically significant; badjusted by age, menopausal status, oral contraceptive usage, smoking status, history of benign breast disease, and family history of cancer; cinteraction analysis between demographic variables and rs2071095; dfirst- and second-degree relatives with history of cancer

Fig. 4
figure 4

Different genotype at rs2071095 affects H19 expression level. H19 levels determined by qRT–PCR in breast cancer tissues in subjects with the AA (n = 18), CA (n = 87) or CC (n = 108) genotype at rs2071095. Results are shown as means ± s.e.m. relative to GAPDH levels, and all P values are from two-sided t tests. *P < 0.05, ***P < 0.001

Results

A total of 2881 patients with pathologically confirmed breast cancer and 3220 tumor-free controls were included in stage I and II. Their demographic characteristics are summarized in Table 1. The ages of the participants were matched between cases and controls (P = 0.808). The differences between cases and controls in terms of smoking status (P < 0.001), oral contraceptive usage (P < 0.001), history of benign breast disease (P < 0.001), and family history of cancer (P < 0.001) were statistically significant. We further analyzed the effects of the rs2071095 genotypes on the risk of breast cancer among different subgroups of demographic characteristics (Figs. 1, 2, 3). For AA genotype, decreased risks of breast cancer were more likely to be evident in subgroups of younger patients (age ≤ 55 years, stage I + II: adjusted OR 0.86, 95% CI 0.68–0.98; stage I: adjusted OR 0.63, 95% CI 0.45–0.87), non-smoking women (stage I + II: adjusted OR 0.88, 95% CI: 0.74–0.98; stage I: adjusted OR 0.63, 95% CI 0.48–0.84), premenopausal women (stage I + II: adjusted OR 0.76, 95% CI 0.57–0.96; stage I: adjusted OR 0.57, 95% CI 0.39–0.84), non-use of oral contraception (stage I: adjusted OR 0.65, 95% CI 0.48–0.87), non-history of benign breast disease (stage I + II: adjusted OR 0.83, 95% CI 0.67–0.98; stage I: adjusted OR 0.65, 95% CI 0.48–0.86), and non-family history of cancer (stage I + II: adjusted OR 0.82, 95% CI: 0.65–0.97; stage I: adjusted OR 0.56, 95% CI 0.41–0.77). The results of stratified analysis in stage II were not significant. The interactions between genotype and demographic characteristics were not significant in stage I, stage II, nor stage I + II.

Table 1 Demographic characteristics of breast cancer cases and controls (Stage I and II)

In stage I, the genotype and allele frequencies of the 17 candidate SNPs and their associations with breast cancer risk are shown in Table 2. The distributions of the genotype in controls were in line with the Hardy–Weinberg equilibrium. The significant associated SNPs were rs2071095 located in H19 at 11p15.5 and rs2151280 located in ANRIL at 9p21.3.

Table 2 Logistic regression analysis of associations between lincRNAs’ SNPs and the risk of breast cancer (Stage I)

The above 2 SNPs identified in stage I were replicated in an independent set. Results for the SNPs association in stage II were not significant at P < 0.05. We combined the data from stage I and stage II, and found that, compared with rs2071095 CC genotype, AA and CA + AA genotypes were associated with significantly decreased breast cancer risk (OR 0.82, 95% CI 0.68–0.98; OR 0.87, 95% CI 0.79–0.97, respectively), which were still significantly associated with breast cancer risk after adjusted by age, menopausal status, smoking status, family history of the benign breast disease, and family history of cancer (AA vs CC: OR 0.83, 95% CI 0.69–0.99; CA + AA vs CC: OR 0.88, 95% CI 0.80–0.98). We further assessed the associations between rs2071095 and breast cancer risk stratified by ER and PR status. The association was still significant for ER-positive breast cancer (OR 0.83, 95% CI 0.74–0.94, adjusted OR 0.86, 95% CI 0.76–0.98, P = 0.002), but not for ER-negative breast cancer (OR 0.93, 95% CI 0.81–1.07, adjusted OR 0.94, 95% CI 0.81–1.09, P = 0.332), though p for heterogeneity was not significant. For PR status, both PR + and PR- were significantly associated among those with the CA + AA genotypes (unadjusted OR for PR+: 0.87 and 95% CI 0.76–0.99; unadjusted OR for PR-: 0.87 and 95% CI 0.77–0.99) although the adjusted OR/95% CIs were not significant (Table 3).

Table 3 Logistic regression analysis of associations between rs2071095 and the risk of breast cancer (stage I and stage II)

We further explored the H19 expression by rs2071095 genotypes among 213 breast cancer tissues. As shown in Fig. 4, rs2071095 CA and AA genotypes were found to be significantly associated with decreased expression of H19 in breast cancer patients compared with the CC genotype (P = 0.027 and P < 0.001, respectively).

Discussions

Through a two-stage case-control study including 2881 breast cancer cases and 3220 controls, we identified a significant association with breast cancer risk for rs2071095 (A allele) in H19, which was significantly associated with decreased expression of H19 in breast cancer patients, indicating a novel susceptibility locus for breast cancer.

The H19 gene, located in human chromosome 11p15.5 locus, is one of the first discovered paternally imprinted and maternally expressed genes [23]. It generates a highly conserved 2.3 kb capped, spliced and polyadenylate non-coding RNA, which location is mainly cytoplasmic [24,25,26]. H19 abundantly expresses RNA during embryogenesis, playing a pivotal role in embryonic development, and is downregulated postnatally [27]. H19 can function as a primary miR-675 precursor and suggest that H19 expression could post-transcriptionally downregulate specific mRNAs [28]. H19 might also act as competitive endogenous RNA [29]. H19 overexpression could result in the activation of genes involved in angiogenesis, cell viability, and proliferation, triggering malignancies such as breast, lung, gastric, bladder, and pancreatic carcinomas [26, 30,31,32].

Curgy JJ et al. reported that overexpression of the H19 gene was significantly correlated with the presence of steroid receptors and controlled by steroid hormones [33]. A recent study also identified that H19 expression had a positive correlation with ERα expression in breast cancer, the estrogen-ERα-H19 signaling axis plays a role in regulating the proliferation and differentiation potentials of the normal luminal progenitors and that this signaling network may also be important in the development of ER(+) breast cancer tumors [34]. Our results add another aspect to the above findings that the “A” allele of H19 had an ER-positive-specific association with breast cancer risk. In recent years, studies have shown that trait-associated SNPs are concentrated in regulatory regions and can perturb transcription factor recognition to these regions and thus conferring allele-specific dysregulation of the SNP-associated gene [35,36,37,38,39]. SNP rs2071095 is located about 2 kb upstream of H19 transcriptional start site, and is likely to modulate transcription factor binding to this region, resulting in “A” allele-specific down-regulation of the H19 expression in breast cancer tissues in our study. However, the underlying mechanism remains to be elucidated and may be exploited for therapeutic benefit.

In the present study, we included SNP rs2151280 in ANRIL, which has been reported to affect susceptibility to basal cell carcinoma [40]. ANRIL (officially known as CDKN2B antisense RNA 1 (CDKN2B-AS1)) is encoded in the chr9:p21 region at the INK4 locus. Genetic susceptibility locus in ANRIL could serve as a hotspot for multiple diseases like cardiovascular disease, cancer, and diabetes [41,42,43,44,45,46]. ANRIL is overexpressed in a variety of cancers including leukemia, breast cancer, and prostate cancer [13], and controls cell proliferation and senescence via interacting with CBX7 (a PRC1 component) and SUZ12 (a PRC2 component) to form heterochromatin surrounding the INK4b-ARF-INK4a locus, leading to induce gene silencing [47,48,49,50]. The knockdown of ANRIL could lower multidrug resistance [51], decrease proliferation [52], and inhibit invasiveness [53]. However, we identified rs2151280 as a genetic susceptibility to breast cancer only in stage I, and could not validate in stage II.

Although we take a large systematic case-control study to detect the breast cancer risk associated SNPs in lincRNAs. Our study has several limitations. Firstly, we only selected independent (r2 < 0.5) SNPs for stage I, because of limitation of scope, we need to perform fine-mapping of a larger region centered on rs2071095 to analyze in depth. Secondly, we only included the SNPs with minor allele frequency (MAF) ≥ 0.1, inevitably miss the low frequency breast cancer risk associated SNPs. Thirdly, we did not perform multiple comparisons correction, although we performed the large sample size and an independent validation set to ensure the replicability of our finding, and the expression data by genotype lend another support to the association.

In conclusion, our study has identified a common variant in H19 that is implicated in susceptibility to breast cancer among Chinese women, which remains to be prioritized for future analysis in tumorigenicity assays.