Introduction

As DNA sequencing technology advances, our knowledge of the human genome evolves. For example, we now classify transcription into two major categories, protein-coding and non-coding transcripts. Transcribed into messenger RNAs (mRNAs), protein-coding genes in total account for a very small percentage of transcripts (only about 2 %), whereas non-coding transcripts constitutes over 95 % of the transcriptome [1]. Among the non-coding transcripts, long non-coding RNAs (sequences longer than 200 nucleotide bases, lncRNAs) have emerged as a unique group of transcripts that have similar structures as protein-coding genes such as introns and exons, but also possess a wide range of biological functions involved in a variety of cellular activities [28]. Given their important roles in cell signaling and regulation, lncRNA’s involvement in various diseases, especially in cancer, has been suspected and investigated [2, 3, 58]. However, since the functionality of lncRNAs is based on the nucleotide sequences, not peptide structures, and involves multiple molecules including proteins or other non-coding transcripts, our understanding of lncRNAs remains limited. The function and regulation of many lncRNAs and their derivatives are still unidentified or uncharacterized [9, 10].

In a previous study [11], we reported the discovery of a novel long intergenic non-coding RNA (lincRNA), LINC00472, in close link to clinical and pathologic features of breast cancer. High expression of LINC00472 was found to be associated with low tumor grade, early stage disease, estrogen or progesterone receptor positivity, and less aggressive molecular subtypes. Compared to patients with low expression, those with high expression of this lincRNA also had more favorable responses to adjuvant chemotherapy and endocrine therapy as well as survived longer. These observations have been remarkably consistent across more than a dozen clinical studies that have involved thousands of patients. Further, our in vitro experiments demonstrated that LINC00472 expression is low in breast cancer cell lines and up-regulating its expression via transfection of a LINC00472-expressing vector slows cell growth and inhibits cell migration [11].

In this article, we report our further investigation of this lincRNA in addressing three additional issues. First, we investigated which mechanism, change in gene copy number or DNA methylation, might have the potential to influence LINC00472 expression in breast cancer. Second, we were expected to further replicate our findings in microarray datasets other than the Affymetrix because our previous results mainly focused on the results from that platform. Third, since tumor grade was correlated with LINC00472 expression, and since both were associated with breast cancer survival, it would be helpful to demonstrate if LINC00472 had additional value in predicting breast cancer prognosis after eliminating the confounding effects of tumor grade. Compared to grade 1 and 3, grade 2 tumors are known to be much more heterogeneous with regard to disease prognosis. Thus, identifying additional prognostic markers for grade 2 tumors is considered necessary and valuable.

Materials and methods

Microarray-based comparative genome hybridization (aCGH)

We used the aCGH data from GEO (GSE23720) [12, 13] for copy number analysis. In the dataset, tumor DNA samples were extracted from 173 breast cancer patients, and 13 normal male DNA samples were used as reference. Genomic imbalances of the DNA samples were determined using the Agilent-014693 Human Genome CGH Microarray 244A chip. We downloaded the values obtained by circular binary segmentation (CBS) of the normalized log2 ratio Cy5/Cy3 (Cy5: label for human primary breast tumor samples; Cy3: label for the DNA pool from 13 normal male samples). Two probes (A_14_P113080 and A_14_P202474) on this Agilent chip cover the genomic region that contains the LINC00472 gene. In the same study, 193 patients had gene expression data generated by the Affymetrix Human Genome U133 Plus 2.0 Array. The Affymetrix chip has four probes (220324_at, 231136_at, 235771_at and 243974_at) mapped to different regions of the LINC00472 gene, and their values are highly correlated with one another. We used the data from probe 220324_at as we did in our previous work [11]. To investigate whether copy number variation of the LINC00472 gene contributes to its expression, we first generated a data table with both copy number and expression values of LINC00472 by matching the patients IDs, which included information from 173 patients at last. We separated these patients into low and high expression groups using the median of LINC00472 expression values as cutoff. Then we plotted the normalized copy number values (Cy5/Cy3 ratio) side by side, and calculated the Mann–Whitney U statistic between the two groups. As reference, data from the retinoblastoma 1 (RB1) gene were extracted and analyzed in the same way.

Affymetrix genome-wide human SNP array 6.0

The cBioPortal for Cancer Genomics was used to analyze raw data from a provisional study of breast invasive carcinoma in The Cancer Genome Atlas (TCGA) [14, 15]. Through May 2015, 1065 tissue samples tested both by RNA sequencing and by the Affymetrix Genome-wide Human SNP6.0 Array were available for plotting. We downloaded the expression data and copy number values of the LINC00472 and RB1 genes, and compared them using the same strategy as described above for the GSE23720 data.

Illumina HumanMethylation450 BeadChip

The provisional breast invasive carcinoma study from TCGA included microarray methylation data generated from the Illumina HumanMethylation450 BeadChip. This chip covers 99 % of the RefSeq genes, with an average of 17 CpG sites per gene distributed across the promoter, 5′UTR, first exon, gene body, and 3′UTR. Fifteen CpG sites are located in the LINC00472 gene (Fig. 2a), of which 14 are in the promoter and first exon regions. The cBioPortal for Cancer Genomics [14, 15] analyzes the Spearman correlation coefficient between gene expression and DNA methylation, and automatically selects the CpG site with the strongest correlation. To examine the expression-methylation correlations in detail, we downloaded the TCGA level 3 data on all the 15 CpG sites which contained normalized DNA methylation results, and performed correlation analysis with gene expression for each CpG site.

Gene expression analysis

In our previous work on LINC00472, we only analyzed the GEO data generated from the Affymetrix Human Genome U133 plus 2.0 array or U133A array [11]. In the current study, we broadened the evaluation by analyzing four additional datasets in GEO that were based on the Agilent and Illumina platforms containing probes for LINC00472. These datasets included studies with a total of 561 breast cancer samples (Supplementary Table S1). Because different microarray platforms were used in these studies, we dichotomized the normalized LINC00472 expression data using study-specific median as cutoff to define “LINC00472_higher” (≥median) and “LINC00472_lower” (<median) for meta-analysis across the studies. Clinical and pathologic variables were also dichotomized. Associations of LINC00472 with clinical and pathologic variables were determined by odds ratios and their 95 % confidence intervals (95 % CI). Summary results, weighted by inverse-variance, were calculated based on the random-effects model, and presented in Forest plots. For datasets with survival information, Kaplan–Meier survival curves were constructed on individual studies and log-rank test was used to assess differences in survival between groups. In this survival analysis, LINC00472 expression was grouped into 3 categories based on its tertile distribution.

Analysis of grade 2 tumors

We analyzed the associations of LINC00472 expression with breast cancer survival specifically in grade 2 tumors in our study (Turin_Study), and in eight other GEO datasets that contained more than 60 patients with grade 2 tumors (Supplementary Table S2). In total, 936 patients with grade 2 tumors were included in this analysis. Kaplan–Meier survival analysis was performed on the individual studies, and LINC00472 expression levels were dichotomized based on the median in each study. Summarized results were also generated using the inverse-variance weighted random-effects model.

Statistical analysis

For data analysis, normalized LINC00472 expression intensity was analyzed as a categorical variable with low and high levels classified by median expression. Associations of LINC00472 expression with clinical and pathologic factors were determined using the Chi-square statistic. Kaplan–Meier survival curves were constructed to show survival differences according to LINC00472 expression, and the log-rank test was used for comparison. Survival outcomes considered were disease-free survival, distant relapse-free survival, relapse-free survival, and metastasis-free survival. The Mann–Whitney U statistic was used to compare differences in copy number variation. Spearman correlation coefficients were calculated for correlation analysis. Data were analyzed using the Statistical Analysis System, version 9.4 (SAS Institute Inc., Cary, NC) and GraphPad Prism 6 (GraphPad Software, Inc., La Jolla, CA). All statistics were two-sided; p values less than 0.05 were considered significant. Review Manager (Revman Version 5.3, Copenhagen, Denmark) was used for meta-analysis.

Results

In our previous study, we found low LINC00472 expression in tumors compared to adjacent non-tumor or normal breast tissues [11], but did not know whether or not the differences were the results of copy number changes in the corresponding genomic region. To address this issue, we analyzed DNA copy number variations in relation to gene expression in two publically available datasets, one from GEO and one from TCGA. The dataset GSE23720 [12, 13] contained 173 tumor samples analyzed both by the Affymetrix gene expression microarray (Platform: GPL570) and by the Agilent CGH microarray (Platform: GPL9128). The ratio of gene copy numbers between tumor DNA and normal DNA (Cy5/Cy3) for LINC00472 distributed almost evenly around 1.0, suggesting no loss or deletion of this gene, while for the RB1 gene, which has been reported generally to be deleted in cancer tissues, most of the Cy5/Cy3 ratios were below 1.0 (Fig. 1a). Grouping the samples into high versus low LINC00472 expression showed no differences in gene copy numbers between these groups (Fig. 1a).

Fig. 1
figure 1

Copy number variation and LINC00472 expression. a Box and whiskers plot based on the dataset GSE23720 show similar distributions of copy numbers for the LINC00472 gene (left) but different distributions for the RB1 gene (right) between patients with high and low expression, correspondingly. The y axis shows the normalized signal ratio between tumor tissues (Cy5) and a pool of normal male DNA (Cy3). The whiskers cover 2.5–97.5 percentiles. p values were determined by the Mann–Whitney U test. b Box and whiskers plot based on the TCGA breast cancer study show similar distributions of copy numbers for the LINC00472 gene (left) but different distributions for the RB1 gene (right) between patients with high and low expression, correspondingly. The y axis shows the ratio of copy number values. The whiskers cover 2.5–97.5 percentiles. p values were determined by the Mann–Whitney U test

In the TCGA provisional breast cancer study, gene expression data were produced by RNA sequencing, and copy number variations were measured by the Affymetrix Genome-wide Human SNP6.0 Array. We plotted the data as we did for the GSE23720 data, and found that LINC00472 expression was not associated with copy number alteration, while many samples in this large TCGA dataset showed copy number loss or deletion in the RB1 gene (Fig. 1b). RB1 expression was positively correlated with gene copy number (Fig. 1b) as had been observed previously [16].

We next analyzed the relationship of LINC00472 expression and DNA methylation of the gene. In the TCGA provisional breast cancer study, 735 patient samples had information on gene expression by RNA Sequencing and on DNA methylation by the HumanMethylation450 chip. The Illumina HumanMethylation450 chip contains 14 methylation probes for the CpG sites in the promoter and first exon regions of the LINC00472 gene (Fig. 2a). We downloaded all the methylation data from the 14 CpG sites, and analyzed their correlations with expression of LINC00472. Our analysis showed that methylation in these CpG sites were all inversely correlated with LINC00472 expression, higher methylation, and lower expression (Fig. 2b), suggesting that the expression of this gene is regulated by promoter methylation. Across the 14 probes, the strongest correlation coefficient was −0.32 (p < 0.0001) (Fig. 2c). Further analyses of methylation with respect to disease features and patient survival revealed no significant associations between these variables (data not shown).

Fig. 2
figure 2

Methylation status and LINC00472 expression. a A screenshot from UCSC Genome Browser shows the CpG island around the LINC00472 promoter and probes included in the Illumina HumanMethylation450 BeadChip for measuring methylation in the CpG sites. b Bar charts demonstrate a consistent negative correlation between LINC00472 expression and methylation from all the probes. The y axis shows each probes, and x axis shows the Spearman correlation coefficient for each probe (*p ≤ 0.0001; **p < 0.05). c Scatter plot shows a negative correlation between LINC00472 expression and the methylation level around the LINC00472 promoter. Normalized DNA methylation beta values are shown in the y axis. Linear regression analysis suggests a regression line of Y = −0.02139X + 0.3321

In our previous study [11], we focused exclusively on the results of the Affymetrix chip (Affymetrix Human Genome U133 plus 2.0 array and U133A array) in order to ensure that we employed consistent and reliable gene expression data for validation. In the present report, we broadened the scope of our validation by including chip results from other manufacturers. We identified four such datasets, three from the Illumina chip and one from the Agilent (Supplemental Table S1). Consistent with our previous observations, analysis of these data showed that LINC00472 expression was positively associated with estrogen receptor (ER) status, and negatively with tumor grades and aggressive molecular subtypes (Fig. 3a). Two of the datasets also had information on disease-free survival. High expression of LINC00472 was associated with favorable disease outcomes compared to low expression (p = 0.0061 and 0.0097 for GSE19783 and GSE22219, respectively) (Fig. 3b, c). These results again confirmed the findings of our previous study.

Fig. 3
figure 3

Agilent and Illumina platforms for LINC00472 expression. a A meta-analysis shows that low LINC00472 expression was associated with ER negative tumors (OR = 0.41; 95 % CI 0.27–0.63), high-grade tumors (OR = 2.48; 95 % CI 1.63–3.77), and luminal B, Her2 positive or basal-like tumors (OR = 5.29; 95 % CI 3.25–8.60). b Kaplan–Meier survival curves by low, intermediate and high LINC00472 expression in dataset GSE19783. c Kaplan–Meier survival curves by low, intermediate, and high LINC00472 expression in dataset GSE22219

LINC00472 expression is associated with tumor grade, potentially limiting its utility for prognosis, especially in high- and low-grade tumors (grade 3 and 1) where expression is relatively homogenous [17]. To improve the accuracy of breast cancer prognosis among patients with grade 2 tumors, additional tumor features, especially molecular markers, should be considered. We therefore analyzed LINC00472 data in patients with grade 2 tumors. Nine datasets including our own had more than 60 such patients. Of the 9 studies, 6 showed high expression of LINC00472 significantly associated with favorable disease-free survival compared to low expression (Fig. 4). Meta-analysis of these studies demonstrated that patients with grade 2 breast cancer had a 50 % reduction in risk of disease relapse if their tumors expressed high levels of LINC00472 transcript (Fig. 5).

Fig. 4
figure 4

Kaplan-Meier survival curves by low and high LINC00472 expression in our study and 8 other datasets from GEO with more than 60 patients with grade 2 tumor in each dataset

Fig. 5
figure 5

Meta-analysis of associations between LINC00472 expression and disease-free survival among patients with grade 2 tumors. Summarized hazard ratio was estimated using the random-effect model and each study was weighted with its variance. High LINC00472 expression was associated with better disease-free survival (OR = 0.49; 95 % CI 0.38–0.63)

Discussion

This study further confirms that LINC00472 expression is significantly associated with breast cancer in terms of tumor grade, estrogen receptor status, and molecular subtype, and that higher expression of LINC00472 predicts better disease outcome. Our study also provides some evidence that LINC00472 expression may be regulated by DNA methylation in its promoter, whereas changes in gene copy number are not found in breast tumors and cannot account for the variation in LINC00472 expression. More importantly, levels of LINC00472 expression can be used to distinguish survival differences among breast cancer patients with grade 2 tumors. These features underscore the potential significance of LINC00472 in serving as a marker for breast cancer prognosis.

As part of our investigation, we evaluated two aspects of LINC00472 expression regulation, gene copy number, and promoter methylation, using data available online from genome-wide analysis. Data from the microarray-based comparative genome hybridization analysis and Affymetrix genome-wide human SNP genotyping array both showed no evidence of substantial deviation from standard copy number, suggesting no deletion or amplification of this gene in tumor samples. We integrated the copy number data with gene expression results, and found no differences in gene copy number between tumor samples with high versus low expression of LINC00472. These analyses indicate that expression variation of LINC00472 in breast cancer is not due to changes in gene copy number. We also compared these results with similar data for the RB1 gene which is known to have copy number loss in cancer, reinforcing the conclusion of no copy number changes in LINC00472.

The LINC00472 gene contains a CpG island in its promoter. As reported by several lncRNA profiling studies [1820], DNA methylation in the CpG island of a lncRNA gene promoter may regulate the expression of the lncRNA gene, just like it does for coding genes. We therefore examined methylation values in the TCGA database generated from the Illumina HumanMethylation450 BeadChip, and integrated the data with gene expression results. Both our own analysis and the analysis through the cBioPortal for Cancer Genomics showed that LINC00472 expression was inversely correlated with methylation levels of the CpG sites in the promoter and first exon. Our analysis of the TCGA data also indicates that this inverse correlation exists not only in breast cancer, but in other cancer sites as well. In lung adenocarcinoma, the Spearman correlation coefficient (r) was −0.40 (p < 0.0001), in lung squamous cell carcinoma, the r was −0.30 (p < 0.0001), in uterine carcinosarcoma r was −0.30 (p < 0.0001), and in uterine corpus endometrial carcinoma r was −0.53 (p < 0.0001). Our findings suggest that promoter methylation may play a role in regulation of LINC00472 expression. Data from another GEO dataset GSE39004 [21], containing both gene expression and methylation information from 46 tumor samples, also showed a similar correlation (data not shown).

In our previous study, we used gene expression data exclusively from two microarray chips, the Affymetrix Human Genome U133 plus 2.0 and the U133A arrays. There were reports suggesting that microarray data from different platforms did not correlate well [22, 23]. We had the same impression when we compared gene expression signatures generated by different microarray platforms for breast cancer prognosis and found little overlap in genes across different signatures [24]. This phenomenon led us to think that our previous results need to be validated by other microarray platforms. In this study, we included microarray data from other manufacturers to broaden the range of data sources for validation and to rule out the possibility that our validation was limited to one type of array from a single manufacturer. We identified four datasets in GEO (Supplementary Table S1), and each contained more than 50 samples of gene expression data and clinical information that were useful for evaluation. Our meta-analysis confirmed that low LINC00472 expression was linked to breast cancer of more unfavorable prognosis.

A set of tumor samples in GEO has been analyzed both by RNA sequencing (GSE60785) and by gene expression microarray (GSE60788). The results of these analyses with regard to LINC00472 expression were highly correlated (Spearman correlation coefficient = 0.74; p < 0.0001). The associations of LINC00472 expression with ER status, tumor grade, and molecular subtype were also similar between the two platforms. The provisional breast cancer dataset in TCGA, which was used in our copy number and methylation analyses, included more than 1000 patients, but these studies were conducted relatively recently and patients in the datasets had short follow-up times. The microarray data in TCGA did not cover most long non-coding RNAs, including LINC00472, and therefore we had to use RNA sequencing data to analyze the association of LINC00472 with survival. In this analysis, patients with higher expression of LINC00472 had significantly better overall survival than patients with lower expression. Considering these methods plus the RT-qPCR that we used in our previous study [11] we conclude that the associations between LINC00472 expression and disease features are consistent in breast cancer patients regardless of the analytical methods used to measure the expression of LINC00472.

As a well-established indicator of breast cancer prognosis, tumor grade, determined on the basis of cell morphology, provides important information on the potential behaviors of malignant cells [25]. Determining tumor grade may be relatively straightforward for grade 1 or 3 breast cancers [26, 27], but not for grade 2, as reflected by the lowest degree of concordance among pathologists compared to grades 1 and grade 3 [17]. Grade 2 tumors have the most uncertainty in choice of post-surgical treatment, especially chemotherapy [17]. Several genomic tests have been developed on the basis of gene expression profiling, including Oncotype DX [28] and MammaPrint [29, 30], to assist the prediction of breast cancer prognosis for grade 2 tumors [31]. However, even for the ongoing TAILORx trial (the Trial Assigning Individualized Options for Treatment), patients with intermediate grade tumors are still randomly assigned to receive adjuvant chemotherapy or not as well as to subsequent endocrine therapy [32, 33], because risk of recurrence for these patients is uncertain. Multiple gene expression signatures have been developed with the hope that genomic-grade can predict tumor prognosis better than histologic grade [13, 3440]. However, the gene expression signatures are comprised of distinct sets of genes with little overlap [28, 32, 36, 4147], suggesting that substantial heterogeneity may exist and additional predictors are needed. To address this issue, we focused on the prognostic value of LINC00472 in patients with grade 2 tumors only, and found that survival in such patients was further distinguished when their LINC00472 levels were analyzed in tumor samples. Additional studies are needed to further confirm the prognostic and predictive values of LINC00472 in grade 2 tumors when confounding factors can be considered and adjusted in analysis.

Although our investigation found additional evidence in support of our finding of LINC00472 being a potential biomarker for breast cancer prognosis, more studies, especially those prospective ones where a standardized lab test is employed to measure gene expression, are still needed for further validating the results and excluding the influences of other prognostic factors or parameters. For clinical application, we also need to establish a unique cutoff for predicting prognosis, and demonstrate the sensitivity and specificity of the test. Another issue we should consider is that our findings are currently based on the analysis of fresh frozen tissues which may not be feasible or practical for application in clinic. One should test if FFPE tissue blocks can be used for testing this marker since these samples are more readily available for analysis. More research is also needed for understanding the biologic implication of LINC00472 in breast cancer.