Introduction

Lymph node involvement is the most important prognostic factor in breast [1]. However, 20–30% of node positive patients remain free of distant metastases whereas 20–30% of lymph node negative patients will develop metastasis [2]. Given this limited correlation it remains unclear whether metastasis to distant sites proceeds sequentially from lymph node metastasis or in parallel by a hematogenous route. Moreover, it is unclear whether lymph node metastasis reflects the chronologic age or the biology of the tumor and whether there is an influence of the host on the process of metastasis to the lymph nodes [36]. Known factors associated with axillary lymph node metastasis include increasing tumor size, presence of lymphovascular invasion, poor histologic grade, and age [79]. However, even in the subgroup of patients with all favorable factors, still 13% had involved lymph nodes [8].

The genetic signature of a primary tumor holds significant prognostic value [10]. However, when looking into the literature it is unclear whether lymph node involvement can be predicted from primary tumor tissue. Huang et al. [11] identified a gene expression pattern associated with the breast tumor’s likelihood of having lymph node metastasis at diagnosis. In contrast Weigelt et al. [12] did not find an expression signature that could predict the lymph node status. Therefore, the question remains as to whether or not it is possible to predict lymph node metastases from patients’ primary tumor based on gene expression data.

MicroRNAs (miRNAs) are a class of small non-coding RNAs able to negatively regulate gene expression at the post-transcriptional level [13]. With over 700 human miRNAs reported and hundreds of target genes per miRNA, these RNAs represent one of the largest classes of gene regulators. The influence of miRNAs on potentially every cellular pathway makes it likely that deregulated miRNA expression is implicated in cancer progression. Growing evidence suggests that miRNAs can function as oncogenes or tumor [13]. Emerging evidence reveals that the pattern of miRNA expression correlates well with clinicopathological characteristics and disease outcome [14, 15].

We sought to determine whether the presence of metastasis in the regional lymph node could be predicted from the primary tumor. To answer this question, we first compared gene expression profiles of primary tumor tissue. In contrast with previous studies, we selected a group of breast tumors with very homogeneous histological characteristics, balanced for lymph node involvement. Second, in a subgroup of these breast tumors, miRNA expression profiling was performed. Third, the results of both analyses were integrated.

Materials and methods

Selection of patients

Tumor samples were selected from the multidisciplinary breast center database. Cases were chosen from women with primary breast cancer in whom axillary lymph node status was known, and the cohort was balanced for nodal status. We only selected postmenopausal patients with a poorly differentiated, estrogen receptor positive, her2-neu negative invasive ductal cancer. Node positive patients were those with at least one node containing a tumor deposit of >2 mm. Node negative patients were those with pathologically negative nodes. Within 30 min after surgical extirpation, tissues were deep frozen and stored at −80°C. Frozen tumor blocks were thin-sectioned and stained with hematoxylin/eosin and only those that were judged to contain at least 70% viable tumor by area were carried on for RNA extraction. The final collection of tumors for the microarray study consisted of 48 lymph node negative and 48 lymph node positive tumors (training set). 82 out of these tumors were carried on for miRNA profiling. In all cases, ER, PR, and HER-2 status were determined for diagnostic purposes. Nuclear ER and PR immunostaining (Allred score 3–8) were considered ER-positive and PR-positive, respectively. Lack of membranous HER-2 immunostaining (score 0 or 1) was considered HER-2 negative. Table 1 gives an overview of the clinical characteristics of the patients.

Table 1 Clinical characteristics of training set

Microarrays

RNA extraction

Total RNA was extracted from eight 10–20 μm slides using Trizol reagens and further purified on column (RNAeasy Minikit, Qiagen, Valencia, CA, USA). RNA concentration and purity were determined spectrophotometrically using the Nanodrop ND-1000 (Nanodrop Technolgies), and RNA integrity was assessed using a Bioanalyser 2100 (Agilent).

Microarray expression profiling

The analyses detailed here comply with the MIAME (minimal information about a microarray experiment) guidelines established by the Microarray gene expression data society (www.mged.org). cRNA target preparation, hybridization to Affymetrix U133 Plus 2.0 arrays, and washing and array signal acquisition were performed at the Microarray Facility of the Flanders Interuniversity Institute for Biotechnology (VIB) in Belgium.

Per sample, an amount of 2 μg of total RNA spiked with bacterial RNA transcript positive controls (Affymetrix) was converted to double-stranded cDNA in a reverse transcription reaction. Subsequently, the sample was converted and amplified to antisense cRNA and labeled with biotin in an in vitro transcription reaction. All the steps were carried out according to the manufacturer’s protocol (Affymetrix).

All amplification and labeling reactions were performed on a Biomek 3000 ArrayPlex Workstation (Beckman Coulter). A mixture of purified and fragmented biotinylated cRNA and hybridisation controls (Affymetrix) was hybridised on Affymetrix HG U133 Plus 2.0 arrays followed by staining and washing in a GeneChip® fluidics station 450 (Affymetrix) according to the manufacturer’s procedures. To assess the raw probe signal intensities, chips were scanned using a GeneChip® scanner 3000 (Affymetrix). Microarray data are available at the Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/geo), with accession code GSE23177.

Microarray data analysis

The microarray dataset was preprocessed with MAS 5.0, the GeneChip Microarray Analysis Suite 5.0 software (Affymetrix). We used an alternative annotation for the conversion of probes to genes provided by Dai et al. We also took the low signal-to-noise ratio of microarray data into account by unsupervised filtering out genes with low variation across all samples. The 5000 most varying genes were included. Finally, our dataset was standardized per sample across all genes. To be able to include all 96 patients independent of number of positive nodes, we defined the lymph node ratio (LNratio) as the number of lymph nodes found to be positive divided by the total number of examined lymph nodes [16]. The LN ratio reflects the severity, and the Spearman correlation coefficient was used to identify genes that gradually increase or decrease with changing LN ratio. Models for the prediction of the lymph node status were built by weighted Least-Squares Support Vector Machines [17, 18] on random splits of 96 patients in 10-folds, and this 10-fold cross validation was repeated 100 times for robustness [19]. Genes that were selected in at least half of the cross-validation iterations at a significance level of 0.05 were considered.

Datasets

Besides the training set of 96 patients, an independent internal dataset of 20 patients balanced for lymph node involvement was selected from the multidisciplinary breast center database in the same way as described for the training set. The same approach for RNA extraction and microarray expression profiling was used as well. Also six publicly available datasets on breast cancer for which the lymph node status was provided were considered in the microarray study. In the independent dataset and the datasets for external validation only those patients with the defined characteristics were included (Table 2).

Table 2 Microarray datasets used in this study

miRNA (82 tumors)

miRNA extraction

RNA from each tumor specimen was extracted from five 10 μm slides using a mirVana MiRNA isolation kit (Applied biosystems). RNA quality was checked with an Agilent BioAnalyzer lab on chip.

miRNA expression profiling

Biotinylated first strand cDNA was prepared from 5 μg of total RNA and hybridized onto Human MiRNA assay version 1 microarray chips (Illumina) according to the manufacturer’s instructions. This assay detects 470 miRNAs described in the miRBase database v9.1 and 265 potential miRNAs identified in a RAKE analysis study [20]. Images of the arrays were acquired using Illumina Beadstudio software, and visual inspection of microarray images revealed no visual artifacts. The signal intensity was calculated for each spot while adjusting for local background. Data were exported to Genemaths XT microarray analysis software (Applied Maths) for further analysis. Spots with signal-to-noise ratio below 200 were marked as missing values, and probes that had 50% or more missing values were excluded, hereby retaining 235 high quality probes (195 validated and 40 potential miRNAs). Signal intensities were log2 transformed after setting values smaller than 10−6 to 10−6. A median centering normalization algorithm was applied to all remaining probes (i.e., all arrays were scaled with respect to the global median).

The resulting dataset was used for statistical analysis.

To evaluate the miRNAs that are significantly differentially expressed according to lymph node status, significance analysis was performed using the Wilcoxon rank sum test at significance level 0.05 without correction for multiple testing.

Paired miRNA-microarray analysis

The correlation between a miRNA and its computationally predicted targets was evaluated in our dataset of 82 tumors. Although the targets of many miRNAs are not yet known, databases exist consisting of computationally predicted targets based on sequence complementarity between the miRNA and its target site and on evolutionary target site conservation. We used the microRNA.org database of computationally predicted targets [21] to assign genes to their corresponding miRNA. The spearman correlation coefficient was calculated between each miRNA and its computationally predicted targets. The analysis was done starting from all miRNAs and mRNAs passing preprocessing without supervised selection of only the differentially expressed miRNAs or mRNAs. A one-sided hypothesis test was used to determine the significance of inverse correlation between a miRNA and its target, with significance threshold of 0.05. For each miRNA, the number of significant inverse correlated targets was determined and compared with 1000 sets of randomly chosen targets within the node positive and node negative patients separately, to assess whether miRNA expression is more correlated with its computationally predicted targets than expected with random sets of targets. This allowed identifying miRNAs with lymph node-specific inverse correlation.

Results

Differential gene expression

A 10-fold cross-validation strategy applied to the microarray training set of 96 patients has led to a model based on 241 genes (Table 1 in the Supplementary Appendix) with an average 10-fold area under the ROC curve (AUC) of 0.66. Figure 1 shows the heatmap for the 241 genes and 96 patients.

Fig. 1
figure 1

Heatmap of the 241 genes used to determine lymph node involvement in a series of 96 breast cancer patients. Each column represents the profile of the 241 marker genes for one tumor, and each row represents the relative level of expression of each gene. The tumors are numbered from 1 to 96 on the x axis. The first 48 patients are lymph node positive, the last lymph node negative. A red color indicates a high level of expression of messenger RNA (mRNA) in the tumor, as compared with the reference level of mRNA, and a green color indicates a low level of expression

Next, to validate the classifier, an additional independent set of primary tumors from 20 patients was selected (9 lymph node negative and 11 lymph node positive patients). The lymph node status was predicted with the classifier and resulted in an AUC of 0.646 (Fig. 2).

Fig. 2
figure 2

Area under the receiver operating characteristic curves (AUC) for (a) the internal and external validation and (b) the separate external validation datasets. Because the datasets VDX, UNT and MAINZ only contain lymph node negative patients, they could not be validated separately. We therefore validated them together with the subset of patients in TAM for which the same Affymetrix platform U133A was used

Finally, the model was validated on six publicly available datasets, all gathered with the Affymetrix technology. Again, only samples with the defined characteristics were selected. Starting from 4800 patients in six publicly available breast cancer datasets, data from only 79 lymph node negative and 27 lymph node positive patients could be used, due to the requirement for homogeneous characteristics (Table 2). On external datasets, there was a large variance in results, with an AUC ranging from 0.57 to 0.89 (Table 2 and Fig. 2).

To gain insight into the biological meaning of the genes in our expression signature, we used the Gene Ontology Tree Machine [22] and the Molecular Signatures Database [23]. We observed an important increase in apoptosis-related genes and zinc ion-binding genes (Table 3). The model includes 11 gene-mediating BAF57 cell deaths (S100A8, CITED2, SMURF2, KIAA0895, SORBS2, SOD2, PRR6, MYH10, KYNU, PALMD, CDF15).

Table 3 Functional analysis

A subset of genes upregulated in the node positive group are cytokines and genes related to interactions between the tumor and the host (VEGFB, CXCL13, TNFRSF12A, TNFAIP-8, IL1RAP, S100A8).

Differential miRNA expression

Eight miRNAs (miR-195, miR-191, miR-132, miR-203, miR-431, miR-16, miR-30c, miR-30a) were significantly differentially expressed according to lymph node status (negative or positive) with P-values varying from 0.014 to 0.05. Only miR-431 was upregulated in the lymph node positive patients while the seven other miRNAs were lower expressed in these patients.

Paired miRNA-microarray analysis

48 miRNAs had a higher number of inversely correlated targets for the node negative group than expected by chance while not having a significant number of inversely correlated targets in the node positive group. For the node positive group, 21 miRNAs had a higher number of inversely correlated targets than expected by chance while not having a significant number of inversely correlated targets in the node negative group. Next, there were also miRNAs which had a lower number of inversely correlated targets than expected by chance. These miRNAs can be hypothesized as not binding to their targets. There were 21 such miRNAs in the lymph node positive group. 17 miRNAs are inactive in the lymph node negative group (Table 2 in the Supplementary Appendix). In general, there are more miRNAs significantly inactive and less miRNAs significantly active in the lymph node positive group compared to the lymph node negative group. These results point to general deregulation of the miRNA machinery potentially responsible for lymph node invasion.

Finally, both analyses were intersected, to investigate whether there are miRNAs that are significantly active in one group and significantly inactive in the other group. This holds for ten miRNAs being significantly active in the node negative group and significantly inactive in the node positive group (Table 4). Figure 3 shows the network for these ten miRNAs and represents the node negative patients because these miRNAs repress their targets for the node negative patients while not performing this task in the node positive patients. This network can thus be hypothesized as suppressing lymph node invasion. Moreover, the network shows that part of the targets are shared (TGFBR1, CNIH, SLC5A3, …). These shared targets can be a good starting point for further investigating whether these genes are related.

Table 4 The inverse correlation of miRNAs and the computationally predicted targets
Fig. 3
figure 3

The miRNA-mRNA network of (a) the 10 miRNAs that are active in the node negative group and inactive in the node positive group and (b) the miRNA that is active in the node positive group and inactive in the node negative group

Conversely, there is only one miRNA that is significantly active in the lymph node positive group and inactive in the lymph node negative group, hsa-miR-361-5p (Fig. 3). This miRNA and its targets can be hypothesized as promoting lymph node invasion.

Discussion

It is unclear whether lymph node involvement reflects tumor chronology, tumor genetics or whether it serves as a marker of the host response to the tumor. In a meta-analysis of published correlations between various prognostic factors in breast cancer by Mittra and MacRae [4] is concluded that the axillary lymph node status is simply a reflection of the chronologic age of the tumor. On the other hand, several studies show that nodal metastasis is not only a marker of diagnosis at a later point in the natural history of breast cancer but also a marker of an aggressive phenotype [3, 8]. Moreover, one might speculate that besides chronology and phenotype, a weakened host response results in early metastasis to the axillary lymph nodes [9].

Our results demonstrate that measurable differences in gene expression exist between node negative and node positive patients and thus that lymph node involvement is not a genetically random process. However, an area under the ROC curve of only 0.65 indicates that, besides tumor genetics, other factors such as tumor chronology influence lymph node involvement. Moreover, it is possible that we miss important information to define lymph node involvement by studying mainly the tumor and ignoring the continuous interplay between the tumor, the host, and his environment. It has long been recognized that the prognosis of cancer patients is determined not only by the intrinsic properties of the tumor but also by the tumor environment [24] and that interactions with immune and stromal cells play a role in tumor aggressiveness [25, 26]. Tumor-infiltrating lymphocytes are considered a manifestation of the host immune response to tumor and are strongly correlated with lymph node metastasis [27, 28].

Another possibility is that mRNA expression not always correlates with the protein expression and possible functional roles of the genes of interest.

There is an overlap of five genes between our model and the Van‘t Veer profile (IGFBP5; KRT18; BTG2; PEX12; CIRBP). As this is far more than expected by chance it might be, in contrast to previous publications [12], an indication of deregulation of the same pathways in lymphatic and hamatogenous metastasis.

Our study is unique because a very homogeneous group of tumors was selected, reducing the problem of multiple testing. Most microarray studies are conducted using a convenience sample of patients for whom tissue is available, but the cohort is often far too heterogeneous to support relevant conclusions [29, 30].

The lymph node ratio was defined as the number of lymph nodes found to be positive divided by the total number of examined lymph nodes and has been shown superior to the number of involved lymph nodes as a prognostic factor [16]. It was used to be able to include all 96 patients independent of the number of positive nodes. The Spearman correlation coefficient was used for the identification of genes that gradually increase or decrease with changing LN ratio.

Our model includes an important increase in kinases, apoptosis-related genes, and zinc ion-binding genes (Table 3). Zinc is essential for cell survival, and zinc chelation leads to cell death by involvement of the NFkappaB pathway and MAP kinase signaling [31]. The model includes 11 gene-mediating BAF57 cell deaths. BAF57-induced cell death involves downregulation of gene products critical for inhibition of apoptosis [32]. A subset of genes upregulated in the node positive group are cytokines and genes related to interactions between the tumor and the host.

IL1RAP and TNFAIP8 are potent pro-inflammatory cytokines, the so called alarm cytokines that are secreted by macrophages and initiate inflammation [33].

TNFAIP8 is an antiapoptotic molecule induced by the activation of the transcription factor NF-kappaB and has been shown to play a role in tumor progression [34]. Moreover, activation of NF-kappaB was shown to be an essential link between inflammation and tumorigenesis [33] and a key event in neoplastic progression [35]. The mechanism by which IL-1 promotes tumor growth remains unclear, though the protein is believed to act via stimulation of adjacent cells to produce VEGF and TNFAIP8.

VEGF-B expression on tumor cells is associated with a poor prognosis [36]. In several reports on malignant tumors, high levels of VEGF-B expression were associated with lymph node metastasis in colorectal cancer and enhanced migration in pancreatic cells through activation of the MAPK pathway [37, 39, 40]. VEGF-B binds on VEGFR-1 and it has been shown that VEGFR-1 positive cells may regulate the homing of tumor cells [40]. IGFBP5, Insulin growth factor binding protein 5, is one of the five genes of our model overlapping with the Van’t Veer profile [41]. In several independent studies it was associated with metastasis by protecting cells from apoptosis. Hao et al. [42] found that IGFBP5 was more frequently overexpressed in lymph node metastasis compared with the matched primary cancer tissues. In addition, IGFBP5 was more frequently overexpressed in T1 breast carcinoma that has lymph node metastasis compared with T1 carcinoma with no lymph node metastasis [43]. Li et al. [44] recently found that the mRNA level of IGFBP5 positively correlates with the invasion of axillary lymph nodes.

S100A8 is a strong inflammatory kinase, induced by TNAIP8 and VEGF. It is produced by the primary tumor and attracts monocytes and neutrophils in the premetastatic site [35, 40]. This process is called immune prepping. In this way, the primary tumor influences the environment in premetastatic sites before metastasis. Our findings might reflect the dynamic changes in tumor cell interactions with the microenvironment and suggest that some of the differences between lymph node negative and positive tumors relate to the stromal component rather than to the tumor itself.

The analysis of the inverse correlation between a miRNA and its computationally predicted target genes shows that overall miRNAs were more active and less inactive in the lymph node negative group.

Finally, ten miRNAs were significantly active in the lymph node negative group and at the same time significantly inactive in the lymph node positive group, strengthening the evidence that these miRNAs are suppressing lymph node invasion. When looking into the literature, six of these miRNAs (let-7i, miR-143, miR-16, mir-R-26a, miR-375, and miR-519) have been associated with decreased migration and proliferation of tumor cells and with a better survival [4550]. It has been proposed that miR-375 is a tumor suppressor miRNA in different types of tumors by inducting apoptosis. There was only one miRNA (has-miR-361-5p) promoting lymph node invasion. These results point to general deregulation of the miRNA machinery potentially responsible for lymph node invasion. However, only an exploratory analysis was performed and in a next step, these results will be used to build predictive models for lymph node involvement.

Conclusion

We can conclude that measurable differences in gene and miRNA expression exist between N0 and N+ patients and thus that lymph node involvement is not a genetically random process. Our model includes an important increase in kinases, apoptosis-related genes, and zinc ion-binding genes. Our findings might reflect the dynamic changes in tumor cell interactions with the microenvironment and suggest that some of the differences between lymph node negative and positive tumors relate to the stromal component rather than to the tumor itself. Moreover, our data suggest a general deregulation of the miRNA machinery potentially responsible for lymph node invasion. For an accurate prediction of lymph node involvement, we hypothesize that, besides the tumor, also the host and its environment have to be taken into account.